By Kajal Sahu
In this tutorial, we will learn how to scrape data of feature film from imdb website using BeautifulSoup in Python language step by step.
Pre-requisites required to scrape a website:
-> HTML structures.
-> Python Basics.
-> Python Libraries.
-> CSV file for storing data.
The details of the movie we will scrape from the imdb website:
movie title
year of release
certificate
runtime
genre
imdb rating
metascore
votes
gross
director of the movie
actors in the film
Importing required python libraries
import requests from bs4 import BeautifulSoup import pandas
url of the required page to scrape the data.
url = 'https://www.imdb.com/search/title/?title_type=feature&num_votes=25000,&genres=animation&sort=user_rating,desc'
creating the lists to store column-wise data of the dataset.
name_col = [] year_col = [] certificate_col = [] runtime_col = [] genre_col = [] imdb_rating_col = [] metascore_col = [] votes_col = [] gross_col = [] director_col = [] actor_col = []
Sending the request to the URL to access the HTML content from the webpage by assigning the URL and creating a html_soup object.
response=requests.get(url) html_soup=BeautifulSoup(response.text,'html.parser') movie_containers=html_soup.find_all('div',class_='lister-item mode-advanced')
scraping movie details. Here we are scraping data using BeautifulSoup object using HTML tags.
for container in movie_containers: name=container.h3.a.text name_col.append(name) year=container.h3.find('span', class_='lister-item-year').text year_col.append(year) certificate=container.find('span', class_='certificate').text if container.find('span', class_='certificate') else '-' certificate_col.append(certificate) runtime=container.find('span', class_='runtime').text runtime_col.append(runtime) genre=container.find('span',class_='genre').text genre_col.append(genre) imdb_rating=float(container.strong.text) imdb_rating_col.append(imdb_rating) metascore=container.find('span', class_='metascore').text if container.find('span', class_='metascore') else '-' metascore_col.append(metascore) nv = container.find_all('span', attrs = {'name':'nv'}) vote=nv[0].text votes_col.append(vote) gross= nv[1].text if len(nv) > 1 else '-' gross_col.append(gross) director = container.find('p', class_='').find_all('a')[0].text director_col.append(director) actor_col.append([actor.text for actor in container.find('p',class_='').find_all('a')[1:]])
Creating the Dataset Using the dictionary data structure.
movie_dict = {'name':name_col, 'year':year_col, 'certificate':certificate_col, 'runtime':runtime_col, 'genre':genre_col, 'rating':imdb_rating_col, 'metascore':metascore_col, 'votes':votes_col, 'gross':gross_col, 'director':director_col, 'actors':actor_col }
Creating the dataframe and saving it in the CSV file.
df=pandas.DataFrame(movie_dict) df.to_csv('feature.csv',index = True)
Now, the file has been created. And it can be used as required.
Submitted by Kajal Sahu (kajalsahu1311)
Download packets of source code on Coders Packet
Comments