Coders Packet

Scraping data from Imdb website using Python

By Kajal Sahu

In this tutorial, we will learn how to scrape data of feature film from imdb website using BeautifulSoup in Python language step by step.

Pre-requisites required to scrape a website:

-> HTML structures.
-> Python Basics.
-> Python Libraries.
-> CSV file for storing data.

 

The details of the movie we will scrape from the imdb website:

movie title
year of release
certificate
runtime
genre
imdb rating
metascore
votes
gross
director of the movie
actors in the film


Importing required python libraries
import requests 
from bs4 import BeautifulSoup
import pandas

 

url of the required page to scrape the data.
url = 'https://www.imdb.com/search/title/?title_type=feature&num_votes=25000,&genres=animation&sort=user_rating,desc'

 

creating the lists to store column-wise data of the dataset.

name_col = []
year_col = []
certificate_col = []
runtime_col = []
genre_col = []
imdb_rating_col = []
metascore_col = []
votes_col = []
gross_col = []
director_col = []
actor_col = []

 

Sending the request to the URL to access the HTML content from the webpage by assigning the URL and creating a html_soup object.

response=requests.get(url)
html_soup=BeautifulSoup(response.text,'html.parser')    movie_containers=html_soup.find_all('div',class_='lister-item mode-advanced')

 

scraping movie details. Here we are scraping data using BeautifulSoup object using HTML tags.

for container in movie_containers:
        

        name=container.h3.a.text
        name_col.append(name)

        year=container.h3.find('span', class_='lister-item-year').text
        year_col.append(year)

        certificate=container.find('span', class_='certificate').text if container.find('span', class_='certificate') else '-'
        certificate_col.append(certificate)

        runtime=container.find('span', class_='runtime').text
        runtime_col.append(runtime)

        genre=container.find('span',class_='genre').text
        genre_col.append(genre)

        imdb_rating=float(container.strong.text)
        imdb_rating_col.append(imdb_rating)

        metascore=container.find('span', class_='metascore').text if container.find('span', class_='metascore') else '-'
        metascore_col.append(metascore)

        nv = container.find_all('span', attrs = {'name':'nv'})
        vote=nv[0].text
        votes_col.append(vote)

        gross= nv[1].text if len(nv) > 1 else '-'
        gross_col.append(gross)
        
        director = container.find('p', class_='').find_all('a')[0].text
        director_col.append(director)

        actor_col.append([actor.text for actor in container.find('p',class_='').find_all('a')[1:]])

 

Creating the Dataset Using the dictionary data structure.

movie_dict = {'name':name_col,
              'year':year_col,
              'certificate':certificate_col,
              'runtime':runtime_col,
              'genre':genre_col,
              'rating':imdb_rating_col,
              'metascore':metascore_col,
              'votes':votes_col,
              'gross':gross_col,
              'director':director_col,
              'actors':actor_col
              }

 

Creating the dataframe and saving it in the CSV file.

df=pandas.DataFrame(movie_dict)
df.to_csv('feature.csv',index = True)

Now, the file has been created. And it can be used as required.

 

Download Complete Code

Comments

No comments yet

Download Packet

Reviews Report

Submitted by Kajal Sahu (kajalsahu1311)

Download packets of source code on Coders Packet