Scraping data from Imdb website using Python

imdb_feature.py

In this tutorial, we will learn how to scrape data of feature film from imdb website using BeautifulSoup in Python language step by step.

Pre-requisites required to scrape a website:

-> HTML structures.
-> Python Basics.
-> Python Libraries.
-> CSV file for storing data.

The details of the movie we will scrape from the imdb website:

movie title
  year of release
  certificate
  runtime
  genre
  imdb rating
  metascore
  votes
  gross
  director of the movie
  actors in the film


Importing required python libraries

import requests 
from bs4 import BeautifulSoup
import pandas

url of the required page to scrape the data.

url = 'https://www.imdb.com/search/title/?title_type=feature&num_votes=25000,&genres=animation&sort=user_rating,desc'

creating the lists to store column-wise data of the dataset.

name_col = []
year_col = []
certificate_col = []
runtime_col = []
genre_col = []
imdb_rating_col = []
metascore_col = []
votes_col = []
gross_col = []
director_col = []
actor_col = []

Sending the request to the URL to access the HTML content from the webpage by assigning the URL and creating a html_soup object.

response=requests.get(url)
html_soup=BeautifulSoup(response.text,'html.parser')    movie_containers=html_soup.find_all('div',class_='lister-item mode-advanced')

scraping movie details. Here we are scraping data using BeautifulSoup object using HTML tags.

for container in movie_containers:
        

        name=container.h3.a.text
        name_col.append(name)

        year=container.h3.find('span', class_='lister-item-year').text
        year_col.append(year)

        certificate=container.find('span', class_='certificate').text if container.find('span', class_='certificate') else '-'
        certificate_col.append(certificate)

        runtime=container.find('span', class_='runtime').text
        runtime_col.append(runtime)

        genre=container.find('span',class_='genre').text
        genre_col.append(genre)

        imdb_rating=float(container.strong.text)
        imdb_rating_col.append(imdb_rating)

        metascore=container.find('span', class_='metascore').text if container.find('span', class_='metascore') else '-'
        metascore_col.append(metascore)

        nv = container.find_all('span', attrs = {'name':'nv'})
        vote=nv[0].text
        votes_col.append(vote)

        gross= nv[1].text if len(nv) > 1 else '-'
        gross_col.append(gross)
        
        director = container.find('p', class_='').find_all('a')[0].text
        director_col.append(director)

        actor_col.append([actor.text for actor in container.find('p',class_='').find_all('a')[1:]])

Creating the Dataset Using the dictionary data structure.

movie_dict = {'name':name_col,
              'year':year_col,
              'certificate':certificate_col,
              'runtime':runtime_col,
              'genre':genre_col,
              'rating':imdb_rating_col,
              'metascore':metascore_col,
              'votes':votes_col,
              'gross':gross_col,
              'director':director_col,
              'actors':actor_col
              }

Creating the dataframe and saving it in the CSV file.

df=pandas.DataFrame(movie_dict)
df.to_csv('feature.csv',index = True)

Coders Packet

Scraping data from Imdb website using Python

Comments