Hey fellas!
Let us create a movie recommendation system based on what the user likes using Python and Pandas library.
We shall process the movie dataset imported online consisting of thousands of movies including the genre, date of release, and other essential features to obtain a set of movies with a score that defines the similarity.
Link to download the movie dataset –> https://files.grouplens.org/datasets/movielens/ml-25m.zip
Follow the steps below to create the program.
Importing the necessary libraries and the dataset
import pandas as pd movies=pd.read_csv("movies.csv") movies.head() #to view the first 10 rows of the movies dataframe
Cleaning the movie titles using regex
It removes special characters or punctuation from movie titles. We apply this function to all movie titles and store the cleaned titles in a new column.
import re def clean_title(title): title=re.sub("[^a-zA-Z0-9]","",title) return title
movies["clean_title"]=movies["title"].apply(clean_title) movies
Finding the unique terms using TfidVectorizer
TfidfVectorizer converts the cleaned movie titles into numerical vectors based on the Term Frequency – Inverse Document Frequency method. The tfidf matrix represents the similarity of each movie title based on its text context.
from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer(ngram_range=(1,2)) tfidf = vectorizer.fit_transform(movies["clean_title"])
Defining the search function
The search function performs multiple tasks starting from the cleaning of the title, transforming it into a TF-IDF vector, and computing the cosine similarity between the input title and movie titles in the dataset.
from sklearn.metrics.pairwise import cosine_similarity import numpy as np def search(title): title = clean_title(title) query_vec = vectorizer.transform([title]) similarity = cosine_similarity(query_vec, tfidf).flatten() indices = np.argpartition(similarity, -5)[-5:] results = movies.iloc[indices].iloc[::-1] return results
Creating an interactive search box using ipywidgets
An interactive text box where users can type a movie title.
import ipywidgets as widgets from IPython.display import display movie_input = widgets.Text( value='Toy Story', description='Movie Title:', disabled=False ) movie_list = widgets.Output() def on_type(data): with movie_list: movie_list.clear_output() title = data["new"] if len(title) > 5: display(search(title)) movie_input.observe(on_type, names='value') display(movie_input, movie_list)
Loading the “ratings” dataset
It can be downloaded from the zip file mentioned at the beginning.
ratings = pd.read_csv("ratings.csv") ratings.dtypes
Finding similar movies based on users
This feature locates related films by looking up users who gave the current film high ratings. It starts by locating people who gave the specified movie_id a rating of at least 4. Then, it determines the proportion of users who are similar to them who enjoyed the other films that these users scored highly. It is advised to see the films that have the highest ratio of similarity scores. The user’s preferences inform the personalization of this advice.
def find_similar_movies(movie_id): similar_users = ratings[(ratings["movieId"] == movie_id) & (ratings["rating"] > 4)]["userId"].unique() similar_user_recs = ratings[(ratings["userId"].isin(similar_users)) & (ratings["rating"] > 4)]["movieId"] similar_user_recs = similar_user_recs.value_counts() / len(similar_users) similar_user_recs = similar_user_recs[similar_user_recs > .10] all_users = ratings[(ratings["movieId"].isin(similar_user_recs.index)) & (ratings["rating"] > 4)] all_user_recs = all_users["movieId"].value_counts() / len(all_users["userId"].unique()) rec_percentages = pd.concat([similar_user_recs, all_user_recs], axis=1) rec_percentages.columns = ["similar", "all"] rec_percentages["score"] = rec_percentages["similar"] / rec_percentages["all"] rec_percentages = rec_percentages.sort_values("score", ascending=False) return rec_percentages.head(10).merge(movies, left_index=True, right_on="movieId")[["score", "title", "genres"]]
Setting up movie recommendation interaction
This part integrates the previous search and recommendation logic into an interactive system.
movie_name_input = widgets.Text( value='Toy Story', description='Movie Title:', disabled=False ) recommendation_list = widgets.Output() def on_type(data): with recommendation_list: recommendation_list.clear_output() title = data["new"] if len(title) > 5: results = search(title) movie_id = results.iloc[0]["movieId"] display(find_similar_movies(movie_id)) movie_name_input.observe(on_type, names='value') display(movie_name_input, recommendation_list)
Output:
A list of 10 movies will be showcased matching the user preference and movie genre along with the criteria score obtained with the score, title and genres column names.