Web Scraping of Top Repositories from GitHub Topics using Beautiful Soup library

scraping-github-topics-repositories.ipynb

In this project, I scraped the top repositories from the topics section on Github using BeautifulSoup library in Python and stored them in a CSV file

Web scraping is basically the extraction of data from the internet for getting information on various topics. It is a form of data mining technique in which extracted data can be analyzed and can also be used to make different machine learning models afterward.

WARNING: The practice of scraping data cannot be done from all the websites on the internet. Please proceed with caution while doing web scraping from sites that strictly prohibit this practice.

In this project, I used the web scraping library BeautifulSoup to scrape data from https://github.com/topics and created CSV files from the extracted data. Below is the implementation of the approach:-

1) Import the requests library and use it to download the contents from the web page.

2) Use BeautifulSoup to parse the HTML tags of the web page and extract information about the topic tags, description tags, and link tags.

3) Store the topic titles, descriptions, and URLs in separate lists and store all of them in a dictionary and make a pandas data frame of all the topics u want to easily look up all the information. Convert the data frame into a CSV file.

4) Now for every topic get the information of the top repositories by extracting the repo tags and the star tags for each repo.

5) Finally make functions for all of the above work and use it to make the final function which will automatically extract the information of a particular topic and make a csv file to store all the information needed about the top repositories.

This is how u implement web scraping in general. You could use it to extract information depending on your needs.

Coders Packet

Web Scraping of Top Repositories from GitHub Topics using Beautiful Soup library

Comments