This project contains python code to scrape player data from basketball-reference.com and clean the data using scrapy, selenium and pandas.
This packet contains source code to obtain player details (such as games, position and other stats) for every player from Brooklyn Nets from basketball-reference.com. This was programmed using scrapy and selenium libraries of python. Scrapy spider is programmed to obtain the code from the website, and the selenium webdriver creates a browser instance to deal with html markup of dynamic web pages.
To use this, you need to install scrapy and selenium python; create a scrapy project and spider (under the name of your choice), by using the commands - "scrapy startproject {project name}" and "scrapy genspider {spider name}". You also have to install the selenium webdriver of the required browser from the selenium python website, and place it in the same folder as the scrapy project (you can place it in any other folder as well, but you will have to provide the path to that file accordingly). You can then use the code to crawl the website and obtain the required data from the website. The code given is in the spiders folder of nbaStats, which is where the spider will be created. Similarly, you will have to look for your spider in the spiders folder and code over there. An ideal technique to run the code would be to use command prompt and execute the code - "scrapy crawl {spider name}".
You can save the data you extract using the command - "scrapy crawl {spider name} -o {file name}.{extension}", where the extension can either be .csv or .json. You will obtain the results in the command prompt window itself so you will be able to see what kind of data you wil be extracting.
The same code can be modified to obtain other data by changing the required xpath expressions, or can be used for another website as well by changing the required urls and xpaths. Other methods such as creating and logging requests are available in scrapy's documentation.
Upon obtaining the data, cleaning the data before utilising it for any further purpose is very important. Data cleaning involves going through the obtained data file, making sure the data is well aligned and indexed (if necessary), removing outliers and Nan values, etc. The data cleansing done here makes removes additional headers that are added when the website is crawled mutliple times and removing Nan values by comparing those values with the dataset. When scraping other websites, the obtained data must be compared to the original data to know what sort of data cleansing is required for that; some domain knowledge would surely help in identifying outliers in the scraped data.
The most important thing regarding web-scraping is, it is not legal to scrape any website without the permission of that website. Each website will have a file called robos.txt, so it is very important to go through that file to know what all end points are open to bots for crawling. In this regard, it is very important to make sure your spider obeys the robots.txt file. To be on the safer side, go through the website's terms to make sure you are allowed to scrape the data you need.
Submitted by Srivatsan Sridhar (srivatsansridhar99)
Download packets of source code on Coders Packet
Comments