Scraping HTML Tables from Websites using Python

In this tutorial, we will learn how to scrape HTML tables from websites in python with some cool and easy examples. You might be thinking that why can’t we use  BeautifulSoup for such task. Using BeautifulSoup will make it a difficult task. In order to perform BeautifulSoup we will use an alternative way of doing this task.

Here we will cover the following task:

  • Scraping of HTML table from websites.

Scraping is an essential skill for everyone to get the information from any website. Here we will be describing a library with the help of which any table can be scraped from any website easier. With the help of this method  you only have to provide the provide the URL of the website. The work will be done faster.

Scraping of HTML table from websites

Select a website from which tables need to be scraped.

INSTALLATION

Proceed by installing a library as:

pip install html-table-parser-python3

After that, getting started by importing the necessary libraries required for the task that is urllib.request, pprint, html_table_parser.parser, pandas and then open the URL that needs to be scraped, decode it with UTF-8. Define a function to get contents of the websites. The url_contents function is called which scrape the table by passing the URL as the parameter. So we have to specify the url of the website from which we need to parse tables. The output data is saved in xhtml then parsed in the parser. And the data is glutted with the help of feed function. Each row of the table is stored in an array. This can be converted into a pandas dataframe easily and also used to perform any analysis. Here we use pprint which shows the output in a configuration manner.

import urllib.request

from pprint import pprint

from html_table_parser.parser import HTMLTableParser

import pandas as pd

def url_get_contents(url):

    req = urllib.request.Request(url=url)
    f = urllib.request.urlopen(req)

    return f.read()

xhtml = url_get_contents('https://www.moneycontrol.com/india\/stockpricequote/refineries/relianceindustries/RI').decode('utf-8')

p = HTMLTableParser()

p.feed(xhtml)

pprint(p.tables[1])

print("\n\nPANDAS DATAFRAME\n")

print(pd.DataFrame(p.tables[1]))
OUTPUT:
[['BUY', 'SELL'],
['QTY', 'PRICE', 'PRICE', 'QTY'], 
['35', '1452.50', '1452.80', '107'], 
['200', '1452.00', '1453.35', '1'],
['1', '1451.85', '1453.40', '5'],
['25', '1451.75', '1453.50', '11'],
['391', '1451.70', '1454.00', '94'],
['0', 'Total', 'Total', '0']]

PANDAS DATAFRAME

     0      1        2           3
0   BUY   SELL     None        None
1   QTY  PRICE     PRICE       QTY
2   35   1452.50   1452.80     107
3   200  1452.00   1453.35     1
4   1    1451.85   1453.40     5
5   25   1451.75   1453.50     11
6   391  1451.70   1454.00     94
7    0   Total     Total       0

Accordingly, we know how to scrape a table from the website. But when if  we have an HTML file on our desktop and we want to scrape that HTML table.

Basically we start by creating an HTML file as follows-

<html>
<head>
    <title>Example HTML Table</table>
</head>
<body>
    <h1>Sample HTML Table</h1>
    <table border = "1">
        <tr>
            <th>NAME</th>
            <th>AGE</th>
            <th>GENDER</th>
        </tr>
        <tr>
            <td>Cherry</td>
            <td>22</td>
            <td>M</td>
        </tr>
        <tr>
            <td>Sri</td>
            <td>20</td>
            <td>F</td>
        </tr>
        <tr>
            <td>Krish</td>
            <td>15</td>
            <td>M</td>
        </tr>
        </table>
</body>
</html>

For instance you can create HTML files on notepad with an extension of HTML. Thereafter, you can try opening it on the browser to check whether it works efficiently.

You will get the output as follows-

After that, we will do the same on HTML file as:

import urllib.request

from pprint import pprint

from html_table_parser.parser import HTMLTableParser

import pandas as pd 

def url_get_contents(url):
    
    req = urllib.request.Request(url=url)
    f = urllib.request.urlopen(req)
    return f.read()

xhtml = url_get_contents('file:///C:/Users/sanke/OneDrive/Desktop/Untitled-1.html').decode('utf-8')

p = HTMLTableParser()

p.feed(xhtml)

pprint(p.tables[0])

print(pd.DataFrame(p.tables[0]))
[['NAME', 'AGE', 'GENDER'],
['Cherry', '22', 'M'],
['Sri', '20', 'F'],
['Krish', '15', 'M']]
0 1 2 
0  NAME     AGE   GENDER
1  Cherry   22     M
2  Sri      20     F
3  Krish    15     M

 

We have scraped the table from the website and the output contains the sufficient information. Hence therefore, this methods effectively helps us to we have scraped the table from the website and output contains the relevant information. I hope this might help the audience.

 

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top