Coders Packet

Webpage Text Scrapper using Python

By Shivang Kohli

I have used Python to test an entered URL and extract all the textual content from the webpage corresponding to URL.

I have used Python to create a tool thats lets you get all the text content from a webpage. It takes the URL as the input for the program. 

It makes use of urlopen from urllib.request library , validators, requests libraries and also uses BeautifulSoup method from bs4 library.

To use this program you must install these libraries using :

>> pip3 install validators

 

>> pip3 install requests

 

>> pip3 install beautifulsoup4

 

The program first verifies the validity of the url by using validators.url() method that returns boolean value. If valid, it checks if url is actually present on the internet by using requests.get() method, if not, it raises an exception.

Then it uses the urlopen.read() function to read the html content as text from the url which is then passes to beautifulsoup method that parses it to HTML.

After removing style content from the HTML , read all the text. After that break all the text content into lines and remove the trailing spaces and drop the blank lines to get all the formatted textual content from the site.

For instance, for a URL : https://www.wikipedia.org/

We get the following output:

Output

 

 

 

Download Complete Code

Comments

No comments yet