Image Fetcher python script - fetch images from static HTML pages

process_htmls.py

Fetches images from static HTML pages given html-page-file-names/html links/directory name containing html pages, and fetches images present in page into an output directory, developed in Python

This project implements a image fetcher, that parses html pages and fetches images found in pages to an output directory. The project can be understood using the pipeline as shown below.

Pre-Parse HTML Pages -----

Pre-Parse HTML Links ------> Fetch HTML Page ------> Parse HTML Page ----> Filter Image Links -----> Fetch Images

As can be seen, there are two ways to start the pipeline, one using HTML files as input, other using HTML links as input. Once the pipeline is started, the next step is to fetch the HTML page, either from the disk or from internet. Then, page is parsed through parser, and the output links are filtered through image link filter. Finally, the links are sent to fetch image function.

Each part of pipeline is run by a separate worker thread in parallel.

The script support command-line arguments given by :

--from_files_name : pass name of file containing location of html pages on disk

--from_html_links : pass name of file containing html links as input

--from_dir_name : pass name of directory containing html pages

--output_dir : pass name of directory to save fetched images

--worker_threads : number of parallel threads to use

At least one of first three arguments must be passed. If two are passed , one is chosen according to order --from_files_name, --from_dir_name, --from_html_links

Coders Packet

Image Fetcher python script - fetch images from static HTML pages

Comments