Fetches images from static HTML pages given html-page-file-names/html links/directory name containing html pages, and fetches images present in page into an output directory, developed in Python
This project implements a image fetcher, that parses html pages and fetches images found in pages to an output directory. The project can be understood using the pipeline as shown below.
Pre-Parse HTML Pages -----
|
Pre-Parse HTML Links ------> Fetch HTML Page ------> Parse HTML Page ----> Filter Image Links -----> Fetch Images
As can be seen, there are two ways to start the pipeline, one using HTML files as input, other using HTML links as input. Once the pipeline is started, the next step is to fetch the HTML page, either from the disk or from internet. Then, page is parsed through parser, and the output links are filtered through image link filter. Finally, the links are sent to fetch image function.
Each part of pipeline is run by a separate worker thread in parallel.
The script support command-line arguments given by :
--from_files_name : pass name of file containing location of html pages on disk
--from_html_links : pass name of file containing html links as input
--from_dir_name : pass name of directory containing html pages
--output_dir : pass name of directory to save fetched images
--worker_threads : number of parallel threads to use
At least one of first three arguments must be passed. If two are passed , one is chosen according to order --from_files_name, --from_dir_name, --from_html_links
Submitted by Aakash sharma (aakash140799)
Download packets of source code on Coders Packet
Comments