Goal

The goal of this project is automate the scrapping of auto parts' information.

Methodology

Download all the .xml files linked in the sitemap.xml file of a website that sells car parts;
Process those .xml files to create a list of all URLs of all webpages of car parts in that website;
Use that list to download the HTML contents of every single webpage that contains information about car parts in that website;
Process those HTML files in raw JSONs, which contain general information that might be interesting to us;
Process those raw JSONs to gather more specifically the data that we want, generating refined JSONs;
Use those refined JSONs to generate .csv files that contain the information that we actually want, in the structure that we want it;
Success!

Usage

make pecahoje

You will need to setup a proxy, eventually. Then just add it the Makefile variable. Or just remove the proxy from the request command in the html_download.py file.

kalyanoliveira/car-parts-scrap

Goal

Methodology

Usage