The goal of this project is automate the scrapping of auto parts' information.
- Download all the .xml files linked in the sitemap.xml file of a website that sells car parts;
- Process those .xml files to create a list of all URLs of all webpages of car parts in that website;
- Use that list to download the HTML contents of every single webpage that contains information about car parts in that website;
- Process those HTML files in raw JSONs, which contain general information that might be interesting to us;
- Process those raw JSONs to gather more specifically the data that we want, generating refined JSONs;
- Use those refined JSONs to generate .csv files that contain the information that we actually want, in the structure that we want it;
- Success!
make pecahoje
You will need to setup a proxy, eventually. Then just add it the Makefile variable. Or just remove the proxy from the request command in the html_download.py file.