/alibaba_scraper

Retrieve data from Alibaba E-Commerce and save as spreadsheet

Primary LanguagePython

alibaba_scraper

Alibaba is one of largest and popular e-commerce in entire world. This company has an enormous dataset about products, suppliers and any sort of market's information. We tend to get the data through scraping. Using Python Programming tool we can obtain the data and then stored as spreadsheet file.

Technique

There are alot of tools and libraries that can be used to crawl website's information. But in this case, we combined Selenium and Requests to get all informations. We tried to make scenario that Selenium as product's link crawler and Requests as scrapper of page product's detail using the links that generated by Selenium.

Optimalization

We make all of the requests run as worker, which is concurrent and faster than single threaded. Selenium run in the main process of application and all of requests run as threads.

How to USE

  1. Install Python 3
https://www.python.org/downloads/
  1. Install all libraries in requirements.txt
pip3 install -r requirements.txt
  1. Download chromedriver
https://chromedriver.chromium.org/downloads

or directly for v80.0.3987.16:
https://chromedriver.storage.googleapis.com/index.html?path=80.0.3987.16/

After download is completed, place the executable in the same folder of this script.

  1. Run main.py for scraping
python main.py directory filename url_target worker page_start page_end

directory : name for working folder, write "." for current folder
filename : name for temporary files and result spreadsheet (like "alibaba" then the result will be -> "alibaba.xlsx")
url : must double quote like "https://gzhengdian.en.alibaba.com/productlist.html?spm=a2700.icbuShop.41413.
45.2ded68dbUnr74k"
worker : how many thread to get the requests (ex. 5)
page_start : start scraping from N page number
page_end : number, but write none if you want to scrape until the end of result

There are 2 type of pages:

  1. gallery like : https://www.alibaba.com/products/jewelry.html
  2. hosted shop : https://gzhengdian.en.alibaba.com/productlist.html

Output

Screenshot Run