Use scrapy in python to crawl: https://mall.ikang.com/
This is an advanced web-crawling example, because the website has anti-crawler mechanism
Thus, we use selenium package together with scrapy.
Selenium can simulate web-browser actions like opening the browser, scrolling, clicking a button, switching back and forth among tabs, etc.
The key feature to use in selenium is driver. Driver "directs" those actions mentioned above
- Install Google Chrome browser. It's most commonly-used in selenium and probably has the best performance
- Install chromedriver from: https://chromedriver.chromium.org/downloads
NOTICE: You need to check the particular version of your Chrome and download the corresponding driver.. Find it in setting->About Google Chrome. - Unzip the file and put it into your project folder.
- Install scrapy, selenium package and make sure they can functions in your project environment
Make sure you put the unzipped chromedriver executable in the correct place, otherwise you get error when initializing the driver.
The STARTER.py file is just for convinience. You don't need to type in the terminal, just run this file in your IDE.
- I use PostgreSQL for a local database. I want to save the data into the table I created. Each item in scrapy is a row (a piece of data) in the table.
- You can either execute simple sql commands to create a table, or manually create it in your PostgreSQL visualization software, like Navicat. (It's paid, but super powerful)
- Remember to set the data type of each field in your table to be varchar.
- Most importantly, the length of field 'exams' (all medical examinations included in the product) should be long enough, 1023 for me, otherwise you get error when writing data.
- After creating table, we need to modify pipeline.py enable data writing. Functions open_spider() and close_spider simply connect to and disconnect from your DB. Function process_item write each of your item into the DB.
- Last step, enable pipeline feature. Just uncomment those 3 lines in settings.py
Special thanks to Harry Wang, https://github.com/harrywang
for his fantastic end-to-end tutorial (it has 5 parts), https://towardsdatascience.com/a-minimalist-end-to-end-scrapy-tutorial-part-i-11e350bcdec0
and his scrapy-selenium-demo, https://github.com/harrywang/scrapy-selenium-demo