Help in running the code
Opened this issue · 4 comments
Hi @dynamohuang I am new to python scrapy and mysql anyway. I have finished creating the working environment for running your code (i.e. MySQL backend and run your code under amazon/db to create datasets accordingly)
However, how could i start scrapping the best seller's ASIN ? What I do now is running the main.py but results 0 item fetched. Would you please help me through the working process?
Hey I figure out how to run the different Spiders now.
However there is another problem occurring, when I run the Spider with name 'cate' to grab all ASIN of best sellers, i found the response in the parse event got an empty list
if response.meta['level'] == 1:
list = response.css('#zg_browseRoot ul)[0].css('li a) #(cate_spider.py line 26)
To make sure everything works, I just open the debugging console and print the following css query.
response.css('title::text').extract()
and got "Robort Check".
Does this mean I was blocked by amazon?
If so do you think it is due to i use the proxy as you do in the proxy.json?
Thank you in advance
Thanks for your issue.
- If you get "Robort Check" means you have blocked by amazon.(usually amazon wiil blocked 10 min).
- The proxy.json contain some free ip proxy scrapy from http://fineproxy.org/ by (https://github.com/dynamohuang/amazon-scrapy/blob/master/amazon/amazon/spiders/proxy/fineproxy_spider.py).
unlucky the proxy is unstable always.(maybe too many people use it) - if you have ip pool you can export to the proxy.json, and change the setting.py enable the proxy middleware, then every thing will be ok.
- if you do not have your ip pool ,do not run spider frequently. And change the setting.py disable the proxy middleware.(this is enable by default)
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'amazon.middlewares.RotateUserAgentMiddleware.RotateUserAgentMiddleware': 543,
#'amazon.middlewares.ProxyMiddleware.ProxyMiddleware': 542
}
sorry for not completing the doc.
Thank you very much for the reply! I have now successfully updated the proxy lists and now everything is running smoothly! Would you mind to share the order of getting the files?
My current order is "cate -> asin -> details". How about the keyword and sales spider?
Yeah.
cate -> asin -> details this is the right order.
In the spider detail, the scrapy result wiil store in the var self.product_pool.
And The Detail spider Just fetch top 300 asin from the asin table for try.
If you want scrapy more ,you can change the sql.py`s function findall_asin_level1.
@classmethod def findall_asin_level1(cls): sql = "SELECT distinct(asin), cid FROM py_asin_best limit 0,300" cursor.execute(sql) return cursor.fetchall()