dynamohuang/amazon-scrapy

Help in running the code

Opened this issue · 4 comments

Hi @dynamohuang I am new to python scrapy and mysql anyway. I have finished creating the working environment for running your code (i.e. MySQL backend and run your code under amazon/db to create datasets accordingly)

However, how could i start scrapping the best seller's ASIN ? What I do now is running the main.py but results 0 item fetched. Would you please help me through the working process?

Hey I figure out how to run the different Spiders now.
However there is another problem occurring, when I run the Spider with name 'cate' to grab all ASIN of best sellers, i found the response in the parse event got an empty list

if response.meta['level'] == 1:
    list = response.css('#zg_browseRoot ul)[0].css('li a) #(cate_spider.py line 26)

To make sure everything works, I just open the debugging console and print the following css query.

response.css('title::text').extract()
and got "Robort Check".

Does this mean I was blocked by amazon?

If so do you think it is due to i use the proxy as you do in the proxy.json?

Thank you in advance

Thanks for your issue.

  1. If you get "Robort Check" means you have blocked by amazon.(usually amazon wiil blocked 10 min).
  2. The proxy.json contain some free ip proxy scrapy from http://fineproxy.org/ by (https://github.com/dynamohuang/amazon-scrapy/blob/master/amazon/amazon/spiders/proxy/fineproxy_spider.py).
    unlucky the proxy is unstable always.(maybe too many people use it)
  3. if you have ip pool you can export to the proxy.json, and change the setting.py enable the proxy middleware, then every thing will be ok.
  4. if you do not have your ip pool ,do not run spider frequently. And change the setting.py disable the proxy middleware.(this is enable by default)

DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'amazon.middlewares.RotateUserAgentMiddleware.RotateUserAgentMiddleware': 543,
#'amazon.middlewares.ProxyMiddleware.ProxyMiddleware': 542
}
sorry for not completing the doc.

Thank you very much for the reply! I have now successfully updated the proxy lists and now everything is running smoothly! Would you mind to share the order of getting the files?

My current order is "cate -> asin -> details". How about the keyword and sales spider?

Yeah.
cate -> asin -> details this is the right order.
In the spider detail, the scrapy result wiil store in the var self.product_pool.
And The Detail spider Just fetch top 300 asin from the asin table for try.
If you want scrapy more ,you can change the sql.py`s function findall_asin_level1.

@classmethod def findall_asin_level1(cls): sql = "SELECT distinct(asin), cid FROM py_asin_best limit 0,300" cursor.execute(sql) return cursor.fetchall()