关于yield feapder.Request中的优先级问题
tisoz opened this issue · 2 comments
tisoz commented
我的构想是abc三个任务 , 一条线程 , 按照如下顺序执行完成
a1->download中间件
a2->download中间件
a3->download中间件
b1->download中间件
b2->download中间件
b3->download中间件
c1->download中间件
c2->download中间件
c3->download中间件
但是我无论怎么调整 , 能够达到的效果都是
a1 b1 c1
a2 b2 c2
a3 b3 c3
如果我不是abc三个任务 , 而是几千几万需要递归的 , 那中间产生的两个包之间的延迟 , 都是几个小时了
我写了个单例 , 输出结果如下 :
E:\program\py38\python.exe "F:\onedrive\OneDrive - TSCN\桌面(1)\feapder爬虫\Amazon\feapder_test.py"
2023-03-21 11:36:07.691 | INFO | feapder.core.scheduler:<lambda>:111 -
********** feapder begin **********
2023-03-21 11:36:07.872 | INFO | __main__:start_requests:36 - task us
2023-03-21 11:36:07.872 | INFO | __main__:start_requests:36 - task jp
2023-03-21 11:36:07.872 | INFO | __main__:start_requests:36 - task tr
2023-03-21 11:36:07.873 | INFO | __main__:start_requests:36 - task es
2023-03-21 11:36:07.873 | INFO | __main__:start_requests:36 - task fd
2023-03-21 11:36:07.873 | INFO | __main__:start_requests:36 - task tg
2023-03-21 11:36:11.191 | INFO | __main__:parse_valid_token:45 - tg111 | priority:940000
2023-03-21 11:36:11.191 | INFO | __main__:parse_valid_token:45 - fd111 | priority:950000
2023-03-21 11:36:11.192 | INFO | __main__:parse_valid_token:45 - es111 | priority:960000
2023-03-21 11:36:11.192 | INFO | __main__:parse_valid_token:45 - tr111 | priority:970000
2023-03-21 11:36:11.192 | INFO | __main__:parse_valid_token:45 - jp111 | priority:980000
2023-03-21 11:36:11.192 | INFO | __main__:parse_valid_token:45 - us111 | priority:990000
2023-03-21 11:36:14.218 | INFO | __main__:parse_csrf_token:54 - tg222 | priority:840000
2023-03-21 11:36:14.218 | INFO | __main__:parse_csrf_token:54 - fd222 | priority:850000
2023-03-21 11:36:14.218 | INFO | __main__:parse_csrf_token:54 - es222 | priority:860000
2023-03-21 11:36:14.218 | INFO | __main__:parse_csrf_token:54 - tr222 | priority:870000
2023-03-21 11:36:14.218 | INFO | __main__:parse_csrf_token:54 - jp222 | priority:880000
2023-03-21 11:36:14.218 | INFO | __main__:parse_csrf_token:54 - us222 | priority:890000
2023-03-21 11:36:17.228 | INFO | __main__:parse:64 - tg333 | priority:740000
2023-03-21 11:36:17.228 | INFO | __main__:parse:64 - fd333 | priority:750000
2023-03-21 11:36:17.228 | INFO | __main__:parse:64 - es333 | priority:760000
2023-03-21 11:36:17.228 | INFO | __main__:parse:64 - tr333 | priority:770000
2023-03-21 11:36:17.228 | INFO | __main__:parse:64 - jp333 | priority:780000
2023-03-21 11:36:17.228 | INFO | __main__:parse:64 - us333 | priority:790000
2023-03-21 11:36:20.960 | INFO | feapder.core.scheduler:<lambda>:116 -
********** feapder end **********
2023-03-21 11:36:21.023 | INFO | feapder.core.scheduler:spider_end:518 - 《amazon_temp:amazon_address_ck》爬虫结束,耗时 14秒
2023-03-21 11:36:21.206 | INFO | feapder.core.scheduler:delete_tables:442 - 正在删除key amazon_temp:amazon_address_ck:z_spider_status
进程已结束,退出代码0
单例运行代码如下 :
import feapder
import feapder.utils.tools
from feapder.utils.log import log
class AMAZON_ASIN_test(feapder.Spider):
# 自定义数据库,若项目中有setting.py文件,此自定义可删除
__custom_setting__ = dict(
# 框架日志等级
LOG_LEVEL="INFO",
LOG_COLOR=True, # 是否带有颜色
LOG_IS_WRITE_TO_CONSOLE=True, # 是否打印到控制台
MONGO_DB="Amazon_spider", # 保存的库命
SPIDER_MAX_RETRY_TIMES=3,
)
def init_base(self, save_table_name=None, item_list=[]):
self.item_list = item_list
self.save_table_name = save_table_name
def download_midware(self, request):
return request, {}
def start_requests(self):
country_list = {
"us": "locationType=LOCATION_INPUT&zipCode=90001&storeContext=apparel&deviceType=web&pageType=Detail&actionSource=glow&almBrandId=undefined",
"jp": "locationType=LOCATION_INPUT&zipCode=163-8001&storeContext=apparel&deviceType=web&pageType=Detail&actionSource=glow&almBrandId=undefined",
"tr": "locationType=LOCATION_INPUT&zipCode=163-8001&storeContext=apparel&deviceType=web&pageType=Detail&actionSource=glow&almBrandId=undefined",
"es": "locationType=LOCATION_INPUT&zipCode=163-8001&storeContext=apparel&deviceType=web&pageType=Detail&actionSource=glow&almBrandId=undefined",
"fd": "locationType=LOCATION_INPUT&zipCode=163-8001&storeContext=apparel&deviceType=web&pageType=Detail&actionSource=glow&almBrandId=undefined",
"tg": "locationType=LOCATION_INPUT&zipCode=163-8001&storeContext=apparel&deviceType=web&pageType=Detail&actionSource=glow&almBrandId=undefined",
}
priority = 1000000
for i in country_list:
priority -= 10000
log.info(f"task {i}")
yield feapder.Request(
url=f"https://www.amazon.com/{i}",
priority=priority,
country=i,
auto_request=False,
callback=self.parse_valid_token)
def parse_valid_token(self, request, response):
log.info(f"{request.country}111 | priority:{request.priority}")
yield feapder.Request(
url=f"https://www.amazon.com/a{request.country}",
priority=request.priority - 100000,
auto_request=False,
country=request.country,
callback=self.parse_csrf_token)
def parse_csrf_token(self, request, response):
log.info(f"{request.country}222 | priority:{request.priority}")
yield feapder.Request(
url=f"https://www.amazon.com/b{request.country}",
priority=request.priority - 100000,
auto_request=False,
country=request.country,
callback=self.parse)
def parse(self, request, response):
log.info(f"{request.country}333 | priority:{request.priority}")
if __name__ == "__main__":
save_table_name = "amazon_address_ck"
amazon = AMAZON_ASIN_test(redis_key=f"amazon_temp:{save_table_name}", delete_keys=True, thread_count=1)
amazon.init_base(save_table_name=f"amazon:{save_table_name}")
amazon.start()
tisoz commented
已解决
class AMAZON_ASIN_test(feapder.AirSpider):
切换成轻量模型后 , 优先级按照预期工作 , 分布式模型不行
切换后日志
E:\program\py38\python.exe "F:\onedrive\OneDrive - TSCN\桌面(1)\feapder爬虫\Amazon\feapder_test.py"
2023-03-21 13:38:05.076 | INFO | __main__:start_requests:36 - task us
2023-03-21 13:38:05.076 | INFO | __main__:start_requests:36 - task jp
2023-03-21 13:38:05.076 | INFO | __main__:start_requests:36 - task tr
2023-03-21 13:38:05.076 | INFO | __main__:start_requests:36 - task es
2023-03-21 13:38:05.076 | INFO | __main__:start_requests:36 - task fd
2023-03-21 13:38:05.076 | INFO | __main__:start_requests:36 - task tg
2023-03-21 13:38:06.079 | INFO | __main__:parse_valid_token:45 - us111 | priority:3
2023-03-21 13:38:06.079 | INFO | __main__:parse_csrf_token:54 - us222 | priority:2
2023-03-21 13:38:06.079 | INFO | __main__:parse:64 - us333 | priority:1
2023-03-21 13:38:06.079 | INFO | __main__:parse_valid_token:45 - tr111 | priority:3
2023-03-21 13:38:06.079 | INFO | __main__:parse_csrf_token:54 - tr222 | priority:2
2023-03-21 13:38:06.079 | INFO | __main__:parse:64 - tr333 | priority:1
2023-03-21 13:38:06.079 | INFO | __main__:parse_valid_token:45 - jp111 | priority:3
2023-03-21 13:38:06.080 | INFO | __main__:parse_csrf_token:54 - jp222 | priority:2
2023-03-21 13:38:06.080 | INFO | __main__:parse:64 - jp333 | priority:1
2023-03-21 13:38:06.080 | INFO | __main__:parse_valid_token:45 - fd111 | priority:3
2023-03-21 13:38:06.080 | INFO | __main__:parse_csrf_token:54 - fd222 | priority:2
2023-03-21 13:38:06.080 | INFO | __main__:parse:64 - fd333 | priority:1
2023-03-21 13:38:06.080 | INFO | __main__:parse_valid_token:45 - es111 | priority:3
2023-03-21 13:38:06.080 | INFO | __main__:parse_csrf_token:54 - es222 | priority:2
2023-03-21 13:38:06.080 | INFO | __main__:parse:64 - es333 | priority:1
2023-03-21 13:38:06.080 | INFO | __main__:parse_valid_token:45 - tg111 | priority:3
2023-03-21 13:38:06.080 | INFO | __main__:parse_csrf_token:54 - tg222 | priority:2
2023-03-21 13:38:06.080 | INFO | __main__:parse:64 - tg333 | priority:1
2023-03-21 13:38:10.098 | INFO | feapder.core.spiders.air_spider:run:104 - 无任务,爬虫结束
进程已结束,退出代码0
Boris-code commented
分布式会取一批任务到内存,然后再消费,分批取时是按照优先级的。可能你的任务太少,第一批只取到了a1 b1 c1,第二批才取到a2 b2 c2
AirSpider 和 Spider的选择,取决于你的任务量需不需要分布式,成百上千万的用Spider比较好