Scrapy template: Explore the possibility of running more Spiders per Actor
Closed this issue · 1 comments
vdusek commented
honzajavorek commented
UPDATE: I abandoned trying to get multiple spiders working, instead I'm investing my time to implementing the monorepo approach, registering each spider as an individual actor:
async def main() -> None:
actor_path = os.environ['ACTOR_PATH_IN_DOCKER_CONTEXT'] # e.g. juniorguru_plucker/jobs_startupjobs
spider_module_name = f"{actor_path.replace('/', '.')}.spider"
async with Actor:
Actor.log.info(f'Actor {actor_path} is being executed…')
settings = apply_apify_settings(get_project_settings())
crawler = CrawlerProcess(settings, install_root_handler=False)
Actor.log.info(f"Actor's spider: {spider_module_name}")
crawler.crawl(importlib.import_module(spider_module_name).Spider)
crawler.start()
Original Post
Original Post
I do something like this:
async def main() -> None:
async with Actor:
Actor.log.info('Actor is being executed...')
actor_input = await Actor.get_input() or {}
spider_names = set(source for source in actor_input.get('sources', ['all']))
if 'all' in spider_names:
for path in Path(__file__).parent.glob('spiders/*.py'):
if path.stem != '__init__':
spider_names.add(path.stem)
spider_names.remove('all')
Actor.log.info(f"Executing spiders: {', '.join(spider_names)}")
settings = apply_apify_settings(get_project_settings())
crawler = CrawlerProcess(settings, install_root_handler=False)
for spider_name in spider_names:
spider_module_name = f"{settings['NEWSPIDER_MODULE']}.{spider_name}"
spider = importlib.import_module(spider_module_name)
crawler.crawl(spider.Spider)
crawler.start()
But for mysterious reasons, it doesn't work correctly. I'm getting this exception:
ValueError: Method 'parse_job' not found in: <Spider 'startupjobs' at 0x1035afed0>
Although my startupjobs
spider has no parse_job
method at all! That's a method of the other spider. I suspect either the asyncio or the apify sorcery causes the code of the two spiders to somehow mingle 🤯 My full proof of concept is here:
Any ideas on what could have go wrong?