Scrapy template: Explore the possibility of running more Spiders per Actor

Question

Scrapy template: Explore the possibility of running more Spiders per Actor

Closed this issue 10 months ago · 1 comments

https://github.com/apify/actor-templates/blob/087b2dc4315e029e38b6282f7d312fc80c0c4e0d/templates/python-scrapy/src/main.py#L42:L45

Answer 1 · 2023-12-14T19:57:47.000Z

UPDATE: I abandoned trying to get multiple spiders working, instead I'm investing my time to implementing the monorepo approach, registering each spider as an individual actor:

async def main() -> None:
    actor_path = os.environ['ACTOR_PATH_IN_DOCKER_CONTEXT']  # e.g. juniorguru_plucker/jobs_startupjobs
    spider_module_name = f"{actor_path.replace('/', '.')}.spider"

    async with Actor:
        Actor.log.info(f'Actor {actor_path} is being executed…')
        settings = apply_apify_settings(get_project_settings())
        crawler = CrawlerProcess(settings, install_root_handler=False)
        Actor.log.info(f"Actor's spider: {spider_module_name}")
        crawler.crawl(importlib.import_module(spider_module_name).Spider)
        crawler.start()

Original Post

I do something like this:

async def main() -> None:
    async with Actor:
        Actor.log.info('Actor is being executed...')
        actor_input = await Actor.get_input() or {}
        spider_names = set(source for source in actor_input.get('sources', ['all']))

        if 'all' in spider_names:
            for path in Path(__file__).parent.glob('spiders/*.py'):
                if path.stem != '__init__':
                    spider_names.add(path.stem)
            spider_names.remove('all')

        Actor.log.info(f"Executing spiders: {', '.join(spider_names)}")
        settings = apply_apify_settings(get_project_settings())
        crawler = CrawlerProcess(settings, install_root_handler=False)
        for spider_name in spider_names:
            spider_module_name = f"{settings['NEWSPIDER_MODULE']}.{spider_name}"
            spider = importlib.import_module(spider_module_name)
            crawler.crawl(spider.Spider)
        crawler.start()

But for mysterious reasons, it doesn't work correctly. I'm getting this exception:

ValueError: Method 'parse_job' not found in: <Spider 'startupjobs' at 0x1035afed0>

Although my startupjobs spider has no parse_job method at all! That's a method of the other spider. I suspect either the asyncio or the apify sorcery causes the code of the two spiders to somehow mingle 🤯 My full proof of concept is here:

https://github.com/juniorguru/plucker/blob/404f677f4748dfae5389072fc01b7d736abbc62f/juniorguru_plucker/main.py#L40

Any ideas on what could have go wrong?