roach-php/core

Question: Best way to process items that depend on another...?

Closed this issue · 0 comments

Hello,

I am taking a look at this package, using the Laravel integration, actually.

I have a specific need. I would like to scrape a page, but the page has three components:

  • a parent component (usually)...let's say a country
  • the component itself...let's say a city
  • child components (usually)...let's say citizens

The country and citizen components are provided via links on the main page, with the latter possibly having a few bits of data that the main citizen page might not provide.

I'd like to parse the city page, but I'm not quite sure how to handle processing everything.

I would like to be able to grab the country link, creating it or updating it in my database, and then passing its ID along to the city processing, so that the city can be inserted/updated to become a part of the country.

Finally, I'd like to be able to process any of the citizens, again inserting/updating them in my database, as needed, all with reference to the city ID to which they belong. (And, ideally, handling some of the "extra" bits of data that might exist on the main city page.)

I can't quite figure out if I should have three different spiders, with the city spider calling the country and citizen spiders...or one city spider with different parser methods...? 🤔 In any case, I can't figure out how to pass the country/city database IDs along...and in the case of a single spider, I can't figure out how to make the item processors process one component versus another.

Any help or suggestions? I took a look at https://github.com/ksassnowski/roach-example-project, which was quite helpful, in general, but I didn't see how it could help with these particular problems.

Thanks in advance for your help. 🤓