opencivicdata/pupa

Recommended approach to scrape multiple jurisdictions at once?

Closed this issue · 9 comments

For example, if a provincial website has information for all its municipalities.

any idea of how you'd like to see pupa handle this?

I don't have strong opinions on how the API should work, but one way is to be able to change the "active jurisdiction" so that objects are yielded to the appropriate jurisdiction. Pseudo-code:

# __init__.py
from utils import CanadianJurisdiction
class QuebecMunicipalities(CanadianJurisdiction):
    # Here you would find either:
    # * nothing, since this is a fake jurisdiction
    # * dummy variables which will be ignored by the scraper
    # * a list of all the jurisdictions if one of the above two can't be implemented
# people.py
from pupa.scrape import Scraper
class QuebecMunicipalitiesPersonScraper(Scraper):
    # get the list of municipalities
    for municipality in municipalities:
        # create a jurisdiction object
        self.set_jurisdiction(jurisdiction)
        # yield a lot of people

However, I can imagine a lot of challenges in changing Pupa to work this way.

Maybe there are some Python metaprogramming tricks I can use, to make it seem like there are several thousand modules with common people.py scraper code, without requiring me to have thousands of folders of __init__.py files and small people.py files all inheriting from the same meta-scraper class.

the people.py files won't be needed if they're all the same, as multiple
jurisdictions can point to the same scraper(s)

your proposed solution might work, I'll play with some proof of concept code

On Tue, May 20, 2014 at 3:54 PM, James McKinney notifications@github.comwrote:

I don't have strong opinions on how the API should work, but one way is to
be able to change the "active jurisdiction" so that objects are yielded to
the appropriate jurisdiction. Pseudo-code:

init.py

from utils import CanadianJurisdiction
class QuebecMunicipalities(CanadianJurisdiction):
# Here you would find either:
# * nothing, since this is a fake jurisdiction
# * dummy variables which will be ignored by the scraper
# * a list of all the jurisdictions

people.py

from pupa.scrape import Scraper
class QuebecMunicipalitiesPersonScraper(Scraper):
# get the list of municipalities
for municipality in municipalities:
# create a jurisdiction object
self.set_jurisdiction(jurisdiction)
# yield a lot of people

However, I can imagine a lot of challenges in changing Pupa to work this
way.

Maybe there are some Python metaprogramming tricks I can use, to make it
seem like there are several thousand modules with common people.pyscraper code, without requiring me to have thousands of folders of
init.py files and small people.py files all inheriting from the same
meta-scraper class.


Reply to this email directly or view it on GitHubhttps://github.com//issues/70#issuecomment-43674764
.

Cool - how do you make multiple jurisdictions point to the same scrapers?

there's now an example of this in https://github.com/opencivicdata/scrapers-us-state

there's still one file per jurisdiction (maybe we can improve that, maybe this is good enough though) but they all point to the same scraper (and the jurisdictions in this case are actually auto-generated classes)

Thanks! In Quebec I'll have 1000 auto-generated jurisdictions, mixed in with manual jurisdictions; we scrape the big cities individually (to get email addresses), but we're happy to use a provincial directory for the smaller cities (which has one email for the entire council). It may be confusing to have this mix, so avoiding one file per jurisdiction would still be ideal.

How is Pupa 0.0.4 coming along? How soon can I start upgrading to the PostgreSQL version?

pupa 0.4 is pretty much ready, there are still rough edges but no more than existed in the mongo version I believe. I was hoping to update some docs before calling it 0.4 officially, but we're using it in development now and will be releasing it as 0.4 and switching production over soon

the 1000 jurisdiction issue still requires more work/thinking on the best way to do it. i think a different command like pupa bulkupdate might get around some of the challenges we'd face, once things settle down here I'll try and think of a cleaner interface for this

Pinging for any updates on how to implement common scraper code for 1000s of jurisdictions.

In the update command's handle method, I'm wondering if instead of getting a single jurisdiction from a module, it might get a list of jurisdictions instead, and then loop over them. Alternatively, there could be a bulkupdate command as mentioned earlier, which expects the module to define multiple jurisdictions.

My workaround is to just put all the jurisdictions into one jurisdiction, in an organization hierarchy, which is fine for my needs, but maybe not in the general case. However, as there is no other demand for the general case, I'm closing.