rootsdev/genscrape

Testing

Closed this issue · 7 comments

I figure we'll use PhantomJS for testing if since we need to simulate a browser.

Authentication will be an issue since most trees and records are behind a paywall, or at least require a login. We can't put credentials in this public repo so we'll probably need a script of some sort that prompts the developer for relevant auth credentials before running the test suite.

gives the EIDB the side eye

I am all for this. 👍

@Asparagirl What would you want to do with this?

Oh man, what wouldn't I do with that? :-)

Build an entirely new interface, with an open API, for starters! And then query the ever-lovin' crap out of that data.

Like, not even for just finding names of people, but studying larger national and ethnic immigration trends over time. How many people came via Ellis Island who listed their nationality as Hungarian but their ethnicity/race as Slovak? How many six-year-olds came over in 1912? When a group of people with the same surname came over on the same ship on the same date numbered sequentially on the manifest (i.e. usually a family), what was the average number of children coming together with their parent(s)?

And I'd love to do a pass on the last town of origin data, which of course is usually misspelled and mistranscribed a zillion different ways. But if I could hook that data up to a Places API with historical place name data, I could at least attempt a rough clean-up. I could get a rough count of immigrants from "Lemberg / Lwow / Lviv / Lvov" (plus misspellings) and graph their immigration patterns over time, and further break it down by departure port or age or ethnicity or a million other things. And since the borders changed like crazy, I could map immigrant town names to actual modern locations and do break downs on modern maps, rather than self-reported hunderd-year-old maps.

Not that I've spent massive amounts of time thinking about this, to the point where I've seriously considered scraping the entire site, or anything. whistles innocently

...say, that new URL structure for the passenger records on the new Ellis Island website looks an awful lot like Base64, doesn't it?
http://www.libertyellisfoundation.org/passenger-details/czoxMjoiMTAyMDU4MDQwNzUyIjs=/czo5OiJwYXNzZW5nZXIiOw==

tl;dr: BIG DATA FOR GENEALOGY NERDS. GIMME.

This is a little out of scope for my initial vision. For this library to be useful as a whole (as opposed to just be a collection of scrapers that are never used together) it needs to have a shared schema for output. This is non-trivial.

Because this idea sprouted from my work on roots-search, the schema was very simple (names, dates, and places). My intent was for v1 to have a similar schema because I don't want to think about the work involved to have a more advanced schema.

Perhaps the answer is to have two different methods:

// Simple shared schema
gs.simple('ellis-island')

// All the data; not necessarily a shared schema
gs.full('ellis-island')

What do you think?

I think that could work. But then again I'm happy with any bit of info we can get from the EIDB site. Basic names, dates, and places will do for a start.

Closing this issue because I have the basic testing framework setup.