/genea

Scrape parent-child relationships from Wikipedia infoboxes.

Primary LanguagePythonMIT LicenseMIT

Genea

Pronounced "genie". Scrape parent-child relationships from Wikipedia infoboxes.

Why Infoboxes?

Infoboxes give us a digest of a particular Wikipedia page, in addition to the relational information that we'll need to build a tree.

infobox_washington.png

Modified infobox as seen on the Wikipedia page for George Washington

In the image above, we can see rows of data under the "Personal Details" section; each of these rows contain a header (bolded text) and (typically) associated links.

We'll use regular expression patterns to match with these headers, some of which provide ancestral relationships ("Parents", in this case), some provide descendant relationships ("Children"), and others that could provide extra links that we can walk out from ("Relatives").

Let's try out the above example.

Installation

Clone this repository to your local machine with git, then install with Python.

git clone https://github.com/shanedrabing/genea.git
cd genea
python setup.py install

Getting Started

Run the program with Python.

python genea.py "George Washington" "^Parent" "^Child"

Positional Arguments

  • term : Search term. Redirects to initial Wikipedia page.
  • pre : (optional, regex) If matched, will add ancestor.
  • post : (optional, regex) If matched, will add descendant.

Named Arguments

  • -n [STEPS] : How many steps to walk from initial page?
  • -e [EXTRA] : (regex) If matched, will add additional links (no relation).

Text Output

ANCESTORS of George Washington
├── Augustine Washington Sr.  
│   ├── Mildred Gale
│   │   └── Augustine Warner Jr.
│   │       └── Augustine Warner
│   └── Lawrence Washington
│       └── John Washington
│           └── Lawrence Washington
└── Mary Washington

DESCENDANTS of George Washington
└── John Parke Custis
    ├── George Washington Parke Custis
    │   ├── Mary Anna Custis Lee
    │   │   ├── Eleanor Agnes Lee
    │   │   ├── George Washington Custis Lee
    │   │   ├── William Henry Fitzhugh Lee
    │   │   ├── Robert E. Lee Jr.
    │   │   ├── Mildred Childe Lee
    │   │   ├── Anne Carter Lee
    │   │   └── Mary Custis Lee
    │   └── Maria Carter Syphax
    ├── Martha Parke Custis Peter
    ├── Elizabeth Custis Law
    └── Eleanor Parke Custis Lewis

Motivating Examples

Try out these other searches! Genea is intended to be general, meaning that any infobox labels you find can define the relationships between pages.

# how many cars succeeded the Ford Quadricycle?
python genea.py "Ford Quadricycle" "^Predecessor" "^Successor"

# what is the pedigree of Secretariat? (goes back to the 1700s!)
python genea.py "Secretariat (horse)" "^(Sire|Dam)$" --extra "sire"

# where did Windows XP come from, where did it go?
python genea.py "Windows XP" "^(Preceded by)$" "^(Succeeded by)$"

# how many child companies does Disney have?
python genea.py "Disney" "Parent" "(Divisions|Subsidiaries)"

License

MIT