city-directory-entry-parser

city-directory-entry-parser parses lines from OCR’d New York City directories into separate fields, such as names, occupations, and addresses.

city-directory-entry-parser is part of NYPL’s NYC Space/Time Directory project.

For more tools that are used to turn digitized city directories into datasets, see Space/Time’s City Directories repository.

This module relies on the sklearn-crfsuite implementation of a conditional random fields algorithm.

Example

Input:

"Calder William W, clerk, 206 W. 24th"

Output:

{
  "subjects": [
    "Calder William W"
  ],
  "occupations": [
    "clerk"
  ],
  "addresses": [
    [
      "206 W . 24th"
    ]
  ]
}

If the output contains an address field, nyc-street-normalizer can be used to turn this abbreviated address into a full address (e.g. 668 Sixth av. ⟶ 668 Sixth Avenue).

Prerequisites

city-directory-entry-parser depends on the following Python modules:

numpy
sklearn
nltk
scipy
sklearn_crfsuite

Installation & usage

From Python:

from cdparser import Classifier, Features, LabeledEntry, Utils

## Create a classifier object and load some labeled data from a CSV
classifier = Classifier.Classifier()
classifier.load_training("/full/path/to/training/nypl-labeled-train.csv")

## Optionally, load validation dataset
classifier.load_validation("/full/path/to/validation/nypl-labeled-validate.csv")

## Train your classifier (with default settings)
classifier.train()

## Create an entry object from string
entry = LabeledEntry.LabeledEntry("Cappelmann Otto, grocer, 133 VVashxngton, & liquors, 170 Greenwich, h. 109 Cedar")

## Pass the entry to the classifier
classifier.label(entry)

## Export the labeled entry as JSON
json.dumps(entry.categories)

From bash (using parse.py):

cat /path/to/nypl-1851-1852-entries-sample.txt | python3 parse.py --training /path/to/nypl-labeled-70-training.csv

nypl-spacetime/city-directory-entry-parser

city-directory-entry-parser

Example

Prerequisites

Installation & usage

See also