This project was done in support of "An example driven introduction to Data Science" presentation that I did for the @KC_DC

Engineer's Notebook

I've documented my thoughts about creating and using a notebook. Here I discuss Today's digital notebook experience.

Finding the data

sourcing the data

learning key words

searching

bearing fruit

Missouri Department of Corrections Sunshine Law Offender Data File

Analysis

Offender Data File Layout specification

The data file is too large for most applications to open. That is why I have included a sample file of the first 200 lines of the file.

Extract, Transform and Load

After you download and extract the Offender Data file, update the path to the file in the LoadOffenderData.py code.

Next, run the python application.

python LoadOffenderData.py > offender_data.json

This will result in a new file offender_data.json which is as its name implies, offender data. You may also see some errors as the result of UTF-8 translation.

Individual documents within this created file will look like:

{
	"sentenceLengthDays":99,
	"suffix":"",
	"sentenceDate":"19560316",
	"MissouriCharge":"10021040",
	"birthDate":"19290622",
	"sentenceProbationDate":"00000000",
	"probationType":"",
	"sentenceLengthYears":9999,
	"OffenseDescription":"TC:MURDER1ST-FIST",
	"middleName":"",
	"sentenceMinimumReleaseDate":"99999999",
	"probationTermYears":0,
	"sentenceLengthMonths":99,
	"CcCsInd":"",
	"probationTermDays":0,
	"docId":"00000001",
	"OffenseCounty":"St.LouisCity",
	"completed":"Y",
	"NcicCode":"0904",
	"firstName":"PAUL",
	"CauseNo":"1265D",
	"probationTermMonths":0,
	"lastName":"SMITH",
	"SentenceCounty":"St.LouisCity",
	"DocLocFuncFlag":"",
	"sentenceMaximumReleaseDate":"99999999",
	"offenderAssignedPlace":"",
	"race":"Black",
	"gender":"Male"
}

Now, take this data file and load it into MongoDB.

mongoimport --db doc --collection offender offender_data.json

You'll end up with a doc database that contains an offender collection that is using ~2G of disk space.

Additional Analysis

Using the distinct operation on the offenders collection is a good way to do some discovery work in the data.

> db.offenders.distinct('race')
[
        "Asian/Pacific Islander",
        "Black",
        "Nat Am/Alaskan",
        "Unknown",
        "White"
]

> db.offenders.distinct('gender')
[ "Male", "Female", "Unknown" ]

Important note about the Missouri Charge field.

Refer back to the DOC format description and you will find:

This will contain the 8-digit code associated with this offense from court papers or the Missouri Charge Code Manual. Felony class may be used to insure the correct match. Positions 1 through 5 are the major category code. Positions 6 and 7 contain the NCIC/State Modifier range. These positions of the MO Code match the last two digits of the NCIC code for the charge. The eighth position may be 0 for Not Applicable, 1 for Attempt, 2 for Accessory or 3 for Conspiracy.

So, querying for a Murder in the 1st degree, the query would be like this:

> db.offenders.distinct('MissouriCharge',{"MissouriCharge":/^10021.*/})
[
        "10021040",
        "10021070",
        "10021990",
        "10021020",
        "10021010",
        "10021030",
        "10021110",
        "10021120",
        "10021991",
        "1002199",
        "10021993",
        "10021992",
        "10021090",
        "10021",
        "10021050",
        "10021033",
        "10021121",
        "10021113",
        "10021013",
        "10021103",
        "10021023",
        "10021043"
]

Other fields

The completed field indicates if they have completed their sentence. SentenceCounty may differ from OffenseCounty, think change of venue. sentenceLengthYears all 9's indicates life sentence.

Visualization

Springboard

I think it would be interesting to get this data into a graph-oriented database and do some querying and visualization that way. For entities (nodes) I'm thinking offenders, counties and charge would be prime candidates.

k0emt/corrections