This project was done in support of "An example driven introduction to Data Science" presentation that I did for the @KC_DC
I've documented my thoughts about creating and using a notebook. Here I discuss Today's digital notebook experience.
Missouri Department of Corrections Sunshine Law Offender Data File
Offender Data File Layout specification
The data file is too large for most applications to open. That is why I have included a sample file of the first 200 lines of the file.
After you download and extract the Offender Data file, update the path to the file in the LoadOffenderData.py code.
Next, run the python application.
python LoadOffenderData.py > offender_data.json
This will result in a new file offender_data.json which is as its name implies, offender data. You may also see some errors as the result of UTF-8 translation.
Individual documents within this created file will look like:
{
"sentenceLengthDays":99,
"suffix":"",
"sentenceDate":"19560316",
"MissouriCharge":"10021040",
"birthDate":"19290622",
"sentenceProbationDate":"00000000",
"probationType":"",
"sentenceLengthYears":9999,
"OffenseDescription":"TC:MURDER1ST-FIST",
"middleName":"",
"sentenceMinimumReleaseDate":"99999999",
"probationTermYears":0,
"sentenceLengthMonths":99,
"CcCsInd":"",
"probationTermDays":0,
"docId":"00000001",
"OffenseCounty":"St.LouisCity",
"completed":"Y",
"NcicCode":"0904",
"firstName":"PAUL",
"CauseNo":"1265D",
"probationTermMonths":0,
"lastName":"SMITH",
"SentenceCounty":"St.LouisCity",
"DocLocFuncFlag":"",
"sentenceMaximumReleaseDate":"99999999",
"offenderAssignedPlace":"",
"race":"Black",
"gender":"Male"
}
Now, take this data file and load it into MongoDB.
mongoimport --db doc --collection offender offender_data.json
You'll end up with a doc
database that contains an offender
collection that is using ~2G of disk space.
Using the distinct
operation on the offenders
collection is a good way to do some discovery work in the data.
> db.offenders.distinct('race')
[
"Asian/Pacific Islander",
"Black",
"Nat Am/Alaskan",
"Unknown",
"White"
]
> db.offenders.distinct('gender')
[ "Male", "Female", "Unknown" ]
Refer back to the DOC format description and you will find:
This will contain the 8-digit code associated with this offense from court papers or the Missouri Charge Code Manual. Felony class may be used to insure the correct match. Positions 1 through 5 are the major category code. Positions 6 and 7 contain the NCIC/State Modifier range. These positions of the MO Code match the last two digits of the NCIC code for the charge. The eighth position may be 0 for Not Applicable, 1 for Attempt, 2 for Accessory or 3 for Conspiracy.
So, querying for a Murder in the 1st degree, the query would be like this:
> db.offenders.distinct('MissouriCharge',{"MissouriCharge":/^10021.*/})
[
"10021040",
"10021070",
"10021990",
"10021020",
"10021010",
"10021030",
"10021110",
"10021120",
"10021991",
"1002199",
"10021993",
"10021992",
"10021090",
"10021",
"10021050",
"10021033",
"10021121",
"10021113",
"10021013",
"10021103",
"10021023",
"10021043"
]
The completed
field indicates if they have completed their sentence.
SentenceCounty
may differ from OffenseCounty
, think change of venue.
sentenceLengthYears
all 9's indicates life sentence.
I think it would be interesting to get this data into a graph-oriented database and do some querying and visualization that way. For entities (nodes) I'm thinking offenders, counties and charge would be prime candidates.