CodeForPhilly/pbf-scraping

Error parsing offenses for some dockets

Closed this issue · 1 comments

The most common error I got when parsing the 13,000 dockets was the one below. Best I can tell, this happens because these dockets do not have sequential charges. They may only have charges 3 and 4. In some cases, I saw it where charges 1, 2, 3, 5, 8, 9 were on the docket, but the others had been dropped along the way. This is an issue we only encounter parsing dockets for older cases, since charges may be dropped or otherwise updated along the way.

Here are some samples. I can provide more.

MC-51-CR-0002740-2020.pdf
MC-51-CR-0000353-2020.pdf

  File "/Users/Shared/CFP Scraping/pbf-scraping/analyses/full_dockets/one_time_parse.py", line 13, in <module>
    parse = parse_pdf(path_folder+file, text)
  File "/Users/Shared/CFP Scraping/pbf-scraping/analyses/full_dockets/parse_docket.py", line 113, in parse_pdf
    result['offenses'] = get_charges(pdf, pages_charges)
  File "/Users/Shared/CFP Scraping/pbf-scraping/analyses/full_dockets/funcs_parse.py", line 141, in get_charges
    charges = offense(pdf,p,y2_1,y1_0,x1_0,x3_0,charges)
  File "/Users/Shared/CFP Scraping/pbf-scraping/analyses/full_dockets/funcs_parse.py", line 49, in offense
    y_array_bottom[k-1-h] = y
IndexError: index 2 is out of bounds for axis 0 with size 2```

Solved