Handling chapters/subchapters

Question

Handling chapters/subchapters

Closed this issue 12 years ago · 15 comments

So typically, chapters are not used in citations. You'll see "5 USC 552", and 5 is the title, and 552 is the section. Technically, section 552 is inside some chapter, but that seems more like a publishing convenience than a formal location in the Code.

I have observed the occasional US Code citation that references a whole chapter, instead of a citation. You can see a mix of them here:
https://scout.sunlightfoundation.com/search/federal_bills/%22chapter%2083%22%20%22title%22/advanced

So I'm not suggesting discarding the data. Perhaps this could generate two JSON files, one just for chapters within titles, and one for structure sans chapters.

Or maybe the simplest solution is to just document this, so that anyone who's storing these section markers as citations for later lookup should know to ignore the chapter and subchapter levels of the hierarchy. Either way, I want to flag the dissonance and see what people think.

JoshData commented 12 years ago

Cool!

Answer 1 · 2013-04-01T16:02:19.000Z

FWIW, this is a big problem within state codes, too. Handling chapter citations is hard enough (and of limited enough value) that I finally just gave up.

Answer 2 · 2013-04-01T16:07:07.000Z

Then maybe I am suggesting discarding the data! It'd certainly be a lot easier if people using the JSON can just assemble citations from the hierarchy provided, without white- or blacklisting anything.

Answer 3 · 2013-04-01T16:11:11.000Z

I didn't mean to encourage you to give up. :) Because of how my data storage is structured and how wildly inconsistent references to chapters are, it just wasn't worth the effort. Your mileage may vary!

Answer 4 · 2013-04-01T16:11:36.000Z

The reason I created the script is that I want the whole structure for browse/navigation purposes and so you can "track" a whole title/chapter/etc. and get updates about bills that cite any part of it.

Getting a list of sections only is a much easier task (just grep for h3's that start with the section symbol).

Answer 5 · 2013-04-01T16:14:31.000Z

Not sections only - every other "level" is relevant to a cite (paragraph, subsection, subparagraph, whatever). It's just chapters that are this weird parallel hierarchy that don't often get used.

If you allow someone to track a chapter, your detection will always be indirect, because citations are nearly always title+section+sub-stuff, and ignore the chapter. So you'd have to do a DB lookup on every cite you see, to figure out what chapter that cite is contained in.

Answer 6 · 2013-04-01T16:28:01.000Z

every other "level" is relevant to a cite

Nothing between title and section is relevant (chapter, subchapter, part, subpart, division, and I believe even 'title' again).

So you'd have to do a DB lookup on every cite you see, to figure out what chapter that cite is contained in.

Yep.

Answer 7 · 2013-04-01T16:54:45.000Z

Oh, I see what you mean. I didn't realize that. Also, it looks like the expcites don't go below section. Are subsections something you're interested in collecting through this, too? I'm personally more interested in the "bottom half" (section and below) than the top (section and above).

Answer 8 · 2013-04-01T18:07:49.000Z

Are subsections something you're interested in collecting through this, too?

The more the better, but I haven't looked at what that is like to parse. If it fits in, great. Otherwise I think it might be better to put the bottom-half citations in a separate script and separate output.

Answer 9 · 2013-04-02T15:07:36.000Z

The subsections aren't easy to parse at all. Nor do they have names. So I'll probably content myself with title+section, even if citation parsing needs to handle subsections.

I'd like to make two changes - one is to rename the task to just structure.py, so the command is just ./run structure, and the other is to add a flag to skip a bunch of levels and return only titles and sections (e.g. --sections-only), since that produces a predictable mostly-flat hierarchy, and the only information needed to augment most cite parsing.

Answer 10 · 2013-04-02T15:15:29.000Z

I'd also like to add a citation_id to each returned section, that can sync up with unitedstates/citation and can potentially be a simple ID format we each end up using and that makes integration easier. I was looking through this bill text XML from Congress earlier and I see it uses IDs of the form usc/[title]/[section]. I also like that slashes follow the mindset that @grantcv1 suggests for legal identifiers. So how about that?

unitedstates/citation doesn't use slashes, or that order of pieces, but I'm certainly fine changing it to do so, even if it means re-indexing all my cites downstream. I don't want to keep doing that though, so if anyone objects to that, let me know.

Answer 11 · 2013-04-02T15:44:45.000Z

add a flag to skip a bunch of levels and return only titles and sections

Can you do it as a filter on the output (as opposed to changing how the parsing works)?

I'd also like to add a citation_id ... of the form usc/[title]/[section]

Sounds great!

Answer 12 · 2013-04-02T22:30:02.000Z

Can you do it as a filter on the output (as opposed to changing how the parsing works)?

I did sort of a mix - I added a conditional in parse_h3 that doesn't bother adding a thing onto the path if --sections is specified and it's not a title or section. But it doesn't affect the actual page parsing logic or program flow. Is this okay?

I added a citation field to the output for sections. I also updated the repository to use the double-dash format the others do (so now it's --year=2011).

It defaults to "uscprelim" for --year. --limit limits the number of processed titles. --title does only a specified title (though it still returns an array, no change to the output format). --sections returns only a flattish hierarchy of title+sections. --debug causes it t oprint out only debug info, and doesn't output JSON to STDOUT.

For the --title option, it expects titles of the form "5" or "5a" (no 0-prefix). On output, the citation format uses "usc/[title]/[section]", and the title in that citation is also not 0-prefixed.

Answer 13 · 2013-04-02T22:31:12.000Z

Oh, and - (and I apologize for causing you inconvenience downstream) - I've renamed the script to just "structure". I also updated the README with all of this information. The last thing left I want to do is merge the download work into the Python script, and cut out the extra step. The utils.py file already has all the basic download logic we use in our other repos, so I'll use that.

Answer 14 · 2013-04-11T21:55:50.000Z

For anyone who was following this - I used the output of this work in Scout, so that every search page for a US Code section includes the name of its title and section, and a link to Cornell:

https://scout.sunlightfoundation.com/search/all/5%20usc%20552