Handling chapters/subchapters
Closed this issue · 15 comments
So typically, chapters are not used in citations. You'll see "5 USC 552", and 5 is the title, and 552 is the section. Technically, section 552 is inside some chapter, but that seems more like a publishing convenience than a formal location in the Code.
I have observed the occasional US Code citation that references a whole chapter, instead of a citation. You can see a mix of them here:
https://scout.sunlightfoundation.com/search/federal_bills/%22chapter%2083%22%20%22title%22/advanced
So I'm not suggesting discarding the data. Perhaps this could generate two JSON files, one just for chapters within titles, and one for structure sans chapters.
Or maybe the simplest solution is to just document this, so that anyone who's storing these section markers as citations for later lookup should know to ignore the chapter and subchapter levels of the hierarchy. Either way, I want to flag the dissonance and see what people think.
FWIW, this is a big problem within state codes, too. Handling chapter citations is hard enough (and of limited enough value) that I finally just gave up.
Then maybe I am suggesting discarding the data! It'd certainly be a lot easier if people using the JSON can just assemble citations from the hierarchy provided, without white- or blacklisting anything.
I didn't mean to encourage you to give up. :) Because of how my data storage is structured and how wildly inconsistent references to chapters are, it just wasn't worth the effort. Your mileage may vary!
The reason I created the script is that I want the whole structure for browse/navigation purposes and so you can "track" a whole title/chapter/etc. and get updates about bills that cite any part of it.
Getting a list of sections only is a much easier task (just grep for h3's that start with the section symbol).
Not sections only - every other "level" is relevant to a cite (paragraph, subsection, subparagraph, whatever). It's just chapters that are this weird parallel hierarchy that don't often get used.
If you allow someone to track a chapter, your detection will always be indirect, because citations are nearly always title+section+sub-stuff, and ignore the chapter. So you'd have to do a DB lookup on every cite you see, to figure out what chapter that cite is contained in.
every other "level" is relevant to a cite
Nothing between title and section is relevant (chapter, subchapter, part, subpart, division, and I believe even 'title' again).
So you'd have to do a DB lookup on every cite you see, to figure out what chapter that cite is contained in.
Yep.
Oh, I see what you mean. I didn't realize that. Also, it looks like the expcites don't go below section. Are subsections something you're interested in collecting through this, too? I'm personally more interested in the "bottom half" (section and below) than the top (section and above).
Are subsections something you're interested in collecting through this, too?
The more the better, but I haven't looked at what that is like to parse. If it fits in, great. Otherwise I think it might be better to put the bottom-half citations in a separate script and separate output.
The subsections aren't easy to parse at all. Nor do they have names. So I'll probably content myself with title+section, even if citation parsing needs to handle subsections.
I'd like to make two changes - one is to rename the task to just structure.py
, so the command is just ./run structure
, and the other is to add a flag to skip a bunch of levels and return only titles and sections (e.g. --sections-only
), since that produces a predictable mostly-flat hierarchy, and the only information needed to augment most cite parsing.
I'd also like to add a citation_id
to each returned section, that can sync up with unitedstates/citation and can potentially be a simple ID format we each end up using and that makes integration easier. I was looking through this bill text XML from Congress earlier and I see it uses IDs of the form usc/[title]/[section]
. I also like that slashes follow the mindset that @grantcv1 suggests for legal identifiers. So how about that?
unitedstates/citation doesn't use slashes, or that order of pieces, but I'm certainly fine changing it to do so, even if it means re-indexing all my cites downstream. I don't want to keep doing that though, so if anyone objects to that, let me know.
add a flag to skip a bunch of levels and return only titles and sections
Can you do it as a filter on the output (as opposed to changing how the parsing works)?
I'd also like to add a citation_id ... of the form usc/[title]/[section]
Sounds great!
Can you do it as a filter on the output (as opposed to changing how the parsing works)?
I did sort of a mix - I added a conditional in parse_h3
that doesn't bother adding a thing onto the path if --sections
is specified and it's not a title or section. But it doesn't affect the actual page parsing logic or program flow. Is this okay?
I added a citation
field to the output for sections. I also updated the repository to use the double-dash format the others do (so now it's --year=2011
).
It defaults to "uscprelim" for --year
. --limit
limits the number of processed titles. --title
does only a specified title (though it still returns an array, no change to the output format). --sections
returns only a flattish hierarchy of title+sections. --debug
causes it t oprint out only debug info, and doesn't output JSON to STDOUT.
For the --title
option, it expects titles of the form "5" or "5a" (no 0-prefix). On output, the citation format uses "usc/[title]/[section]", and the title in that citation is also not 0-prefixed.
Oh, and - (and I apologize for causing you inconvenience downstream) - I've renamed the script to just "structure". I also updated the README with all of this information. The last thing left I want to do is merge the download work into the Python script, and cut out the extra step. The utils.py file already has all the basic download logic we use in our other repos, so I'll use that.
For anyone who was following this - I used the output of this work in Scout, so that every search page for a US Code section includes the name of its title and section, and a link to Cornell:
https://scout.sunlightfoundation.com/search/all/5%20usc%20552
Cool!