harvard-lil/capstone

Directory structure amendments

kilbergr opened this issue · 4 comments

We've manually experimented with creating the file structure we expect.
Now we have opinions!

  • Alter the proposed file formats according to our new understanding of an improved format.
  • Get sign off from Jack

From @mdellabitta

redacted/
    Reporters.json
    Volumes.json
    ${reporter_id}/ # aka Reporter Folder; e.g. "pa-d-c"; shortcode already in case.law urls
        Metadata.json
        Volumes.json
        Cases.jsonl
        ${volume_id}/ # aka Volume Folder; e.g. "6"; already in case.law urls
            Metadata.json
            Volume.pdf
            Cases.jsonl
            case/
                1.html # file names named after page case starts on; similar to case.law urls
                1.json
                1.pdf
                6.html
                6.json
                6.pdf
                ...
            vendor/
                ${volume_id}.tar # compression?
                ${volume_id}.csv
                ${volume_id}.tar.sha256
unredacted/
    [same as above, but more secret]

misc/
    [stuff from https://case.law/download/]

@jcushman what do you think of the above world order?
It's a slight variation on your proposal.

This makes sense to me.

  • How do you want to handle cases with the same start page? could be 1-01.html, 1-02.html for example. This would happen because a case is shorter than a page, and in some cases there can be a bunch on the same page if we digitized a list of cases on a page as separate cases.
  • This discussion may have happened elsewhere, but I'll flag that I'm on the fence about keeping individual case PDFs, since they're large and redundant and infrequently used. The alternative would be to just have something in the case metadata json like "pdf_url": "../Volume.pdf#page=<page_index>".

Altered:

redacted/
    Reporters.json
    Volumes.json
    ${reporter_id}/ # aka Reporter Folder; e.g. "pa-d-c"; shortcode already in case.law urls
        Metadata.json
        Volumes.json
        Cases.jsonl
        ${volume_id}/ # aka Volume Folder; e.g. "6"; already in case.law urls
            Metadata.json
            Volume.pdf
            Cases.jsonl
            case/
                1.html # file names named after page case starts on; similar to case.law urls
                1.json
                6.html
                6.json
                ...
            vendor/
                ${volume_id}.tar # compression?
                ${volume_id}.csv
                ${volume_id}.tar.sha256
unredacted/
    [same as above, but more secret]

misc/
    [stuff from https://case.law/download/]