Directory structure amendments
kilbergr opened this issue · 4 comments
kilbergr commented
We've manually experimented with creating the file structure we expect.
Now we have opinions!
- Alter the proposed file formats according to our new understanding of an improved format.
- Get sign off from Jack
kilbergr commented
From @mdellabitta
redacted/
Reporters.json
Volumes.json
${reporter_id}/ # aka Reporter Folder; e.g. "pa-d-c"; shortcode already in case.law urls
Metadata.json
Volumes.json
Cases.jsonl
${volume_id}/ # aka Volume Folder; e.g. "6"; already in case.law urls
Metadata.json
Volume.pdf
Cases.jsonl
case/
1.html # file names named after page case starts on; similar to case.law urls
1.json
1.pdf
6.html
6.json
6.pdf
...
vendor/
${volume_id}.tar # compression?
${volume_id}.csv
${volume_id}.tar.sha256
unredacted/
[same as above, but more secret]
misc/
[stuff from https://case.law/download/]
kilbergr commented
@jcushman what do you think of the above world order?
It's a slight variation on your proposal.
jcushman commented
This makes sense to me.
- How do you want to handle cases with the same start page? could be
1-01.html
,1-02.html
for example. This would happen because a case is shorter than a page, and in some cases there can be a bunch on the same page if we digitized a list of cases on a page as separate cases. - This discussion may have happened elsewhere, but I'll flag that I'm on the fence about keeping individual case PDFs, since they're large and redundant and infrequently used. The alternative would be to just have something in the case metadata json like
"pdf_url": "../Volume.pdf#page=<page_index>"
.
kilbergr commented
Altered:
redacted/
Reporters.json
Volumes.json
${reporter_id}/ # aka Reporter Folder; e.g. "pa-d-c"; shortcode already in case.law urls
Metadata.json
Volumes.json
Cases.jsonl
${volume_id}/ # aka Volume Folder; e.g. "6"; already in case.law urls
Metadata.json
Volume.pdf
Cases.jsonl
case/
1.html # file names named after page case starts on; similar to case.law urls
1.json
6.html
6.json
...
vendor/
${volume_id}.tar # compression?
${volume_id}.csv
${volume_id}.tar.sha256
unredacted/
[same as above, but more secret]
misc/
[stuff from https://case.law/download/]