harvard-lil/capstone

First foray into manipulating current S3 dir structure into static file structure

kilbergr opened this issue · 1 comments

Given the success of last week's spike, we will continue pursuing creating a static file site pulling from S3. The problem? Our current file structure in S3 is not the one we ultimately want.

We've manually experimented with creating the file structure we expect. Now, we will programmatically experiment with it.

AC:

  • Either get access to current CAP resources in S3 OR create a fake set up with public buckets so Bex can access. @bensteinberg will determine what is appropriate here.
  • Create transformation script that will:
    • move files and directories as required
      The following has been relegated to a separate ticket:
      • put HTML section of each case into a separate file
      • Remove \n and escape \ from HTML.
      • Add attributes for paragraphs and bounding boxes to HTML
  • Demonstrate this can work on a limited subset (could be the same amount as used last ticket).
  • Can be done in language of choice
  • Use same file structure doc although we may change.

Ok so the pieces done up to this point are:

redacted/
    Reporters.json
    Volumes.json
    ${reporter_id}/ # aka Reporter Folder; e.g. "pa-d-c"; shortcode already in case.law urls
        Metadata.json
        Cases.jsonl
        Volume.pdf
        ${volume_id}/ # aka Volume Folder; e.g. "6"; already in case.law urls
            Metadata.json
            Cases.jsonl
            case/
                1.json # file names named after page case starts on; similar to case.law urls
                6.json
                ...

The pieces that remain are

redacted/
    ${reporter_id}/ # aka Reporter Folder; e.g. "pa-d-c"; shortcode already in case.law urls
        Volumes.json
        ${volume_id}/ # aka Volume Folder; e.g. "6"; already in case.law urls
            case/
                1.html # file names named after page case starts on; similar to case.law urls
                6.html
                ...
            vendor/
                ${volume_id}.tar # compression?
                ${volume_id}.csv
                ${volume_id}.tar.sha256
misc/
    [stuff from https://case.law/download/]