Assemble NCEI image archiving functional requirements
Closed this issue · 10 comments
This is the initial draft version of the workflow and is subject to change
sequenceDiagram
autonumber
actor U as User
participant F as FathomNet
participant M as MSU
participant N as NCEI
U->>F: Upload ZIP of images + CSV via HTTP
F-->>U: Ack/200
rect rgb(150, 114, 114)
Note left of F: MBARI
F-)+F: Repackage using NCEI naming conventions
F-)F: Extract CSV
F-)F: Update image names/locations to NCEI name, MSU location
rect rgb(114, 114, 150)
Note left of M: MSU
F->>+M: Upload (FTP, HTTP?)
M-)-M: Unzip images at standard location and provide web access
end
F-)-F: Use CSV to register images
end
rect rgb(114, 150, 114)
Note left of M: NCEI
M->>N: At 6 months migrate to NCEI
N-)N: Unpack at standard location
N->>F: Notify FathomNet of the location change?
end
F-)F: Update image URLs to new location
Updated sequence diagram based on MSU's proposal to poll for zip files rather than have them pushed.
sequenceDiagram
autonumber
actor U as User
participant F as FathomNet
participant M as MSU
participant N as NCEI
U->>F: Upload ZIP of images + CSV via HTTP
F-->>U: Ack/200
rect rgb(150, 114, 114)
Note left of F: MBARI
F-)+F: Extract CSV
F-)F: Update image names/locations to NCEI name, MSU location
F-)F: Repackage using NCEI naming conventions
F-)-F: Stage zip file of images/csv to https://fathomnet.org/static/...
rect rgb(114, 114, 150)
Note left of M: MSU
loop Every Day?
Note left of M: This would require us to enable directory listing. Do we want that?
M-)+F: Scan for new zip files
end
M->>F: Download new zip via HTTP
F-->>-M: <zip>
M-)+M: Unzip images and CSV at standard location and provide web access
M-)-F: Send email notification with unzipped location?
end
rect rgb(75, 57, 57)
loop Every Day?
F-)F: Poll for emails periodicaly
end
F-)F: On new email, extract location of new directory
F-)F: Extract location of CSV in new directory
F-)F: Use CSV to register images
F-)F: Delete local zip file from https://fathomnet.org/static/...
F-)U: Send email that images are registered
end
end
rect rgb(114, 150, 114)
Note left of M: NCEI
M->>N: At 6 months migrate to NCEI
N-)N: Unpack at standard location
N->>F: Notify FathomNet of the location change?
end
F-)F: Update image URLs to new locationflowchart TD
Things we'd have to do on the FathomNet side for this:
- Set up an email account to receive the emails.
- Formalize the email content. I would guess that would be how NCEI would also notify us when files are migrated from MSU to NCEI. We need a way to differentiate the contents of the emails (a. initial hosting at MSU, b. Migration from MSU to NCEI)
- Setup a service to poll for emails and process the contents
- Enable directory listing in the web server for the staging location of the zip files.
Email from David Moffitt on 2023-12-01:
I've gotten the email notifications working with a simulated smtp server. I'm putting in a ticket with MSU so I can start testing it with the actual smtp server and have it set up as a cron job. Currently the emails only have a list of the files downloaded and the file size, what other information would be good to have in the notifications?
My response to David's email:
The entire work flow and handshake between FathomNet and MSU is described in a sequence diagram at #136 (comment) .
Currently the emails only have a list of the files downloaded and the file size, what other information would be good to have in the notifications?
Ideally, these are the things I would like in the email:
- The URL to the original file fetched from https://fathomnet.org/static/staging/
- The url to the unzipped root dir of that file on MSU servers.
- If directory listing is enabled on MSU’s server, the url (in 2 above) is enough. We can just scrape the directory listing for the files that were in the zip file. If directory listing is not enabled, the email should contain the full url to every file that was extracted from the zip file.
- The email should contain the date/time in the text body of when the file was extracted.
It would be ideal if the email body is easily parsable by automated code. Example email body with directory listing enabled:
description: MSU file transfer from FathomNet
timestamp: 2023-12-07T01:23:45Z
source: https://fathomnet.org/static/staging/FN2309-small.zip
destination: https://msu.server.edu/path/to/FN2309-small/
Example email body if directory listing is not enabled:
description: MSU file transfer from FathomNet
timestamp: 2023-12-07T01:23:45Z
source: https://fathomnet.org/static/staging/FN2309-small.zip
destination: https://msu.server.edu/path/to/Fn2309-small/
files:
- https://msu.server.edu/path/to/Fn2309-small/FN2309_355922--fb7616ae-38b0-45b5-883b-3d18ab7121cd.png
- https://msu.server.edu/path/to/Fn2309-small/FN2309_414280--704a79af-98ce-4c65-95a8-7273bd3dbaed.png
- https://msu.server.edu/path/to/Fn2309-small/FN2309_550917--64849cc7-2e3f-4eb1-93be-5f016aa540a2.png
- https://msu.server.edu/path/to/Fn2309-small/foobar.csv
Let me know if you think I’m missing anything. Thanks!
Response from @errol-ronje:
[...] Yee Lau is now standing by to complete the Fathomnet automation. Yee and I met last week and came up with a few questions for clarification to help move this forward. Please check our notes below for accuracy and let us know the answer to our questions? We may also have some follow up questions since so much time has passed as we try to get up to speed and back on this project:
Notes
- NCEI script will do a daily check/transfer of fathomnet server
- File name convention for each package: FNYYMM where YY is the 2-digit year and MM is the 2-digit month.
- Package should be unzipped in https://oer.hpc.msstate.edu/FathomNet/
NCEI script will move MSU data from the MSU fathomnet directory to NCEI for archiving prep
Questions:
- Can we delete the test package on https://oer.hpc.msstate.edu/FathomNet/20230427_test_package/
- What are the other images currently in the FathomNet directory? Can we delete ? https://oer.hpc.msstate.edu/FathomNet/ (e.g.,Acanthogorgiidae001_trimmed.png)
- FN2304-large and FN2309-small have already been transferred to https://fathomnet.org/static/staging/, is this the complete dataset that is ready for the first archive package?
Why response:
Hi Errol and Yee,
I’m very excited that we’re moving forward! As a reminder, I keep notes related to this effort on GitHub at https://github.com/orgs/fathomnet/projects/7/views/1. The current, notiional data flow is documented in a diagram at #136 (comment). Since nothing is currently set-in-stone, we can change this workflow as needed so that it works best for both FathomNet and NOAA.
My responses to your notes and questions ….
NOTES:
File name convention for each package: FNYYMM where YY is the 2-digit year and MM is the 2-digit month.
My understanding is that the naming conventions for packages are FNYYMM. For example, FN2304-small and FN2304-large and these will be extracted to directories on MSU servers with the same names. The extra characters are needed to avoid naming collisions between packages.
Package should be unzipped in https://oer.hpc.msstate.edu/FathomNet/
You will need to preserve the package name. So a package FN2304-small would be extracted into https://oer.hpc.msstate.edu/FathomNet/FN2304-small. Otherwise, we will have problems with name collisions between files.
Once the package is unzipped, it would be helpful if an email is sent to us (Or some other notification, I still have to set up an email account for this) The contents of the email need to be structures so that they can be parse by code. An example email is at #136 (comment). Again, nothing is set yet, so we can adapt this as needed.
NCEI script will move MSU data from the MSU fathomnet directory to NCEI for archiving prep
When the data is moved from MSU to NCEI, can you send us a notification via email?
QUESTIONS:
Can we delete the test package on https://oer.hpc.msstate.edu/FathomNet/20230427_test_package/
Yes! All images from FathomNet at MSU are just for testing purposes. It’s safe to remove any and all of them
What are the other images currently in the FathomNet directory? Can we delete ? https://oer.hpc.msstate.edu/FathomNet/ (e.g.,Acanthogorgiidae001_trimmed.png)
Yes!
FN2304-large and FN2309-small have already been transferred to https://fathomnet.org/static/staging/, is this the complete dataset that is ready for the first archive package?
Those are just packages to use for testing and development and not meant to be permanently archived.
Please let me know if you have any other questions. Yee, I’m looking forward to working with you.
Errol sent this email:
Brian, please find notification of fathoment files transferred below. Is this notification sufficient, and are the files organized as expected?
Subject: FathomNet Download List
description: MSU file transfer from FathomNet
timestamp: 2024-05-01T21:46:26Z
source: https://fathomnet.org/static/staging/FN2304-large.zip
target: https://oer.hpc.msstate.edu/FathomNet/staging/FN2304-large/
source: https://fathomnet.org/static/staging/FN2309-small.zip
target: https://oer.hpc.msstate.edu/FathomNet/staging/FN2309-small/
My reply:
It’s a good start but can we tweak how the zip files are unpacked? Thy file might be a zip file of images OR it might be a zipped directory of images. If it’s the later, they get unpacked in a somewhat random directory, it would be much more useful if, after the file is unzipped, all the png or jpg images are moved so they are in the correct staging directory. For example, unzipping FN2304-large results in the images being in a rather redundant path location: https://oer.hpc.msstate.edu/FathomNet/staging/FN2304-large/FN2304-large/, ideally the images should be moved to https://oer.hpc.msstate.edu/FathomNet/staging/FN2304-large. The same with FN2309, the images are in https://oer.hpc.msstate.edu/FathomNet/staging/FN2309-small/FN2309/ but it be better if they were relocated to https://oer.hpc.msstate.edu/FathomNet/staging/FN2309-small/
Let me know if that’s possible.
Sent this email to Yee, Errol and others:
Thanks Yee. That looks great.
To follow up on the email format. The format you have below is AOK for how the web server is currently configured (with the directory listing enabled). One note is that we should standardize on the email’s subject so it’s simple to automate code to watch for the emails. It doesn’t matter so much to me what the subject is, I’ll throw out "MSU file transfer from FathomNet” as a straw man but if you have a preference, just let me know.
Cheers
—— EMAIL
description: MSU file transfer from FathomNet
timestamp: 2024-05-08T14:42:37Z
source: https://fathomnet.org/static/staging/FN2304-large.zip
target: https://oer.hpc.msstate.edu/FathomNet/staging/FN2304-large/
source: https://fathomnet.org/static/staging/FN2309-small.zip
target: https://oer.hpc.msstate.edu/FathomNet/staging/FN2309-small/
Latest updated workflow
sequenceDiagram
autonumber
actor U as User
participant F as FathomNet
participant M as MSU
participant N as NCEI
U->>F: Upload ZIP of images + CSV via HTTP
F-->>U: Ack/200
rect rgb(150, 114, 114)
Note left of F: MBARI - Repackage Zip file
F-)+F: Read zip
F-)F: Rename images using NCEI naming conventions
F-)F: Extract CSV and update image names
F-)F: Generate new zip
F-)-F: Stage to public archive
end
rect rgb(114, 114, 150)
Note left of M: MSU - Provide public access
M-)+F: Scan for new zip files in public archive
F->>-M: Fetch new zip files (HTTP)
M-)M: Unzip images at standard location and provide web access
end
rect rgb(150, 114, 114)
Note left of F: MBARI - Scan for new uploads
F-)+M: Scan for new FathomNet directories
M->>F: Fetch new CSV
F-)F: Use CSV to register images
end
rect rgb(114, 150, 114)
Note left of M: NCEI
M->>N: At 6 months migrate to NCEI
N-)N: Unpack at standard location
N->>F: Notify FathomNet of the location change?
end
F-)F: Update image URLs to new location