nomad-coe/nomad

Problems in file processing

JohannesUniUlm opened this issue · 10 comments

Dear NOMAD team,

I encountered several problems with the attached test upload:

  • There seem to be some issues with the names. The directory where I store the data is in this case called 100_1_cubic_hs and the files inside have the vasp-typical names INCAR, OUTCAR, KPOINTS, etc. Unfortunately, the parser takes cubic_hs/OUTCAR_1_100 as the main file, which does not exist.
  • The process status is FAILURE, perhaps as a consequence of the previous point.
  • I included a json file in the upload, but unfortunately, it is not read by the parser. At least, the entry shows "no datasets" in the corresponding column.

I have been trying to solve this issue for quite a while now but I have not succeeded. Do you have any advise? I would like to upload several thousand calculations soon, which all have similar names, folder structure, etc. as the test upload.

Thank you very much for your help!

Best,

Johannes

..... I am afraid, I cannot upload the file. Why doesn't GitHub take .tgz files?

@JohannesUniUlm: Thanks your report. Here is the upload in question:

100_1_cubic_hs.zip

It seems to be a combination of at least two things:

  1. VASP parser has trouble parsing some of the methodology (this should be fairly easy to fix)
  2. The name of the mainfile is incrorrect/incorrectly displayed.

I will have a look at this and report back to you. Once a solution is provided, I will update our beta deployment. You can then proceed with the upload there. The data, datasets and DOIs from the beta are all valid and will be available in the production version later as well.

Dear @JohannesUniUlm,

I have now deployed a fix into our beta site at: https://nomad-lab.eu/prod/v1/staging/gui/ This fixes the problem in displaying the mainfile name and the parsing issue with your VASP data. Could you try it out and report back here?

Unfortunately, we do not seem to be properly handling the nomad.json/nomad.yaml files, as the coauthors and datasets do not get correctly updated. Comments and references do seem to work. I will make this into a separate issue. While we work on this, I would suggest using the GUI to add the datasets and coauthors to your upload:

  • Adding coauthors can be done with the "Manage upload members"-button on the top right of the upload interface.
  • Assigning a dataset can be done with the "Edit author metadata of all X entries"-button in the upload interface.

Dear Lauri,

the mainfile of my upload is now correctly recognized by the beta site. The upload also appears at the release version which displays "Process status: Success" for the same and all data is correctly extracted.

How shall I proceed with my uploads?

Thank you very much!

Best,
Johannes

TLCFEM commented

Now you can 1) use the frontend to browse the processed data, and 2) use provided API to query the processed data to further be fed into subsequent data analyses.

@JohannesUniUlm: Here might be some steps to consider:

  1. If you are planning to e.g. submit your data for publication, first try uploading all of it into the beta site and seeing that it gets processed alright. Don't however press the "Publish" button yet.
  2. Add all comments, references and authors.
  3. Add a dataset if you want. Only datasets can be assigned a DOI.
  4. Create a DOI for your dataset if you wish. This is done by navigating with the top menu to: "Publish/Datasets".
  5. Once everythign is good to go, you can publish your data at the bottom of the upload page. Note that this will permanently fix your data: you will not be able to modify it later. You can, however create e.g. new datasets that supercede your old data if you e.g. realize a mistake or want to add new data that was missing.

Dear Lauri,

thank you very much for your help. It seems like my data is now correctly processed by the beta version. Now I am facing a problem concering the amount of data. I would like to

i) upload several thousand individual calculations stored in individual .tgz files
ii) assign dataset and other meta to all files
iii) publish the files alltogether.

I am obviously not going to do the upload manually. I used a bash script with curl, but I get an error message after the 10th file since this is the maximum amount of different unpublished uploads I am allowed to have. What would you suggest? Can't I use curl to add several files to one upload?

Best,
Johannes

Dear Johannes,

You can bundle several files together by compressing them as .zip or .tar files. This will also help with the upload process, as the upload size will be smaller. E.g. to zip single/multiple folders/files, run the following command:

zip -r <zip_filename> <filepath1> <filepath2>

Then you can upload the resulting zip file using the curl command. Our processing will automatically unzip the contents to your upload. There are also ways to add files to an existing upload, you can see more details in our API documentation.

The size of a single upload is limited to 32 GB. Maybe you can even fit all of your calculations into a single .zip file? If not, you can break it into 32 GB .zip files and upload them individually. If you still hit the limit of 10 unpublished uploads (meaning that you have more than 320 GB of data to upload), it is possible to publish some of the earlier uploads, but we should in this case probably discuss an alternative solution for the transfer.

Dear Johannes,

There is a way: you can upload individual files to a specific existing upload using the /uploads/{upload_id}/raw/{path} PUT API endpoint. You can find the documentation here.