feat: dump sent `old.json` after each succesfull upload
Closed this issue ยท 6 comments
Hi @Jorropo,
Thank you so much for developing this package. I hope this is a good place to raise some suggestions/issues that I have.
One issue I run into is that my filesystem (Lustre) does not use reflinks. The creation of cars is therefore quite slow. Because this process takes some time to finish for a dataset of ~1TB, especially when using the default estuary driver with its direct upload, I ran into issues with the job scheduler on my HPC cluster. The scheduler just stops the job after a certain time before it has successfully finished.
Is there any way to restart the process without creating all cars again (and probably fail again due to the time limits)?
Thank you!
First, It is very likely the upload would not be faster even with reflink.
The chunking of the files is done pipelined with the upload, that means the performance that matters is whoever is slower between chunking and uploads. (except for the first 32GiB and last 32GiB)
Estuary is not fast (I get ~15MiB/s going France -> US uploads). Unless your disks is slower than this, the upload wont be faster because the chunker still has to wait for the data to be uploaded.
Reflinking is important if you have a fast remote server or use -driver car
.
Secondly about your issue
The scheduler just stops the job after a certain time before it has successfully finished.
I have two questions:
- What does "stops the job" means ?
kill -9
?
(because if it do that, imagine it being killed in the middle of a backup, you just lost theold.json
content so not much use anyway) - And sorry if that a dumb question, but can't you just make it not do that ?
It would be possible to dump old.json
after every successful upload if we attach a modtime
per file instead of a global one all of the content. So a "snapshot" would be done after every 32GiB (at least that how big the car target is for estuary).
Having proper more complex state recovery would be really hard with the current architecture, I would work on multithreaded traversal before working on that (if I ever work on it).
Thank you very much for your quick response. I experience similar upload limitations from Germany.
To your replies:
What does "stops the job" means ? kill -9 ?
Not necessarily, the job scheduler (SLURM) does allow to send any signal (e.g. SIGTERM) ahead of time before doing kill -9
.
Can't you just make it not do that ?
The compute resources are shared among users and in an attempt to make the usage fair, jobs are only allowed to run for a certain amount of time ( up to 8 hours in my case).
It would be possible to dump old.json after every successful upload if we attach a modtime per file instead of a global one all of the content. So a "snapshot" would be done after every 32GiB (at least that how big the car target is for estuary).
This seems like a good solution ๐
Currently, the creation of one 32GiB-car takes about 10 minutes. Converting a file of 1TB takes therefore ~5 h. Having intermediate "snapshot"s would help to reduce the risk of rewriting the cars in case of any failure.
I'll probably work on this in the next few days.
Self note to myself: we cannot just dump old.json because it would save files we havn't uploaded yet.
A solution to fix this is to double buffer the old
mapping so we can keep in accordance what is inside the .car
and old.json
.
@observingClouds I have implemented this.
Can you pls retry with the current master (0297e30) ?
Performance might take a slight hit (since it's not very efficiently programmed).
But should be mostly fine since this is massively bottlenecked by network anyway.
I might move that to yet an other background job.
Thank you so much @Jorropo! It does seem to work ๐ I was just tricked by the fact that the numbering of the output cars started again with 1 (out.1.car
) after a restart, starting to overwrite the cars of the first run. This is of course only an issue if you use the car
-driver and create local cars, but not in case of the estuary driver.
It does seem to work
๐
I was just tricked by the fact that the numbering of the output cars started again with 1 after a restart, starting to overwrite the cars of the first run.
Silently overwriting previous files is an issue, I'll fix: #4
Fyi you can specify a patern when using the car driver.
So you could do this (%d
gets replaced by the current output car):
linux2ipfs -driver car-out.run.1.%d.car files
linux2ipfs -driver car-out.run.2.%d.car files
But I'll just fix it so it logs something and just skip to the next file.