size of runs
Closed this issue · 5 comments
We would need CAX to calculate the size of a run (in GB) and fill a column of the runDB with this information. This is very important for computing in order to estimate how much space is needed to transfer entire campaigns in the future. Dan Coderre told to contact him to create the proper column in the DB. (Someone suggested that a good moment to do this would do when you do checksum...)
On Mar 15, 2017, at 11:43 AM, daniel.coderre@lhep.unibe.ch wrote:
Hi Luca, others,
I updated the runs DB with a new field ‘size’ contained now in the raw data subdocuments. To be more specific, I’m referring to the entries in the ‘data’ array that track locations of the data. Every run in the csv Luca sent (with the instructions he sent) was updated with a new field called size for all data subdocuments of type ‘raw’. The size is in bytes and should be the file size. All entries of type ‘raw’ were updated for each run.
This probably sounds confusing but it’s quite simple. I put an example here:https://gist.github.com/coderdj/4aa92d1d57a43137cf14d0861476dca8
The example shows how to (a) get the size for a single run given its number and (b) get the total size for all runs on a given host (here midway). Note that the total size seems to deviate from what I get when I just du /project/…raw (and project2). I don’t know if the deviation is because I’m missing size entries for some runs, if some runs are stored twice, if some runs are stored but not tracked in the runs DB, or if there’s a bug in my script. I can check that if needed.
Also, this new field is probably of limited use if it isn’t being added to new data so we should get it into cax asap.
Ciao,
Dan
@coderdj Ciao Dan,
I'm trying to understand how to implement the "size" entry on database starting from your example.
First of all where we want put it.
Luca suggest in checksum task, that also for me seems a good place.
We can calculate the total size of raw data and put here the final value for the update:
https://github.com/XENON1T/cax/blob/master/cax/tasks/checksum.py#L95
do you agree?
But one thing is not clear to me. How you calculate the size of the raw data. Of course will be the sum of all *.zip files, but I see from your code that you can print this command doc['data'][0]['size']
.
There is something already implemented?
https://gist.github.com/coderdj/4aa92d1d57a43137cf14d0861476dca8#file-checksizes-py-L14
I see in python manual that there is in os
library the function os.stat('filename.dat').st_size
that should provide all these informations, right?
Maybe I can write a little function that calculate this value.
For the moment I'm trying to run you example but give me this error:
pymongo.errors.ServerSelectionTimeoutError: gw:27017: [Errno -2] Name or service not known
I need some particular permission from my account??
Hi Francesco,
We can calculate the total size of raw data and put here the final value for the update:
https://github.com/XENON1T/cax/blob/master/cax/tasks/checksum.py#L95
do you agree?
I don't understand the link. The specific line you link is only invoked in case of a checksum error, right? So exactly that spot would no be the place to put this. In general putting the size calculation in the initial checksum creation is probably the best place.
There is something already implemented?
Luca sent me a list of runs and their sizes in csv. All I did was put this information into the run docs. It was not for all runs and some of the sizes were only precise to two decimals (in units of GB, but I put them in the runs DB in bytes). I think if we don't have every file in there this may be of limited use, but I don't know how to access the rest of the runs.
I see in python manual that there is in os library the function os.stat('filename.dat').st_size that should provide all these informations, right?
You want the size of the directory calculated recursively. You'll probably need a small function to do it. You can always look online, there are a lot of examples how to do this.
For the moment I'm trying to run you example but give me this error:
pymongo.errors.ServerSelectionTimeoutError: gw:27017: [Errno -2] Name or service not known
Well, the name or service "gw:27017" is not known to your PC. That's the runs DB address from within the DAQ. Just replace it with whatever address you usually use for connecting to the runs database.
ciao,
Dan
@lucrlom You should make a new AddFilesize
class (similarly to what you did in #85).
Dan's example script shows what field you need to fill (data.size
). @coderdj can you provide an example snippet for adding this info into the DB (since we're not using a test DB for these developments)? (And we're still not using the API; I'm not really sure I understand everything well enough to merge #38, especially since it's been so long and many things have changed.)
This task should of course eventually be running on xe1t-datamanager
since that's the first place that cax
is run on raw data.
@pdeperio @lucrlom are you waiting for me to comment on this? There's tons of example snippets in cax already, right? For example the code used to add a checksum is analog: https://github.com/XENON1T/cax/blob/master/cax/tasks/checksum.py#L97