Importing information about job input files
Opened this issue · 15 comments
Introduction
For simulating caching it is required to have information about input files a job requires and further metadata such as size of input files, etc.
We currently a working solution, that is great!
The information about input files for jobs are attached into a modified HTCondor job export in json format.
For the future, I would like to change the approach a bit.
Issue
I do see three issues here:
- Manual modification of an export is required to add the information about input files,
- the resulting file is not a real HTCondor export anymore so the reading should not be in the specific HTCondor importer, and
- this method does not extend easily to the other import options for jobs.
Proposed solution
I would like to propose to separate the information about the assignment of required input files by putting it into a separate file. This file should still be in json csv format and should follow the specification that @tfesenbecker introduced, see #51 (comment). However, the job-specific information like requested number of cores or memory should be skipped from this file. Only input file-specific information should be included. Further, we should introduce another field to reference a specific job. This would then be a kind of lapis-configuration file that could even include more information.
The exports from HTCondor currently don't have information about a name or identifier for jobs. Question to the experts (@maxfischer2781, @tfesenbecker): is it possible to also export an identifier? Therefore, the job input file should also contain the id of jobs to make proper references.
Otherwise I would take the line count to create a job specifier. The job on the first line gets id 1
, job on next lines has id 2
, and so on. As several files can be imported, we could even extend this to the following format <input-nr>-<job-nr>
. So we can even mix up different formats, e.g. SWF and HTCondor.
When a job is skipped because of wrong parameters, I would still count it, so that we have a defined id for each of the following jobs.
The header fields for the additional input file file should be named:
JobID
,URL
,size
, andused
See #56 (comment) for a complete writeup.
Call for discussion
Did I miss something obvious in the described proposal above? Does anyone have a better idea? Does this fit to other configuration options we need to import, e.g. information about storage elements (see #53).
Every feedback is welcome!
I agree that we should split the job input files the way you proposed.
For job identification we could use the GlobalJobId
ClassAd. This is a string containing the schedulers name, the jobs job ID in HTCondor and the jobs QDate, for example
bms2.etp.kit.edu#686288.24#1572861762
This should uniquely identify jobs in the use cases that I can currently think of and it contains the information we would need to look up the jobs input file related ClassAd e.g. in a script that creates the input file input file
Sounds like a good idea. The onliest think we should still think about is, if this is actually good in terms of reusability. Do we want to have specifically tailored input file information per job or should it be possible to reuse the same input file configuration for different exports?
I'm not sure what you mean by
specifically tailored input file information per job.
We could use the GlobalJobId
to match any kind of additional job information to standard job output and it would be especially useful if some of this information is extracted from the jobs other ClassAds because the JobId would allow us to track the job down in condor_history
or something else. Or did you mean something else?
I am thinking at a more abstract level. Try to imagine that you want to compare different types of jobs, e.g. HTC and HPC, but having the same input files. So I could imagine having the same file for input file configuration for jobs but having to different exports of jobs that I use. So you would do two different simulations, 1) with the HTC import and the input file configuration and 2) with the HPC import and the very same input file configuration. If the input file configuration would utilise the GlobalJobId
I could only use one specific HTCondor export and it wouldn't match any other export.
Just thinking and questioning aloud :)
Ok, in this case you are right and I don't think that there is an identifier that we could export from HTCondor that is reusable and we should do the identification you proposed or something similar.
But based on your last comment we could consider whether we really want to add the job identifier into the input file configuration.
What do you think of putting an identifier on both the jobs and the input files and declare the combinations in a separate file? In this case we could have more identifiers like the GlobalJobId
for jobs and something similar for the additional input and the combination is less dependent or whether or not an entry in any of the input files is removed.
So the question is whether to use a relative identifier or having an additional mapping.
I currently tend to relative identifiers as I prefer the usability of the approach. The other requires a) more files, and b) even more configuration, resulting in more things that can go wrong.
@maxfischer2781: any opinion from your side?
Have I understood correctly that you would put the jobs relative identifier into the additional input file?
Have I understood correctly that you would put the jobs relative identifier into the additional input file?
Yes! So the exported files from HTCondor etc. should / must stay untouched.
Maybe we should call this additional input file a lapis configuration file?
Yes! So the exported files from HTCondor etc. should / must stay untouched.
I agree.
Maybe we should call this additional input file a lapis configuration file?
To me this sounds more like the config file that defines lapis mechanisms like which caching algorithm should be used or something similar and not like a config file that contains additional information of what is passed to lapis. Or did you plan to include "technical" information into this file?
To me this sounds more like the config file that defines lapis mechanisms like which caching algorithm should be used or something similar and not like a config file that contains additional information of what is passed to lapis. Or did you plan to include "technical" information into this file?
I meant in the discussion here to make clear what we are talking about :) How it looks in reality will probably evolve over time while we decide on what to actually put in there. So maybe it is going to be a caching-related config/information file.
Sorry for the late reply. Some comments:
Jobs having a list of input files is totally fine for HTCondor. As with everything in HTC, they are optional and of variable format, however. While we could make the parser ultra-configurable -- I don't see how that would be practical. So I'm in favour of having some plugin/annotation mechanism. ✅
Being able to annotate multiple data sets (e.g. HPC + HTC) seems interesting, but very complicated and we will probably not use it much. So I'd be in favour of focusing on a less-powerful but easier to use/implement plugin mechanism for the time being. GlobalJobId
and file URLs should work well for that. If the plugin mechanism is pluggable (heh...) we could still switch it out for something more powerful later on.
@maxfischer2781, could you please review the description of this ticket if I remember the decisions we made last week correctly? Thanks in advance :)
Rough summary of what we talked about offline:
Every job has a job identifier. This MAY be externally specified, otherwise the importer MUST derive an identifier by enumerating all jobs starting at 0.
The job-file
including plain resource format of whatever we currently have -- CPU, Memory, Disk, Walltime, QueueTime -- stays untouched for the time being EXCEPT for the addition of the job identifier. We MAY review the plain resource format at a later time to be more generic, but it is not required now.
# JobID QTime CPUS ...
J0 1567155456 1 ...
J1 1567155456 1 ...
Information on input file usage belongs to separate annotation files that SHALL use the job identifier to extend respective jobs. The job identifier MUST NOT be optional in these files, and is used to identify jobs from the main job-file
. A job identifier MAY be used multiple times to annotate the same job with several items.
# JobID URL size used
J0 xrootd://a.root 25 20
J0 xrootd://b.root 25 20
J0 xrootd://c.root 25 20
J1 xrootd://a.root 25 20
J1 xrootd://b.root 25 20
J1 xrootd://c.root 25 20
The description and the last comment contain what we talked about offline and describe well what we should do. The only thing I'm unsure about is the choice of job identifier. I remember that we agreed that the HTCondor GlobalJobId
should work but as we can't know whether we always have this information. Wouldn't it be worthwhile working with a costum job identifier from the start?
[...] Wouldn't it be worthwhile working with a costum job identifier from the start?
I think the summary in comment #56 (comment) suits well. Iff no job_id
is given the id defaults to the enumeration count of the job. This works even for filtered out jobs as the specific enumeration is just skipped. Whenever a user needs an explicit assignment she could either use GlobalJobId
or a manually assigned id. However, the default enumeration count will work either.