Jobs for ACDC wfs not being submitted to sites with RSEs ending with _Disk
Closed this issue · 5 comments
Impact of the bug
WMAgent, WorkQueue
Describe the bug
Jobs for ACDC workflow that reads Pileup locally i.e. TrustPUSitelists: false will not be submitted to the site if its phedex_name / Storage name ends with _Disk
Related to #12012
This issue is similar to the above-discussed issue which was recently fixed, Pull Request
Now this issue is limited to ACDC workflows.
For Example:
This ACDC workflow was created after the fix was deployed to agents,
https://cmsweb.cern.ch/reqmgr2/fetch?rid=cmsunified_ACDC0_task_TSG-Phase2Spring24wmLHEGS-00001__v1_T_240917_112116_3408
However its WorkQueue Pileup locations still includes the storage name for the site
How to reproduce it
Steps to reproduce the behavior:
- Create an ACDC workflow with
TrustPUSitelists:false
only to a site whose storage element ends with_Disk
- The Workflow will remain stuck in Acquired state and agents will not create jobs for it.
Expected behavior
Jobs to be properly submitted to all sites in the site whitelist even if the SE and CE names are different i.e "T1_US_FNAL_Disk",
"T1_US_FNAL".
WorkQueue Pileup Locations to have computing site names instead of storage name for ACDC workflows.
Additional context and error message
These 2 non-ACDC workflows were created after the fix was deployed but also have incorrect WorkQueue Pileup Locations.
Thank you for reporting this, Ahmed.
Looking into the ACDC workflow/workqueue element under this link https://cmsweb.cern.ch/couchdb/workqueue/_design/WorkQueue/_rewrite/element/8d243a44de187913148d3c27a9efb3d4, I can extract the following relevant information:
"Inputs": {
"/acdc/cmsunified_ACDC0_task_TSG-Phase2Spring24wmLHEGS-00001__v1_T_240917_112116_3408/:pdmvserv_task_TSG-Phase2Spring24wmLHEGS-00001__v1_T_240806_122806_6136:TSG-Phase2Spring24wmLHEGS-00001_0:TSG-Phase2Spring24DIGIRECOMiniAOD-00105_0:TSG-Phase2Spring24DIGIRECOMiniAOD-00105_1/0/8": [
"T2_CH_CERN_HLT",
"T2_CH_CERN_P5",
"T2_CH_CERN"
]
},
...
"SiteWhitelist": [
"T1_US_FNAL",
"T2_CH_CERN",
"T2_CH_CERN_P5"
],
"SiteBlacklist": [],
...
"PileupData": {
"/MinBias_TuneCP5_14TeV-pythia8/Phase2Spring24GS-140X_mcRun4_realistic_v4-v1/GEN-SIM": [
"T1_US_FNAL_Disk",
"T2_CH_CERN"
]
},
so it looks like the PileupData
info above needs to be converted to PSN as well.
In addition, if I look at the original ACDC collection under:
https://cmsweb.cern.ch/couchdb/acdcserver/_design/ACDC/_view/byCollectionName?key=%22pdmvserv_task_TSG-Phase2Spring24wmLHEGS-00001__v1_T_240806_122806_6136%22&include_docs=true&reduce=false
and search for that specific fileset named after:
"InitialTaskPath": "/pdmvserv_task_TSG-Phase2Spring24wmLHEGS-00001__v1_T_240806_122806_6136/TSG-Phase2Spring24wmLHEGS-00001_0/TSG-Phase2Spring24DIGIRECOMiniAOD-00105_0/TSG-Phase2Spring24DIGIRECOMiniAOD-00105_1",
here is one full document:
{'id': '5ed3a93702dc6448baee3f494f3dde32',
'key': 'pdmvserv_task_TSG-Phase2Spring24wmLHEGS-00001__v1_T_240806_122806_6136',
'value': {'_rev': '1-c0daf77c40aef82ff969da57b2808d5b',
'_id': '5ed3a93702dc6448baee3f494f3dde32'},
'doc': {'_id': '5ed3a93702dc6448baee3f494f3dde32',
'_rev': '1-c0daf77c40aef82ff969da57b2808d5b',
'collection_name': 'pdmvserv_task_TSG-Phase2Spring24wmLHEGS-00001__v1_T_240806_122806_6136',
'collection_type': 'ACDC.CollectionTypes.DataCollection',
'fileset_name': '/pdmvserv_task_TSG-Phase2Spring24wmLHEGS-00001__v1_T_240806_122806_6136/TSG-Phase2Spring24wmLHEGS-00001_0/TSG-Phase2Spring24DIGIRECOMiniAOD-00105_0/TSG-Phase2Spring24DIGIRECOMiniAOD-00105_1',
'files': {'/store/unmerged/Phase2Spring24DIGIRECOMiniAOD/TT_TuneCP5_14TeV-powheg-pythia8/GEN-SIM-DIGI-RAW/PU200_Trk1GeV_140X_mcRun4_realistic_v4-v2/2560000/615ee026-93b2-458e-9b25-11ea9f1981ec.root': {'last_event': 0,
'first_event': 0,
'lfn': '/store/unmerged/Phase2Spring24DIGIRECOMiniAOD/TT_TuneCP5_14TeV-powheg-pythia8/GEN-SIM-DIGI-RAW/PU200_Trk1GeV_140X_mcRun4_realistic_v4-v2/2560000/615ee026-93b2-458e-9b25-11ea9f1981ec.root',
'locations': ['T2_CH_CERN'],
'id': 18796449,
'checksums': {},
'events': 1000,
'merged': '0',
'size': 64881560795,
'runs': [{'run_number': 1, 'lumis': [71]}],
'parents': []}},
'acdc_version': 2,
'timestamp': 1725560053.592352}}
so we might need to check as well the component that uploads these documents to the ACDCServer (I guess it is ErrorHandler), such that locations
would be consistent (I suppose it is really meant to be RSE location...)
Having another look into this, I just noticed that the bug-fix that Kenyi made 10 days ago:
#12094
was not deployed yet in production. This is the reason why we still don't site names defined for PileupData
in the workqueue elements.
About ErrorHandler, here is how we deal with the file locations:
https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/ErrorHandler/ErrorHandlerPoller.py#L206
and I don't think there is anything to be changed on this component, as it is indeed expected to be a list of locations/RSEs.
Sorry for the miscommunication on this, we applied the relevant fix to the agents, but not to central services.
I am moving it over to Waiting right now and we might end up closing this as "not planned" (not actually an issue).
@hassan11196 we have pushed in a hot-fix today for Global WorkQueue, version 2.3.5.1
.
I scanned all of the Resubmission workflows in acquired
status in production, but none of them are using pileup data that is available at FNAL, so I could not cross-check this fix.
If you create more ACDC workflows and/or know any ACDC that is requiring pileup available at FNAL, can you please check that and/or let us know. Thanks
Hello @amaltaro
Thank you for the hotfix. I can confirm that its working for ACDCs,
I had created this ACDC request and its WorkQueue Pileup locations are as expected.
cmsunified_ACDC0_task_TSG-Phase2Spring24GS-00152__v1_T_240926_211603_3410
Thank you.
Awesome! Thank you for promptly looking and validating this.
With that, I am closing this issue as "Not Planned", as it has actually being fixed by another Issue/PR. Thanks again!