dmwm/WMCore

Files deleted by EOS after successful stageOut at T0

Opened this issue · 12 comments

Impact of the bug
T0 WMAgent, Probably WMAgent too

Describe the bug
This is a resurgence of #11998. We see jobs that successfully stageout their output to EOS, but the files are deleted shortly after. Capturing the exit code of xrdcp with #12058 did not help as the commandis successful returning 0.

How to reproduce it
This seems to happen randomly, however, in the last 24h, T0 production has experience 31 occurrences, including during replays.

Additional context and error message
Log of the successful stage out of a later deleted file:

2024-08-28 07:39:29,312:INFO:StageOutMgr:LFN to PFN match made:
LFN: /store/unmerged/data/Run2024G/TestEnablesEcalHcal/RAW/Express-v1/000/385/057/00000/448389f8-043b-4352-ab47-9c34e6d404f7.root
PFN: root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/Run2024G/TestEnablesEcalHcal/RAW/Express-v1/000/385/057/00000/448389f8-043b-4352-ab47-9c34e6d404f7.root?eos.app=cmst0

2024-08-28 07:39:29,312:INFO:StageOutImpl:Creating output directory...
2024-08-28 07:39:29,313:INFO:StageOutImpl:Running the stage out...
2024-08-28 07:41:27,626:INFO:StageOutImpl:Command exited with status: 0
Output message: stdout: Local File Size is: 48638573
xrdcp exit code: 0
Remote File Size is: 48638573
Local File Checksum is: fd4eb3b5
Remote File Checksum is: fd4eb3b5

stderr: 

@ccaffy @amadio could you please describe your hypothesis about the cause of this issue?

Here's what we observed on EOS side:

A first trial to copy the file to EOS failed and xrdcp retried internally via its write recovery mechanism. The retry mechanism did not close the first trial and managed to succesfully copy the file.
xrdcp therefore considered the transfer as a success and closed all the connections. The first connection was therefore closed but as it failed to transfer the file properly, the file got deleted.

Disabling the write recovery on xrdcp by setting the environment variable XRD_WRITERECOVERY=0 will fix the issue.

@ccaffy could you provide logs were you see this happening? It would be good EOS logs along with WMAgent logs to get the full picture.

Sure, for the file /eos/cms/tier0/store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root:

log=1fd2c41e-6570-11ef-9007-a4bf0179624c&path=/eos/cms/tier0/store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root&fstpath=/data29/00073003/118c38d44&ruid=103031&rgid=1399&td=cmsplt01.5642:8236@b9g20p8382&host=st-096-ii2019bd.cern.ch&lid=1048850&fid=4710436164&fsid=33377&ots=1724871893&otms=106&cts=1724871994&ctms=473&nrc=1295&nwc=1255&rb=0&rb_min=0&rb_max=0&rb_sigma=0.00&rv_op=0&rvb_min=0&rvb_max=0&rvb_sum=0&rvb_sigma=0.00&rs_op=0&rsb_min=0&rsb_max=0&rsb_sum=0&rsb_sigma=0.00&rc_min=0&rc_max=0&rc_sum=0&rc_sigma=0.00&wb=1315175702&wb_min=261398&wb_max=1048576&wb_sigma=22211.51&sfwdb=878632239104&sbwdb=25165824&sxlfwdb=878632239104&sxlbwdb=25165824&nfwds=1296&nbwds=1&nxlfwds=1296&nxlbwds=1&usage=98.52&iot=101366.961&idt=1501.294&lrt=0.000&lrvt=0.000&lwt=87318.619&ot=7.571&ct=12539.477&rt=5213.35&rvt=0.00&wt=4691.41&osize=0&csize=1357118742&delete_on_close=0&prio_c=2&prio_l=4&prio_d=1&forced_bw=0&ms_sleep=0&ior_err=0&iow_err=0&sec.prot=gsi&sec.name=cmst0&sec.host=b9g20p8382.cern.ch&sec.vorg=&sec.grps=&sec.role=&sec.info=/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmst0/CN=658085/CN=Robot: CMS Tier0&sec.app=cmst0
log=1fd2c41e-6570-11ef-9007-a4bf0179624c&path=/eos/cms/tier0/store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root&fstpath=/data11/00073003/118c38d44&ruid=103031&rgid=1399&td=1.213123:1617@st-096-ii2019bd&host=st-096-100gb-ip306-28341.cern.ch&lid=1048850&fid=4710436164&fsid=35398&ots=1724871893&otms=112&cts=1724871994&ctms=472&nrc=1295&nwc=1255&rb=0&rb_min=0&rb_max=0&rb_sigma=0.00&rv_op=0&rvb_min=0&rvb_max=0&rvb_sum=0&rvb_sigma=0.00&rs_op=0&rsb_min=0&rsb_max=0&rsb_sum=0&rsb_sigma=0.00&rc_min=0&rc_max=0&rc_sum=0&rc_sigma=0.00&wb=0&wb_min=0&wb_max=0&wb_sigma=0.00&sfwdb=878632239104&sbwdb=25165824&sxlfwdb=878632239104&sxlbwdb=25165824&nfwds=1296&nbwds=1&nxlfwds=1296&nxlbwds=1&usage=15.63&iot=101359.914&idt=85518.695&lrt=0.000&lrvt=0.000&lwt=8995.287&ot=0.558&ct=6845.374&rt=6362.97&rvt=0.00&wt=8501.17&osize=0&csize=1357118742&delete_on_close=0&prio_c=2&prio_l=4&prio_d=1&forced_bw=0&ms_sleep=0&ior_err=0&iow_err=0&sec.prot=gsi&sec.name=cmst0&sec.host=b9g20p8382.cern.ch&sec.vorg=&sec.grps=&sec.role=&sec.info=/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmst0/CN=658085/CN=Robot: CMS Tier0&sec.app=cmst0
log=1fd2c41e-6570-11ef-9007-a4bf0179624c&path=/eos/cms/tier0/store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root&fstpath=/data11/00073003/118c38d44&ruid=103031&rgid=1399&td=3.213123:2336@st-096-ii2019bd&host=st-096-100gb-ip306-28341.cern.ch&lid=1048850&fid=4710436164&fsid=35398&ots=1724871771&otms=690&cts=1724872036&ctms=999&nrc=0&nwc=49&rb=0&rb_min=0&rb_max=0&rb_sigma=0.00&rv_op=0&rvb_min=0&rvb_max=0&rvb_sum=0&rvb_sigma=0.00&rs_op=0&rsb_min=0&rsb_max=0&rsb_sum=0&rsb_sigma=0.00&rc_min=0&rc_max=0&rc_sum=0&rc_sigma=0.00&wb=0&wb_min=0&wb_max=0&wb_sigma=0.00&sfwdb=0&sbwdb=0&sxlfwdb=0&sxlbwdb=0&nfwds=0&nbwds=0&nxlfwds=0&nxlbwds=0&usage=0.12&iot=265308.750&idt=264987.750&lrt=0.000&lrvt=0.000&lwt=319.672&ot=0.617&ct=0.724&rt=0.00&rvt=0.00&wt=298.37&osize=0&csize=0&delete_on_close=1&prio_c=2&prio_l=4&prio_d=1&forced_bw=0&ms_sleep=0&ior_err=0&iow_err=0&sec.prot=gsi&sec.name=cmst0&sec.host=b9g20p8382.cern.ch&sec.vorg=&sec.grps=&sec.role=&sec.info=/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmst0/CN=658085/CN=Robot: CMS Tier0&sec.app=cmst0
log=1fd2c41e-6570-11ef-9007-a4bf0179624c&path=/eos/cms/tier0/store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root&fstpath=/data29/00073003/118c38d44&ruid=103031&rgid=1399&td=cmsplt01.5642:2031@b9g20p8382&host=st-096-ii2019bd.cern.ch&lid=1048850&fid=4710436164&fsid=33377&ots=1724871771&otms=689&cts=1724872036&ctms=999&nrc=0&nwc=49&rb=0&rb_min=0&rb_max=0&rb_sigma=0.00&rv_op=0&rvb_min=0&rvb_max=0&rvb_sum=0&rvb_sigma=0.00&rs_op=0&rsb_min=0&rsb_max=0&rsb_sum=0&rsb_sigma=0.00&rc_min=0&rc_max=0&rc_sum=0&rc_sigma=0.00&wb=51380224&wb_min=1048576&wb_max=1048576&wb_sigma=0.00&sfwdb=0&sbwdb=0&sxlfwdb=0&sxlbwdb=0&nfwds=0&nbwds=0&nxlfwds=0&nxlbwds=0&usage=99.94&iot=265310.062&idt=161.072&lrt=0.000&lrvt=0.000&lwt=114506.528&ot=2.208&ct=150640.255&rt=0.00&rvt=0.00&wt=189.50&osize=0&csize=0&delete_on_close=1&prio_c=2&prio_l=4&prio_d=1&forced_bw=0&ms_sleep=0&ior_err=0&iow_err=0&sec.prot=gsi&sec.name=cmst0&sec.host=b9g20p8382.cern.ch&sec.vorg=&sec.grps=&sec.role=&sec.info=/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmst0/CN=658085/CN=Robot: CMS Tier0&sec.app=cmst0

What is interesting for you is the ots (open timestamp), cts (close timestamp), td (connection identifier), idt (idle time which is the time in ms where nothing happened) and delete_on_close which if equal to 1, the file will be deleted.

for completeness sake, this is the wmagentJob.log extract for the stage out of this file:

2024-08-28 19:02:51,338:INFO:StageOutMgr:==>Working on file: /store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root
2024-08-28 19:02:51,339:INFO:StageOutMgr:===> Attempting 1 Stage Outs
2024-08-28 19:02:51,340:INFO:StageOutMgr:LFN to PFN match made:
LFN: /store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root
PFN: root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root?eos.app=cmst0

2024-08-28 19:02:51,341:INFO:StageOutImpl:Creating output directory...
2024-08-28 19:02:51,353:INFO:StageOutImpl:Running the stage out...
2024-08-28 19:06:34,882:INFO:StageOutImpl:Command exited with status: 0
Output message: stdout: Local File Size is: 1357118742
xrdcp exit code: 0
Remote File Size is: 1357118742
Local File Checksum is: d9b2f991
Remote File Checksum is: d9b2f991

stderr: 
2024-08-28 19:06:34,883:INFO:StageOutMgr:attempting stageOut
2024-08-28 19:06:34,883:INFO:StageOutMgr:===> Stage Out Successful: {'LFN': '/store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root', 'PFN': 'root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/Run2
024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root?eos.app=cmst0', 'PNN': 'T0_CH_CERN_Disk', 'StageOutCommand': 'xrdcp', 'Checksums': {'adler32': 'd9b2f991', 'cksum': '2524514726'}, 'StageOutReport': []}

We are currently running a replay with XRD_WRITERECOVERY=0

This is the PR disabling write recovery on the replay:
germanfgv#16

That's great news, thanks !

@germanfgv do you know against which storage endpoints this issue is happening? By default, write-recovery should be disabled at least at T0_CH_CERN_Disk and T2_CH_CERN, see:
https://gitlab.cern.ch/SITECONF/T0_CH_CERN/-/blob/master/JobConfig/site-local-config.xml#L14
https://gitlab.cern.ch/SITECONF/T2_CH_CERN/-/blob/master/JobConfig/site-local-config.xml?ref_type=heads#L16

however, we are missing it for T2_CH_CERN_P5 and T2_CH_CERN_HLT. It might be worth it reaching out to the SST to update those site-local-config files.

We are writing to T0 Disk, so it seems this is not working properly. I've checked the final xrdcp command and it does not include the proper env variable. This is an example of the cp command that we usually generate:

xrdcp --force --nopbar --cksum adler32:2d8dc345  "/srv/job/WMTaskSpace/cmsRun1/write_L1ScoutingSelection_L1SCOUT.root"  "root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/Run2024F/L1ScoutingSelecti
on/L1SCOUT/v1/000/383/219/00000/16e6fec3-c3fa-422e-a474-f3f9724f4279.root?eos.app=cmst0" 

This is an example of the command in the running replay:

env XRD_WRITERECOVERY=0 xrdcp --force --nopbar  "/srv/job/WMTaskSpace/logArch1/logArchive.tar.gz"  "root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/logs/prod/2024/8/29/PromptReco_Run382686_EGamma0_Tier0_REPLAY_2024_ID240829104828_v29104828/Reco/0000/0/a96807d0-7d81-4221-8229-9288d4e1a1d3-0-0-logArchive.tar.gz?eos.app=cmst0" 

@germanfgv @LinaresToine I wonder if you have any news on this issue? Has it resolved itself? Is this issue still happening?