uproot can not open files on dCache when 'https' protocol is used.
Opened this issue · 2 comments
When I tried to open a file on dCache using 'https'+X509, uproot fails to open it. I am using:
Python 3.12.4 | packaged by Anaconda, Inc. | (main, Jun 18 2024, 15:12:24) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import uproot
>>> uproot.__version__
'5.3.10'
This can be reproduced with :
import sys
import os
import ssl
import uproot
filenames = [{"T1_US_FNAL root":"root://cmsxrootd-site2.fnal.gov//store/mc/RunIISummer20UL18NanoAODv9/TTTo2L2Nu_TuneCP5_13TeV-powheg-pythia8/NANOAODSIM/106X_upgrade2018_realistic_v16_L1v1-v1/130000/5D2E4672-C7D3-AF49-B699-E0F7E83A699C.root"}]
filenames.append({"T2_US_Wisconsin https":"https://cmsxrootd.hep.wisc.edu:1094/store/mc/RunIISummer20UL18NanoAODv9/TTTo2L2Nu_TuneCP5_13TeV-powheg-pythia8/NANOAODSIM/106X_upgrade2018_realistic_v16_L1v1-v1/130000/44187D37-0301-3942-A6F7-C723E9F4813D.root"})
filenames.append({"T1_US_FNAL https":"https://cmsdcadisk.fnal.gov:2880/dcache/uscmsdisk/store/mc/RunIISummer20UL18NanoAODv9/TTTo2L2Nu_TuneCP5_13TeV-powheg-pythia8/NANOAODSIM/106X_upgrade2018_realistic_v16_L1v1-v1/130000/5D2E4672-C7D3-AF49-B699-E0F7E83A699C.root"})
filenames.append({"T2_DE_DESY https":"https://dcache-cms-webdav-wan.desy.de:2880//pnfs/desy.de/cms/tier2/store/mc/RunIISummer20UL18NanoAODv9/TTTo2L2Nu_TuneCP5_13TeV-powheg-pythia8/NANOAODSIM/106X_upgrade2018_realistic_v16_L1v1-v1/130000/5D2E4672-C7D3-AF49-B699-E0F7E83A699C.root"})
i=0
for thefile in filenames:
i=i+1
try:
sslctx = ssl.create_default_context()
sslctx.load_cert_chain(os.environ["X509_USER_PROXY"], os.environ["X509_USER_PROXY"])
uproot_options={'ssl': sslctx}
site_protocol=list(thefile.keys())[0]
the_file = uproot.open({thefile[site_protocol]: None}, **uproot_options)
print ("[",i,"] OPEN OK ",site_protocol, thefile[site_protocol], " size of CA certs ", len(uproot_options['ssl'].get_ca_certs()))
#print ("[",i,"] file is open and the_file is ",the_file)
the_file.close()
except Exception as e:
print ( "[",i,"] OPEN Exception ",site_protocol, thefile[site_protocol], " Exception was ",e)
The output of the above script looks like:
[ 1 ] OPEN OK T1_US_FNAL root root://cmsxrootd-site2.fnal.gov//store/mc/RunIISummer20UL18NanoAODv9/TTTo2L2Nu_TuneCP5_13TeV-powheg-pythia8/NANOAODSIM/106X_upgrade2018_realistic_v16_L1v1-v1/130000/5D2E4672-C7D3-AF49-B699-E0F7E83A699C.root size of CA certs 147
[ 2 ] OPEN OK T2_US_Wisconsin https https://cmsxrootd.hep.wisc.edu:1094/store/mc/RunIISummer20UL18NanoAODv9/TTTo2L2Nu_TuneCP5_13TeV-powheg-pythia8/NANOAODSIM/106X_upgrade2018_realistic_v16_L1v1-v1/130000/44187D37-0301-3942-A6F7-C723E9F4813D.root size of CA certs 147
[ 3 ] OPEN Exception T1_US_FNAL https https://cmsdcadisk.fnal.gov:2880/dcache/uscmsdisk/store/mc/RunIISummer20UL18NanoAODv9/TTTo2L2Nu_TuneCP5_13TeV-powheg-pythia8/NANOAODSIM/106X_upgrade2018_realistic_v16_L1v1-v1/130000/5D2E4672-C7D3-AF49-B699-E0F7E83A699C.root Exception was https://cmsdcadisk.fnal.gov:2880/dcache/uscmsdisk/store/mc/RunIISummer20UL18NanoAODv9/TTTo2L2Nu_TuneCP5_13TeV-powheg-pythia8/NANOAODSIM/106X_upgrade2018_realistic_v16_L1v1-v1/130000/5D2E4672-C7D3-AF49-B699-E0F7E83A699C.root
[ 4 ] OPEN Exception T2_DE_DESY https https://dcache-cms-webdav-wan.desy.de:2880//pnfs/desy.de/cms/tier2/store/mc/RunIISummer20UL18NanoAODv9/TTTo2L2Nu_TuneCP5_13TeV-powheg-pythia8/NANOAODSIM/106X_upgrade2018_realistic_v16_L1v1-v1/130000/5D2E4672-C7D3-AF49-B699-E0F7E83A699C.root Exception was https://dcache-cms-webdav-wan.desy.de:2880//pnfs/desy.de/cms/tier2/store/mc/RunIISummer20UL18NanoAODv9/TTTo2L2Nu_TuneCP5_13TeV-powheg-pythia8/NANOAODSIM/106X_upgrade2018_realistic_v16_L1v1-v1/130000/5D2E4672-C7D3-AF49-B699-E0F7E83A699C.root
To access the files, one needs to pass X509 using SSLContext like above and might need to add the SSLContext like so:
lib/python3.12/site-packages/fsspec/implementations/http.py : add the creation of the SSLContext around lines between 224 and 225 and between 825 and 826 like so:
import os
import ssl
import socket
import copyreg
def save_sslcontext(obj):
return obj.__class__, (obj.protocol,)
copyreg.pickle(ssl.SSLContext, save_sslcontext)
sslctx = ssl.create_default_context()
sslctx.load_cert_chain(os.environ['X509_USER_PROXY'], os.environ['X509_USER_PROXY'])
sslctxdic={'ssl': sslctx}
# Last - 0 necessary
kw.update(sslctxdic)
On the other hand, this traditional script properly downloads files from dCache:
import json,os,time
import urllib.request, urllib.error
import ssl
import os.path
url = 'https://cms-cric.cern.ch/api/accounts/user/query/?json&preset=people'
url = "https://cmsio9.rc.ufl.edu:1094/store/user/bockjoo/nano_dy.root"
url = "https://cmsxrootd.hep.wisc.edu:1094//store/mc/RunIISummer20UL18NanoAODv9/TTTo2L2Nu_TuneCP5_13TeV-powheg-pythia8/NANOAODSIM/106X_upgrade2018_realistic_v16_L1v1-v1/130000/44187D37-0301-3942-A6F7-C723E9F4813D.root"
url = "https://dcache-cms-webdav-wan.desy.de:2880//pnfs/desy.de/cms/tier2/store/mc/RunIISummer20UL18NanoAODv9/TTTo2L2Nu_TuneCP5_13TeV-powheg-pythia8/NANOAODSIM/106X_upgrade2018_realistic_v16_L1v1-v1/130000/5D2E4672-C7D3-AF49-B699-E0F7E83A699C.root"
CERTIFICATE_CRT = '/home/bockjoo/.cmsuser.proxy'
CERTIFICATE_KEY = '/home/bockjoo/.cmsuser.proxy'
try:
myContext = ssl.SSLContext()
myContext.load_cert_chain(CERTIFICATE_CRT,
CERTIFICATE_KEY)
with urllib.request.urlopen(url,
context=myContext) as urlHandle:
urlCharset = urlHandle.headers.get_content_charset()
if urlCharset is None:
urlCharset = "utf-8"
try:
myData = urlHandle.read().decode( urlCharset )
except:
myData = urlHandle.read()
#response = requests.get(url, headers=headers) #, context=myContext)
except:
print("Failed to download ",url)
raise
print ( type (myData) )
with the output:
<class 'bytes'>
I'm not sure if this is related, but i also noticed some issues when reading through dcache with the current default https method in uproot/fsspec. My main observation is that it hangs once i request columns (leading to the uproot source calling .chunks
). I believe this boils down to 2 things:
- aiohttp uses a connection pool of 100 TCP connections. DCache does not like this - typically it's expected that a single client only opens very few connections and one will see queuing when too many are opened.
- dcache will redirect after a GET request to a url with some unique identifier in the parameters (i believe on the server these connections are treated in a stateful way, wheras http requests are in principle stateless). This location is then supposed to be used for subsequent requests to the same file, but that's not what aiohttp will do. Instead the next GET request will ask the original URL again, get another redirection (with a new state) and then use this. Here it also doesn't help that aiohttp keeps the TCP connections open since it does not remember the redirect urls (with the unique identifiers in the parameters)
Illustration of the second point in form of code (if you want to reproduce, replace url to something you have access to, the following probably needs a Belle II VO X509 certificate)
import ssl
import os
from urllib.parse import urlparse
from http.client import HTTPSConnection
ctx = ssl.create_default_context(capath=os.environ["X509_CERT_DIR"])
ctx.load_cert_chain(os.environ["X509_USER_PROXY"])
path = "https://lcg-lrz-http.grid.lrz.de:443/pnfs/lrz-muenchen.de/data/belle/localgroupdisk/belle/user/nhart/test_202408141225/sub00/RootOutput_00000_job428876195_00.root"
parsed = urlparse(path)
conn = HTTPSConnection(parsed.hostname, port=parsed.port, context=ctx)
Now
conn.request("GET", f"{parsed.path}?{parsed.query}", headers={"Range": "bytes=0-10"})
resp = conn.getresponse()
print(resp.headers.as_string())
print(resp.status)
print(resp.read())
Gives something like
Date: Fri, 23 Aug 2024 14:42:33 GMT
Server: dCache/9.2.17
Location: https://lcg-lrz-dc46.grid.lrz.de:62240/pnfs/lrz-muenchen.de/data/belle/localgroupdisk/belle/user/nhart/test_202408141225/sub00/RootOutput_00000_job428876195_00.root?dcache-http-uuid=d20f7960-32dc-42e3-802c-efb44f66e184&dcache-http-ref=https%3A%2F%2Flcg-lrz-http.grid.lrz.de%3A443
Content-Length: 0
302
b''
So, a redirect URL with some uuid in it. I can now open a connection to this and make multiple requests to it (but i can't use the URL parameters with the uuid in multiple connections)
location = resp.headers["Location"]
parsed = urlparse(location)
conn = HTTPSConnection(parsed.hostname, port=parsed.port, context=ctx) # this is the new connection to the redirect location
Now i can repeat the following many times, also with different ranges
conn.request("GET", f"{parsed.path}?{parsed.query}", headers={"Range": "bytes=0-10"}) # the parsed.query now contains the url parameters needed
resp = conn.getresponse()
print(resp.headers.as_string())
print(resp.status)
print(resp.read())
Not sure what the solution is - one would need to introduce a corresponding behavior in the fsspec https source and/or make it use multi range requests again (what the old uproot HTTPSource did). Concerning the multi range requests what i get from @jpivarski's comments on older issues like #3 it was quite a struggle to even find out if a http server supports this.
Or we stick to xrootd for storages like dcache? ...