Reading a hipscat dataset from a web url fails
dougbrn opened this issue · 1 comments
Bug report
I'm trying to use LSDB/hipscat to read a small web-hosted dataset (https://epyc.astro.washington.edu/~lincc-frameworks/1deg_surveys/ztf_1deg/), and hipscat appears to be unable to load this dataset correctly. The following line fails:
lsdb.read_hipscat("https://epyc.astro.washington.edu/~lincc-frameworks/1deg_surveys/ztf_1deg/ztf_object")
and produces an error where it cannot find the catalog_info.json
file, here's a few relevant snippets of the long stack-trace
My main thought is that somewhere it's dropping the https and failing to resolve the link in fsspec? For clarity, I'm able to load this file independently in fsspec using:
with fsspec.open("https://epyc.astro.washington.edu/~lincc-frameworks/1deg_surveys/gaia_1deg/catalog_info.json") as json_file:
outfile = json.load(json_file)
Relevant Version Information
- python == 3.10
- hipscat == 0.2.3
- lsdb == 0.1.1
Before submitting
Please check the following:
- I have described the situation in which the bug arose, including what code was executed, information about my environment, and any applicable data others will need to reproduce the problem.
- I have included available evidence of the unexpected behavior (including error messages, screenshots, and/or plots) as well as a descriprion of what I expected instead.
- [] If I have a solution in mind, I have provided an explanation and/or pseudocode and/or task list.
With some further investigation, I think I've honed in on the issue. This is happening in the file_io module, and indeed the protocol is stripped away from the file path, causing fsspecs https filesystem reader to fail. Here's a reproducable code cell:
from hipscat.io import file_io
catalog_info_file = "https://epyc.astro.washington.edu/~lincc-frameworks/1deg_surveys/ztf_1deg/ztf_object/catalog_info.json"
# Issue is happening in the get_fs command
fs, fp = file_io.file_pointer.get_fs(catalog_info_file)
#fs.open(fp) # using the file pointer (fp) doesn't work
with fs.open(catalog_info_file) as json_file: #the open command still needs the "https://" protocol pre-prended
res = json.load(json_file)
I'm not that knowledgeable about what the best solution to this should be, but if I were to do it I would add an explicit handling of the https
protocol in this function, and avoid removing the protocol:
https://github.com/astronomy-commons/hipscat/blob/main/src/hipscat/io/file_io/file_pointer.py#L74