astronomy-commons/hipscat

Reading a hipscat dataset from a web url fails

dougbrn opened this issue · 1 comments

Bug report
I'm trying to use LSDB/hipscat to read a small web-hosted dataset (https://epyc.astro.washington.edu/~lincc-frameworks/1deg_surveys/ztf_1deg/), and hipscat appears to be unable to load this dataset correctly. The following line fails:

lsdb.read_hipscat("https://epyc.astro.washington.edu/~lincc-frameworks/1deg_surveys/ztf_1deg/ztf_object")

and produces an error where it cannot find the catalog_info.json file, here's a few relevant snippets of the long stack-trace
Screen Shot 2024-02-12 at 11 44 54 AM
Screen Shot 2024-02-12 at 11 44 35 AM

My main thought is that somewhere it's dropping the https and failing to resolve the link in fsspec? For clarity, I'm able to load this file independently in fsspec using:

with fsspec.open("https://epyc.astro.washington.edu/~lincc-frameworks/1deg_surveys/gaia_1deg/catalog_info.json") as json_file:
    outfile = json.load(json_file)

Relevant Version Information

  • python == 3.10
  • hipscat == 0.2.3
  • lsdb == 0.1.1

Before submitting
Please check the following:

  • I have described the situation in which the bug arose, including what code was executed, information about my environment, and any applicable data others will need to reproduce the problem.
  • I have included available evidence of the unexpected behavior (including error messages, screenshots, and/or plots) as well as a descriprion of what I expected instead.
  • [] If I have a solution in mind, I have provided an explanation and/or pseudocode and/or task list.

With some further investigation, I think I've honed in on the issue. This is happening in the file_io module, and indeed the protocol is stripped away from the file path, causing fsspecs https filesystem reader to fail. Here's a reproducable code cell:

from hipscat.io import file_io
catalog_info_file = "https://epyc.astro.washington.edu/~lincc-frameworks/1deg_surveys/ztf_1deg/ztf_object/catalog_info.json"

# Issue is happening in the get_fs command
fs, fp = file_io.file_pointer.get_fs(catalog_info_file)

#fs.open(fp) # using the file pointer (fp) doesn't work
with fs.open(catalog_info_file) as json_file: #the open command still needs the "https://" protocol pre-prended
    res = json.load(json_file)

I'm not that knowledgeable about what the best solution to this should be, but if I were to do it I would add an explicit handling of the https protocol in this function, and avoid removing the protocol:
https://github.com/astronomy-commons/hipscat/blob/main/src/hipscat/io/file_io/file_pointer.py#L74