Implement file-list-batch style catalog import
Closed this issue · 2 comments
Feature request
PLACEHOLDER.
There are a lot of details I'm glossing over. I'll write up more later.
Before submitting
Please check the following:
- I have described the purpose of the suggested change, specifying what I need the enhancement to accomplish, i.e. what problem it solves.
- I have included any relevant links, screenshots, environment information, and data relevant to implementing the requested feature, as well as pseudocode for how I want to access the new functionality.
- If I have ideas for how the new feature could be implemented, I have provided explanations and/or pseudocode and/or task lists for the steps.
If I'm interpreting the title correctly, I think the feature request is:
Add docs and code to the file readers module that shows how to send lists of input files to the reader and have the reader concatenate data from multiple files if necessary to yield chunks with at least x number of rows. This should reduce the number of files in the intermediate dataset in cases where the input files are small and numerous.
For reference, a recent import of the ZTF lightcurves resulted in an intermediate dataset with 4.4 million files. The import took several days to run and multiple things went wrong at different stages, including obscure but crucial problems with the compute nodes. The large number of files made it practically impossible to verify what was actually on disk at any given time. This was especially hard after some of the intermediate files were deleted during the reducing step and I ended up just having to start over completely.
As I recall, @delucchi-cmu recommended that the lists be sized so that there are 50-100 lists per worker. One list of input files per worker is not recommended because it prevents the pipeline from being able to skip any of the previously completed input files when resuming the splitting step.