huggingface/huggingface_hub

Support server-side filtering for list_repo_tree

younik opened this issue · 1 comments

Is your feature request related to a problem? Please describe.
I need to retrieve some files from a dataset, for example all the *.json file. Currently, I am using the list_repo_tree function on all the files and check for my occurrences. However, it would be more efficient to provide a server-side filtering.

Describe the solution you'd like
Add filtering support on the API, similar to list_spaces's filter argument.

Describe alternatives you've considered
Continue with my current solution.

Hi @younik, listing info using list_spaces or entries in a repo using list_repo_tree has a completely different implementation server-side. Searching through Space is a very common use case so the data is already indexed in a search engine, making it easy to request only a subset of it based on a filter. On the contrary, listing files in a repo is a costly operation has files are not indexed. Since it's not done very often, the cost of indexing everything would be much much greater than indexing repos (we have 2M+ repos, each of them possibly containing thousands of files...) - and for a limited benefit.

This explains why it is not possible to filter them server-side when listing files from a repo. In theory we could do it before returning the response but in that case we would be hiding the complexity of the request to the end user (which is bad design-wise). So best solution is still to do the filtering client-side :)

Note that you can filter the listing by only listing files from a subdirectory. Hope this answers your question. I'll close this issue as "not planned" but please let me know if you have any further question