datalad/datalad-catalog

Efficient way to list / iterate through all datasets and versions included in a catalog

Closed this issue · 1 comments

Using datalad-next's iterators / tree command?

Easy way to identify all dataset-versions is to list all second-level directories relative to the metadata directory. Considering future changes related to using dataset aliases, design in #423 (comment), all first level directories will either be dataset-ids or aliases. Only the dataset-id directories will contain 2nd level directories, which are the dataset-version directories. This then effectively bypasses the alias directories to prevent them from being counted as duplicates:

Example:

> target = catalor_dir / 'metadata'
> ds_version_list = [l for l in target.glob('*/*')]

[PosixPath('/Users/jsheunis/Documents/psyinf/abcdj/data-catalog/catalog/metadata/0036b2c6-f131-4660-9ef9-945087ad02d3/1139b4c82d7ecd0d9ab25bf4c00ec2c06461d6da'),
 PosixPath('/Users/jsheunis/Documents/psyinf/abcdj/data-catalog/catalog/metadata/372e66b3-e654-4a0f-ba4c-6394bf314f2f/d468e9a9455b38b959db34bc5335fd66e42b2884'),
 PosixPath('/Users/jsheunis/Documents/psyinf/abcdj/data-catalog/catalog/metadata/db7592d0-6206-5684-a29c-8059a6033241/0.1.0'),
 PosixPath('/Users/jsheunis/Documents/psyinf/abcdj/data-catalog/catalog/metadata/1015ed7c-0a3d-4dfc-9c4f-11fe71673a41/40ab84384760f9b87e8107ea81ede45054101622'),
 PosixPath('/Users/jsheunis/Documents/psyinf/abcdj/data-catalog/catalog/metadata/1015ed7c-0a3d-4dfc-9c4f-11fe71673a41/c58f0f563011222c618a58951fa21b61c8eb189b'),
 PosixPath('/Users/jsheunis/Documents/psyinf/abcdj/data-catalog/catalog/metadata/1015ed7c-0a3d-4dfc-9c4f-11fe71673a41/ced65ccd8e6a6c3402c6695f10bf2d71c119696c'),
 PosixPath('/Users/jsheunis/Documents/psyinf/abcdj/data-catalog/catalog/metadata/1015ed7c-0a3d-4dfc-9c4f-11fe71673a41/38333c8e44a42892e09b24d898c86c23faf8134f'),
 PosixPath('/Users/jsheunis/Documents/psyinf/abcdj/data-catalog/catalog/metadata/1015ed7c-0a3d-4dfc-9c4f-11fe71673a41/5c01525ef4799485809e521abd9ab48f04af1744'),
 PosixPath('/Users/jsheunis/Documents/psyinf/abcdj/data-catalog/catalog/metadata/1015ed7c-0a3d-4dfc-9c4f-11fe71673a41/3f91fe14a8841affb7084d0277410347b6f2a597'),
 PosixPath('/Users/jsheunis/Documents/psyinf/abcdj/data-catalog/catalog/metadata/3f8a45c0-08fc-479c-b561-cb6f744d2b5c/32f3303308c89b037fe1700547348470539a30cb')]


Then this list of paths can be parsed to find all datasets and their versions, and a counts of both, here 10 dataset-versions and 5 datasets.