IntelLabs/matsciml

Example on how to create your own lmdb dataset for training.

Closed this issue · 1 comments

Feature/behavior summary

We have some larger material datasets we would like to train on and your repo is the best I could find in terms of support for large-scale training.
It would be great if you could provide an example for how to create your own lmdb dataset similar to the existing ones from a list of structures and properties and use that for training. I am sure it's quite simple but I have to admit I am getting a bit lost in all the different classes of the datasets.
Or maybe you can point me in the right direction, i.e. what key and dictionary structure would be correct to use as input to write_lmdb_data(key: Any, data: Any, target_lmdb: lmdb.Environment) to be consistent with the existing datasets e.g. the SinglePointLmdbDataset.

thank you so much for your help.

Request attributes

  • Would this be a refactor of existing code?
  • Does this proposal require new package dependencies?
  • Would this change break backwards compatibility?
  • Does this proposal include a new model?
  • Does this proposal include a new dataset?
  • Does this proposal include a new task/workflow?

Related issues

No response

Solution description

Add an example on how to create your own lmdb dataset from a list of structures and properties to the tutorials.

Additional notes

No response

Thanks for creating the issue @JonathanSchmidt1! I agree with you that we're missing that documentation: when we get a chance we'll look at writing something up and updating CONTRIBUTING.md with an overview of how to use existing functionality to implement new LMDB sources.

Once we have that in place, hopefully you'll have what you need, but please feel free to reach out to us in the interim to see if we can get something ad hoc going.