chrisjsewell/ipypublish

Data Storage/Access/Manipulation

Closed this issue · 1 comments

Add a section on best-practices for storing, accessing and manipulating data

A goal for scientific publishing is automated reproducibility of analyses, which the Jupyter notebook excels at. But, more than that, it should be possible to efficiently reproduce the analysis with different data sets. This entails having one point of access to a data set within the notebook, rather than having copy-pasted data into variables, i.e. this:

data = read_in_data('data_key')
variable1 = data.key1
variable2 = data.key2
...

rather than this:

variable1 = 12345
variable2 = 'something'
...

The best-practice for this (in my opinion) is to use the JSON format (as long as the data isn't relational), because it is:

  • applicable for any data structure
  • lightweight and easy to read and edit
  • has a simple read/write mapping to python objects (using json)
  • widely used (especially in web technologies)

A good way to store multiple bits of JSON data is in a mongoDB and accessing it via pymongo. This will also make it easy to move all the data to a cloud server at a later time, if required.

conda install pymongo

But, if the data is coming from files output from different simulation or experimental code, where the user has no control of the output format. Then writing JSON parsers may be the way to go, and this is where jsonextended comes in, which implements:

  • a lightweight plugin system to define bespoke classes for parsing different file extensions and data types.

  • a 'lazy loader' for treating an entire directory structure as a nested dictionary.

    from jsonextended import plugins, edict
    plugins.load_plugins_dir('path/to/folder_of_parsers','parsers')
    data = edict.LazyLoad('path/to/data')
    variable1 = data.folder1.file1_json.key1
    variable2 = data[['folder1','file1.json','key2']]
    variable3 = data[['folder1','file2.csv','key1']]
    variable4 = data[['folder2','subfolder1','file3.other','key1']]
    ...