peekbank/peekbank-data-import

Proposal for IDless imports

Opened this issue · 4 comments

Since IDs (and linking them) have been the biggest source of confusion, errors, code duplication, and bloat when importing datasets into the peekbank format, we could think about a future approach that forgoes IDs completely on the importer side and handles them via peekds. Since IDs are uniquely defined by properties within each table, they can be inferred.

In that version, all of the data would be put into a big dataframe with one row for each xy/aoi timepoint, and all other data (stimulus info, subject info, etc) would be placed redundantly into every row. This table is then fed into peekds, which infers IDs and sorts the data into the separate tables. This approach would result in a slightly larger intermediate dataset size. However, most of the datasets we get come in singular large unnormalized tables anyway, so adding some of the peekbank variables to that should not add that much bloat.

The productivity gains in the MB2 imports were striking, so I think this approach is promising.

From a software upkeep/error reduction perspective: right now, every single import script has to include the rules on which variables within a table uniquely identify rows when creating IDs. The "huge table into peekds" approach would put this into a centralized location, leaving the import scripts to deal with dataset-specific functionality.

Happy to have more opinions on this before I go and write the code!

Just decided in the meeting that this is worth a try, I will have a look

I like this idea Adrian!! I think IDs are a huge source of error (and also take a lot of mental energy to think through), so it would be awesome to solve this "once", whether then solve it each time for every dataset. Happy to help look at this!

Note: Put age into this and calculate from lab_age and lab_age_units

First version is done, still needs

  • test this with legacy imported datasets to check equivalence (ids will be different, but the structure/content should be the same)
  • automated testing framework for legacy comparisons (stephan is working on it)
  • testing for datasets with xy coordinates
  • a cdi helper to facilitate cdi data import
  • documentation on how to use it (only the spreadsheet left)
  • validation that helps importers with common errors
  • a marking in the dataset table (which dataset uses which import method, legacy vs idless)
  • integration into peekds (once it is more mature - maybe also think about combining peekbankr and peekds)
  • think of quality of life additions (taking L,l,r,R as inputs for target_side and transforming in the function, same for other common outputs)
  • the current version does not support having multiple aoi region sets for a dataset - should be an easy to fix though (?)
  • automatic aoi generation for datasets that provide xy data (we trust our aois more than theirs)
  • expose the column list from the idless import to give importers an easy way to filter down their wip wide table to make the idless import easier on the memory
  • better aux data nesting helper function (maybe move this into a separate issue as this might be a bigger thing)
  • make language entry more permissive by performing mapping of languages to iso in the digestion function
  • relax the "space after comma for multiple languages in iso format" rule by adding that in the digestion function