alephdata/followthemoney

Proposal: declarative mapping to convert complex nested documents into the FTM entities

dchaplinsky opened this issue · 2 comments

Foreword

Aleph already has some tools (including nice UI) to map flat files (like CSV/XLSX) into the FTM entities. Those mappings are yaml based and allow you to version control them and can be read and used by non-programmers. Great success.

Problem

Some input files are not born tables and cannot be converted to tables (without blowing up the data). For example, we have an asset declaration, where person declares his real estate, cars, incomes, bank accounts. That creates bunch of problems:

  • Each section has it's own rules, fields and generally produces different subset of entities.
  • There might be more than hundred of records in each section.
  • Some records in some sections has even more levels of nesting as well as back-referencing. For example, rights on the asset. You have a plot of the land, which you co-own with your relatives and third-parties. So, each record of such kind will yield not only record on the asset (for example, Real Estate) but also N records of Company or Person type + intervals to connect them to the Asset.
  • The entities that are generated can be of a different type. Back to the example above: some real estate can be co-owned/co-used by 2 persons and 1 company.

Usual solution for this kind of the data sources is to write some python code.

Proposal

It would be nice to have an extension (or a separate project/product/tool) to map such data sources into entities.
Here are some principles:

  • Declarative
  • YAML based
  • Somewhat compatible with existing mappings (so it can read them too)
  • Probably JMESPath based, where you can describe a jmespath to extract the section and jmespathes to extract the content of the section
  • Probably possessing it's own pseudo-language for expressions or a way to call predefined macro/user python function to deal with the objects of a different nature in the same list. For example, if this condition on some flag or field or combination of fields is met, we yield Person.

No idea what to do with backreferencing (for example, when you have one section describing relatives and another section, where data refers to those relatives using their internal id)

@pudo, what do you think.

pudo commented

I think it'd be fun to try and prototype a v2 of mappings, with a much richer modelling language. We should definitely maintain the separation of data cleaning and mapping as two separate tasks - mappings should not do any more data re-formatting than they do right now.

That, of course, does not need to stop us from building better ways to unroll and project complex data structures to entities. Have you ever worked with JMESPath before? It may also be interesting to look at https://github.com/kindly/libflatterer as a way to spec out the projections....

pudo commented

Work on this now has moved to: https://github.com/dchaplinsky/thebeast - where there's a pretty cool mapping syntax for nested data structures. We still should consider back-porting this into ftm main, but that first requires making the mapper work on JSON data.