New data integration API
Closed this issue · 2 comments
We'll use this issue to track the implementation of the new bdi-kit API.
Implementation:
-
bdi.match_columns()
-
bdi.top_matches()
-
bdi.preview_domains()
-
bdi.preview_value_matches()
-
bdi.update_matches()
-
bdi.match_values()
-
bdi.materialize_mapping()
Documentation:
- Make sure that all public functions have pydocs
- Make sure that readthedocs is generating API documentation
- Create a notebook demonstrating the API usage
I suggest some naming refactoring before integrating this branch into the devel. For example:
Instead of bdikit.mapping_algorithms.column_mapping.algorithms
we could have bdikit.matching.column.algorithms
; the same for value matching: bdikit.matching.value.algorithms
. For a naming standard, I would suggest that we stick to the schema matching/mapping conventions from a textbook book like:
https://link.springer.com/book/10.1007/978-3-642-16518-4
So far algorithms
is a Python module, but we will have to split it as we add new methods. In that case, we will have the package bdikit.matching.column.algorithms
with modules like algorithm_type1.py
, algorithm_type2.py
, etc. Otherwise, we will have a lot of code in a single module which is more difficult to maintain. I think it's better to do this refactoring in another PR.