NickCrews/mismo

The SQL/Ibis powered sklearn of record linkage

PythonLGPL-3.0

Issues

Unable to use 'cluster.connected_components' on a pyspark dataframe
#53 opened 3 months ago by jstammers
2
Add active learning API for predicate-based blocking and matching
#54 opened 3 months ago by jstammers
2
Incorrect join on large tables for add_tfidf
#50 opened 4 months ago by jstammers
4
DuckDb ConversionException when running MinHashLSH example
#45 opened 4 months ago by lmores
6
Testing: Standardized workflow and datasets for speed and performance benchmarking
#19 opened 4 months ago by OlivierBinette
3
Address parsing performance
#47 opened 4 months ago by lmores
4
viz: Plot pair scores using manifold learning
#3 opened a year ago by NickCrews
0
Add ability to sample from blocked pairs when training an FS model
#44 opened 5 months ago by jstammers
3
Poor scaling of add_tfidf to larger datasets
#42 opened 6 months ago by jstammers
7
Inefficient Sampling From Known Labels
#35 opened 6 months ago by jstammers
17
expose liepzig affiliations dataset
#41 opened 6 months ago by NickCrews
0
Incremental clustering
#36 opened 7 months ago by lmores
1
benchmarks for array.filter(x -> x.isin(<column from other relation>))
#32 opened 8 months ago by NickCrews
1
Add more example datasets
#13 opened a year ago by NickCrews
4
Add TF-IDF comparer based on sklearn
#31 opened 8 months ago by NickCrews
0
feat: test on spark using docker
#27 opened 9 months ago by NickCrews
1
joining on arrays is slow
#29 opened 9 months ago by NickCrews
3
explore ipydatagrid for showing data
#28 opened 9 months ago by NickCrews
0
Consider supporting latent-entity based algorithms
#26 opened 9 months ago by NickCrews
0
feat: plot clusters
#22 opened 10 months ago by NickCrews
1
clustering: sillouette, rand index, adjusted rand index, Mutual Information, etc. reference
#6 opened 10 months ago by NickCrews
1
Support set-wise comparison and pooling
#1 opened 10 months ago by NickCrews
1
ROADMAP
#4 opened a year ago by NickCrews
0
better connected_components() API
#5 opened 10 months ago by NickCrews
1
Testing: test_fs is too computationally and memory intensive
#21 opened 10 months ago by OlivierBinette
8
Testing: Refactor some of the tests to facilitate test-driven development and modularity
#17 opened a year ago by OlivierBinette
2
Why deal with left and right tables?
#20 opened a year ago by OlivierBinette
4
Design: Should type aliases be used for Ibis types?
#16 opened a year ago by OlivierBinette
2
FEAT: Implement Pipelines?
#18 opened a year ago by OlivierBinette
2
Add usage note on metrics
#12 opened a year ago by NickCrews
1
Block using KDTree
#15 opened a year ago by NickCrews
0
Eval: look into new cluster eval metric
#14 opened a year ago by NickCrews
0
Assess link quality via comparison of links vs non-links
#11 opened a year ago by NickCrews
0
Assess link quality via sensitivity analysis
#10 opened a year ago by NickCrews
0
viz: plot blocking with https://upset.app/
#7 opened a year ago by NickCrews
1
EM algorithm of FS model for unlabeled data
#9 opened a year ago by NickCrews
0
FEAT: wizard that checks things for you
#8 opened a year ago by NickCrews
0
Consider using DuckDB for SQL operations
#2 opened a year ago by NickCrews
1