Issues
- 2
- 2
- 4
Incorrect join on large tables for add_tfidf
#50 opened by jstammers - 6
- 3
Testing: Standardized workflow and datasets for speed and performance benchmarking
#19 opened by OlivierBinette - 4
Address parsing performance
#47 opened by lmores - 0
- 3
- 7
Poor scaling of add_tfidf to larger datasets
#42 opened by jstammers - 17
Inefficient Sampling From Known Labels
#35 opened by jstammers - 0
expose liepzig affiliations dataset
#41 opened by NickCrews - 1
Incremental clustering
#36 opened by lmores - 1
- 4
Add more example datasets
#13 opened by NickCrews - 0
Add TF-IDF comparer based on sklearn
#31 opened by NickCrews - 1
feat: test on spark using docker
#27 opened by NickCrews - 3
joining on arrays is slow
#29 opened by NickCrews - 0
explore ipydatagrid for showing data
#28 opened by NickCrews - 0
- 1
feat: plot clusters
#22 opened by NickCrews - 1
clustering: sillouette, rand index, adjusted rand index, Mutual Information, etc. reference
#6 opened by NickCrews - 1
Support set-wise comparison and pooling
#1 opened by NickCrews - 0
- 1
better connected_components() API
#5 opened by NickCrews - 8
- 2
Testing: Refactor some of the tests to facilitate test-driven development and modularity
#17 opened by OlivierBinette - 4
Why deal with left and right tables?
#20 opened by OlivierBinette - 2
- 2
FEAT: Implement Pipelines?
#18 opened by OlivierBinette - 1
Add usage note on metrics
#12 opened by NickCrews - 0
Block using KDTree
#15 opened by NickCrews - 0
Eval: look into new cluster eval metric
#14 opened by NickCrews - 0
- 0
Assess link quality via sensitivity analysis
#10 opened by NickCrews - 1
viz: plot blocking with https://upset.app/
#7 opened by NickCrews - 0
EM algorithm of FS model for unlabeled data
#9 opened by NickCrews - 0
FEAT: wizard that checks things for you
#8 opened by NickCrews - 1
Consider using DuckDB for SQL operations
#2 opened by NickCrews