bmiller1009/deduper

General deduping engine for JDBC sources with output to JDBC/csv targets

KotlinApache-2.0

Issues

Remove row ids from dupe output
#51 opened 4 years ago by bmiller1009
0
Lock file for csv isn't always deleted when a run is over
#50 opened 4 years ago by bmiller1009
0
Allow ability for string capitalization to be considered or ignored
#49 opened 4 years ago by bmiller1009
0
If publishing thread fails all consuming threads should be shutdown
#48 opened 4 years ago by bmiller1009
0
Update README after Asnyc code merge
#46 opened 5 years ago by bmiller1009
0
Add library build instructions to README
#41 opened 5 years ago by bmiller1009
1
Host dokka content on git
#47 opened 5 years ago by bmiller1009
0
Update javadocs for async merged code
#45 opened 5 years ago by bmiller1009
0
Refactor Consumer classes
#44 opened 5 years ago by bmiller1009
0
Make executor service timeout a parameter in the builder
#43 opened 5 years ago by bmiller1009
0
Use Producer/Consumer pattern to push items to be saved to database
#40 opened 5 years ago by bmiller1009
1
Add option to include json for output hashes into csv option
#42 opened 5 years ago by bmiller1009
0
Allow for on disk/cache hybrid persistence model of "seen" hashes
#37 opened 5 years ago by bmiller1009
1
Flat file output should be locked while being written to
#38 opened 5 years ago by bmiller1009
0
Add the ability for deduper to source hash values from a Kafka topic as well as write them to a target topic
#36 opened 5 years ago by bmiller1009
1
SqlPersistor transaction rollback shouldn't occur in finally block
#39 opened 5 years ago by bmiller1009
0
Improve and expand unit testing
#10 opened 5 years ago by bmiller1009
0
Publish library to maven central
#33 opened 5 years ago by bmiller1009
1
Add Dokka documentation
#34 opened 5 years ago by bmiller1009
0
Fill out proper README
#11 opened 5 years ago by bmiller1009
0
Performance metrics
#35 opened 5 years ago by bmiller1009
0
Improve csv output
#4 opened 5 years ago by bmiller1009
0
Use trove4j to store the long representation of string hashes when building up hash list in type loop
#28 opened 5 years ago by bmiller1009
0
Add ability to preview a sample row of data being hashed
#32 opened 5 years ago by bmiller1009
0
Allow each JNDI connection to be attached to a different context
#31 opened 5 years ago by bmiller1009
0
Add Apache Drill JDBC to open up more sources to read from
#15 opened 5 years ago by bmiller1009
1
Add ability to detect dupes using COUNT/HAVING SQL Syntax
#23 opened 5 years ago by bmiller1009
1
Add command-line functionality
#8 opened 5 years ago by bmiller1009
1
"Seen" hashes to be cached on disk or in memory
#3 opened 5 years ago by bmiller1009
1
Deduper should process null values based on input parameter
#2 opened 5 years ago by bmiller1009
1
Make dupe persistence more efficient
#19 opened 5 years ago by bmiller1009
1
Refactor table creation/deletion code into SqlUtils library
#30 opened 5 years ago by bmiller1009
0
Sql Target should reflect null/not null of the jdbc source
#5 opened 5 years ago by bmiller1009
0
Remove null checks from result set tight loop.
#29 opened 5 years ago by bmiller1009
0
Add ability to gather and persist hashes
#12 opened 5 years ago by bmiller1009
0
Move the delete APIs for target and dupes to the JNDI Target/Dupes objects
#27 opened 5 years ago by bmiller1009
0
Change Builder to take in Csv/Sql target JNDI objects rather than just strings
#26 opened 5 years ago by bmiller1009
0
Option to delete dupe/target persistence
#6 opened 5 years ago by bmiller1009
0
Options for emitting deduped data and duplicates
#13 opened 5 years ago by bmiller1009
0
Improve logging
#7 opened 5 years ago by bmiller1009
0
API should support listing JNDI contexts available
#24 opened 5 years ago by bmiller1009
0
API should support showing items in a JNDI context
#25 opened 5 years ago by bmiller1009
0
Add new jndi entries programmatically
#20 opened 5 years ago by bmiller1009
0
Make hash column primary key in SQL Persistor
#22 opened 5 years ago by bmiller1009
0
Dupe Count report should contain a count of all dupes as well as unique dupes
#21 opened 5 years ago by bmiller1009
0
File output defaults
#16 opened 5 years ago by bmiller1009
0
Use builder pattern to improve config parameter collection
#17 opened 5 years ago by bmiller1009
1
Duplicate API should store all instances of duplicates
#14 opened 5 years ago by bmiller1009
0
Store string hash as one of the values in the duplicate table
#18 opened 5 years ago by bmiller1009
0
CI/CD pipeline
#9 opened 5 years ago by bmiller1009
0