The DBs Need Fixing they contain Hex Encoded Strings + Duplicates
minanagehsalalma opened this issue · 0 comments
a pic of the hex encoded strings
a site that can decode them with no problem
for a long time i been using them and missed that at the end of the files there are hex Encoded Strings that made me miss results !
I have tested on the largest DB of them Egypt
Number of information 👉🏽 44,823,547
Download size 👉🏽 1.55 GB
the files sizes after decompression : 14.2 GB
the files sizes after removing duplicates & hex decoding : 13.6 GB
the file lines count is : 45,203,980
New lines count : 44,411,457
after using the remove duplicates function from PilotEdit tool
New lines count : 44,411,457
after removing the duplicated lines using a simple python script
the duplicated lines count : 792,522
statics :
the number of hex encoded strings : 5K lines
the egypt file has 41,001,675
unique user ids since 3,393,719
user have more than one phone number
and only 16,473,892
have usernames which means 27,921,502
don't have usernames
the already hex decoded files.zip
i decoded them using a simple typescript
to replace the decoded strings right away ...first merge the 4 files into one using App.Merge.zip
by this command
App.Merge.exe o="output-file.txt" "1.txt" "2.txt" "3.txt" "4.txt
then open the output file using ultraedit or pilotedit then go to this line 45150001 and remove all the text after it and copy the text of the the already hex decoded files and paste them there