ODBParser

TL;DR

ODBParser is a tool to search for PII being exposed in open databases.

ONLY to be used to identify exposed PII and warn server owners of irresponsible database maintenance
OR to query databases you have permission to access!

PLEASE USE RESPONSIBLY

What is this?

Wrote this as wanted to create one-stop OSINT tool for searching, parsing and analyzing open databases in order to identify leakages of PII on third-party servers. Other tools seem to either only search for open databases or dump them once you've identified them and then will grab data indiscriminately. Grew from function or two into what's in this repo, so code isn't as clean and pretty as it could be.

Features

To identify open databases you can:

query Shodan and BinaryEdge using all possible parameters (filter by country, port number, whatever)
specify single IP address
load up file that has list of IP addresses
paste list of IP addresses from clipboard

Dumping options:

parses all databases/collections to identify data you specify
grab everything hosted on server
grab just one index/collection
Use ctrl+c to skip dumping certain index

Post-Processing:

convert JSON dumps to CSV
remove useless columns from CSV

Other features:

keeps track of all the IP addresses and databases you have queried along with info about each server.
maintains stats file with number of IP's you've queried, number of databases you've parsed and number of records you've dumped
convert JSON dumps you already have to CSV
for every database that has total number of records above your limit, script will create an entry in a special file along with 5 sample records so you can review and decide whether the database is worth grabbing
Default output is line-separated JSON file with a JSON object on each line. You can choose to have it output a "proper JSON" file by using the "properjson" flag
You can convert the files to CSV on the fly or you can convert only certain files after run is complete (I recommend latter). Converted JSON files will be moved to folder called "JSON backups" in same directory. NOTE: When converting to CSV, script drops exact duplicate rows and drops columns and rows where all values are NaN, because that's what I wanted to do. Feel free to edit function if you'd rather have exact copy of JSON file.
Windows ONLY If script pulls back huge number of indices that have field you care about, script will list names of the dbs, pause and give you ten seconds to decide whether you want to go ahead and pull all the data from every index as I've found if you get too many databases returned even after you've specified fields you want, there is a good chance data is fake or useless logs and you can usually tell from name whether either possibility is the case. If you don't act within 10 seconds, script will go ahead and dump every index.
as you may have noticed, lot of people have been scanning for MongoDB databases and holding them hostage, often changing name to something like "TO_RESTORE_EMAIL_XXXRESTORE.COM." The MongoDb scraper will ignore all databases and collections that have been pwned by checking name of DB/collection against list of strings that indicate pwnage
script is pretty verbose (maybe too verbose) but I like seeing what's going on. Feel free to silence print statements if you prefer.

Customization

See the odbconfig.py file to specify your parameters, because really name of the game is exposing the data YOU are interested in. I provided some examples in the config file. Play around with them!

You can:

specify what index or collection names you want to collect by specifying substrings in config file. For example, if have the term "client", script will pull index called "clients" or "client_data." I recommend you keep these lists blank as you never know what databases you care about will be called and instead specify the fields you care about.
specify what fields you care about: if you only want to grab ES indices that have "email" in a field name, e.g."user_emails", you can do that. If you want to make sure the index has at least 2 fields you care about, you can do that too. Or if you just want to grab everything no matter what fields are in there, you can do that too.
specify what indices you DON'T want e.g., system index names and others that are generally used for basic logging. Examples provided in config file.
override config and grab everything on a server
specify output (default is JSON, can choose CSV)
set minimum and maximum size database script will dump by default and you can set flag to override max docs on case by case basis.

Installation and Requirements

Clone or download to machine
Get API keys for Shodan and/or BinaryEdge
configure parameters in ODBconfig.py file
install requirements from file

I suggest creating virtual environment for ODBParser so have no issues with incorrect module versions. Note: Tested ONLY on Python 3.7.3 and on Windows 10.