/SQLite-Mega-Merge

script for merging thousands of identical sqlite databases based on original mergeScript.py by @gopherchucks

Primary LanguagePythonCreative Commons Zero v1.0 UniversalCC0-1.0

SQLite-Mega-Merge

Script for merging thousands of SQLite databases. Based on original mergeScript.py by @gopherchucks. See the original here: https://github.com/gopherchucks/Database-Merging-SQLite3

If you have thousands of databases with a number of column constraints that are causing you errors with the original merge script, this may be helpful. However, there are some aspects of this code that are specific to my use case. I have tried to document them here.

Essentially, this script differs from the original in three important respects:

  1. Data are pre-processed prior to merging such that data from each table are copied into new tables with identical schema except that the columns do not have specific primary key, NOT NULL, or UNIQUE constraints. This may not be useful in your use case, but improves the likelihood that merges are sucessful.

  2. The databases are attached in blocks of ten to the "mainDB" for the merging process because SQLite3 can only handle ten attached databases at a time.

  3. The donor databases (databases that data is sent from to mainDB) are removed from the directory automatically after merging to conserve disk space. A copy of the original data should therefore be kept elsewhere.

Step 1

Move your databases to an empty directory

Step 2

Pull a copy of MegaMergeScript.py into the directory

Step 3a

Create a list of filenames

For example, using:
ls /home/ubuntu/databases/*.db > filenames.txt

The filenames.txt file should contain a plain list of filenames with a complete filepath, such as that produced by the command above. There is code within the script that will handle this file. It should contain all databases (incl. mainDB at position 0, ideally). Leave it in the databases/ directory with the .db files.

Step 3b (optional)

Modifying the input parameters

#################################################################################
############################## Input Parameters #################################

# 1. DEFINE A LIST OF DBs TO MERGE
################################## 
# define a list of DBs using filenames from the folder containing databases by running ls /home/ubuntu/databases/*.db > filenames.txt
# ...needs to be full path for sqlite3 to interpret correctly
# otherDBs can also be defined by a list e.g. otherDBs = [] for example,
# otherDBs = ['/home/ubuntu/databases/5-9856-4928.db', '/home/ubuntu/databases/5-9856-5376.db']

with open('filenames.txt', 'r') as fileNames:
    otherDBs = [line.strip() for line in fileNames]

# 2. DEFINE A MAIN DB AS A TEMPLATE FOR THE MERGE
################################## 
# This is where the main database is ie. the first if a list of identical DBs but can be defined otherwise

mainDB = otherDBs[0]

... ...

***** If for some reason you're no longer working with mainDB as the first item in filenames.txt, make sure to comment out the pop statement at the top of #3. Database Merge Module.*****

# 3. Database Merge Module 
##################################
## A nested for loop iterates through the blocks in this version
## to prevent an error from trying to attach more than ten DBs at a time

print("Merging databases initiated at: " + strftime("%H:%M", gmtime()))
otherDBs.pop(0) # removes the first database (mainDB) from filenames, turn this off if MainDB is not in filenames.txt <<<==== HERE ======
DBs_attacher = list(divide_list(otherDBs, 10))
nBlocks = len(DBs_attacher)
Total_DBs_attacher = int(sum([len(block) for block in DBs_attacher]))
print("Total: "+str(Total_DBs_attacher)+" Blocks: "+str(nBlocks))

... ...

The program will merge the databases in otherDBs into mainDB. The code as-is is meant to set the first .db file in filenames.txt as mainDB and merge the others into this file. However, mainDB and otherDBs can be set normally, e.g. otherDBs = [obj1, obj2, ... objn]. If so, you need to comment out the pop statement in #3. Database Merge Module (see above).

Step 4

If there are tables that don't need to be merged, set them in Define Functions (#3) get_table_names()

# 3. Get the table names of a database
#
# @param db_name the name of the database file (i.e. "example.db")
# @return a string array of the table names

def get_table_names():
    temp = []
    tables = []
    curs.execute("SELECT name FROM sqlite_master WHERE type='table';")
    temp = curs.fetchall()
    for i in range(0, len(temp)):
        if ("table_example1" in temp[i][0]):        #### Skips merging table "table_example1"
            continue
        if ("table_example2" in temp[i][0]):        #### Skips merging table "table_example1"
            continue
        if ("table_example3" in temp[i][0]):        #### Etc.
            continue
        if ("table_example4" in temp[i][0]):        #### Etc.
            continue
        else:
            tables.append(temp[i][0])
    return tables

Add or comment out the if/else statements to skip merging specific tables.

Step 5

Run MegaMergeScript from /databases/ directory

python3 MegaMergeScript.py

Basic Troubleshooting Insights

Constraint errors - these are tricky and are caused by Primary Key, NOT NULL, UNIQUE constraints on specific columns, etc. These should not occur because you're moving all data into new tables without constrained columns, but getting these errors usually require a look under the hood. Use DB-Browser or equivalent to visually inspect a db and double check the columns and tables you're trying to merge. Adding print statements to check table or column name output can be helpful to check against.

Database size or disk space error - when working with large datasets, the DISK SPACE error is the more likely culprit (unless your data is >100 TB). Make sure to check that your storage is adequate (e.g. if using an AWS instance), and that there are no ongoing processes eating your memory.

SQLite error in view, no such table - if your db has views, it will require the PRAGMA legacy_alter_table=TRUE statement in the pre-processing loop. It comes set this way.

Database connection, or database closed - The conn = sqlite.connect(, timeout = 10) setting can be set to a longer timeout if your processing is significant. This may occur if more than one process is trying to write to the same db or if you've introduced edits and the commmits are not happening frequently enough or in the right places.

What the script is doing

Input Parameters

Here you're telling the script what dbs you're working with. By default, the script will try to merge all the dbs in filenames.txt with the first db in filenames.txt.

Quality Control

Here the scripts gets the names of the tables for merging from mainDB, then iterates through all the otherDBs and compares the table numbers. If there are less tables in main DB than in a 'otherDB', only the tables printed at the top of the QC step will be merged. If there are more tables in mainDB than in otherDB, otherDB is discarded from the merge process entirely. The QC section will print statements to the console that will record which tables are logged as matches or exceptions during the iteration phase as well as at the end. If there are a lot of dbs, this is helpful to avoid excess scrolling.

Merging Databases

Firstly, the databases go through a pre-processing stage where the data is copied into identical tables, but without column constraints. This may require tweaking for your use case but I found personally that column constraints were difficult to work around while trying to merge tables. There are ways to renumber UNIQUE INTEGER or PRIMARY KEY columns for instance, using counters during the loop process, but I have left this out for the sake of simplicity. Alternatively, you can renumber these columns in the mainDB after merging and use similar code to re-copy the full dataset into new tables with the required constraints. This section is a useful template in that respect. Again, the code does not do this because I felt this was an overly specific addition.

Secondly, the dataases go through the merging process. This relies on the same scheme as in @gopherchuck's original code. However, SQLite3 can only attach ten databases to mainDB at a time, so the otherDBs list is used to create DBs_attacher, a nested list of lists, ten a piece. The code essentially goes through the same process, but requires the database attachment and merge process in a nested for-loop, instead of two separate loops. Elsewhere, counters have been adjusted to reflect the counting process for handling blocks and sub-blocks.

Other notes

The code is not generalized and contains some parts that are vestiges of other modules I am not currently running. I apologize if there are some inefficiencies, as this was not my goal in developing this version of @gopherchucks code.