Slow query performance with loom crosscat vs. baseline (lovcat) crosscat

I compared the performance of queries between loom crosscat and the default baseline (lovcat) crosscat. The queries ESTIMATE SIMILARITY and ESTIMATE DEPENDENCE using loom take longer. The difference is on the order of n_models (a.k.a. number of ANALYSES), but not exactly.

For example, here below are runtimes from identical workflows I performed with loom and the default crosscat. The runtime differences are similar with multiprocess on, and multiprocess off.

n_models = 32:
ESTIMATE SIMILARITY with 300 variables: 3.91 seconds default crosscat, 69.9 seconds loom
ESTIMATE DEPENDENCE with 300 variables: 300 seconds default crosscat, waited a long time then interrupted loom
ESTIMATE DEPENDENCE with 30 variables: 0.22 seconds default crosscat, 9.8 seconds loom

n_models = 16:
ESTIMATE SIMILARITY with 300 variables: .63 seconds default crosscat, 29 seconds loom
ESTIMATE DEPENDENCE with 300 variables: 148 seconds default crosscat, waited a long time then interrupted loom
ESTIMATE SIMILARITY with 30 variables: .62 seconds default crosscat, 5.3 seconds loom
ESTIMATE DEPENDENCE with 30 variables: .17 seconds default crosscat, 5.0 seconds loom

n_models = 2:
ESTIMATE DEPENDENCE with 30 variables: 0.10 seconds default crosscat, .84 seconds loom
ESTIMATE SIMILARITY with 30 variables: .23 seconds default crosscat, .93 seconds loom
ESTIMATE DEPENDENCE with 300 variables: 0.10 seconds default crosscat, .84 seconds loom
ESTIMATE SIMILARITY with 300 variables: .23 seconds default crosscat, .93 seconds loom

We need to convert the python loop over models to a single SQL lookup:

dependence probability:

bayeslite/src/metamodels/loom_metamodel.py

Lines 500 to 502 in 39e96af

    
           for modelno in modelnos: 
        
               kind0 = self._get_kind_id(bdb, generator_id, modelno, colno0) 
        
               kind1 = self._get_kind_id(bdb, generator_id, modelno, colno1)

row similarity:

bayeslite/src/metamodels/loom_metamodel.py

Lines 556 to 570 in 39e96af

    
                   for modelno in modelnos: 
        
                           kind_id = self._get_kind_id( 
        
                               bdb, generator_id, modelno, colnos[0]) 
        
                           cursor = bdb.sql_execute(''' 
        
                               SELECT partition_id 
        
                               FROM bayesdb_loom_row_kind_partition 
        
                               WHERE generator_id = ? 
        
                                   AND modelno = ? 
        
                                   AND kind_id = ? 
        
                                   AND rowid IN (?, ?) 
        
                           ''', (generator_id, modelno, kind_id, rowid, target_rowid,)) 
        
                           partition_ids = cursor.fetchall() 
        
                           assert len(partition_ids) == 2 
        
                           similar = partition_ids[0] == partition_ids[1] 
        
                           model_similarities.append(int(similar))

kind lookup:

bayeslite/src/metamodels/loom_metamodel.py

Lines 507 to 515 in 39e96af

    
               def _get_kind_id(self, bdb, generator_id, modelno, colno): 
        
                   cursor = bdb.sql_execute(''' 
        
                       SELECT kind_id 
        
                       FROM bayesdb_loom_column_kind_partition 
        
                       WHERE generator_id = ? 
        
                           AND modelno = ? 
        
                           AND colno = ? 
        
                   ''', (generator_id, modelno, colno,)) 
        
                   return util.cursor_value(cursor)

After re-running the benchmarks @versar reports that the candidate fix in feb1e22 appears to not have improved the runtime. The overhead of Loom might be related to reading data from disk versus memory, so we may consider caching results if this indeed is the case.

retrieve the benchmark test for versar.
re-profile the dependence queries and confirm.
decide on next steps, e.g. caching or otherwise.

Confirmed issue for both cases (and this will be the same for all Loom queries): the culprit is the invocation of _check_loom_initialized on a per-query basis. This check good error messages but unreasonable computational overhead.

bayeslite/src/metamodels/loom_metamodel.py

Lines 238 to 253 in 39e96af

    
               def _check_loom_initialized(self, bdb, generator_id): 
        
                   cursor = bdb.sql_execute(''' 
        
                       SELECT COUNT(*) 
        
                       FROM bayesdb_loom_row_kind_partition 
        
                       WHERE generator_id = ? 
        
                   ''', (generator_id,)) 
        
                   count_row = cursor.fetchall() 
        
                   cursor = bdb.sql_execute(''' 
        
                       SELECT COUNT(*) 
        
                       FROM bayesdb_loom_row_kind_partition 
        
                       WHERE generator_id = ? 
        
                   ''',(generator_id,)) 
        
                   count_col = cursor.fetchall() 
        
                   if count_row[0][0] == 0 or count_col[0][0] == 0: 
        
                       raise BQLError(bdb, 'Analyze must be run before any BQL'\ 
        
                           ' queries when using loom.')

Resolution plan:

Fix #586 to significantly reduce the chance a user will encounter this error.
Remove all invocations of _check_loom_initialized
Add a ticket for comprehensible error messages, perhaps using a cached boolean flag or a better SQL query for checking.

0d80bf2 should contain the resolution for this issue (tested on the same benchmarks in the first post). @versar can reopen the ticket if they encounter more issues.

Note that 0d80bf2 also fixes #586 so that INITIALIZE actually initializes models using single streaming pass through the data.

	for modelno in modelnos:
	kind0 = self._get_kind_id(bdb, generator_id, modelno, colno0)
	kind1 = self._get_kind_id(bdb, generator_id, modelno, colno1)

	for modelno in modelnos:
	kind_id = self._get_kind_id(
	bdb, generator_id, modelno, colnos[0])
	cursor = bdb.sql_execute('''
	SELECT partition_id
	FROM bayesdb_loom_row_kind_partition
	WHERE generator_id = ?
	AND modelno = ?
	AND kind_id = ?
	AND rowid IN (?, ?)
	''', (generator_id, modelno, kind_id, rowid, target_rowid,))
	partition_ids = cursor.fetchall()
	assert len(partition_ids) == 2
	similar = partition_ids[0] == partition_ids[1]
	model_similarities.append(int(similar))

	def _get_kind_id(self, bdb, generator_id, modelno, colno):
	cursor = bdb.sql_execute('''
	SELECT kind_id
	FROM bayesdb_loom_column_kind_partition
	WHERE generator_id = ?
	AND modelno = ?
	AND colno = ?
	''', (generator_id, modelno, colno,))
	return util.cursor_value(cursor)

	def _check_loom_initialized(self, bdb, generator_id):
	cursor = bdb.sql_execute('''
	SELECT COUNT(*)
	FROM bayesdb_loom_row_kind_partition
	WHERE generator_id = ?
	''', (generator_id,))
	count_row = cursor.fetchall()
	cursor = bdb.sql_execute('''
	SELECT COUNT(*)
	FROM bayesdb_loom_row_kind_partition
	WHERE generator_id = ?
	''',(generator_id,))
	count_col = cursor.fetchall()
	if count_row[0][0] == 0 or count_col[0][0] == 0:
	raise BQLError(bdb, 'Analyze must be run before any BQL'\
	' queries when using loom.')