evidence-dev/evidence

[Bug]: SQLite: Error: Unknown Error

Closed this issue · 12 comments

Describe the bug

I'm trying to load an SQLite database that's around 100MB.

Seems like I'm hitting this line when trying to access a table in the db that's bigger than 32MB:
https://github.com/evidence-dev/evidence/blob/main/packages/lib/sdk/src/plugins/datasources/wrapSimpleConnector.js#L52

By reading the code, I can't see a workaround for this.
Obs: All .sql files consume the same database file

Steps to Reproduce

  • Try to load a SQLite base that's bigger than 32MB
  • Try to access a table that's bigger than 32MB
  • run npm run sources

Logs

> evidence sources

✔ Loading plugins & sources
-----
  [Processing] articles_connection
  article_categories ✔ Finished, wrote 108594 rows.
  article_classification ✔ Finished, wrote 64421 rows.
  articles-database ⚠ No results returned.
  articles ✖ Error: Unknown Error
  categories ✔ Finished, wrote 155 rows.
-----
  Evaluated sources, saving manifest
  ✅ Done!


With debug, I have the following line:

$ NODE_OPTIONS="--max-old-space-size=4096" npm run sources -- --debug;

> my-evidence-project@0.0.1 sources
> evidence sources --debug

Evidence running with debug logging
✔ Loading plugins & sources
-----
  [Processing] articles_connection
  article_categories ◢ Processing...[DEBUG]:  Building parquet file article_categories.parquet
[DEBUG]:  Reading rows from a generator object
  article_categories ◥ Processing...[DEBUG]:  Measure: "buildMultipartParquet" {
  duration: 2868.07013,
  meta: { 'batch number': 0 },
  parents: [ 'buildMultipartParquet' ]
}
[DEBUG]:  Flushing batch 0 with 108594 rows
[DEBUG]:  Flushing batch 0 with 108594 rows
  article_categories ◢ Processing...[DEBUG]:  Measure: "flush" {
  duration: 418.50450400000045,
  meta: { 'batch number': 0 },
  parents: [ 'buildMultipartParquet' ]
}
[DEBUG]:  Flushed batch 0 with 108594 rows
  article_categories ◣ Processing...[DEBUG]:  Measure: "buildMultipartParquet" {
  duration: 3679.8906589999997,
  meta: { 'output filename': 'article_categories.parquet' },
  parents: []
}
  article_categories ✔ Finished, wrote 108594 rows.
  article_classification ◢ Processing...[DEBUG]:  Building parquet file article_classification.parquet
[DEBUG]:  Reading rows from a generator object
  article_classification ◢ Processing...[DEBUG]:  Measure: "buildMultipartParquet" {
  duration: 2006.8502520000002,
  meta: { 'batch number': 0 },
  parents: [ 'buildMultipartParquet' ]
}
[DEBUG]:  Flushing batch 0 with 64421 rows
[DEBUG]:  Flushing batch 0 with 64421 rows
  article_classification ◣ Processing...[DEBUG]:  Measure: "flush" {
  duration: 806.4016789999996,
  meta: { 'batch number': 0 },
  parents: [ 'buildMultipartParquet' ]
}
[DEBUG]:  Flushed batch 0 with 64421 rows
  article_classification ◤ Processing...[DEBUG]:  Measure: "buildMultipartParquet" {
  duration: 3263.2427499999994,
  meta: { 'output filename': 'article_classification.parquet' },
  parents: []
}
  article_classification ✔ Finished, wrote 64421 rows.
Will not eagerly load files larger than 32 Megabytes.
  articles-database ⚠ No results returned.
  articles ✖ Error: Unknown Error
  categories ◢ Processing...[DEBUG]:  Building parquet file categories.parquet
[DEBUG]:  Reading rows from a generator object
[DEBUG]:  Measure: "buildMultipartParquet" {
  duration: 4.869833999999173,
  meta: { 'batch number': 0 },
  parents: [ 'buildMultipartParquet' ]
}
[DEBUG]:  Flushing batch 0 with 155 rows
[DEBUG]:  Flushing batch 0 with 155 rows
[DEBUG]:  Measure: "flush" {
  duration: 3.3371879999995144,
  meta: { 'batch number': 0 },
  parents: [ 'buildMultipartParquet' ]
}
[DEBUG]:  Flushed batch 0 with 155 rows
[DEBUG]:  Measure: "buildMultipartParquet" {
  duration: 14.027275999998892,
  meta: { 'output filename': 'categories.parquet' },
  parents: []
}
  categories ✔ Finished, wrote 155 rows.
-----
  Evaluated sources, saving manifest
  Updating schema 'articles_connection'
  | Schema exists already
  | 4 queries found
  |   article_categories
  |   article_classification
  |   articles-database
  |   categories
  | 0 queries are new
  | 3 queries already exists
  |   static/data/articles_connection/article_categories/article_categories.parquet
  |   static/data/articles_connection/article_classification/article_classification.parquet
  |   static/data/articles_connection/categories/categories.parquet
  | 3 queries to be rendered
  |   static/data/articles_connection/article_categories/article_categories.parquet
  |   static/data/articles_connection/article_classification/article_classification.parquet
  |   static/data/articles_connection/categories/categories.parquet
  ✅ Done!

System Info

System:
    OS: macOS 15.1.1
    CPU: (12) x64 Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
    Memory: 217.51 MB / 16.00 GB
    Shell: 3.2.57 - /bin/bash
  Binaries:
    Node: 22.11.0 - ~/.nvm/versions/node/v22.11.0/bin/node
    npm: 10.5.0 - ~/.nvm/versions/node/v22.11.0/bin/npm
    pnpm: 8.7.5 - ~/Library/pnpm/pnpm
    Watchman: 2023.07.03.00 - /usr/local/bin/watchman
  npmPackages:
    @evidence-dev/bigquery: ^2.0.8 => 2.0.8
    @evidence-dev/core-components: ^4.8.13 => 4.8.13
    @evidence-dev/csv: ^1.0.13 => 1.0.13
    @evidence-dev/databricks: ^1.0.7 => 1.0.7
    @evidence-dev/duckdb: ^1.0.12 => 1.0.12
    @evidence-dev/evidence: ^39.1.17 => 39.1.17
    @evidence-dev/motherduck: ^1.0.3 => 1.0.3
    @evidence-dev/mssql: ^1.1.1 => 1.1.1
    @evidence-dev/mysql: ^1.1.3 => 1.1.3
    @evidence-dev/postgres: ^1.0.6 => 1.0.6
    @evidence-dev/snowflake: ^1.2.1 => 1.2.1
    @evidence-dev/sqlite: ^2.0.6 => 2.0.6
    @evidence-dev/trino: ^1.0.8 => 1.0.8

Severity

blocking all usage of Evidence

Additional Information, or Workarounds

This is my connection.yaml file:

name: articles_connection
type: sqlite
options:
  filename: articles-database.db
  readonly: true

This is the articles.sql

select * from articles limit 1;

Can you also confirm what columns and column types are in your sqlite file in the articles and articles-database table?

@archiewood This is the query that. creates the articles table:

CREATE TABLE IF NOT EXISTS articles (
	id TEXT PRIMARY KEY,
	title TEXT NOT NULL,
	published TEXT NOT NULL,
	abstract TEXT NOT NULL,
	conclusion TEXT,
	link TEXT UNIQUE NOT NULL,
	input_token INTEGER,
	output_token INTEGER,
	created_at TEXT DEFAULT CURRENT_TIMESTAMP,
	updated_at TEXT DEFAULT CURRENT_TIMESTAMP
);

There is no articles-database table. But this is the structure of the folder:

$ cd sources/articles_connection/
$ ls -l
total 122832
-rw-r--r--  1 user  staff        33 Nov 21 19:06 article_categories.sql
-rw-r--r--  1 user  staff        37 Nov 21 19:06 article_classification.sql
-rw-r--r--@ 4 user  staff  61861888 Nov 21 13:40 articles-database.db
-rw-r--r--  1 user  staff        23 Nov 21 19:53 articles.sql
-rw-r--r--  1 user  staff        25 Nov 21 19:05 categories.sql
-rw-r--r--  1 user  staff        98 Nov 21 19:50 connection.yaml

I think this is the expected behaviour regarding the loading of the files. We dont need to read the contents of the SQLite file's text, we just need to query it.

However the error is happening when running the query articles.sql, and the error message that is given back is not helpful!

@archiewood I see. Indeed, the 32MB log probably comes from trying to load the .db file, not the query! It makes total sense!

But indeed my biggest problem is the articles.sql file. Please let me know what I can do to help provide more relevant data! I'm very interested in solving this problem

I assume this same query runs successfully against sqlite in some other client?

@archiewood Yes it does

SQLite version 3.42.0 2023-05-16 12:36:15
Enter ".help" for usage hints.
sqlite> select * from articles limit 1;
1653afcd-92ea-4468-8891-99c9fcc7275e|Randomized Autoregressive Visual Generation|2024-11-01|This paper presents Randomized AutoRegressive modeling (RAR) for visual generation, which sets a new state-of-the-art performance on the image generation task while maintaining full compatibility with language modeling frameworks. The proposed RAR is simple: during a standard autoregressive training process with a next-token prediction objective, the input sequence-typically ordered in raster form-is randomly permuted into different factorization orders with a probability r, where r starts at 1 and linearly decays to 0 over the course of training. This annealing training strategy enables the model to learn to maximize the expected likelihood over all factorization orders and thus effectively improve the model's capability of modeling bidirectional contexts. Importantly, RAR preserves the integrity of the autoregressive modeling framework, ensuring full compatibility with language modeling while significantly improving performance in image generation. On the ImageNet-256 benchmark, RAR achieves an FID score of 1.48, not only surpassing prior state-of-the-art autoregressive image generators but also outperforming leading diffusion-based and masked transformer-based methods. Code and models will be made available at https://github.com/bytedance/1d-tokenizer||https://arxiv.org/abs/2411.00776|2024-11-04 17:28:49|2024-11-07 09:26:35||

@archiewood Seems like i managed to debug the solution:
#2849

Please modify as you see fit!
If possible, can you let me know when this will be released? This is blocking a page that I'm trying to build, and I would love to release my code as soon as possible.

Thank you!

Hi @luanmuniz - thanks for the PR.

We'll look at this next week. Next release is scheduled for Thursday 28th.

If you want to get unblocked faster, you could release your version as a community plugin.

https://docs.evidence.dev/plugins/create-source-plugin/

Since you have written all the code already i imagine it will just be a bit of copy pasting.

you can then install your plugin in evidence and drive on!

Should be decent instructions here in the template:

https://github.com/evidence-dev/datasource-template

if you have any questions or issues with the template let me know!

More context on this from one user:

https://evidencedev.slack.com/archives/C023LRB9Z40/p1733499597210059?thread_ts=1732827248.791199&cid=C023LRB9Z40

In SQLite you are able to define a column without a datatype ( argh ). Those columns are causing the error in Evidence.

@archiewood This is not the problem i was having just to make it clear. You can see in the table schema i sent a few messages back that all columns have types.

Am i missing something?

Ah, good point I misremembered. I thought these might have the same cause