mithrandie/csvq

Can it replace the H2 database engine?

Opened this issue · 15 comments

The functionality is very powerful. I have a question. If a csv file has a size of 1TB, what would the query efficiency be?
Does csvq want to read all 1TB of content into memory before subsequent processing?
Can it replace the H2 database engine?

No, csvq reads all the data into memory at runtime, so trying to handle a 1TB file is nearly impossible.

hw2499 commented
  1. Is there any suggestion on how large a file size csvq is suitable for processing?
  2. How can I solve this 1TB file scenario?

The file size you can handle depends on the query you want to execute and your system.

Csvq is not a DBMS, but an SQL interpreter that executes queries against text files such as csv.
I don't know what you want to do with the file and how you want to handle it, so I can't say what method would be appropriate. In general, however, it is appropriate to use a DBMS that is designed to handle large amounts of data.

hw2499 commented

I want to use this SQL interpreter to process large text files, but I am not sure how efficient it is to process large files. If it is very efficient, it can do many things in memory databases

Have you tried DuckDB?

hw2499 commented

I am not looking for a database, but for a solution that can process big data in memory. I feel that CVSQ is very powerful.

I agree that csvq is a very capable tool for querying CSV files using SQL and I still use it where it makes sense. DuckDB can do many of the same things and can process as much data as will fit in memory without ever creating a single table. Like csvq, it can read this data directly from CSV files, but also from Parquet files, and write results to such files. You need not create a DuckDB database to query very large CSV files in memory!

hw2499 commented

What types of databases does DuckDB belong to? I read the document and feel that it is similar to a sqlite database. What are the main scenarios in which DuckDB is used?

DuckDB is similar to SQLite in that it is an embedded database, but unlike SQLite, DuckDB is a column-oriented database such that it stores data column-wise, rather than row-wise. Column stores like DuckDB are designed for analysis of large data sets while SQLite is designed more for transaction processing. Neither has a server component, so they both are designed for local data processing on a personal computer. Column-oriented databases use various column data compression methods to store data more efficiently, scan and retrieve only the columns that the query selects (good for querying wide tables), and generally execute aggregate queries on columns more quickly.

hw2499 commented
  1. Can duckDB process (query) big data like 1TB in memory, or are there any other solutions?
  2. Does duckDB Client API support golang?
  1. Can duckDB process (query) big data like 1TB in memory, or are there any other solutions?

DuckDB streams the input and results of most (or many) query operations, so most (or many) DuckDB commands can query very large data sets, even 1 TB, on computers that have much lower memory capacity.

  1. Does duckDB Client API support golang?

https://pkg.go.dev/github.com/benjajaja/go-duckdb

hw2499 commented
  1. ok, I will look into the use of duckdb and then adapt it to etl-engine products.
    https://github.com/hw2499/etl-engine
  2. What olap functions does duckbd support?
kpym commented

@hw2499 , @derekmahar Maybe the end of this discussion should be moved to the DuckDB Discussion Forum ? IMHO, the csvq issue tracker is not the place for this.

@kpym, I agree. I actually didn't realise that this discussion was about a csvq issue. I thought it was a csvq discussion topic. In any case, in the DuckDB discussion forums on GitHub or the DuckDB Discord server, the DuckDB developers and other users could better answer @hw2499's questions.

hw2499 commented

Okay, thank you. @kpym @derekmahar