/singlestoredb-go-arrow-driver

SingleStoreDB driver for reading data in arrow format

Primary LanguageGoApache License 2.0Apache-2.0

SingleStoreDB Go Arrow Driver

Attention: The code in this repository is intended for experimental use only and is not fully tested, documented, or supported by SingleStore. Visit the SingleStore Forums to ask questions about this repository.

The SingleStoreDB Go Arrow driver facilitates the reading of data in Apache Arrow format from SingleStoreDB databases. Note that this is the alpha release of the driver, and there may be changes to the API, type conversion, and other internal implementations in the future.

Installation

Run the following command to add the SingleStoreDB Go Arrow driver as a dependency to your Go module:

go get github.com/singlestore-labs/singlestoredb-go-arrow-driver

MySQL driver dependency is required to use this driver:

go get github.com/go-sql-driver/mysql@v1.7.2-0.20230809113539-7cf548287682

Use the following code to import dependencies:

import (
    "database/sql"

    _ "github.com/go-sql-driver/mysql"
    s2db_arrow_driver "github.com/singlestore-labs/singlestoredb-go-arrow-driver"
)

API

The S2DBArrowReader interface provides an API for reading Apache Arrow data from SingleStoreDB databases. To create a new instance of S2DBArrowReader, use the NewS2DBArrowReader function. S2DBArrowReader provides the following methods:

  • GetNextArrowRecordBatch: Retrieves a single Record object (arrow.Record) from the database. When there are no more records to fetch, it returns nil as the first part of the result tuple. You must release the returned Record using the Release() method after use.
  • Close: Finalizes the reading of query results and releases all the acquired resources.

Configuration

The NewS2DBArrowReader function takes S2DBArrowReaderConfig as a parameter. Here are the supported reader configurations and their explanations:

Name Default Description
Conn No default (required) The sql.DB object used to connect with a SingleStoreDB database.
Args nil (no arguments) Arguments for placeholder parameters in the query.
RecordSize 10000 The maximum number of rows in the resulting records.
ParallelReadConfig nil (sequential read) Additional configurations for parallel read. If this value is non-nil, parallel read is enabled.
UseClientConvesion false Indicates if the data should be converted to Arrow Record format on the client. It can be set to true for test purposes, performance optimizations take place only when it is false
EnableQueryLogging false Controls whether the driver should generate debug logs. Debug logs are printed to the standard output.

The S2DBParallelReadConfig allows you to configure additional settings for parallel read. Here are the additional configurations that can be set:

Name Default Description
DatabaseName No default (required) The name of the SingleStoreDB database. It is used to determine the number of partitions for parallel reading.
ChannelSize 10000 The size of the channel buffer. The channel stores references to Arrow Records while reading is in progress and transfers them to the main goroutine.
EnableDebugProfiling false Controls whether to profile the query. Profiling result is printed to the standart output.

Note: Set interpolateParams=true parameter of the sql.DB in order to use parallel read. If this parameter is not set - you will get the following error: This command is not supported in the prepared statement protocol yet

Note: Currently parallel read with UseClientConvesion = false is not supported. So if you set ParallelReadConfig, you must also set UseClientConvesion to true.

Usage example

dsn := fmt.Sprintf("%s:%s@tcp(%s:%d)/%s?interpolateParams=true", "user", "password", "host", 3306, "database")
dsn += fmt.Sprintf("&connectionAttributes=%s:%s,%s:%s", "program_name", "CompanyName_AppName", "program_version", "1.2.3")
db, err := sql.Open("mysql", dsn)
if err != nil {
    // Handle the error
}

arrowReader, err := s2db_arrow_driver.NewS2DBArrowReader(
    context.Background(), 
    s2db_arrow_driver.S2DBArrowReaderConfig{
        Conn:  db,
        Query: "SELECT * FROM t WHERE a > ? AND a < ?",
            Args: []interface{}{1, 10},
        // uncomment lines below to use parallel read instead of Arrow conversion on Server
        // ParallelReadConfig: &s2db_arrow_driver.S2DBParallelReadConfig{
        //     DatabaseName: "db",
        // },
        // UseClientConvesion: true,
    })
if err != nil {
    // Handle the error
}
defer arrowReader.Close()

for batch, err := arrowReader.GetNextArrowRecordBatch(); batch != nil; batch, err = arrowReader.GetNextArrowRecordBatch() {
    if err != nil {
        // Handle the error
    }
    defer batch.Release()

    // Process the batch
}

Performance Considerations

To achieve maximum performance, consider using parallel read. The performance of parallel read depends on the size of the SingleStore cluster and the number of CPU cores on the machine where the code runs. SingleStore recommends using a machine where the number of CPU cores is equal to the number of partitions in the SingleStoreDB database. While the above holds true, parallel read will be supported in the next release.

Additionally, performance is influenced by the data types in the SingleStoreDB database. Performance tests conducted by the SingleStore team demonstrated that nullable data types are slower than non-nullable types. Therefore, consider using non-nullable data types when appropriate.

Data type mapping

The following table maps the SingleStoreDB data types to the corresponding Arrow data types. Note that this mapping is based on the alpha version of the driver and it may change in the future.

SingleStoreDB Data Type Arrow Data Type
UNSIGNED TINYINT uint8
UNSIGNED SMALLINT uint16
UNSIGNED MEDIUMINT uint32
UNSIGNED INT uint32
UNSIGNED BIGINT uint64
TINYINT boolean
SMALLINT int16
MEDIUMINT int32
INT int32
BIGINT int64
FLOAT float32
DOUBLE float64
DECIMAL string
YEAR int16
DATE string
TIME string
DATETIME string
TIMESTAMP string
CHAR string
VARCHAR string
TINYTEXT string
TEXT string
MEDIUMTEXT string
LONGTEXT string
JSON string
BIT binary
BINARY binary
VARBINARY binary
TINYBLOB binary
BLOB binary
MEDIUMBLOB binary
LONGBLOB binary