/RKinetica

An R ODBC connector package for Kinetica

Primary LanguageROtherNOASSERTION

RKinetica Connector Guide

The following guide provides step by step instructions to get started using R with Kinetica. The RKinetica connector provides access to the Kinetica database through a REST interface. The RKinetica package can be built locally from source or a prebuilt package release can be downloaded from the release page. Kinetica strongly recommends that release builds be used instead of building the package locally.

RKinetica interfaces with Kinetica similarly to other R-database interactions: using the DBI and dplyr packages. Interacting with RKinetica is easiest using the RStudio interface. RStudio and RKinetica require that R be installed locally.

Source code for the connector can be found at:

Contents

Build & Install

RKinetica depends on several other R packages from the comprehensive R archive network (CRAN):

NOTE: If installing the RKinetica package in RStudio, the dependencies should be installed automatically prior to RKinetica being installed. Otherwise, all packages can be installed via RStudio (or the R console) like so:

install.packages(c("DBI", "RJSONIO", "httr", "bit64", "hms", "methods", "purrr", "stats", "dplyr"))

If opting to build the RKinetica package instead of downloading, the connector package can be built as follows:

git clone https://github.com/kineticadb/RKinetica.git -b release/v7.1 --single-branch
R CMD build RKinetica

This sequence produces a tar.gz file, which, once installed, is made available to R. The tar.gz file is created in the same directory the build command was issued, i.e. not in the RKinetica repository. Verify that tar.gz file was created before installing the RKinetica package:

ls RKinetica*

To install the RKinetica package via the command line:

R CMD install RKinetica_7.1.0.0.tar.gz

To install the RKinetica package in RStudio (or R console):

> install.packages("/path/to/RKinetica_7.1.0.0.tar.gz", repos = NULL, type = "source")

Usage

Before using RKinetica, the package must be loaded:

library(RKinetica)

Create a KineticaConnection object using the dbConnect() method, passing in Kinetica URL, user, and password parameters:

con <- dbConnect(RKinetica::Kinetica(),
                 url = "http://<kinetica-host>:9191",
                 username = "<user>",
                 password = "<password>")

NOTE: A parameter cannot be added to an existing KineticaConnection object. Instead, a new KineticaConnection object must be created to properly initialize any functionality enabled by additional parameters.

IMPORTANT: If using RStudio, you can use the rstudioapi package to instead prompt for username and password:

con <- dbConnect(RKinetica::Kinetica(),
                 url = "http://<kinetica-host>:9191",
                 username = rstudioapi::askForPassword("Database username?"),
                 password = rstudioapi::askForPassword("Database password?"))

If you expect a result set from your queries to exceed 10,000 rows, set the row_limit parameter value accordingly:

con <- dbConnect(RKinetica::Kinetica(),
                 url = "http://<kinetica-host>:9191",
                 username = "<user>",
                 password = "<password>",
                 row_limit = 1000000L)

You can then use the KineticaConnection object as a regular DBI connection. Some of the most common commands:

  • Print connection info:

    dbGetInfo(con)
    
  • List top level database objects and collections with their types:

    dbListObjects(con)
    
  • Get a list of top level database tables and collections names:

    dbListTables(con)
    
  • Drop a table if it exists:

    dbRemoveTable(con, "tableA")
    
  • Check if the table exists:

    dbExistsTable(con, "tableA")
    
  • Write a table with 3 columns:

    dbWriteTable(con, "tableA", data.frame(a = 1L, b = 2L, c = 3.0), row.names = NULL)
    
  • List tableA fields:

    dbListFields(con, "tableA")
    
  • Add records to tableA:

    dbAppendTable(con, "tableA", data.frame(a = 2L:3L, b = 3L:4L, c = 4.0:5.0), row.names = NULL)
    
  • Read table into variable:

    rows <- dbReadTable(con, "tableA")
    print(rows)
    
  • Disconnect:

    dbDisconnect(con)
    

Large tables, views, or query results can have pagination parameters passed into the dbSendQuery or dbSendStatement methods. Assuming that the connection was established with a row limit of 1,000,000, i.e. row_limit = 1000000L, the following example query extracts all records from the acquisition table sorted by l_id. The data is retrieved in batches of one million records and the offset is increased by 1 million each batch. This loop continues until the dbSendQuery() resultset is returned empty:

sql_query <- "SELECT l_id, product_type, term, score FROM acquisition ORDER BY l_id"
offset <- 0L
repeat {
  result <- dbSendQuery(con, sql_query, limit = 1000000L, offset = offset)

  # work with current data.frame provided in result@data

  if (nrow(result@data) > 0) {
    # increase offset to get next page of data
    offset <- offset + 1000000L
  } else {
    # exit pagination loop when data.frame is empty
    break
  }
}

Additional code examples are available in the "examples" subdirectory.

Strings vs Factors

When RKinetica reads a character list into R dataframe it can be converted in to a factor. This option is controlled by environment property that's read into as.data.frame() parameter stringsAsFactors:

stringsAsFactors = default.stringsAsFactors()

To set environment option stringsAsFactors to TRUE or FALSE explicitly, use the following syntax at the beginning of your R script or once per session:

options(stringsAsFactors = FALSE)

Schema naming support

Starting with version 7.1 Kinetica DB stops supporting comma and period characters in table and column names. Instead user can use <schema_name>.<table_name> notation with only one period character allowed or <table_name> notation to lookup tables in the user's default schema. RKinetica added parameter default_schema to KineticaConnection object that stores user's default schema name:

con@default_schema
[1] "ki_home"

All of the table managing functions of KineticaConnection (dbCreateTable(), dbAppendTable(), dbReadTable(), dbWriteTable(), dbExistsTable(), and dbRemoveTable()) now allow name argument to be passed as a character value such as some_schema.tableA (or tableA, short for omitting default schema in ki_home.tableA notation) or a KineticaId object modelled after DBI::Id class that encapsulates a named vector with schema and table parameter values defined separately:

id  <- KineticaId(schema = "ki_home", table = "some_table")
show(id)
[1] <KineticaId> schema = ki_home, table = some_table

You can also create KineticaId object by submitting an unnamed character value that is going to be parsed into named schema and table pair:

id  <- KineticaId("ki_home.some_table")
show(id)
[1] <KineticaId> schema = ki_home, table = some_table

Configuring High Availablity (HA)

Automatic Discovery

When two or more Kinetica clusters have been configured for an HA ring via KAgent, RKinetica will automatically discover the additional Kinetica instance URLs available in the ring. If the connection to the URL of the primary cluster fails, each additional URL will be tried until a successful connection is established; every failed connection will result in a warning message. If all connection attempts fail, an error message will be thrown. Only the URL of the primary cluster to connect to needs to be specified (via the url parameter in the dbConnect method); the URLs for the failover clusters will be retrieved from the primary cluster upon first connecting to it. The KineticaConnection object has additional parameters to store these failover URLs (via the ha_ring parameter) as well as other connection information:

con <- dbConnect(RKinetica::Kinetica(), url = "http://172.123.45.61:9191")
dbGetInfo(con)

$url
[1] "http://172.123.45.61:9191"

$host
[1] "172.123.45.61"

$port
[1] 9191

$ha_enabled
[1] TRUE

$ha_ring
[1] "http://172.123.45.63:9191" "http://172.123.45.61:9191" "http://172.123.45.62:9191"

URLs in ha_ring list are randomly selected to balance load on secondary URL instances when the primary URL fails. You can use a show() command on a KineticaConnection object at any time to check which URL is being used in the current connection:

show(con)

<KineticaConnection>
HA enabled
Current url: http://172.123.45.62:9191

Manual Configuration

If you want to provide URLs for failover clusters manually, you can do so by adding the ha_ring parameter to the dbConnect() method with a comma-separated list of URIs for the secondary cluster(s):

con <- dbConnect(RKinetica::Kinetica(), url = "http://172.123.45.61:9191",
       ha_ring = "http://172.123.45.62:9191,http://172.123.45.63:9191")
show(con)

<KineticaConnection>
Self-provided HA enabled
Current url: http://172.123.45.61:9191

Documentation

Detailed RKinetica information including method and function descriptions, usage examples, and arguments can be found in the documentation.