/catalog

Access The Spark Catalog API Via 'sparklyr'

Primary LanguageR

{catalog}

CRAN status Dependencies CRAN downloads

Overview

{catalog} gives the user access to the Spark Catalog API making use of the {sparklyr} API. Catalog is the interface for managing a metastore (aka metadata catalog) of relational entities (e.g. database(s), tables, functions, table columns and temporary views).

Installation

You can install:

  • the development version from GitHub with
# install.packages("remotes")
remotes::install_github("nathaneastwood/catalog")
  • the latest release from CRAN with
install.packages("catalog")

Usage

{catalog} provides an API matching that of the Catalog API and provides full access to all methods. Below is a small example of some of the functionality.

sc <- sparklyr::spark_connect(master = "local")
mtcars_spark <- sparklyr::copy_to(dest = sc, df = mtcars)

library(catalog)

list_tables(sc)
# # A tibble: 1 × 5
#   name   database description tableType isTemporary
#   <chr>  <chr>    <chr>       <chr>     <lgl>      
# 1 mtcars <NA>     <NA>        TEMPORARY TRUE

list_columns(sc, "mtcars")
# # A tibble: 11 × 6
#    name  description dataType nullable isPartition isBucket
#    <chr> <chr>       <chr>    <lgl>    <lgl>       <lgl>   
#  1 mpg   <NA>        double   TRUE     FALSE       FALSE   
#  2 cyl   <NA>        double   TRUE     FALSE       FALSE   
#  3 disp  <NA>        double   TRUE     FALSE       FALSE   
#  4 hp    <NA>        double   TRUE     FALSE       FALSE   
#  5 drat  <NA>        double   TRUE     FALSE       FALSE   
#  6 wt    <NA>        double   TRUE     FALSE       FALSE   
#  7 qsec  <NA>        double   TRUE     FALSE       FALSE   
#  8 vs    <NA>        double   TRUE     FALSE       FALSE   
#  9 am    <NA>        double   TRUE     FALSE       FALSE   
# 10 gear  <NA>        double   TRUE     FALSE       FALSE   
# 11 carb  <NA>        double   TRUE     FALSE       FALSE

list_functions(sc)
# # A tibble: 349 × 5
#    name  database description className                                  isTem…¹
#    <chr> <chr>    <chr>       <chr>                                      <lgl>  
#  1 !     <NA>     <NA>        org.apache.spark.sql.catalyst.expressions… TRUE   
#  2 %     <NA>     <NA>        org.apache.spark.sql.catalyst.expressions… TRUE   
#  3 &     <NA>     <NA>        org.apache.spark.sql.catalyst.expressions… TRUE   
#  4 *     <NA>     <NA>        org.apache.spark.sql.catalyst.expressions… TRUE   
#  5 +     <NA>     <NA>        org.apache.spark.sql.catalyst.expressions… TRUE   
#  6 -     <NA>     <NA>        org.apache.spark.sql.catalyst.expressions… TRUE   
#  7 /     <NA>     <NA>        org.apache.spark.sql.catalyst.expressions… TRUE   
#  8 <     <NA>     <NA>        org.apache.spark.sql.catalyst.expressions… TRUE   
#  9 <=    <NA>     <NA>        org.apache.spark.sql.catalyst.expressions… TRUE   
# 10 <=>   <NA>     <NA>        org.apache.spark.sql.catalyst.expressions… TRUE   
# # … with 339 more rows, and abbreviated variable name ¹​isTemporary
# # ℹ Use `print(n = ...)` to see more rows

drop_temp_view(sc, "mtcars")
# [1] TRUE

For more information, please refer to the package website.