/spark-sas7bdat

Splittable SAS (.sas7bdat) Input Format for Hadoop and Spark SQL

Primary LanguageScalaApache License 2.0Apache-2.0

Spark SAS Data Source (sas7bdat)

A library for reading SAS data (.sas7bdat) with Spark.

Build Status Join the chat at https://gitter.im/saurfang/spark-sas7bdat

Requirements:

Download:

The latest jar can be downloaded from spark-packages.

Version Scala Version Spark Version
3.0.0-s_2.11 2.11.x 2.4.x
3.0.0-s_2.12 2.12.x 3.0.x

Features:

  • This package allows reading SAS files from local and distributed filesystems, into Spark DataFrames.
  • Schema is automatically inferred from metadata embedded in the SAS file. (Behaviour can be customised, see parameters below)
  • The SAS format is splittable when not file-system compressed, thus we are able to convert a 200GB (1.5Bn rows) .sas7bdat file to .csv files using 2000 executors in under 2 minutes.
  • This library uses parso for parsing as it is the only public available parser that handles both forms of SAS compression (CHAR and BINARY).

NOTE: this package does not support writing sas7bdat files

Docs:

Parameters:

  • extractLabel (Default: false)
    • Boolean: extract column labels as column comments for Parquet/Hive
  • forceLowercaseNames (Default: false)
    • Boolean: force column names to lower case
  • inferDecimal (Default: false)
    • Boolean: infer numeric columns with format width >0 and format precision >0, as Decimal(Width, Precision)
  • inferDecimalScale (Default: each column's format width)
    • Int: scale of inferred decimals
  • inferFloat (Default: false)
    • Boolean: infer numeric columns with <=4 bytes, as Float
  • inferInt (Default: false)
    • Boolean: infer numeric columns with <=4 bytes, format width >0 and format precision =0, as Int
  • inferLong (Default: false)
    • Boolean: infer numeric columns with <=8 bytes, format width >0 and format precision =0, as Long
  • inferShort (Default: false)
    • Boolean: infer numeric columns with <=2 bytes, format width >0 and format precision =0, as Short
  • metadataTimeout (Default: 60)
    • Int: number of seconds to allow reading of file metadata (stops corrupt files hanging)
  • minSplitSize (Default: mapred.min.split.size)
    • Long: minimum byte length of input splits (splits are always at least 1MB, to ensure correct reads)
  • maxSplitSize (Default: mapred.max.split.size)
    • Long: maximum byte length of input splits, (can be decreased to force higher parallelism)

NOTE:

  • the order of precedence for numeric type inference is: Long -> Int -> Short -> Decimal -> Float -> Double
  • sas doesn’t have a concept of Long/Int/Short, instead people typically use column formatters with 0 precision

Scala API

val df = {
  spark.read
    .format("com.github.saurfang.sas.spark")
    .option("forceLowercaseNames", true)
    .option("inferLong", true)
    .load("cars.sas7bdat")
}
df.write.format("csv").option("header", "true").save("newcars.csv")

You can also use the implicit readers:

import com.github.saurfang.sas.spark._

// DataFrameReader
val df = spark.read.sas("cars.sas7bdat")
df.write.format("csv").option("header", "true").save("newcars.csv")

// SQLContext
val df2 = sqlContext.sasFile("cars.sas7bdat")
df2.write.format("csv").option("header", "true").save("newcars.csv")

(Note: you cannot use parameters like inferLong with the implicit readers.)

Python API

df = spark.read.format("com.github.saurfang.sas.spark").load("cars.sas7bdat", forceLowercaseNames=True, inferLong=True)
df.write.csv("newcars.csv", header=True)

R API

df <- read.df("cars.sas7bdat", source = "com.github.saurfang.sas.spark", forceLowercaseNames = TRUE, inferLong = TRUE)
write.df(df, path = "newcars.csv", source = "csv", header = TRUE)

SQL API

SAS data can be queried in pure SQL by registering the data as a (temporary) table.

CREATE TEMPORARY VIEW cars
USING com.github.saurfang.sas.spark
OPTIONS (path="cars.sas7bdat")

SAS Export Runner

We included a simple SasExport Spark program that converts .sas7bdat to .csv or .parquet files:

sbt "run input.sas7bdat output.csv"
sbt "run input.sas7bdat output.parquet"

To achieve more parallelism, use spark-submit script to run it on a Spark cluster. If you don't have a spark cluster, you can always run it in local mode and take advantage of multi-core.

Spark Shell

spark-shell --master local[4] --packages saurfang:spark-sas7bdat:3.0.0-s_2.12

Caveats

  1. spark-csv writes out null as "null" in csv text output. This means if you read it back for a string type, you might actually read "null" instead of null. The safest option is to export in parquet format where null is properly recorded. See databricks/spark-csv#147 for alternative solution.

Related Work

Acknowledgements

This project would not be possible without parso continued improvements and generous contributions from @mulya, @thesuperzapper, and many others. We are hornored to be a recipient of 2020 WiseWithData ELEVATE Awards and appreciate their generous donations.