/tablesaw

Part data frame. Part column store. All Java.

Primary LanguageJavaOtherNOASSERTION

Tablesaw

Tablesaw is a high-performance, in-memory data table, plus tools for data manipulation and a column-oriented storage format. In Java.

With Tablesaw, you can import, sort, transform, filter, and summarize tables with hundreds of millions of rows on a laptop. It uses tricks like primitive collections and compressed, column-oriented data structures to maximize what you can do in one VM.

The goal is to make large data wrangling jobs possible without distributed computing (HDFS, Hadoop, Spark, etc.). Analysis is more productive with less engineering overhead and shorter iteration cycles. A fluent API lets you express operations in a concise and readable fashion.

Tablesaw provides general-purpose analytic support, with rich functionality for working with time-series. It's designed for ease-of-use, and with Java 9, you'll be able to work interactively in the REPL.

For more information and examples see: https://javadatascience.wordpress.com

Getting started

Tablesaw uses maven, but we're not on Maven Central yet.

Download the latest release from:

https://github.com/lwhite1/tablesaw/releases

Build and install:

mvn clean install

Then add a dependency to your pom file:

<dependency>
    <groupId>com.github.lwhite1</groupId>
    <artifactId>tablesaw</artifactId>
    <version>1.0-SNAPSHOT</version>
</dependency>

Tablesaw requires Java 8 or newer.

A tutorial introduction to Tablesaw

Lets explore a tornado data set using Tablesaw. Here's what we'll cover:

  • Read and writing CSV files
  • Viewing table metadata
  • Adding and removing columns
  • Printing the first few rows for a peak at the data
  • Sorting
  • Running descriptive stats (mean, min, max, etc.)
  • Performing mapping operations over columns
  • Filtering rows
  • Using groups and reduce functions to compute stats for subsets
  • Storing tables in Tablesaw's compressed columnar storage format

All the data is in the projects data folder. The code is taken from the TornadoExample test class.

Read a CSV file

Here we read a csv file of tornado data. First, we say what column types are present.

    ColumnType[] CT = {LOCAL_DATE, LOCAL_TIME, CATEGORY, INTEGER, INTEGER, INTEGER, 
                       INTEGER, CATEGORY, FLOAT, FLOAT, FLOAT, FLOAT, FLOAT};
    Table tornadoes = Table.fromCSV(CT, "data/1950-2014_torn.csv");

Specifying the column types is the most tedious part of using Tablesaw. That requirement will be removed soon.

Viewing table metadata

Often, the best way to start is to print the column names for reference:

    
    tornadoes.columnNames();

which produces:

	[Date, Time, State, State No, Scale, Injuries, Fatalities, Start Lat, Start Lon, Length, Width]

The shape() method displays the row and column counts:

    
    tornadoes.shape();
    
    >> 59945 rows X 10 cols

The structure() method shows the index, name and type of each column

    tornadoes.structure();
    
    >> Structure of data/tornadoes_1950-2014.csv
	Index Column Names Column Type 
	0     Date         LOCAL_DATE  
	1     Time         LOCAL_TIME  
	2     State        CATEGORY    
	3     State No     INTEGER     
	4     Scale        INTEGER     
	5     Injuries     INTEGER     
	6     Fatalities   INTEGER     
	7     Start Lat    FLOAT       
	8     Start Lon    FLOAT       
	9     Length       FLOAT       
	10    Width        FLOAT       
        

Like many Tablesaw methods, structure() returns a table; You can then call print() to produce a string representation for display, or perform any other table operations on it, like:

    
    tornadoes.structure()
    	.selectWhere(column("Column Type").isEqualTo("INTEGER"));
    
    >> Structure of data/tornadoes_1950-2014.csv
	Index Column Name Column Type 
	3     State No    INTEGER     
	4     Scale       INTEGER     
	5     Injuries    INTEGER     
	6     Fatalities  INTEGER     
    

Of course, that also returned a table. We'll cover selecting rows in more detail later.

Viewing data

The first(n) method returns the first n rows.

    table.first(3);
    >>
    Date       Time     State Scale Injuries Fatalities Start Lat Start Lon Length Width 
	1950-01-03 11:00:00 MO    3     3        0          38.77     -90.22    9.5    150.0 
	1950-01-03 11:00:00 MO    3     3        0          38.77     -90.22    6.2    150.0 
	1950-01-03 11:10:00 IL    3     0        0          38.82     -90.12    3.3    100.0 

Mapping operations

Mapping operations in Tablesaw take one or more columns as inputs and produce a new column as output. We can map arbitrary expressions onto the table, but many common operations are built in. You can, for example, calculate the difference in days, weeks, or years between the values in two date columns. The method below extracts the Month name from the date column into a new column.

    CategoryColumn month = tornadoes.localDateColumn("Date").month();

Once you have a new column, you can add it to a table:

    tornadoes.addColumn(2, month);

You can also remove columns from tables to save memory or reduce clutter:

    tornadoes.removeColumn("State No);

Sorting by column

Now lets sort the table in reverse order by the id column

    tornadoes.sortDescendingOn("Fatalities");

Descriptive statistics

Descriptive statistics are calculated using the describe() method:

    table.column("Fatalities").describe();

This outputs:

	Measure  Value     
	n        1590.0    
	Missing  0.0       
	Mean     4.2779875 
	Min      1.0       
	Max      158.0     
	Range    157.0     
	Std. Dev 9.573451  

Filtering Rows

To filter rows you can use arbitrary logic, but it's easier to use the built-in filter classes as shown below:

    tornadoes.selectWhere(column("Fatalities").isGreaterThan(0));
    
    tornadoes.selectWhere(column("Date").isInApril());
    
    tornadoes.selectWhere(either(column("Width").isGreaterThan(300)),   // 300 yards
    			 	(column("Length").isGreaterThan(10)));    // 10 miles
    							
    tornadoes.select("State", "Date", "Scale").where(column("Date").isInQ2());
    

The last example above returns a table containing only the three columns named in select() parameters.

Performing totals and sub-totals

Column metrics can be calculated using methods like sum(), product(), mean(), max(), etc.

You can apply those methods to a table, calculating results on one column, grouped by the values in another.

    Table injuriesByScale = tornadoes.reduce("Injuries", "Scale", median);
    injuriesByScale.setName("Median injuries by Tornado Scale");

This produces the following table, in which Group represents the Tornado Scale and Median the median injures for that group:

Median injuries by Tornado Scale
Group Median 
-9    0.0    
0     0.0    
1     0.0    
2     0.0    
3     1.0    
4     12.0   
5     107.0  

Write the new CSV file to disk

    tornadoes.exportToCsv("data/rev_tornadoes_1950-2014.csv");

Read and write data using Tablesaw's ".saw" format

Once you've imported data you can use Tablesaw's own disk format to save it. In .saw format, reads and writes are an order of magnitude faster than optimized CSV operations.

    String dbName = tornadoes.save("/tmp/tablesaw/testdata");

    Table tornadoes = Table.readTable(dbName);

This is just a little of what Tablesaw can do. More information is available on the project web site: https://javadatascience.wordpress.com

A work-in-progress

Tablesaw is moving towards stability of the core functionality and APIs. A production release planned for Q3 2016. A great deal of additional functionality will follow the initial release, including window operations (like rolling averages), outlier detection, database integration, and integrated machine-learning.