A framework for reading and writing tabular data using popular data formats, especially spreadsheets.
Tabular data is an important interface between humans and machine. However, supporting multiple spreadsheet formats transparently is complex. This library provides a framework of adapters and facades for reading and writing tabular data using popular spreadsheet data formats.
- To support the straightforward reading and writing of tabular data in common file formats, especially spreadsheets.
- To support common value types for cells to streamline development and minimize errors.
- To support non-tabular layouts of data, e.g. merged cells in Excel.
- To expose all features for all data formats, e.g. charts in Excel. Outstanding specialized libraries already exist for the use and manipulation of specific data formats.
- To support all data formats that can represent tabular data. The library's primary goal is to support the data formats most commonly exchanged between humans and computers.
You can get the library from Maven central:
<!-- To add support for CSV and TSV files -->
<dependency>
<groupId>com.sigpwned</groupId>
<artifactId>tabular4j-csv</artifactId>
<version>0.0.0-b2</version>
</dependency>
<!-- To add support for Excel XLS and XLSX files -->
<dependency>
<groupId>com.sigpwned</groupId>
<artifactId>tabular4j-excel</artifactId>
<version>0.0.0-b2</version>
</dependency>
The tabular4j library uses the ServiceLoader to discover supported file formats automatically at runtime. This requires each JAR file to include a special META-INF/services/com.sigpwned.tabular4j.SpreasheetFormatFactory
file that lists file format factory classes the JAR provides. To ensure that your application gets all of the supported file formats, as opposed to just the supported file formats from the last JAR added to your build, use the maven-shade-plugin
to merge the META-INF/services/com.sigpwned.tabular4j.SpreasheetFormatFactory
files from all of the JARs in your build:
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.6.0</version> <!-- use current version -->
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
A worksheet is a single spreadsheet containing one rectangular table of data. The first row of a worksheet is interpreted as column names (i.e., headers), and determines the width of the table of data. Each addditional row of a worksheet is interpreted as a value in each of the named columns. If a row's actual width is greater than the number of columns, then it is truncated before it is interpreted; if its actual width is less than the number of the columns, then it is right-padded with null-valued cells before it is interpreted.
+---------+---------+---------+---------+---------+
| alpha | bravo | charlie | delta | | <-- Column Names/Headers. Width = 4.
+---------+---------+---------+---------+---------+
| a | b | c | d | | <-- Width 4. No transformation.
+---------+---------+---------+---------+---------+
| a | b | c | | | <-- Width 3. Will be right-padded with null.
+---------+---------+---------+---------+---------+
| | b | c | d | | <-- Width 4. First cell is empty.
+---------+---------+---------+---------+---------+
| a | b | c | d | e | <-- Width 5. Will be truncated to width 4.
+---------+---------+---------+---------+---------+
| a | b | c | d | | <-- Width 4. No transformation.
+---------+---------+---------+---------+---------+
A workbook is an ordered list of spreadsheets gathered into a single file. Each worksheet in a workbook has a unique name. One sheet in the workbook is "active". For a workbook created by a human, this is the worksheet that was last edited by the user and/or is shown when the file is opened; for a workbook created by the library, it is simply the worksheet with the active flag set.
Users open an existing workbook file to read using a ByteSource
. Note that the spreadsheet format (e.g., CSV, XLSX, etc.) is never given explicitly; rather, the framework detects the format of the spreadsheet and reacts accordingly.
try (TabularWorkbookReader workbook=SpreadsheetFactory.getInstance().readTabularWorkbook(ByteSource.ofFile(file))) {
// Handle workbook here...
}
This approach opens a workbook file and processes its active worksheet only.
First, the user can open it and then iterate over its rows manually:
try (TabularWorksheetReader worksheet=SpreadsheetFactory.getInstance().readTabularActiveWorksheet(source)) {
for(TabularWorksheetRow row=worksheet.readRow();row!=null;row=worksheet.readRow()) {
// Handle row here...
}
}
Alternatively, the user can use an Iterator
directly. Note that any IOException
exceptions are thrown as UncheckedIOException
in this case.
try (TabularWorksheetReader worksheet=SpreadsheetFactory.getInstance().readTabularActiveWorksheet(source)) {
for(TabularWorksheetRow row : worksheet) {
// Handle row here...
}
}
The user an also use a Java 8 Stream
to process rows:
try (TabularWorksheetReader worksheet=SpreadsheetFactory.getInstance().readTabularActiveWorksheet(source)) {
worksheet.stream().forEach(c -> {
// Handle row here...
});
}
Finally, if the user doesn't need to control how rows are read and processed ("pull" style), then the user can also use a consumer that walks the sheet automatically ("push" style):
Spreadsheets.processTabularWorksheet(source, new TabularWorksheetConsumer() {
public default void beginTabularWorksheet(int sheetIndex, String sheetName, List<String> columnNames) {
// Handle setup here...
}
public default void tabularRow(int rowIndex, List<TabularWorksheetCell> cells) {
// Handle row here...
}
public default void endTabularWorksheet() {
// Handle cleanup here...
}
});
This approach opens a workbook file and processes each of its worksheets in order. It shares much logic with the worksheet examples above.
First, the user can open the file and iterate over its sheets by index:
try (TabularWorkbookReader workbook=SpreadsheetFactory.getInstance().readTabularWorkbook(source)) {
for(int i=0;i<workbook.getWorksheetCount();i++) {
try (TabularWorksheetReader worksheet=workbook.getWorksheet(i)) {
// Handle worksheet here...
}
}
}
Alternatively, the user can open the file and iterate over its sheets by name:
try (TabularWorkbookReader workbook=SpreadsheetFactory.getInstance().readTabularWorkbook(source)) {
for(String worksheetName : workbook.getWorksheetNames()) {
try (TabularWorksheetReader worksheet=workbook.findWorksheetByName(worksheetName).get()) {
// Handle worksheet here...
}
}
}
Users open a workbook file to write using a ByteSink
. The user gives the desired file format in this case.
try (TabularWorkbookWriter workbook=SpreadsheetFactory.getInstance().writeTabularWorkbook(ByteSink.ofFile(file))) {
// Handle workbook here...
}
In this approach, the user simply opens writer, writes the rows, and closes the writer. This results in a workbook with one sheet.
try (TabularWorksheetRowWriter worksheet=SpreadsheetFactory.getInstance().writeTabularActiveWorksheet(sink, "csv")
.writeHeaders("alpha", "bravo")) {
worksheet.writeValuesRow("a", "b");
worksheet.writeValuesRow("1", "2");
}
In this approach, the user simply opens writer, writes the rows, and closes the writer. The user may then open and write additional sheets the same way.
try (TabularWorkbookWriter workbook=SpreadsheetFactory.getInstance().writeTabularWorkbook(sink, "csv")) {
try (WorksheetWriter worksheet=workbook.getWorksheet("sheet1").writeHeaders("alpha", "bravo")) {
// Write worksheet...
}
try (TabularWorksheetRowWriter worksheet=workbook.getWorksheet("sheet2").writeHeaders("charlie", "delta")) {
// Write worksheet...
}
}
Out of the box, the library supports the following file formats via the following modules:
tabular4j-excel
-- Using the excellent Apache POI libraryxlsx
xls
tabular4j-csv
-- Using thecsv4j
librarycsv
tsv
If there is interest, I'd like to support formats like the following:
jsonl
-- One JSON object per lineorc
-- Apache ORC Tabular data store for Hadoopparquet
-- Apache Parquet column-oriented data file
Out of the box, the library supports the following data types:
- All Java primitives (
byte
,short
,int
,long
,boolean
,float
,double
,char
) - All Java boxed types (
Byte
,Short
,Integer
,Long
,Boolean
,Float
,Double
,Character
) - Java 8 Time types (
Instant
,LocalDate
,LocalTime
,LocalDateTime
,OffsetDateTime
,ZonedDateTime
,ZoneId
) - Internet types (
URL
,URI
,InetAddress
) - Various and sundry others (
UUID
,BigDecimal
,BigInteger
,String
,byte[]
,Date
,Calendar
)
One goal of this library is to make it easier to add new data types for processing. For now, to add new types, look at the CoreCsvValueMapperFactory
and CoreExcelValueMapperFactory
classes. I hope to add an easier way to add types soon.