/ParquetSharp.DataFrame

ParquetSharp.DataFrame is a .NET library for reading and writing Apache Parquet files into/from .NET DataFrames, using ParquetSharp

Primary LanguageC#Apache License 2.0Apache-2.0

ParquetSharp.DataFrame

CI Status NuGet latest release

ParquetSharp.DataFrame is a .NET library for reading and writing Apache Parquet files into/from .NET DataFrames, using ParquetSharp.

Reading Parquet files

Parquet data is read into a DataFrame using ToDataFrame extension methods on ParquetFileReader, for example:

using ParquetSharp;

using (var parquetReader = new ParquetFileReader(parquet_file_path))
{
    var dataFrame = parquetReader.ToDataFrame();
    parquetReader.Close();
}

Overloads are provided that allow you to read specific columns from the Parquet file, and/or a subset of row groups:

var dataFrame = parquetReader.ToDataFrame(columns: new [] {"col_1", "col_2"});
var dataFrame = parquetReader.ToDataFrame(rowGroupIndices: new [] {0, 1});

Writing Parquet files

Parquet files are written using the ToParquet extension method on DataFrame:

using ParquetSharp;
using Microsoft.Data.Analysis;

var dataFrame = new DataFrame(columns);
dataFrame.ToParquet(parquet_file_path);

Parquet writing options can be overridden by providing an instance of WriterProperties:

using (var propertiesBuilder = new WriterPropertiesBuilder())
{
    propertiesBuilder.Compression(Compression.Snappy);
    using (var properties = propertiesBuilder.Build())
    {
        dataFrame.ToParquet(parquet_file_path, properties);
    }
}

The logical type to use when writing a column can optionally be overridden. This is required when writing decimal columns, as you must specify the precision and scale to be used (see the Parquet documentation for more details). This also allows writing an integer column as a Parquet date or time.

dataFrame.ToParquet(parquet_file_path, logicalTypeOverrides: new Dictionary<string, LogicalType>
{
    {"decimal_column", LogicalType.Decimal(precision: 29, scale: 3)},
    {"date_column", LogicalType.Date()},
    {"time_column", LogicalType.Time(isAdjustedToUtc: true, TimeUnit.Millis)},
});

Contributing

We welcome new contributors! We will happily receive PRs for bug fixes or small changes. If you're contemplating something larger please get in touch first by opening a GitHub Issue describing the problem and how you propose to solve it.

License

Copyright 2021 G-Research

Licensed under the Apache License, Version 2.0 (the "License"); you may not use these files except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.