ParquetSharp.DataFrame is a .NET library for reading and writing Apache Parquet files into/from .NET DataFrames, using ParquetSharp.
Parquet data is read into a DataFrame
using ToDataFrame
extension methods on ParquetFileReader
,
for example:
using ParquetSharp;
using (var parquetReader = new ParquetFileReader(parquet_file_path))
{
var dataFrame = parquetReader.ToDataFrame();
parquetReader.Close();
}
Overloads are provided that allow you to read specific columns from the Parquet file, and/or a subset of row groups:
var dataFrame = parquetReader.ToDataFrame(columns: new [] {"col_1", "col_2"});
var dataFrame = parquetReader.ToDataFrame(rowGroupIndices: new [] {0, 1});
Parquet files are written using the ToParquet
extension method on DataFrame
:
using ParquetSharp;
using Microsoft.Data.Analysis;
var dataFrame = new DataFrame(columns);
dataFrame.ToParquet(parquet_file_path);
Parquet writing options can be overridden by providing an instance of WriterProperties
:
using (var propertiesBuilder = new WriterPropertiesBuilder())
{
propertiesBuilder.Compression(Compression.Snappy);
using (var properties = propertiesBuilder.Build())
{
dataFrame.ToParquet(parquet_file_path, properties);
}
}
The logical type to use when writing a column can optionally be overridden. This is required when writing decimal columns, as you must specify the precision and scale to be used (see the Parquet documentation for more details). This also allows writing an integer column as a Parquet date or time.
dataFrame.ToParquet(parquet_file_path, logicalTypeOverrides: new Dictionary<string, LogicalType>
{
{"decimal_column", LogicalType.Decimal(precision: 29, scale: 3)},
{"date_column", LogicalType.Date()},
{"time_column", LogicalType.Time(isAdjustedToUtc: true, TimeUnit.Millis)},
});
We welcome new contributors! We will happily receive PRs for bug fixes or small changes. If you're contemplating something larger please get in touch first by opening a GitHub Issue describing the problem and how you propose to solve it.
Copyright 2021 G-Research
Licensed under the Apache License, Version 2.0 (the "License"); you may not use these files except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.