/marshmallow-dataframe

Marshmallow Schema generator for Pandas DataFrames

Primary LanguagePythonApache License 2.0Apache-2.0

marshmallow-dataframe

Build Status PyPI License

marshmallow-dataframe is a library that helps you generate marshmallow Schemas for Pandas DataFrames.

Usage

Let's start by creating an example dataframe for which we want to create a Schema. This dataframe has four columns: two of them are of string type, one is a float, and the last one is an integer.

import pandas as pd
import numpy as np
from marshmallow_dataframe import SplitDataFrameSchema

animal_df = pd.DataFrame(
    [
        ("falcon", "bird", 389.0, 2),
        ("parrot", "bird", 24.0, 2),
        ("lion", "mammal", 80.5, 4),
        ("monkey", "mammal", np.nan, 4),
    ],
    columns=["name", "class", "max_speed", "num_legs"],
)

You can then create a marshmallow schema that will validate and load dataframes that follow the same structure as the one above and that have been serialized with DataFrame.to_json with the orient=split format. The dtypes attribute of the Meta class is required, and other marshmallow Schema options can also be passed as attributes of Meta:

class AnimalSchema(SplitDataFrameSchema):
    """Automatically generated schema for animal dataframe"""

    class Meta:
        dtypes = animal_df.dtypes

When passing a valid payload for a new animal, this schema will validate it and build a dataframe:

animal_schema = AnimalSchema()

new_animal = {
    "data": [("leopard", "mammal", 58.0, 4), ("ant", "insect", 0.288, 6)],
    "columns": ["name", "class", "max_speed", "num_legs"],
    "index": [0, 1],
}

new_animal_df = animal_schema.load(new_animal)

print(type(new_animal_df))
# <class 'pandas.core.frame.DataFrame'>
print(new_animal_df)
#       name   class  max_speed  num_legs
# 0  leopard  mammal     58.000         4
# 1      ant  insect      0.288         6

However, if we pass a payload that doesn't conform to the schema, it will raise a marshmallow ValidationError exception with informative message about errors:

invalid_animal = {
    "data": [("leopard", "mammal", 58.0, "four")],  # num_legs is not an int
    "columns": ["name", "class", "num_legs"],  # missing  max_speed column
    "index": [0],
}

animal_schema.load(invalid_animal)

# Raises:
# marshmallow.exceptions.ValidationError: {
#     'columns': ["Must be equal to ['name', 'class', 'max_speed', 'num_legs']."],
#     'data': {0: {3: ['Not a valid integer.']}}
# }

marshmallow_dataframe can also generate Schemas for the orient=records format by following the above steps but using marshmallow_dataframe.RecordsDataFrameSchema as the superclass for AnimalSchema.

Installation

marshmallow-dataframe requires Python >= 3.6 and marshmallow >= 3.0. You can install it with pip:

pip install marshmallow-dataframe

Contributing

Contributions are welcome!

You can report a problem or feature request in the issue tracker. If you feel that you can fix it or implement it, please submit a pull request referencing the issues it solves.

Unit tests written using the pytest framework are in the tests directory, and are run using tox on Python 3.6 and 3.7. You can run the tests by installing tox:

pip install tox

and running the linters and tests for all Python versions by running tox, or for a specific Python version by running:

tox -e py36

We format the code with black, and you can format your checkout of the code before commiting it by running:

tox -e black -- .