/pyrsona

Text data file validation and structure management

Primary LanguagePythonMIT LicenseMIT

pyrsona


Text data file validation and structure management using the pydantic and parse Python packages.

Installation

Install using pip install pyrsona.

A Simple Example

For the text file example.txt:

operator name: Jane Smith
country: NZ
year: 2022

ID,Time,Duration (sec),Reading
1,20:04:05,12.2,2098
2,20:05:00,2.35,4328

The following pyrsona file structure model can be defined:

from pyrsona import BaseStructure
from pydantic import BaseModel
from datetime import time


class ExampleStructure(BaseStructure):

    structure = (
        "operator name: {operator_name}\n"
        "country: {country}\n"
        "year: {}\n"
        "\n"
        "ID,Time,Duration (sec),Reading\n"
    )

    class meta_model(BaseModel):
        operator_name: str
        country: str

    class row_model(BaseModel):
        id: int
        time: time
        duration_sec: float
        value: float

The read() method can then be used to read the file, parse its contents and validate the meta data and table rows:

meta, table_rows, structure_id = ExampleStructure.read("example.txt")

print(meta)
#> {'operator_name': 'Jane Smith', 'country': 'NZ'}

print(table_rows)
#> [{'id': 1, 'time': datetime.time(20, 4, 5), 'value': 2098.0}, {'id': 2,
# 'time': datetime.time(20, 5), 'value': 4328.0}]

print(structure_id)
#> ExampleStructure

What's going on here:

  • The structure class attribute contains a definition of the basic file structure. This definition includes the meta data lines and table header lines. Any variable text of interest is replaced with curly brackets and a field name, E.g. '{operator_name}', while any variable text that should be ignored is replaced with empty curly brackets, E.g. '{}'. The structure definition must contain all spaces, tabs and new line characters in order for a file to successfully match it. The named fields in the structure definition will be passed to meta_model.

  • meta_model is simply a pydantic model with field names that match the named fields in the structure definition. All values sent to meta_model will be strings and these will be converted to the field types defined in meta_model. Custom pydantic validators can be included in the meta_model definition as per standard pydantic models.

  • row_model is also a pydantic model. This time the field names do not need to match the header line in the structure definition; however, the row_model fields do need to be provided in the same order as the table columns. This allows the table column names to be customised/standardised where the user does not control the file structure itself. Again, custom pydantic validators can be included in the row_model definition if required.

Another Example

Should the file structure change at some point in the future a new model can be created based on the original model. This is referred to as a sub-model, where the original model is the parent model.

Given the slightly modified file structure of new_example.txt:

operator name: Jane Smith
country: NZ
city: Auckland
year: 2022

ID,Time,Duration (sec),Reading
1,20:04:05,12.2,2098
2,20:05:00,2.35,4328

Attempting to parse this file using the original ExampleStructure model will raise a PyrsonaError due to the addition of the 'city: Auckland' line. In order to successfully parse the file and capture the new 'city' field the following sub-model should be defined.

from pyrsona import BaseStructure
from pydantic import BaseModel
from datetime import time


class NewExampleStructure(ExampleStructure):

    structure = (
        "operator name: {operator_name}\n"
        "country: {country}\n"
        "city: {city}\n"
        "year: {}\n"
        "\n"
        "ID,Time,Duration (sec),Reading\n"
    )

    class meta_model(BaseModel):
        operator_name: str
        country: str
        city: str

ExampleStructure is still used as the entry point; however, pyrsona will attempt to parse the file using any sub-models that exist (in this case NewExampleStructure) before using ExampleStructure itself.

meta, table_rows, structure_id = ExampleStructure.read("new_example.txt")

print(meta)
#> {'operator_name': 'Jane Smith', 'country': 'NZ', 'city': 'Auckland'}

print(table_rows)
#> [{'id': 1, 'time': datetime.time(20, 4, 5), 'value': 2098.0}, {'id': 2,
# 'time': datetime.time(20, 5), 'value': 4328.0}]

print(structure_id)
#> NewExampleStructure

What's going on here:

  • A new pyrsona file structure model is defined based on the original ExampleStructure model. This means that structure, meta_model and row_model will be inherited from ExampleStructure. This also provides a single entry point (I.e. ExampleStructure.read()) when attempting to read the different file versions.

  • structure and meta_model are redefined to include the new "city: Auckland" meta data line. Alternatively, the original meta_model in ExampleStructure could have been updated to include an optional city field.

Post-processors

It is sometimes necessary to modify some of the data following parsing by the meta_model and row_model. Two post-processing methods are available for this purpose.

Using the ExampleStructure class above, meta_postprocessor and table_postprocessor static methods are defined for post-processing the meta data and table_rows, respectively:

class ExampleStructure(BaseStructure):

    # Lines omitted for brevity

    @staticmethod
    def meta_postprocessor(meta):
        meta["version"] = 3
        return meta

    @staticmethod
    def table_postprocessor(table_rows, meta):
        # Add a cumulative total and delete the "id" field:
        total = 0
        for ii, row in enumerate(table_rows):
            total += row["value"]
            row["total"] = total
            del(row["id"])
            table_rows[ii] = row
        return table_rows

The meta data and table_rows are now run through the post-processing stages before being returned, resulting in the following changes:

  • A new version field is added to the meta data.
  • The id field is deleted from the table_rows and a cumulative total field is added.
meta, table_rows, structure_id = ExampleStructure.read("example.txt")

print(meta)
#> {'operator_name': 'Jane Smith', 'country': 'NZ', 'version': 3}

print(table_rows)
#> [{'time': datetime.time(20, 4, 5), 'duration_sec': 12.2, 'value': 2098.0,
# 'total': 2098.0}, {'time': datetime.time(20, 5), 'duration_sec': 2.35, 'value': 4328.0,
# 'total': 6426.0}]

print(structure_id)
#> NewExampleStructure

Array data in field

Sometimes the table rows contain array data that is not easily converted to a pydantic model. In this case, the row_model can be omitted and the table_postprocessor method can be used to convert the table rows into a more suitable format.

class ExampleStructure(BaseStructure):

    structure = (
        "operator name: {operator_name}\n"
        "country: {country}\n"
        "year: {}\n"
        "\n"
        "ID,Time,Duration (sec),Reading\n"
    )

    class meta_model(BaseModel):
        operator_name: str
        country: str

    @staticmethod
    def table_postprocessor(table_rows, meta):

        class row_model(BaseModel):
            id: int
            array_data: list[str]

        ids = [row[0] for row in table_rows]
        array_data = [row[1:] for row in table_rows]

        table_rows = [
            row_model(id=row_id, array_data=row_array_data).dict()
            for row_id, row_array_data in zip(ids, array_data)
        ]

        return table_rows

With an undefined row_model the table row data would be returned as a list of strings. The table_postprocessor method can then be used to convert the data into a more suitable format using custom logic.

print(table_rows)
#> [{'id': 1, 'array_data': ['20:04:05', '12.2', '2098']}, {'id': 2, 'array_data': ['20:05:00','2.35','4328']}]

Extra details

All meta lines MUST be included

While the parse package allows a wildcard '{}' to be used to ignore several lines this can cause a named field to be unexpectedly included in the wildcard section. pyrsona therefore checks for the presence of a new line character '\n' in the named field values and fails if one is found.

Sub-sub-models

Calling the read() method will first build a list of pyrsona file structure models from the parent model down.

Any sub-models of the parent model will themselves be checked for sub-models, meaning that every model in the tree below the parent model will be used when attempting to parse a file.

Each branch of models will be ordered bottom-up so that the deepest nested model in a branch will be used first. The parent model will be the final model used if all others fail.

Model names

The read() method returns a structure_id variable that matches the model name. This structure_id can be useful when creating automated tests that sit alongside the pyrsona models as it provides a mechanism for confirming that a text file was parsed using the expected pyrsona model where multiple sub-models exist.

As the number of sub-models grows a naming convention becomes more important. One option is to set the names of any sub-models to a random hexadecimal value prefixed with a single underscore (in case the value begins with a number), E.g. '_a4c15356'. The initial underscore will be removed from model name when returning the structure_id value.

parse formats

The parse package allows format specifications to be included alongside the fields, E.g. '{year:d}'. While including these format types in the structure definition is valid, more complex format conversions can be made using meta_model. Keeping all format conversions in meta_model means that all conversions are defined in one place.