This user guide provides an overview of the data transformation program and explains how to use each transformation function. The program is designed to perform various data transformations on a Pandas DataFrame.

Introduction
Pipeline
- Example
- Pipeline Definitions
Processing
- Checks
- Transformations
  - create
  - duplicate
  - split
  - split_pair
  - replace
  - replace_text
  - merge
  - drop
  - filter_records
  - map_value
  - convert_case
  - rename
  - sort

You can use these links to navigate directly to each section in your user guide.

Introduction

ChatGPT: https://chat.openai.com/?model=text-davinci-002-render-sha

The data transformation program provides a collection of functions to manipulate and transform data in a Pandas DataFrame. These functions allow you to split, replace, merge, filter, rename, map values, convert case, copy columns, sort, and perform data type and blank value checks.

Pipeline

Example

- merge:
    input_file1: input1.csv
    input_file2: input2.csv
    output_file: merged.csv
    key_column: SSN

- process:
    input_file: merged.csv
    transformation_file: transformations.yml
    transformations: 
      ...
    output_file: processed.csv

- split:
    input_file: processed.csv
    output_definitions:
      split1.csv: [SSN, Name, Age, Email, ZipCode, Address]
      split2.csv: [SSN, Name, Phone, Sex]

Pipeline Definitions

The Pipeline allows you to perform various transformations on files. It uses the pandas library for data manipulation and a YAML file to define the transformations. This guide will walk you through the usage of the provided functions and demonstrate an example usage.

merge file

Merges two files based on a common key column.

Syntax:

- merge:
    input_file1: <input_file1>
    input_file2: <input_file2>
    output_file: <output_file>
    key_column: <key_column>

Parameters::

input_file1: Path to the first input file.
input_file2: Path to the second input file.
output_file: Path to the output merged file.
key_column: The common key column used for merging.

Example:

- merge:
    input_file1: input1.csv
    input_file2: input2.csv
    output_file: merged.csv
    key_column: SSN

split file

Splits a file into multiple output files based on specified columns.

Syntax:

- split:
    input_file: <input_file>
    output_definitions:
      <output_file>: [<column>, <column>, ...]
      <output_file>: [<column>, <column>, ...]
      ...

Parameters::

input_file: Path to the input file.
output_definitions: A dictionary where keys are output file paths, and values are lists of columns to include in each output file.
- output_file: Path to the output merged file.
- column: list - The list of column names to include in the output file.

Example:

- split:
    input_file: processed.csv
    output_definitions:
      split1.csv: [SSN, Name, Age, Email, ZipCode, Address]
      split2.csv: [SSN, Name, Phone, Sex]

process file

Processes a file using transformations defined in a YAML file.

Syntax:

- process:
    input_file: <input_file>
    transformation_file: <transformation_file>
    transformations: 
      ...
    output_file:  <output_file>

Parameters:

input_file: Path to the input file.
transformation_file: Path to the YAML file containing transformation definitions.
output_file: Path to the output processed file.
transformations: see Processing

Example:

- process:
    input_file: merged.csv
    transformation_file: transformations.yml
    output_file: processed.csv

Processing

Checks

check_data_type

This function checks the data types of specified columns in the DataFrame.

Syntax:

- check_data_type:
    mapping:
      <column>: <data_type>
      ...

Parameters:

mapping: dict - The dictionary specifying the column names as keys and the expected data types as values. The data types can be any valid Pandas data types.
- data_type options: object | int64 | datetime64

Example:

- check_data_type:
    mapping:
      Name: object
      Age: int64
      Email: object
      City: object
      Address: object
      Sex: object

check_not_blank

This function checks if specified columns have any blank values (NaN or empty strings).

Syntax:

- check_not_blank:
    columns: [<column>, <column>, ...]

Parameters:

columns: list - The list of column names to check for blank values.

Example:

- check_not_blank:
    columns: [Name, Age, Phone, City, Address, ZipCode, Sex]

checks (combined)

Syntax:

- checks:
    - type: length
      column: <column>
      max_length: <max_length>
    - type: values
      column: <column>
      valid_values: [<value>, >value>, ...]
    - type: not_blank
      columns: [<column>, <column>, ...]
    - type: range
      column: <column>
      min_value: <min_value>
      max_value: <max_value>

Parameters:

type: length | values | not_blank | range

Example:

- checks:
    - type: length
      column: Name
      max_length: 100
    - type: values
      column: Sex
      valid_values:
        - M
        - F
    - type: not_blank
      columns:
        - Name
    - type: range
      column: Age
      min_value: 18
      max_value: 39

Transformations

create

Syntax:

- create:
    column: <new_column_name>
    data_type: <data_type>
    default_value: <default_value>

Parameters:

column: The name of the new column to create.
data_type: The desired data type for the new column.
default_value (optional): The default value to assign to the new column. If not provided, the column will be left as blank.

Example:

- create:
    column: Income
    data_type: int64
    default_value: 3000

duplicate

This function copies the values from specified columns to new columns with different names.

Syntax:

- duplicate:
    mapping:
      <column>: <new_column_name>
      ...

Parameters:

mapping: dict - The dictionary specifying the original column names as keys and the new column names as values.
- column: column to copy
- new_column_name: name for the new column

Example:

- duplicate:
    mapping:
      Name: Name_new
      Age: Age_new
      Email: Email_new
      Phone: Phone_new
      City: City_new
      Address: Address_new
      ZipCode: ZipCode_new
      Sex: Sex_new

split

This function splits the values in a specified column into mulitple parts based on a separator and creates new columns with the split values. Each new column created will be named '{column}_{index}'.

Syntax:

- split:
    column: <column>
    separator: <separator_text>

Parameters:

column: str - The column name to split. Muliple columns may be created.
separator: str - The separator to split the values.

Example:

- split:
    column: City
    separator: " "

split_pair

This function splits the values in a specified column into two parts based on a separator and creates new columns with the split values.

Syntax:

- split_pair:
    column: <column>
    separator: <separator_text>
    first:  True | False

Parameters:

column: str - The column name to split. At most 2 columns will be created.
separator: str - The separator to split the values.
first: bool (optional) - Indicates whether to split from the first occurrence of the separator. Default is True.

Example:

- split_pair:
    column: Name
    separator: " "
    first: false

replace

This function replaces values in a specified column with a new value.

Syntax:

- replace:
    column: <column>
    match: <match>
    replacement: <replacement_value>

Parameters:

column: str - The column name to replace values.
match: str - The value to match.
replacement: str - The new value to replace with.

Example:

- replace:
    column: Phone_new
    match: "555"
    replacement: "XXX"

replace_text

This function replaces a portion of text in a specified column with a replacement character.

Syntax:

- replace_text:
    column: <column>
    start_position: <start_position>
    end_position: <end_position>
    replacement: <replacement_text>
    start: True | False

Parameters:

column: str - The column name to replace text.
start_position: int - The starting position of the text to replace.
end_position: int - The ending position of the text to replace.
replacement: str - The character to use for replacement.
start: bool (optional) - Indicates whether the start position is counted from the beginning or end of the text. Default is True.

Example:

- replace_text:
    column: Phone_new
    start_position: 1
    end_position: 2
    replacement: "^"
    start: false

merge

This function merges the values of multiple columns into a single column.

Syntax:

- merge:
    columns: [<column>, <column>, ...]
    separator: <separator_text>
    output_column: <new_column_name>

Parameters:

columns: list - The list of column names to merge. You can merge multiple columns.
output_column: str (optional) - The name of the output column. If not provided, the merged column name will be generated.
separator: str (optional) - The separator to use between merged values. Default is a space (' ').

Example:

- merge:
    columns: [Address, City, ZipCode]
    separator: ", "
    output_column: "Address City ZipCode"

drop

This function drop columns from the table.

Syntax:

- drop:
    columns: [<column>, <column>, ...]

Parameters:

columns: list - The list of column names to be dropped.

Example:

- drop:
    columns: [Age, Name]

filter_records

This function filters the DataFrame based on a condition. See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html for condition

Syntax:

- filter_records:
    condition: <condition>

Parameters:

condition: The filtering condition written as a string.

Example:

- filter_records:
    condition: "(Age <= 30) and (ZipCode >= 50000)"

map_value

This function maps the values in a specified column to new values based on a mapping dictionary.

- map_value:
    column: <column>
    default_value: <default_value>
    mapping:
      <match>: <replacement>
      ...

Parameters:

column: str - The column name to map values.
default_value: Any (optional) - The default value to fill for unmapped values. If not provided, the unmapped values will be filled with NaN.
mapping: dict - The dictionary mapping the original values to new values.
- match: str - The value to match.
- replacement: str - The new value to replace with.

Example:

- map_value:
    column: City_new
    default_value: Unknown
    mapping:
      New York: NY
      Los Angeles: LA
      Chicago: CHI

convert_case

This function converts the case of values in specified columns.

Syntax:

- convert_case:
    mapping:
      <column>: <case_type>
      ...

Parameters:

mapping: dict - The dictionary specifying the column names and the desired case type. The case types can be 'uppercase', 'lowercase', 'titlecase', or 'sentencecase'.
- column: column of which value to convert
- case_type: lowercase | uppercase | titlecase | sentencecase

Example:

- convert_case:
    mapping:
      Name_1: uppercase
      Name_2: titlecase

rename

This function renames the columns of the DataFrame.

Syntax:

- rename:
    mapping:
      <column>: <new_column_name>
      ...

Parameters:

mapping: dict - The dictionary specifying the column name mappings.
- column: current name
- new_column_name: new name

Example:

- rename:
    mapping:
      Name_2: Last_Name
      Name_1: First_Name

sort

This function sorts the DataFrame based on one or more columns.

Syntax:

- sort:
    mapping:
      <column>: <ascending?>
      ...

Parameters:

mapping: dict - The dictionary specifying the column names as keys and the sort orders (True for ascending, False for descending) as values. You can sort mulitple columns.
- column: column to sort
- ascending: sort ascending for this column?

Example:

- sort:
    mapping:
      Age: false
      Name: true

ongbt/ETL

Table of Contents

Introduction

Pipeline

Example

Pipeline Definitions

merge file

Syntax:

Parameters::

Example:

split file

Syntax:

Parameters::

Example:

process file

Syntax:

Parameters:

Example:

Processing

Checks

check_data_type

Syntax:

Parameters:

Example:

check_not_blank

Syntax:

Parameters:

Example:

checks (combined)

Syntax:

Parameters:

Example:

Transformations

create

Syntax:

Parameters:

Example:

duplicate

Syntax:

Parameters:

Example:

split

Syntax:

Parameters:

Example:

split_pair

Syntax:

Parameters:

Example:

replace

Syntax:

Parameters:

Example:

replace_text

Syntax:

Parameters:

Example:

merge

Syntax:

Parameters:

Example:

drop

Syntax:

Parameters:

Example:

filter_records

Syntax:

Parameters:

Example:

map_value

Parameters:

Example:

convert_case

Syntax:

Parameters:

Example:

rename

Syntax:

Parameters:

Example: