This user guide provides an overview of the data transformation program and explains how to use each transformation function. The program is designed to perform various data transformations on a Pandas DataFrame.
You can use these links to navigate directly to each section in your user guide.
ChatGPT: https://chat.openai.com/?model=text-davinci-002-render-sha
The data transformation program provides a collection of functions to manipulate and transform data in a Pandas DataFrame. These functions allow you to split, replace, merge, filter, rename, map values, convert case, copy columns, sort, and perform data type and blank value checks.
- merge:
input_file1: input1.csv
input_file2: input2.csv
output_file: merged.csv
key_column: SSN
- process:
input_file: merged.csv
transformation_file: transformations.yml
transformations:
...
output_file: processed.csv
- split:
input_file: processed.csv
output_definitions:
split1.csv: [SSN, Name, Age, Email, ZipCode, Address]
split2.csv: [SSN, Name, Phone, Sex]
The Pipeline allows you to perform various transformations on files. It uses the pandas library for data manipulation and a YAML file to define the transformations. This guide will walk you through the usage of the provided functions and demonstrate an example usage.
Merges two files based on a common key column.
- merge:
input_file1: <input_file1>
input_file2: <input_file2>
output_file: <output_file>
key_column: <key_column>
- input_file1: Path to the first input file.
- input_file2: Path to the second input file.
- output_file: Path to the output merged file.
- key_column: The common key column used for merging.
- merge:
input_file1: input1.csv
input_file2: input2.csv
output_file: merged.csv
key_column: SSN
Splits a file into multiple output files based on specified columns.
- split:
input_file: <input_file>
output_definitions:
<output_file>: [<column>, <column>, ...]
<output_file>: [<column>, <column>, ...]
...
- input_file: Path to the input file.
- output_definitions: A dictionary where keys are output file paths, and values are lists of columns to include in each output file.
- output_file: Path to the output merged file.
- column: list - The list of column names to include in the output file.
- split:
input_file: processed.csv
output_definitions:
split1.csv: [SSN, Name, Age, Email, ZipCode, Address]
split2.csv: [SSN, Name, Phone, Sex]
Processes a file using transformations defined in a YAML file.
- process:
input_file: <input_file>
transformation_file: <transformation_file>
transformations:
...
output_file: <output_file>
- input_file: Path to the input file.
- transformation_file: Path to the YAML file containing transformation definitions.
- output_file: Path to the output processed file.
- transformations: see Processing
- process:
input_file: merged.csv
transformation_file: transformations.yml
output_file: processed.csv
This function checks the data types of specified columns in the DataFrame.
- check_data_type:
mapping:
<column>: <data_type>
...
- mapping: dict - The dictionary specifying the column names as keys and the expected data types as values. The data types can be any valid Pandas data types.
- data_type options: object | int64 | datetime64
- check_data_type:
mapping:
Name: object
Age: int64
Email: object
City: object
Address: object
Sex: object
This function checks if specified columns have any blank values (NaN or empty strings).
- check_not_blank:
columns: [<column>, <column>, ...]
- columns: list - The list of column names to check for blank values.
- check_not_blank:
columns: [Name, Age, Phone, City, Address, ZipCode, Sex]
- checks:
- type: length
column: <column>
max_length: <max_length>
- type: values
column: <column>
valid_values: [<value>, >value>, ...]
- type: not_blank
columns: [<column>, <column>, ...]
- type: range
column: <column>
min_value: <min_value>
max_value: <max_value>
- type: length | values | not_blank | range
- checks:
- type: length
column: Name
max_length: 100
- type: values
column: Sex
valid_values:
- M
- F
- type: not_blank
columns:
- Name
- type: range
column: Age
min_value: 18
max_value: 39
- create:
column: <new_column_name>
data_type: <data_type>
default_value: <default_value>
- column: The name of the new column to create.
- data_type: The desired data type for the new column.
- default_value (optional): The default value to assign to the new column. If not provided, the column will be left as blank.
- create:
column: Income
data_type: int64
default_value: 3000
This function copies the values from specified columns to new columns with different names.
- duplicate:
mapping:
<column>: <new_column_name>
...
- mapping: dict - The dictionary specifying the original column names as keys and the new column names as values.
- column: column to copy
- new_column_name: name for the new column
- duplicate:
mapping:
Name: Name_new
Age: Age_new
Email: Email_new
Phone: Phone_new
City: City_new
Address: Address_new
ZipCode: ZipCode_new
Sex: Sex_new
This function splits the values in a specified column into mulitple parts based on a separator and creates new columns with the split values. Each new column created will be named '{column}_{index}'.
- split:
column: <column>
separator: <separator_text>
- column: str - The column name to split. Muliple columns may be created.
- separator: str - The separator to split the values.
- split:
column: City
separator: " "
This function splits the values in a specified column into two parts based on a separator and creates new columns with the split values.
- split_pair:
column: <column>
separator: <separator_text>
first: True | False
- column: str - The column name to split. At most 2 columns will be created.
- separator: str - The separator to split the values.
- first: bool (optional) - Indicates whether to split from the first occurrence of the separator. Default is True.
- split_pair:
column: Name
separator: " "
first: false
This function replaces values in a specified column with a new value.
- replace:
column: <column>
match: <match>
replacement: <replacement_value>
- column: str - The column name to replace values.
- match: str - The value to match.
- replacement: str - The new value to replace with.
- replace:
column: Phone_new
match: "555"
replacement: "XXX"
This function replaces a portion of text in a specified column with a replacement character.
- replace_text:
column: <column>
start_position: <start_position>
end_position: <end_position>
replacement: <replacement_text>
start: True | False
- column: str - The column name to replace text.
- start_position: int - The starting position of the text to replace.
- end_position: int - The ending position of the text to replace.
- replacement: str - The character to use for replacement.
- start: bool (optional) - Indicates whether the start position is counted from the beginning or end of the text. Default is True.
- replace_text:
column: Phone_new
start_position: 1
end_position: 2
replacement: "^"
start: false
This function merges the values of multiple columns into a single column.
- merge:
columns: [<column>, <column>, ...]
separator: <separator_text>
output_column: <new_column_name>
- columns: list - The list of column names to merge. You can merge multiple columns.
- output_column: str (optional) - The name of the output column. If not provided, the merged column name will be generated.
- separator: str (optional) - The separator to use between merged values. Default is a space (' ').
- merge:
columns: [Address, City, ZipCode]
separator: ", "
output_column: "Address City ZipCode"
This function drop columns from the table.
- drop:
columns: [<column>, <column>, ...]
- columns: list - The list of column names to be dropped.
- drop:
columns: [Age, Name]
This function filters the DataFrame based on a condition. See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html for condition
- filter_records:
condition: <condition>
- condition: The filtering condition written as a string.
- filter_records:
condition: "(Age <= 30) and (ZipCode >= 50000)"
This function maps the values in a specified column to new values based on a mapping dictionary.
- map_value:
column: <column>
default_value: <default_value>
mapping:
<match>: <replacement>
...
- column: str - The column name to map values.
- default_value: Any (optional) - The default value to fill for unmapped values. If not provided, the unmapped values will be filled with NaN.
- mapping: dict - The dictionary mapping the original values to new values.
- match: str - The value to match.
- replacement: str - The new value to replace with.
- map_value:
column: City_new
default_value: Unknown
mapping:
New York: NY
Los Angeles: LA
Chicago: CHI
This function converts the case of values in specified columns.
- convert_case:
mapping:
<column>: <case_type>
...
- mapping: dict - The dictionary specifying the column names and the desired case type. The case types can be 'uppercase', 'lowercase', 'titlecase', or 'sentencecase'.
- column: column of which value to convert
- case_type: lowercase | uppercase | titlecase | sentencecase
- convert_case:
mapping:
Name_1: uppercase
Name_2: titlecase
This function renames the columns of the DataFrame.
- rename:
mapping:
<column>: <new_column_name>
...
- mapping: dict - The dictionary specifying the column name mappings.
- column: current name
- new_column_name: new name
- rename:
mapping:
Name_2: Last_Name
Name_1: First_Name
This function sorts the DataFrame based on one or more columns.
- sort:
mapping:
<column>: <ascending?>
...
- mapping: dict - The dictionary specifying the column names as keys and the sort orders (True for ascending, False for descending) as values. You can sort mulitple columns.
- column: column to sort
- ascending: sort ascending for this column?
- sort:
mapping:
Age: false
Name: true