Custom dataflows processors and goodtables checks for BCODMO
To run the dpp command locally using the custom processors located in this repository, simply clone this reposistory and add the environment variable DPP_PROCESSOR_PATH. If this repository is located at $PROCESSOR_REPO, the environment variable will be $PROCESSOR_REPO/bcodmo_processors.
You can add environment variables manually using export DPP_PROCESSOR_PATH=$PUT_PATH_HERE or you can place all of your environment variables in a .env file and run the following commands:
set -a
source .env
Now when using dpp it will first look inside this repository when resolving processors.
If you want to get rid of the bcodmo_pipeline_processors prefix you can instead set DPP_PROCESSOR_PATH to $PROCESSOR_REPO/bcodmo_processors/bcodmo_pipeline_processors.
See https://github.com/frictionlessdata/datapackage-pipelines for documentation for standard processors.
Loads data into the package. Similar to the standard processor load.
Parameters:
- All parameters from the standard processor load
missing_values
- a list of values that are interpretated as missing data (nd) values. Defaults to['']
use_filename
- use filename as resource nameinput_seperator
- the string used to separate values in thefrom
andname
parameters for loading multiple resources with one processor. Defaults to','
remove_empty_rows
- a boolean determining if empty rows (rows where all values are the empty string or None) are deleted. Defaults to falsesheet_regex
- a boolean determining if the sheet name from an xlsx/xls file should be processed with a regular expression
Other differences from the standard load:
from
andname
can be a delimeter seperated list of sources/resource names.name
is ignored ifuse_filename
is set to True- additional bcodmo-fixedwidth parser that takes in
width
andinfer
parameters - if
name
is left empty the resource name will default tores{n}
where n is the number of resources. - if
sheet_regex
is usedname
will be ignored and the sheet will be the resource name, unless there are multiplefrom
values, in which case the name will be{resource_name}-{sheet_name}
sheet_regex
can only be used with local paths
_Additional fixedwidth parser:
width
- width between columnsinfer
- whether the width should be inferredparse_seabird_header
- parse a .cnv seabird file. If infer is true, it will automatically set the field widths to 11 * len(header_row).seabird_capture_skipped_rows
- a list of dictionaries with the keys "column_name" and "regex". The regex is applied to all skipped rows and matched rows are stored as a columnseabird_capture_skipped_rows_join
- a boolean determining if multiple matches should be joined or if they should each be in a seperate column (must be used withdeduplicate_headers
if false). Defaults to trueseabird_capture_skipped_rows_join_string
- the string to use to join multiple matches together (defaults to ;)fixedwidth_sample_size
- the number of rows to sample to infer the width
See standard processor for examples.
Concatenate a number of streamed resouces into a single resource. Similar to the standard processor concatenate.
Parameters:
- All parameters from the standard processor concatenate
include_source_name
- whether or not a source name should be included as a field in the resulting resource. Can be one of False, 'resource', 'path' or 'file'.- 'resource' will add the resource name as a field
- 'path' will add the path or url that was originally used to load the resource
- 'file' will add the filename from the original resource
source_field_name
- the name of the new field created by include_source_name. Required if include_source_name is not False
See standard processor for examples.
Dump data to a path. Similar to the standard processor dump_to_path.
Parameters:
- All parameters from the standard processor dump_to_path
save_pipeline_spec
- whether or not the pipeline_spec.yaml file should also be saved in dump_to_path. Note that the entire pipeline's pipeline-spec.yaml file will be saved, regardless of where in the pipeline the dump_to_path processor lives.data_manager
- the name and orcid of the datamanager who developed this pipeline. An object with a name key and an orcid key.
Other differences from the standard dump_to_path:
- an attempt is made to change file permissions to 775
- carriage return \r at line endings are removed
See standard processor for examples.
Add a field computed using boolean values.
Parameters:
resources
- a list of resources to perform this operation onfields
- a list of new_fieldsfunctions
- a list of functions for the new fieldboolean
- boolean string for this function. See notes for details on boolean stringvalue
- the value to set if the boolean string evaluates as truemath_operation
- a boolean to determine whether or not the value should be evaluated as a mathematical expression.
target
- the name of the new field to be createdtype
- the type of the new field to be created
Notes:
- boolean string is made up of an number of conditions and boolean comparisons. Conditions are made up of comparison term, operator, comparision term. A comparison term can be a date, a number, a variable (contained within curly braces {}), a regular expression (contained within re''), a string (contained within single quotes ''), LINE_NUMBER, or null (can be one of None, NONE, null or NULL). An operator can be one of >=, >, <=, <, != or ==. Boolean comparison can be any one of AND, and, &&, OR, or, ||. All terms and operators must be seperated by spaces.
- For example:
- {lat} > 50 && {depth} != NULL
- {species} == 's. pecies' OR {species} == NULL
- For example:
- functions are evaluated in the order they are passed in. So if function 0 and function 3 both evaluate as true for row 30, the value in function 3 will show up in row 30.
- Regular expression can only be used with the == and != operators and must be compared to a string.
- Use curly braces {} to match field names in the row
- If
math_operation
is set to true, the operators +, -, *, / and ^ can be used to set the value to the result of a mathematical operation. Order of operations are as expected and parentheses can be used to group operations. - Values will be interpreted based on the type. If a field of type string looks like '5313' it will not equal the number 5313, but rather only the string '5313'.
Filter rows with a boolean statement
Parameters:
resources
- a list of resources to perform this operation onboolean_statement
- a single boolean statement. Only rows that pass the statement will be kept.See boolean_add_computed_field
for details on boolean syntax.
Notes:
Convert any number of fields containing date information into a single date field with display format and timezone options.
Parameters:
-
resources
- a list of resources to perform this operation on -
fields
- a list of new_fieldsoutput_field
- the name of the output fieldoutput_format
- the python datetime format string for the output fieldinput_type
- the input field type. One of 'python', 'decimalDay', 'decimalYear', "matlab", or 'excel'. If 'python', evaluate the input field/fields using python format strings. If 'excel', only take in a single input field and evaluate it as an excel date serial number. If 'matlab', also only take ina single input fioeld and evaluate and matlab datenum type. If 'decimalDay', interpret the input field as decimal between 0 and 365. Year must also be inputted. If 'decimalYear', interpret the input field as a decimal year (eg. 2015.1234). If 'decimalYear',decimal_year_start_day
is required.input_field
- a single input field. Only use ifinput_type
is 'excel'. Depecrated ifinput_type
is 'python'.decimal_year_start_day
- the start day to use when interpreting a decimal year. Usually 1 or 0.
-
boolean_statement
- a single boolean statement. Only rows that pass the statement will be impacted. Seeboolean_add_computed_field
for details on boolean syntax.the rest of the parameters are only relevant if input_type is 'python'
inputs
- a list of input fieldsfield
- the input field nameformat
- the format of this input field
input_timezone
- the timezone to be passed to the datestring. Required ifinput_format
does not have timezone information and timezone is used in the output (either through '%Z' inoutput_format
or a value inoutput_timezone
), otherwise optionalinput_timezone_utc_offset
- UTC offset in seconds. Optionaloutput_timezone
- the output timezoneoutput_timezone_utc_offset
- UTC offset for the output timezone. Optionalinput_format
- deprecated, for use withinput_field
ifinput_type
is 'python'
Notes:
- The output type is string until the date type dump_to_path issue is resolved
- If the
output_field
already exists in the schema, the existing values will be overwritten
Convert a single field containing coordinate information from degrees-minutes-seconds or degrees-decimal_minutes to decimal_degrees.
Parameters:
resources
- a list of resources to perform this operation onfields
- a list of new_fieldsinput_field
- the name of the input fieldoutput_field
- the name of the output fieldinput_format
- the input format. One of 'degrees-minutes-seconds' or 'degrees-decimal_minutes'pattern
- the pattern for the input field. See notes for detailsdirectional
- the directional of the coordinate. Must be one of 'N', 'E', 'S', 'W'
boolean_statement
- a single boolean statement. Only rows that pass the statement will be impacted. Seeboolean_add_computed_field
for details on boolean syntax.
Notes:
pattern
is made up of python regular expression named groups. The possible group names are 'directional', 'degrees', 'minutes', 'seconds', and 'decimal_minutes'. If theinput_format
is 'degrees-minutes-seconds' the groups 'degrees', 'minutes' and 'seconds' are required. If theinput_format
is 'degrees-decimal_minutes' the groups 'degrees' and 'decimal_minutes' are required. The 'directional' group is always optional.- if 'directional' is passed in both through the
pattern
and thedirectional
parameter, thedirectional
parameter takes precendence.
Remove any number of resources from the pipeline
Parameters:
resources
- the resources to remove
Rename any number of fields
Parameters:
resources
- a list of resources to perform this operation onfields
- a list of fieldsold_field
- the name of the field before it is renamednew_field
- the new name for the field
Notes:
- if
new_field
already exists as a field in the resource an error will be thrown
Rename any number of fields using a regular expression
Parameters:
resources
- a list of resources to perform this operation onfields
- a list of fields to perform this operation onpattern
- regular expression patterns to be useddfind
- the find pattern for the old field namereplace
- the replace pattern for the new field name
Notes:
- if the field name created by
replace
already exists in the resource an error will be thrown - regular expressions are always python regular expressions
Rename a resource
Parameters:
-old_resource
- the old name of the resource -new_resource
- the new name of the resource
Rename any number of fields using a regular expression
Parameters:
resources
- a list of resources to perform this operation onfields
- the new order of fields
Notes:
- if a field does not exist in the resource an error will be thrown
- if the number of passed in fields does not match the number of fields in the resource an error will be thrown
Round any number of fields
Parameters:
resources
- a list of resources to perform this operation onfields
- a list of fields to perform this operation onname
- the name of the field to rounddigits
- the number of digits to round the field topreserve_trailing_zeros
- whether trailing zeros should be preservedmaximum_precision
- whether values with precision lower than digits should be roundedconvert_to_integer
- whether the field should be converted to an integer
boolean_statement
- a single boolean statement. Only rows that pass the statement will be impacted. Seeboolean_add_computed_field
for details on boolean syntax.
Notes:
- As of v1.0.3, round_fields works on ONLY number types.
- if attempted to use on an incorrect field type an error will be thrown
convert_to_integer
parameter will only work if digits is set to 0
Split a field into any number of other fields
Parameters:
resources
- a list of resources to perform this operation onfields
- a list of fields to perform this operation oninput_field
- the name of the field to splitoutput_fields
- the names of the output fieldspattern
- the pattern to match the input_field. Use python regular expression matches (denoted by parentheses) to capture values foroutput_fields. Use pattern or delimiter.
delimiter
- the regex delimiter on which to split the input_field. Use pattern or delimiter.`
delete_input
- whether theinput_field
should be deleted after theoutput_fields
are createdboolean_statement
- a single boolean statement. Only rows that pass the statement will be impacted. Seeboolean_add_computed_field
for details on boolean syntax.
Notes:
- all new fields will be typed as strings
- the number of matches in
pattern
must equal the number of output fields inoutput_fields
Find and replace regular expression within a field. Same as standard processor other than boolean.
Parameters:
- All parameters from the standard processor find_replace
boolean_statement
- a single boolean statement. Only rows that pass the statement will be impacted. Seeboolean_add_computed_field
for details on boolean syntax.
Notes:
Format a string using python string formatting
Parameters:
resources
- a list of resources to perform this operation onfields
- a list of fields to perform this operation oninput_field
- the name of the field to splitoutput_field
- the name of the output fieldinput_string
- the input string to be used in the python format functioninput_fields
- the fields to be passed as arguments in the python format function
boolean_statement
- a single boolean statement. Only rows that pass the statement will be impacted. Seeboolean_add_computed_field
for details on boolean syntax.
Notes:
- all new fields will be typed as strings
- the input types are important. If the field you're trying to format is an integer, you have to use something like {0:03d}. If it is a float (number), you have to use {0:03f}, etc.
Find and replace regular expression within a field. Same as standard processor other than boolean.
Parameters:
- All parameters from the standard processor find_replace
boolean_statement
- a single boolean statement. Only rows that pass the statement will be impacted. Seeboolean_add_computed_field
for details on boolean syntax.
Notes:
Extract nonnumeric values from fields into a new a field
Parameters:
fields
- A list of stringssuffix
- The suffix to be added to create a field where the string is storedboolean_statement
- a single boolean statement. Only rows that pass the statement will be impacted. Seeboolean_add_computed_field
for details on boolean syntax.
Notes: