/metafacture-fix

Work in progress towards an implementation of the Fix language for Metafacture

Primary LanguageJavaApache License 2.0Apache-2.0

Metafacture Fix

Metafacture Fix (Metafix) is work in progress towards tools and an implementation of the Fix language for Metafacture as an alternative to configuring data transformations with Metamorph. Inspired by Catmandu Fix, Metafix processes metadata not as a continuous data stream but as discrete records. The basic idea is to rebuild constructs from the (Catmandu) Fix language like functions, selectors and binds in Java and combine with additional functionalities from the Metamorph toolbox.

See also Fix Interest Group for an initiative towards an implementation-independent specification for the Fix Language.

This repo contains the actual implementation of the Fix language as a Metafacture module and related components. It started as an Xtext web project with a Fix grammar, from which a parser, a web editor, and a language server are generated. The repo also contains an extension for VS code/codium based on that language server. (The web editor has effectively been replaced by the Metafacture Playground, but remains here for its integration into the language server, which we want to move over to the playground.)

Setup

Build

Note: If you're using Windows, configure Git option core.autocrlf before cloning: git config --global core.autocrlf false.

Clone the Git repository:

git clone https://github.com/metafacture/metafacture-fix.git

Go to the Git repository root:

cd metafacture-fix/

Run the tests (in metafix/src/test/java) and checks (.editorconfig, config/checkstyle/checkstyle.xml):

./gradlew clean check

(To import the projects in Eclipse, choose File > Import > Existing Gradle Project and select the metafacture-fix directory.)

Usage

The repo contains and uses a new Metafix stream module for Metafacture which plays the role of the Metamorph module in Fix-based Metafacture workflows. For the current implementation of the Metafix stream module see the tests in metafix/src/test/java. To play around with some examples, check out the Metafacture Playground. For real-world usage samples see openRub.fix and duepublico.fix. For reference documentation, see Functions and cookbook.

Extension

The project metafix-vsc provides an extension for Visual Studio Code / Codium for fix via the language server protocol (LSP). In the current state the extension supports auto completion, simple syntax highlighting and auto closing brackets and quotes. This project was created using this tutorial and the corresponding example.

Build extension:

  1. Install Visual Studio Code / alternative: VS Codium
  2. Install Node.js (including npm)
  3. In metafacture-fix execute: Unix: ./gradlew installServer Windows: .\gradlew.bat installServer
  4. In metafix-vsc/ execute (tip: if you use windows, install cygwin to execute npm commands): npm install

To start the extension in development mode (starting a second code/codium instance), follow A. To create a vsix file to install the extension permanently follow B.

A) Run in dev mode:

  1. Open metafix-vsc/ in Visual Studio Code / Codium
  2. Launch vscode extension by pressing F5 (opens new window of Visual Studio Code)
  3. Open new file (file-ending .fix) or open existing fix-file (see sample below)

B) Install vsix file:

  1. Install vsce: npm install -g vsce
  2. In metafix-vsc/ execute: vsce package vsce will create a vsix file in the vsc directory which can be used for installation:
  3. Open VS Code / Codium
  4. Click 'Extensions' section
  5. Click menu bar and choose 'Install from VSIX...'

Web editor

Start the web server:

./gradlew jettyRun

Visit http://localhost:8080/, and paste this into the editor:

# Fix is a macro-language for data transformations

# Simple fixes

add_field(hello,"world")
remove_field(my.deep.nested.junk)
copy_field(stats,output.$append)

# Conditionals

if exists(error)
    set_field(is_valid, no)
    log(error)
elsif exists(warning)
    set_field(is_valid, yes)
    log(warning)
else
    set_field(is_valid, yes)
end

# Loops

do list(path)
    add_field(foo,bar)
end

# Nested expressions

do marc_each()
    if marc_has(f700)
        marc_map(f700a,authors.$append)
    end
end

Content assist is triggered with Ctrl-Space. The input above is also used in FixParsingTest.java.

Run workflows on the web server, passing data, flux, and fix:

http://localhost:8080/xtext-service/run?data='1'{'a': '5', 'z': 10}&flux=as-lines|decode-formeta|fix|encode-formeta(style="multiline")&fix=map(a,b) map(_else)

Functions and cookbook

Best practices and guidelines for working with Metafacture Fix

  • We recommend to use double quotation marks for arguments and values in functions, binds and conditionals.
  • If using a list bind with a variable, the var option requires quotation marks (do list(path: "<sourceField>", "var": "<variableName>")).
  • Fix turns repeated fields into arrays internally but only marked arrays (with [] at the end of the field name) are also emitted as "arrays" (entities with indexed literals), all other arrays are emitted as repeated fields.
  • Every Fix file should end with a final newline.

Glossary

Array wildcards

Array wildcards resemble Catmandu's concept of wildcards.

When working with arrays and repeated fields you can use wildcards instead of an index number to select elements of an array.

Wildcard Meaning
* Selects all elements of an array.
$first Selects only the first element of an array.
$last Selects only the last element of an array.
$prepend Selects the position before the first element of an array. Can only be used when adding new elements to an array.
$append Selects the position after the last element of an array. Can only be used when adding new elements to an array.

Path wildcards

Path wildcards resemble Metamorph's concept of wildcards. They are not supported in Catmandu (it has specialized Fix functions instead).

You can use path wildcards to select fields matching a pattern. They only match path segments (field names), though, not whole paths of nested fields. These wildcards cannot be used to add new elements.

Wildcard Meaning
* Placeholder for zero or more characters.
? Placeholder for exactly one character.
| Alternation of multiple patterns.
[...] Enumeration of characters.

Functions

Script-level functions

include

Includes a Fix file and executes it as if its statements were written in place of the function call.

Parameters:

  • path (required): Path to Fix file (if the path starts with a ., it is resolved relative to the including file's directory; otherwise, it is resolved relative to the current working directory).

Options:

  • All options are made available as "dynamic" local variables in the included Fix file.
include("<path>"[, <dynamicLocalVariables>...])
nothing

Does nothing. It is used for benchmarking in Catmandu.

nothing()
put_filemap

Defines an external map for lookup from a file.

put_filemap("<sourceFile>", "<mapName>", sep_char: "\t")

The separator (sep_char) will vary depending on the source file, e.g.:

Type Separator
CSV , or ;
TSV \t
put_map

Defines an internal map for lookup from key/value pairs.

put_map("<mapName>",
  "dog": "mammal",
  "parrot": "bird",
  "shark": "fish"
)
put_var

Defines a single global variable that can be referenced with $[<variableName>].

put_var("<variableName>", "<variableValue>")
put_vars

Defines multiple global variables that can be referenced with $[<variableName>].

put_vars(
  "<variableName_1>": "<variableValue_1>",
  "<variableName_2>": "<variableValue_2>"
)

Record-level functions

add_field

Creates (or appends to) a field with a defined value.

add_field("<targetFieldName>", "<fieldValue>")
array

Converts a hash/object into an array.

array("<sourceField>")

E.g.:

array("foo")
# {"name":"value"} => ["name", "value"]
call_macro

Calls a named macro, i.e. a list of statements that have been previously defined with the do put_macro bind.

Parameters:

  • name (required): Unique name of the macro.

Options:

  • All options are made available as "dynamic" local variables in the macro.
do put_macro("<macroName>"[, <staticLocalVariables>...])
  ...
end
call_macro("<macroName>"[, <dynamicLocalVariables>...])
copy_field

Copies (or appends to) a field from an existing field.

copy_field("<sourceField>", "<targetField>")
format

Replaces the value with a formatted (sprintf-like) version.

---- TODO: THIS NEEDS MORE CONTENT -----

format("<sourceField>", "<formatString>")
hash

Converts an array into a hash/object.

hash("<sourceField>")

E.g.:

hash("foo")
# ["name", "value"] => {"name":"value"}
move_field

Moves (or appends to) a field from an existing field. Can be used to rename a field.

move_field("<sourceField>", "<targetField>")
parse_text

Parses a text into an array or hash of values.

---- TODO: THIS NEEDS MORE CONTENT -----

parse_text("<sourceField>", "<parsePattern>")
paste

Joins multiple field values into a new field. Can be combined with additional literal strings.

The default join_char is a single space. Literal strings have to start with ~.

paste("<targetField>", "<sourceField_1>"[, ...][, "join_char": ", "])

E.g.:

# a: eeny
# b: meeny
# c: miny
# d: moe
paste("my.string", "~Hi", "a", "~how are you?")
# "my.string": "Hi eeny how are you?"
print_record

Prints the current record as JSON either to standard output or to a file.

Parameters:

  • prefix (optional): Prefix to print before the record; may include format directives for counter and record ID (in that order). (Default: Empty string)

Options:

  • compression (file output only): Compression mode. (Default: auto)
  • destination: Destination to write the record to; may include format directives for counter and record ID (in that order). (Default: stdout)
  • encoding (file output only): Encoding used by the underlying writer. (Default: UTF-8)
  • footer: Footer which is output after the record. (Default: \n)
  • header: Header which is output before the record. (Default: Empty string)
  • id: Field name which contains the record ID; if found, will be available for inclusion in prefix and destination. (Default: _id)
  • internal: Whether to print the record's internal representation instead of JSON. (Default: false)
  • pretty: Whether to use pretty printing. (Default: false)
print_record(["<prefix>"][, <options>...])

E.g.:

print_record("%d) Before transformation: ")
print_record(destination: "record-%2$s.json", id: "001", pretty: "true")
print_record(destination: "record-%03d.json.gz", header: "After transformation: ")
random

Creates (or replaces) a field with a random number (less than the specified maximum).

random("<targetField>", "<maximum>")
remove_field

Removes a field.

remove_field("<sourceField>")
rename

Replaces a regular expression pattern in subfield names of a field. Does not change the name of the source field itself.

rename("<sourceField>", "<regexp>", "<replacement>")
retain

Deletes all fields except the ones listed (incl. subfields).

retain("<sourceField_1>"[, ...])
set_array

Creates a new array (with optional values).

set_array("<targetFieldName>")
set_array("<targetFieldName>", "<value_1>"[, ...])
set_field

Creates (or replaces) a field with a defined value.

set_field("<targetFieldName>", "<fieldValue>")
set_hash

Creates a new hash (with optional values).

set_hash("<targetFieldName>")
set_hash("<targetFieldName>", "subfieldName": "<subfieldValue>"[, ...])
timestamp

Creates (or replaces) a field with the current timestamp.

Options:

timestamp("<targetField>"[, format: "<formatPattern>"][, timezone: "<timezoneCode>"][, language: "<languageCode>"])
vacuum

Deletes empty fields, arrays and objects.

vacuum()

Field-level functions

append

Adds a string at the end of a field value.

append("<sourceField>", "<appendString>")
capitalize

Upcases the first character in a field value.

capitalize("<sourceField>")
count

Counts the number of elements in an array or a hash and replaces the field value with this number.

count("<sourceField>")
downcase

Downcases all characters in a field value.

downcase("<sourceField>")
filter

Only keeps field values that match the regular expression pattern.

filter("<sourceField>", "<regexp>")
flatten

Flattens a nested array field.

flatten("<sourceField>")
from_json

Replaces the string with its JSON deserialization.

Options:

  • error_string: Error message as a placeholder if the JSON couldn't be parsed. (Default: null)
from_json("<sourceField>"[, error_string: "<errorValue>"])
index

Returns the index position of a substring in a field and replaces the field value with this number.

index("<sourceField>", "<substring>")
isbn

Extracts an ISBN and replaces the field value with the normalized ISBN; optionally converts and/or validates the ISBN.

Options:

  • to: ISBN format to convert to (either ISBN10 or ISBN13). (Default: Only normalize ISBN)
  • verify_check_digit: Whether the check digit should be verified. (Default: false)
  • error_string: Error message as a placeholder if the ISBN couldn't be validated. (Default: null)
isbn("<sourceField>"[, to: "<isbnFormat>"][, verify_check_digit: "<boolean>"][, error_string: "<errorValue>"])
join_field

Joins an array of strings into a single string.

join_field("<sourceField>", "<separator>")
lookup

Looks up matching values in a map and replaces the field value with this match. External files as well as internal maps can be used.

lookup("<sourceField>", "<mapFile>", sep_char: ”,”)
lookup("<sourceField>", "<mapName>")
lookup("<sourceField>", "<mapName>", default: "NA")
prepend

Adds a string at the beginning of a field value.

prepend("<sourceField>", "<prependString>")
replace_all

Replaces a regular expression pattern in field values with a replacement string. Regexp capturing is possible; refer to capturing groups by number ($<number>) or name (${<name>}).

replace_all("<sourceField>", "<regexp>", "<replacement>")
reverse

Reverses the character order of a string or the element order of an array.

reverse("<sourceField>")
sort_field

Sorts strings in an array. Alphabetically and A-Z by default. Optional numerical and reverse sorting.

sort_field("<sourceField>")
sort_field("<sourceField>", reverse: "true")
sort_field("<sourceField>", numeric: "true")
split_field

Splits a string into an array and replaces the field value with this array.

split_field("<sourceField>", "<separator>")
substring

Replaces a string with its substring as defined by the start position (offset) and length.

substring("<sourceField>", "<startPosition>", "<length>")
sum

Sums numbers in an array and replaces the field value with this number.

sum("<sourceField>")
to_json

Replaces the value with its JSON serialization.

Options:

  • error_string: Error message as a placeholder if the JSON couldn't be generated. (Default: null)
  • pretty: Whether to use pretty printing. (Default: false)
to_json("<sourceField>"[, pretty: "<boolean>"][, error_string: "<errorValue>"])
trim

Deletes whitespace at the beginning and the end of a field value.

trim("<sourceField>")
uniq

Deletes duplicate values in an array.

uniq("<sourceField>")
upcase

Upcases all characters in a field value.

upcase("<sourceField>")

Selectors

reject

Ignores records that match a condition.

if <condition>
  reject()
end

Binds

do list

Iterates over each element of an array. In contrast to Catmandu, it can also iterate over a single object or string.

do list(path: "<sourceField>")
  ...
end

Only the current element is accessible in this case (as the root element).

When specifying a variable name for the current element, the record remains accessible as the root element and the current element is accessible through the variable name:

do list(path: "<sourceField>", "var": "<variableName>")
  ...
end

do once

Executes the statements only once (when the bind is first encountered), not repeatedly for each record.

do once()
  ...
end

In order to execute multiple blocks only once, tag them with unique identifiers:

do once("maps setup")
  ...
end
do once("vars setup")
  ...
end

do put_macro

Defines a named macro, i.e. a list of statements that can be executed later with the call_macro function.

Variables can be referenced with $[<variableName>], in the following order of precedence:

  1. "dynamic" local variables, passed as options to the call_macro function;
  2. "static" local variables, passed as options to the do put_macro bind;
  3. global variables, defined via put_var/put_vars.

Parameters:

  • name (required): Unique name of the macro.

Options:

  • All options are made available as "static" local variables in the macro.
do put_macro("<macroName>"[, <staticLocalVariables>...])
  ...
end
call_macro("<macroName>"[, <dynamicLocalVariables>...])

Conditionals

Conditionals start with if in case of affirming the condition or unless rejecting the condition.

Conditionals require a final end.

Additional conditionals can be set with elsif and else.

if <condition(params, ...)>
  ...
end
unless <condition(params, ...)>
  ...
end
if <condition(params, ...)>
  ...
elsif
  ...
else
  ...
end

contain

all_contain

Executes the functions if/unless the field contains the value. If it is an array or a hash all field values must contain the string.

any_contain

Executes the functions if/unless the field contains the value. If it is an array or a hash one or more field values must contain the string.

none_contain

Executes the functions if/unless the field does not contain the value. If it is an array or a hash none of the field values may contain the string.

str_contain

Executes the functions if/unless the first string contains the second string.

equal

all_equal

Executes the functions if/unless the field value equals the string. If it is an array or a hash all field values must equal the string.

any_equal

Executes the functions if/unless the field value equals the string. If it is an array or a hash one or more field values must equal the string.

none_equal

Executes the functions if/unless the field value does not equal the string. If it is an array or a hash none of the field values may equal the string.

str_equal

Executes the functions if/unless the first string equals the second string.

exists

Executes the functions if/unless the field exists.

if exists("<sourceField>")

in

Executes the functions if/unless the field value is contained in the value of the other field.

Also aliased as is_contained_in.

is_contained_in

Alias for in.

is_array

Executes the functions if/unless the field value is an array.

is_empty

Executes the functions if/unless the field value is empty.

is_false

Executes the functions if/unless the field value equals false or 0.

is_hash

Alias for is_object.

is_number

Executes the functions if/unless the field value is a number.

is_object

Executes the functions if/unless the field value is a hash (object).

Also aliased as is_hash.

is_string

Executes the functions if/unless the field value is a string (and not a number).

is_true

Executes the functions if/unless the field value equals true or 1.

match

all_match

Executes the functions if/unless the field value matches the regular expression pattern. If it is an array or a hash all field values must match the regular expression pattern.

any_match

Executes the functions if/unless the field value matches the regular expression pattern. If it is an array or a hash one or more field values must match the regular expression pattern.

none_match

Executes the functions if/unless the field value does not match the regular expression pattern. If it is an array or a hash none of the field values may match the regular expression pattern.

str_match

Executes the functions if/unless the string matches the regular expression pattern.

Xtext

This repo has been originally set up with Xtext 2.17.0 and Eclipse for Java 2019-03, following https://www.eclipse.org/Xtext/documentation/104_jvmdomainmodel.html.