afrolidxml

This repo contains a Python-based CLI tool (afrolidxml) that uses AfroLID to identify possible African language usage in a text file then output the results to an XML file. The tool can be used with Archivematica to identify African languages in text files, adding characterization metadata to the Metadata Encoding and Transmission Standard (METS) file of a transfer.

This is done by adding the tool to Archivematica's Format Policy Registry (FPR) as a characterization command then adding FPR rules that trigger use of the tool during the characterize and extract metadata microservice (see Archivematica configuration for FPR configuration).

flowchart TB
    Characterization-->METS
    subgraph Characterization[Language usage predicition]
    AfroLID[AfroLID tool]-->|XML|Archivematica[Archivematica FPR]
    Archivematica[Archivematica FPR]-->|text|AfroLID[AfroLID FPR command]
    end
    subgraph METS[Storage in transfer METS file]
    AM[Archivematica FPR]-->|XML|METS2[(METS XML)]
    end

style Characterization stroke-dasharray: 5 5;
style METS stroke-dasharray: 5 5;

This has been tested with Archivematica 1.15.0 running on Ubuntu 22.04.

Installation

Below are the installation instructions:

Clone this project somewhere on your Archivematica server:

git clone https://github.com/artefactual-labs/afrolidxml.git

Change into the project directory:
```
cd afrolidxml
```
Create a Python virtual environment:

Install virtualenv if it's not already installed:
```
sudo apt install python3-virtualenv
```
Initialize a new virtual environment:
```
virtualenv -p python3 venv
```
Activate the virtual environment:
```
source venv/bin/activate
```
Install the project's Python dependencies:
```
pip3 install -r requirements/base.txt
```

Download the AfroLID model:

Install wget if it's not already installed:

sudo apt install wget

Download and extract the model:

wget https://demos.dlnlp.ai/afrolid/afrolid_model.tar.gz
tar -xf afrolid_model.tar.gz

Ideally this could be installed with pipx, which installs CLI tools and automatically creates virtual environments for them, but there's currently an issue within the dependencies (fairseq 0.12.2, currently the latest version, has a dependency that has a PEP 440-related issue).

Running the tool manually

Once the tool is installed it can be run manually to make sure it has been installed correctly:

./afrolidxml/cli.py -m afrolid_model \
    tests/fixtures/language_use_example.txt output.xml

The resulting output, in output.xml, should look similar to this:

<?xml version="1.0" encoding="utf-8"?>
<languages sourcetool="AfroLID">
        <language>
                <score>39.95</score>
                <name>Isizulu</name>
                <script>Latin</script>
                <code>zul</code>
        </language>
        <language>
                <score>30.49</score>
                <name>Isixhosa</name>
                <script>Latin</script>
                <code>xho</code>
        </language>
        <language>
                <score>11.4</score>
                <name>IsiNdebele</name>
                <script>Latin</script>
                <code>nbl</code>
        </language>
</languages>

An XSD XML schema for this, if needed, exists at tests/fixtures/languages.xsd in the repository.

Archivematica configuration

In the Archivematica dashboard click "Preservation planning", on the navigation bar, to navigate to the web interface for defining which actions Archivematica should take on a particular file format.

Adding the AfroLID tool as a characterization command

Follow these steps to create a characterization command for the AfroLID tool:

Click "Commands" in the "Characterization" section of the left sidebar.
Click "Create new command".
For "The related tool" select "Archivematica Script".
For "Description" enter "AfroLID".

For "Command" add the following Bash script logic (changing SCRIPT_DIR to the location where you've cloned this project):

set -euo pipefail
SCRIPT_DIR="/home/someuser/afrolidxml"
cd $SCRIPT_DIR
source venv/bin/activate
TEMP_DIR=$(mktemp -d %tmpDirectory%afrolid.XXXXXX)
$SCRIPT_DIR/afrolidxml/cli.py -m afrolid_model "%fileFullName%" "$TEMP_DIR/output.xml" 1>/dev/null 2>/dev/null
cat "$TEMP_DIR/output.xml"
echo
echo
rm -r "$TEMP_DIR/output.xml"

This Bash script logic suppresses both the standard and error output of the AfroLID tool, while it's running, then outputs the resulting XML to standard output.

For "Script type" select "Bash script".
For "The related output format" select "Text (Markup): XML: XML (fmt/101)".
Leave "Output location" blank.
For "Command usage" select "Characterization".
Leave "The related verification command" blank.
Leave "The related event detail command" blank.

Adding characterization rules to run the AfroLID command

After adding the characterization command the next step is to create rules for the six file formats that we'd like to use AfroLID to characterize.

Add a rule for the first format using these steps:

Click "Rules" in the "Characterization" section of the left sidebar.
Click "Create new rule".
For "Purpose" select "Characterization".
For "The related format" select "Text (Plain): Plain Text: Generic TXT (x-fmt/111)".
For "Command" select "AfroLID".
Click "Save".

Repeat steps steps 2 to 6 for the following five formats, changing the "The related format" selection in step 4 to the format:

Text (Plain): Unicode Text File: Unicode Text File (x-fmt/16)
Text (Plain): 7-bit ANSI Text: 7-bit ANSI Text (x-fmt/21)
Text (Plain): 7-bit ASCII Text: 7-bit ASCII Text (x-fmt/22)
Text (Plain): 8-bit ANSI Text: 8-bit ANSI Text (x-fmt/282)
Text (Plain): 8-bit ASCII Text: 8-bit ASCII Text (x-fmt/283)

Tests

With the virual environment active, install the project's Python test dependencies:

pip3 install -r requirements/test.txt

To run the tests:

./venv/bin/pytest tests/test.py