/HXL-Data-Science-file-formats

Common file formats used for Data Science and language localization exported from (and to) HXL (The Humanitarian Exchange Language)

Primary LanguagePythonThe UnlicenseUnlicense

Data Science files exported from HXL (The Humanitarian Exchange Language)

[Proof of concept] Common file formats used for Data Science exported from HXL (The Humanitarian Exchange Language)

Site EticaAI/HXL-Data-Science-file-formats Python Package: hdp-toolchain Standard HXL License Google Drive



HXL-Data-Science-file-formats

In addition to this GitHub repository, check also the EticaAI-Data_HXL-Data-Science-file-formats Google Drive folder.

1. The main focus

1.1 Vocabulary, Taxonomies and URNs

1.1.1 Vocabulary & Taxonomies on HXL

This project either use explicit HXL +attributes (easy to implement, but more verbose) or do inferences on well know HXLated datasets used on humanitarian areas. To make this work, the main reference is not software implementation, but reference tables.

1.1.2 Uniform Resource Name on URN:DATA
Why use URN to identify resources is more than naming convention

While find good URNs conventions to be used for typical datasets used on humanitarian context is more complex than the ISO URN or even the LEX URN (this one already used in Brazil), one goal of the urnresolver is accept that most data shared are VERY sensitive and private, so this this actually is the challenge. So in addition to converting some well known public datasets related to HXL, we're already designing to eventually be used as abstraction to scripts and tools that without this would need to have access to real datasets.

By using URNs, at worst case we're creating documentations and scripts that a new user would need to replace by the real one of its use case. But the ideal case is to allow exchange scripts or, when an issue happens in a new region, the personel who prepare the data could do it and then publish also on private URN listing so others could reuse.

Note that the URN Resolver, even if it does have links to resources and not just the contact page, the links themselves to download the real data could still require authentication case by case. Also same URNs, if you manage to have contact with several peers, in special for datasets that are not already an COD, but are often needed, are likely to exist with more than one option to use.

Deeper integration with CKAN instances and/or awareness of encrypted data still not implemented on the current version (v0.7.3)

Security (and privacy) considerations (for URN:DATA)

Since the main goal of URNs is also help with auditing and sharing of scripts and even how to reference "best acceptable use" of exchanced data (with special focus for private/sensitive), while the URN:DATA themselves are mean to be NOT a secret and could be published on official documents, the local implementations (aka how to resolve/redirect these URNs for real data) need to take in account concepts that the "perfect optimization" (think "secure from misuse" vs "protect privacy from legitimate use") often is contraditory.

TODO: add more context

Disclaimer (for URN:DATA)

Note: while this project, in addition to CLI tools to convert URNs to usable tool ("the implementation"), also draft the logic about how to construct potentially useful URNs reusable at International level (e.g. what may seem as drafted "an standard", think ISO, or an Best Current Practice, think IETF) please do not take EticaAI/HXL-Data-Science-file-formats... as endorsed by any organization.

Also, authors from @EticaAI / @HXL-CPLP (both past and future ones who cooperate directly with this project) explicitly release both software and drafted 'how to Implement' under public domain-like licenses. Under ideal circumstances data global namespace (the ZZ on urn:data:ZZ:example) may have more specific rules

1.1.3 Ontologia

See ontologia/

"In computer science and information science, an ontology encompasses a representation, formal naming and definition of the categories, properties and relations between the concepts, data and entities that substantiate one, many, or all domains of discourse. More simply, an ontology is a way of showing the properties of a subject area and how they are related, by defining a set of concepts and categories that represent the subject." -- [Wikipedia: Ontology (information science)](https://en.wikipedia.org/wiki/Ontology_(information_science)

The contents from ontologia/ both contain some selected datasets and (while not 100% converted) the main parts of how command line tools and libraries released by this repository use.

Why: focus on abstract complexity for users AND allow reuse by other projects

When feasible, even if it make harder to do initial implementation or be a bit less efficient than use dedicated "advanced" strategies with state of the art tools, the internal parts of hxlm.core that deal with ontology will be stored in this folder.

This strategy is likely to make it easier for non-developers to update internals, like individuals interested in adding new languages or proposing corrections.

Distribution channels

For production usage, these files are both availible via:

1.2 HXL2 Command line tools

1.2.1 hxl2example: create your own exporter/importer

The hxl2example is an example python script with generic functionality that allow you to create your custom functions. Feel free to add your name, edit license etc.

What it does: hxl2example accepts one HXLated dataset and save as .CSV.

Quick examples

### Basic examples

# This will output a local file to stdout (tip: you can disable local files)
hxl2example tests/files/iris_hxlated-csv.csv

# This will save to a local file
hxl2example tests/files/iris_hxlated-csv.csv my-local-file.example

# Since we use the libhxl-python, remote HXLated remote urls works too!
hxl2example https://docs.google.com/spreadsheets/d/1En9FlmM8PrbTWgl3UHPF_MXnJ6ziVZFhBbojSJzBdLI/edit#gid=319251406

### Advanced usage (if you need to share work with others)

## Quick ad-hoc web proxy, local usage
# @see https://github.com/hugapi/hug

hug -f bin/hxl2example
# http://localhost:8000/ will how an JSON documentation of hug endpoints. TL;DR:
# http://localhost:8000/hxl2example.csv?source_url=http://example.com/remote-file.csv

## Expose local web proxy to others
# @see https://ngrok.com/
ngrok http 8000
1.2.2 hxl2tab: tab format, focused for compatibility with Orange Data Mining

What it does: hxl2tab uses an already HXLated dataset and then, based on #hashtag+attributes, generates an Orange Data Mining .tab format with extra hints.

The hxl2tab v2.0 has some usable functionality to use a web interface instead of cli to generate the file. Uses hug 🐨 🤗.

If you want quick expose outside localhost, try ngrok.

Installation

This package can both be installed by doing a copy of bin/hxl2tab to a place on your executable path and installing dependencies manually.

The automated way to your path or as part of the Python pypi package hdp-toolchain already with extra dependencies is:

python3 -m pip install hdp-toolchain[hxl2tab]

# python3 -m pip install hdp-toolchain[full]
1.2.3 hxlquickmeta: output information about local/remote datasets (even non HXLated yet)

What it does: hxlquickmeta output information about a local or remote dataset. If the file already is HXLated, it will print even more information.

v1.1.0 added support to give an overview by default, equivalent to users of Python Pandas.

Installation

This package can both be installed by doing a copy of bin/hxlquickmeta to a place on your executable path and installing dependencies manually.

The automated way to your path or as part of the Python pypi package hdp-toolchain already with extra dependencies is:

python3 -m pip install hdp-toolchain[hxlquickmeta]

# python3 -m pip install hdp-toolchain[full]

Quick examples

#### inline result for and hashtag and (optional) value ________________________

hxlquickmeta --hxlquickmeta-hashtag="#adm2+code" --hxlquickmeta-value="BR3106200"
# > get_hashtag_info
# >> hashtag: #adm2+code
# >>> HXLMeta._parse_heading: #adm2+code
# >>> HXLMeta.is_hashtag_base_valid: None
# >>> libhxl_is_token None
# >> value: BR3106200
# >>> libhxl_is_empty False
# >>> libhxl_is_date False
# >>> libhxl_is_number False
# >>> libhxl_is_string True
# >>> libhxl_is_token None
# >>> libhxl_is_truthy False
# >>> libhxl_typeof string

#### Output information for an file, and (if any) HXLated information __________
# Local file
hxlquickmeta tests/files/iris_hxlated-csv.csv

# Remove file
hxlquickmeta https://docs.google.com/spreadsheets/u/1/d/1l7POf1WPfzgJb-ks4JM86akFSvaZOhAUWqafSJsm3Y4/edit#gid=634938833

1.2.4 hxlquickimport: (like the hxltag)

What it does: hxlquickimport is similar to the hxltag (cli tools that are installed with libhxl) mostly only try to by default slugfy whatever was before on the old headers and add it as HXL attribute. Please consider using the HXL-Proxy for serious usage. This quick script is more for internal testing

Installation

This package can both be installed by doing a copy of bin/hxlquickimport to a place on your executable path and installing dependencies manually.

The automated way to your path or as part of the Python pypi package hdp-toolchain already with extra dependencies is:

python3 -m pip install hdp-toolchain[hxlquickimport]

# python3 -m pip install hdp-toolchain[full]

1.3 URN Command line tools

Installation

The automated way to install is using the Python pypi package hdp-toolchain. urnresolver is installed by default.

python3 -m pip install hdp-toolchain
1.3.1 urnresolver: convert Uniform Resource Name of datasets to real IRIs (URLs)

The urnresolver is an proof of concept of an URN resolver. (see Uniform Resource Name (URN) on Wikipedia).

Examples (note: early working draft!)

# Basic usage: based on local and (to be implemented) remote listing pages
# it translate one readable URN to one or more datasets
urnresolver urn:data:xz:hxl:standard:core:hashtag
# https://docs.google.com/spreadsheets/d/1En9FlmM8PrbTWgl3UHPF_MXnJ6ziVZFhBbojSJzBdLI/pub?gid=319251406&single=true&output=csv

# Now, the more practical example: using to translate to other commands:
hxlselect "$(urnresolver urn:data:xz:hxl:standard:core:hashtag)" --query '#valid_vocab=+v_pcode'
#    Hashtag,Hashtag one-liner,Hashtag long description,Release status,Data type restriction,First release,Default taxonomy,Category,Sample HXL,Sample description
#    #valid_tag,#description+short+en,#description+long+en,#status,#valid_datatype,#meta+release,#valid_vocab+default,#meta+category,#meta+example+hxl,#meta+example+description+en
#    #adm1,Level 1 subnational area,Top-level subnational administrative area (e.g. a governorate in Syria).,Released,,1.0,+v_pcode,1.1. Places,#adm1 +code,administrative level 1 P-code
#    #adm2,Level 2 subnational area,Second-level subnational administrative area (e.g. a subdivision in Bangladesh).,Released,,1.0,+v_pcode,1.1. Places,#adm2 +name,administrative level 2 name
#    #adm3,Level 3 subnational area,Third-level subnational administrative area (e.g. a subdistrict in Afghanistan).,Released,,1.0,+v_pcode,1.1. Places,#adm3 +code,administrative level 3 P-code
#    #adm4,Level 4 subnational area,Fourth-level subnational administrative area (e.g. a barangay in the Philippines).,Released,,1.0,+v_pcode,1.1. Places,#adm4 +name,administrative level 4 name
#    #adm5,Level 5 subnational area,Fifth-level subnational administrative area (e.g. a ward of a city).,Released,,1.0,+v_pcode,1.1. Places,#adm5 +code,administrative level 5 name

hxlselect "$(urnresolver urn:data:xz:hxlcplp:fod:lang)" --query '#vocab+id+v_iso6393_3letter=por'
#    Id,Part2B,Part2T,Part1,Scope,Language_Type,Ref_Name,Comment
#    #vocab+id+v_iso6393_3letter,#vocab+code+v_iso3692_3letter+z_bibliographic,#vocab+code+v_3692_3letter+z_terminology,#vocab+code+v_6391,#status,#vocab+type,#vocab+name,#description+comment+i_en
#    por,por,por,pt,I,L,Portuguese,

1.4 HDP HDP Declarative Programming (early draft)

Installation

The automated way to install is using the Python pypi package hdp-toolchain. All the relevand parts, including bare minimal ontologia, are part of the default installation.

python3 -m pip install hdp-toolchain
1.4.1 HDP conventions (The YAML/JSON file structure)
1.4.2 hdpcli (command line interface)
1.4.3 HXLm.HDP (python library subpackage) usage

1.5 HXLTM HXL Trānslātiōnem Memoriam

§ HXLTM

Dedicated documentation at https://hdp.etica.ai/hxltm

The Humanitarian Exchange Language Trānslātiōnem Memoriam (abbreviation: "HXLTM") is an HXLated valid HXL tabular format by HXL-CPLP to store community contributed translations and glossaries.

The hxltmcli is an (initial reference) of an public domain python cli tool allow reuse by others interested in export HXLTM files to common formats used by professional translators. But software developers interested in promote use cases of HXL are encouraged to either collaborate to hxltmcli or create other tools.

2. Reasons behind

2.1 Why?

The HXL already is used in production in special humanitarian areas (see The Humanitarian Data Exchange). With one line change is possible to convert most of already used spreadsheet-like data to be machine readable without need to disturb end users as other alternatives. One notable implementation (data visualization) powered by HXL is HXLDash (see this HXLDash example video).

The idea of this project strategies to turn already HXLated datasets to be used directly on open source desktop tools like the Orange Data Mining and WEKA "The workbench for machine learning" with the the minimum extra explanation on how to convert already existing HXL datasets AND do exist tools that solve know issues that are likely to be found.

2.2 How?

NOTE: already is possible to use HXLated CSVs on these tools! For either who is leaning HXL or who is using in production for humanitarian intent, the HXL-proxy (https://proxy.hxlstandard.org/) with "Strip text headers" can serve live-updated CSV-like files. Other usages can still use the HXL CLI tools or run the unocha/hxl-proxy with Docker on your machine or an private public server.

One way to implement this is to create minimum usable conversion tools that are able to export already HXLated datasets with additional hints to file formats used by default by their applications.

In practice this is beyond just file conversion (like XLSX to CSV), since it includes both "variable type" AND "intent to use (on data mining)". This is why this project also has the taxonomy/vocabulary reference table (and this ctually is more important than the implementation itself!). Without some extra step HXLated datasets work as averange CSV (good, but is just not great).

But yes, some of these converted files, in special Weka (at least if compared to Orange) are more strict on the tabular format it accepts, and this can be infuriating EVEN for who actually would know how to debug these issues! But this issue, at least, is more automatable.

Note: one practical reason to use HXLated files as base instead of plain CSV or XLSX (beyond obviously being available in humanitarian context) is because the grammar of HXL +attributes are flexible to export to several different formats with freetom to choose other aspects of the tagging.

2.3 Non-goals

  • The software implementation for file formats not typically used by easy to use desktop applications is a non-goal
    • Yet, since as part of the HXL +attributes conversion tables, some of these proposed implementations may already be drafted. These reference tables are released under public domain licenses.
    • Note that often humans who already use these formats already are likely to have skill to manually concert from CSVs (so could convert from HXL)
  • The software implementation (at least at the start) will not optimize for speed or low local disk usage
    • but should work to convert large datasets with reasonable low memory usage
  • The software implementations assume an already HXLated input dataset to keep it simple
    • Note that it is possible to quickly convert already well formatted CSVs to HXL by changing the header line (first line of the CSV).
  • While is technically possible to import back (reconstruct the original HXLated file) from exported files, this is an non-goal to be 100% compatible
    • This applicable in special cases for .arff exports: the default export may need to clean known issues with exported strings.

HXLated datasets to test

Production data on The Humanitarian Data Exchange ("HDX")

The Humanitarian Data Exchange ("HDX") contains public datasets and part of them already is HXLated and ready to test.

PROTIP: on the https://proxy.hxlstandard.org/data/source, the   Option 2: choose from the cloud also have an icon "HDX" also can be used.   This can be helpful if you are just looking around several datasets.

Files from EticaAI-Data_HXL-Data-Science-file-formats

Both Google Drive Folder and this repository has some test files. The not-so-documented manual tests may also give a quick idea on how it works.

Additional Guides

Note: these additional guides are not part of the main focus of this project

Command line tools for CSV

NOTE: Often people who work with HXL simply use the HXL-proxy, including to convert from non-HXLated sources.

Here there is an an quick overview of different command line tools that worth at least mention, in special if are dealing with raw formats already not HXLated.

Alternatives to preview spreadsheets with over 1.000.000 rows

90% of the time 1.000.000 rows is likely to be enough even if you are dealing with data science projects. So it means that there is no need to use command line tools or use more complex solutions, like import to an database or pay for enterprise solutions.

This guide if when you need to go over these limits without change too much your tools.

License

Public Domain Dedication

The EticaAI has dedicated the work to the public domain by waiving all of their rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law. You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission.