EticaAI/HXL-Data-Science-file-formats

`hxl2tab`: tab format, focused for compatibility with Orange Data Mining

fititnt opened this issue · 5 comments

The EticaAI-Data_HXL-Data-Science-file-formats_Tab already have an draft of an table that could be used to make an Expert system without the need of full machine learning models.

But for this implementation, I think that we can simply implement both the more specific prefixes, like the +vt_orange_, and and maybe some special more generic attributes to be used with #3, like the one to mention the "class" (both Orange and Weka use class).

Captura de tela de 2021-01-25 23-36-09

Ok. Interesting. Here the Orange 'Simplified header' specification

Captura de tela de 2021-01-27 23-05-53


While not ideal, the HXLated output without text headers actually are pretty similar to what orange would expect. The biggest difference is that everything after the # the orange consider as textual header, but before this is possible to add a few extra short variables.

Captura de tela de 2021-01-27 23-01-11
Captura de tela de 2021-01-27 23-02-52

hxl2tab https://docs.google.com/spreadsheets/d/1Vqv6-EAdSHMSZvZtE426aXkDiwP8Mdrpft3tiGQ1RH0/edit#gid=0 temp/example-ebola-dataset-1_HXLated+tab.csv

fititnt@bravo:/workspace/git/EticaAI/HXL-Data-Science-file-formats$ head temp/example-ebola-dataset-1_HXLated+tab.csv
#status	#country	#adm1	#adm1+code	#loc	#loc	#org	#loc+type	#affected+dead	#affected+confirmed	#affected+suspected
Pending	Liberia	Margibi	LR09	Kakata 1	Kakata 2 AFL	AFL	ETC	0	0	0
Functional	Guinea	Nzerekore	GN008		Nzerekore	Ailema (?)	ETC	45	56	3
Pending	Liberia	River Gee	LR13	Fishtown	Fishtown ETC	American Red Cross	ETC	0	0	0
Functional	Sierra Leone	Western	SL04	Jui	Sierra Leone-China Friendship Hospital (Jui Hospital)	Chinese CDC	ETC	47	65	17
Pending	Guinea	Nzerekore	GN008			Croix-Rouge française	ETC	0	0	0
Pending	Sierra Leone	Western	SL04	Freetown	Goderich	EMERGENCY	ETC	0	0	0
Functional	Sierra Leone	Western	SL04	Lakka	Lakka Hospital ETU	EMERGENCY Italian NGO	ETC	3	17	11
Functional	Liberia	Margibi	LR09	Firestone	Firestone Medical Center	Firestone Company	ETC	14	29	19
Functional	Liberia	Montserrado	LR11	Monrovia	Monrovia, Congo Town - Old Ministry of Defence ETU 1	FMT	ETC	1	30	6

hxl2tab https://docs.google.com/spreadsheets/d/1Vqv6-EAdSHMSZvZtE426aXkDiwP8Mdrpft3tiGQ1RH0/edit#gid=0 temp/example-ebola-dataset-1_HXLated+tab_hxltabv15.tab

fititnt@bravo:/workspace/git/EticaAI/HXL-Data-Science-file-formats$ head temp/example-ebola-dataset-1_HXLated+tab_hxltabv15.tab
cD#status+vt_categorical+vt_class	D#country+vt_categorical	D#adm1+vt_categorical	D#adm1+code+vt_categorical	D#loc+vt_categorical	D#loc+vt_categorical	D#org+vt_categorical	#loc+type+vt_meta	C#affected+dead+number	C#affected+confirmed+number	C#affected+suspected+number
Pending	Liberia	Margibi	LR09	Kakata 1	Kakata 2 AFL	AFL	ETC	0	0	0
Functional	Guinea	Nzerekore	GN008		Nzerekore	Ailema (?)	ETC	45	56	3
Pending	Liberia	River Gee	LR13	Fishtown	Fishtown ETC	American Red Cross	ETC	0	0	0
Functional	Sierra Leone	Western	SL04	Jui	Sierra Leone-China Friendship Hospital (Jui Hospital)	Chinese CDC	ETC	47	65	17
Pending	Guinea	Nzerekore	GN008			Croix-Rouge française	ETC	0	0	0
Pending	Sierra Leone	Western	SL04	Freetown	Goderich	EMERGENCY	ETC	0	0	0
Functional	Sierra Leone	Western	SL04	Lakka	Lakka Hospital ETU	EMERGENCY Italian NGO	ETC	3	17	11
Functional	Liberia	Margibi	LR09	Firestone	Firestone Medical Center	Firestone Company	ETC	14	29	19
Functional	Liberia	Montserrado	LR11	Monrovia	Monrovia, Congo Town - Old Ministry of Defence ETU 1	FMT	ETC	1	30	6

Captura de tela de 2021-02-06 19-25-58

Humm, from this semi-random Reddit thread I found this https://github.com/hugapi/hug. So, in theory, is possible to do an hackish way to expose cli interface as webapp. At bare minimum this can help with pass to orange an URL (even if local) instead of manually save the file with the cli app.

The post cites other alternatives, but this one requires less dependencies and low number of changes. Also for some quick tests, if need to quick expose the URL without setup remote server, would be possible to use ngrok (https://ngrok.com/), so it may be useful if someone elses need something for a quick period and any randon people from community just send an private URL from their computer and solve the issue util something better comes.

Captura de tela de 2021-02-07 10-12-36

A proof of concept exist since at least v0.8.7.1, and is documented on the main README.md.

This can be used standalone, but still require original dataset already be valid HXL and have some tags like +vt_orange_flag_class to work as hint for the export to Orange.

Trivia: the hxlquickmeta is one way to automate how a dataset could be tagged to be used with hxl2tab (which could be useful for very large datasets with so many columns. But the inner parts of bin/hxl2tab still need edit python code (not like most other new tools here with fully configurable ontologies with YAML.


From the README:

1.2.2 hxl2tab: tab format, focused for compatibility with Orange Data Mining

What it does: hxl2tab uses an already HXLated dataset and then, based on
#hashtag+attributes, generates an Orange Data Mining .tab format with extra
hints.

The hxl2tab v2.0 has some usable functionality to use a web interface
instead of cli to generate the file. Uses hug 🐨 🤗.

If you want quick expose outside localhost, try ngrok.

Installation

This package can both be installed by doing a copy of
bin/hxl2tab to a place on your executable path and
installing dependencies manually.

The automated way to your path or as part of the
Python pypi package hdp-toolchain
already with extra dependencies is:

python3 -m pip install hdp-toolchain[hxl2tab]

# python3 -m pip install hdp-toolchain[full]