`hxlquickimport`
fititnt opened this issue · 4 comments
Meta
hxl +public | |
---|---|
meta +status | working-draft |
meta +id | EticaAI-Data_HXL-Data-Science-file-formats_hxlquickimport |
meta +discussion+public | #6 |
meta +hxlproxy +url | https://proxy.hxlstandard.org/data?dest=data_view&url=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2F1vFkBSharAEg5g5K2u_iDLCBvpWWPqpzC1hcL6QpFNZY%2Fedit%23gid%3D1097528220 |
meta +description | hxlquickimport is a quick (and wrong) way to importnon-HXL dataset (like an .csv or .xlsx, but requires headers already on thefirst row) without human intervention. It will try to slugify the originalheader and add as +attributefor a base hashtag like #meta.The result may be an HXL with valid syntax (that can be used for automatedtesting) but most HXL powered tools would still be human review.How does it work?"[Max Power] Kids: there's three ways to do things; the right way,the wrong way and the Max Power way![Bart Simpson] Isn't that the wrong way?[Max Power] Yeah, but faster!"(via https://www.youtube.com/watch?v=7P0JM3h7IQk)How to do it the right way?Read the documentation on https://hxlstandard.org/.(Tip: both HXL Postcards and the hxl-hashtag-chooser are very helpful!) |
Spreadsheet data
See EticaAI-Data_HXL-Data-Science-file-formats_hxlquickimport (https://docs.google.com/spreadsheets/d/1vFkBSharAEg5g5K2u_iDLCBvpWWPqpzC1hcL6QpFNZY/edit#gid=1097528220) for updated content. This is an snapshot.
Category | Nome | URL | URL source |
---|---|---|---|
#item+category | #item +name | #item +url | #item +source +url |
test-dataset | mx.gob.dados_dataset_informacion-referente-a-casos-covid-19-en-mexico_2020-06-01.csv | https://drive.google.com/file/d/1nQAu6lHvdh2AV7q6aewGBQIxFz7VrCF9/view?usp=sharing | https://github.com/CMedelR/dataCovid19 |
test-dataset | br.einstein_dataset_covid-pacientes-hospital-albert-einstein-anonimizado_2020-03-28_before-HXLate | https://docs.google.com/spreadsheets/d/1GQVrCQGEetx7RmKaZJ8eD5dgsr5i1zNy_UJpX3_AgrE/edit?usp=sharing | https://www.kaggle.com/einsteindata4u/covid19 |
research-paper | data-mining-for-the-study-of-the-epidemic-sars-cov-2-covid-19-algorithm-for-the-identification-of-patients-sars-cov-2-covid-19-in-mexico.pdf | https://drive.google.com/file/d/1WaW2b7bGiSZjvc4OdA0kjrBtRTkKV11N/view?usp=sharing | https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3619549 |
Thanks to @CMedelR!!!
Not only Ramírez have an research paper called Data mining for the study of the Epidemic (SARS- CoV-2) COVID-19: Algorithm for the identification of patients (SARS-CoV-2) COVID 19 in Mexico and his repository at https://github.com/CMedelR/dataCovid19 have an backup copy of the (at the moment) offline link at https://datos.gob.mx/busca/dataset/informacion-referente-a-casos-covid-19-en-mexico, but his paper explicitly mention the use of the Orange Data Mining!
While his dataset will be used as additional test sample (the previous one was initially only the one from Albert Einstein Hospital on São Paulo), we're also adding his paper, since I'm very sure more people would like to find it later!
The hxlquickmeta
(cli tool) + HXLMeta (Usable Class) #9, while able to fallback and use Pandas and then Orange Data Mining, still fails with something like hxlquickmeta tests/files/iris.csv
.
I think that at least for very basic CSV files, the hxlquickmeta
could implement the features of hxlquickimport
.
fititnt@bravo:/workspace/git/EticaAI/HXL-Data-Science-file-formats$ hxlquickmeta tests/files/iris.csv
> Connection overview
>> TODO: implement raw connection, HTTP headers, etc
>> (this should output debug information even
>> for inputs that would break libhxl)
ERROR! libhxl and/or HXLmeta/HXLMetaExtras failed <HXLException: HXL tags not found in first 25 rows>
Ok. Trying harder now with HXLMetaExtras...
>> HXLMetaExtras: Pandas DataFrame
>>> DataFrame
sepallength sepalwidth petallength petalwidth class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
.. ... ... ... ... ...
145 6.7 3.0 5.2 2.3 Iris-virginica
146 6.3 2.5 5.0 1.9 Iris-virginica
147 6.5 3.0 5.2 2.0 Iris-virginica
148 6.2 3.4 5.4 2.3 Iris-virginica
149 5.9 3.0 5.1 1.8 Iris-virginica
[150 rows x 5 columns]
>>> DataFrame.T
0 1 2 3 4 5 ... 144 145 146 147 148 149
sepallength 5.1 4.9 4.7 4.6 5.0 5.4 ... 6.7 6.7 6.3 6.5 6.2 5.9
sepalwidth 3.5 3.0 3.2 3.1 3.6 3.9 ... 3.3 3.0 2.5 3.0 3.4 3.0
petallength 1.4 1.4 1.3 1.5 1.4 1.7 ... 5.7 5.2 5.0 5.2 5.4 5.1
petalwidth 0.2 0.2 0.2 0.2 0.2 0.4 ... 2.5 2.3 1.9 2.0 2.3 1.8
class Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa ... Iris-virginica Iris-virginica Iris-virginica Iris-virginica Iris-virginica Iris-virginica
[5 rows x 150 columns]
>>> DataFrame.describe
sepallength sepalwidth petallength petalwidth
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
>> HXLMetaExtras: Orange Data Mining
data.domain [sepallength, sepalwidth, petallength, petalwidth, class]
data.columns <Orange.data.table.Columns object at 0x7f416848cd30>
I think that at least for very basic CSV files, the hxlquickmeta could implement the features of hxlquickimport.
My last comment can be ignored. Actually this may not need. As long as hxlquickmeta accept stdin (be piped) and all other tools work with pipes (the standard ones from HXLStandard works!) its not need at all implement this.
So instead of hxlquickmeta tests/files/iris.csv
is just hxlquickimport tests/files/iris.csv | hxlquickmeta
this makes hxlquickmeta fails
# Non HXLated file
hxlquickmeta tests/files/iris.csv
(...)
ERROR! libhxl and/or HXLmeta/HXLMetaExtras failed <HXLException: HXL tags not found in first 25 rows>
Ok. Trying harder now with HXLMetaExtras...
(...)
This ones works (but not for complex Excel files)
# Non HXLated file
hxlquickimport tests/files/iris.csv | hxlquickmeta
## (...)
> lihxl-python overview
>> output.output <_io.TextIOWrapper name='/tmp/tmphdplthem' mode='w' encoding='UTF-8'>
>> source <hxl.io.HXLReader object at 0x7fc33c008820>
> HXLMeta debuginfo
>> HXLMeta.text_headers None
>> HXLMeta.hxl_headers ['#item+sepallength', '#item+sepalwidth', '#item+petallength', '#item+petalwidth', '#item+class']
> get_hashtag_info [ #item+sepallength ] [ None ]
(...)
Potential problem with hxlquickmeta
if would not work with streams
I will make this comment on other issue. So it keeps notes for future.
The hxlquickimport
already have an working proof of concept, and since is an all-in-one single file, can work even without [meta issue] hxlm #11 or the [meta] hxlm.core. As long as the depended libraries are installed, just need to put the bin/hxlquickimport
on working path.
If need, this issue could be re-opened, but the current version of bin/hxlquickimport
(single is mostly an hxltag
with implicitly defaults, either could be something I would propose add to the HXLStandard/libhxl-python
Eventual point to be done (but not today)
Without actually doing a full refactoring to use something like the hxlm.core (or more 'pythonic'), maybe the bin/hxlquickimport
will be moved to when installing this repository with
pip install https://github.com/EticaAI/HXL-Data-Science-file-formats
With this, at least would be more intuitive to explain another strategy of how to use these tools (and then the Minimal documentation about how to use the command line tools #1 could be solved)