- Want to analyze open source dynamism in more sophisticated way than Excel
- Want to analyze focusing on Japan
- Convert survey rawdata(xlsx) into json data objects.
- Survey rawdata: 2024 World of Open Source: Global Spotlight
- Output json data: Please refer here(data.world)
Too sad, some "hand work" is necessary because question's header(line 1-3 of xlsx file) are different about depth/layers. So you need modify header line 2 with line 3 info for Q30,31,32,38,42,43,47.
Example: (in Q30 case); The line 2 item
"How often does using OSS deliver the following benefits in your organization? (select one response per row)"
is modified with the line 3 item to;
"How often does using OSS deliver the following benefits in your organization? (select one response per row):Improved software quality"
then you save the file as "input_files/input.csv"
Just run it.
example:
curl -s -H "Content-Type: application/x-ndjson" -XPOST localhost:9200/index_name/_bulk?pretty --data-binary @output_1l_bulk.json
Configure settings and do analyze on Opensearch Dashboard/Kibana
-
run_pipeline.py : Main and run following scripts and shell commands.(bash)
-
step1.py: modified csv file extracted from original excel.
*In: input_files/input.csv *Out: output_files/output_step1.csv
- step2.py: change timestamp into ISO formats (from 'yyyy-MM-dd HH:mm:ss' to 'yyyy-MM-ddT HH:mm:ss'
*In: output_files/output_step1.csv *Out: output_files/output_step2.csv
- step3.py: convert csv into json
*In: output_files/output_step2.csv *Out: output_files/output_step3.json
- cmd1: extract each data object from array (
cat *In* |jq -c .[] > *Out*
)
*In: output_files/output_step3.json *Out: output_files/output_per_line.json
- cmd2: Change json to register Opensearch/Elasticsearch and use Bulk API(
sed 'i\{ "index" : {} \} *In*' > *Out*
)
*In: output_files/output_per_line.json *Out: output_files/output_bulk.json
/ │ ├── scripts/ │ ├── step1.py │ ├── step2.py │ ├── step3.py │ └── run_pipeline.py │ ├── input_files/ │ └── input.csv │ ├── output_files/ │ └── output*.* │ └── README.md
- Ubuntu 22.04LTS
- Python 3.11.0
- bash : 5.1.16
- Unlicense