Observing impact of incomplete data on different insight types

For more detail, please read our full paper: Quality Matters: Understanding the Impact of Incomplete Data on Visualization Recommendation
We used three types of insights in this work: Aggregate-based insight, correlation-based insight, and distribution-based insight
We observed impact of missing data with different settings: different missing data percentage, different number of k, etc

Preparation

For the aggregate-based insight, PostgreSQL engine should be installed first.

$ cd 0_generate_data
$ python import_heart_data_to_postgreSQL.py
$ python generate_missing_data_version_postgreSQL.py

The second command is for importing heart disease data (csv format) to Postgre engine. Then, the next line is to generate missing data version of heart disease data.

Aggregate-based Insight

$ cd aggregate_based_insight
$ python a_seedb_main.py
$ python 3_0_div_missing_a_m_vs_ideal.py #Outstanding-based insight - deviation
$ python 3_0_sim_missing_a_m_vs_ideal.py #Similarity-based insight

The second command line is to generate all possible visualizations with their utility scores. The results are stored in Excel file. The next lines are for calculating the Jaccard, RBO, and Cumulative distance scores based on the comparison of recommended visualizatiosn from the incomplete data and the recommended visualizations from the ideal/complete data.

Other Insight Types

For other insight types, the postgreSQL is not needed.

$ cd other_insight_types
$ python 0_correlation_insights_random.py #Correlation-based insight
$ python 1_kurt_insight_random.py #Kurtosis-based insight
$ python 2_skewness_insight_random.py #Skewness-based insight

Plot the result

$ cd plotting
$ python example_plotting_result_impact_missing_data_5_insights_RBO.py

Result:

rischanlab/impact_incomplete_data_on_insights

Observing impact of incomplete data on different insight types

Preparation

Aggregate-based Insight

Other Insight Types

Plot the result