Which correlation and why?
TylerKirby opened this issue · 1 comments
Determine which correlation metric you use in https://github.com/jacksteussie/DSCA/blob/main/Data%20Analysis%20with%20Python/medical-data-visualizer/medical_data_visualizer.py and justify your choice. Does your analysis change with different metrics? How should you choose a metric?
I chose to use the Pearson correlation because the data fit the assumptions that the Pearson correlation requires to be accurate. Specifically, those are as follows: 1) the data is in the interval/ratio format, 2) the data is mostly linear, 3) we removed the outliers in the dataset, and 4) the data is normalized (tested with scipy's normal test with p < 0.02. The differences between the Spearman correlation and Pearson correlation are very small but the differences between Kendall's and the rest are quite noticeable, which would end up leading to a change in analysis. You should choose a metric based on the assumptions each metric has, which determines its usefulness/accuracy in any given situation (the better your data fits the assumptions, the more useful a certain type of correlation is going to be).