Medical Data Visualizer (Certification Project)

Question

Medical Data Visualizer (Certification Project)

Closed this issue 2 years ago · 10 comments

beaucarnes commented 5 years ago

Create project from the Python data analysis certification.

Answer 1 · 2019-12-20T07:39:38.000Z

Project prototype: https://repl.it/@BeauCarnes/fcc-medical-data-visualizer

Answer 2 · 2020-01-19T01:02:50.000Z

Opening paragraph:

The dataset values were collected during medical examinations.

I would write a little more about the context of the data:

What do the rows represent? (Patients)
What information do the columns contains (patient's body measurements, results from various blood tests, and lifestyle choices).
What is the objective of our analysis? (Exploring the relationship between cardiac disease and body measurements, blood markers and lifestyle choices).

Questions

Convert the data into long format and create a chart that shows the value counts of the categorical features using seaborn's catplot(). The dataset should be split by 'Cardio' so there is one chart for each 'cardio' value. The chart should look like "examples/Figure_1.png".

The stub for this function in medical_data_visualizer.py is draw_cat_plot, but main.py and test_module.py call medical_data_visualizer.draw_factor_plot().
This is a long problem, and it's mostly about creating the plot, so I would highlight that on the wording of the question. I would describe the desired visualization first, and then outline the steps required to build it.
- Create a chart similar toexamples/Figure_1.png, where we show the counts of good and bad outcomes for (ac, alco, active, cholesterol, gluc, smoke) for patients with cardio=1 and cardio=0 in different panels.

I tried to follow the instructions in the code comments but got stuck. I don't know seaborn that well, but I'd say that would be quite typical for other campers. When I tried to replicate the plot in the example using this:

- https://stackoverflow.com/questions/35692781/python-plotting-percentage-in-seaborn-bar-plot 
- https://seaborn.pydata.org/generated/seaborn.catplot.html

Here is my solution, which produces the visualization and is a little cleaner than the proposed solution (I think).

  medical_dict = { 1: 0, 2 : 1, 3: 1}

  df['cholesterol'] = df['cholesterol'].map( medical_dict )

  df['gluc'] = df['gluc'].map( medical_dict ) 

  df_cat = df.groupby(["cardio", "cholesterol", "gluc", "alco", "smoke", "active"]).size().rename("total").reset_index().melt(['total', 'cardio'])

  sns.catplot(data = df_cat, x='variable', y='total', hue='value', col='cardio', kind="bar", ci = None)
  
plt.savefig('catplot.png')

The resulting plot looks like the solution but does not pass the tests:

I would re-write the instructions and the tests to follow this outline. I know it is a big change, and I might be wrong, so I'd like to know what do you guys think. :)

Clean the data. Filter out the following patient segments that represent incorrect data.

Maybe the code for cleaning the data should live outside the draw_heat_map() function.
It would be nice to have specific tests for the data cleaning question.
The hints are too much code. I would suggest students the name of the methods to use but not more. Also, the suggested code selects rows that should be kept in the result, but the wording suggests that the condition selects the rows to filter out:

diastolic pressure is higher then systolic (df['ap_lo'] <= df['ap_hi']))

Implies that df['ap_lo'] <= df['ap_hi']) selects rows for which diastolic pressure is higher then systolic, but the opposite is true.

Other comments

For development, you can use main.py to test your functions.

You need to comment out the import medical_data_visualizer line for that to work, because the stubs raise syntax errors. There might be a way around that.

Answer 3 · 2020-01-20T02:48:52.000Z

I posted this in the wrong place initially:

The solution doesn't seem to result in the figure posted.

I am going to submit my results soon, but this seems like maybe an earlier draft of the starter problem?

I peaked at the solution to try to see how mine compared and copied the code and the resulting graph didn't match 'figure 1' - I believe the problem has something to do with the part using "fig, ax" (unless somehow I've made a major mistake).

In addition, the function in the solution doesn't match the name of the function in the problem provided (it changed from 'draw_factor_plot' to 'draw_cat_plot.')

The instructions call for us to include the 'overweight' column we created, but 'Figure_1' does not include the 'overweight' column.

The test also calls for 'draw_factor_plot' - perhaps this is all on me...
Will do a more thorough write up once challenge is completed.

Answer 4 · 2020-01-23T08:00:29.000Z

Thank you for your detailed reviews @rlabuonora and @rayjohnson529. The draw_cat_plot/draw_factor_plot is something we forgot to update in the boilerplate. Also good catch with the wording and df['ap_lo'] <= df['ap_hi']) -- we will have to check the graphs and adjust either the wording or the snippet itself. Also, you're right that Figure_1 doesn't include the overweight column, but Figure_2 does.

@rlabuonora, which syntax errors are you getting? Could you post a link to your solution on Repl.it?

Answer 5 · 2020-01-23T14:55:39.000Z

@rlabuonora, which syntax errors are you getting? Could you post a link to your solution on Repl.it?

I meant the starter code raises syntax errors as is, That's probably because of lines like:

fig, ax = None
sns.catplot(     ) # fill in the parenthesis

This means that you can't use main.py as is for experimenting without commenting out the line that imports the solution.

Answer 6 · 2020-01-24T08:11:27.000Z

@rlabuonora, I see, that's a good point. Thank you for clarifying.

What we could do is have those bits of unfinished code commented out and change the comments to read, "Uncomment and fill in the parenthesis".

Answer 7 · 2020-01-24T14:54:36.000Z

Yes, I think that would work.

Answer 8 · 2020-02-11T20:40:21.000Z

I implemented a lot of the suggestions.

@rlabuonora You had some very helpful advice. It would be great if you had a chance to review this again with the changes. I tried to get your solution code to work but it was changing the value of "total". In the original code, the highest total value is around 30K but in your code it is around 10K. The truth is I'm kind of new to pandas myself so I wasn't able to figure out the reason for the discrepancy. While I implemented much of what you suggested, some of the things I couldn't figure out the best way to implement. I'm definitely open to more suggestions. If you want to make your own version of the challenge with tests and description, we may be able to use it.

Answer 9 · 2020-03-05T09:22:48.000Z

Unfortunately I can't move comments from one issue to another. @rayjohnson529 had some thoughts in #287.

Answer 10 · 2020-03-18T20:22:30.000Z

I'm pretty sure the issues brought up by @rayjohnson529 have already been fixed.