freeCodeCamp/CurriculumExpansion

Demographic Data Time Series Analyzer (Certification Project)

Closed this issue ยท 20 comments

Create project from the Python data analysis certification.

I found the instructions to be clear and the questions well specified
Never used the pandas library before and my python isn't sharp
So I apologize in advance for the messy solution
Here's my forked project
I found some typos on the README.md file starting on line 25

Hi! This assigment was pretty cool! I learned quite a bit of pandas solving it, here are a few thoughts:

  • Exploratory data analysis is all about context, so I would add a little bit of information about the data in the opening paragraph (where does it come from? what kind of information does it contain? what kind of questions are we looking to answer from it?). Also, if the rows of the data represent individuals, refer to individuals throughout the assignment (e.g. How many individuals of each race are represented in this dataset? (race column)

  • The csv file name is strange (maybe it should be adult_data.csv?)

  • The project asks us to round all decimals to the nearest tenth. I would use assertAlmostEqual instead.

  • In the question:

How many of each race are represented in this dataset? (race column)

It is not clear if result variable race_count should hold a series or a list. You can figure it out by looking at the tests but you should probably hint in the question. Also, the order of the list in the result should not matter, so maybe you can cast the result down to a set when testing so this passes the test:

df.groupby('race').size()

  • I would not include the name of each column in EVERY question. For some of the questions, It might be nice to have the student lookup the relevant variable in the columns of the data frame.

  • The wording on this question might be misleading:

What percentage of the people with AND without education equal to Bachelors, Masters, or Doctorate also have a salary of >50K? (Every row of data has salary of either '>50K' or '<=50K')

I would ask:

  • What percentage of individuals with advanced education, make more than 50K?

  • What percentage of individuals without advanced education make less than 50K?

Hope the feedback helps, cheers!

Thank you for your reviews @arthigos and @rlabuonora. We'll definitely fix that typo and take all of these points into account for the next draft. All of the suggestions about the wording, filenames, and adjustments to the tests so they're not as brittle is very helpful.

@arthigos & @rlabuonora Thanks! I made your recommended changes.

Completed Project: https://repl.it/@borntofrappe/fcc-demographic-data-analyzer

I worked on a previous version of the project, so I apologize if what I mention has already been fixed (I made sure to check the new version, but you never know).


In README.md:

  • describing the starter code, the name of the file is misspelled to deomgraphic_data_anaylizer. There's also a missing backtick at the end of the word.

  • the penultimate question asks:

    What percentage of the people who work the minimum number of hours per week have a salary of less than 50K?

    In the comments of the python file, and most importantly in the test, we are looking instead for those who have a salary more than 50K.


In demographic_data_analyzer.py:

  • around line 20, we set up two variables for the two types of education, but one is misspelled to eduction

    # higher_eduction = None
    higher_education = None
  • around line 27, we introduce the minimum number of hours and then the percentage of people working those hours. Following the previous comment structure, it might be better to move min_work_hours in between the two successive comments

    # What is the minimum number of hours a person works per week (hours-per-week feature)?
    
    min_work_hours = None
    
    # What percentage of the people who work the minimum number of hours per week have a salary of >50K?

I wanted to add a couple of notes on the project itself. Hope you don't mind :)

As a first-timer with the library, I quite struggled to understand how to select specific columns, rows, values. I went through the official documentation, but found it quite difficult to navigate.

The sheer amount of data was also a tad overwhelming, and I found it really helpful to practice in a separate REPL with a smaller dataset. The challenges leading up to the larger project will be invaluable to make the library more approachable.

I'd like to also point out that the pandas library was updated just recently, and it seems the update was quite important. This might have an impact and I'll gladly add more feedback as I learn more about it.

@borntofrappe Thanks for your comments. I made the changes you recommended. The plan is for the projects leading up to this one (that we still have to develop) will prepare people for this one.

I didn't realize that pandas was just updated. If you learn anything that we should include, please share!

Thanks for the project. I learned quite a lot about Pandas. ๐Ÿ‘

I believe there is an error with one of the test case.
In test_module.py:9

actual = self.data['race_count'].values.tolist().sort(reverse = True)

The README asks for race_count to be a list of integers. I don't think a list of integers would have a property called values, also the sort method returns None. So, I believe assigning actual to the result of sort is a mistake.

@adamdune Thanks for catching that problem. I updated the instructions, test, and solution code. It no longer asks for a list of integers. Now the instruction is "How many people of each race are represented in this dataset? This should be a Pandas series with race names as the index labels. (race column)" Do you think that is clear enough?

@beaucarnes I suppose it should be clear enough for the campers as long as they have an understanding of Pandas. ๐Ÿ˜„

Very nice project. I had some idea about Pandas but this project helped me to brush up my skills. Here is the link to my solution.
I installed Jupyter on my local dev environment to help me finish the project. First time I use it and it's really awesome!

Screen Shot 2020-06-15 at 5 25 01 PM

Hey there... I was working on this project, I spent some days without using it and when I decided to finish it, I'm facing this problem that doesn't allow me to try my answer. Does anyone has any idea of how to overcome this?

Hi @robertue1, thanks for reporting this.

I see a lot of 404 errors in the dev console. Installing packages is also an issue in new projects. I think this is a problem with repl.it, unfortunately.

Please feel free to download your current progress and continue coding locally until they're able to resolve the issue.

Had fun completing this project. One thing which can be added that to which level rounding has to be done. However, it can be checked from test files.

Link : https://repl.it/@ManshulArora/demographic-data-analyzer#demographic_data_analyzer.py

I've revisited the project in its most recent version, and the instructions are much clearer.

The only thing I found strange is that the README adds the title of the project in the Assignment section

### Assignment

# Demographic Data Analyzer

This is the first time the title is actually included in a python project. It might be worth to consider a common structure.

For the rounding, the most recent README includes the instruction at the end of the Assignment section. You might be referring to a previous version @manshul1807

Use the starter code in the file demographic_data_anaylizer. Update the code so all variables set to "None" are set to the appropriate calculation or code. Round all decimals to the nearest tenth.

I saw an error in the prototype. The last block of code in demographic_data_analyzer.py should read

return {'race_count': race_count, 'average_age_men': average_age_men, 'percentage_bachelors': percentage_bachelors, 'higher_education_rich': higher_education_rich, 'lower_education_rich': lower_education_rich, 'min_work_hours': min_work_hours, 'rich_percentage': rich_percentage, 'highest_earning_country':highest_earning_country,'highest_earning_country_percentage':highest_earning_country_percentage,'top_IN_occupation': top_IN_occupation}

You forgot a couple items.

Hello guys !
I want to start off by saying thank you and great job!
I have a question I would like to ask which is related to the below :
average_age_men = round(df.loc[df['sex'] == 'Male', 'age'].mean(), 1

What does the number 1 stand for in this code, furthermore since im new to ccoding I am having diffuclties knowing when to use () or [] so if you guys can give some advice I would really appreciate it 1

Hello Everyone,
Trying to learn from free code camp data analyst course.
There is an error in my code for #Demographic_data_analyzer .. it will be a pleasure if someone can resolve this issue. try to run .. link is below
https://replit.com/@LavishSinghal/boilerplate-demographic-data-analyzer-1#demographic_data_analyzer.py
Thank You

Hi @FlamerJay and @lavish2801.

For help with the coding projects or challenges, please check out the forum at https://forum.freecodecamp.org. The contributors there are very active, and can help answer any questions you might have.