Course ReadMe

To access course notebooks in Binder (temporary coding environment):

Note: These notebooks are available only during class time.

To access and download course materials, including notebooks, at any time:

Day 1: Using Python for advanced data journalism

  • Introduction to course outline and set-up (GitHub, Jupyter notebooks)
  • Overview of useful Python libraries and capabilities for journalists
    • Automation of data cleaning, wrangling and analysis
    • Statistical analysis
    • Visualization
    • ML/AI
    • APIs
    • Other uses
  • Data analysis with the Pandas library
  • Introduction of course group projects

Resources

Day 2: More with Pandas

  • Advanced analysis (grouping, functions, and more)
  • Merging and joining datasets

Day 3: Cleaning data and setting up automation

  • How to diagnose dirty data
  • Reformatting and cleaning dirty data
  • Automating data pipelines
  • Documenting your steps for replication

Day 4: Statistical analysis

  • Basic statistical concepts for journalists - looking at the relationship between variables
    • Correlation
    • Regression
    • Scatterplots
  • Examples of statistical analysis in journalism
  • Statistical analysis and plots using Python

Day 5: Visualization in Python

  • Introducing the Matplotlib library
    • Making different types of charts such as bar charts, line charts and maps
    • Formatting charts with color and text
  • Adding interactivity using Plotly
  • Examples of visualization in journalism
  • Exporting charts for publication

Course assignment

Assignment for the course: the 2020 Class students: You will work in 9 groups, the same groups you were in for the Data Journalism and Visualization course with Prof. Herzog.

You will use the same data used in the previous course, but this time you will clean, prepare, analyze and visualize the data in Python using Jupyter notebooks. You may also bring in additional datasets into your project, such as population or income data, that can help you do some deeper analysis.

The Python notebook will be graded on:

  • Reproducibility: Make sure you note your steps and what each one does, and that the steps can be reproduced
  • Deeper analysis: Join/merge additional data, create an automated pipeline, reformat/clean the data, do a statistical analysis, or anything that takes your analysis further than the last time
  • Conclusions/reporting questions: What story could you create from this data? What questions would you try to answer?
  • Challenges: List any challenges you overcame with the data

At the end of the class, each student team will submit its work before 5:00pm (Beijing Time) on Friday, April 14 to the study principal, who will upload all the works to Prof. Carol Zhang’s Baidu drive. The submission must include your Python notebook with the above components. The group members’ Qualtrics peer evaluation results have to be sent to Prof. Carol Zhang by 12 am, midnight, on April 14. Anybody submits the Qualtrics late will be deducted 1 point; but Anybody who submits the Qualtrics later than 10am April 15 or eventually will not submit the Qualtrics will lose 5 points. Ms. Malan, Ms. Yanchen Liu and Dr. Ernest Zhang will grade each group’s production.

More details:

  • Use the same data from your project in data journalism class
  • Bring in additional data for context, perform statistical analysis or visualizations using Python that help you do a deeper analysis than previously done
  • Do all analysis in a Python notebook and turn in the notebook for grading

Other resources