BIOS 823 is designed for people who want to work in a data science team, especially in a healthcare setting. Hence the course focuses on the skills and knowledge needed to contribute to the main roles in a data science team:
- Making data available and accessible for analysis
- Processing data and performing exploratory data analysis
- Statistical inference and machine learning
- Scaling up to big data and deploying automated workflows
You will learn how to use the essential tools for each of these roles effectively. The emphasis is on conceptual understanding, and we will highlight only selected software packages that exemplify each role. Once you understand the concepts, it should be simple to self-learn other similar packages, and you have the opportunity to do so in projects. Applications of data science to the analysis of structured, semi-structured and unstructured data, especially from biomedical contexts, will be interleaved into the course.
Prerequisites
- You have a background in linear algebra
- You have a background in basic statistical inference
- You have a background in basic machine learning
- You are fluent in Python
- You are familiar with the use of
git
for version control - You enjoy coding and playing with data
Course repository is at https://github.com/cliburn/bios-823-2021
Due to the wide-ranging nature of the syllabus, there is no course textbook. However, recommendations for useful books for each major topic will be suggested for those who want to explore the topic in more depth.
- Develop competency in data manipulation and analysis
- Exam
- Demonstrate mastery of individual skills / packages
- Homework / blogs
- Integration of separate skills / working in teams
- Final project
- Communication skills / visual storytelling
- Written - blogs
- Oral - presentations, class participation
- Build a (biomedical) data science portfolio
- GitHub repositories
- Personal landing page
- Blogs
- 50% homework
- A lot of practical self-learning will be done only through homework assignments, including exposure to topics and/or alternative packages we have no time to cover in class. The homework assignments (and final project) will also result in the creation of your data science portfolio.
- 20% exams
- Test basic competency
- Based on topics covered in lectures
- 20% final project
- Learn to work as part of a data science team
- 10% class participation
- Contribute to intellectual environment
- Participates in office hours
- Demonstrates drive to learn and intellectual curiosity
- Rating of helpfulness by rest of class / project team
- Contributions to peer learning
- Contributions to team project
For letter grade
- A: 90 - 100
- B: 75 - 89
- C: 60 - 74
- F: 59 and below
- Michael Gao michael.gao@duke.edu (TA): ?
- Cliburn Chan cliburn.chan@duke.edu (Instructor): Thursdays 5-6 PM (Room 10050)
- Please email me in advance if you plan to attend my office hour
The basic foundation of data science is the creation of tidy data formats that can then be visualized using a grammar of graphics. We will review data processing and exploratory data visualization, common formats for data sharing, and application programming interfaces (API) for transfer of semi-structured data resources.
- Jupyter
- Literate programming
- Magic functions
- Polyglot programming
- Jupyter notebooks and version control
- Programming exercises
- String methods
- String formatting
- Using
re
- Using
numpy
- Using
pandas
- Using
seaborn
- JSON with
jsoon
- XML with
elementtree
- HDF with
h5py
- Other formats
pydata
,parquet
- Working with REST APIs
Most data in healthcare settings still reside in relational database's, especially in enterprise data warehouses that have been created to serve as portals to electronic health records. Hence, an essential skill for a data scientist is the use of SQL, especially for constructing data queries. We will also examine examples of a few NoSQL databases, and review their comparative advantages and disadvantages relative to relational databases.
- Relational databases are based on set theory
- Tables, rows, columns, keys, relations, constraints
- Database schema and normalization
- CRUD
- SQL queries
- Single table queries
- Window function
- Joins
- User defined functions
- Common table expressions
This is mostly for exposure and we will not go into any depth.
- Key-value with
redis
- Uses
- Collections
- Document with
mongodb
- Uses
- Queries
- Graph with
neo4j
- Graph concepts with
networkx
- Uses
- The Cypher language
- Graph concepts with
Machine learning is an essential skill for any data scientist, so we will review practical issues in classical and deep learning, including feature engineering, imbalanced data, hyper-parameter tuning, and model expandability.
Main package isscikit-learn
. Supplementary package is yellowbrick
for visualization.
- Unsupervised learning
- Dimension reduction
- Clustering
- Recommender systems
- Supervised learning
- Basic framework for training and evaluation
- Model families
- Preprocessing and feature engineering
- Hyperparameter tuning with
optuna
- Model evaluation
- Explaining models
Main package is tensorflow
with keras
- Neural network concepts
- Working with data sets
- Effective training
- Explainable AI (XAI) with
shap
Finally, we consider three aspects of data engineering for "big data" - construction of pipelines using functional approaches, scaling of algorithms for performance, and the automation of workflows.
Main packages are itertools
, functional
, toolz
- Recursion
- Lazy evaluation and iterators
- Comprehensions and generators
- Anonymous functions
- Partial application and currying
- Higher order functions: map, filter, reduce, decorators
- Pipes and chaining operations
- A digression:
coconut
Main package: spark
- Concurrent, parallel, distributed
- Simple multi-core processing
- Spark concepts
- Spark SQL
- Spark ML
- Spark Streaming
- GraphFrames
- Literate programming with
Jupyter
andnbconvert
- Source code version control with
git
- Testing code with
py.test
- Continuous integration with
githu actions
- Using containers with
Docker
- Automating complex workflows
- Creating APIs for analysis with
fastapi
- Life in the cloud