
This is a collaborative attempt to define what belongs in a data science curriculum to productively advance the field forward. Fork this repo and submit pull requests if you would like to contribute (or open an issue)


Utilities: Shell and POSIX

  • Pipes and directing output
  • Essential utilities
    • Explore (head, tail, more, less, grep)
    • Transform (sed, awk, cut, tr, sort, join)
    • Schedule (cron, watch)
    • Visualize (gnuplot)
  • Regular Expressions

Software Engineering

  • Git and version control
  • Data Structures
    • Dictionaries and Hash Tables
    • Trees (binary, balanced, splay, B)
    • Heaps
    • Stacks and Queues
    • Graphs and Networks
    • Sets
  • Algorithms
    • Search (BFS, DFS, A*, Dijkstra's)
    • Sorting (merge, quick, heap, radix)
    • Selection
  • Performance (Asymptotic Analysis, hardware restrictions, indexing, etc.)

Data Acquisition

  • HTTP
  • APIs and ReST
  • HTML and XML
  • Parsing (CSS and XPath)
  • Web Scraping
  • PDF parsing

Statistics and Probability

  • Descriptive statistics (mean, mode, variance, skew, etc.)
  • Estimation (confidence intervals, sampling, etc.)
  • Correlation (covariance, goodness of fit, causation, etc.)
  • Distributions
    • PMF, PDF, CDF, CMF
    • Histograms and Scatterplots
    • Normal, Binomial, Exponential
    • Probability Plot
    • Central Limit Theorem
  • Significance (Hypothesis testing, p-value, ANOVA, etc.)
  • Conditional Probability
    • Bayesian Statistics
    • Random Variables and Conditional Distributions
    • Monte Carlo Methods


  • Sampling
  • Feature Preparation
    • Vectorization (binning, bag of words, tf-idf)
    • Selection (automatic and manual)
    • Normalization
    • Regularization and Smoothing
  • Natural Language Processing
    • N-grams
    • Tokenization
    • Sentiment Analysis
    • Information Retrieval


  • SQL (Postgres, MySQL)
  • NoSQL (document, graph, key-value)
  • Filesystem and Text

Data at Scale

  • MapReduce paradigm (Hadoop)
  • Distributed Datastores (HDFS, Cassandra, HBase)
  • Hadoop Ecosysytem (Pig, Hive, HBase, Flume, Sqoop, etc.)
  • Real-Time (Spark, Storm, Shark)
  • Distributed Machine Learning

Machine Learning

  • Unsupervised
    • Clustering (K-means, Hierarchical, etc.)
    • Association Analysis (FP-Growth, MDS, etc.)
    • Dimensionality Reduction (PCA, SVD, etc.)
  • Supervised
    • Classification (Naive Bayes, kNN, Logistic Regression, etc.)
    • Regression (Linear, Polynomial, Tree, etc.)
  • Recommendation
    • Similarity metrics (Jaccard, Pearson, Euclidean, etc.)
    • Item vs. User vs. Content based
    • Limitations (Cold-start problem, preference collection, performance)
  • Optimization (cost functions, hill climbing, simulated annealing, etc.)
  • Anomaly Detection and Time Series Analysis
  • Evaluation
    • Cross Validation
    • ROC plot
    • Bias vs. Variance
    • Recall vs. Precision
    • Bootstrap

Visualize and Present

  • Grammer of Graphics (ggplot2, Bokeh)
  • Interactivity (Javascript, HTML, D3.js, CSS)
  • Geographic display (i.e. maps)
  • Charts, plots, and layout (Visual Display of Quantitative Information)