/Tetiana

Forked from the OSDSM repo and inspired by Tetiana Ivanova's talk at PyData London 2016. Watch the vid of her talk!

The UnlicenseUnlicense

This is an example of accelerated agile learning or LEAN education by "jumping in at the deep end," "swimming toward the goal" and doing it all in a competent enough fashion to develop good habits, survive and repeat over and over ... transforming your base of knowledge continually throughout life ... a vastly better way to LEARN than standing on the sidelines, racking up debt and carping about how much time a university education takes.

The general method for LEAN education is rather tough, but it is very simple:

  1. Determine the requirements for a "MVP" self-educational product: closing as much ground in your technical knowledge gap as rapidly possible by the eliminating the non-value-added activities;
  2. Develop a your own immersive curriculum with a aim toward of learning enough to fill the need [or, at least, "fake it"] in six months.
  3. Work your plan in agile fashion every single day while adjusting tactically according to a measured PDCA discipline to develop mental toughness, leadership and your own LEAN education habits.
  4. Stick with your general strategic plan for six months before re-evaluating strategically.
  5. Fork the process and do it better as you repeat steps 1-4.

If you don't like the method, re-invent it for yourself. If you LOVE this method, re-invent it for yourself.

Tatiana's Manifesto

The impetus for this project was driven by a simple and absolutely compelling hour-long talk delivered by Tatiana Ivanova: How to Become a Data Scientist in 6 Months -- A Hacker’s Approach to Career Planning at PyData London 2016. What is so thoroughly compelling about Tatiana's Manifesto is that the principles do not just apply to Python or Data Science -- these same principles will apply to lifelong learning and the career transitions that knowledge professionals can expect for the next century or more. It is an absolutely BRILLIANT Manifesto -- if you watch the video of Tetiana's talk, I will gaurantee that you will find it thoroughly compelling, in many different dimensions. {HINT: For anyone in a Ph.D. program, the same process of subverting the educational process is doubly applicable for anyone a traditional academic program of study. The cost of an academic program of study means that being coddled and held back by a traditional academic program of study is nothing more than a more expensive way of wasting your time. If you are going to keep toiling away in any academic "education factory," at least LEARN TO SEE WASTE}

This particular repo is a work-in-progress [as almost all repos tend to be.] It was origninally forked from the excellent open-source data science masters (OSDSM) curriculum github repository. The OSDSM breaks down the core competencies necessary to making use of data. The original OSDSM is pretty awesome, although it will not be perfect for you -- that is the POINT. YOU will learn more by developing your own curriculum than you can possibly learn by just following someone else's curriculum. It is important not to start from scratch -- it is up to YOU to fork your own repo, aggressively seek ways to compare it against the best alternatives, find ways to accelerate the improvement of your curriculum and work assiduously on your particular career situation and knowledge gaps, steadily making your repo more perfect and a better starting point for someone else.

Make the internet and everything about the world into your oyster -- USE THE DIFFICULTIES -- get tougher by working to perfect your ability to gather intelligence, adjust, learn how to learn and flank your opponent

With free and open source software as well as the "being social" in the GitHub community, leading MeetUps, hosting Slack channels, participating in hackathons, building reputation on fora like StackOverflow or Quora, developing your own YouTube or Vimeo MOOC playlists, editing Wikipedia entries, writing blogs and publishing Amazon ebooks, participating in open source projects with code, documentation, wikis ... all are free and open to people who want to contribute and lead, requiring nothing but ambition and your LEADERSHIP as you step forward, proving your merit to contemporaries with motivations similar to yours. There are plenty of other high-quality and affordable low-priced training/content providers such as Udacity or O'Reilly's Safari -- there has never been a better time for you to participate in leading the disruption and transformation of education. What excuse do you have for not taking better advantage of accelerated, non-traditional educational alternatives?

The Motivation

Right NOW, our world needs a LOT more Data Scientists ... that probably will not be true if you waste your next decade or so getting a bachelors, then a masters degree, then a Ph.D. The NEED is right now or maybe 6 months or a year from now -- but the need will change, just as the needs of industry will change. The nasty secret of academia is that the pace of curriculum development is too slow AND is not responsive enough for the needs of the work that must be done ... and even if society needs people who we call 'data scientists' in five or ten years, the skills that those professionals will need to bring to the workplace will be far different from the skills that academicians believe are required today. This has always been true to a small degree, but it is orders of magnitude more pronounced for the professions, like data science that are fundamentally about revolutionizing the way that humans solve new problems and invent new innovative industries.

...by 2018 the United States will experience a shortage of 190,000 skilled data scientists, and 1.5 million managers and analysts capable of reaping actionable insights from the big data deluge. THIS is a great statistic ... but even traditional education can develop impressive sounding degree programs, all of academia is still far too hidebound and bureaucratic -- it is certainly far, far too unresponsive to adapt to exactly what SKILLS [especially in the realm of rapidly evolving social skills such as online community building and social networking] that will be required by data professionals in 2018.

-- McKinsey Report Highlights the Impending Data Scientist Shortage 23 July 2013

There are little to no Data Scientists with 5 years experience, because the job simply did not exist.

-- David Hardtke "How To Hire A Data Scientist" 13 Nov 2012

An Academic Shortfall

Classic academic conduits cannot provide Data Scientists -- this talent gap will be closed differently, different adaptations will make sense in different situations, but it will always be true the closure will happen in the most expeditious, the most accelerated fashion when individuals sieze responsibility for closing their own talent gaps. People who need to have their progress affirmed by experts or need to behave as "teachers pets" cannot close this talent gap ... those can only regurgitate what they learn. Genuine learning is not comfortable or entertaining; actually learning is HARD ...it is necessary to actually struggle with problems and "build learning muscles" stretching abilities in difficult attempts to close the gap in knowledge. Most of the academic learning process is about racking up student loan debt while occupying space as a spectator on the sidelines of genuine learning.

Academic credentials are important but not necessary for high-quality data science. The core aptitudes – curiosity, intellectual agility, statistical fluency, research stamina, scientific rigor, skeptical nature – that distinguish the best data scientists are widely distributed throughout the population.

We’re likely to see more uncredentialed, inexperienced individuals try their hands at data science, bootstrapping their skills on the open-source ecosystem and using the diversity of modeling tools available. Just as data-science platforms and tools are proliferating through the magic of open source, big data’s data-scientist pool will as well.

And there’s yet another trend that will alleviate any talent gap: the democratization of data science. While I agree wholeheartedly with Raden’s statement that “the crème-de-la-crème of data scientists will fill roles in academia, technology vendors, Wall Street, research and government,” I think he’s understating the extent to which autodidacts – the self-taught, uncredentialed, data-passionate people – will come to play a significant role in many organizations’ data science initiatives.

-- James Kobielus, Closing the Talent Gap 17 Jan 2013

Ready?


The Open Source Data Science Curriculum

Your skills in studying and learning how to learn will resemble the skills of those who you train with. If you want to be a data scientist, you have to begin thinking like a data scientist when it comes to approaching problems -- seek out different perspectives on institutionalizing your own culture of data science in your life.

Dig in and get started, understand that you will be in "over your head" but get started. Don't worry if some of the topics seem beyond you at first -- the material will make sense as you work with it. Get started, keep going, be patient with your ability to pick up the material, but just stay at it.

Intro to Data Science UW / Coursera

  • Topics: Python NLP on Twitter API, Distributed Computing Paradigm, MapReduce/Hadoop & Pig Script, SQL/NoSQL, Relational Algebra, Experiment design, Statistics, Graphs, Amazon EC2, Visualization.

Data Science / Harvard Video Archive & Course

  • Topics: Data wrangling, data management, exploratory data analysis to generate hypotheses and intuition, prediction based on statistical methods such as regression and classification, communication of results through visualization, stories, and summaries.

Data Science with Open Source Tools Book $27

  • Topics: Visualizing Data, Estimation, Models from Scaling Arguments, Arguments from Probability Models, What you Really Need to Know about Classical Statistics, Data Mining, Clustering, PCA, Map/Reduce, Predictive Analytics
  • Example Code in: R, Python, Sage, C, Gnu Scientific Library

A Note About Direction

This is an introduction geared toward those with at least a minimum understanding of programming, and (perhaps obviously) an interest in the components of Data Science (like statistics and distributed computing). Out of personal preference and need for focus, I geared the original curriculum toward Python tools and resources. R resources can be found here.

Math

[★ What are some good resources for learning about numerical analysis? / Quora ] (http://www.quora.com/What-are-some-good-resources-for-learning-about-numerical-analysis)

Computing

Get your environment up and running with the Data Science Toolbox

  • Programming

  • Take advantage of the fact that the fastest way to learn anything is to tutoring someone else ... or pair-program tutor each other. Python Tutor -- is visualizer/execution window on a webbrowser for understanding and sharing what happens as the computer executes each line of a program's source code in Python, Java, JavaScript, TypeScript, Ruby, C, and C++ programs ... no substitute for really learning a language, of course ... but a great supplement to the textbooks, lecture notes, online programming tutorials, fora, etc.

  • Algorithms

  • Algorithms Design & Analysis I Stanford / Coursera

  • Algorithm Design, Kleinberg & Tardos Book $125

  • Distributed Computing Paradigms

  • *See Intro to Data Science UW / Lectures on MapReduce

  • Intro to Hadoop and MapReduce Cloudera / Udacity Course *includes select free excerpts of Hadoop: The Definitive Guide Book $29

  • Databases

  • Introduction to Databases Stanford / Online Course

  • SQL School Mode Analytics / Tutorials

  • SQL Tutorials SQLZOO / Tutorials

  • Data Mining

  • Mining Massive Data Sets / Stanford Coursera & Digital & Book $58

  • Mining The Social Web Book $30

  • Introduction to Information Retrieval / Stanford Digital & Book $56

OSDSM Specialization: Web Scraping & Crawling

  • Machine Learning

Foundational & Theoretical

Practical

Data Design

  • Visualization

Data Visualization and Communication

Theoretical Design of Information

Applied Design of Information

Theoretical Courses / Design & Visualization

Practical Visualization Resources

OSDSM Specialization: Data Journalism

Python (Learning)

Python (Libraries)

Installing Basic Packages Python, virtualenv, NumPy, SciPy, matplotlib and IPython & Using Python Scientifically

Command Line Install Script for Scientific Python Packages

More Libraries can be found in the "awesome machine learning" repo & in related specializations

  • Data Structures & Analysis Packages

  • Machine Learning Packages

  • Networks Packages

  • Statistical Packages

    • PyMC - Bayesian Inference & Markov Chain Monte Carlo sampling toolkit
    • Statsmodels - Python module that allows users to explore data, estimate statistical models, and perform statistical tests
    • PyMVPA - Multivariate Pattern Analysis in Python
  • Natural Language Processing & Understanding

    • NLTK - Natural Language Toolkit
    • Gensim - Python library for topic modeling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community.
  • Data APIs

    • twython - Python wrapper for the Twitter API
  • Visualization Packages

    • matplotlib - well-integrated with analysis and data manipulation packages like numpy and pandas
    • Seaborn - a high-level statistical visualization package built on top of matplotlib
  • iPython Data Science Notebooks

  • Data Science in IPython Notebooks (Linear Regression, Logistic Regression, Random Forests, K-Means Clustering)

  • A Gallery of Interesting IPython Notebooks - Pandas for Data Analysis

Datasets are now here

R resources are now here

Data Science as a Profession

  • Doing Data Science: Straight Talk from the Frontline O'Reilly / Book $25
  • The Data Science Handbook: Advice and Insights from 25 Amazing Data Scientists Book $22

Capstone Project


Resources

Read

Watch

Learn


Notation

Non-Open-Source books, courses, and resources are noted with $.

Contribute

Please Contribute -- this is Open Source!

Follow me on Twitter @clarecorthell