/into-data-science

A beginner's introduction to the field of Data Science.

MIT LicenseMIT

Introduction to Data Science

What Is Data Science?

Overview

Data Science is an emerging interdisciplinary field whose goal is to extract knowledge and enable discovery from complex data using a fusion of computation, mathematics, statistics, and machine learning. Datasets are varied; examples include maps of the universe, MRI images, human genomes, medical records, stock market transactions, educational data, historical texts, infrastructure systems, and website clickstream data. Over the coming decades, Data Science is expected to transform the landscape of basic and applied research in the sciences, social sciences, arts, humanities, and engineering as well as impact all sectors of the economy from health care to education, government, transportation, finance, manufacturing, construction, and urban planning. Data Science has the potential to improve individual and community health and education, develop smart communities that enable efficient circulation of people, goods, and services, enable informed decision making in public and private sectors, and enhance environmental sustainability and overall quality of life. Given the wide range of applications and potential benefits, the powerful tools and techniques of Data Science must be used ethically and responsibly.

What Does a Data Scientist Do?

Image: Drew Conway. License: Attribution-NonCommercial

In their book Doing Data Science, authors Cathy O'Neil and Rachel Schutt propose the following:

A data scientist is someone who knows how to extract meaning from and interpret data, which requires both tools and methods from statistics and machine learning, as well as being human. She spends a lot of time in the process of collecting, cleaning, and munging data, because data is never clean. This process requires persistence, statistics, and software engineering skills—skills that are also necessary for understanding biases in the data, and for debugging logging output from code.

Once she gets the data into shape, a crucial part is exploratory data analysis, which combines visualization and data sense. She’ll find patterns, build models, and algorithms—some with the intention of understanding product usage and the overall health of the product, and others to serve as prototypes that ultimately get baked back into the product. She may design experiments, and she is a critical part of data-driven decision making. She’ll communicate with team members, engineers, and leadership in clear language and with data visualizations so that even if her colleagues are not immersed in the data themselves, they will understand the implications.
O’Neil and Schutt, 2013 (As cited by the University of Wisconsin)

Resources

Books

  • Wickham, Hadley and Garrett Grolemund. R for Data Science.

    • Those with an O'Rielly Media account (including all Vanderbilt students) can access the book here. The book also has a freely available, open-source webpage here.
    • This book is a good introduction to the R programming language in a Data Science context. It is a helpful resource for both beginners to R and beginners to data science.
  • Bruce, Peter and Andrew Bruce. Practical Statistics for Data Scientists, 2nd Edition.

    • Those with an O'Rielly Media account (including all Vanderbilt students) can access the book here.
    • This book provides some introduction into the key statistical concepts and tools that every data scientist should understand.
  • Müller, Andreas C. and Sarah Guido. Introduction to Machine Learning with Python.

    • Those with an O'Rielly Media account (including all Vanderbilt students) can access the book here.
    • This book provides an introduction to the fundamental concepts and practices of machine learning with Python. While previous knowledge of Python is not required, those who have some familiarity (especially with NumPy and Matplotlib) will move through it more quickly.
    • A word of warning: the title says "Python," but the cover of the book actually displays an image of an Alleghany Hellbender Salamander, which is NOT a python, or even a reptile. At this time, we are unaware of any programming language called "Alleghany Hellbender Salamander."
  • Nield, Thomas. Essential Math for Data Science

    • Those with an O'Rielly Media account (including all Vanderbilt students) can access the book here.
    • This book covers some of the most important mathematical concepts that underlie a data scientist's tookilt. While many of the tools used in data science can be used blindly with or without this mathematical background, a solid understanding of the mechanics behind the tools is crucial to maximizing their effectiveness.

Videos and Online Lectures

TODO

Workshops

The Data Science Institute often offers workshops on various important topics in data science. See our webpage for details and examples of past workshops.

AI Deep Dives

In these sessions, held on Fridays from 1-2pm, a researcher presents a problem to the gathered data scientists and students and together we explore approaches to solving the problem using AI tools. See our webpage for details and examples of past topics.

Undergraduate Data Science Minor

Vanderbilt University offers a trans-institutional undergraduate Data Science Minor, spanning the Blair School of Music, the College of Arts and Science, the School of Engineering, and Peabody College. Students are introduced to the fundamentals— computer programming, statistics, machine learning, and visualization— with attention to ethical considerations of collecting, curating, analyzing, visualizing, and interpreting data. The minor in Data Science prepares students for advanced coursework in statistics and data analysis, scientific computing and simulation, machine learning and visualization, and high performance computing and big data.

Non-Minors

Undergraduate students who wish to study aspects of data science but do not wish to pursue a minor may find this list of data science courses helpful.

Data Science M.S. Degree

The Vanderbilt Master of Science in Data Science is a 4-semester graduate program. The curriculum is organized into three sequences: Computation, focusing on programming, data structures, computer systems, and methods, Data Analysis, focusing on data exploration, analysis, prediction, inference and algorithms, and Practice, focusing on workplace skills, ethical standards, and awareness of data science to date. The Data Science M.S. program prepares students in the skills of computation, statistics, critical thinking, communication, and domain-specific knowledge that are now essential in a wide variety of quantitative, computational, and scientific disciplines.