ML-AI-Data

It's important to start out with a single question:

What is Data Literacy

  • How well we... - Read Data - Interpret Data - Communicate Data

How does data literacy factor in? Part of understanding and communicating with data means asking the right questions so that we end up with useful, relevant data.

Asking the following questions: Do we have sufficient data to answer the question at hand? Can my data answer my exact question? Who participated in the data? Who is left out? Why? Who made the data?

The most important question: What's the takeaway from the data?

Part of an analysts job is to provide context and clarifications to make sure that audiences are not only reading the correct numbers, but understanding what they mean.


In order to be a good machine learning programmer, we need to know what Statistics is and how they help us do our job:

What is Statistics?

Story Time: The Importance of Data

The Challenger space shuttle carried seven US astronauts who were supposed to deploy a satellite and study Halley’s Comet while they were in orbit. Less than two minutes after takeoff, however, the shuttle exploded, killing all seven crew members. The explosion was caused by a failure of two O-rings: small rubber rings that helped create an airtight seal between the space shuttle and its launch fuel supply. Before the launch, engineers were concerned about how the low-temperature forecast would affect the O-rings’ ability to make a proper seal. The engineers made their arguments in favor of postponing the launch using, in part, a series of data visualizations that showed launch success rates at various temperatures. Tragically, their arguments did not prevent the launch from proceeding.


Data tells stories. Sometimes those stories can be good, sometimes they can be bad. The difference exists the results that data gives us, and the data we get can either help us tremendously or harm us significantly.

Garbage in, garbage out is a data-world phrase that means “our data-driven conclusions are only as strong, robust, and well-supported as the data behind them.”

For example: we have a lot of data on heart attacks, but there’s room for improvement when it comes to data quality. Heart disease is the leading cause of death in women, but as of 2021, women account for only 38% of participants in relevant research studies.

There are key differences between men’s and women’s heart attacks that impact how they’re treated, but our data doesn’t yet adequately outline those differences. This leads ultimately to worse outcomes in treatment and a higher post-heart attack mortality rate for women.

The Objective of ML

To teach computers to learn patterns from reading data. (What patterns can they see?)

  • and from those patterns what guesses can we make?

What is a Model?

A model is a representation of what data represents.


Where does data come from?

When talking about data collection, we need to discuss data ethics. Much of the data available comes from individuals who can be identified based on the data collected. An acronym for such people are PII (pie) -- "Personally identifiable information" Information includes: - Email Addresses - Phone Numbers - SSN - Credit Card Information - Medical Records - etc...

It is obligatory to protect where the data comes from, and the data itself.

Ethical issues regarding data collection

  1. Consent: Individuals must be informed before data can be collected.
  2. Ownership: Individuals always retain ownership of their data.
  3. Privacy: Individual's info must always be kept secure.
  4. Intention: Individuals must be informed of the intended uses of the data.
    1. Why it is being taken.
    2. How it will be stored.
    3. How it will be used.

Keywords

  • Supervised Learning VS Unsupervised Learning (look for data clusters) // What is a data cluster?
  • Deep Learning (Need lots of high quality data)
  • Reinforcement Learning
  • Feature Engineering
  • Kitchen Sink Approach

Cause and Effect / Causal Link

  • Providing that one event CAUSES another.

Data Scientists do the best they can to isolate and control variables and get comfortable working with some amount of error because data is often created from controllled experiments, but when investigating things in the outside world, controlled environments are hard to come by.

Data Gaps

"garbage in, garbage out." The ability to separate good, mediocre, and poor quality data. A crucial skill for data literacy. Data-driven conclusions are only as strong, robust, and well-supported as the data behind them.

Garbage in, garbage out

The quality of the predictions made during a predictive analysis is deeply dependent on the quality of the data used to generate the predictions.

For example, if a model is trained with mislabeled data, it will produce inaccurate predictions no matter how good the actual algorithm is. This is commonly referred to as, “garbage in, garbage out.”

Addressing Bias

Bias in data collection leads to poorer quality data. Recognizing bias in data is a crucial data literacy skill. Some key questions about bias include: “Who made the data?” “Who participated in the data?” “Who is left out of the data?”

What are Statistics?

Statistics helps to measure whether an event happens by chance or by a systemic factor or factors.