/data-science-synonyms

Help wading through the jargon

MIT LicenseMIT

Because many data analysis methods developed relatively independently in statistics, machine learning, biology, physics, chemistry, economics, psychology, engineering fields, and more, there is a lot of overlapping terminology.

This is an an attempt to sort it out.

###NOTES:

  • When terms are "near synonyms," it seems like it is v. educational to describe the difference.
  • It could be interesting to scrape departmental course web pages to see what type of terminology is used in different disciplines.
  • Maybe instead of an "all-encompassing" thesaurus, could write blog posts that talk about specific interesting cases.
  • Writing specific blog posts might be a good strategy even if there is a central tool.
  • [LRS] Have been envisioning very similar things for a while. See bottom for more and sketch.

I don't know the best way to structure this, for now I'm just going to jot some notes here.

  • False Positive (ML) = Type 1 Error (Statistics)
  • False Negative (ML) = Type 2 Error (Statistics)
  • Recall (Information Science) = Sensitivity (Medicine/Biology) = True Positive Rate (ML)
  • Precision (Information Science) = Positive Predictive Value (ML)
  • True Negative Rate (ML) = Specificity (Medicine/Biology)
  • Gaussian Distribution (Physics) = Normal Distribution (Statistics) = Bell Curve (Colloquial)
  • Hypothesis (Statistics) = Model (ML)
  • Marginal Likelihood (Frequentist) = Evidence (Bayesian)

In the above, Precision is not the same as the contrast between Accuracy and Precision in measurement:

  • Precision (Engineering) = Variance (Statitics) = Reliability (Psychometrics) = Variable Error
  • Accuracy (Engineering) = Bias (Statistics) = Validity (Psychometrics) = Constant Error

Lots of good ones for stochastics optimization problems!

  • Rough Landscape (Stochastic Optimization) = NP Complete (Computer Science)
  • Cost Function (Economics) = Loss Function (Statistics) = Utility (Economics) = Objective Function (Operations Management) = Reward Function () = Energy (Physics/Chemistry) = Fitness (Evolutionary Biology)
  • Search Space () = Fitness Landscape (Biology)
  • Reinforcement Learning (ML) = Approximate Dynamic Programming (Operations Management)

####These are not exactly equivalent, but are extremely closely related. NOTE: How could we illustrate that?

  • Structural Equation Model (Statistics)

  • Latent Variable Model () = Hidden Variable Model (Physics) = Bayesian Network (ML)

  • Independent Variables (Statistics) = Regressors (Statistics) = Explanatory Variables (Statistics) = Exogenous Variables (stats/statsmodels) = design (statistics) = Exposure Variable (Reliability Engineering) = Risk Factor (Medicine) = Feature (ML) = Input Variable (Engineering)

  • Dependent Variable (Statistics) = Response Variable (Statistics) = Regressand (Statistics) = Endogenous Variable (stats/statsmodels) = Outcome Variable (Medicine) = Output Variable (Engineering)

  • Hidden Variable () = Latent Variable () = Confounding Variable () --- seems like it's confounding when it's not there on purpose :)

  • Categorical Variable (Statistics) = Enumerated Type (Computer Science)

    • Binary Variable () = Dichotomous Variable ()

chart: Latent Variable Model nice chart on Wikipedia is a superset of all these things:

  • Factor Analysis ~= Principal Component Analysis --- seems like the difference between Factor Analysis and PCA is the assumptions made
  • Latent Trait Analysis = Item Response Theory
  • Latent Profile Analysis = Mixture Model
  • Latent Class Analysis

Mexican Hats

  • Difference of Gaussians = Mexican Hat function (really old, awesome American physicists)
  • Ricker Wavelet = Mexican hat wavelet/function (anyone else who says Mexican Hat function)
    But seriously, people. Let's not bicker over slivers of differences. We aren't launching spacecraft, here.

credit: JustinWick - Originally uploaded to en.wikipedia. Generated using Mathematica 5.0.


###[LRS note] -- This is my related but ginormo-scale thing that's been rolling around in my head for a while. Since I am visual/spatial, I was imagining more of a clicky thing where you could wander around. Maybe feed it some scraped & analyzed stuff, but allow plenty of expert refining, etc. My critical part is that you could add a link between things, and set (or vote on?) the strength or weakness of the links. Links would have a type (or strength) and some text, so you could have a place to explain what the differences are.

Here's my doodle: