/data_science_glossary

A glossary of data science terms and packages

Data Science Glossary

This is a glossary of terms found in data science. It's aim is to provide a the reader with a friendly description of concepts, techniques, and packages, as well as point them towards resources for further information.

Topics

Contribute

There are many aspects of data science. Help covering them is greatly appreciated! If you would like to add a new term or section, or think an existing description can be modified, the please submit a pull request, or file an issue with a suggestion or question. If you are adding a new item, please only add one item per pull request.

Contribution Guidelines

Each topic area listed above has its own page. Within the pages, there are sections for Concepts, Methods, Terms, and Tools. These are loose definitions but are roughly summarised as:

  • Concept - a practical or conceptual object relating to the topic (e.g. unsupervised machine learning).
  • Method - a specific method of implementation relating to a concept (e.g. k-means clustering).
  • Term - a word or phrase that has a specific meaning in data science other than its common meaning (e.g. feature).
  • Tool - typically a piece of software that can be used to implement data science techniques (e.g. scikit-learn).

This guide is written in Markdown, which is very easy! Items added to the glossary should be formatted as follows:

#### Item Title
The description of the item should be a few sentences
long, perhaps including an example, and easy to understand
for a beginner. It should not try to explain the full theory
behind the item. Sources can be included as a comma separated 
list in line with the main text, and formatted with enclosing
square brackets. [[source title 1] (source link 1), 
[source title 2] (source link 2)]

The result should be something like this:


Token

An element of a chopped up body of text, which could be a word or a group of words. The task of turning a body of text into tokens is called "tokenisation". [[Spacy Tutorial] (https://github.com/cytora/pycon-nlp-in-10-lines/blob/master/00_spacy_intro.ipynb)]


There are no hard rules to the glossary, so if you feel that an item could fit into a few different categories, then just use your best judgement to choose one. In some cases there can be a case for multiple entries.

Multiple Entries

Sometimes an item may fit into more than one section because it actually has different meanings or uses depending on the topic. For example, the tool, TensorFlow is a package directly related to Deep Learning, but also has applications in Natural Language Processing. In this case it would be fine to have two separate entries with domain-specific descriptions. If applicable, a link can be added to an entry, to guide the reader back to the "original entry".

Not sure?

Not sure if you have enough experience to contribute? Read this project's impostor syndrome disclaimer!