Data Science Glossary
This data science glossary contains the most important terms, structured into three categories:
- Machine Learning
- Software Engineering
- Statistics/Data Visualization
Machine Learning Glossary
The machine learning glossary is structured into:
- General terms
- Reinforcement-learning terms
- Supervised-learning terms
- Unsupervised-learning terms
General Terms
Data Point | Data Wrangling | Machine Learning |
Model Complexity | Model | Reinforcement Learning |
Supervised Learning | Unsupervised Learning |
Reinforcement Learning
Action | Agent | Environment |
Observation | Policy | Reward |
State |
Supervised Learning
Unsupervised Learning
Clustering | Dimensionality Reduction | k-means |
PCA |
Action
In reinforcement learning, agents try to perform actions that maximize the reward. Each action changes the learning environment and thus yields a new state.
Agent
In reinforcement learning, an agent is the learner that interacts with the environment. Based on a given state, the agent selects an appropriate action by considering past earned rewards. The policy of an agent determines the actions that should be executed for each state. ### AUC {#auc}
AUC means area under the curve. When evaluating scoring classifiers, the term AUC usually refers to the ROC (receiver operating characteristic)-AUC. The ROC curve determines the true positive rate and false positive rate for all cutoffs on the scores. When available, the ROC-AUC is preferable to other measures such as sensitivity and specificity.
Categorical Outcome
See Outcome.
Class
See categorical outcomes.
Classifier
Classifiers (classification models) are used for the prediction of categorical outcomes. Classifiers that ouput quantitative outcomes are called scoring classifiers and are more interpretable than non-scoring classifiers.
Clustering
Clustering, one of the main applications of unsupervised learning, is used to assign each sample to a group of samples. These groups of samples are called clusters. Clustering can be used for the visual exploration of data or for the automated identification of outliers. One of the simplest and most well-known clustering algorithms is k-means.
Confusion Matrix
The confusion matrix is used to evaluate the predictive performance of a classifier. The name confusion matrix stems from the fact
that the table illustrates which predictions are confused among the two classes. For binary classifiers, which differentiate
between a positive (+1
) and a negative (-1
) class, the confusion matrix is a 2x2 table of the following form:
Predicted Class | Ground Truth | ||
---|---|---|---|
+1 | -1 | ||
+1 | TP | FP | |
-1 | FN | TN |
The entries are defined as follows:
- TP: The number of samples from the positive class that were correctly predicted
- FP: The number of samples from the negative class that were falsely predicted
- FN: The number of samples from the positive class that were falsely predicted
- TN: The number of samples from the negative class that were correctly predicted
From the confusion matrix, one can determine performance metrics such as sensitivity, specificity, and the AUC.
Cross-Validation
Cross-validation is a strategy for evaluating the predictive performance of a model. In k-fold cross-validation, the data set is split into k-folds such that each fold is used for training once, while the remaining data are used for testing. Nested cross-validation introduces another layer by introducing an additional fold that is used for selecting the model that is evaluated on the test fold.
Data Point
See Observation.
Data Wrangling
Data wrangling describes the unpopular task of transforming data into a machine-readable format. For example, data wrangling could entail transforming semi-structured data (e.g. from spreadsheets) to the CSV (comma-separated values) format. Data wrangling is often performed via automated scripts but may also involve manual steps. Note that data wrangling does not involve feature engineering.
Dependent Variable
See Feature.
Dimensionality Reduction
In dimensionality reduction, data are projected to a low-dimensional subspace. This is either done in order to obtain better data visualizations or during feature engineering in the context of supervised learning. Dimensionality reduction techniques such as PCA are unsupervised methods.
Environment
In reinforcement learning, the environment determines the observable states and the actions that an agent can perform. A popular framework for specifying environments is OpenAI’s Gym.
Estimate
See Prediction.
False Positive Rate
Given a classifier, the false positive rate represents the ratio of false positive predictions among all samples from the negative class:
FPR = FP / (FP + TN)
See also Confusion Matrix.
Feature matrix
See Features.
Features
Features are the dependent variables in the supervised learning scenario. The columns of a feature matrix, \[X \in \mathbb{R}^{n \times p}\,\] represent the values of the p features. For example, to predict the weather, two possible features are the level of precipitation and the cloudiness.
Feature Engineering
Supervised learning aims at learning the general associations betwen features and outcomes. However, in their original form, the input data are often not well-suited for this purpose. Feature engineering is concerned with transforming the data such that machine learning models can easily learn from the data.
Forecasting
See Prediction.
Gold Standard
See Ground Truth.
Ground Truth
In order to perform supervised learning, it is necesary that the outcome for each data point is known. The measured outcome should reflect the ground truth. Otherwise, models are optimized with respect to the wrong values, aka garbage in, garbage out.
Independent Variable
See Outcome.
Inference
See Prediction.
Interpretability
Interpretability describes whether a model is able to produce results that humans can easily interpret. Interpretability is closely tied to model complexity (i.e. the effective numbers of model parameters). Simple models such as linear models have few parameters and can easily be understood and interpreted. Complex models such as deep neural networks have large numbers of parameters, which makes them hard to understand and interpret.
There are many application scenarios in which it is acceptable to sacrifice some predictive performance in favor of greater interpretability. This is because in machine learning applications such as decision support systems, it is key that human operators can understand the intentions of the model.
k-means
k-means is a simple yet powerful clustering algorithm that identifies k cluster centers in the data. The algorithm terminates when the cluster centers have converged.
Label
In classification, labels are the values that are used to differentiate between individual classes.
For example, one could use Sunny
and Cloudy
as labels for observations that have been made on sunny and cloudy days, respectively.
However, to apply supervised learning algorithms, numeric labels such as +1
and -1
would be necessary.
Linear Model
See Model.
Machine Learning
Machine learning encompasses artificial intelligence approaches that are concerned with learning from data. There are three machine learning areas: supervised learning, unsupervised learning, and reinforcement learning.
Once a model has been fitted to the data, it is possible to make predictions given new data points (supervised learning), structure data (unsupervised learning), or select optimal actions in a dynamic environment (reinforcement learning).
Model
Models are the mathematical approximation of real-world phenomena. In supervised learning, models are constructed using pairs of input data and observed outcomes. In unsupervised learning, the outcomes are not available such that only the structure of the data is modeled. In reinforcement learning, models are constructed according to states, actions, and rewards.
Besides these machine learning approaches, which use optimization algorithms to fit models to data, there is a host of other models that are useful for specific tasks, for example, hidden Markov models, epidemiological models, and Bayesian models.
It is possible to differentiate between linear and non-linear models. While linear models assume linear relationship between the features, non-linear models assume non-linear relationships.
One should always remember the following famous quote from British statistican George E.P. Box:
All models are wrong but some are useful.
Model Complexity
Model complexity is defined by the effective numbers of parameters that make up a model. For example, deep learning models with many parameters are more complex than simple models, such as linear models. Complex models should be avoided if there are not sufficient training data available.
Model Validation
Model validation entails the following steps:
- Fitting the model to a set of training data
- Tuning the hyperparameters of the model using a set of validation data
- Evaluating predictive performance on an independent test data set
The two most popular approaches for validation are:
- Splitting the data into a training, validation, and test set
- Using cross-validation, in which the model is trained on various subsets of the data.
Non-Linear Model
See Model.
Outcome
In supervised learning, the outcome is a measurement of the ground truth.
Principal types of outcomes are categorical outcomes (class labels) and quantitative outcomes. For example,
when predicting the weather, Sunny
and Cloudy
would be categorical outcomes, while the amount of precipitation
would be a quantitative outcome.
The underlying variable associated with the outcome is called the independent variable.
Observation
In supervised learning, observations are the rows of the feature matrix. Observations are also called data points or samples. The number of observations is usually denoted by N.
For the use of the term observation in reinforcement learning, see State.
PCA
Principal component analysis (PCA) is a standard dimensionality reduction technique. It is based on finding a projection to orthogonal coordinates that maintain as much variance as possible.
Policy
In reinforcement learning, the policy of an agent is a mapping from states to actions. This means that the policy defines the behavior of the agent in the environment. There are on-policy and off-policy reinforcement learning algorithms.
Performance
In supervised learning, predictive performance is the ability of a model to correctly classify observations. To quantify predictive performance, metrics such as the AUC can be utilized.
Prediction
Prediction is the act of applying a model on a new data point in order to determine the estimated outcome. Inference is often used synonymously, although inference is geared towards learning about the data generation process. Forecasting is a special form of prediction in which time-series are used as the input.
The term estimate is a synonym for prediction that is popular in the statistical community because it underlines the fact that predictions are only approximations of reality.
Quantitative Outcome
See Outcome.
Regressor
Regressors (regression models) are used to predict the outcomes for quantitative variables. Compared to classifiers, they allow for fine-grained predictions.
Reinforcement Learning
Reinforcement learning (RL) is an area of machine learning in which one or multiple agents perform actions in an environment, after observing the state. Once an action has been performed, the agent receives a reward. By balancing exploration (finding novel states) and exploitation (reaping rewards), RL agents can learn an optimal policy, which identifies the best action to take for every state.
In recent years, reinforcement learning has gained in popularity due to the emergence of deep RL, in which deep neural networks are used to learn which states are associated with the greatest rewards.
Reward
In reinforcement learning, agents obtain rewards after performing an action. Agents adjust their policy in order to maximize the reward.
Sample
For its use in supervised learning, see Observation.
Sensitivity
The sensitivity of a classifier is defined by its true positive rate:
sensitivity = TPR = TP/(TP+FN).
See also Confusion Matrix.
Specificity
Specificity indicates the true negative rate of a classifier:
specificity = 1 - TP / (TP + FP) = 1 - FPR
Since specificty considers the number of false positives (FP), it allows for conclusions about the false positive rate (FPR). See also Confusion Matrix.
State
In reinforcement learning, the state indicates the observations that an agent has made at a given point in time. States are usually represented by numeric vectors or matrices. Crafting appropriate states is a form of feature engineering.
Supervised Learning
Supervised learning is an area of machine learning that is concerned with learning from pairs of input data and associated outcomes. Once a model has been trained on a set of training data, it is tuned using a validation data set, and, finally, evaluated on an independent test data set. The application of a supervised learning model on new data is called prediction, inference, or forecasting.
Models that are trained on labeled data (i.e. categorical outcomes) are called classifiers. Models that are trained on quantitative outcomes are called regressors.
Test Data
Test data refers to the set of data that is used for evaluating the predictive performance of a model.
Training Data
Training data refers to the set of data that is used for fitting a model.
True Positive Rate
See Sensitivity.
Unsupervised Learning
Unsupervised learning is an area of machine learning that is concerned with the identification of models that are capable to represent the properties of the data in a condensed manner, which allows for greater interpretability.
Evaluating the performance of unsupervised learning methods is more challenging than for supervised learning because there are no outcomes that provide the ground truth. Popular unsupervised methods include k-means and PCA.
Validation Data
Validation data refers to the set of data that is used for tuning the hyperparameters of a model.
Software Engineering Glossary
The software engineering glossary is structured into:
- DevOps terms
- General terms
- Scrum terms
- Testing terms
DevOps Terms
CI/CD | Continuous Delivery | Continuous Integration |
Deployment | Infrastructure as Code |
General Terms
Acceptance-Test-Driven Development | Behavior-Driven Development | Extreme Programming |
Mob Programming | Pair Programming | Test-Driven Development |
Scrum Terms
Backlog Item | Backlog | Daily |
Development Team | Product Owner | Refinement |
Scrum Master | Sprint Planning | Sprint Retrospective |
Sprint Review |
Testing Terms
End-to-End Test | Integration Test | Types of Test |
Unit Test |
c(“Product Owner”, “Development Team”, “Scrum Master”, “Daily”, “Sprint Planning”, “Sprint Retrospective”, “Refinement”, “Backlog”, “Backlog Item”)
Acceptance-Test-Driven Development
Acceptance-test-driven development (ATDD) is based on the idea that automated acceptance tests should be specified before starting with the implementation of a new feature. Since the acceptance criteria typically reflect the requirements of the business stakeholders, these tests are typically formulated in such a way that they are understandable in layman’s terms, for example, using Cucumber.
Backlog Item
TODO
Behavior-Driven Development
Behavior-driven development (BDD) is related to acceptance-test-driven development (ATDD). BDD places a focus on the fact that the required behavior of the software is made explicit and is available in an easily understood manner, e.g. via Cucumber.
Backlog
The backlog is a collection of backlog items (e.g. features, bugs, enablers) that are planned to be implemented by the development team. The product backlog contains all the backlog items that are relevant for the product, while the sprint backlog contains only those backlog items that are relevant for the current sprint.
CI/CD
CI/CD is the shorthand for continuous integration and continuous delivery. CI/CD is realized through automated pipelines that trigger on code changes in the version control system. Such a pipeline implements CI through automated tests. If the tests are successful, CD is performed by deploying the software on a staging area or even on production instances.
Continuous Delivery
Continuous delivery (CD) refers to the ability to automatically deploy increments of software to a staging area. CD should not be confused with continuous deployment, which goes one step further by automatically deploying the software to the production instances.
Deployment
Deployment refers to the release of a software in a non-local environment. Software is typically developed in the following environment:
- Development environment (e.g. local machines of developers)
- Staging area (a server similar to the production environment)
- The prouction instances running the software that is available to the customers
In complex systems, multiple staging areas may be used.
Development Team
The development team engineers the software according to the product backlog.
End-to-End Test
End-to-end tests validate the functionality of full application workflows from a user perspective.
For example, an end-to-end test for an e-commerce business could consist of the following steps:
- User login
- Adding products to the shopping basket
- Ordering the selected items
- Receiving the invoice through a confirmation email
Ideally, end-to-end tests are implemented using a behavior-driven development approach.
Extreme Programming
Extreme programming (XP) refers to a collection of software engineering practices, most notably pair programming. Since its inception, many of these practices have become an integral part of what is considered agile software development today.
Infrastructure as Code
Infrastructure as code (IaC) is a practice in which a technical infrastructure is defined in terms of code such that the infrastructure can be maintained (e.g. reconfigured) by executing the code instead of manual interventions.
Examples of IaC frameworks include Ansible, Chef, and Puppet.
Integration Test
An integration test validates functional correctness across multiple application modules. See Types of Tests for a listing of all types of tests.
Mob Programming
Mob programming extends the concept of pair programming from pairs of developers to the entire team. This means that the entire team collaboratively works on the same piece of code. This coding practice is particularly useful when implementing complex program logic that requires the knowledge from various team members.
Pair Programming
Pair programming is an extreme programming technique in which two developers collaboratively work on the same piece of code. One developer assumes the role of the driver, while the other assumes the role of the navigator.
The developer acting as the driver writes the code, while the developer acting as the navigator guides the implementation efforts. Typically, the roles are switched continually (e.g. all thirty minutes).
Product Owner
In Scrum, the product owner (PO) is responsible for maximizing the value that is delivered by the development team. To achieve this goal, he has to prioritize backlog items by considering the value they offer to the customer as well as the technical complexity of the implementation. The development team informs the PO about new software features in the sprint review meeting.
Scrum
TODO
Sprint
In Scrum, a sprint refers to the period of time after which a new increment of software is produced. The duration of a sprint defines the interval in which the majority of Scrum meetings take place (e.g. planning, review, retrospective) and determines the rhythm according to which software is developed. Typically, sprints have a length of two weeks.
Sprint Review
In Scrum, the sprint review is a meeting in which the development team presents its progress on the sprint backlog to the product owner.
Test-Driven Development
Test-driven development (TDD) is a development practice that ensures that a sufficient number of varied unit tests are implemented during the development of a new functionality. The following pattern can be used for TDD:
- Write up the boilerplate for the functionality you want to test (e.g. class, function) but do not implement the functionality yet.
- Write a failing test.
- Adapt the implementation such that the test is passed successfully.
- Continue with Step 2. Terminate the TDD process if you cannot think of any additional tests that would demonstrate a failure in your implementation.
Types of Tests
Software testing can be tested using the following types of tests:
These types of tests form the test pyramid. Most application tests are made up of a large number of unit tests, a small number of integration tests, and just a few end-to-end-tests.
Unit Test
Unit tests validate the functionality of a unit of application code. The term unit is not formally defined and could be a single, independent function, multiple functions, or a complete software module. A single unit test should test only a single type of functionality, if possible. This allows the precise identification of errors in the application. See Types of Tests for a listing of all types of tests.
Comments
There aren't any comments yet. Be the first to comment!