Hugo is a tool for creating static websites. How is this different from a dynamic website? you may be wondering. While a static website is delivered to clients in the same way that is stored, dynamic websites build pages using an application server. As a consequence, static and dynamic web pages have contrasting properties:
Criterion
Static
Dynamic
Loading Times
Fast
Slow
Required Expertise for Maintenance
High
Low
Security
High
Potentially Low
Potential for User Interaction
Low
High
Let’s take a look at these aspects in more detail.
Linear prediction models assume that there is a linear relationship between the independent variables and the dependent variable. Therefore, these models exhibit high bias and low variance.
The high bias of these models is due to the assumption of nonlinearity. If this assumption does not sufficiently represent the data, then linear models will be inaccurate.
On the other hand, linear models also have a low variance. This means that if several linear models would be trained using different data, they would perform similarly on the same test data set.
The benefit of non-parametric tests over parametric tests is that they not make any assumptions about the data. Thus, they are well-suited in situations where the assumptions of parametric tests are not met, which is typically the case for small sample sizes.
Popular non-parametric test
This table gives an overview over popular non-parametric tests:
Test
Test for what?
Wilcoxon rank sum test
Difference in medians
Wilcoxon signed-rank test
Difference in paired means
Fisher’s exact test
Independence in contingency tables
Kruskal-Wallis test
Difference of multiple medians
Posts about Non-Parametric Significance Testing
Parametric significance tests assume that the data follow a specific distribution (typically the normal distribution). If their assumptions are met, they have greater power than non-parametric test. Otherwise, non-parametric tests should be used. Thus, parametric tests should only be used after carefully evaluating whether the assumptions of the test are sufficiently fulfilled.
This table gives an overview of the most popular parametric tests:
Test
Test for what?
Student’s t-test, Paired Student’s t-test
Difference in paired means and means
Chi-squared test
Independence of group counts
One-way ANOVA
Difference in means of several independent variables
Posts about Parametric Significance Testing
Besides interpretability, predictive performance is the most important property of machine learning models. Here, I provide an overview of available performance measures and discuss under which circumstances they are appropriate.
Performance measures for regression
For regression, the most popular performance measures are R squared and the root mean squared error (RMSE). \(R^2\) has the advantage that it is typically in the interval \([0,1]\), which makes it more interpretable than the RMSE, whose value is on the scale of the outcome.
There is a large number of different types of plots for visualizing data.
Basic plots
The following plots are frequently used:
The bar plot shows the extent of values according to the height of bars. If the data are normally distributed, they can be display variation by including error bars.
The box plot indicates variation by showing the most frequently observed measurements in terms of the first, second, and third quartile.
The R logo, licensed under CC-BY-SA 4.0.
All posts with the R tag deal with applications of the statistical programming language R in the data science setting.
Posts about R
Supervised learning is concerned with models for predicting the outcome for new data points.
Models for supervised learning
The following supervised learning models are important:
Linear models: models that assume the existence of a linear relationship between the independent variables and the outcome.
Support vector machines: models that deal with non-linear associations by transforming the data to another space via kernel functions.
Neural networks: models that emulate the interaction of neurons in the nervous system.
Unsupervised learning is the aspect of machine learning that is concerned with approaches for learning from data where the outcomes are not available. The main goal of many unsupervised methods is to improve the interpretability of the data.
Clustering
The goal of clustering is to assign each observation in a data set to a group based on the observed values associated with each observation. Different clustering approaches rely on different target functions and therefore lead to different cluster assignments.