Tags

Hugo is a tool for creating static websites. How is this different from a dynamic website? you may be wondering. While a static website is delivered to clients in the same way that is stored, dynamic websites build pages using an application server. As a consequence, static and dynamic web pages have contrasting properties: Criterion Static Dynamic Loading Times Fast Slow Required Expertise for Maintenance High Low Security High Potentially Low Potential for User Interaction Low High Let’s take a look at these aspects in more detail.

Linear prediction models assume that there is a linear relationship between the independent variables and the dependent variable. Therefore, these models exhibit high bias and low variance. The high bias of these models is due to the assumption of nonlinearity. If this assumption does not sufficiently represent the data, then linear models will be inaccurate. On the other hand, linear models also have a low variance. This means that if several linear models would be trained using different data, they would perform similarly on the same test data set.

The benefit of non-parametric tests over parametric tests is that they not make any assumptions about the data. Thus, they are well-suited in situations where the assumptions of parametric tests are not met, which is typically the case for small sample sizes. Popular non-parametric test This table gives an overview over popular non-parametric tests: Test Test for what? Wilcoxon rank sum test Difference in medians Wilcoxon signed-rank test Difference in paired means Fisher’s exact test Independence in contingency tables Kruskal-Wallis test Difference of multiple medians Posts about Non-Parametric Significance Testing

Parametric significance tests assume that the data follow a specific distribution (typically the normal distribution). If their assumptions are met, they have greater power than non-parametric test. Otherwise, non-parametric tests should be used. Thus, parametric tests should only be used after carefully evaluating whether the assumptions of the test are sufficiently fulfilled. This table gives an overview of the most popular parametric tests: Test Test for what? Student’s t-test, Paired Student’s t-test Difference in paired means and means Chi-squared test Independence of group counts One-way ANOVA Difference in means of several independent variables Posts about Parametric Significance Testing

Besides interpretability, predictive performance is the most important property of machine learning models. Here, I provide an overview of available performance measures and discuss under which circumstances they are appropriate. Performance measures for regression For regression, the most popular performance measures are R squared and the root mean squared error (RMSE). \(R^2\) has the advantage that it is typically in the interval \([0,1]\), which makes it more interpretable than the RMSE, whose value is on the scale of the outcome.

There is a large number of different types of plots for visualizing data. Basic plots The following plots are frequently used: The bar plot shows the extent of values according to the height of bars. If the data are normally distributed, they can be display variation by including error bars. The box plot indicates variation by showing the most frequently observed measurements in terms of the first, second, and third quartile.

The R logo, licensed under CC-BY-SA 4.0. All posts with the R tag deal with applications of the statistical programming language R in the data science setting. Posts about R

Supervised learning is concerned with models for predicting the outcome for new data points. Models for supervised learning The following supervised learning models are important: Linear models: models that assume the existence of a linear relationship between the independent variables and the outcome. Support vector machines: models that deal with non-linear associations by transforming the data to another space via kernel functions. Neural networks: models that emulate the interaction of neurons in the nervous system.

Unsupervised learning is the aspect of machine learning that is concerned with approaches for learning from data where the outcomes are not available. The main goal of many unsupervised methods is to improve the interpretability of the data. Clustering The goal of clustering is to assign each observation in a data set to a group based on the observed values associated with each observation. Different clustering approaches rely on different target functions and therefore lead to different cluster assignments.

Hugo

Linear Prediction Models

Non-Parametric Significance Tests

Parametric Significance Tests

Performance Measures

Plots

R for applications in data science

Supervised Learning

Unsupervised Learning