We need the following packages:

library(tidyverse)
library(leaps)
library(glmnet)
library(mlbench)
library(doMC)
library(caret)
library(class)
library(reshape2)
library(plotROC)
library(kernlab)
library(bestglm)

Best subset selection

Exercise 1

We consider the ozone data set from the mlbench package

data("Ozone")
summary(Ozone)

We first delete individuals with missing data and the three first variables which contains the date of the observations:

donnees <- Ozone %>% na.omit()
donnees <- donnees[,-c(1:3)]
summary(donnees)

We then split the data into:

  • a training set of size 150;
  • a test set of size 53.
set.seed(1234)
perm <- sample(nrow(donnees))
ind.train <- perm[1:150]
ind.test <- perm[-c(1:150)]
train <- donnees[ind.train,]
test <- donnees[ind.test,]

The problem is to explain the Daily maximum one-hour-average ozone reading (V4) by the other variables. We first consider the linear model

linear.model <- lm(V4~.,data=train)
summary(linear.model)

It seems that some variables are not necessary in the model.

  1. Explain the output of the following command
mod.sel <- regsubsets(V4~.,data=train)
summary(mod.sel)
  1. Select the model which optimize Mallows’s \(C_p\) and \(BIC\) criteria.

  2. We consider the quadratic risk for the three models: \[E[(Y-\widehat m(X))^2].\] This risk is estimated with the test set according to \[\frac{1}{n_{test}}\sum_{i\in test}(Y_i-\widehat m(X_i))^2.\] Compute the estimated risks for the three linear models.

Exercise 2 (variable selection)

We consider the model \[Y=\beta_0+\beta_1X_1+\dots+\beta_pX_p+\varepsilon\] where \(X_1,\dots,X_p\) are independent variables with distribution \(\mathcal N(0,1)\) and \(\varepsilon\) has distribution \(\mathcal N(0,0.5^2)\) (\(X=(X_1,\dots,X_p)\) and \(\varepsilon\) are independent).

We assume that \(p=105\) and that

  • \(\beta_0=0\), \(\beta_1=\dots=\beta_5=1\)
  • \(\beta_6=\dots=\beta_{105}=0\). So here, only variables \(X_1,\dots,X_5\) are relevant to explain \(Y\).
  1. Generate \(n=1000\) observations \((x_1,y_1),\dots,(x_n,y_n)\) according to this model. Put these observations in a data.frame df with dimension \(1000\times 106\).

  2. Fit a linear model (lm function) with df and print the estimator of \(\beta_0,\dots,\beta_{105}\).

  3. We propose to make a variable selection procedure with a backward selection approach using BIC criterion. You can use nvmax=30 in regsubsets.

  4. Generate a test dataset (df.test) \((x_{n+1},y_{n+1}),\dots,(x_{n+m},y_{n+m})\) with \(m=500\) according to the same model. Compute the estimated mean square error \[\frac{1}{m}\sum_{i\in test}(\widehat y_i-y_i)^2\] for the two models (full and backward).

Penalized regression

Exercise 3 (Lasso and ridge for ozone data)

We consider the same problem as in exercise 1. We propose to fit ridge ans lasso estimates and to compare these models with those of exercise 1.

glmnet function requires the \(X\) matrix and the \(Y\) vector as inputs (we can not use a formula). To compute the \(X\) matrix, we use the function model.matrix:

train.X <- model.matrix(V4~.,data=train)
test.X <- model.matrix(V4~.,data=test)
  1. Draw the coefficient paths for ridge and lasso.

  2. Select the shrinkage parameter for lasso regression with cv.glmnet.

  3. Do the same for ridge regression. Explain the problem and solve it.

  4. Estimate the quadratic error for the selected ridge and lasso models. Compare this error with the error of the tree other models.

  5. Conclude.

Exercise 4

We consider the spam dataset from the kernlab package:

data(spam)

We split the data into a training dataset of size 3000 and a test dataset of size 1601.

set.seed(1234)
perm <- sample(nrow(spam))
train <- spam[perm[1:3000],]
test <- spam[perm[3001:4601],]
train.X <- model.matrix(type~.,data=train)
test.X <- model.matrix(type~.,data=test)
  1. Fit a logistic model on the train dataset with all the variables.

  2. We propose to make a variable selection procedure with a backward selection approach using BIC criterion (this approach will be described in details in other lectures). You just have to use the step function with the direction=“backward” and k=log(nrow(train)) options. We call it mod.back (it may takes few minutes).

  3. Fit a logistic lasso model on the training data (select the shrinkage parameter with cv.glmnet).

  4. Fit a logistic ridge model on the training data (select the shrinkage parameter with cv.glmnet).

  5. Fit a nearest neigbor rule on the training data (use caret to select the number of nearest neigbor by 10 folds cross validation).

  6. Make a comparison of the 5 methods with the error probability (estimated on the test dataset).

  7. Make a comparison of the 5 methods with ROC curve and AUC (estimated on the test dataset).

LS0tCnRpdGxlOiAiUGVuYWxpemVkIHJlZ3Jlc3Npb25zIgpvdXRwdXQ6IAogIGh0bWxfbm90ZWJvb2s6IAogICAgY3NzOiB+L0Ryb3Bib3gvRklDSElFUlNfU1RZTEUvc3R5bGVzLmNzcwogICAgdG9jOiB5ZXMKICAgIHRvY19mbG9hdDogeWVzCi0tLQoKV2UgbmVlZCB0aGUgZm9sbG93aW5nIHBhY2thZ2VzOgoKYGBge3IgbWVzc2FnZT1GQUxTRSwgd2FybmluZz1GQUxTRX0KbGlicmFyeSh0aWR5dmVyc2UpCmxpYnJhcnkobGVhcHMpCmxpYnJhcnkoZ2xtbmV0KQpsaWJyYXJ5KG1sYmVuY2gpCmxpYnJhcnkoZG9NQykKbGlicmFyeShjYXJldCkKbGlicmFyeShjbGFzcykKbGlicmFyeShyZXNoYXBlMikKbGlicmFyeShwbG90Uk9DKQpsaWJyYXJ5KGtlcm5sYWIpCmxpYnJhcnkoYmVzdGdsbSkKYGBgCgoKIyBCZXN0IHN1YnNldCBzZWxlY3Rpb24gCgojIyBFeGVyY2lzZSAxCgpXZSBjb25zaWRlciB0aGUgKipvem9uZSoqIGRhdGEgc2V0IGZyb20gdGhlICoqbWxiZW5jaCoqIHBhY2thZ2UKCmBgYHtyfQpkYXRhKCJPem9uZSIpCnN1bW1hcnkoT3pvbmUpCmBgYAoKV2UgZmlyc3QgZGVsZXRlIGluZGl2aWR1YWxzIHdpdGggbWlzc2luZyBkYXRhIGFuZCB0aGUgdGhyZWUgZmlyc3QgdmFyaWFibGVzIHdoaWNoIGNvbnRhaW5zIHRoZSBkYXRlIG9mIHRoZSBvYnNlcnZhdGlvbnM6CgpgYGB7cn0KZG9ubmVlcyA8LSBPem9uZSAlPiUgbmEub21pdCgpCmRvbm5lZXMgPC0gZG9ubmVlc1ssLWMoMTozKV0Kc3VtbWFyeShkb25uZWVzKQpgYGAKCldlIHRoZW4gc3BsaXQgdGhlIGRhdGEgaW50bzoKCiAgKiBhIHRyYWluaW5nIHNldCBvZiBzaXplIDE1MDsKICAqIGEgdGVzdCBzZXQgb2Ygc2l6ZSA1My4KICAKYGBge3J9CnNldC5zZWVkKDEyMzQpCnBlcm0gPC0gc2FtcGxlKG5yb3coZG9ubmVlcykpCmluZC50cmFpbiA8LSBwZXJtWzE6MTUwXQppbmQudGVzdCA8LSBwZXJtWy1jKDE6MTUwKV0KdHJhaW4gPC0gZG9ubmVlc1tpbmQudHJhaW4sXQp0ZXN0IDwtIGRvbm5lZXNbaW5kLnRlc3QsXQpgYGAKClRoZSBwcm9ibGVtIGlzIHRvIGV4cGxhaW4gdGhlIGBEYWlseSBtYXhpbXVtIG9uZS1ob3VyLWF2ZXJhZ2Ugb3pvbmUgcmVhZGluZ2AgKFY0KSBieSB0aGUgb3RoZXIgdmFyaWFibGVzLiBXZSBmaXJzdCBjb25zaWRlciB0aGUgbGluZWFyIG1vZGVsCgpgYGB7cn0KbGluZWFyLm1vZGVsIDwtIGxtKFY0fi4sZGF0YT10cmFpbikKc3VtbWFyeShsaW5lYXIubW9kZWwpCmBgYAoKSXQgc2VlbXMgdGhhdCBzb21lIHZhcmlhYmxlcyBhcmUgbm90IG5lY2Vzc2FyeSBpbiB0aGUgbW9kZWwuCgoxLiBFeHBsYWluIHRoZSBvdXRwdXQgb2YgdGhlIGZvbGxvd2luZyBjb21tYW5kCgpgYGB7cn0KbW9kLnNlbCA8LSByZWdzdWJzZXRzKFY0fi4sZGF0YT10cmFpbikKc3VtbWFyeShtb2Quc2VsKQpgYGAKCgoyLiBTZWxlY3QgdGhlIG1vZGVsIHdoaWNoIG9wdGltaXplIE1hbGxvd3MncyAkQ19wJCBhbmQgJEJJQyQgY3JpdGVyaWEuCgoKMy4gV2UgY29uc2lkZXIgdGhlIHF1YWRyYXRpYyByaXNrIGZvciB0aGUgdGhyZWUgbW9kZWxzOgokJEVbKFktXHdpZGVoYXQgbShYKSleMl0uJCQKVGhpcyByaXNrIGlzIGVzdGltYXRlZCB3aXRoIHRoZSB0ZXN0IHNldCBhY2NvcmRpbmcgdG8KJCRcZnJhY3sxfXtuX3t0ZXN0fX1cc3VtX3tpXGluIHRlc3R9KFlfaS1cd2lkZWhhdCBtKFhfaSkpXjIuJCQKQ29tcHV0ZSB0aGUgZXN0aW1hdGVkIHJpc2tzIGZvciB0aGUgdGhyZWUgbGluZWFyIG1vZGVscy4KCgojIyBFeGVyY2lzZSAyICh2YXJpYWJsZSBzZWxlY3Rpb24pCgpXZSAgY29uc2lkZXIgdGhlIG1vZGVsCiQkWT1cYmV0YV8wK1xiZXRhXzFYXzErXGRvdHMrXGJldGFfcFhfcCtcdmFyZXBzaWxvbiQkCndoZXJlICRYXzEsXGRvdHMsWF9wJCBhcmUgaW5kZXBlbmRlbnQgdmFyaWFibGVzIHdpdGggZGlzdHJpYnV0aW9uICRcbWF0aGNhbCBOKDAsMSkkIGFuZCAkXHZhcmVwc2lsb24kIGhhcyBkaXN0cmlidXRpb24gJFxtYXRoY2FsIE4oMCwwLjVeMikkICgkWD0oWF8xLFxkb3RzLFhfcCkkIGFuZCAkXHZhcmVwc2lsb24kIGFyZSBpbmRlcGVuZGVudCkuCgpXZSBhc3N1bWUgdGhhdCAkcD0xMDUkIGFuZCB0aGF0CgoqICRcYmV0YV8wPTAkLCAkXGJldGFfMT1cZG90cz1cYmV0YV81PTEkCiogJFxiZXRhXzY9XGRvdHM9XGJldGFfezEwNX09MCQuClNvIGhlcmUsIG9ubHkgdmFyaWFibGVzICRYXzEsXGRvdHMsWF81JCBhcmUgcmVsZXZhbnQgdG8gZXhwbGFpbiAkWSQuCgoxLiBHZW5lcmF0ZSAkbj0xMDAwJCBvYnNlcnZhdGlvbnMgJCh4XzEseV8xKSxcZG90cywoeF9uLHlfbikkIGFjY29yZGluZyB0byB0aGlzIG1vZGVsLiBQdXQgdGhlc2Ugb2JzZXJ2YXRpb25zIGluIGEgKipkYXRhLmZyYW1lIGRmKiogd2l0aCBkaW1lbnNpb24gJDEwMDBcdGltZXMgMTA2JC4KCjIuIEZpdCBhIGxpbmVhciBtb2RlbCAoKipsbSoqIGZ1bmN0aW9uKSB3aXRoICpkZiogYW5kIHByaW50IHRoZSBlc3RpbWF0b3Igb2YgJFxiZXRhXzAsXGRvdHMsXGJldGFfezEwNX0kLgoKCjMuIFdlIHByb3Bvc2UgdG8gbWFrZSBhIHZhcmlhYmxlIHNlbGVjdGlvbiBwcm9jZWR1cmUgd2l0aCBhIGJhY2t3YXJkIHNlbGVjdGlvbiBhcHByb2FjaCB1c2luZyAqKkJJQyoqIGNyaXRlcmlvbi4gWW91IGNhbiB1c2UgKm52bWF4PTMwKiBpbiAqKnJlZ3N1YnNldHMqKi4KCgo0LiBHZW5lcmF0ZSBhIHRlc3QgZGF0YXNldCAoKipkZi50ZXN0KiopICQoeF97bisxfSx5X3tuKzF9KSxcZG90cywoeF97bittfSx5X3tuK219KSQgd2l0aCAkbT01MDAkIGFjY29yZGluZyB0byB0aGUgc2FtZSBtb2RlbC4gQ29tcHV0ZSB0aGUgZXN0aW1hdGVkIG1lYW4gc3F1YXJlIGVycm9yCiQkXGZyYWN7MX17bX1cc3VtX3tpXGluIHRlc3R9KFx3aWRlaGF0IHlfaS15X2kpXjIkJApmb3IgdGhlIHR3byBtb2RlbHMgKGZ1bGwgYW5kIGJhY2t3YXJkKS4KCgojIFBlbmFsaXplZCByZWdyZXNzaW9uCgojIyBFeGVyY2lzZSAzIChMYXNzbyBhbmQgcmlkZ2UgZm9yIG96b25lIGRhdGEpCgpXZSBjb25zaWRlciB0aGUgc2FtZSBwcm9ibGVtIGFzIGluIGV4ZXJjaXNlIDEuIFdlIHByb3Bvc2UgdG8gZml0IHJpZGdlIGFucyBsYXNzbyBlc3RpbWF0ZXMgYW5kIHRvIGNvbXBhcmUgdGhlc2UgbW9kZWxzIHdpdGggdGhvc2Ugb2YgZXhlcmNpc2UgMS4KCioqZ2xtbmV0KiogZnVuY3Rpb24gcmVxdWlyZXMgdGhlICRYJCBtYXRyaXggYW5kIHRoZSAkWSQgdmVjdG9yIGFzIGlucHV0cyAod2UgY2FuIG5vdCB1c2UgYSBmb3JtdWxhKS4gVG8gY29tcHV0ZSB0aGUgJFgkIG1hdHJpeCwgd2UgdXNlIHRoZSBmdW5jdGlvbiAqKm1vZGVsLm1hdHJpeCoqOgpgYGB7cn0KdHJhaW4uWCA8LSBtb2RlbC5tYXRyaXgoVjR+LixkYXRhPXRyYWluKQp0ZXN0LlggPC0gbW9kZWwubWF0cml4KFY0fi4sZGF0YT10ZXN0KQpgYGAKCjEuIERyYXcgdGhlIGNvZWZmaWNpZW50IHBhdGhzIGZvciByaWRnZSBhbmQgbGFzc28uCgoyLiBTZWxlY3QgdGhlIHNocmlua2FnZSBwYXJhbWV0ZXIgZm9yIGxhc3NvIHJlZ3Jlc3Npb24gd2l0aCAqKmN2LmdsbW5ldCoqLgoKMy4gRG8gdGhlIHNhbWUgZm9yIHJpZGdlIHJlZ3Jlc3Npb24uIEV4cGxhaW4gdGhlIHByb2JsZW0gYW5kIHNvbHZlIGl0LgoKNC4gRXN0aW1hdGUgdGhlIHF1YWRyYXRpYyBlcnJvciBmb3IgdGhlIHNlbGVjdGVkIHJpZGdlIGFuZCBsYXNzbyBtb2RlbHMuIENvbXBhcmUgdGhpcyBlcnJvciB3aXRoIHRoZSBlcnJvciBvZiB0aGUgdHJlZSBvdGhlciBtb2RlbHMuCgo1LiBDb25jbHVkZS4KCgoKIyMgRXhlcmNpc2UgNAoKV2UgY29uc2lkZXIgdGhlIHNwYW0gZGF0YXNldCBmcm9tIHRoZSAqKmtlcm5sYWIqKiBwYWNrYWdlOgoKYGBge3J9CmRhdGEoc3BhbSkKYGBgCgpXZSBzcGxpdCB0aGUgZGF0YSBpbnRvIGEgdHJhaW5pbmcgZGF0YXNldCBvZiBzaXplIDMwMDAgYW5kIGEgdGVzdCBkYXRhc2V0IG9mIHNpemUgMTYwMS4KCmBgYHtyfQpzZXQuc2VlZCgxMjM0KQpwZXJtIDwtIHNhbXBsZShucm93KHNwYW0pKQp0cmFpbiA8LSBzcGFtW3Blcm1bMTozMDAwXSxdCnRlc3QgPC0gc3BhbVtwZXJtWzMwMDE6NDYwMV0sXQp0cmFpbi5YIDwtIG1vZGVsLm1hdHJpeCh0eXBlfi4sZGF0YT10cmFpbikKdGVzdC5YIDwtIG1vZGVsLm1hdHJpeCh0eXBlfi4sZGF0YT10ZXN0KQpgYGAKCjEuIEZpdCBhIGxvZ2lzdGljIG1vZGVsIG9uIHRoZSB0cmFpbiBkYXRhc2V0IHdpdGggYWxsIHRoZSB2YXJpYWJsZXMuCgoyLiBXZSBwcm9wb3NlIHRvIG1ha2UgYSB2YXJpYWJsZSBzZWxlY3Rpb24gcHJvY2VkdXJlIHdpdGggYSBiYWNrd2FyZCBzZWxlY3Rpb24gYXBwcm9hY2ggdXNpbmcgKipCSUMqKiBjcml0ZXJpb24gKHRoaXMgYXBwcm9hY2ggd2lsbCBiZSBkZXNjcmliZWQgaW4gZGV0YWlscyBpbiBvdGhlciBsZWN0dXJlcykuIFlvdSBqdXN0IGhhdmUgdG8gdXNlIHRoZSAqKnN0ZXAqKiBmdW5jdGlvbiB3aXRoIHRoZSAqKmRpcmVjdGlvbj0iYmFja3dhcmQiKiogYW5kICoqaz1sb2cobnJvdyh0cmFpbikpKiogb3B0aW9ucy4gV2UgY2FsbCBpdCAqKm1vZC5iYWNrKiogKGl0IG1heSB0YWtlcyBmZXcgbWludXRlcykuCgozLiBGaXQgYSBsb2dpc3RpYyBsYXNzbyBtb2RlbCBvbiB0aGUgdHJhaW5pbmcgZGF0YSAoc2VsZWN0IHRoZSBzaHJpbmthZ2UgcGFyYW1ldGVyIHdpdGggKipjdi5nbG1uZXQqKikuCgo0LiBGaXQgYSBsb2dpc3RpYyByaWRnZSBtb2RlbCBvbiB0aGUgdHJhaW5pbmcgZGF0YSAoc2VsZWN0IHRoZSBzaHJpbmthZ2UgcGFyYW1ldGVyIHdpdGggKipjdi5nbG1uZXQqKikuCgo1LiBGaXQgYSBuZWFyZXN0IG5laWdib3IgcnVsZSBvbiB0aGUgdHJhaW5pbmcgZGF0YSAodXNlIGNhcmV0IHRvIHNlbGVjdCB0aGUgbnVtYmVyIG9mIG5lYXJlc3QgbmVpZ2JvciBieSAxMCBmb2xkcyBjcm9zcyB2YWxpZGF0aW9uKS4KCjYuIE1ha2UgYSBjb21wYXJpc29uIG9mIHRoZSA1IG1ldGhvZHMgd2l0aCB0aGUgZXJyb3IgcHJvYmFiaWxpdHkgKGVzdGltYXRlZCBvbiB0aGUgdGVzdCBkYXRhc2V0KS4KCjcuIE1ha2UgYSBjb21wYXJpc29uIG9mIHRoZSA1IG1ldGhvZHMgd2l0aCBST0MgY3VydmUgYW5kIEFVQyAoZXN0aW1hdGVkIG9uIHRoZSB0ZXN0IGRhdGFzZXQpLgoK