We need the following packages:

library(tidyverse)
library(leaps)
library(glmnet)
library(mlbench)
library(rpart)
library(rpart.plot)
library(visNetwork)
library(ISLR)
library(kernlab)
library(randomForest)
library(plotROC)
library(pROC)
library(ranger)
library(caret)

Trees

Exercise 1 (tree regression)

We consider the following dataset

n <-50
set.seed(1234)
X <- runif(n)
set.seed(5678)
Y <- 1*X*(X<=0.6)+(-1*X+3.2)*(X>0.6)+rnorm(n,sd=0.1)
data1 <- data.frame(X,Y)
ggplot(data1)+aes(x=X,y=Y)+geom_point()+theme_classic()
  1. Fit a tree to explain \(Y\) by \(X\) (use rpart).

  2. Draw the tree with prp and rpart.plot.

  3. Write the tree regression function.

  4. Add on the graph of question 1 the partition defined by the tree and the prediction.

Exercise 2 (classification tree)

We consider the following dataset

n <- 50
set.seed(12345)
X1 <- runif(n)
set.seed(5678)
X2 <- runif(n)
Y <- rep(0,n)
set.seed(54321)
Y[X1<=0.45] <- rbinom(sum(X1<=0.45),1,0.85)
set.seed(52432)
Y[X1>0.45] <- rbinom(sum(X1>0.45),1,0.15)
data2 <- data.frame(X1,X2,Y)
library(ggplot2)
ggplot(data2)+aes(x=X1,y=X2,color=Y)+geom_point(size=2)+scale_x_continuous(name="")+scale_y_continuous(name="")+theme_classic()
  1. Fit a tree to explain \(Y\) by \(X_1\) and \(X_2\). Draw the tree. What happens?

  2. Write the classification rule and the score function induced by the tree.

The classification rule is \[\widehat g(x)=\mathbf{1}_{X_1<0.44}.\]

The score function is \[\widehat S(x)=\widehat P(Y=1|X=x)=0.83\mathbf{1}_{X_1<0.44}+0.07\mathbf{1}_{X_1\geq 0.44}.\]

  1. Add on the first graph the partition defined by the tree.

Exercise 3 (categorical input)

We consider the following dataset

n <- 100
X <- factor(rep(c("A","B","C","D"),n))
set.seed(1234)
Y[X=="A"] <- rbinom(sum(X=="A"),1,0.9)
Y[X=="B"] <- rbinom(sum(X=="B"),1,0.25)
Y[X=="C"] <- rbinom(sum(X=="C"),1,0.8)
Y[X=="D"] <- rbinom(sum(X=="D"),1,0.2)
Y <- as.factor(Y)
data3 <- data.frame(X,Y)
  1. Fit a tree to explain \(Y\) by \(X\).

  2. Explain how the tree is fitted in this context.

Exercise 4 (pruning)

We consider the dataset Carseats from the ISLR package.

data(Carseats)

The problem is to explain the Sales variable by the other variables.

  1. Fit a tree with rpart to explain Sales by the other variables.

  2. Explain the output of the printcp command

printcp(tree)
  1. Draw the tree with 8 split (use prune).

  2. Remark : visTree function from visNetwork package allows to draw interactive graphs

visTree(tree)

A shiny web application is also proposed to visualise the sequence of subtrees

visTreeEditor(Carseats)
  1. Split the dataset into a training set of size 250 and a test set of size 150.
n.train <- 250
set.seed(12345)
perm <- sample(nrow(Carseats))
train <- Carseats[perm[1:n.train],]
test <- Carseats[-perm[1:n.train],]
  1. We fit a sequence of trees on the train sample with
set.seed(12345)
tree <- rpart(Sales~.,data=train,cp=0.00000001,minsplit=2)

In this sequence of tree, select

  • a very simple tree (with 2 or 3 splits)
  • a very large tree
  • an optimal tree (with the classical pruning strategy)
  1. Estimate the quadratic error of these 3 trees with the test sample.

Random Forest

Exercise 5 (random Forest)

We again consider the spam dataset from package kernlab.

data(spam)
  1. Explain the following graph
rf1 <- randomForest(type~.,data=spam)
plot(rf1)
  1. Fit anoter random forest with mtry=1. Make a comparison between the two forests.

  2. Split the data into a training set of size 3000 and a test set of size 1601.

  3. Fit 2 random forest on the training set: one with the default value of mtry and one with mtry=1.

  4. Estimate the misclassification error of the 2 RF with the test set.

  5. Use caret package to select mtry parameter. You can select this parameter in the grid seq(1,30,by=5).

  6. Fit a tree on the train sample.

  7. Draw ROC curves and compute AUC for the 2 random forests and the tree.

  8. Represent the 10 most important variables of the best random forest with a bar chart.

  9. Fit a forest on the train dataset with the ranger function of the ranger package. What do you notice?

Exercice 6

Make a comparison between random Forest, ridge and lasso for the spam dataset. To do that, you will compute the classical risks (error probability, ROC…) by 10-folds cross validation. Be carefull wih the selection of the parameters.

LS0tCnRpdGxlOiAiVHJlZXMgYW5kIHJhbmRvbSBmb3Jlc3QiCm91dHB1dDogCiAgaHRtbF9ub3RlYm9vazogCiAgICBjc3M6IH4vRHJvcGJveC9GSUNISUVSU19TVFlMRS9zdHlsZXMuY3NzCiAgICB0b2M6IHllcwogICAgdG9jX2Zsb2F0OiB5ZXMKLS0tCgpXZSBuZWVkIHRoZSBmb2xsb3dpbmcgcGFja2FnZXM6CgpgYGB7ciBtZXNzYWdlPUZBTFNFLCB3YXJuaW5nPUZBTFNFfQpsaWJyYXJ5KHRpZHl2ZXJzZSkKbGlicmFyeShsZWFwcykKbGlicmFyeShnbG1uZXQpCmxpYnJhcnkobWxiZW5jaCkKbGlicmFyeShycGFydCkKbGlicmFyeShycGFydC5wbG90KQpsaWJyYXJ5KHZpc05ldHdvcmspCmxpYnJhcnkoSVNMUikKbGlicmFyeShrZXJubGFiKQpsaWJyYXJ5KHJhbmRvbUZvcmVzdCkKbGlicmFyeShwbG90Uk9DKQpsaWJyYXJ5KHBST0MpCmxpYnJhcnkocmFuZ2VyKQpsaWJyYXJ5KGNhcmV0KQpgYGAKCgojIFRyZWVzCgojIyBFeGVyY2lzZSAxICh0cmVlIHJlZ3Jlc3Npb24pCgpXZSBjb25zaWRlciB0aGUgZm9sbG93aW5nIGRhdGFzZXQKCmBgYHtyfQpuIDwtNTAKc2V0LnNlZWQoMTIzNCkKWCA8LSBydW5pZihuKQpzZXQuc2VlZCg1Njc4KQpZIDwtIDEqWCooWDw9MC42KSsoLTEqWCszLjIpKihYPjAuNikrcm5vcm0obixzZD0wLjEpCmRhdGExIDwtIGRhdGEuZnJhbWUoWCxZKQpnZ3Bsb3QoZGF0YTEpK2Flcyh4PVgseT1ZKStnZW9tX3BvaW50KCkrdGhlbWVfY2xhc3NpYygpCmBgYAoKMS4gRml0IGEgdHJlZSB0byBleHBsYWluICRZJCBieSAkWCQgKHVzZSAqKnJwYXJ0KiopLgoKCjIuIERyYXcgdGhlIHRyZWUgd2l0aCAqKnBycCoqIGFuZCAqKnJwYXJ0LnBsb3QqKi4KCgozLiBXcml0ZSB0aGUgdHJlZSByZWdyZXNzaW9uIGZ1bmN0aW9uLgoKCjQuIEFkZCBvbiB0aGUgZ3JhcGggb2YgcXVlc3Rpb24gMSB0aGUgcGFydGl0aW9uIGRlZmluZWQgYnkgdGhlIHRyZWUgYW5kIHRoZSBwcmVkaWN0aW9uLgoKCgojIyBFeGVyY2lzZSAyIChjbGFzc2lmaWNhdGlvbiB0cmVlKQoKCldlIGNvbnNpZGVyIHRoZSBmb2xsb3dpbmcgZGF0YXNldAoKYGBge3J9Cm4gPC0gNTAKc2V0LnNlZWQoMTIzNDUpClgxIDwtIHJ1bmlmKG4pCnNldC5zZWVkKDU2NzgpClgyIDwtIHJ1bmlmKG4pClkgPC0gcmVwKDAsbikKc2V0LnNlZWQoNTQzMjEpCllbWDE8PTAuNDVdIDwtIHJiaW5vbShzdW0oWDE8PTAuNDUpLDEsMC44NSkKc2V0LnNlZWQoNTI0MzIpCllbWDE+MC40NV0gPC0gcmJpbm9tKHN1bShYMT4wLjQ1KSwxLDAuMTUpCmRhdGEyIDwtIGRhdGEuZnJhbWUoWDEsWDIsWSkKbGlicmFyeShnZ3Bsb3QyKQpnZ3Bsb3QoZGF0YTIpK2Flcyh4PVgxLHk9WDIsY29sb3I9WSkrZ2VvbV9wb2ludChzaXplPTIpK3NjYWxlX3hfY29udGludW91cyhuYW1lPSIiKStzY2FsZV95X2NvbnRpbnVvdXMobmFtZT0iIikrdGhlbWVfY2xhc3NpYygpCmBgYAoKMS4gRml0IGEgdHJlZSB0byBleHBsYWluICRZJCBieSAkWF8xJCBhbmQgJFhfMiQuIERyYXcgdGhlIHRyZWUuIFdoYXQgaGFwcGVucz8KCgoyLiBXcml0ZSB0aGUgY2xhc3NpZmljYXRpb24gcnVsZSBhbmQgdGhlIHNjb3JlIGZ1bmN0aW9uIGluZHVjZWQgYnkgdGhlIHRyZWUuCgpUaGUgY2xhc3NpZmljYXRpb24gcnVsZSBpcwokJFx3aWRlaGF0IGcoeCk9XG1hdGhiZnsxfV97WF8xPDAuNDR9LiQkCgpUaGUgc2NvcmUgZnVuY3Rpb24gaXMKJCRcd2lkZWhhdCBTKHgpPVx3aWRlaGF0IFAoWT0xfFg9eCk9MC44M1xtYXRoYmZ7MX1fe1hfMTwwLjQ0fSswLjA3XG1hdGhiZnsxfV97WF8xXGdlcSAwLjQ0fS4kJAoKNC4gQWRkIG9uIHRoZSBmaXJzdCBncmFwaCB0aGUgcGFydGl0aW9uIGRlZmluZWQgYnkgdGhlIHRyZWUuCgoKIyMgRXhlcmNpc2UgMyAoY2F0ZWdvcmljYWwgaW5wdXQpCgpXZSBjb25zaWRlciB0aGUgZm9sbG93aW5nIGRhdGFzZXQKCmBgYHtyfQpuIDwtIDEwMApYIDwtIGZhY3RvcihyZXAoYygiQSIsIkIiLCJDIiwiRCIpLG4pKQpzZXQuc2VlZCgxMjM0KQpZW1g9PSJBIl0gPC0gcmJpbm9tKHN1bShYPT0iQSIpLDEsMC45KQpZW1g9PSJCIl0gPC0gcmJpbm9tKHN1bShYPT0iQiIpLDEsMC4yNSkKWVtYPT0iQyJdIDwtIHJiaW5vbShzdW0oWD09IkMiKSwxLDAuOCkKWVtYPT0iRCJdIDwtIHJiaW5vbShzdW0oWD09IkQiKSwxLDAuMikKWSA8LSBhcy5mYWN0b3IoWSkKZGF0YTMgPC0gZGF0YS5mcmFtZShYLFkpCmBgYAoKMS4gRml0IGEgdHJlZSB0byBleHBsYWluICRZJCBieSAkWCQuCgoyLiBFeHBsYWluIGhvdyB0aGUgdHJlZSBpcyBmaXR0ZWQgaW4gdGhpcyBjb250ZXh0LgoKIyMgRXhlcmNpc2UgNCAocHJ1bmluZykKCldlIGNvbnNpZGVyIHRoZSBkYXRhc2V0ICoqQ2Fyc2VhdHMqKiBmcm9tIHRoZSAqKklTTFIqKiBwYWNrYWdlLgoKYGBge3J9CmRhdGEoQ2Fyc2VhdHMpCmBgYAoKVGhlIHByb2JsZW0gaXMgdG8gZXhwbGFpbiB0aGUgKipTYWxlcyoqIHZhcmlhYmxlIGJ5IHRoZSBvdGhlciB2YXJpYWJsZXMuCgoxLiBGaXQgYSB0cmVlIHdpdGggKipycGFydCoqIHRvIGV4cGxhaW4gKipTYWxlcyoqIGJ5IHRoZSBvdGhlciB2YXJpYWJsZXMuCgoKMi4gRXhwbGFpbiB0aGUgb3V0cHV0IG9mIHRoZSAqKnByaW50Y3AqKiBjb21tYW5kCgpgYGB7cn0KcHJpbnRjcCh0cmVlKQpgYGAKCgozLiBEcmF3IHRoZSB0cmVlIHdpdGggOCBzcGxpdCAodXNlICoqcHJ1bmUqKikuCgoKNC4gKipSZW1hcmsqKiA6ICoqdmlzVHJlZSoqIGZ1bmN0aW9uIGZyb20gKip2aXNOZXR3b3JrKiogcGFja2FnZSBhbGxvd3MgdG8gZHJhdyBpbnRlcmFjdGl2ZSBncmFwaHMKYGBge3J9CnZpc1RyZWUodHJlZSkKYGBgCgpBIHNoaW55IHdlYiBhcHBsaWNhdGlvbiBpcyBhbHNvIHByb3Bvc2VkIHRvIHZpc3VhbGlzZSB0aGUgc2VxdWVuY2Ugb2Ygc3VidHJlZXMKCgpgYGB7cixldmFsPUZBTFNFLGluY2x1ZGU9VFJVRX0KdmlzVHJlZUVkaXRvcihDYXJzZWF0cykKYGBgCgoKNS4gU3BsaXQgdGhlIGRhdGFzZXQgaW50byBhIHRyYWluaW5nIHNldCBvZiBzaXplIDI1MCBhbmQgYSB0ZXN0IHNldCBvZiBzaXplIDE1MC4KCmBgYHtyfQpuLnRyYWluIDwtIDI1MApzZXQuc2VlZCgxMjM0NSkKcGVybSA8LSBzYW1wbGUobnJvdyhDYXJzZWF0cykpCnRyYWluIDwtIENhcnNlYXRzW3Blcm1bMTpuLnRyYWluXSxdCnRlc3QgPC0gQ2Fyc2VhdHNbLXBlcm1bMTpuLnRyYWluXSxdCmBgYAoKNi4gV2UgZml0IGEgc2VxdWVuY2Ugb2YgdHJlZXMgb24gdGhlIHRyYWluIHNhbXBsZSB3aXRoCmBgYHtyfQpzZXQuc2VlZCgxMjM0NSkKdHJlZSA8LSBycGFydChTYWxlc34uLGRhdGE9dHJhaW4sY3A9MC4wMDAwMDAwMSxtaW5zcGxpdD0yKQpgYGAKCkluIHRoaXMgc2VxdWVuY2Ugb2YgdHJlZSwgc2VsZWN0CgogIC0gYSB2ZXJ5IHNpbXBsZSB0cmVlICh3aXRoIDIgb3IgMyBzcGxpdHMpCiAgLSBhIHZlcnkgbGFyZ2UgdHJlZQogIC0gYW4gb3B0aW1hbCB0cmVlICh3aXRoIHRoZSBjbGFzc2ljYWwgcHJ1bmluZyBzdHJhdGVneSkKICAKCjcuIEVzdGltYXRlIHRoZSBxdWFkcmF0aWMgZXJyb3Igb2YgdGhlc2UgMyB0cmVlcyB3aXRoIHRoZSB0ZXN0IHNhbXBsZS4KCgojIFJhbmRvbSBGb3Jlc3QKCiMjIEV4ZXJjaXNlIDUgKHJhbmRvbSBGb3Jlc3QpCgpXZSBhZ2FpbiBjb25zaWRlciB0aGUgKipzcGFtKiogZGF0YXNldCBmcm9tIHBhY2thZ2UgKiprZXJubGFiKiouCgpgYGB7cn0KZGF0YShzcGFtKQpgYGAKCjEuIEV4cGxhaW4gdGhlIGZvbGxvd2luZyBncmFwaAoKYGBge3J9CnJmMSA8LSByYW5kb21Gb3Jlc3QodHlwZX4uLGRhdGE9c3BhbSkKcGxvdChyZjEpCmBgYAoKCjIuIEZpdCBhbm90ZXIgcmFuZG9tIGZvcmVzdCB3aXRoICoqbXRyeT0xKiouIE1ha2UgYSBjb21wYXJpc29uIGJldHdlZW4gdGhlIHR3byBmb3Jlc3RzLgoKCjMuIFNwbGl0IHRoZSBkYXRhIGludG8gYSB0cmFpbmluZyBzZXQgb2Ygc2l6ZSAzMDAwIGFuZCBhIHRlc3Qgc2V0IG9mIHNpemUgMTYwMS4KCjQuIEZpdCAyIHJhbmRvbSBmb3Jlc3Qgb24gdGhlIHRyYWluaW5nIHNldDogb25lIHdpdGggdGhlIGRlZmF1bHQgdmFsdWUgb2YgKiptdHJ5KiogYW5kIG9uZSB3aXRoICoqbXRyeT0xKiouCgo1LiBFc3RpbWF0ZSB0aGUgbWlzY2xhc3NpZmljYXRpb24gZXJyb3Igb2YgdGhlIDIgUkYgd2l0aCB0aGUgdGVzdCBzZXQuCgo2LiBVc2UgKipjYXJldCoqIHBhY2thZ2UgdG8gc2VsZWN0ICoqbXRyeSoqIHBhcmFtZXRlci4gWW91IGNhbiBzZWxlY3QgdGhpcyBwYXJhbWV0ZXIgaW4gdGhlIGdyaWQgKipzZXEoMSwzMCxieT01KSoqLgoKNy4gRml0IGEgdHJlZSBvbiB0aGUgdHJhaW4gc2FtcGxlLgoKCgo4LiBEcmF3IFJPQyBjdXJ2ZXMgYW5kIGNvbXB1dGUgQVVDIGZvciB0aGUgMiByYW5kb20gZm9yZXN0cyBhbmQgdGhlIHRyZWUuCgoKOS4gUmVwcmVzZW50IHRoZSAxMCBtb3N0IGltcG9ydGFudCB2YXJpYWJsZXMgb2YgdGhlIGJlc3QgcmFuZG9tIGZvcmVzdCB3aXRoIGEgYmFyIGNoYXJ0LgoKCjEwLiBGaXQgYSBmb3Jlc3Qgb24gdGhlIHRyYWluIGRhdGFzZXQgd2l0aCB0aGUgKipyYW5nZXIqKiBmdW5jdGlvbiBvZiB0aGUgKipyYW5nZXIqKiBwYWNrYWdlLiBXaGF0IGRvIHlvdSBub3RpY2U/CgoKCiMjIEV4ZXJjaWNlIDYKCk1ha2UgYSBjb21wYXJpc29uIGJldHdlZW4gKipyYW5kb20gRm9yZXN0KiosICoqcmlkZ2UqKiBhbmQgKipsYXNzbyoqIGZvciB0aGUgKipzcGFtKiogZGF0YXNldC4gVG8gZG8gdGhhdCwgeW91IHdpbGwgY29tcHV0ZSB0aGUgY2xhc3NpY2FsIHJpc2tzIChlcnJvciBwcm9iYWJpbGl0eSwgUk9DLi4uKSBieSAxMC1mb2xkcyBjcm9zcyB2YWxpZGF0aW9uLiBCZSBjYXJlZnVsbCB3aWggdGhlIHNlbGVjdGlvbiBvZiB0aGUgcGFyYW1ldGVycy4KCg==