Trees
Exercise 1 (tree regression)
We consider the following dataset
n <-50
set.seed(1234)
X <- runif(n)
set.seed(5678)
Y <- 1*X*(X<=0.6)+(-1*X+3.2)*(X>0.6)+rnorm(n,sd=0.1)
data1 <- data.frame(X,Y)
ggplot(data1)+aes(x=X,y=Y)+geom_point()+theme_classic()
Fit a tree to explain \(Y\) by \(X\) (use rpart).
Draw the tree with prp and rpart.plot.
Write the tree regression function.
Add on the graph of question 1 the partition defined by the tree and the prediction.
Exercise 2 (classification tree)
We consider the following dataset
n <- 50
set.seed(12345)
X1 <- runif(n)
set.seed(5678)
X2 <- runif(n)
Y <- rep(0,n)
set.seed(54321)
Y[X1<=0.45] <- rbinom(sum(X1<=0.45),1,0.85)
set.seed(52432)
Y[X1>0.45] <- rbinom(sum(X1>0.45),1,0.15)
data2 <- data.frame(X1,X2,Y)
library(ggplot2)
ggplot(data2)+aes(x=X1,y=X2,color=Y)+geom_point(size=2)+scale_x_continuous(name="")+scale_y_continuous(name="")+theme_classic()
Fit a tree to explain \(Y\) by \(X_1\) and \(X_2\). Draw the tree. What happens?
Write the classification rule and the score function induced by the tree.
The classification rule is \[\widehat g(x)=\mathbf{1}_{X_1<0.44}.\]
The score function is \[\widehat S(x)=\widehat P(Y=1|X=x)=0.83\mathbf{1}_{X_1<0.44}+0.07\mathbf{1}_{X_1\geq 0.44}.\]
- Add on the first graph the partition defined by the tree.
Exercise 4 (pruning)
We consider the dataset Carseats from the ISLR package.
data(Carseats)
The problem is to explain the Sales variable by the other variables.
Fit a tree with rpart to explain Sales by the other variables.
Explain the output of the printcp command
printcp(tree)
Draw the tree with 8 split (use prune).
Remark : visTree function from visNetwork package allows to draw interactive graphs
visTree(tree)
A shiny web application is also proposed to visualise the sequence of subtrees
visTreeEditor(Carseats)
- Split the dataset into a training set of size 250 and a test set of size 150.
n.train <- 250
set.seed(12345)
perm <- sample(nrow(Carseats))
train <- Carseats[perm[1:n.train],]
test <- Carseats[-perm[1:n.train],]
- We fit a sequence of trees on the train sample with
set.seed(12345)
tree <- rpart(Sales~.,data=train,cp=0.00000001,minsplit=2)
In this sequence of tree, select
- a very simple tree (with 2 or 3 splits)
- a very large tree
- an optimal tree (with the classical pruning strategy)
- Estimate the quadratic error of these 3 trees with the test sample.
Random Forest
Exercise 5 (random Forest)
We again consider the spam dataset from package kernlab.
data(spam)
- Explain the following graph
rf1 <- randomForest(type~.,data=spam)
plot(rf1)
Fit anoter random forest with mtry=1. Make a comparison between the two forests.
Split the data into a training set of size 3000 and a test set of size 1601.
Fit 2 random forest on the training set: one with the default value of mtry and one with mtry=1.
Estimate the misclassification error of the 2 RF with the test set.
Use caret package to select mtry parameter. You can select this parameter in the grid seq(1,30,by=5).
Fit a tree on the train sample.
Draw ROC curves and compute AUC for the 2 random forests and the tree.
Represent the 10 most important variables of the best random forest with a bar chart.
Fit a forest on the train dataset with the ranger function of the ranger package. What do you notice?
Exercice 6
Make a comparison between random Forest, ridge and lasso for the spam dataset. To do that, you will compute the classical risks (error probability, ROC…) by 10-folds cross validation. Be carefull wih the selection of the parameters.
LS0tCnRpdGxlOiAiVHJlZXMgYW5kIHJhbmRvbSBmb3Jlc3QiCm91dHB1dDogCiAgaHRtbF9ub3RlYm9vazogCiAgICBjc3M6IH4vRHJvcGJveC9GSUNISUVSU19TVFlMRS9zdHlsZXMuY3NzCiAgICB0b2M6IHllcwogICAgdG9jX2Zsb2F0OiB5ZXMKLS0tCgpXZSBuZWVkIHRoZSBmb2xsb3dpbmcgcGFja2FnZXM6CgpgYGB7ciBtZXNzYWdlPUZBTFNFLCB3YXJuaW5nPUZBTFNFfQpsaWJyYXJ5KHRpZHl2ZXJzZSkKbGlicmFyeShsZWFwcykKbGlicmFyeShnbG1uZXQpCmxpYnJhcnkobWxiZW5jaCkKbGlicmFyeShycGFydCkKbGlicmFyeShycGFydC5wbG90KQpsaWJyYXJ5KHZpc05ldHdvcmspCmxpYnJhcnkoSVNMUikKbGlicmFyeShrZXJubGFiKQpsaWJyYXJ5KHJhbmRvbUZvcmVzdCkKbGlicmFyeShwbG90Uk9DKQpsaWJyYXJ5KHBST0MpCmxpYnJhcnkocmFuZ2VyKQpsaWJyYXJ5KGNhcmV0KQpgYGAKCgojIFRyZWVzCgojIyBFeGVyY2lzZSAxICh0cmVlIHJlZ3Jlc3Npb24pCgpXZSBjb25zaWRlciB0aGUgZm9sbG93aW5nIGRhdGFzZXQKCmBgYHtyfQpuIDwtNTAKc2V0LnNlZWQoMTIzNCkKWCA8LSBydW5pZihuKQpzZXQuc2VlZCg1Njc4KQpZIDwtIDEqWCooWDw9MC42KSsoLTEqWCszLjIpKihYPjAuNikrcm5vcm0obixzZD0wLjEpCmRhdGExIDwtIGRhdGEuZnJhbWUoWCxZKQpnZ3Bsb3QoZGF0YTEpK2Flcyh4PVgseT1ZKStnZW9tX3BvaW50KCkrdGhlbWVfY2xhc3NpYygpCmBgYAoKMS4gRml0IGEgdHJlZSB0byBleHBsYWluICRZJCBieSAkWCQgKHVzZSAqKnJwYXJ0KiopLgoKCjIuIERyYXcgdGhlIHRyZWUgd2l0aCAqKnBycCoqIGFuZCAqKnJwYXJ0LnBsb3QqKi4KCgozLiBXcml0ZSB0aGUgdHJlZSByZWdyZXNzaW9uIGZ1bmN0aW9uLgoKCjQuIEFkZCBvbiB0aGUgZ3JhcGggb2YgcXVlc3Rpb24gMSB0aGUgcGFydGl0aW9uIGRlZmluZWQgYnkgdGhlIHRyZWUgYW5kIHRoZSBwcmVkaWN0aW9uLgoKCgojIyBFeGVyY2lzZSAyIChjbGFzc2lmaWNhdGlvbiB0cmVlKQoKCldlIGNvbnNpZGVyIHRoZSBmb2xsb3dpbmcgZGF0YXNldAoKYGBge3J9Cm4gPC0gNTAKc2V0LnNlZWQoMTIzNDUpClgxIDwtIHJ1bmlmKG4pCnNldC5zZWVkKDU2NzgpClgyIDwtIHJ1bmlmKG4pClkgPC0gcmVwKDAsbikKc2V0LnNlZWQoNTQzMjEpCllbWDE8PTAuNDVdIDwtIHJiaW5vbShzdW0oWDE8PTAuNDUpLDEsMC44NSkKc2V0LnNlZWQoNTI0MzIpCllbWDE+MC40NV0gPC0gcmJpbm9tKHN1bShYMT4wLjQ1KSwxLDAuMTUpCmRhdGEyIDwtIGRhdGEuZnJhbWUoWDEsWDIsWSkKbGlicmFyeShnZ3Bsb3QyKQpnZ3Bsb3QoZGF0YTIpK2Flcyh4PVgxLHk9WDIsY29sb3I9WSkrZ2VvbV9wb2ludChzaXplPTIpK3NjYWxlX3hfY29udGludW91cyhuYW1lPSIiKStzY2FsZV95X2NvbnRpbnVvdXMobmFtZT0iIikrdGhlbWVfY2xhc3NpYygpCmBgYAoKMS4gRml0IGEgdHJlZSB0byBleHBsYWluICRZJCBieSAkWF8xJCBhbmQgJFhfMiQuIERyYXcgdGhlIHRyZWUuIFdoYXQgaGFwcGVucz8KCgoyLiBXcml0ZSB0aGUgY2xhc3NpZmljYXRpb24gcnVsZSBhbmQgdGhlIHNjb3JlIGZ1bmN0aW9uIGluZHVjZWQgYnkgdGhlIHRyZWUuCgpUaGUgY2xhc3NpZmljYXRpb24gcnVsZSBpcwokJFx3aWRlaGF0IGcoeCk9XG1hdGhiZnsxfV97WF8xPDAuNDR9LiQkCgpUaGUgc2NvcmUgZnVuY3Rpb24gaXMKJCRcd2lkZWhhdCBTKHgpPVx3aWRlaGF0IFAoWT0xfFg9eCk9MC44M1xtYXRoYmZ7MX1fe1hfMTwwLjQ0fSswLjA3XG1hdGhiZnsxfV97WF8xXGdlcSAwLjQ0fS4kJAoKNC4gQWRkIG9uIHRoZSBmaXJzdCBncmFwaCB0aGUgcGFydGl0aW9uIGRlZmluZWQgYnkgdGhlIHRyZWUuCgoKIyMgRXhlcmNpc2UgMyAoY2F0ZWdvcmljYWwgaW5wdXQpCgpXZSBjb25zaWRlciB0aGUgZm9sbG93aW5nIGRhdGFzZXQKCmBgYHtyfQpuIDwtIDEwMApYIDwtIGZhY3RvcihyZXAoYygiQSIsIkIiLCJDIiwiRCIpLG4pKQpzZXQuc2VlZCgxMjM0KQpZW1g9PSJBIl0gPC0gcmJpbm9tKHN1bShYPT0iQSIpLDEsMC45KQpZW1g9PSJCIl0gPC0gcmJpbm9tKHN1bShYPT0iQiIpLDEsMC4yNSkKWVtYPT0iQyJdIDwtIHJiaW5vbShzdW0oWD09IkMiKSwxLDAuOCkKWVtYPT0iRCJdIDwtIHJiaW5vbShzdW0oWD09IkQiKSwxLDAuMikKWSA8LSBhcy5mYWN0b3IoWSkKZGF0YTMgPC0gZGF0YS5mcmFtZShYLFkpCmBgYAoKMS4gRml0IGEgdHJlZSB0byBleHBsYWluICRZJCBieSAkWCQuCgoyLiBFeHBsYWluIGhvdyB0aGUgdHJlZSBpcyBmaXR0ZWQgaW4gdGhpcyBjb250ZXh0LgoKIyMgRXhlcmNpc2UgNCAocHJ1bmluZykKCldlIGNvbnNpZGVyIHRoZSBkYXRhc2V0ICoqQ2Fyc2VhdHMqKiBmcm9tIHRoZSAqKklTTFIqKiBwYWNrYWdlLgoKYGBge3J9CmRhdGEoQ2Fyc2VhdHMpCmBgYAoKVGhlIHByb2JsZW0gaXMgdG8gZXhwbGFpbiB0aGUgKipTYWxlcyoqIHZhcmlhYmxlIGJ5IHRoZSBvdGhlciB2YXJpYWJsZXMuCgoxLiBGaXQgYSB0cmVlIHdpdGggKipycGFydCoqIHRvIGV4cGxhaW4gKipTYWxlcyoqIGJ5IHRoZSBvdGhlciB2YXJpYWJsZXMuCgoKMi4gRXhwbGFpbiB0aGUgb3V0cHV0IG9mIHRoZSAqKnByaW50Y3AqKiBjb21tYW5kCgpgYGB7cn0KcHJpbnRjcCh0cmVlKQpgYGAKCgozLiBEcmF3IHRoZSB0cmVlIHdpdGggOCBzcGxpdCAodXNlICoqcHJ1bmUqKikuCgoKNC4gKipSZW1hcmsqKiA6ICoqdmlzVHJlZSoqIGZ1bmN0aW9uIGZyb20gKip2aXNOZXR3b3JrKiogcGFja2FnZSBhbGxvd3MgdG8gZHJhdyBpbnRlcmFjdGl2ZSBncmFwaHMKYGBge3J9CnZpc1RyZWUodHJlZSkKYGBgCgpBIHNoaW55IHdlYiBhcHBsaWNhdGlvbiBpcyBhbHNvIHByb3Bvc2VkIHRvIHZpc3VhbGlzZSB0aGUgc2VxdWVuY2Ugb2Ygc3VidHJlZXMKCgpgYGB7cixldmFsPUZBTFNFLGluY2x1ZGU9VFJVRX0KdmlzVHJlZUVkaXRvcihDYXJzZWF0cykKYGBgCgoKNS4gU3BsaXQgdGhlIGRhdGFzZXQgaW50byBhIHRyYWluaW5nIHNldCBvZiBzaXplIDI1MCBhbmQgYSB0ZXN0IHNldCBvZiBzaXplIDE1MC4KCmBgYHtyfQpuLnRyYWluIDwtIDI1MApzZXQuc2VlZCgxMjM0NSkKcGVybSA8LSBzYW1wbGUobnJvdyhDYXJzZWF0cykpCnRyYWluIDwtIENhcnNlYXRzW3Blcm1bMTpuLnRyYWluXSxdCnRlc3QgPC0gQ2Fyc2VhdHNbLXBlcm1bMTpuLnRyYWluXSxdCmBgYAoKNi4gV2UgZml0IGEgc2VxdWVuY2Ugb2YgdHJlZXMgb24gdGhlIHRyYWluIHNhbXBsZSB3aXRoCmBgYHtyfQpzZXQuc2VlZCgxMjM0NSkKdHJlZSA8LSBycGFydChTYWxlc34uLGRhdGE9dHJhaW4sY3A9MC4wMDAwMDAwMSxtaW5zcGxpdD0yKQpgYGAKCkluIHRoaXMgc2VxdWVuY2Ugb2YgdHJlZSwgc2VsZWN0CgogIC0gYSB2ZXJ5IHNpbXBsZSB0cmVlICh3aXRoIDIgb3IgMyBzcGxpdHMpCiAgLSBhIHZlcnkgbGFyZ2UgdHJlZQogIC0gYW4gb3B0aW1hbCB0cmVlICh3aXRoIHRoZSBjbGFzc2ljYWwgcHJ1bmluZyBzdHJhdGVneSkKICAKCjcuIEVzdGltYXRlIHRoZSBxdWFkcmF0aWMgZXJyb3Igb2YgdGhlc2UgMyB0cmVlcyB3aXRoIHRoZSB0ZXN0IHNhbXBsZS4KCgojIFJhbmRvbSBGb3Jlc3QKCiMjIEV4ZXJjaXNlIDUgKHJhbmRvbSBGb3Jlc3QpCgpXZSBhZ2FpbiBjb25zaWRlciB0aGUgKipzcGFtKiogZGF0YXNldCBmcm9tIHBhY2thZ2UgKiprZXJubGFiKiouCgpgYGB7cn0KZGF0YShzcGFtKQpgYGAKCjEuIEV4cGxhaW4gdGhlIGZvbGxvd2luZyBncmFwaAoKYGBge3J9CnJmMSA8LSByYW5kb21Gb3Jlc3QodHlwZX4uLGRhdGE9c3BhbSkKcGxvdChyZjEpCmBgYAoKCjIuIEZpdCBhbm90ZXIgcmFuZG9tIGZvcmVzdCB3aXRoICoqbXRyeT0xKiouIE1ha2UgYSBjb21wYXJpc29uIGJldHdlZW4gdGhlIHR3byBmb3Jlc3RzLgoKCjMuIFNwbGl0IHRoZSBkYXRhIGludG8gYSB0cmFpbmluZyBzZXQgb2Ygc2l6ZSAzMDAwIGFuZCBhIHRlc3Qgc2V0IG9mIHNpemUgMTYwMS4KCjQuIEZpdCAyIHJhbmRvbSBmb3Jlc3Qgb24gdGhlIHRyYWluaW5nIHNldDogb25lIHdpdGggdGhlIGRlZmF1bHQgdmFsdWUgb2YgKiptdHJ5KiogYW5kIG9uZSB3aXRoICoqbXRyeT0xKiouCgo1LiBFc3RpbWF0ZSB0aGUgbWlzY2xhc3NpZmljYXRpb24gZXJyb3Igb2YgdGhlIDIgUkYgd2l0aCB0aGUgdGVzdCBzZXQuCgo2LiBVc2UgKipjYXJldCoqIHBhY2thZ2UgdG8gc2VsZWN0ICoqbXRyeSoqIHBhcmFtZXRlci4gWW91IGNhbiBzZWxlY3QgdGhpcyBwYXJhbWV0ZXIgaW4gdGhlIGdyaWQgKipzZXEoMSwzMCxieT01KSoqLgoKNy4gRml0IGEgdHJlZSBvbiB0aGUgdHJhaW4gc2FtcGxlLgoKCgo4LiBEcmF3IFJPQyBjdXJ2ZXMgYW5kIGNvbXB1dGUgQVVDIGZvciB0aGUgMiByYW5kb20gZm9yZXN0cyBhbmQgdGhlIHRyZWUuCgoKOS4gUmVwcmVzZW50IHRoZSAxMCBtb3N0IGltcG9ydGFudCB2YXJpYWJsZXMgb2YgdGhlIGJlc3QgcmFuZG9tIGZvcmVzdCB3aXRoIGEgYmFyIGNoYXJ0LgoKCjEwLiBGaXQgYSBmb3Jlc3Qgb24gdGhlIHRyYWluIGRhdGFzZXQgd2l0aCB0aGUgKipyYW5nZXIqKiBmdW5jdGlvbiBvZiB0aGUgKipyYW5nZXIqKiBwYWNrYWdlLiBXaGF0IGRvIHlvdSBub3RpY2U/CgoKCiMjIEV4ZXJjaWNlIDYKCk1ha2UgYSBjb21wYXJpc29uIGJldHdlZW4gKipyYW5kb20gRm9yZXN0KiosICoqcmlkZ2UqKiBhbmQgKipsYXNzbyoqIGZvciB0aGUgKipzcGFtKiogZGF0YXNldC4gVG8gZG8gdGhhdCwgeW91IHdpbGwgY29tcHV0ZSB0aGUgY2xhc3NpY2FsIHJpc2tzIChlcnJvciBwcm9iYWJpbGl0eSwgUk9DLi4uKSBieSAxMC1mb2xkcyBjcm9zcyB2YWxpZGF0aW9uLiBCZSBjYXJlZnVsbCB3aWggdGhlIHNlbGVjdGlvbiBvZiB0aGUgcGFyYW1ldGVycy4KCg==