Exercise 1

We consider \(n\) independent observations \((X_1,Y_1),\dots,(X_n,Y_n)\) with the same distribution \(\mathbf P\). \(X_1\) takes values in \(\mathbb R^d\) and \(Y_1\) in \(\{-1,1\}\).

  1. Recall the Bayes rule \(g^\star\) and the Bayes error \(L^\star\).

Bayes rule is defined by \(g^\star(x)=1\) if \(\mathbf P(Y=1|X=x)\geq 0.5\), \(0\) otherwise. Bayes error is defined by \(L^\star=L(g^\star)=\mathbf P(g^\star(X)\neq Y)\).

  1. Let \(g\) be a classification rule. Show that \[\mathbf P(g(X)\neq Y|X=x)=1-(\mathbf 1_{g(x)=1}\eta(x)+\mathbf 1_{g(x)=-1}(1-\eta(x)))\] where \(\eta(x)=\mathbf P(Y=1|X=x)\).

We have
\[ \begin{align*} \mathbf P(g(X)\neq Y|X=x) & = 1-\left(\mathbf P(g(X)=Y,g(X)=1|X=x)+\mathbf P(g(X)=Y,g(X)=-1|X=x)\right) \\ & = 1-\left(\mathbf 1_{g(x)=1}\mathbf P(Y=1|X=x)+\mathbf 1_{g(x)=-1}\mathbf P(Y=-1|X=x)\right) \\ & = 1-(\mathbf 1_{g(x)=1}\eta(x)+\mathbf 1_{g(x)=-1}(1-\eta(x))). \end{align*} \]

  1. Deduce that for all \(x\in\mathcal X\) and for all classification rule \(g\) \[\mathbf P(g(X)\neq Y|X=x)-\mathbf P(g^\star(X)\neq Y|X=x)\geq 0.\] We deduce that \[\begin{align*} \mathbf P(g(X)\neq Y|X=x)- & \mathbf P(g^\star(X)\neq Y|X=x) \\ & = \eta(x)\left(\mathbf 1_{g^\star(x)=1}-\mathbf 1_{g(x)=1}\right)+(1-\eta(x)) \left(\mathbf 1_{g^\star(x)=-1}-\mathbf 1_{g(x)=-1}\right) \\ & = (2\eta(x)-1) \left(\mathbf 1_{g^\star(x)=1}-\mathbf 1_{g(x)=1}\right) \\ & \geq 0 \end{align*}\] by definition of \(g^\star\).

  2. Deduce that

\[\mathbf P(g(X)\neq Y)\geq \mathbf P(g^\star(X)\neq Y).\] We just have to use that \[\mathbf P(g(X)\neq Y)=\int\mathbf P(g(X)\neq Y|X=x)\mu dx\] where \(\mu\) stands for the probability distribution of \(X\).

  1. In this question, we assume \((X,Y)\) takes values in \(\mathbb R\times\{0,1\}\), \(X\sim\mathcal U[-2,2]\) (\(\mathcal U[a,b]\) stands for the uniform distribution on \([a,b]\)) and that

Calculate the Bayes rule and the Bayes error.

We deduce the Bayes error:

\[ \begin{align*} \mathbf P(g^\star(X)\neq Y)=& \mathbf P(g^\star(X)\neq Y,X\leq 0)+\mathbf P(g^\star(X)\neq Y,X>0) \\ =&\mathbf P(X\leq 0)\mathbf P(g^\star(X)\neq Y|X\leq 0)+\mathbf P(X>0)\mathbf P(g^\star(X)\neq Y|X>0) \\ =&\frac{1}{2}\mathbf P(Y=1|X\leq 0)+\frac{1}{2}\mathbf P(Y=0|X>0) \\ =&\frac{1}{2}\frac{1}{5}+\frac{1}{2}\frac{1}{10}\\ =&\frac{3}{20} \end{align*} \]

Exercice 2 (convergence rate for a simple linear model)

We consider the linear model \[Y_i=\beta x_i+\varepsilon_i,\quad i=1,\dots,n\] where \(x_i=i/n\) and \(\varepsilon_i\) are i.i.d. with \(E[\varepsilon_i]=0\) and \(V[\varepsilon_i]=\sigma^2\).

  1. Calculate the least square estimate \(\widehat\beta\) of \(\beta\).

The least square criterion is \[L(\beta)=\sum_{i=1}^n(Y_i-\beta x_i)^2.\] We have \[L'(\beta)=-2\sum_{i=1}^n x_i(Y_i-\beta x_i).\] We deduce that \[\widehat\beta=\frac{\sum_{i=1}^n x_iY_i}{\sum_{i=1}^n x_i^2}.\]

  1. Calculate bias and variance of \(\widehat\beta\).

We have \[E[\widehat\beta]=\frac{\sum_{i=1}^n x_iE[Y_i]}{\sum_{i=1}^n x_i^2}=\beta.\] and \[V[\widehat\beta]=\frac{\sum_{i=1}^n x_i^2V[Y_i]}{(\sum_{i=1}^n x_i^2)^2}=\frac{\sigma^2}{\sum_{i=1}^nx_i^2}.\]

  1. Deduce that the quadratic risk of \(\widehat\beta\) satifies

\[\mathcal R(\widehat\beta)=E[(\widehat\beta-\beta)^2]=O\left(\frac{1}{n}\right).\] Hint:you can calculate the sum \(\sum_{i=1}^n(i/n)^2\) on the wolfram alpha website.

We deduce that \[\mathcal R(\widehat\beta)=V[\widehat\beta]=\frac{\sigma^2}{\sum_{i=1}^nx_i^2}.\] Moreover \[\sum_{i=1}^nx_i^2=\sum_{i=1}^n(i/n)^2=\frac{(n+1)(2n+1)}{6n}.\]

Thus for \(n\) large, we have \[\frac{1}{\sum_{i=1}^nx_i^2}\approx \frac{C}{n}.\]

We deduce that \(\mathcal R(\widehat\beta)\) tends to 0 at the rate \(1/n\).

Exercice 3 (rate for kernel estimates)

We consider the regression model \[Y_i=m(x_i)+\varepsilon_i,\quad i=1,\dots,n\] where \(x_1,\dots,x_n\in\mathbb R^d\) are fixed and \(\varepsilon_1,\dots,\varepsilon_n\) are \(n\) i.i.d random variables with \(E[\varepsilon_i]=0\) and \(V[\varepsilon_i]=\sigma^2<+\infty\). Let \(\|\,.\,\|\) be the Euclidean norm in \(\mathbb R^d\). We define the local constant estimate of \(m\) in \(x\in\mathbb R^d\) by : \[\widehat m(x)=\mathop{\mathrm{argmin}}_{a\in\mathbb R}\sum_{i=1}^n(Y_i-a)^2K\left(\frac{\|x_i-x\|}{h}\right)\] where \(h>0\) and for \(u\in\mathbb R,K(u)=\mathbf 1_{[0,1]}(u)\). We assume that \(\sum_{i=1}^nK\left(\frac{\|x_i-x\|}{h}\right)>0\).

  1. Calculate \(\widehat m(x)\).

We just have to solve \(\varphi'(a)=0\) with \[\varphi(a)=\sum_{i=1}^n(Y_i-a)^2K\left(\frac{\|x_i-x\|}{h}\right).\]

We have \[\varphi'(a)=-2\sum_{i=1}^n(Y_i-a)K\left(\frac{\|x_i-x\|}{h}\right).\] We deduce that \[\widehat m(x)$=\frac{\sum_{i=1}^nY_iK\left(\frac{\|x_i-x\|}{h}\right)}{\sum_{i=1}^nK\left(\frac{\|x_i-x\|}{h}\right)}.\]

  1. Proove that \[\mathbf V[\widehat m(x)]= \frac{\sigma^2}{\sum_{i=1}^nK\left(\frac{\|x_i-x\|}{h}\right)}\] and \[\mathbf E[\widehat m(x)]-m(x)=\frac{\sum_{i=1}^n(m(x_i)-m(x))K\left(\frac{\|x_i-x\|}{h}\right)}{\sum_{i=1}^nK\left(\frac{\|x_i-x\|}{h}\right)}.\] Since \(\mathbf V[Y_i]=\sigma^2\) and \(\mathbf E[Y_i]=m(x_i)\), we have \[\mathbf V[\widehat m(x)]= \frac{\sigma^2}{\sum_{i=1}^nK\left(\frac{\|x_i-x\|}{h}\right)}\] and \[\mathbf E[\widehat m(x)]-m(x)=\frac{\sum_{i=1}^n(m(x_i)-m(x))K\left(\frac{\|x_i-x\|}{h}\right)}{\sum_{i=1}^nK\left(\frac{\|x_i-x\|}{h}\right)}.\]

  2. We moreover assume that \(m\) is a Lipschitz function, i.e., there exists \(L\) such that \(\forall (x_1,x_2)\in\mathbb R^d\times\mathbb R^d\) \[|m(x_1)-m(x_2)|\leq L\|x_1-x_2\|.\] Show that \[|\textrm{bias}[\widehat m(x)]|\leq Lh.\]

Observe that \[K\left(\frac{\|x_i-x\|}{h}\right)\neq 0\] if and only if \(\|x_i-x\|\leq h\). Therefore, for all \(i=1,\dots,n\) \[L\|x_i-x\|K\left(\frac{\|x_i-x\|}{h}\right)\leq Lh K\left(\frac{\|x_i-x\|}{h}\right).\] We deduce that \(|\textrm{bias}[\widehat m(x)]|\leq Lh.\)

  1. Assume that there exists \(C_1\) such that \[C_1\leq\frac{\sum_{i=1}^n\mathbf 1_{B_h}(x_i-x)}{n\textrm{Vol}(B_h)},\] where \(B_h=\{u\in\mathbb R^d:\|u\|\leq h\}\) is the ball with center 0 and radius \(h\) in \(\mathbb R^d\) and \(\textrm{Vol}(A)\) stands for the volume of a set \(A\subset\mathbb R^d\). Show that \[\mathbf V[\widehat m(x)]\leq\frac{C_2\sigma^2}{nh^d},\] where \(C_2\) is a constant to specify. Hint: you can use that \[\textrm{Vol}(B_h)\geq \sqrt{2}^dh^d.\]

We have

\[\mathbf V[\widehat m(x)]= \frac{\sigma^2}{\sum_{i=1}^nK\left(\frac{\|x_i-x\|}{h}\right)}=\frac{\sigma^2}{\sum_{i=1}^n\mathbf 1_{B_h}(x_i-x)}.\] Note that \[\sum_{i=1}^n\mathbf 1_{B_h}(x_i-x)\geq C_1n\textrm{Vol}(B_h)\geq C_1n\sqrt{2}^dh^d.\] Thus \[\mathbf V[\widehat m(x)]\leq \frac{\sigma^2}{C_1\sqrt{2}^dnh^d}.\] It follows that \(C_2=1/(C_1\sqrt{2}^d)\).

  1. Deduce an upper bound for the quadratic risk of \(\widehat m(x)\).

We deduce that \[\mathbf E[(\widehat m(x)-m(x))^2]\leq L^2h^2+\frac{C_2\sigma^2}{nh^d}.\] 6. Calculate \(h_{opt}\): the value of \(h\) which minimises the upper bound. And deduce that, when \(h=h_{opt}\) we have \[\mathbf E[(\widehat m(x)-m(x))^2]=\mathrm{O}\left(n^{-\frac{2}{d+2}}\right).\]

Denote by \(M(h)\) the upper bound. We have \[M(h)'=2hL^2-\frac{C_2\sigma^2d}{n}h^{-d-1}.\] Thie derivative equals 0 for \[h_{opt}=\frac{2L^2}{C_2\sigma^2d}n^{-\frac{1}{d+2}}.\] When \(h=h_{opt}\) quadratic risk satisfies \[\mathbf E[(\widehat m(x)-m(x))^2]=\mathrm{O}\left(n^{-\frac{2}{d+2}}\right).\]