Introduction to Machine Learning - Exercises (paper), Part 1 and 2

Exercise 1
Exercice 2 (convergence rate for a simple linear model)
Exercice 3 (rate for kernel estimates)

Exercise 1

We consider $n$ independent observations $(X_1,Y_1),\dots,(X_n,Y_n)$ with the same distribution $\mathbf P$. $X_1$ takes values in $\mathbb R^d$ and $Y_1$ in $\{-1,1\}$.

Recall the Bayes rule $g^\star$ and the Bayes error $L^\star$.

Bayes rule is defined by $g^\star(x)=1$ if $\mathbf P(Y=1|X=x)\geq 0.5$, $0$ otherwise. Bayes error is defined by $L^\star=L(g^\star)=\mathbf P(g^\star(X)\neq Y)$.

Let $g$ be a classification rule. Show that \[\mathbf P(g(X)\neq Y|X=x)=1-(\mathbf 1_{g(x)=1}\eta(x)+\mathbf 1_{g(x)=-1}(1-\eta(x)))\] where $\eta(x)=\mathbf P(Y=1|X=x)$.

We have
\[ \begin{align*} \mathbf P(g(X)\neq Y|X=x) & = 1-\left(\mathbf P(g(X)=Y,g(X)=1|X=x)+\mathbf P(g(X)=Y,g(X)=-1|X=x)\right) \\ & = 1-\left(\mathbf 1_{g(x)=1}\mathbf P(Y=1|X=x)+\mathbf 1_{g(x)=-1}\mathbf P(Y=-1|X=x)\right) \\ & = 1-(\mathbf 1_{g(x)=1}\eta(x)+\mathbf 1_{g(x)=-1}(1-\eta(x))). \end{align*} \]

Deduce that for all $x\in\mathcal X$ and for all classification rule $g$ \[\mathbf P(g(X)\neq Y|X=x)-\mathbf P(g^\star(X)\neq Y|X=x)\geq 0.\] We deduce that \[\begin{align*} \mathbf P(g(X)\neq Y|X=x)- & \mathbf P(g^\star(X)\neq Y|X=x) \\ & = \eta(x)\left(\mathbf 1_{g^\star(x)=1}-\mathbf 1_{g(x)=1}\right)+(1-\eta(x)) \left(\mathbf 1_{g^\star(x)=-1}-\mathbf 1_{g(x)=-1}\right) \\ & = (2\eta(x)-1) \left(\mathbf 1_{g^\star(x)=1}-\mathbf 1_{g(x)=1}\right) \\ & \geq 0 \end{align*}\] by definition of $g^\star$.
Deduce that

\[\mathbf P(g(X)\neq Y)\geq \mathbf P(g^\star(X)\neq Y).\] We just have to use that \[\mathbf P(g(X)\neq Y)=\int\mathbf P(g(X)\neq Y|X=x)\mu dx\] where $\mu$ stands for the probability distribution of $X$.

In this question, we assume $(X,Y)$ takes values in $\mathbb R\times\{0,1\}$, $X\sim\mathcal U[-2,2]$ ($\mathcal U[a,b]$ stands for the uniform distribution on $[a,b]$) and that

$Y|X=x$ has a Bernoulli distribution $\mathcal B(1/5)$ if $x\leq 0$;
$Y|X=x$ has a Bernoulli distribution $\mathcal B(9/10)$ if $x> 0$;

Calculate the Bayes rule and the Bayes error.

If $x\leq 0$ then $\mathbf P(Y=1|X=x)=1/5$, so $g^\star(x)=0$.
If $x>0$ then $\mathbf P(Y=1|X=x)=9/10$, so $g^\star(x)=1$.

We deduce the Bayes error:

\[ \begin{align*} \mathbf P(g^\star(X)\neq Y)=& \mathbf P(g^\star(X)\neq Y,X\leq 0)+\mathbf P(g^\star(X)\neq Y,X>0) \\ =&\mathbf P(X\leq 0)\mathbf P(g^\star(X)\neq Y|X\leq 0)+\mathbf P(X>0)\mathbf P(g^\star(X)\neq Y|X>0) \\ =&\frac{1}{2}\mathbf P(Y=1|X\leq 0)+\frac{1}{2}\mathbf P(Y=0|X>0) \\ =&\frac{1}{2}\frac{1}{5}+\frac{1}{2}\frac{1}{10}\\ =&\frac{3}{20} \end{align*} \]

Exercice 2 (convergence rate for a simple linear model)

We consider the linear model \[Y_i=\beta x_i+\varepsilon_i,\quad i=1,\dots,n\] where $x_i=i/n$ and $\varepsilon_i$ are i.i.d. with $E[\varepsilon_i]=0$ and $V[\varepsilon_i]=\sigma^2$.

Calculate the least square estimate $\widehat\beta$ of $\beta$.

The least square criterion is \[L(\beta)=\sum_{i=1}^n(Y_i-\beta x_i)^2.\] We have \[L'(\beta)=-2\sum_{i=1}^n x_i(Y_i-\beta x_i).\] We deduce that \[\widehat\beta=\frac{\sum_{i=1}^n x_iY_i}{\sum_{i=1}^n x_i^2}.\]

Calculate bias and variance of $\widehat\beta$.

We have \[E[\widehat\beta]=\frac{\sum_{i=1}^n x_iE[Y_i]}{\sum_{i=1}^n x_i^2}=\beta.\] and \[V[\widehat\beta]=\frac{\sum_{i=1}^n x_i^2V[Y_i]}{(\sum_{i=1}^n x_i^2)^2}=\frac{\sigma^2}{\sum_{i=1}^nx_i^2}.\]

Deduce that the quadratic risk of $\widehat\beta$ satifies

\[\mathcal R(\widehat\beta)=E[(\widehat\beta-\beta)^2]=O\left(\frac{1}{n}\right).\] Hint:you can calculate the sum $\sum_{i=1}^n(i/n)^2$ on the wolfram alpha website.

We deduce that \[\mathcal R(\widehat\beta)=V[\widehat\beta]=\frac{\sigma^2}{\sum_{i=1}^nx_i^2}.\] Moreover \[\sum_{i=1}^nx_i^2=\sum_{i=1}^n(i/n)^2=\frac{(n+1)(2n+1)}{6n}.\]

Thus for $n$ large, we have \[\frac{1}{\sum_{i=1}^nx_i^2}\approx \frac{C}{n}.\]

We deduce that $\mathcal R(\widehat\beta)$ tends to 0 at the rate $1/n$.

Exercice 3 (rate for kernel estimates)

We consider the regression model \[Y_i=m(x_i)+\varepsilon_i,\quad i=1,\dots,n\] where $x_1,\dots,x_n\in\mathbb R^d$ are fixed and $\varepsilon_1,\dots,\varepsilon_n$ are $n$ i.i.d random variables with $E[\varepsilon_i]=0$ and $V[\varepsilon_i]=\sigma^2<+\infty$. Let $\|\,.\,\|$ be the Euclidean norm in $\mathbb R^d$. We define the local constant estimate of $m$ in $x\in\mathbb R^d$ by : \[\widehat m(x)=\mathop{\mathrm{argmin}}_{a\in\mathbb R}\sum_{i=1}^n(Y_i-a)^2K\left(\frac{\|x_i-x\|}{h}\right)\] where $h>0$ and for $u\in\mathbb R,K(u)=\mathbf 1_{[0,1]}(u)$. We assume that $\sum_{i=1}^nK\left(\frac{\|x_i-x\|}{h}\right)>0$.

Calculate $\widehat m(x)$.

We just have to solve $\varphi'(a)=0$ with \[\varphi(a)=\sum_{i=1}^n(Y_i-a)^2K\left(\frac{\|x_i-x\|}{h}\right).\]

We have \[\varphi'(a)=-2\sum_{i=1}^n(Y_i-a)K\left(\frac{\|x_i-x\|}{h}\right).\] We deduce that \[\widehat m(x)$=\frac{\sum_{i=1}^nY_iK\left(\frac{\|x_i-x\|}{h}\right)}{\sum_{i=1}^nK\left(\frac{\|x_i-x\|}{h}\right)}.\]

Proove that \[\mathbf V[\widehat m(x)]= \frac{\sigma^2}{\sum_{i=1}^nK\left(\frac{\|x_i-x\|}{h}\right)}\] and \[\mathbf E[\widehat m(x)]-m(x)=\frac{\sum_{i=1}^n(m(x_i)-m(x))K\left(\frac{\|x_i-x\|}{h}\right)}{\sum_{i=1}^nK\left(\frac{\|x_i-x\|}{h}\right)}.\] Since $\mathbf V[Y_i]=\sigma^2$ and $\mathbf E[Y_i]=m(x_i)$, we have \[\mathbf V[\widehat m(x)]= \frac{\sigma^2}{\sum_{i=1}^nK\left(\frac{\|x_i-x\|}{h}\right)}\] and \[\mathbf E[\widehat m(x)]-m(x)=\frac{\sum_{i=1}^n(m(x_i)-m(x))K\left(\frac{\|x_i-x\|}{h}\right)}{\sum_{i=1}^nK\left(\frac{\|x_i-x\|}{h}\right)}.\]
We moreover assume that $m$ is a Lipschitz function, i.e., there exists $L$ such that $\forall (x_1,x_2)\in\mathbb R^d\times\mathbb R^d$ \[|m(x_1)-m(x_2)|\leq L\|x_1-x_2\|.\] Show that \[|\textrm{bias}[\widehat m(x)]|\leq Lh.\]

Observe that \[K\left(\frac{\|x_i-x\|}{h}\right)\neq 0\] if and only if $\|x_i-x\|\leq h$. Therefore, for all $i=1,\dots,n$ \[L\|x_i-x\|K\left(\frac{\|x_i-x\|}{h}\right)\leq Lh K\left(\frac{\|x_i-x\|}{h}\right).\] We deduce that $|\textrm{bias}[\widehat m(x)]|\leq Lh.$

Assume that there exists $C_1$ such that \[C_1\leq\frac{\sum_{i=1}^n\mathbf 1_{B_h}(x_i-x)}{n\textrm{Vol}(B_h)},\] where $B_h=\{u\in\mathbb R^d:\|u\|\leq h\}$ is the ball with center 0 and radius $h$ in $\mathbb R^d$ and $\textrm{Vol}(A)$ stands for the volume of a set $A\subset\mathbb R^d$. Show that \[\mathbf V[\widehat m(x)]\leq\frac{C_2\sigma^2}{nh^d},\] where $C_2$ is a constant to specify. Hint: you can use that \[\textrm{Vol}(B_h)\geq \sqrt{2}^dh^d.\]

We have

\[\mathbf V[\widehat m(x)]= \frac{\sigma^2}{\sum_{i=1}^nK\left(\frac{\|x_i-x\|}{h}\right)}=\frac{\sigma^2}{\sum_{i=1}^n\mathbf 1_{B_h}(x_i-x)}.\] Note that \[\sum_{i=1}^n\mathbf 1_{B_h}(x_i-x)\geq C_1n\textrm{Vol}(B_h)\geq C_1n\sqrt{2}^dh^d.\] Thus \[\mathbf V[\widehat m(x)]\leq \frac{\sigma^2}{C_1\sqrt{2}^dnh^d}.\] It follows that $C_2=1/(C_1\sqrt{2}^d)$.

Deduce an upper bound for the quadratic risk of $\widehat m(x)$.

We deduce that \[\mathbf E[(\widehat m(x)-m(x))^2]\leq L^2h^2+\frac{C_2\sigma^2}{nh^d}.\] 6. Calculate $h_{opt}$: the value of $h$ which minimises the upper bound. And deduce that, when $h=h_{opt}$ we have \[\mathbf E[(\widehat m(x)-m(x))^2]=\mathrm{O}\left(n^{-\frac{2}{d+2}}\right).\]

Denote by $M(h)$ the upper bound. We have \[M(h)'=2hL^2-\frac{C_2\sigma^2d}{n}h^{-d-1}.\] Thie derivative equals 0 for \[h_{opt}=\frac{2L^2}{C_2\sigma^2d}n^{-\frac{1}{d+2}}.\] When $h=h_{opt}$ quadratic risk satisfies \[\mathbf E[(\widehat m(x)-m(x))^2]=\mathrm{O}\left(n^{-\frac{2}{d+2}}\right).\]