We consider \(n\) independent observations \((X_1,Y_1),\dots,(X_n,Y_n)\) with the same distribution \(\mathbf P\). \(X_1\) takes values in \(\mathbb R^d\) and \(Y_1\) in \(\{-1,1\}\).
Bayes rule is defined by \(g^\star(x)=1\) if \(\mathbf P(Y=1|X=x)\geq 0.5\), \(0\) otherwise. Bayes error is defined by \(L^\star=L(g^\star)=\mathbf P(g^\star(X)\neq Y)\).
We have
\[
\begin{align*}
\mathbf P(g(X)\neq Y|X=x) & = 1-\left(\mathbf P(g(X)=Y,g(X)=1|X=x)+\mathbf P(g(X)=Y,g(X)=-1|X=x)\right) \\
& = 1-\left(\mathbf 1_{g(x)=1}\mathbf P(Y=1|X=x)+\mathbf 1_{g(x)=-1}\mathbf P(Y=-1|X=x)\right) \\
& = 1-(\mathbf 1_{g(x)=1}\eta(x)+\mathbf 1_{g(x)=-1}(1-\eta(x))).
\end{align*}
\]
Deduce that for all \(x\in\mathcal X\) and for all classification rule \(g\) \[\mathbf P(g(X)\neq Y|X=x)-\mathbf P(g^\star(X)\neq Y|X=x)\geq 0.\] We deduce that \[\begin{align*} \mathbf P(g(X)\neq Y|X=x)- & \mathbf P(g^\star(X)\neq Y|X=x) \\ & = \eta(x)\left(\mathbf 1_{g^\star(x)=1}-\mathbf 1_{g(x)=1}\right)+(1-\eta(x)) \left(\mathbf 1_{g^\star(x)=-1}-\mathbf 1_{g(x)=-1}\right) \\ & = (2\eta(x)-1) \left(\mathbf 1_{g^\star(x)=1}-\mathbf 1_{g(x)=1}\right) \\ & \geq 0 \end{align*}\] by definition of \(g^\star\).
Deduce that
\[\mathbf P(g(X)\neq Y)\geq \mathbf P(g^\star(X)\neq Y).\] We just have to use that \[\mathbf P(g(X)\neq Y)=\int\mathbf P(g(X)\neq Y|X=x)\mu dx\] where \(\mu\) stands for the probability distribution of \(X\).
Calculate the Bayes rule and the Bayes error.
We deduce the Bayes error:
\[ \begin{align*} \mathbf P(g^\star(X)\neq Y)=& \mathbf P(g^\star(X)\neq Y,X\leq 0)+\mathbf P(g^\star(X)\neq Y,X>0) \\ =&\mathbf P(X\leq 0)\mathbf P(g^\star(X)\neq Y|X\leq 0)+\mathbf P(X>0)\mathbf P(g^\star(X)\neq Y|X>0) \\ =&\frac{1}{2}\mathbf P(Y=1|X\leq 0)+\frac{1}{2}\mathbf P(Y=0|X>0) \\ =&\frac{1}{2}\frac{1}{5}+\frac{1}{2}\frac{1}{10}\\ =&\frac{3}{20} \end{align*} \]
We consider the linear model \[Y_i=\beta x_i+\varepsilon_i,\quad i=1,\dots,n\] where \(x_i=i/n\) and \(\varepsilon_i\) are i.i.d. with \(E[\varepsilon_i]=0\) and \(V[\varepsilon_i]=\sigma^2\).
The least square criterion is \[L(\beta)=\sum_{i=1}^n(Y_i-\beta x_i)^2.\] We have \[L'(\beta)=-2\sum_{i=1}^n x_i(Y_i-\beta x_i).\] We deduce that \[\widehat\beta=\frac{\sum_{i=1}^n x_iY_i}{\sum_{i=1}^n x_i^2}.\]
We have \[E[\widehat\beta]=\frac{\sum_{i=1}^n x_iE[Y_i]}{\sum_{i=1}^n x_i^2}=\beta.\] and \[V[\widehat\beta]=\frac{\sum_{i=1}^n x_i^2V[Y_i]}{(\sum_{i=1}^n x_i^2)^2}=\frac{\sigma^2}{\sum_{i=1}^nx_i^2}.\]
\[\mathcal R(\widehat\beta)=E[(\widehat\beta-\beta)^2]=O\left(\frac{1}{n}\right).\] Hint:you can calculate the sum \(\sum_{i=1}^n(i/n)^2\) on the wolfram alpha website.
We deduce that \[\mathcal R(\widehat\beta)=V[\widehat\beta]=\frac{\sigma^2}{\sum_{i=1}^nx_i^2}.\] Moreover \[\sum_{i=1}^nx_i^2=\sum_{i=1}^n(i/n)^2=\frac{(n+1)(2n+1)}{6n}.\]
Thus for \(n\) large, we have \[\frac{1}{\sum_{i=1}^nx_i^2}\approx \frac{C}{n}.\]
We deduce that \(\mathcal R(\widehat\beta)\) tends to 0 at the rate \(1/n\).
We consider the regression model \[Y_i=m(x_i)+\varepsilon_i,\quad i=1,\dots,n\] where \(x_1,\dots,x_n\in\mathbb R^d\) are fixed and \(\varepsilon_1,\dots,\varepsilon_n\) are \(n\) i.i.d random variables with \(E[\varepsilon_i]=0\) and \(V[\varepsilon_i]=\sigma^2<+\infty\). Let \(\|\,.\,\|\) be the Euclidean norm in \(\mathbb R^d\). We define the local constant estimate of \(m\) in \(x\in\mathbb R^d\) by : \[\widehat m(x)=\mathop{\mathrm{argmin}}_{a\in\mathbb R}\sum_{i=1}^n(Y_i-a)^2K\left(\frac{\|x_i-x\|}{h}\right)\] where \(h>0\) and for \(u\in\mathbb R,K(u)=\mathbf 1_{[0,1]}(u)\). We assume that \(\sum_{i=1}^nK\left(\frac{\|x_i-x\|}{h}\right)>0\).
We just have to solve \(\varphi'(a)=0\) with \[\varphi(a)=\sum_{i=1}^n(Y_i-a)^2K\left(\frac{\|x_i-x\|}{h}\right).\]
We have \[\varphi'(a)=-2\sum_{i=1}^n(Y_i-a)K\left(\frac{\|x_i-x\|}{h}\right).\] We deduce that \[\widehat m(x)$=\frac{\sum_{i=1}^nY_iK\left(\frac{\|x_i-x\|}{h}\right)}{\sum_{i=1}^nK\left(\frac{\|x_i-x\|}{h}\right)}.\]
Proove that \[\mathbf V[\widehat m(x)]= \frac{\sigma^2}{\sum_{i=1}^nK\left(\frac{\|x_i-x\|}{h}\right)}\] and \[\mathbf E[\widehat m(x)]-m(x)=\frac{\sum_{i=1}^n(m(x_i)-m(x))K\left(\frac{\|x_i-x\|}{h}\right)}{\sum_{i=1}^nK\left(\frac{\|x_i-x\|}{h}\right)}.\] Since \(\mathbf V[Y_i]=\sigma^2\) and \(\mathbf E[Y_i]=m(x_i)\), we have \[\mathbf V[\widehat m(x)]= \frac{\sigma^2}{\sum_{i=1}^nK\left(\frac{\|x_i-x\|}{h}\right)}\] and \[\mathbf E[\widehat m(x)]-m(x)=\frac{\sum_{i=1}^n(m(x_i)-m(x))K\left(\frac{\|x_i-x\|}{h}\right)}{\sum_{i=1}^nK\left(\frac{\|x_i-x\|}{h}\right)}.\]
We moreover assume that \(m\) is a Lipschitz function, i.e., there exists \(L\) such that \(\forall (x_1,x_2)\in\mathbb R^d\times\mathbb R^d\) \[|m(x_1)-m(x_2)|\leq L\|x_1-x_2\|.\] Show that \[|\textrm{bias}[\widehat m(x)]|\leq Lh.\]
Observe that \[K\left(\frac{\|x_i-x\|}{h}\right)\neq 0\] if and only if \(\|x_i-x\|\leq h\). Therefore, for all \(i=1,\dots,n\) \[L\|x_i-x\|K\left(\frac{\|x_i-x\|}{h}\right)\leq Lh K\left(\frac{\|x_i-x\|}{h}\right).\] We deduce that \(|\textrm{bias}[\widehat m(x)]|\leq Lh.\)
We have
\[\mathbf V[\widehat m(x)]= \frac{\sigma^2}{\sum_{i=1}^nK\left(\frac{\|x_i-x\|}{h}\right)}=\frac{\sigma^2}{\sum_{i=1}^n\mathbf 1_{B_h}(x_i-x)}.\] Note that \[\sum_{i=1}^n\mathbf 1_{B_h}(x_i-x)\geq C_1n\textrm{Vol}(B_h)\geq C_1n\sqrt{2}^dh^d.\] Thus \[\mathbf V[\widehat m(x)]\leq \frac{\sigma^2}{C_1\sqrt{2}^dnh^d}.\] It follows that \(C_2=1/(C_1\sqrt{2}^d)\).
We deduce that \[\mathbf E[(\widehat m(x)-m(x))^2]\leq L^2h^2+\frac{C_2\sigma^2}{nh^d}.\] 6. Calculate \(h_{opt}\): the value of \(h\) which minimises the upper bound. And deduce that, when \(h=h_{opt}\) we have \[\mathbf E[(\widehat m(x)-m(x))^2]=\mathrm{O}\left(n^{-\frac{2}{d+2}}\right).\]
Denote by \(M(h)\) the upper bound. We have \[M(h)'=2hL^2-\frac{C_2\sigma^2d}{n}h^{-d-1}.\] Thie derivative equals 0 for \[h_{opt}=\frac{2L^2}{C_2\sigma^2d}n^{-\frac{1}{d+2}}.\] When \(h=h_{opt}\) quadratic risk satisfies \[\mathbf E[(\widehat m(x)-m(x))^2]=\mathrm{O}\left(n^{-\frac{2}{d+2}}\right).\]