[Learning theory] Bias variance trade-off (Hoeffding inequality)

Notice

Recent Posts

Recent Comments

Link

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

[Learning theory] Bias variance trade-off (Hoeffding inequality) 본문

공부/Machine Learning

[Learning theory] Bias variance trade-off (Hoeffding inequality)

mariabeetle 2019. 10. 12. 20:00

유한 개의 Hypothesis set을 가정하고, training error를 가장 적게 하는 hypothesis에 대해서 generalization error bound를 구함.
위에서 구한 generalization error bound를 통해 learning theory 관점에서 bias variance trade-off를 볼 수 있음.

1. Preliminaries

Lemma. (The union bound)

Let $A_1, A_2, ..., A_k$ be $k$ different events(that may not be independent). Then
$$P(A_1 \cup \cdots \cup A_k) \leq P(A_1) + \cdots + P(A_k)$$

Lemma. (Hoeffding inequality) (Chernoff bound)

Let $Z_1, ..., Z_m$ be $m$ independent and indentically distributed(iid) random variables drawn from a Bernoulli($\phi$) distribution(I.e. $P(Z_i = 1) = \phi,$ and $P(Z_i = 0) = 1-\phi$).
Let $\hat{\phi} = \frac{1}{m} \sum_{i=1} ^m Z_i$ be the mean of these random variables, and let any $\gamma > 0$ be fixed. Then
$$P(|\phi - \hat{\phi}| > \gamma) \leq 2\text{exp}(-2\gamma^2 m)$$

Assumption

binary classifiation with the labels are $y \in {0, 1}$
Training set $S = \{(x^{(i)} , y^{(i)}); i = 1, ..., m\}$ of size $m$, drawn iid from some probability distribution $D$.

Training error(empirical risk or empirical error)

$$\hat{\epsilon}(h) = \frac{1}{m} \sum_{i=1}^m1\{h(x)^{(i)} \neq y^{(i)}\}$$

Generalization error

$$\epsilon(h) = P_{(x, y) \sim D}(h(x) \neq y) $$

Hypothesis class $\mathcal{H}$

$H = \{h_\theta: h_\theta(x) = 1\{\theta ^T x \geq 0\}, \theta \in \mathbb{R}^{n+1}\}$
Decision boundary is linear

Empirical risk minimization

$$\hat{h} = \arg \min_{h \in \mathcal{H}} \hat{\epsilon}(h)$$

2. The case of finite $\mathcal{H}$

$\mathcal{H} = {h_1, ..., h_k}$
$\mathcal{H}$ is a set of $k$ functions mapping from $\mathcal{X}$ to {0, 1}
$\hat{h}$ : smallest training error를 보여주는 hypothesis function 안에서의 함수.
목표 : $\hat{h}$에 대해 generalization error를 구해보자.
$h_i \in \mathcal{H}$ hypothesis function $h_i$ 고정.
$Z = 1\{h_i (x) \neq y\}$
- $Z_j = 1\{h_i (x^{(j)} \neq y^{(j)})\}$ : j번째 데이터셋에 대한 $Z_j$ 값
- Training set이 $D$에서 샘플링되었기 때문에, $Z_j$들도 같은 distribution을 가지게 됨.

Training error

$\hat{\epsilon} (h_i) = \frac{1}{m}\sum_{j=1}^m Z_j$
$\hat{\epsilon}(h_i)$는 m개의 random variables $Z_j$에 대한 평균값이고, 각각의 $Z_j$는 평균이 $\epsilon(h_i)$인 Bernoulli distribution에서 sampling되었음.

Apply Hoeffding inequality

$$P(|\epsilon(h_i) - \hat{\epsilon}(h_i)| > \gamma) \leq 2 \text{exp}(-2 \gamma^2 m)$$

$h_i$ 하나에 대해서만 정의하지 않고, 모든 $h \in \mathcal{H}$에 대해 정의해보자.

Uniform convergence

$A_i$ : $|\epsilon(h_i) - \hat{\epsilon(h_i)}| > \gamma$ 인 event
$P(A_i) \leq 2 \text{exp}(-2 \gamma^2 m)$ => 여기에 uniform bound 적용

Three quantities
- $m$ : training data 개수
- $\gamma$ : 평균과의 차이(거리)
- probability of error : $2k \text{exp}(-2\gamma^2 m)$
Sample complexity : training error와 generalization error 차이가 $\gamma$보다 작게 하기 위해서 얼마나 많은 training dataset이 필요할까?
- $\delta \geq 2k\text{exp}(-2 \gamma^2 m)$ => m에 대해서 풀자.
- $m \geq \frac{1}{2\gamma^2}\log\frac{2k}{\delta}$

3. Generalization bound

$\delta = 2k\text{exp}(-2 \gamma^2 m)$ => $\gamma$에 대해서 풀자.
$\sqrt{\frac{1}{2m} \log \frac{2k}{\delta}} = \gamma$
$|\hat{\epsilon}(h) - \epsilon(h)| \leq \gamma \cdots (1)$
$|\hat{\epsilon}(h) - \epsilon(h)| \leq \sqrt{\frac{1}{2m} \log \frac{2k}{\delta}}$
$\hat{h} = \arg \min_{h \in \mathcal{H}} \hat{\epsilon}(h)$ ; training error를 가장 작게 해주는 hypothesis function
$h^* = \arg\min_{h \in \mathcal{H}} \epsilon(h)$ : generalization error를 가장 작게 해주는 hypothesis function

$\epsilon(\hat{h}) \leq \hat{\epsilon}(\hat{h}) + \gamma$ ; 위의 부등식 (1), $\hat{h}$의 정의가 $\hat{\epsilon}(\cdot)$ 함수의 최솟값
$\hat{\epsilon}(\hat{h})$는 $\hat{\epsilon}(h^*)$보다 작거나 같음.
$\hat{\epsilon}(h^*) \leq \epsilon(h^*) + \gamma$ ; 부등식 (1), $ h^* $는 $ \epsilon (\cdot) $ 함수의 최솟값.

reference = http://cs229.stanford.edu/

저작자표시 (새창열림)

'공부 > Machine Learning' 카테고리의 다른 글

[ML] Convolution 정리 (1)	2020.09.08
유전알고리즘 시험 내용 요약 (0)	2019.12.06

'공부/Machine Learning' Related Articles

Comments

Archive

[Learning theory] Bias variance trade-off (Hoeffding inequality) 본문

[Learning theory] Bias variance trade-off (Hoeffding inequality)

1. Preliminaries

Lemma. (The union bound)

Lemma. (Hoeffding inequality) (Chernoff bound)

Assumption

Training error(empirical risk or empirical error)

Generalization error

Hypothesis class $\mathcal{H}$

Empirical risk minimization

2. The case of finite $\mathcal{H}$

Training error

Apply Hoeffding inequality

Uniform convergence

3. Generalization bound

'공부 > Machine Learning' 카테고리의 다른 글

티스토리툴바