Archive

[Learning theory] Bias variance trade-off (Hoeffding inequality) 본문

공부/Machine Learning

[Learning theory] Bias variance trade-off (Hoeffding inequality)

mariabeetle 2019. 10. 12. 20:00
  • 유한 개의 Hypothesis set을 가정하고, training error를 가장 적게 하는 hypothesis에 대해서 generalization error bound를 구함.
  • 위에서 구한 generalization error bound를 통해 learning theory 관점에서 bias variance trade-off를 볼 수 있음.

1. Preliminaries

Lemma. (The union bound)

  • Let $A_1, A_2, ..., A_k$ be $k$ different events(that may not be independent). Then
    $$P(A_1 \cup \cdots \cup A_k) \leq P(A_1) + \cdots + P(A_k)$$

Lemma. (Hoeffding inequality) (Chernoff bound)

  • Let $Z_1, ..., Z_m$ be $m$ independent and indentically distributed(iid) random variables drawn from a Bernoulli($\phi$) distribution(I.e. $P(Z_i = 1) = \phi,$ and $P(Z_i = 0) = 1-\phi$).
  • Let $\hat{\phi} = \frac{1}{m} \sum_{i=1} ^m Z_i$ be the mean of these random variables, and let any $\gamma > 0$ be fixed. Then
    $$P(|\phi - \hat{\phi}| > \gamma) \leq 2\text{exp}(-2\gamma^2 m)$$

Assumption

  • binary classifiation with the labels are $y \in {0, 1}$
  • Training set $S = \{(x^{(i)} , y^{(i)}); i = 1, ..., m\}$ of size $m$, drawn iid from some probability distribution $D$.
Training error(empirical risk or empirical error)

$$\hat{\epsilon}(h) = \frac{1}{m} \sum_{i=1}^m1\{h(x)^{(i)} \neq y^{(i)}\}$$

Generalization error

$$\epsilon(h) = P_{(x, y) \sim D}(h(x) \neq y) $$

Hypothesis class $\mathcal{H}$
  • $H = \{h_\theta: h_\theta(x) = 1\{\theta ^T x \geq 0\}, \theta \in \mathbb{R}^{n+1}\}$
  • Decision boundary is linear
Empirical risk minimization

$$\hat{h} = \arg \min_{h \in \mathcal{H}} \hat{\epsilon}(h)$$

2. The case of finite $\mathcal{H}$

  • $\mathcal{H} = {h_1, ..., h_k}$
  • $\mathcal{H}$ is a set of $k$ functions mapping from $\mathcal{X}$ to {0, 1}
  • $\hat{h}$ : smallest training error를 보여주는 hypothesis function 안에서의 함수.
  • 목표 : $\hat{h}$에 대해 generalization error를 구해보자.
  • $h_i \in \mathcal{H}$ hypothesis function $h_i$ 고정.
  • $Z = 1\{h_i (x) \neq y\}$
    • $Z_j = 1\{h_i (x^{(j)} \neq y^{(j)})\}$ : j번째 데이터셋에 대한 $Z_j$ 값
    • Training set이 $D$에서 샘플링되었기 때문에, $Z_j$들도 같은 distribution을 가지게 됨.

Training error

  • $\hat{\epsilon} (h_i) = \frac{1}{m}\sum_{j=1}^m Z_j$

  • $\hat{\epsilon}(h_i)$는 m개의 random variables $Z_j$에 대한 평균값이고, 각각의 $Z_j$는 평균이 $\epsilon(h_i)$인 Bernoulli distribution에서 sampling되었음.

Apply Hoeffding inequality

$$P(|\epsilon(h_i) - \hat{\epsilon}(h_i)| > \gamma) \leq 2 \text{exp}(-2 \gamma^2 m)$$

  • $h_i$ 하나에 대해서만 정의하지 않고, 모든 $h \in \mathcal{H}$에 대해 정의해보자.

Uniform convergence

  • $A_i$ : $|\epsilon(h_i) - \hat{\epsilon(h_i)}| > \gamma$ 인 event

  • $P(A_i) \leq 2 \text{exp}(-2 \gamma^2 m)$ => 여기에 uniform bound 적용

  • Three quantities

    • $m$ : training data 개수
    • $\gamma$ : 평균과의 차이(거리)
    • probability of error : $2k \text{exp}(-2\gamma^2 m)$
  • Sample complexity : training error와 generalization error 차이가 $\gamma$보다 작게 하기 위해서 얼마나 많은 training dataset이 필요할까?

    • $\delta \geq 2k\text{exp}(-2 \gamma^2 m)$ => m에 대해서 풀자.
    • $m \geq \frac{1}{2\gamma^2}\log\frac{2k}{\delta}$

3. Generalization bound

  • $\delta = 2k\text{exp}(-2 \gamma^2 m)$ => $\gamma$에 대해서 풀자.
  • $\sqrt{\frac{1}{2m} \log \frac{2k}{\delta}} = \gamma$
  • $|\hat{\epsilon}(h) - \epsilon(h)| \leq \gamma \cdots (1)$
  • $|\hat{\epsilon}(h) - \epsilon(h)| \leq \sqrt{\frac{1}{2m} \log \frac{2k}{\delta}}$
  • $\hat{h} = \arg \min_{h \in \mathcal{H}} \hat{\epsilon}(h)$ ; training error를 가장 작게 해주는 hypothesis function
  • $h^* = \arg\min_{h \in \mathcal{H}} \epsilon(h)$ : generalization error를 가장 작게 해주는 hypothesis function

  • $\epsilon(\hat{h}) \leq \hat{\epsilon}(\hat{h}) + \gamma$ ; 위의 부등식 (1), $\hat{h}$의 정의가 $\hat{\epsilon}(\cdot)$ 함수의 최솟값
  • $\hat{\epsilon}(\hat{h})$는 $\hat{\epsilon}(h^*)$보다 작거나 같음.
  • $\hat{\epsilon}(h^*) \leq \epsilon(h^*) + \gamma$ ; 부등식 (1), $ h^* $는 $ \epsilon (\cdot) $ 함수의 최솟값.

 

 

reference = http://cs229.stanford.edu/

'공부 > Machine Learning' 카테고리의 다른 글

[ML] Convolution 정리  (1) 2020.09.08
유전알고리즘 시험 내용 요약  (0) 2019.12.06
Comments