User:Scott.linderman/sandbox

In computational learning theory, probably approximately correct learning (PAC learning) is a framework for mathematical analysis of machine learning. It was proposed in 1984 by Leslie Valiant.^[1]

In this framework, the learner receives samples and must select a generalization function (called the hypothesis) from a certain class of possible functions. The goal is that, with high probability (the "probably" part), the selected function will have low generalization error (the "approximately correct" part). The learner must be able to learn the concept given any arbitrary approximation ratio, probability of success, or distribution of the samples.

The model was later extended to treat noise (misclassified samples).

An important innovation of the PAC framework is the introduction of computational complexity theory concepts to machine learning. In particular, the learner is expected to find efficient functions (time and space requirements bounded to a polynomial of the example size), and the learner itself must implement an efficient procedure (requiring an example count bounded to a polynomial of the concept size, modified by the approximation and likelihood bounds).

Definitions and terminology[edit]

In order to give the definition for something that is PAC-learnable, we first have to introduce some terminology.^[2] ^[3]

For the following definitions, two examples will be used. The first is the problem of character recognition given an array of $n$ bits encoding a binary-valued image. The other example is the problem of finding an interval that will correctly classify points within the interval as positive and the points outside of the range as negative.

The instance space, $X$ , is a set called of all possible inputs to the learning algorithm. In the character recognition problem, the instance space is $X=\{0,1\}^{n}$ , where $n$ is the length of the inputs. In the interval problem the instance space is $X=\mathbb {R}$ , where $\mathbb {R}$ denotes the set of all real numbers.

A concept is a subset $c\subset X$ . One concept is the set of all patterns of bits in $X=\{0,1\}^{n}$ that encode a picture of the letter "P". In the interval learning problem, an example of a concept is the interval $[-1,1]$ . A concept class $C$ is a set of concepts over $X$ . This could be the set of intervals of length less than one, for example. The goal of the learning problem will be to identify a hypothesis concept $h$ that is close. in a sense that will be made precise, to the true concept $c$ given a set of labeled samples ${\mathcal {S}}=\{(x_{1},c(x_{1})),(x_{2},c(x_{2})),\ldots ,(x_{m},c(x_{m}))\}$ .

Paralleling the concept class is the hypothesis class. ${\mathcal {H}}$ , from which the hypothesis is drawn. In many cases the hypothesis class is equal to the concept class, though this need not be the case. The hypothesis class may be restricted to a smaller set as a form of regularization. Alternatively, the hypothesis class may be infinitely large in order to capture as many potential concepts as possible.

Let $EX(c,{\mathcal {D}})$ be a procedure that draws an example, $x$ , using a probability distribution ${\mathcal {D}}$ and gives the correct label $c(x)$ , that is 1 if $x\in c$ and 0 otherwise.

Say that there is an algorithm $A$ that given access to $EX(c,{\mathcal {D}})$ and inputs $\epsilon$ and $\delta$ that, with probability at least $1-\delta$ , $A$ outputs a hypothesis $h\in {\mathcal {H}}$ that has error less than or equal to $\epsilon$ with examples drawn from $X$ according to ${\mathcal {D}}$ . If there is such an algorithm for every concept $c\in C$ , for every distribution ${\mathcal {D}}$ over $X$ , and for all $0<\epsilon <1/2$ and $0<\delta <1/2$ then $C$ is PAC learnable (or distribution-free PAC learnable) by ${\mathcal {H}}$ . We can also say that $A$ is a PAC learning algorithm for $C$ .

An algorithm runs in time $t$ if it draws at most $t$ examples and requires at most $t$ time steps. A concept class is efficiently PAC learnable if it is PAC learnable by an algorithm that runs in time polynomial in $1/\epsilon$ , $1/\delta$ and instance length.

Example: Learning intervals[edit]

Consider the aforementioned problem of learning intervals on the real line. You are given a set of labeled points ${\mathcal {S}}=\{(x_{i},y_{i})\}_{i=1}^{m}$ , where $x_{i}\in \mathbb {R}$ and $y_{i}=1$ if $x_{i}\in [\ell ^{*},u^{*}]$ and $y_{i}=0$ otherwise. The $x$ locations of these points are drawn from an unknown distribution ${\mathcal {D}}$ . Our goal is to provide an algorithm that, given such a set of points, outputs an interval $[l,u]$ that, with high probability, will be "close" to the true interval in that the expected classification error on new points drawn from ${\mathcal {D}}$ will be small. We will show that the following algorithm satisfies this property.

Algorithm: Given a set of labeled points

{\mathcal {S}}

, output the interval

[l,u]

where

\ell =\min\{x_{i}|y_{i}=1\}

and

u=\max\{x_{i}|y_{i}=1\}

. To evaluate the performance of this simple algorithm, we prove the following theorem,

Theorem: For all

\epsilon ,\delta >0

, if

m>{\frac {2}{\epsilon }}\ln \left({\frac {2}{\delta }}\right)

examples are given, then with probability

1-\delta

the interval returned by the above algorithm will be incorrect on less than an

\epsilon

fraction of examples drawn from

{\mathcal {D}}

. That is, the probability of an error will be less than

\epsilon

.

Proof: First, notice that the algorithm can only underestimate the true interval since it always returns the tightest interval containing the positively labeled points. We call an interval

\epsilon

-good if

\Pr(\ell ^{*}\leq x<\ell )<{\frac {\epsilon }{2}}

and

\Pr(u<x\leq u^{*})<{\frac {\epsilon }{2}}

, where the probability is taken with respect to distribution

{\mathcal {D}}

. Such a hypothesis will have error less than

\epsilon

since

\Pr({\text{error}})=\Pr(\ell ^{*}\leq x<\ell )+\Pr(u<x\leq u^{*})<\epsilon

.

Assuming that the total probability of the interval is greater than

\epsilon

, let

\ell _{\text{max}}

and

u_{\text{min}}

be the maximum and minimum boundaries, respectively, of an

\epsilon

-good interval.

Now we analyze the probability of returning an interval that is not

\epsilon

-good. This can happen if none of the points in

{\mathcal {S}}

are less than

\ell _{\text{max}}

or greater than

u_{\text{min}}

. Since the points are independently sampled, the probability of each event is less than

\left(1-{\frac {\epsilon }{2}}\right)^{m}

, and, by a union bound, the probability of either event occurring is less than the sum of their probabilities,

2\left(1-{\frac {\epsilon }{2}}\right)^{m}

.

To achieve our confidence guarantee, we solve for the number of examples necessary to guarantee that the interval will be

\epsilon

-good with probability

1-\delta

. We want

1-2\left(1-{\frac {\epsilon }{2}}\right)^{m}\geq 1-\delta

, or

\delta \geq 2\left(1-{\frac {\epsilon }{2}}\right)^{m}

. Using the inequality

1-x\leq e^{-x}

, this is satisfied if

\delta \geq 2e^{-\epsilon m/2}

or

m\geq {\frac {2}{\epsilon }}\ln {\frac {2}{\delta }}

. This is the lower bound on the required by the theorem, and thus concludes the proof.

Occam's Razor[edit]

(This page is under development by user AlexE.)

How does one go about finding an efficient learning algorithm for a given concept class? It turns out that a very simple principle known as Occam's Razor, first articulated by the thirteenth century philosopher William of Occam, provides a guide. Simply finding a succinct hypothesis that is consistent with the given examples is enough to guarantee that the hypothesis will generalize to new examples as well. This statement can be formalized in terms of the cardinality of the hypothesis class (i.e. the number of potential hypotheses), yielding a general upper bound on the sample complexity of the form,

m=O\left({\frac {1}{\epsilon }}\left[\log |{\mathcal {H}}|+\log {\frac {1}{\delta }}\right]\right)

.

This can also be seen from an information theoretic perspective: if the examples can be compressed and represented by a succinct hypothesis, then in the PAC framework it can be shown that this hypothesis will, with high probability, generalize to new examples with low error. More details can be found in Occam Learning.

Reconsider the Learning Intervals above. The algorithm we gave shares the spirit of Occam learning and has a very similar sample complexity. We simply returned the smallest interval consistent with the given examples. However, the Occam learning theory does not seem to apply since there are an infinite number of hypotheses (intervals) to choose from. That is, the hypothesis class has infinite cardinality. Still, the concept of closed intervals on the real line is really quite simple. To extend these ideas to real-valued concepts, we leverage tools of Vapnik–Chervonenkis theory.

Relationship to Vapnik Chervonenkis theory[edit]

Vapnik-Chervonenkis (VC) theory introduces the notion of the VC dimension of a concept class. The VC dimension is the maximum number $d$ such that there exist $d$ points $\{x_{1},\ldots x_{d}\}$ and $2^{d}$ concepts in ${\mathcal {C}}$ that classify the $d$ points in all $2^{d}$ possible ways. This can substitute for $\log |{\mathcal {C}}|$ , the log cardinality of the concept class, in the Occam learning bounds discussed above. A similar upper bound of

m=O\left({\frac {1}{\epsilon }}\left[d+\log {\frac {1}{\delta }}\right]\right)

can be shown for the sample complexity of PAC-learning a concept class with VC-dimension $d$ . This upper bound can be shown to be tight to within a factor of $\log {\frac {1}{\epsilon }}$ , though the proof is nontrivial. For more details, see ^[4].

A similar relationship can be shown (under some regularity conditions) when ${\mathcal {C}}$ is a Glivenko-Cantelli class.

Boosting[edit]

Boosting addresses the question of whether many weak learners can be combined to produce a strong learner. Concretely, if we have a learning algorithm that is weak in the sense that it returns a concept with error $1/2-\epsilon$ (i.e. only slightly better than a random guess), then boosting seeks to use this weak learning algorithm to produce a learning algorithm that returns a concept with arbitrarily low error. It turns out that this is indeed possible. The most common means of boosting accuracy is by maintaining a set of candidate learning algorithms and weighting them according to how well they perform on a set of training examples. We can iteratively reweight these candidates such that weighted average response of the candidates achieves the desired accuracy, even though each individual has poor overall accuracy.

Learning in the presence of noise[edit]

The PAC framework described above assumes that examples are labeled according to the true underlying concept. This is, arguably, the least realistic assumption since, in reality, our labels are imperfect. When these labels are corrupted by noise, either random or malicious, then we place even greater demands upon our learning algorithms. The simplest noise model is the classification error model in which labels are independently flipped with probability $\eta$ . To formalize the concepts that can be learned in the presence of such noise, Kearns^[5] introduced the statistical query model, a strictly weaker model than the PAC-learning model. Any concept that can be efficiently learned using statistical queries is efficiently PAC learnable in the presence of classification noise.

There are other, more restrictive noise models as well, such as the malicious noise model in which an adversary is allowed to corrupt each label with probability $\beta$ in order to make the concept more challenging to learn. Here, the limits of what may be efficiently PAC learned are more severe.

Finally, though not necessarily noise, it is also likely the case that the true process that generates our labels does not belong to our hypothesis class. In this case, the disagreement in labels is due to model misspecification. This problem is studied under the heading of agnostic learning. In this scenario, we attempt to make as few assumptions as possible about the underlying set of functions or distributions that give rise to our examples, and seek to find the concept in our hypothesis class that is closest (according to some loss function) to the underlying concept. This is a particularly challenging learning model. In general, those concepts which can be agnostically learned are closely related to those which have the property of uniform convergence.

(Add page from AlexE.)

Biological applications of PAC learning[edit]

PAC theory unites the computational and statistical requirements of learning algorithms. Though these are most evident in machine learning, they are equally applicable to biological learning problems. Leslie Valiant has extended the PAC framework to the problems of evolution ^[6] and neural computation ^[7]. Though not bound by the same constraints as computers, there is no reason to believe these systems can circumvent the pragmatic constraints of polynomial time computability. In his book^[8], Valiant argues that the PAC framework provides important constraints for guiding our understanding of these biological processes.

References[edit]

^ L. Valiant. A theory of the learnable. Communications of the ACM, 27, 1984.
^ Kearns and Vazirani, pg. 1-12,
^ Balas Kausik Natarajan, Machine Learning , A Theoretical Approach, Morgan Kaufmann Publishers, 1991
^ Kearns and Vazirani, Chapter 3
^ Kearns, Michael. "Efficient noise-tolerant learning from statistical queries." Journal of the ACM (JACM) 45.6 (1998): 983-1006.
^ Valiant, Leslie G. "Evolvability." Journal of the ACM (JACM) 56.1 (2009): 3.
^ Leslie G. Valiant. Circuits of the Mind. Oxford University Press, 1994.
^ Valiant, L. Probably Approximately Correct. Basic Books, 2013.