UCSC Machine Learning Research Projects
Worst Case Analysis of On-line Learning by Experts (bibliography)
Worst Case Analysis of On-line Learning by Experts (more complete bibliography in postscript)
UCSC Technical Reports
UCSC Machine Learning FTP site
Some current papers, in postscript form:
Worst case prediction over sequences under log loss
[180K postscript]
Opper, M and D. Haussler (1997)
in The Mathematics of Information Coding, Extraction
and Distribution, Springer Verlag,
Edited by G. Cybenko, D. O'Leary and J. Rissanen.
We consider the game of sequentially assigning probabilities
to future data based on past observations under logarithmic loss.
We are not making probabilistic assumptions about the generation
of the data, but consider a situation where a
player tries to minimize his loss relative to the loss of
the (with hindsight) best distribution
from a target class for the worst sequence of data.
We give bounds on the minimax regret in terms of the
metric entropies of the target class with respect to
suitable distances between distributions.
Metric Entropy and Minimax Risk in Classification
[245k postscript]
D. Haussler and M. Opper (1997),
Lecture Notes in Computer Science: Studies in Logic
and Computer Science, a selection of essays in honor of Andrzej Ehrenfeucht,
Vol. 1261, 212-235 (1997),
Eds. J. Mycielski, G. Rozenberg and A. Salomaa
We apply recent results on the minimax risk in density estimation to the
related problem of pattern classification.
The notion of loss we seek to minimize is an information theoretic
measure of how well we can predict the classification of future examples,
given the classification of previously seen examples. We give an asymptotic
characterization of the minimax risk in terms of the metric entropy
properties of the class of distributions that might be generating
the examples. We then use these results to characterize the minimax
risk in the special case of noisy two-valued classification problems in terms
of the Assouad density and the Vapnik-Chervonenkis dimension.
Scale-sensitive Dimensions, Uniform Convergence, and Learnability
[226k postscript]
,
J. ACM 44 (4) 615-631 (1997) N. Alon, S. Ben-David and N. Cesa-Bianchi, David Haussler
Learnability in Valiant's PAC learning model has been
shown to be strongly related to the existence of uniform laws
of large numbers. These laws define a distribution-free convergence
property of means to expectations uniformly over classes
of random variables. Classes of real-valued functions enjoying
such a property are also known as uniform Glivenko-Cantelli classes.
In this paper we prove, through a generalization of
Sauer's lemma that may be interesting in its own right,
a new characterization of uniform Glivenko-Cantelli classes.
Our characterization yields Dudley, Gin\'e, and Zinn's previous
characterization as a corollary. Furthermore, it is the first based
on a simple combinatorial quantity generalizing the Vapnik-Chervonenkis
dimension. We apply this result to obtain the weakest combinatorial condition
known to imply PAC learnability in the statistical regression (or
``agnostic'') framework.
Furthermore, we show a characterization of learnability in the
probabilistic concept model, solving an open problem posed by Kearns and
Schapire. These results show that the accuracy parameter plays a crucial role
in determining the effective complexity of the learner's hypothesis class.
"Exponentially many local minima for single neurons" [267k postscript]
Peter Auer, Mark Herbster and Manfred Warmuth, Neural Information
Processing Systems 1996
We show that for a single neuron with the
logistic function as the transfer function
the number of local minima of the error function
based on the square loss can grow exponentially
in the dimension.
"Tracking the Best Expert"
Mark Herbster and Manfred Warmuth, Proceeding of Machine Learning 1995
We generalize the recent worst-case loss bounds
for on-line algorithms where the additional loss
of the algorithm on the whole sequence of examples
over the loss of the best expert is bounded.
The generalization allows the sequence to be partitioned
into segments and the goal is
to bound the additional loss of the algorithm
over the sum of the losses of the best experts
of each segment.
This is to model situations
in which the examples change and different experts are best
for certain segments of the sequence of examples.
In the single expert case the additional loss
is proportional to $\log n$, where $n$ is the number of experts
and the constant of proportionality depends on the loss function.
When the number of segments is at most $k+1$ and the sequence of
length $\ell$
then we can bound the additional loss of our algorithm over
the best partitioning by
$O(k \log n + k \log(\ell/k))$. Note that it takes the
same order of bits to denote
the sequence of experts and the boundaries of the segments.
When the loss per trial is bounded by one then we obtain
additional loss bounds that are independent of the length
of the sequence. The bound becomes $O(k\log n+ k \log(L/k))$,
where $L$ is the loss of the best partition into $k+1$ segments.
Our algorithms for tracking the best expert are simple
adaptations of Vovk's original algorithm for the single best expert
case. These algorithms keep one weight per expert and
spend $O(1)$ time per weight in each trial.
Mutual Information, Metric Entropy, and Risk in Estimation of
Probability Distributions [472k postscript] ,
David Haussler and Manfred Opper.
$\{P_{Y|\theta}: \theta \in \Theta\}$ is a set of probability distributions
(with a common dominating measure)
on a complete separable metric space $Y$. A state $\theta^* \in \Theta$ is
chosen by Nature. A statistician gets $n$ independent observations
$Y_1, \ldots, Y_n$ distributed according to $P_{Y|\theta^*}$ and
produces an estimated distribution $\hat{P}$ for $P_{Y|\theta^*}$.
The statistician suffers a loss based on a measure of the distance between
the estimated distribution and the true distribution.
We examine the Bayes and minimax risk of this game for various loss functions,
including the relative entropy, the squared Hellinger distance, and the
$L_1$ distance. We also look at the cumulative
relative entropy risk over the distributions estimated during the first $n$
observations. Here the Bayes risk is the mutual information
between the random parameter $\Theta^*$ and the observations
$Y_1, \ldots, Y_n$.
New bounds on this mutual information are given in terms of the Laplace
transform of the Hellinger distance between $P_{Y|\theta}$ and
$P_{Y|\theta^*}$. From these, bounds on the minimax risk are given
in terms of the metric entropy of $\Theta$
with respect to the Hellinger distance. The assumptions required
for these bounds are very general and do not depend
on the choice of the dominating measure. They apply to both finite and
infinite dimensional $\Theta$. They apply in some cases where $Y$ is
infinite dimensional, in some cases where $Y$ is not
compact, in some cases where the distributions are not smooth,
and in some parametric cases where asymptotic normality of the posterior
distribution fails.
Exponentiated Gradient versus Gradient Descent for Linear Predictors.
[624k postscript] , Jyrki Kivenen and Manfred K. Warmuth.
The Perceptron algorithm vs. Winnow: linear vs. logarithmic mistake
bounds when few input variables are relevant. [417k postscript],
Jyrki Kivenen and Manfred K. Warmuth.
We give an adversary strategy that forces
the Perceptron algorithm to make $(N-k+1)/2$ mistakes
when learning $k$-literal disjunctions over $N$ variables.
Experimentally we see that even for simple random data, the number of
mistakes made by the Perceptron algorithm grows almost
linearly with $N$,
even if the number $k$ of relevant variable remains a small constant.
In contrast, Littlestone's algorithm
Winnow makes at most
%$O(k(1+\log(N/k))$ mistakes for the same problem.
$O(k\log N)$ mistakes for the same problem.
Both algorithms use
thresholded linear functions
as their
hypotheses. However, Winnow does multiplicative updates
to its weight vector instead of the additive updates of the
Perceptron algorithm.
"Efficient Learning with Virtual Threshold Gates," [237k postscript]
By Wolfgang Maass (TU GRaz) and Manfred K. Warmuth (UC Santa Cruz)
We reduce learning simple geometric concept classes to learning
disjunctions over exponentially many variables. We then apply
an on-line algorithm called Winnow whose number of prediction
mistakes grows only logarithmically with the number of variables.
The hypotheses of Winnow are linear threshold functions
with one weight per variable. We find ways to keep the exponentially
many weights of Winnow implicitly so that the time for the algorithm
to compute a prediction and update its ``virtual'' weights is
polynomial.
Our method can be used to learn d-dimensional axis-parallel boxes
when d is variable, and unions of d-dimensional axis-parallel boxes
when d is constant. The worst-case number of mistakes of our
algorithms for the above classes is optimal to within a constant
factor, and our algorithms inherit the noise robustness of Winnow.
We think that other on-line algorithms with multiplicative weight
updates whose loss bounds grow logarithmically with the dimension
are amenable to our methods.
"How to Use Expert Advice"
[542k postscript]
Nicolo Cesa-Bianchi, Yoav Freund, David P. Helmbold, David Haussler,
Robert E. Schapire, and Manfred K. Warmuth. [540k postscript]
We analyze algorithms that predict a binary value by combining the
predictions of several prediction strategies, called {\em experts}.
Our analysis is for worst-case situations, i.e., we make no assumptions
about the way the sequence of bits to be predicted is generated.
We measure the performance of the algorithm by the difference between
the expected number of mistakes it makes on the bit sequence
and the expected number of mistakes made by the best expert on this
sequence, where the expectation is taken with respect to the
randomization in the predictions. We show that the minimum achievable
difference is on the order of the square root of the number of mistakes
of the best expert, and we give efficient algorithms
that achieve this. Our upper and lower bounds have matching
leading constants in most cases. We then show how this leads to
certain kinds of pattern recognition/learning algorithms with
performance bounds that improve on the best results currently
known in this context. We also extend our analysis to the case in
which log loss is used instead of the expected number of mistakes.
"Bounds for Predictive Errors in the Statistical
Mechanics of Supervised Learning", Manfred Opper and David Haussler,
published in Physical Review Letters, number 20, volume 75, 1995,
pp. 3772-3775 [120k postscript]
Within a Bayesian framework, by generalizing inequalities
known from statistical mechanics, we calculate general upper and
lower bounds for a cumulative entropic error, which measures the success
in the supervised learning of an unknown rule from examples.
This performance measure is equivalent to the mutual information between the
data and the parameter that specifies the rule to be learnt.
Both bounds match asymptotically, when the number $m$ of observed data grows
large. Under mild conditions, we find that the information gain
from observing a new example decreases universally like $d/m$.
Here $d$ is a dimension that is defined from the scaling of small
volumes with respect to a suitable distance in the space of rules.
"General Bounds on the Mutual Information between a parameter
and n conditionally independent observations",
David Haussler and Manfred Opper, to appear in the
Proceedings of the 8th Conference
on Computational Learning Theory, Santa Cruz, July, 1995,
Published by ACM Press [208k postscript]
Each parameter $\theta$ in an abstract parameter space $\Theta$ is
associated
with a different probability distribution on a set $Y$. A parameter
$\theta$ is chosen at random from $\Theta$ according to some {\em a priori}
distribution on $\Theta$, and $n$ conditionally independent random variables
$Y^n = Y_1, \ldots Y_n$
are observed with common distribution determined by $\theta$.
We obtain bounds on the mutual information between the random
variable $\Theta$, giving the choice of parameter, and the
random variable $Y^n$, giving the sequence of observations. We also
bound the supremum of the mutual information, over choices of the prior
distribution on $\Theta$.
These quantities have applications in density estimation, computational
learning theory, universal coding, hypothesis testing, and portfolio
selection theory.
The bounds are given in terms of the metric and information
dimensions of the parameter space $\Theta$ with respect to the
Hellinger distance.
"A General Minimax Result
for Relative Entropy", David Haussler, Submitted to IEEE
Transactions on Information Theory [154k postscript]
Suppose nature picks a probability measure
$P_\theta$ on a complete separable metric space $X$ at random from
a measurable set $P_\Theta = \{P_\theta : \theta \in \Theta\}$.
Then, without knowing
$\theta$, a statistician picks a measure $Q$ on $X$.
Finally, the statistician suffers a loss $D(P_\theta||Q)$, the
relative entropy between $P_\theta$ and $Q$.
We show that the minimax and maximin
values of this game are always equal, and there is always a minimax strategy
in the closure of the set of all Bayes strategies.
This generalizes previous results of Gallager, and Davisson and Leon-Garcia.
"Rigorous Learning Curve Bounds
from Statistical Mechanics",
David Haussler, Michael Kearns, and H. Sebastian Seung,
Submitted to Machine Learning [747k postscript]
In this paper we introduce and investigate
a mathematically rigorous theory of learning curves
that is based on ideas from statistical mechanics.
The advantage of our theory over the well-established
Vapnik-Chervonenkis theory is that our bounds can be considerably
tighter in many cases, and are more reflectve of the true behavior
(functional form) of learning curves. This behavior can
often exhibit dramatic properties such as phase transitions,
as well as power law asymptotics not explained by the VC theory.
The disadvantages of our theory are that its application requires
knowledge of the input distribution, and it is limited so far
to finite cardinality function classes. We illustrate
our results with many concrete examples of
learning curve bounds derived from our theory.
Return to ML page
Last modified: Jan 4 1996
maintained by Bruce Sherrod /
sherrod@cse.ucsc.edu