UCSC Machine Learning Research Projects


Worst Case Analysis of On-line Learning by Experts (bibliography)

Worst Case Analysis of On-line Learning by Experts (more complete bibliography in postscript)

UCSC Technical Reports

UCSC Machine Learning FTP site


Some current papers, in postscript form:

Worst case prediction over sequences under log loss [180K postscript]

Opper, M and D. Haussler (1997) in The Mathematics of Information Coding, Extraction and Distribution, Springer Verlag, Edited by G. Cybenko, D. O'Leary and J. Rissanen.

We consider the game of sequentially assigning probabilities to future data based on past observations under logarithmic loss. We are not making probabilistic assumptions about the generation of the data, but consider a situation where a player tries to minimize his loss relative to the loss of the (with hindsight) best distribution from a target class for the worst sequence of data. We give bounds on the minimax regret in terms of the metric entropies of the target class with respect to suitable distances between distributions.

Metric Entropy and Minimax Risk in Classification [245k postscript]

D. Haussler and M. Opper (1997), Lecture Notes in Computer Science: Studies in Logic and Computer Science, a selection of essays in honor of Andrzej Ehrenfeucht, Vol. 1261, 212-235 (1997), Eds. J. Mycielski, G. Rozenberg and A. Salomaa

We apply recent results on the minimax risk in density estimation to the related problem of pattern classification. The notion of loss we seek to minimize is an information theoretic measure of how well we can predict the classification of future examples, given the classification of previously seen examples. We give an asymptotic characterization of the minimax risk in terms of the metric entropy properties of the class of distributions that might be generating the examples. We then use these results to characterize the minimax risk in the special case of noisy two-valued classification problems in terms of the Assouad density and the Vapnik-Chervonenkis dimension.

Scale-sensitive Dimensions, Uniform Convergence, and Learnability [226k postscript] , J. ACM 44 (4) 615-631 (1997) N. Alon, S. Ben-David and N. Cesa-Bianchi, David Haussler

Learnability in Valiant's PAC learning model has been shown to be strongly related to the existence of uniform laws of large numbers. These laws define a distribution-free convergence property of means to expectations uniformly over classes of random variables. Classes of real-valued functions enjoying such a property are also known as uniform Glivenko-Cantelli classes. In this paper we prove, through a generalization of Sauer's lemma that may be interesting in its own right, a new characterization of uniform Glivenko-Cantelli classes. Our characterization yields Dudley, Gin\'e, and Zinn's previous characterization as a corollary. Furthermore, it is the first based on a simple combinatorial quantity generalizing the Vapnik-Chervonenkis dimension. We apply this result to obtain the weakest combinatorial condition known to imply PAC learnability in the statistical regression (or ``agnostic'') framework. Furthermore, we show a characterization of learnability in the probabilistic concept model, solving an open problem posed by Kearns and Schapire. These results show that the accuracy parameter plays a crucial role in determining the effective complexity of the learner's hypothesis class.

"Exponentially many local minima for single neurons" [267k postscript] Peter Auer, Mark Herbster and Manfred Warmuth, Neural Information Processing Systems 1996

We show that for a single neuron with the logistic function as the transfer function the number of local minima of the error function based on the square loss can grow exponentially in the dimension.

"Tracking the Best Expert" Mark Herbster and Manfred Warmuth, Proceeding of Machine Learning 1995

We generalize the recent worst-case loss bounds for on-line algorithms where the additional loss of the algorithm on the whole sequence of examples over the loss of the best expert is bounded. The generalization allows the sequence to be partitioned into segments and the goal is to bound the additional loss of the algorithm over the sum of the losses of the best experts of each segment. This is to model situations in which the examples change and different experts are best for certain segments of the sequence of examples. In the single expert case the additional loss is proportional to $\log n$, where $n$ is the number of experts and the constant of proportionality depends on the loss function. When the number of segments is at most $k+1$ and the sequence of length $\ell$ then we can bound the additional loss of our algorithm over the best partitioning by $O(k \log n + k \log(\ell/k))$. Note that it takes the same order of bits to denote the sequence of experts and the boundaries of the segments. When the loss per trial is bounded by one then we obtain additional loss bounds that are independent of the length of the sequence. The bound becomes $O(k\log n+ k \log(L/k))$, where $L$ is the loss of the best partition into $k+1$ segments. Our algorithms for tracking the best expert are simple adaptations of Vovk's original algorithm for the single best expert case. These algorithms keep one weight per expert and spend $O(1)$ time per weight in each trial.

Mutual Information, Metric Entropy, and Risk in Estimation of Probability Distributions [472k postscript] , David Haussler and Manfred Opper.

$\{P_{Y|\theta}: \theta \in \Theta\}$ is a set of probability distributions (with a common dominating measure) on a complete separable metric space $Y$. A state $\theta^* \in \Theta$ is chosen by Nature. A statistician gets $n$ independent observations $Y_1, \ldots, Y_n$ distributed according to $P_{Y|\theta^*}$ and produces an estimated distribution $\hat{P}$ for $P_{Y|\theta^*}$. The statistician suffers a loss based on a measure of the distance between the estimated distribution and the true distribution. We examine the Bayes and minimax risk of this game for various loss functions, including the relative entropy, the squared Hellinger distance, and the $L_1$ distance. We also look at the cumulative relative entropy risk over the distributions estimated during the first $n$ observations. Here the Bayes risk is the mutual information between the random parameter $\Theta^*$ and the observations $Y_1, \ldots, Y_n$. New bounds on this mutual information are given in terms of the Laplace transform of the Hellinger distance between $P_{Y|\theta}$ and $P_{Y|\theta^*}$. From these, bounds on the minimax risk are given in terms of the metric entropy of $\Theta$ with respect to the Hellinger distance. The assumptions required for these bounds are very general and do not depend on the choice of the dominating measure. They apply to both finite and infinite dimensional $\Theta$. They apply in some cases where $Y$ is infinite dimensional, in some cases where $Y$ is not compact, in some cases where the distributions are not smooth, and in some parametric cases where asymptotic normality of the posterior distribution fails.

Exponentiated Gradient versus Gradient Descent for Linear Predictors. [624k postscript] , Jyrki Kivenen and Manfred K. Warmuth.

The Perceptron algorithm vs. Winnow: linear vs. logarithmic mistake bounds when few input variables are relevant. [417k postscript], Jyrki Kivenen and Manfred K. Warmuth.

We give an adversary strategy that forces the Perceptron algorithm to make $(N-k+1)/2$ mistakes when learning $k$-literal disjunctions over $N$ variables. Experimentally we see that even for simple random data, the number of mistakes made by the Perceptron algorithm grows almost linearly with $N$, even if the number $k$ of relevant variable remains a small constant. In contrast, Littlestone's algorithm Winnow makes at most %$O(k(1+\log(N/k))$ mistakes for the same problem. $O(k\log N)$ mistakes for the same problem. Both algorithms use thresholded linear functions as their hypotheses. However, Winnow does multiplicative updates to its weight vector instead of the additive updates of the Perceptron algorithm.

"Efficient Learning with Virtual Threshold Gates," [237k postscript] By Wolfgang Maass (TU GRaz) and Manfred K. Warmuth (UC Santa Cruz)

We reduce learning simple geometric concept classes to learning disjunctions over exponentially many variables. We then apply an on-line algorithm called Winnow whose number of prediction mistakes grows only logarithmically with the number of variables. The hypotheses of Winnow are linear threshold functions with one weight per variable. We find ways to keep the exponentially many weights of Winnow implicitly so that the time for the algorithm to compute a prediction and update its ``virtual'' weights is polynomial. Our method can be used to learn d-dimensional axis-parallel boxes when d is variable, and unions of d-dimensional axis-parallel boxes when d is constant. The worst-case number of mistakes of our algorithms for the above classes is optimal to within a constant factor, and our algorithms inherit the noise robustness of Winnow. We think that other on-line algorithms with multiplicative weight updates whose loss bounds grow logarithmically with the dimension are amenable to our methods.

"How to Use Expert Advice" [542k postscript] Nicolo Cesa-Bianchi, Yoav Freund, David P. Helmbold, David Haussler, Robert E. Schapire, and Manfred K. Warmuth. [540k postscript]

We analyze algorithms that predict a binary value by combining the predictions of several prediction strategies, called {\em experts}. Our analysis is for worst-case situations, i.e., we make no assumptions about the way the sequence of bits to be predicted is generated. We measure the performance of the algorithm by the difference between the expected number of mistakes it makes on the bit sequence and the expected number of mistakes made by the best expert on this sequence, where the expectation is taken with respect to the randomization in the predictions. We show that the minimum achievable difference is on the order of the square root of the number of mistakes of the best expert, and we give efficient algorithms that achieve this. Our upper and lower bounds have matching leading constants in most cases. We then show how this leads to certain kinds of pattern recognition/learning algorithms with performance bounds that improve on the best results currently known in this context. We also extend our analysis to the case in which log loss is used instead of the expected number of mistakes.

"Bounds for Predictive Errors in the Statistical Mechanics of Supervised Learning", Manfred Opper and David Haussler, published in Physical Review Letters, number 20, volume 75, 1995, pp. 3772-3775 [120k postscript]

Within a Bayesian framework, by generalizing inequalities known from statistical mechanics, we calculate general upper and lower bounds for a cumulative entropic error, which measures the success in the supervised learning of an unknown rule from examples. This performance measure is equivalent to the mutual information between the data and the parameter that specifies the rule to be learnt. Both bounds match asymptotically, when the number $m$ of observed data grows large. Under mild conditions, we find that the information gain from observing a new example decreases universally like $d/m$. Here $d$ is a dimension that is defined from the scaling of small volumes with respect to a suitable distance in the space of rules.

"General Bounds on the Mutual Information between a parameter and n conditionally independent observations", David Haussler and Manfred Opper, to appear in the Proceedings of the 8th Conference on Computational Learning Theory, Santa Cruz, July, 1995, Published by ACM Press [208k postscript]

Each parameter $\theta$ in an abstract parameter space $\Theta$ is associated with a different probability distribution on a set $Y$. A parameter $\theta$ is chosen at random from $\Theta$ according to some {\em a priori} distribution on $\Theta$, and $n$ conditionally independent random variables $Y^n = Y_1, \ldots Y_n$ are observed with common distribution determined by $\theta$. We obtain bounds on the mutual information between the random variable $\Theta$, giving the choice of parameter, and the random variable $Y^n$, giving the sequence of observations. We also bound the supremum of the mutual information, over choices of the prior distribution on $\Theta$. These quantities have applications in density estimation, computational learning theory, universal coding, hypothesis testing, and portfolio selection theory. The bounds are given in terms of the metric and information dimensions of the parameter space $\Theta$ with respect to the Hellinger distance.

"A General Minimax Result for Relative Entropy", David Haussler, Submitted to IEEE Transactions on Information Theory [154k postscript]

Suppose nature picks a probability measure $P_\theta$ on a complete separable metric space $X$ at random from a measurable set $P_\Theta = \{P_\theta : \theta \in \Theta\}$. Then, without knowing $\theta$, a statistician picks a measure $Q$ on $X$. Finally, the statistician suffers a loss $D(P_\theta||Q)$, the relative entropy between $P_\theta$ and $Q$. We show that the minimax and maximin values of this game are always equal, and there is always a minimax strategy in the closure of the set of all Bayes strategies. This generalizes previous results of Gallager, and Davisson and Leon-Garcia.

"Rigorous Learning Curve Bounds from Statistical Mechanics", David Haussler, Michael Kearns, and H. Sebastian Seung, Submitted to Machine Learning [747k postscript]

In this paper we introduce and investigate a mathematically rigorous theory of learning curves that is based on ideas from statistical mechanics. The advantage of our theory over the well-established Vapnik-Chervonenkis theory is that our bounds can be considerably tighter in many cases, and are more reflectve of the true behavior (functional form) of learning curves. This behavior can often exhibit dramatic properties such as phase transitions, as well as power law asymptotics not explained by the VC theory. The disadvantages of our theory are that its application requires knowledge of the input distribution, and it is limited so far to finite cardinality function classes. We illustrate our results with many concrete examples of learning curve bounds derived from our theory.

Return to ML page


Last modified: Jan 4 1996
maintained by Bruce Sherrod / sherrod@cse.ucsc.edu