Contributions to decision tree induction:
bias/variance tradeoff and time series classification
Pierre Geurts, PhD Thesis, University of Liège, Belgium, May 2002
- The whole thesis in gzipped postscript (1.5M) or PDF (4.1M)
- Abstract
- Table of contents
- Databases
- Back to my home page
Abstract
Because of the rapid progress of computer and information technology, large amounts of data are nowadays available in almost any field of application. Automatic learning aims at producing synthetic high-level information, or models, from data. Learning algorithms are generally evaluated according to three different criteria: interpretability (how well the model helps to understand the data), predictive accuracy (how well the model can predict unseen situations), and computational efficiency (how fast the algorithm is and how well it scales to very large databases). Data mining refers to the application of automatic learning and visualization techniques in order to help a human expert to extract knowledge from large volumes of raw data. Unlike in other applications of automatic learning, in the context of data mining usually only a small fraction of the available data is actually relevant, and also, there is seldom reliable domain knowledge to help extracting relevant features or models. Actually, the precise objective of data mining is to help creating such knowledge from the available data.
Decision tree induction is a supervised learning algorithm, which focuses on the modeling of input/output relationships in the form of if-then rules. It has been claimed that among the various classes of learning methods, decision trees come closest to meeting the requirements for serving as an off-the-shelf procedure for data mining. Indeed, they are quite fast, produce relatively interpretable models, are immune to outliers, resistant to irrelevant variables, insensitive to variable rescaling, and can be easily extended to cope with a large variety of data types (numerical, symbolic, time series\ldots). However, in terms of accuracy they seldom are competitive with other automatic learning algorithms, such as neural networks and nearest neighbors. It is commonly admitted that this suboptimality is mostly due to the excessive variance of this method. The variance of an automatic learning method measures the dependence of the models it produces on the random nature of the data set used (sampling and noise). In the context of decision trees, this variance affects all the parameters of the extracted rules and is detrimental in terms of both accuracy and interpretability. Furthermore, this variance tends to increase very strongly in applications such as time series classification where the input variables correspond to a large number of low level and, sometimes, very noisy features.
Within this overall context, this thesis explores two complementary issues: the reduction of variance of decision tree based models and the problem of learning interpretable and accurate classification rules from time series data. All our developments and analyses are extensively validated in an empirical fashion through the estimation of variance, bias, and average accuracy, on the basis of a diversified set of databases.
We start our investigations by an empirical study which shows quantitatively how important decision tree variance is, i.e. how strongly they depend on the random nature of the database used to infer them. These experiments confirm that the rules induced for a given problem are indeed highly variable from one learning sample to another, and thus that variance is detrimental not only for accuracy but also from the point of view of interpretability. With the goal of improving both interpretability and accuracy, we then investigate three complementary variance reduction techniques. First, we propose several methods to stabilize the parameters chosen during tree induction. While these methods succeed in reducing the variability of the parameters, they produce only a slight improvement of accuracy. Next, we consider perturb and combine algorithms, which aggregate predictions of several models obtained by randomizing in some way the learning process. Among the existing such methods, the most popular ones, namely Bagging and Boosting, strongly improve tree accuracy but jeopardize interpretability and computational efficiency. Thus, inspired by the high variance of tree parameters, we propose a new algorithm based on extremely randomized trees (extra-trees). This method aggregates the predictions of many large trees, each one obtained by choosing most of its parameters fully at random. These extra-trees compare very favorably with both bagging and boosting, in terms of variance reduction, accuracy and computational efficiency. They are thus largely more accurate than standard trees, while remaining competitive with them in terms of computational efficiency. In order to also recover interpretability, we then propose a ``dual'' perturb and combine algorithm which randomizes input attributes at the prediction stage while using one single model. In combination with decision trees, this method actually yields soft decisions and eventually bridges the gap between single trees and the ensembles of trees produced by the perturb and combine methods. Of the first, it saves most of the interpretability, and with perturb and combine algorithms, it shares part of the accuracy improvement.
The last part of our work concerns a first attempt to design general algorithms solving the specific, but very multifaceted problem of time series classification. The most direct way to solve this problem is to apply existing learning algorithms on low-level variables which correspond to the values of a time series at several time points. Experiments with the tree-based algorithms studied in the first part of the thesis show that this approach has its limitations. We thus design a variance reduction technique specifically for this kind of data, which consists in aggregating the predictions given by a classification model based on small subsequences of the original time series. This method is very successful in terms of accuracy but does not provide fully interpretable models. We therefore propose a second method which extends decision tree tests by allowing them to detect local shift invariant properties, or patterns, in time series. This method, although limited in scope, offers the possibility to extract interpretable and accurate rules from time series.
Table of Contents
- Summary
- Acknowledgements
- 1 Introduction
- 1.1 Context and motivations
- 1.1.1 Automatic learning
- 1.1.2 Supervised learning
- 1.1.3 The Bias/variance tradeoff in supervised learning
- 1.1.4 Decision tree induction
- 1.1.5 Temporal data
- 1.2 Objectives of the thesis and our way to solutions
- 1.2.1 Improvements of decision tree induction
- 1.2.2 Treatment of temporal data
- 1.3 Organization of the thesis
- 1.4 Contributions and originality
- PART I Bias and variance in automatic learning
- 2 Supervised automatic learning
- 2.1 Definitions and notation /dt>
- 2.1.1 Universe, objects, attributes
- 2.1.2 Learning sample
- 2.1.3 Supervised learning problem
- 2.1.4 Bayes model and residual error
- 2.2 Learning algorithms /dt>
- 2.2.1 Hypothesis space
- 2.2.2 Empirical risk minimization
- 2.2.3 Problem specific background knowledge
- 2.3 Main classes of supervised learning algorithms
- 2.3.1 Linear models
- 2.3.2 Artificial neural networks
- 2.3.3 K-Nearest Neighbors
- 2.3.4 Decision and regression trees
- 2.3.5 Naive Bayes and Bayesian networks
- 2.4 Evaluation of learning algorithms
- 2.5 Evaluation of accuracy of models
- 3 Bias and variance in supervised learning
- 3.1 Why more general is not necessarily better
- 3.2 Bias and variance decompositions
- 3.2.1 Regression error
- 3.2.2 Classification error
- 3.2.3 Variance of probability estimates
- 3.2.4 Bias and variance estimation (bootstrap method)
- 3.3 Empirical study of bias and variance
- 3.3.1 Hypothesis space size
- 3.3.2 The learning problem: Bayes model and noise
- 3.3.3 Learning sample size
- 3.3.4 Learning algorithm
- 3.4 Two other kinds of variances
- 3.4.1 Error variance
- 3.4.2 Parameter variance
- 3.5 Analytical results related to bias and variance
- 3.5.1 Linear models
- 3.5.2 K-NN
- 3.5.3 Single hidden layer perceptrons
- 3.5.4 Statistical learning theory
- 3.6 Discussion
- 4 Variance reduction techniques
- 4.1 Introduction
- 4.2 Controlling the bias-variance tradeoff of a method
- 4.2.1 General framework
- 4.2.2 Examples in the context of different learning algorithms
- 4.2.3 Related approaches for model selection
- 4.3 Aggregation techniques
- 4.3.1 Bootstrap aggregating (Bagging)
- 4.3.2 Randomized algorithms
- 4.3.3 General perturb and combine algorithm
- 4.3.4 Examples of P \& C algorithms
- 4.3.5 Other aggregation techniques
- 4.4 Conclusion
- PART II Bias and variance in decision tree induction
- 5 Variance in decision tree induction
- 5.1 Introduction 81
- 5.2 Description of the algorithm
- 5.2.1 General principles of decision tree growing
- 5.2.2 Particular implementation
- 5.2.3 Alternative score measures
- 5.3 Experimental protocol
- 5.4 Prediction variance
- 5.4.1 Percentage of error due to variance
- 5.4.2 Variance as a function of learning set size
- 5.4.3 Variance as a function of tree complexity
- 5.4.4 Discussion
- 5.5 Variance sources in decision tree induction
- 5.6 Parameter variance: node splitting
- 5.6.1 Discretization threshold
- 5.6.2 Attribute choice
- 5.6.3 Global effect of the score measure on decision trees
- 5.7 Conclusion
- 6 Reducing the variance of decision trees
- 6.1 Introduction
- 6.2 Pruning
- 6.2.1 Description of pruning algorithms
- 6.2.2 Validation of pruning
- 6.2.3 Discussion
- 6.3 Discretization threshold stabilization
- 6.3.1 Stabilization methods
- 6.3.2 Validation on threshold variance
- 6.3.3 Validation on global accuracy
- 6.3.4 Attribute selection
- 6.3.5 Discussion
- 6.4 Conclusion
- 7 Perturbing and combining decision trees
- 7.1 Introduction
- 7.2 Bagging
- 7.2.1 Experiments
- 7.2.2 Bias/variance interpretation
- 7.2.3 Geometrical interpretation
- 7.3 Specific P\&C methods for decision tree induction
- 7.4 Extra-trees : extremely randomized trees
- 7.4.1 The algorithm
- 7.4.2 Empirical validation
- 7.4.3 Discussion
- 7.5 The question of bias reduction with P\&C algorithms
- 7.6 Conclusion
- 8 Dual perturb and combine or soft decision trees
- 8.1 Introduction
- 8.2 Proposed algorithm
- 8.2.1 Generic dual P\&C algorithm
- 8.2.2 Closed-form dual P\&C of decision trees
- 8.2.3 Impact of the noise level on bias and variance
- 8.3 Empirical study
- 8.4 Relationship with other methods
- 8.4.1 Connection with model averaging techniques
- 8.4.2 Combination of dual P\&C and bagging
- 8.4.3 Other soft tree models
- 8.5 Conclusion
- 9 Closure of Part II
- 9.1 On the classes of methods investigated
- 9.2 On the criteria and their evaluation methodology
- 9.3 On the relative ranking of methods
- 9.3.1 From the viewpoint of bias, variance, and average accuracy
- 9.3.2 From the viewpoint of computational efficiency
- 9.4 About complementary investigations
- PART III Knowledge extraction from time series
- 10 Supervised learning with temporal data
- 10.1 Motivation
- 10.2 Problem statement
- 10.2.1 General framework for temporal supervised learning
- 10.2.2 Classification of temporal supervised learning problems
- 10.2.3 Time series classification
- 10.3 Peculiarities of time series classification problems
- 10.4 Test problems
- 10.5 Existing approaches for time series classification
- 10.6 Conclusion
- 11 Time series classification with low-level attributes
- 11.1 Introduction
- 11.2 Simple sampling approaches
- 11.2.1 Bias/variance study
- 11.2.2 Further experiments
- 11.2.3 Discussion
- 11.3 Segment and combine approach
- 11.3.1 Algorithm
- 11.3.2 Bias/variance study
- 11.3.3 Further experiments
- 11.3.4 Interpretability
- 11.3.5 Related work
- 11.3.6 Discussion
- 11.4 Conclusion
- 12 Time series classification by pattern extraction
- 12.1 Introduction
- 12.2 Proposed algorithm
- 12.2.1 Intuitive presentation
- 12.2.2 Elementary and complex patterns and detection tests
- 12.2.3 Piecewise constant modeling with regression trees
- 12.2.4 Search strategy for candidate tests
- 12.2.5 Computational complexity
- 12.2.6 Discussion
- 12.3 Validation of the method
- 12.3.1 Accuracy
- 12.3.2 Computing times
- 12.3.3 Interpretability
- 12.4 Related work
- 12.5 Conclusions and future research directions
- 13 Conclusions and future work
- 13.1 Conclusions
- 13.2 Future work directions
- 13.2.1 Short term investigations
- 13.2.2 Long term research
- APPENDIX
- A Bias/variance decompositions of zero-one loss functions
- A.1 Bias/Variance decompositions
- A.1.1 Tibshirani [tibshirani-biasvar-96] and James and Hastie [james-1996-biasvar]
- A.1.2 Breiman in [breiman-biasvar-1996]
- A.1.3 Breiman in [breiman-randomoutput-2000]
- A.1.4 Kong and Dietterich [dietterich-biasvar-1995]
- A.1.5 Domingos [domingos-2000-biasvar-icml,domingos-2000-biasvar-aaai]
- A.1.6 Kohavi and Wolpert [kohavi-biasvar-1996]
- A.1.7 Heskes [heskes-1998-biasvar]
- A.2 Discussion
- B Datasets
- B.1 Classification problems
- B.2 Temporal classification problems
- B.2.1 CBF problem
- B.2.2 Cbf-tr
- B.2.3 Two-pat
- B.2.4 CC
- C Experimental results of part II
- C.1 Results of Chapter 5
- C.2 Results of Chapter 6
- C.3 Results of Chapter 7
- C.4 Results of Chapter 8
- D Pattern detection algorithm
- D.1 Problem statement
- D.2 Optimal matching algorithm
- D.2.1 Backwards recursion
- D.2.2 Forward sequential algorithm
- D.2.3 Computational complexity
- Bibliography
- Author Index