Pierre Geurts' home page

Research

I carry out research in machine learning. I'm mainly interested in the design of (computationally and statistically efficient) supervised and semi-supervised learning algorithms in order to exploit structured input and output spaces (sequences, images, time-series, graphs), with applications in bioinformatics, computer vision, and computer networks.

Please find below a (non exhaustive) list of representative research themes with main references (more or less in chronological order):

(last update: January, 2010. See here for a more complete list of publications)

Machine learning:
- Decision trees and ensemble methods
  During my phd thesis and afterwards, I have developed several tree-based ensemble methods, among which the dual perturb and combine method that simulates the averaging effect of ensembles with only one model and the extremely randomized tree algorithm, a random forest-like method that goes further in terms of randomization. This latter method has been used quite extensively by our group and others, notably in the context of image classification (see below).
  Geurts, P. (2002). Contributions to decision tree induction: bias/variance tradeoff and time series classification. Unpublished doctoral thesis, University of Liège, Belgium.
  
  Geurts, P., & Wehenkel, L. (2005). Closed-form dual perturb and combine for tree-based models. Proceedings of the International Conference on Machine Learning (ICML 2005).
  
  Geurts, P., Ernst, D., & Wehenkel, L. (2006). Extremely randomized trees. Machine Learning, 63(1), 3-42.
- Time series classification and structured inputs
  During my phd thesis, I developed several techniques for time series classification, among which the "segment and combine" approach. This latter method has been generalized for image classification and other structured input problems.
  Geurts, P. (2002). Contributions to decision tree induction: bias/variance tradeoff and time series classification. Unpublished doctoral thesis, University of Liège, Belgium.
  
  Geurts, P. (2001). Pattern extraction for time-series classification. Proceedings of PKDD 2001, 5th European Conference on Principles of Data Mining and Knowledge Discovery (pp. 115-127). Freiburg: Springer-Verlag.
  
  Geurts, P., Marée, R., & Wehenkel, L. (2006). Segment and combine: a generic approach for supervised learning of invariant classifiers from topologically structured data. Proceedings of the Machine Learning Conference of Belgium and The Netherlands (Benelearn) (pp. 15-23).
- Computer vision
  Together with Raphaël Marée, Justus Piater, and Louis Wehenkel, we have developed an original method for image classification based on the extraction of random subwindows from the images and their classification with ensemble of extremely randomized trees
  Marée, R., Geurts, P., Piater, J., Wehenkel, L., Schmid, C. (Ed.), Soatto, S. (Ed.), & Tomasi, C. (Ed.). (2005). Random Subwindows for Robust Image Classification. Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR 2005) (pp. 34--40).
  This method has been subsequently extended for image annotation and image retrieval:
  Dumont, M., Marée, R., Wehenkel, L., & Geurts, P. (2009). Fast Multi-Class Image Annotation with Random Subwindows and Multiple Output Randomized Trees. Proc. International Conference on Computer Vision Theory and Applications (VISAPP) (pp. 196-203).
  
  Marée, R., Geurts, P., & Wehenkel, L. (2009). Content-based Image Retrieval by Indexing Random Subwindows with Randomized Trees. IPSJ Transactions on Computer Vision and Applications, 1.
  
  Marée, R., Denis, P., Wehenkel, L., & Geurts, P. (2010). Incremental Indexing and Distributed Image Search using Shared Randomized Vocabularies. ACM Proceedings MIR 2010.
- Reinforcement learning
  I participated in the development of the fitted q-iteration algorithm that uses extremely randomized trees as function approximators
  Ernst, D., Geurts, P., & Wehenkel, L. (2005). Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6, 503-556.
- Structured outputs
  During a postdoc in Florence d'Alché-Buc's group, we have proposed an extension of standard regression trees for handling kernelized output spaces. The approach can be used for learning an approximation of a kernel as a function of some input features, as well as for handling structured output problems.
  Geurts, P., Wehenkel, L., & d Alché-Buc, F. (2006). Kernelizing the output of tree-based methods. Proceedings of the 23rd International Conference on Machine Learning (pp. 345--352). Acm.
  
  Geurts, P., Wehenkel, L., & d'Alché-Buc, F. (2007). Gradient boosting for kernelized output spaces. ACM International Conference Proceeding Series (Proceedings of the 24th International Conference on Machine Learning) (pp. 289-296).
- Feature selection
  We are currently working on the development of methods for improving the interpretability of feature ranking techniques and hence helping in the determination of a relevance threshold in these rankings.
  Huynh-Thu, V. A., Wehenkel, L., & Geurts, P. (2008). Exploiting tree-based variable importances to selectively identify relevant variables. Proc. of FSDM08, ECML/PKDD Workshop on New challenges for feature selection in data mining and knowledge discovery.
- Parallel and large-scale machine learning
  Since 2010, we have started working on the development of parallel and large-scale machine learning algorithms:
  Louppe, G., & Geurts, P. (2010, December 11). A zealous parallel gradient descent algorithm. Paper presented at NIPS 2010 Workshop on Learning on Cores, Clusters and Clouds, Whistler, Canada.
Bioinformatics
- Mass spectrometry
  We have developed an approach based on tree-based ensemble methods for the determination of proteic biomarkers and predictive models from mass spectrometry data. The paper about the methodology:
  Geurts, P., Fillet, M., De Seny, D., Meuwis, M.-A., Malaise, M., Merville, M.-P., & Wehenkel, L. (2005). Proteomic mass spectra classification using decision tree based ensemble methods. Bioinformatics, 21(14), 3138-45.
  and some biomedical applications:
  De Seny, D., Fillet, M., Meuwis, M.-A., Geurts, P., Lutteri, L., Ribbens, C., Bours, V., Wehenkel, L., Piette, J., Malaise, M., & Merville, M.-P. (2005). Discovery of new rheumatoid arthritis biomarkers using the surface-enhanced laser desorption/ionization time-of-flight mass spectrometry ProteinChip approach. Arthritis and Rheumatism, 52(12), 3801-12.
  
  Meuwis, M.-A., Fillet, M., Geurts, P., De Seny, D., Lutteri, L., Chapelle, J.-P., Bours, V., Wehenkel, L., Belaiche, J., Malaise, M., Louis, E., & Merville, M.-P. (2007). Biomarker discovery for inflammatory bowel disease, using proteomic serum profiling. Biochemical Pharmacology, 73(9), 1422-1433.
- Supervised inference of biological networks
  During a postdoc in Florence d'Alché-Buc's group, we have applied the output kernel tree approach for the inference of protein-protein interactions and metabolic networks in Yeast. We are currently working on the application of these techniques on other kinds of networks
  Geurts, P., Touleimat, N., Dutreix, M., & d'Alche-Buc, F. (2007). Inferring biological networks with output kernel trees. BMC Bioinformatics, 8(Suppl. 2), 4.
- Gene regulatory network inference We have developed a method called GENIE3 for the inference of gene regulatory network from expression data. This method has been the best performer of the DREAM4 (multifactorial track) and DREAM5 network inference challenges.
  Huynh-Thu, V. A., Irrthum, A., Wehenkel, L., & Geurts, P. (2010). Inferring Regulatory Networks from Expression Data Using Tree-Based Methods. PLoS ONE, 5(9), 12776.
- Genome-wide association studies
  With Vincent Botta and Louis Wehenkel, we are working on the extension of decision tree-based methods for dealing with SNP data in the context of genome-wide association studies
  Botta, V., Hansoul, S., Geurts, P., & Wehenkel, L. (2008). Raw genotypes vs haplotype blocks for genome wide association studies by random forests. Proc. of MLSB 2008, second workshop on Machine Learning in Systems Biology.
Other applications:
- Networking
  Since 2004, I collaborate with the RUN team (Prof. Guy Leduc) for the application of machine learning techniques in networking. Two recent references:
  El Khayat, I., Geurts, P., & Leduc, G. (2010). Enhancement of TCP over wired/wireless networks with packet loss classifiers inferred by supervised learning. Wireless Networks, 16(2), 273-290.
  
  Liao, Y., Geurts, P., & Leduc, G. (2010). Network Distance Prediction Based on Decentralized Matrix Factorization. Lecture Notes in Computer Science, 6091, 15-26.
- Power systems
  In the past, I have worked on the application of machine learning techniques in power systems
  Del Angel, A., Geurts, P., Ernst, D., Glavic, M., & Wehenkel, L. (2007). Estimation of rotor angles of synchronous machines using artificial neural networks and local PMU-based quantities. Neurocomputing, 70(16-18), 2668-2678.
  
  Ernst, D., Glavic, M., Geurts, P., & Wehenkel, L. (2005). Approximate value iteration in the reinforcement learning context. Application to electrical power system control. International Journal of Emerging Electrical Power Systems, 3(1).
Review papers
I wrote two review papers, one about bias/variance tradeoff as part of a handbook about data mining and knowledge discovery and, with Alexandre Irrthum and Louis Wehenkel, one about decision tree-based methods and their application in computational and systems biology
Geurts, P. (2005). Bias vs. variance decomposition for regression and classification. In O., Maimon & L., Rokach (Eds.), Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers. Kluwer Academic Publishers.

Geurts, P., Irrthum, A., & Wehenkel, L. (2009). Supervised learning with decision tree-based methods in computational and systems biology. Molecular Biosystems, 5(12), 1593-1605.

Machine Learning
Bioinformatics
Others
Review papers