ELEN0062 - Introduction to machine learning (iML)
With too little data, you won’t be able to make any conclusions that you trust. With loads of data you will find relationships that aren’t real… Big data isn’t about bits, it’s about talent.
|/||18 Sep. 2019|
|Assignment||25 Sep. 2019||
|Q&A||02 Oct. 2019|
|Q&A||09 Oct. 2019|
|Q&A||16 Oct. 2019|
|Deadline||20 Oct. 2019||
Don't forget to submit your first assignment
||23 Oct. 2019|
|Feedback||13 Nov. 2019|
||17 Nov. 2019||
Don't forget to submit your second assignment.
|Assignment||20 Nov. 2019||
See below for information regarding the third assignment
||27 Nov. 2019||[Setup] Find a group, register for the third assignment, register on Kaggle, download the data, make the toy submission.|
||13 Dec. 2019||End of challenge|
||15 Dec. 2019||Don't forget to submit your report regarding the challenge.|
Third assignment: the challenge
The third project is organized in the form a challenge, where you will compete against each other. All the relevant information can be found on the Kaggle plateform which will hold the challenge.
The project is divided into four parts. All the deadlines can be found in the schedule section above.
- Setup for the project
- Propose the best model you can before the competition deadline.
- Submit an archive on the submission platform in
tar.gzformat, containing a report that describes the different steps of your approach and your main results along with your source code. Use the same ids as for the Kaggle platform. The report must contain the following information:
- A detailed description of all the approaches that you have used to win the challenge, including the feature engineering you performed.
- A detailed description of your hyper-parameters optimization approach and your model validation technique.
- A detailed description of how you proceeded to estimate the AUC of your final models and a comparison with the actual value.
- A table summarizing the performance of your differents approaches containing for each approach at least the name of the approach, the validation score, the score on the public and the private leaderboard.
- Any complementary information or figures that you want to mention.
- Present succinctly your approach to the rest of the class. (More information coming soon)
How to present data
Presenting data well is key to efficient communication. Here are a few pointers:
- How to Present Scientific Data
- A few additional thoughts
- A more thorough tour
- How To Present Research Data?
- Principles of data visualization
There are many ways to install Python on a computer and get all the libraries needed. One quick way is to install anaconda, which comes with all the libraries we will need.
- Get the anaconda installer for your operating system. Make sure you install a Python 3.5+ version.
- Open a Python console:
- From a unix command line:
- Or open
spyderIDE, which comes with anaconda
ipythoninterpreter, which is much easier to work with.
If there is no error, the installation went fine
import numpy as np import pandas as pd import sklearn import scipy print(np.__version__) print(pd.__version__) print(sklearn.__version__) print(scipy.__version__)
Cheat sheet for ML in Python
Check out datacamp for more.
Here is a very scarce list of supplementary material related to the field of machine learning. I tend to update this section when I come across interesting stuff but if you feel like you need more material on some topic, do not hesitate to ask!
Machine learning in generalThere are tons of online and accessible material in the domain of machine learning:
- Andrew Ng's online course (Standford): The most popular online course on ML. Archived from coursera.
- Pedro Domingos' online course (Washington).
- Reza Shadmehr (Baltimore) and his slides.
- Jeffrey Ullman's course on mining massive datasets (Standford) based on his reference book. Not everything is related to the course though.
Artifical neural networksThere have been three hypes about ANN. The first one was about the perceptrons in the 60s until it was discovered it could not solve a XOR problem. The second hype started with the discovery of backpropagation but it soon became clear that the large and/or deep neural nets were very hard to train. We are in the midts of the third one right now with "deep learning": neural nets with several (many) invisible layers. As a consequence, internet is bursting with resource on the topic, from the simplest models (multi-layer perceptron) to the most advanced architectures (such as GANs), going through more classical ones (such as Convnets and LSTM).
- Graham Taylor: An Overview of Deep Learning and Its Challenges for Technical Computing (2014)
- Geoffrey Hinton: Introduction to Deep Learning and Deep Belief Nets (2012)
- Geoffrey Hinton: The Next Generation of Neural Networks (2007)
- Leon Bottou: Multilayer Networks series
- A simplified version of Backprop illustrated.
- An illustrated taxonomy of learning networks.
Learning theory (Bias/Variance...)
Support Vector Machines
- Visualizing the kernel trick
- A couple of videos about constraint optimization (by Khan Academy):
Misc.There are many YouTube channels about ML. Here are a few:
- Sentex: A bit of everything
- Derek Kane: A bit of everything
- Welch Labs: A few videos about Neural Nets
- Two minutes papers: Many articles relate to (applications of) ML
- Siraj Raval (this guy is crazy)
- Introductory online course on ML (covers linear/logistic regression, decision trees/random forests, basics on neural networks and a clustering).
Machine learning requires a solid background in maths, especially in linear algebra, (advanced) probability theory and (multivariable) calculus. There are even more resources on those than on deep learning. Here is a short selection, which emphasizes intuition.
- 3 brown 1 blue serie on linear algebra
- If you prefer paper (or PDF): Practical Linear Algebra: A Geometry Toolbox 2nd Edition by Farin, Gerald, Hansford, Dianne. A K Peters/CRC Press (2004)