An annotated path to start with Machine Learning

Machine learning is becoming increasingly widespread; day after day, new computer scientists and engineers are jumping into this beautiful world. Unfortunately, the number of theories, algorithms, applications, papers, books, videos, and so forth is so huge to disorient whoever doesn’t have a clear picture of what he wants/needs to learn to improve his/her skills.

In this short post, I wanted to share my experiences, suggesting a feasible path to learn the essential concepts quickly and be ready to go deeper into the most complex topics. Of course, this is only a personal proposal: every student can choose to dedicate more attention to some topics that are more interesting based on his/her experience.

“Do not worry about your difficulties in Mathematics. I can assure you mine are still greater.”

(A. Einstein)

Prerequisites

Machine Learning is based on Mathematics. It’s not an optional, theoretical approach; it’s a fundamental pillar that cannot be discarded. If you are a computer engineer working daily with UML, ORM, Design Patterns, and many other software engineering tools/techniques, close your eyes and, for one second, forget almost everything. This doesn’t mean that all those concepts aren’t necessary. They are! But Machine Learning needs a different approach. One of the reasons why Python has become more and more popular in this field is its “prototyping speed.” In Machine Learning, a language that allows you to model an algorithm with a few lines of code (without classes, interfaces, and all other OO infrastructures) is a must.

Calculus, Probability Theory, and Linear Algebra are necessary mathematical skills for almost any algorithm. You can skip this section if you already have an excellent mathematical background. Otherwise, it’s a good idea to refresh some essential concepts. Considering the number of theories, I discourage starting with big manuals: even if they can also be used when looking for particular concepts, initially, it’s better to focus on a simple subset of topics. Many good online resources (like Coursera, Khan Academy, or Udacity, to name a few) adopt a pragmatic approach suitable for any background. However, I suggest using a brief compendium, where the most important concepts are explained, and searching and studying new elements whenever needed. This isn’t a systematic approach, but the alternative has a dramatic drawback: many concepts can discourage and disorient everyone without a solid academic background.

An acceptable starting “recipe” can be:

- Probability theory
  - Discrete and continuous random variables
  - Main distributions (Bernoulli, Categorical, Binomial, Normal, Exponential, Poisson, Beta, Gamma)
  - Moments
  - Bayes statistics
  - Correlation and Covariance
- Linear algebra
  - Vectors and matrices
  - The determinant of a matrix
  - Eigenvectors and eigenvalues
  - Matrix factorization (like SVD)
- Calculus
  - Real functions
  - Derivatives, Integrals
  - Main numerical methods

There are many free resources on the web, like:

- Grinstead, Snell, Introduction to Probability, Swarthmore and Dartmouth Colleges
- Wise, Gallagher, An Introduction to Linear Algebra (with MATLAB examples), Columbia
- Heinbockel, Introduction to Calculus Vol. I, Old Dominion University
- Bendersky E., Understanding Gradient Descent
- Bonaccorso G., A Brief (and Comprehensive) Guide to Stochastic Gradient Descent Algorithms
- Bonaccorso G., Mastering Machine Learning Algorithms, Packt, 2018

Wikipedia is also an excellent resource; many formulas, theories, and theorems are explained clearly and understandably.

Talking Learning Machine by Mattel in the Readers Digest November 1967, an ancestor of modern Machine Learning systems — Advertising of the Talking Learning Machine by Mattel in the Readers Digest November 1967

A Machine Learning path proposal (for very beginners)

Feature Engineering

The first step to jumping into Machine Learning is understanding how to measure and improve the quality of datasets. Managing categorical and missing features, normalization, and dimensionality reduction (PCA, ICA, NMF) are fundamental techniques that can dramatically improve the performance of every algorithm. It’s also helpful in studying how to split the datasets into training and test sets and how to adopt Cross-Validation instead of classical test methods.

A good tutorial on Principal Component Analysis is:

- Shlens J., A Tutorial On Principal Component Analysis, UCSD

If you’re interested in how it’s possible to exploit the Hebbian Learning for PCA, I’ve published these articles:

- ML Algorithms Addendum: Hebbian Learning
- PCA with Rubner-Tavan Networks

Numpy: the king of Python mathematics!

When working with Python, Numpy is much more than a library. It’s the foundation of almost any machine learning implementation, and it’s essential to know how it works, focusing the attention on the concepts of vectorization and broadcasting. Through these techniques, it’s possible to speed up the learning process of most algorithms, exploiting the power of multithreading and SIMD and MIMD architectures.

The official documentation is complete. However, I also suggest these resources:

- VanderPlas J., Python Data Science Handbook: Essential Tools for Working with Data, O’Reilly
- LangTangen P. H., A Primer on Scientific Programming with Python, Springer

Data visualization

Knowing how to visualize the data sets is essential, even if it’s not a purely Machine Learning topic. Matplotlib is probably the most diffused solution: it’s easy to use and allows plotting different charts. Bokeh and Seaborne offer exciting alternatives. It’s not necessary to have complete knowledge of all packages, but it’s helpful to know the strengths/weaknesses of each of them to be able to pick up the right package when needed.

An excellent resource to learn Matplotlib details is:

- McGreggor D., Mastering Matplotlib, Packt Publishing

Linear Regression

Linear regression is one of the simplest models and can be studied by considering it an optimization problem that can be solved by minimizing the mean squared error. This approach is practical but limits the possibilities that can be exploited. I also suggest studying it as a Bayesian problem, where the parameters are represented using prior probabilities (Gaussian-distributed, for example), and the optimization becomes an MLE (Maximum Likelihood Estimation). Even if it seems more complex, this approach offers a new vision shared by dozens of more complex models.

A handy introduction to Bayesian Statistics is available on Coursera:

- Bayesian Statistics: From Concept to Data Analysis
- Bayesian Statistics: Techniques and Models

I also suggest these books:

- Downey B. A., Think Bayes, O’Reilly
- Davidson-Pilon C., Bayesian Methods for Hackers, Addison-Wesley

Linear Classification

Logistic Regression is typically the best starting point. It’s also an excellent opportunity to study some Information Theory and to understand the power of concepts like entropy, cross-entropy, and mutual information (I’ve also written the post: ML Algorithms addendum: Mutual information in classification tasks). Categorical cross-entropy is the most stable and diffused cost function in deep learning classification, and a simple logistic regression can show how it can speed up the learning process (compared to the mean squared error). Another critical topic is regularization (in particular, Ridge, Lasso, and ElasticNet).

Too often, it is considered an “esoteric” way to improve the accuracy of a model, but its real meaning is much more precise and should be understood with some concrete examples. I also suggest considering a Logistic Regression as a simple neural network, visualizing (for 2D examples) how the weight vector moves during the learning process.

I also suggest to include the Hyperparameter Grid Search methods in this section. Instead of trying different values without complete awareness, Grid Search allows evaluation of the performances of different hyperparameter sets. The engineer can, therefore, focus his/her attention only on the combinations that yield the highest accuracy.

If you’re interested in some important concepts regarding the Fisher Information, I’ve written this post:

- ML Algorithms Addendum: Fisher Information

Support Vector Machines

Support Vector Machines offer a different approach to classification (both linear and nonlinear). The algorithm is straightforward and can be learned by every student with a basic knowledge of geometry. However, it’s beneficial to understand how kernel-SVMs work because their real power is shown in tasks where linear methods fail.

Some helpful free resources:

- Law, A Simple Introduction to Support Vector Machines, Michigan State University
- Kernel methods on Wikipedia
- Linearly Separable? No? For me, it is! A Brief Introduction to Kernel Methods

Decision Trees

Decision Trees offer another approach to classification and regression. In general, they are not the first choice for very complex problems. However, they offer a completely different approach, which can be easily understood even by non-technical people and visualized during meetings or official presentations.

A good tutorial (easy) on Decision Trees is:

- Kak A., Decision Trees: How to construct them and how to use them for classifying new data, Purdue University

Classification metrics

Evaluating the performance of a classifier can be more complex than expected. The overall accuracy is a good measure, but evaluating the behavior with false positives and negatives is often necessary. I suggest dedicating time to studying Precision, Recall, F-Score, and ROC curves. They can dramatically change the way a model is considered acceptable or not. Pay attention to Recall, which measures the impact of false negatives on accuracy. Having good precision but an imperfect recall means that your model generates many false negatives (think about this in a medical environment). F(beta)-Score is a good trade-off between precision and recall.

The majority of Linear Classification algorithms, SVM, and Classification metrics are explained in:

- Bonaccorso G., Machine Learning Algorithms (Second Edition), Packt, 2018

A quick glimpse into Ensemble Learning

After understanding the dynamics of a Decision Tree, it’s helpful to study methods where sets (ensembles) of trees are trained together to improve the overall accuracy. Random Forests, Gradient Tree Boosting, and AdaBoost are powerful algorithms with reasonably low complexity. It’s interesting to compare the learning process of a simple tree and the ones adopted by boosting and bagging methods. Scikit-Learn provides the most common implementations, but if you want to exploit the full power of these approaches, I suggest dedicating some time to study XGBoost, which is a distributed framework that can work both with CPUs and GPUs, speeding up the training process even with massive datasets.

Two valid tutorials on Ensemble Learning are:

- Baskin I., Marcou G., Varnek A., Tutorial on Ensemble Learning, Facultè de Chimie de Strasbourg
- Dietterich T. G., Ensemble Methods in Machine Learning, Oregon State University
- Bonaccorso G., Mastering Machine Learning Algorithms, Packt, 2018

Clustering

When starting with clustering methods, I think the best thing to do is consider the Gaussian Mixture algorithm (based on EM, Expectation-Maximization). Even if K-Means is relatively more straightforward (and must be studied), Gaussian Mixtures offers a pure Bayesian approach, which is helpful for many similar tasks. Other algorithms that must be studied include Hierarchical Clustering, Spectral Clustering, and DBSCAN. It’s also helpful to understand the idea of instance-based learning by studying the k-Nearest Neighbors algorithm (which can be employed for supervised and unsupervised tasks).

An excellent introduction to KMeans is Introduction to K-Means Clustering

A valuable free resource on Spectral Clustering is:

- Von Luxburg U., A tutorial on Spectral Clustering, Max-Planck Institute

Clustering metrics

Clustering metrics are slightly more empirical because their semantics can change with the context. However, I suggest studying the Silhouette plots and some ground truth methods (like the adjusted Rand score). They can provide you with a complete insight into the structure of the clustering process, showing all those situations when hyperparameter tuning is probably necessary.

An exciting resource on cluster stability is:

- Von Luxburg U., Cluster stability: an Overview, arXiv:1007.1075
- Assessing clustering optimality with instability index

The majority of clustering algorithms and metrics are exposed in:

- Bonaccorso G., Machine Learning Algorithms (Second Edition), Packt, 2018

Neural Networks for Beginners

Neural networks are the basis of Deep Learning and should be studied in a separate course. However, I think it’s helpful to understand the concepts of Perceptron, Multi-Layer Perceptron, and the Backpropagation algorithm. Scikit-Learn offers a straightforward implementation of Neural Networks. However, it might be a good idea to explore Keras, a high-level framework based on Tensorflow, PyTorch, or CNTK that allows modeling and training neural networks with a minimum initial effort.

Some good reshttps://wehttps://www.microsoft.com/en-us/cognitive-toolkit/ources to start with Neural Networks are:

- Hassoun M, Fundamentals of Artificial Neural Networks, The MIT Press
- Gulli A., Pal S., Deep Learning with Tensorflow and Keras, Packt Publishing
- Bonaccorso G., Mastering Machine Learning Algorithms, Packt, 2018

The best Deep Learning book (advanced) available on the market is probably:

- Goodfellow I., Bengio Y., Courville A., Deep Learning, The MIT Press

Moreover, I suggest “playing” with the Tensorflow Playground, where it’s possible to simulate a complex neural network, tune its parameters, and observe the resulting output.

Path image copyright by TheRitters.

If you like this post, you can always donate to support my activity! One coffee is enough!

Share this post on: