# An annotated path to start with Machine Learning

*“Do not worry about your difficulties in Mathematics. **I can assure you mine are still greater.”*

###### (A. Einstein)

Machine Learning is becoming more and more widespread and, day after day, new computer scientists and engineers begin their long jump into this wonderful world. Unfortunately, the number of theories, algorithms, applications, papers, books, videos and so forth is so huge to disorient whoever hasn’t a clear picture of what he wants/needs to learn to improve his/her skills. In this short post, I wanted to share my experiences, suggesting a feasible path to learn quickly the essential concepts and being ready to go deeper the most complex topics. Of course, this is only a personal proposal: every student can choose to dedicate more attention to some topics which are more interesting based on his/her experience.

### Prerequisites

Machine Learning is based on Mathematics. It’s not an optional, theoretical approach: it’s a fundamental pillar that cannot be discarded. If you are a computer engineer, working daily with UML, ORM, Design Patterns and many other software engineering tools/techniques, close your eyes and, for one second, forget almost everything. This doesn’t mean that all those concepts aren’t important. They are! But Machine Learning needs a different approach. One of the reasons why Python has become more and more popular in this field is its “prototyping speed”. In Machine Learning, a language that allows you to model an algorithm with a few lines of code (without classes, interfaces and all other OO infrastructures) is absolutely a must.

Calculus, Probability Theory and Linear Algebra are necessary mathematical skills for almost any algorithm. If you already have a good mathematical background, you can skip this section, otherwise, it’s a good idea to refresh some important concepts. Considering the number of theories, I discourage starting with big manuals: even if they can also be used when looking for particular concepts, in the beginning, it’s better focusing on a simple subset of topics. There are many good online resources (like Coursera, Khan Academy or Udacity, just to name a few) which adopt a pragmatic approach suitable for any background. However, my suggestion is to use a brief compendium, where the most important concepts are explained and to go on by searching and studying new elements whenever they are needed. This isn’t a very systematic approach, but the alternative has a dramatic drawback: the huge amount of concepts can discourage and disorient all the people without a solid academic background.

An acceptable starting “recipe” can be:

**Probability theory**- Discrete and continuous random variables
- Main distributions (Bernoulli, Categorical, Binomial, Normal, Exponential, Poisson, Beta, Gamma)
- Moments
- Bayes statistics
- Correlation and Covariance

**Linear algebra**- Vectors and matrices
- Determinant of a matrix
- Eigenvectors and eigenvalues
- Matrix factorization (like SVD)

**Calculus**- Real functions
- Derivatives, Integrals
- Main numerical methods

There are many free resources on the web, like:

- Grinstead, Snell, Introduction to Probability, Swarthmore and Dartmouth Colleges
- Wise, Gallagher, An Introduction to Linear Algebra (with MATLAB examples), Columbia
- Heinbockel, Introduction to Calculus Vol. I, Old Dominion University
- Bendersky E., Understanding Gradient Descent
- Bonaccorso G., A Brief (and Comprehensive) Guide to Stochastic Gradient Descent Algorithms
- Bonaccorso G., Mastering Machine Learning Algorithms, Packt, 2018

Wikipedia is also a very good resource and many formulas, theories and theorems are explained in a clear and comprehensible way.

### A Machine Learning path proposal (for very beginners)

#### Feature Engineering

The very first step to jump into Machine Learning is understanding how to measure and improve the quality of datasets. Managing categorical and missing features, normalization and dimensionality reduction (PCA, ICA, NMF) are fundamental techniques that can dramatically improve the performances of every algorithm. It’s also useful for studying how to split the datasets into training and test sets and how to adopt the Cross-Validation instead of classical test methods.

A good tutorial on Principal Component Analysis is:

- Shlens J., A Tutorial On Principal Component Analysis, UCSD

If you’re interested in how it’s possible to exploit the Hebbian Learning for PCA, I’ve published this articles:

#### Numpy: the king of Python mathematics!

When working with Python, Numpy is much more than a library. It’s the foundation of almost any machine learning implementation and it’s absolutely necessary to know how it works, focusing the attention on the concepts of vectorization and broadcasting. Through these techniques, it’s possible to speed up the learning process of the majority of algorithms, exploiting the power of multithreading and SIMD and MIMD architectures.

The official documentation is complete, however, I suggest also these resources:

- VanderPlas J., Python Data Science Handbook: Essential Tools for Working with Data, O’Reilly
- LangTangen P. H., A Primer on Scientific Programming with Python, Springer

#### Data visualization

Even if it’s not a purely Machine Learning topic, it’s important to know how to visualize the data sets. Matplotlib is probably the most diffused solution: it’s easy to use and allows plotting different kinds of charts. Very interesting alternatives are offered by Bokeh and Seaborne. It’s not necessary to have a complete knowledge of all packages, but it’s useful to know the strength/weakness points of each of them, so to able to pick up the right package when needed.

A good resource to learn Matplotlib details is:

- McGreggor D., Mastering Matplotlib, Packt Publishing

#### Linear Regression

Linear Regression is one of the simplest models and can be studied by considering it as an optimization problem that can be solved by minimizing the mean squared error. This approach is effective but limits the possibilities that can be exploited. I suggest also to study it as a Bayesian problem, where the parameters are represented using prior probabilities (Gaussian-distributed, for example) and the optimization becomes an MLE (Maximum Likelihood Estimation). Even if it can seem more complex, this approach offers a new vision that is shared by dozens of other more complex models.

A very useful introduction to Bayesian Statistics is available on Coursera:

I also suggest these books:

- Downey B. A., Think Bayes, O’Reilly
- Davidson-Pilon C., Bayesian Methods for Hackers, Addison-Wesley

#### Linear Classification

Logistic Regression is normally the best starting point. It’s also a good opportunity to study some Information Theory, to understand the power of concepts like entropy, cross-entropy, and mutual information (I’ve written also the post: ML Algorithms addendum: Mutual information in classification tasks). Categorical cross-entropy is the most stable and diffused cost function in deep learning classification and a simple logistic regression can show how it can speed up the learning process (compared to the mean squared error). Another important topic is regularization (in particular, Ridge, Lasso, and ElasticNet). Too many times, it is considered as an “esoteric” way to improve the accuracy of a model, but its real meaning is much more precise and should be understood with some concrete examples. I also suggest starting considering a Logistic Regression as a simple neural network, visualizing (for 2D examples) how the weight vector moves during the learning process.

I also suggest to include the Hyperparameter Grid Search methods in this section. Instead of trying different values without a complete awareness, Grid Search allows evaluating the performances of different hyperparameters sets. The engineer can, therefore, focus his/her attention only the combinations that yield the highest accuracy.

If you’re interested in some important concepts regarding the Fisher Information, I’ve written this post:

#### Support Vector Machines

Support Vector Machines offer a different approach to classification (both linear and nonlinear). The algorithm is very simple and can be learned by every student with a basic knowledge of geometry. However, it’s very useful to understand how kernel-SVMs work because their real power is shown in tasks where linear methods fail.

Some useful free resources:

- Law, A Simple Introduction to Support Vector Machines, Michigan State University
- Kernel methods on Wikipedia
- Linearly Separable? No? For me it is! A Brief introduction to Kernel Methods

#### Decision Trees

Another approach to classification and regression is offered by Decision Trees. In general, they are not the first choice for very complex problems, but they offer a completely different approach, which can be easily understood even by non-technical people and can be visualized during meetings or official presentations.

A good tutorial (easy) on Decision Trees is:

- Kak A., Decision Trees: How to construct them and how to use them for classifying new data, Purdue University

#### Classification metrics

Evaluating the performance of a classifier can be more difficult than expected. The overall accuracy is a good measure, but it is often necessary to evaluate the behavior with false positives and false negatives. I suggest dedicating some time to studying: Precision, Recall, F-Score and ROC Curve. They can dramatically change the way a model is considered acceptable or not. Pay attention to Recall, which measures the impact of false negatives on the accuracy. Having a good precision, but a bad recall means that your model is generating many false negatives (think about this in a medical environment). F(beta)-Score is a good trade-off between precision and recall.

The majority of Linear Classification algorithms, SVM, and Classification metrics are explained in:

- Bonaccorso G., Machine Learning Algorithms (Second Edition), Packt, 2018

#### A quick glimpse into Ensemble Learning

After having understood the dynamics of a Decision Tree, it’s useful to study methods where sets (ensembles) of trees are trained together to improve the overall accuracy. Random Forests, Gradient Tree Boosting and AdaBoost are powerful algorithms whose complexity is reasonable low. It’s interesting comparing the learning process of a simple tree and the ones adopted by boosting and bagging methods. Scikit-Learn provides the most common implementations, but if you want to exploit the full power of these approaches, I suggest dedicating some time to study XGBoost, which is a distributed framework that can work both with CPUs and GPUs, speeding up the training process even with very large datasets.

Two valid tutorials on Ensemble Learning are:

- Baskin I., Marcou G., Varnek A., Tutorial on Ensemble Learning, Facultè de Chimie de Strasbourg
- Dietterich T. G., Ensemble Methods in Machine Learning, Oregon State University
- Bonaccorso G., Mastering Machine Learning Algorithms, Packt, 2018

#### Clustering

When starting with clustering methods, in my opinion, the best thing to do is considering the Gaussian Mixture algorithm (based on EM, Expectation-Maximization). Even if K-Means is quite simpler (and must be studied), Gaussian Mixtures offers a pure Bayesian approach, which is useful for many other similar tasks. Other algorithms that must be studied include Hierarchical Clustering, Spectral Clustering, and DBSCAN. It’s also useful to understand the idea of instance-based learning, studying the k-Nearest Neighbors algorithm (that can be adopted for both supervised and unsupervised tasks).

A good introduction to KMeans is: Introduction to K-Means Clustering

A useful free resource on Spectral Clustering is:

- Von Luxburg U., A tutorial on Spectral Clustering, Max-Planck Institute

#### Clustering metrics

Clustering metrics are a little bit more empirical because their semantics can change with the context. However, I suggest studying the Silhouette plots and some ground truth methods (like the adjusted Rand score). They can provide you with a complete insight into the structure of the clustering process, showing all those situations when a hyperparameter tuning is probably necessary.

A very interesting resource on cluster stability is:

- Von Luxburg U., Cluster stability: an Overview, arXiv:1007.1075
- Assessing clustering optimality with instability index

The majority of clustering algorithms and metrics are exposed in:

- Bonaccorso G., Machine Learning Algorithms (Second Edition), Packt, 2018

#### Neural Networks for beginners

Neural networks are the basis of Deep Learning and they should be studied in a separate course. However, I think it’s useful to understand the concepts of Perceptron, Multi-Layer Perceptron, and the Backpropagation algorithm. Scikit-Learn offers a very simple implementation of Neural Networks, however, it might be a good idea starting the exploration of Keras, which is a high-level framework based on Tensorflow, Theano or CNTK, that allows modeling and training neural networks with a minimum initial effort.

Some good resource to start with Neural Networks are:

- Hassoun M, Fundamentals of Artificial Neural Networks, The MIT Press
- Gulli A., Pal S., Deep Learning with Keras, Packt Publishing
- Bonaccorso G., Mastering Machine Learning Algorithms, Packt, 2018

The best Deep Learning book (advanced) available on the market is probably:

- Goodfellow I., Bengio Y., Courville A., Deep Learning, The MIT Press

Moreover, I suggest to “play” with the Tensorflow Playground, where it’s possible to simulate a complex neural network, to tune its parameters and observe the resulting output.

Path image copyright by TheRitters.

See also:

## Eight reasons to study Machine Learning – Giuseppe Bonaccorso

These considerations may be accepted or not, however I hope to show some good starting points before beginning to study Machine Learning. [list_items][list_item]Old fashioned computing techniques are getting more and more useless. Industrial revolution is over and so all kind of merely-automatic machines.