Machine learning is becoming increasingly widespread; day after day, new computer scientists and engineers are jumping into this beautiful world. Unfortunately, the number of theories, algorithms, applications, papers, books, videos, and so forth is so huge to disorient whoever doesn’t have a clear picture of what he wants/needs to learn to improve his/her skills.
In this short post, I wanted to share my experiences, suggesting a feasible path to learn the essential concepts quickly and be ready to go deeper into the most complex topics. Of course, this is only a personal proposal: every student can choose to dedicate more attention to some topics that are more interesting based on his/her experience.
“Do not worry about your difficulties in Mathematics. I can assure you mine are still greater.”
(A. Einstein)
Prerequisites
Machine Learning is based on Mathematics. It’s not an optional, theoretical approach; it’s a fundamental pillar that cannot be discarded. If you are a computer engineer working daily with UML, ORM, Design Patterns, and many other software engineering tools/techniques, close your eyes and, for one second, forget almost everything. This doesn’t mean that all those concepts aren’t necessary. They are! But Machine Learning needs a different approach. One of the reasons why Python has become more and more popular in this field is its “prototyping speed.” In Machine Learning, a language that allows you to model an algorithm with a few lines of code (without classes, interfaces, and all other OO infrastructures) is a must.
Calculus, Probability Theory, and Linear Algebra are necessary mathematical skills for almost any algorithm. You can skip this section if you already have an excellent mathematical background. Otherwise, it’s a good idea to refresh some essential concepts. Considering the number of theories, I discourage starting with big manuals: even if they can also be used when looking for particular concepts, initially, it’s better to focus on a simple subset of topics. Many good online resources (like Coursera, Khan Academy, or Udacity, to name a few) adopt a pragmatic approach suitable for any background. However, I suggest using a brief compendium, where the most important concepts are explained, and searching and studying new elements whenever needed. This isn’t a systematic approach, but the alternative has a dramatic drawback: many concepts can discourage and disorient everyone without a solid academic background.
An acceptable starting “recipe” can be:

 Probability theory
 Discrete and continuous random variables
 Main distributions (Bernoulli, Categorical, Binomial, Normal, Exponential, Poisson, Beta, Gamma)
 Moments
 Bayes statistics
 Correlation and Covariance
 Linear algebra
 Vectors and matrices
 The determinant of a matrix
 Eigenvectors and eigenvalues
 Matrix factorization (like SVD)
 Calculus
 Real functions
 Derivatives, Integrals
 Main numerical methods
 Probability theory
There are many free resources on the web, like:

 Grinstead, Snell, Introduction to Probability, Swarthmore and Dartmouth Colleges
 Wise, Gallagher, An Introduction to Linear Algebra (with MATLAB examples), Columbia
 Heinbockel, Introduction to Calculus Vol. I, Old Dominion University
 Bendersky E., Understanding Gradient Descent
 Bonaccorso G., A Brief (and Comprehensive) Guide to Stochastic Gradient Descent Algorithms
 Bonaccorso G., Mastering Machine Learning Algorithms, Packt, 2018
Wikipedia is also an excellent resource; many formulas, theories, and theorems are explained clearly and understandably.
A Machine Learning path proposal (for very beginners)
Feature Engineering
The first step to jumping into Machine Learning is understanding how to measure and improve the quality of datasets. Managing categorical and missing features, normalization, and dimensionality reduction (PCA, ICA, NMF) are fundamental techniques that can dramatically improve the performance of every algorithm. It’s also helpful in studying how to split the datasets into training and test sets and how to adopt CrossValidation instead of classical test methods.
A good tutorial on Principal Component Analysis is:

 Shlens J., A Tutorial On Principal Component Analysis, UCSD
If you’re interested in how it’s possible to exploit the Hebbian Learning for PCA, I’ve published these articles:
Numpy: the king of Python mathematics!
When working with Python, Numpy is much more than a library. It’s the foundation of almost any machine learning implementation, and it’s essential to know how it works, focusing the attention on the concepts of vectorization and broadcasting. Through these techniques, it’s possible to speed up the learning process of most algorithms, exploiting the power of multithreading and SIMD and MIMD architectures.
The official documentation is complete. However, I also suggest these resources:

 VanderPlas J., Python Data Science Handbook: Essential Tools for Working with Data, O’Reilly
 LangTangen P. H., A Primer on Scientific Programming with Python, Springer
Data visualization
Knowing how to visualize the data sets is essential, even if it’s not a purely Machine Learning topic. Matplotlib is probably the most diffused solution: it’s easy to use and allows plotting different charts. Bokeh and Seaborne offer exciting alternatives. It’s not necessary to have complete knowledge of all packages, but it’s helpful to know the strengths/weaknesses of each of them to be able to pick up the right package when needed.
An excellent resource to learn Matplotlib details is:

 McGreggor D., Mastering Matplotlib, Packt Publishing
Linear Regression
Linear regression is one of the simplest models and can be studied by considering it an optimization problem that can be solved by minimizing the mean squared error. This approach is practical but limits the possibilities that can be exploited. I also suggest studying it as a Bayesian problem, where the parameters are represented using prior probabilities (Gaussiandistributed, for example), and the optimization becomes an MLE (Maximum Likelihood Estimation). Even if it seems more complex, this approach offers a new vision shared by dozens of more complex models.
A handy introduction to Bayesian Statistics is available on Coursera:
I also suggest these books:

 Downey B. A., Think Bayes, O’Reilly
 DavidsonPilon C., Bayesian Methods for Hackers, AddisonWesley
Linear Classification
Logistic Regression is typically the best starting point. It’s also an excellent opportunity to study some Information Theory and to understand the power of concepts like entropy, crossentropy, and mutual information (I’ve also written the post: ML Algorithms addendum: Mutual information in classification tasks). Categorical crossentropy is the most stable and diffused cost function in deep learning classification, and a simple logistic regression can show how it can speed up the learning process (compared to the mean squared error). Another critical topic is regularization (in particular, Ridge, Lasso, and ElasticNet).
Too often, it is considered an “esoteric” way to improve the accuracy of a model, but its real meaning is much more precise and should be understood with some concrete examples. I also suggest considering a Logistic Regression as a simple neural network, visualizing (for 2D examples) how the weight vector moves during the learning process.
I also suggest to include the Hyperparameter Grid Search methods in this section. Instead of trying different values without complete awareness, Grid Search allows evaluation of the performances of different hyperparameter sets. The engineer can, therefore, focus his/her attention only on the combinations that yield the highest accuracy.
If you’re interested in some important concepts regarding the Fisher Information, I’ve written this post:
Support Vector Machines
Support Vector Machines offer a different approach to classification (both linear and nonlinear). The algorithm is straightforward and can be learned by every student with a basic knowledge of geometry. However, it’s beneficial to understand how kernelSVMs work because their real power is shown in tasks where linear methods fail.
Some helpful free resources:

 Law, A Simple Introduction to Support Vector Machines, Michigan State University
 Kernel methods on Wikipedia
 Linearly Separable? No? For me, it is! A Brief Introduction to Kernel Methods
Decision Trees
Decision Trees offer another approach to classification and regression. In general, they are not the first choice for very complex problems. However, they offer a completely different approach, which can be easily understood even by nontechnical people and visualized during meetings or official presentations.
A good tutorial (easy) on Decision Trees is:

 Kak A., Decision Trees: How to construct them and how to use them for classifying new data, Purdue University
Classification metrics
Evaluating the performance of a classifier can be more complex than expected. The overall accuracy is a good measure, but evaluating the behavior with false positives and negatives is often necessary. I suggest dedicating time to studying Precision, Recall, FScore, and ROC curves. They can dramatically change the way a model is considered acceptable or not. Pay attention to Recall, which measures the impact of false negatives on accuracy. Having good precision but an imperfect recall means that your model generates many false negatives (think about this in a medical environment). F(beta)Score is a good tradeoff between precision and recall.
The majority of Linear Classification algorithms, SVM, and Classification metrics are explained in:

 Bonaccorso G., Machine Learning Algorithms (Second Edition), Packt, 2018
A quick glimpse into Ensemble Learning
After understanding the dynamics of a Decision Tree, it’s helpful to study methods where sets (ensembles) of trees are trained together to improve the overall accuracy. Random Forests, Gradient Tree Boosting, and AdaBoost are powerful algorithms with reasonably low complexity. It’s interesting to compare the learning process of a simple tree and the ones adopted by boosting and bagging methods. ScikitLearn provides the most common implementations, but if you want to exploit the full power of these approaches, I suggest dedicating some time to study XGBoost, which is a distributed framework that can work both with CPUs and GPUs, speeding up the training process even with massive datasets.
Two valid tutorials on Ensemble Learning are:

 Baskin I., Marcou G., Varnek A., Tutorial on Ensemble Learning, Facultè de Chimie de Strasbourg
 Dietterich T. G., Ensemble Methods in Machine Learning, Oregon State University
 Bonaccorso G., Mastering Machine Learning Algorithms, Packt, 2018
Clustering
When starting with clustering methods, I think the best thing to do is consider the Gaussian Mixture algorithm (based on EM, ExpectationMaximization). Even if KMeans is relatively more straightforward (and must be studied), Gaussian Mixtures offers a pure Bayesian approach, which is helpful for many similar tasks. Other algorithms that must be studied include Hierarchical Clustering, Spectral Clustering, and DBSCAN. It’s also helpful to understand the idea of instancebased learning by studying the kNearest Neighbors algorithm (which can be employed for supervised and unsupervised tasks).
An excellent introduction to KMeans is Introduction to KMeans Clustering
A valuable free resource on Spectral Clustering is:

 Von Luxburg U., A tutorial on Spectral Clustering, MaxPlanck Institute
Clustering metrics
Clustering metrics are slightly more empirical because their semantics can change with the context. However, I suggest studying the Silhouette plots and some ground truth methods (like the adjusted Rand score). They can provide you with a complete insight into the structure of the clustering process, showing all those situations when hyperparameter tuning is probably necessary.
An exciting resource on cluster stability is:

 Von Luxburg U., Cluster stability: an Overview, arXiv:1007.1075
 Assessing clustering optimality with instability index
The majority of clustering algorithms and metrics are exposed in:

 Bonaccorso G., Machine Learning Algorithms (Second Edition), Packt, 2018
Neural Networks for Beginners
Neural networks are the basis of Deep Learning and should be studied in a separate course. However, I think it’s helpful to understand the concepts of Perceptron, MultiLayer Perceptron, and the Backpropagation algorithm. ScikitLearn offers a straightforward implementation of Neural Networks. However, it might be a good idea to explore Keras, a highlevel framework based on Tensorflow, PyTorch, or CNTK that allows modeling and training neural networks with a minimum initial effort.
Some good reshttps://wehttps://www.microsoft.com/enus/cognitivetoolkit/ources to start with Neural Networks are:

 Hassoun M, Fundamentals of Artificial Neural Networks, The MIT Press
 Gulli A., Pal S., Deep Learning with Tensorflow and Keras, Packt Publishing
 Bonaccorso G., Mastering Machine Learning Algorithms, Packt, 2018
The best Deep Learning book (advanced) available on the market is probably:

 Goodfellow I., Bengio Y., Courville A., Deep Learning, The MIT Press
Moreover, I suggest “playing” with the Tensorflow Playground, where it’s possible to simulate a complex neural network, tune its parameters, and observe the resulting output.
Path image copyright by TheRitters.
If you like this post, you can always donate to support my activity! One coffee is enough!