A book that every data scientist should read

I would like to suggest the book Factfulness (by Hans Rosling) to every data scientist and every student who wishes to start this career. It’s not a technical book full of formulas, algorithms, and other complex stuff. Only a few plots and a lot of analysis, ideas, comparisons. Everybody should be aware to live in a world where facts are often considered as alternatives to beliefs and where biases are so common to be the first problem to face when a clear insight is needed.

Unfortunately, it’s easier to value a belief even more than a factual analysis and we all know what kind of disasters can be brought about. And, it’s even worse when a data scientist analyzes a bunch data starting from a biased viewpoint. Even if there are dozens of powerful models, some tasks require some moderation. Predicting the future of a time-series is a hard job unless we are 100% sure that nothing can alter what we have discovered.

Such an approach leads to the ideas of trends, which are simple to manage but dangerous as atom bombs. How many measures can only grow or decrease? In the short-term, a trend is an acceptable behavior, but are we authorized to extend this analysis also to the long-term? Clearly, the answer is negative. That’s why there are periodicities, seasonalities, and saturation. One of the first analysis that caught my attention was the growth of world population. Almost everybody is driven to think that such a value can only be larger and larger, until a catastrophic event.

Luckily many systems (including the population living in a limited environment) admit fixed points. This means that, believe it or not, they are going to grow until a threshold and then they saturate. Other ones are unstable and keep oscillating (stock values, for example). There can be a long period characterized by trends, but at a certain point, the internal dynamics force the system to invert its trend and the process is reset.

A good data scientist should always keep these simple concepts in mind before starting any analysis. I’m perfectly aware that too many CEOs would like to know that their profit is going to grow indefinitely. As well as, many politicians would observe a more and more extended consensus. We know that even the best companies have to change their strategies and there are no “immortal” political parties.

Moreover, data analysis is totally incompatible with beliefs (I’m not referring to statistical beliefs that are tested in many ways, but to irrational or unprovable statements). A data scientist must always work with facts and run away from any potential belief. it doesn’t matter if 98% of the people think that A is true. The question should always be: are the data confirming these hypotheses? If the answer is yes, that isn’t an actual belief, but common sense derived from observations, otherwise, it’s better to avoid going on. The idea of likelihood is the data scientist’s best friend. When the data have been denoised (sometimes this is a problematic step, but, in general, it can be carried out efficiently), a model (with the underlying hypotheses) must be checked in terms of likelihood.

Observing the world population data, it’s possible to see that the growth speed is slowing down and an exponential growth is very unlikely. A data scientist should notice that almost immediately, but a belief could lead him/her to a misunderstanding. In the following plot provided by Max Roser, you can find the confirmation to the previous statement (compare the growth rate with the progressive count):

Prey-Predators models have fixed points due to a dynamic interaction between opposite factors. But how can this be true also for human beings? The average life duration is increasing (but, also in this case, I suppose that nobody thinks about a stable trend) and healthcare is becoming more and more efficient and effective. Hence, a natural consideration is: the population must increase until there are no more resources (like in a Prey-Predator model) and, at that point, starvation and diseases will dramatically reduce the number.

Such a scenario is wrong (read the book to know the answer and many common objections!), but the population is saturating and this is the most likely and rational hypothesis. Therefore, if you are a young or experienced data scientist, I sure you’ll find this book extremely interesting and maybe also very challenging. Maybe I’m wrong, but, please don’t consider mine as a belief! At most, propose alternative viewpoints! An optimal solution can arise only from the interaction of different alternatives based on rock-solid facts!

The plot of world population growth has been provided by Max Roser under CC license.