Chapter 10
IN THIS CHAPTER
Using the tools of different tribes when learning from data
Discovering how probability benefits AI
Guessing using Naïve Bayes and Bayesian Networks
Partitioning data into branches and leaves by decision trees
Learning has been an important part of AI since the beginning because AI can mimic a human-like level of intelligence. Reaching a level of mimicry that effectively resembles learning took a long time and a variety of approaches. Today, machine learning can boast a quasi-human level of learning in specific tasks, such as image classification or sound processing, and it’s striving to reach a similar level of learning in many other tasks.
Machine learning isn’t completely automated. You can’t tell a computer to read a book and expect it to understand anything. Automation implies that computers can learn how to program themselves to perform tasks instead of waiting for humans to program them. Currently, automation requires large amounts of human-selected data as well as data analysis and training (again, under human supervision). It’s like taking a child by the hand for those first steps. Moreover, machine learning has other limits, which are dictated by how it learns from data.
Each family of algorithms has specific ways of accomplishing tasks, and this chapter describes those methods. The goal is to understand how AI makes decisions and predictions. Like discovering the man behind the curtain in the Wizard of Oz, you uncover the machinery and the operator behind AI in this chapter. Nevertheless, you still get to enjoy the amazing feeling of seeing the wondrous achievements that machine learning can provide.
Taking Many Different Roads to Learning
Just as human beings have different ways to learn from the world, so the scientists who approached the problem of AI learning took different routes. Each one believed in a particular recipe to mimic intelligence. Up to now, no single model has proven superior to any other. The no free lunch theorem, which states that each algorithm provides benefit only to specific problems, is in full effect. Each of these efforts has proven effective in solving particular problems, but not all at one time. Because the algorithms are equivalent in the abstract (see the “No free lunch” sidebar), no one algorithm is superior to the others unless proven in a specific, practical problem. The following sections provide additional information about this concept of using different methods to learn.
Discovering five main approaches to AI learning
An algorithm is a kind of container. It provides a box for storing a method to solve a particular kind of a problem. Algorithms process data through a series of well-defined states. The states need not be deterministic (in many algorithms finding the right solution is often a matter of chance), but the states are defined nonetheless. The sequence of states defines the range of mathematical solutions that the algorithm is able to grasp (technically referred to as the space of hypothesis). The goal is to create an output that solves a problem. In the supervised approach, the algorithm receives inputs that help define the output, but the focus is always on the output.
NO FREE LUNCH
A common theorem in mathematical folklore is the no free lunch theorem by David Wolpert and William Macready, which states that any two optimization algorithms are equivalent when their performance is averaged across all possible problems. Essentially, no matter which optimization algorithm you use, there won’t be any advantage to using it across all possible problems. To gain an advantage, you must use it on those problems for which the algorithm excels. The paper “Simple explanation of the no free lunch theorem of optimization” by Yo-Chi Ho and David L. Pepyne, at ResearchGate.net, provides an accessible but rigorous explanation of the theorem. It’s also a good idea to review the discussion at http://www.no-free-lunch.org/ for more details about no free lunch theorems; machine learning relies on two of them.
Algorithms must express the transitions between states using a well-defined and formal language that the computer can understand (usually, a computer language). In processing the data and solving the problem, the algorithm defines, refines, and applies a mathematical function. The function is always specific to the kind of problem being addressed by the algorithm.
As described in the “Avoiding AI Hype and Overestimation” section of Chapter 1, each of the five tribes has a different technique and strategy for solving problems that result in unique algorithms. Combining these algorithms should lead eventually to the master algorithm that will be able to solve any given problem. The following sections provide an overview of the five main algorithmic families.
Symbolic reasoning
One of the earliest tribes, the symbologists, believed that knowledge could be obtained by operating on symbols (signs that stand for a certain meaning or event) and deriving rules from them. By putting together complex systems of rules, you could attain a logic deduction of the result you wanted to know, thus the symbologists shaped their algorithms to produce rules from data. In symbolic reasoning, deduction expands the realm of human knowledge, while induction raises the level of human knowledge. Induction commonly opens new fields of exploration, while deduction explores those fields.
Connections modeled on the brain’s neurons
The connectionists are perhaps the most famous of the five tribes. This tribe strives to reproduce the brain’s functions by using silicon instead of biological neurons. Essentially, each of the algorithmic neurons (created as an algorithm that models the real-world counterpart) solves a small piece of the problem, and using many neurons in parallel solves the problem as a whole.
The use of backpropagation, or backward propagation of errors, seeks to determine the conditions under which errors are removed from networks built to resemble human neurons by changing the weights (how much a particular input figures into the result) and biases (how features are selected) of the network. The goal is to continue changing the weights and biases until such time as the actual output matches the target output. At this point, the artificial neuron fires and passes its solution along to the next neuron in line. The solution created by just one neuron is only part of the whole solution. Each neuron passes information to the next neuron in line until the group of neurons creates a final output. Such a method proved the most effective in human-like tasks such as recognizing objects, understanding written and spoken language, and chatting with humans.
Evolutionary algorithms that test variation
The evolutionaries rely on the principles of evolution to solve problems. In other words, this strategy is based on the survival of the fittest (removing any solutions that don’t match the desired output). A fitness function determines the viability of each function in solving a problem. Using a tree structure, the solution method looks for the best solution based on function output. The winner of each level of evolution gets to build the next-level functions. The idea is that the next level will get closer to solving the problem but may not solve it completely, which means that another level is needed. This particular tribe relies heavily on recursion and languages that strongly support recursion to solve problems. An interesting output of this strategy has been algorithms that evolve: One generation of algorithms actually builds the next generation.
Bayesian inference
A group of scientists called Bayesians perceived that uncertainty was the key aspect to keep an eye on, and that learning wasn’t assured but rather took place as a continuous updating of previous beliefs that grew more and more accurate. This perception led the Bayesians to adopt statistical methods and, in particular, derivations from Bayes’ theorem, which helps you calculate probabilities under specific conditions (for instance, the chance of seeing a card of a certain seed, the starting value for a pseudo-random sequence, drawn from a deck after three other cards of the same seed have been drawn).
Systems that learn by analogy
The analogyzers use kernel machines to recognize patterns in data. By recognizing the pattern of one set of inputs and comparing it to the pattern of a known output, you can create a problem solution. The goal is to use similarity to determine the best solution to a problem. It’s the kind of reasoning that determines that using a particular solution worked in a given circumstance at some previous time; therefore, using that solution for a similar set of circumstances should also work. One of the most recognizable outputs from this tribe is recommender systems. For example, when you buy a product on Amazon, the recommender system comes up with other, related products that you might also want to buy.
The ultimate goal of machine learning is to combine the technologies and strategies embraced by the five tribes to create a single algorithm (the master algorithm) that can learn anything. Of course, achieving that goal is a long way off. Even so, scientists such as Pedro Domingos (https://homes.cs.washington.edu/~pedrod/) are currently working toward that goal.
Delving into the three most promising AI learning approaches
Later sections in this chapter explore the nuts and bolts of the core algorithms chosen by the Bayesians, symbologists, and connectionists. These tribes represent the present and future frontier of learning from data because any progress toward a human-like AI derives from them, at least until a new breakthrough with new and more incredible and powerful learning algorithms occurs. The machine learning scenery is certainly much larger than these three algorithms, but the focus for this chapter is on these three tribes because of their current role in AI. Here’s a synopsis of the approaches in this chapter:
· Naïve Bayes: This algorithm can be more accurate than a doctor in diagnosing certain diseases. In addition, the same algorithm can detect spam and predict sentiment from text. It’s also widely used in the Internet industry to easily treat large amounts of data.
· Bayesian networks (graph form): This graph offers a representation of the complexity of the world in terms of probability.
· Decision trees: The decision tree type of algorithm represents the symbologists best. The decision tree has a long history and indicates how an AI can make decisions because it resembles a series of nested decisions, which you can draw as a tree (hence the name).
The next chapter, “Improving AI with Deep Learning,” introduces neural networks, an exemplary type of algorithm proposed by the connectionists and the real engine behind the AI renaissance. Chapter 11 first discusses how a neural network works and then explains deep learning and why it’s so effective in learning.
All these sections discuss types of algorithms. These algorithm types are further divided into subcategories. For example, decision trees come categorized as regression trees, classification trees, boosted trees, bootstrap aggregated, and rotation forest. You can even drill down into subtypes of the subcategories. A Random Forest classifier is a kind of bootstrap aggregating, and there are even more levels from there. After you get past the levels, you begin to see the actual algorithms, which number into the thousands. In short, this book is giving you an overview of an infinitely more complex topic that could require many volumes to cover in any detail. The takeaway is to grasp the type of algorithm and not to get mired in detail.
Awaiting the next breakthrough
In the 1980s, as expert systems ruled the AI scenery, most scientists and practitioners deemed machine learning to be a minor branch of AI that was focused on learning how to best answer simple predictions from the environment (represented by data) using optimization. Today, machine learning has the upper hand in AI, outweighing expert systems in many applications and research developments, and powering AI applications that scientists previously regarded as impossible at such a level of accuracy and performance. Neural networks, the solution proposed by the connectionists, made the breakthrough possible in the last few years by using a mix of increased hardware capacity, more suitable data, and the efforts of scientists such as Geoffrey Hinton, Yann LeCun, Yoshua Bengio, and many others.
The capabilities offered by neural network algorithms (newly branded deep learning because of increased complexity) are increasing daily. Frequent news reports recount the fresh achievements in audio understanding, image and video recognition, language translation, and even lip reading. (Even though deep learning lacks HAL9000 performance, it’s approaching human performance; see “Lip-Reading AI is Under Development, Under Watchful Eyes”at aitrends.com, which talks about Speech Recognition App for the Voice Impaired, SRAVI, a product that may be certified by the time you read this book.) The improvements are the result of intensive funding from large and small companies to engage researchers and of the availability of powerful software, such as Google’s TensorFlow (https://www.tensorflow.org/) and PyTorch, an open source machine learning library primarily developed by Facebook's AI Research lab (FAIR) (see https://pytorch.org/). These types of powerful software give both scientists and practitioners access to the technology.
Look for even more sensational AI innovations in the near future. Of course, researchers could always hit a wall again, as happened in the previous AI winters. No one can know whether AI will reach the human level using the present technology or someone will discover a master algorithm, as Pedro Domingos predicted (see his TEDx talk “The Quest for the Master Algorithm” at YouTube.com), that will solve all AI problems (some of which we have yet to imagine). Nevertheless, machine learning is certainly not a fad driven by hype; it’s here to stay, either in its present, improved form, or in the form of new algorithms to come.
Exploring the Truth in Probabilities
Some websites would have you believe that statistics and machine learning are two completely different approaches. For example, when you read a blog post called “Statistics vs. Machine Learning, fight!” by Brendan O’Connor (http://brenocon.com/blog/2008/12/statistics-vs-machine-learning-fight/), you get the idea that the two methodologies are not only different but also downright hostile toward each other. Statistics, contrary to machine learning, were born in an age of limited computational power (you had to solve calculations by hand at that time). Thus, statistics rely more on simplistic mathematical assumptions that render computations easier. Although statistics show a more theoretical approach to problems, whereas machine learning is purely based on data, statistics and machine learning have a lot in common. Also, statistics represents one of the five tribes (schools of thought) that make machine learning feasible.
Statistics often resort to probabilities — which are a way to express uncertainty regarding world events — and so do machine learning and AI (to a larger extent than pure statistics). Not all problems are like the games of chess or Go, which let you take a large but limited number of actions when you decide to take them. If you want to learn how to move a robot in a corridor crowded with people, or have a self-driving car successfully engage in a crossing, you have to consider that plans (such as for moving from point A to point B) don’t always have a single outcome and that many results are possible, each one with a different likelihood. In a sense, probability supports AI systems in their reasoning, providing decision-making support and making what seem to be the best, most rational choices despite uncertainty. Uncertainty can exist for various reasons, and AI should be made aware of the level of uncertainty by an effective use of probability:
1. Some situations can’t offer certainty because they’re random in nature. Similar situations are inherently stochastic. For instance, in card games, you can’t be sure what hand you’ll have after the dealer shuffles and deals the cards.
2. Even if a situation isn’t random, not observing all its aspects (incomplete observation) creates uncertainty over how things will turn out. For instance, a robot walking down a corridor crowded with people can’t know the intended direction of each person (it can’t read their minds), but it can formulate a guess based on a partial observation of their behavior. As with any guess, the robot has a chance of being right and of being wrong.
3. Limits in the hardware that records world data (called sensors) and approximations in data processing can render results produced from such data uncertain. Measuring is often subject to errors because of the tools used and how the measuring is done. In addition, humans are often subject to cognitive biases and easily fall prey to illusions or blind spots. Similarly, AI is limited by the quality of the data it receives. Approximations and errors introduce uncertainty into every algorithm.
Determining what probabilities can do
Probability tells you the likelihood of an event, and you express it as a number. For instance, if you throw a coin in the air, you don’t know whether it will land as heads or tails, but you can tell the probability of both outcomes. The probability of an event is measured in the range from 0 (no probability that an event occurs) to 1 (certainty that an event occurs). Intermediate values, such as 0.25, 0.5, and 0.75, say that the event will happen with a certain frequency when tried enough times. If you multiply the probability by an integer number representing the number of trials you’re going to try, you get an estimate of how many times an event should happen on average if all the trials are tried. For instance, if you have an event occurring with probability p = 0.25 and you try 100 times, you’re likely to witness that event happen 0.25 * 100 = 25 times.
As it happens, the outcome of p = 0.25 is the probability of picking a certain suit when choosing a card randomly from a deck of cards. French playing cards make a classic example of explaining probabilities. The deck contains 52 cards equally divided into four suits: clubs and spades, which are black, and diamonds and hearts, which are red. So if you want to determine the probability of picking an ace, you must consider that there are four aces of different suits. The answer in terms of probability is p = 4/52 = 0.077.
Probabilities are between 0 and 1; no probability can exceed such boundaries. You define probabilities empirically from observations. Simply count the number of times a specific event happens with respect to all the events that interest you. For example, say that you want to calculate the probability of how many times fraud happens when doing banking transactions, or how many times people get a certain disease in a particular country. After witnessing the event, you can estimate the probability associated with it by counting the number of times the event occurs and dividing by the total number of events.
You can count the number of times the fraud or the disease happens by using recorded data (mostly taken from records in databases or from direct observation) and then divide that figure by the total number of generic events or observations available. Therefore, you divide the number of frauds by the number of transactions in a year, or you count the number of people who fell ill during the year with respect to the population of a certain area. The result is a number ranging from 0 to 1, which you can use as your baseline probability for a certain event given certain circumstances.
Counting all the occurrences of an event is not always possible, so you need to know about sampling. By sampling, which is an act based on certain probability expectations, you can observe a small part of a larger set of events or objects, yet be able to infer correct probabilities for an event, as well as exact measures such as quantitative measurements or qualitative classes related to a set of objects. For instance, if you want to track the sales of cars in the United States for the past month, you don’t need to track every sale in the country. By using a sample comprising the sales from a few car sellers around the country, you can determine quantitative measures, such as the average price of a car sold, or qualitative measures, such as the car model sold most often.
Considering prior knowledge
Probability makes sense in terms of time and space, but some other conditions also influence the probability you measure. The context is important. When you estimate the probability of an event, you may (sometimes wrongly) tend to believe that you can apply the probability that you calculated to each possible situation. The term to express this belief is a priori probability, meaning the general probability of an event.
For example, when you toss a coin, if the coin is fair, the a priori probability of a head is about 50 percent (when you also assume the existence of a tiny likelihood of the coin’s landing on its edge). No matter how many times you toss the coin, when faced with a new toss, the probability for heads is still about 50 percent. However, in some other situations, if you change the context, the a priori probability is not valid anymore because something subtle happened and changed it. In this case, you can express this belief as an a posteriori probability, which is the a priori probability after something happened to modify the count.
The Latin terms a priori and a posteriori derive from the treatise Elements by the Greek mathematician Euclid (https://mathcs.clarku.edu/~djoyce/java/elements/toc.html). It describes a priori as what comes before and a posteriori as what comes after.
For instance, the a priori probability of a rainy day in the place you live during springtime could be roughly about 20 percent (it depends on where you live on Earth; elsewhere probabilities may be different). However, such probability may differ drastically if you consider only specific temperature and pressure ranges. For instance, when you notice that air pressure is turning low but the temperature is steadily high, the probability of rain drastically increases, and you have a high probability of experiencing a thunderstorm. Therefore, given a different context, the a posteriori probability is different from the expected a priori one. The following sections help you understand the usefulness of probability in more detail.
Conditional probability and Naïve Bayes
You can view cases such as the weather-related ones mentioned in the previous section as conditional probability, and express it as p(y|x), which you read as the probability of event y happening given that x has happened. Conditional probabilities are a very powerful tool for machine learning and AI. In fact, when the a priori probability changes greatly because of certain circumstances, knowing the possible circumstances can boost your chances of correctly predicting an event by observing examples — which is exactly what machine learning is intended to do. For example, as previously mentioned, the expectation of a rainy day could be low in your location depending on the current season. However, if you observe temperature, humidity, and atmospheric pressure, you find that certain combinations lead to an increased probability of rain. If the percentage of rainy days is very high when certain conditions are met, contrary to the a priori probability, a machine learning algorithm, called Naïve Bayes, can provide a probability estimation from knowing meteorological measurements.
In fact, the Naïve Bayes algorithm takes advantage of boosting the chance of a correct prediction by knowing the circumstances surrounding the prediction. Everything starts with the Reverend Bayes and his revolutionary theorem of probabilities. In fact, as noted elsewhere, in the book one of the machine learning tribes is named after him (the Bayesians). Bayesians use various statistical methods to solve problems, all based on observing probabilities of the desired outcome in the right context, before and after observing the outcome itself. Based on these observations, they solve the sunrise problem (estimating the likelihood that the sun will rise tomorrow) by chaining repeated observations and continuously updating their estimate of the probability of the sun rising again proportionally to the number of times they have witnessed a long series of dawns before. You can read about Bayesian reasoning applied to a newborn baby observing the sun by reading this article that appeared in the Economist at https://www.economist.com/science-and-technology/2000/09/28/in-praise-of-bayes (you may have to supply an email address to obtain a free subscription).
Data scientists have great expectations for the development of advanced algorithms based on Bayesian probability. The article “Can Bayesian Networks provide answers when Machine Learning comes up short?” at diginiomica.com describes why Bayes theorem is so important. Yet, the foundations of Bayes’ theorem (the premises used) aren’t all that complicated (although they may be a bit counterintuitive if you normally consider, as most people do, only the a priori probabilities without considering a posteriori ones).
Considering Bayes’ theorem
Apart from being a Presbyterian minister, the Reverend Thomas Bayes was also a statistician and philosopher who formulated his theorem during the first half of the eighteenth century. The theorem was never published while he was alive. Its publication revolutionized the theory of probability by introducing the idea of conditional probability mentioned in the previous section. Thanks to Bayes’ theorem, predicting the probability of an outcome like having a rainy day given certain conditions becomes easier when you apply his formula. Here’s the formula used by Thomas Bayes:
P(B|E) = P(E|B)*P(B) / P(E)
The Reverend Bayes didn’t devise Naïve Bayes; he only formulated the theorem. In truth, there is no sure attribution of the algorithm. It first appeared in a textbook in 1973 without any reference to its creator and passed unobserved for more than a decade until, in 1990, researchers noticed how it performed incredibly accurate predictions if fed with enough accurate data. Reading the formula using the previous example as input can provide a better understanding of an otherwise counterintuitive formula:
· P(B|E): The probability of a belief (B) given a set of evidence (E) (posterior probability). Read belief as an alternative way to express a hypothesis. In this case, the hypothesis is that it rains, and the evidence is a low measure of atmospheric pressure. Knowing the probability of such a belief given evidence can help to predict the weather with some confidence.
· P(E|B): The probability of having low atmospheric pressure when it rains. This term refers to the probability of the evidence in the subgroup, which is itself a conditional probability. In this case, the figure is 90 percent, which translates to a value of 0.9 in the formula (prior probability).
· P(B): The general probability of having a rainy day; that is, the a priori probability of the belief. In this case, the probability is 20 percent, or a value of 0.2 (likelihood).
· P(E): The general probability of measuring low atmospheric pressure. Here it is another a priori probability, this time related to the observed evidence. In this formula, it is a 25 percent probability, which is a value of 0.25 (evidence).
If you solve the previous problem using the Bayes formula and the values you have singled out, the result is 0.9 * 0.2 / 0.25 = 0.72. That’s a high percentage of likelihood, which leads you to affirm that given such evidence (low atmospheric pressure), there is a good probability that it will rain soon.
Another common example, which can raise some eyebrows and is routinely found in textbooks and scientific magazines, is that of the positive medical test. It is quite interesting for a better understanding of how prior and posterior probabilities may indeed change a lot under different circumstances.
Say that you're worried that you have a rare disease experienced by 1 percent of the population. You take the test and the results are positive. Medical tests are never perfectly accurate, and the laboratory tells you that when you are ill, the test is positive in 99 percent of the cases, whereas when you are healthy, the test will be negative in 99 percent of the cases. Now, using these figures, you immediately believe that you’re ill, given the high percentage of positive tests when a person is ill (99 percent). However, the reality is quite different. In this case, the figures to plug into the Bayes’ theorem are as follows:
P(B|E) = P(E|B)*P(B) / P(E)
· 0.99 as P(E|B)
· 0.01 as P(B)
· 0.01 * 0.99 + 0.99 *0.01 = 0.0198 as P(E)
The calculations are then 0.99 * 0.01 / 0.0198 = 0.5, which corresponds to just a 50 percent probability that you’re ill. In the end, your chances of not being ill are more than you expected. This kind of result is called a false positive paradox, where the indicators seem to point to a positive result, but the math says otherwise. You may wonder how this is possible. The fact is that the number of people seeing a positive response from the test is as follows:
· Who is ill and gets the correct answer from the test: This group is the true positives, and it amounts to 99 percent of the 1 percent of the population who gets the illness.
· Who isn’t ill and gets the wrong answer from the test: This group is the 1 percent of the 99 percent of the population who gets a positive response even though they aren’t ill. Again, this is a multiplication of 99 percent and 1 percent. This group corresponds to the false positives.
If you look at the problem using this perspective, it becomes evident why. When limiting the context to people who get a positive response to the test, the probability of being in the group of the true positives is the same as that of being in the false positives.
Envisioning the world as a graph
Bayes’ theorem can help you deduce how likely something is to happen in a certain context, based on the general probabilities of the fact itself and the evidence you examine, and combined with the probability of the evidence given the fact. Seldom will a single piece of evidence diminish doubts and provide enough certainty in a prediction to ensure that it will happen. As a true detective, to reach certainty, you have to collect more evidence and make the individual pieces work together in your investigation. Noticing how atmospheric pressure has decreased isn’t enough to determine whether it is going to rain. Adding data about humidity, season, and location could help increase confidence.
The Naïve Bayes algorithm helps you arrange all the evidence you gather and reach a more solid prediction with a higher likelihood of being correct. Gathered evidence considered separately couldn’t save you from the risk of predicting incorrectly, but all evidence summed together can reach a more definitive resolution. The following example shows how things work in a Naïve Bayes classification. This is an old, renowned problem, but it represents the kind of capability that you can expect from an AI. The dataset is from the paper “Induction of Decision Trees” by John Ross Quinlan at dl.acm.org. Quinlan is a computer scientist who contributed to the development of another machine learning algorithm, decision trees, in a fundamental way, but his example works well with any kind of learning algorithm. The problem requires that the AI guess the best conditions to play tennis given the weather conditions. The set of features described by Quinlan is as follows:
· Outlook: Sunny, overcast, or rainy
· Temperature: Cool, mild, or hot
· Humidity: High or normal
· Windy: True or false
The following table contains the database entries used for the example:
Outlook |
Temperature |
Humidity |
Windy |
Play Tennis |
Sunny |
Hot |
High |
False |
No |
Sunny |
Hot |
High |
True |
No |
Overcast |
Hot |
High |
False |
Yes |
Rainy |
Mild |
High |
False |
Yes |
Rainy |
Cool |
Normal |
False |
Yes |
Rainy |
Cool |
Normal |
True |
No |
Overcast |
Cool |
Normal |
True |
Yes |
Sunny |
Mild |
High |
False |
No |
Sunny |
Cool |
Normal |
False |
Yes |
Rainy |
Mild |
Normal |
False |
Yes |
Sunny |
Mild |
Normal |
True |
Yes |
Overcast |
Mild |
High |
True |
Yes |
Overcast |
Hot |
Normal |
False |
Yes |
Rainy |
Mild |
High |
True |
No |
The option of playing tennis depends on the four arguments shown in Figure 10-1.
FIGURE 10-1: A Naïve Bayes model can retrace evidence to the right outcome.
The result of this AI learning example is a decision as to whether to play tennis, given the weather conditions (the evidence). Using just the outlook (sunny, overcast, or rainy) won’t be enough, because the temperature and humidity could be too high or the wind might be strong. These arguments represent real conditions that have multiple causes, or causes that are interconnected. The Naïve Bayes algorithm is skilled at guessing correctly when multiple causes exist.
The algorithm computes a score, based on the probability of making a particular decision and multiplied by the probabilities of the evidence connected to that decision. For instance, to determine whether to play tennis when the outlook is sunny but the wind is strong, the algorithm computes the score for a positive answer by multiplying the general probability of playing (9 played games out of 14 occurrences) by the probability of the day’s being sunny (2 out of 9 played games) and of having windy conditions when playing tennis (3 out of 9 played games). The same rules apply for the negative case (which has different probabilities for not playing given certain conditions):
likelihood of playing: 9/14 * 2/9 * 3/9 = 0.05
likelihood of not playing: 5/14 * 3/5 * 3/5 = 0.13
Because the score for the likelihood of not playing is higher, the algorithm decides that it’s safer not to play under such conditions. It computes such likelihood by summing the two scores and dividing both scores by their sum:
probability of playing : 0.05 / (0.05 + 0.13) = 0.278
probability of not playing : 0.13 / (0.05 + 0.13) = 0.722
You can further extend Naïve Bayes to represent relationships that are more complex than a series of factors that hint at the likelihood of an outcome using a Bayesian network, which consists of graphs showing how events affect each other. Bayesian graphs have nodes that represent the events and arcs showing which events affect others, accompanied by a table of conditional probabilities that show how the relationship works in terms of probability. Figure 10-2 shows a famous example of a Bayesian network taken from a 1988 academic paper, “Local computations with probabilities on graphical structures and their application to expert systems,” by Lauritzen, Steffen L. and David J. Spiegelhalter, published by the Journal of the Royal Statistical Society (find it on the Wiley Online Library).
FIGURE 10-2: A Bayesian network can support a medical decision.
The depicted network is called Asia. It shows possible patient conditions and what causes what. For instance, if a patient has dyspnea, it could be an effect of tuberculosis, lung cancer, or bronchitis. Knowing whether the patient smokes, has been to Asia, or has anomalous x-ray results (thus giving certainty to specific pieces of evidence, a priori in Bayesian language) helps infer the real (posterior) probabilities of having any of the pathologies in the graph.
Bayesian networks, though intuitive, have complex math behind them, and they’re more powerful than a simple Naïve Bayes algorithm because they mimic the world as a sequence of causes and effects based on probability. Bayesian networks are so effective that you can use them to represent any situation. They have varied applications, such as medical diagnoses, the fusing of uncertain data arriving from multiple sensors, economic modeling, and the monitoring of complex systems such as a car. For instance, because driving in highway traffic may involve complex situations with many vehicles, the Analysis of MassIve Data STreams (AMIDST) consortium, in collaboration with the automaker Daimler, devised a Bayesian network that can recognize maneuvers by other vehicles and increase driving safety. You can read more about this project and see the complex Bayesian network in “Analysis of MassIve Data STreams - AMIDST” at vbn.auu.dk.
Growing Trees that Can Classify
A decision tree is another type of key algorithm in machine learning that contributes to AI implementation and learning. Decision tree algorithms aren’t new, but they do have a long history. The first algorithm of their kind dates back to the 1970s (with many ensuing variants). When you consider experiments and original research, the use of decision trees goes back even earlier — they are as old as the perceptron, the forerunner of neural networks. As the core symbologist algorithm, decision trees have enjoyed a long popularity because they’re an intuitive type of algorithm. It’s easy to translate the output into rules and therefore make the output easily understood by humans. Decision trees are also extremely easy to use. All these characteristics make them an effective and appealing no-brainer with respect to models that require complex input data matrix transformations or extremely accurate tuning of hyperparameters.
Symbolism is the AI approach based on logic statements and extensive use of deduction. Deduction expands knowledge from what we know, and induction formulates general rules starting from evidence.
Predicting outcomes by splitting data
If you have a group of measures and want to describe them using a single number, you use an arithmetic mean (summing all the measures and dividing by the number of measures). In a similar fashion, if you have a group of classes or qualities (for instance, you have a dataset containing records of many breeds of dogs or types of products), you can use the most frequent class in the group to represent them all, which is called the mode. The mode is another statistical measure like mean, but it contains the value (a measure or a class) that appears most often. Both the mean and the mode strive to report a number or class that provides you with the most confidence in guessing the next group element, because they produce the fewest mistakes. In a sense, they’re predictors that learn the most probable answer from existing data. Decision trees leverage means and modes as predictors by splitting the dataset into smaller sets whose means or modes are the best possible predictors for the problem at hand.
Dividing a problem in order to arrive at a solution easily is also a common strategy in many divide-and-conquer algorithms. As with an enemy army in battle, if you can split your foe and fight it singularly, you can attain an easier victory.
Using a sample of observations as a starting point, the algorithm retraces the rules that generated the output classes (or the numeric values when working through a regression problem) by dividing the input matrix into smaller and smaller partitions until the process triggers a rule for stopping. Such retracing from particular toward general rules is typical of human inverse deduction, as treated by logic and philosophy.
In a machine learning context, such inverse reasoning is achieved by applying a search among all the possible ways to split the training data (the in-sample when discussed in statistics) and decide, regardless of what could happen in the following steps, to use the split that maximizes statistical measurements on the resulting partitions.
The division occurs to enforce a simple principle: Each partition of the initial data must make predicting the target outcome easier, which is characterized by a different and more favorable distribution of classes (or values) than the original sample. The algorithm creates partitions by splitting the data. It determines the data splits by first evaluating the features. Then it evaluates the values in the features that could bring the maximum improvement of a special statistical measure — that is, the measure that plays the role of the cost function in a decision tree.
A number of statistical measurements determine how to make the splits in a decision tree. All abide by the idea that a split must improve on the original sample, or another possible split, when it makes prediction safer. Among the most used measurements are Gini impurity, information gain, and variance reduction (for regression problems). These measurements operate similarly, so this chapter focuses on information gain because it’s the most intuitive measurement and conveys how a decision tree can detect an increased predictive ability (or a reduced risk) in the easiest way for a certain split. Ross Quinlan created a decision tree algorithm based on information gain (ID3) in the 1970s, and it’s still quite popular thanks to its recently upgraded version to C4.5. Information gain relies on the formula for informative entropy (devised by Claude Shannon, an American mathematician and engineer known as the father of information theory), a generalized formulation that describes the expected value from the information contained in a message:
Shannon Entropy E = -∑(p(i)@@tslog2(p(i)))
In the formula, you consider all the classes one at a time, and you sum together the multiplication result of each of them. In the multiplication each class has to take, p(i) is the probability for that class (expressed in the range of 0 to 1) and log2 is the base 2 logarithm. Starting with a sample in which you want to classify two classes having the same probability (a 50/50 distribution), the maximum possible entropy is Entropy = -0.5*log2(0.5) -0.5*log2(0.5) = 1.0. However, when the decision tree algorithm detects a feature that can split the dataset into two partitions, where the distribution of the two classes is 40/60, the average informative entropy diminishes:
Entropy = -0.4*log2(0.4) -0.6*log2(0.6) = 0.97
Note the entropy sum for all the classes. Using the 40/60 split, the sum is less than the theoretical maximum of 1 (diminishing the entropy). Think of the entropy as a measure of the mess in data: The less mess, the more order, and the easier it is to guess the right class. After a first split, the algorithm tries to split the obtained partitions further using the same logic of reducing entropy. It progressively splits any successive data partition until no more splits are possible because the subsample is a single example or because it has met a stopping rule.
Stopping rules are limits to the expansion of a tree. These rules work by considering three aspects of a partition: initial partition size, resulting partition size, and information gain achievable by the split. Stopping rules are important because decision tree algorithms approximate a large number of functions; however, noise and data errors can easily influence this algorithm. Consequently, depending on the sample, the instability and variance of the resulting estimates affect decision tree predictions.
Making decisions based on trees
As an example of decision tree use, this section uses the same Ross Quinlan dataset discussed in the “Envisioning the world as a graph” section, earlier in the chapter. Using this dataset lets us present and describe the ID3 algorithm, a special kind of decision tree found in the paper “Induction of Decision Trees,” mentioned previously in this chapter. The dataset is quite simple, consisting of only 14 observations relative to the weather conditions, with results that say whether playing tennis is appropriate.
The example contains four features: outlook, temperature, humidity, and wind, all expressed using qualitative classes instead of measurements (you could express temperature, humidity, and wind strength numerically) to convey a more intuitive understanding of how the weather features relate to the outcome. After these features are processed by the algorithm, you can represent the dataset using a tree-like schema, as shown in Figure 10-3. As the figure shows, you can inspect and read a set of rules by splitting the dataset to create parts in which the predictions are easier by looking at the most frequent class (in this case, the outcome, which is whether to play tennis).
FIGURE 10-3: A visualization of the decision tree built from the play-tennis data.
To read the nodes of the tree, just start from the topmost node, which corresponds to the original training data; next, start reading the rules. Note that each node has two derivations: The left branch means that the upper rule is true (stated as yes in a box), and the right one means that it is false (stated as no in a box).
On the right of the first rule, you see an important terminal rule (a terminal leaf), in a circle, stating a positive result, Yes, that you can read as play tennis=True. According to this node, when the outlook isn’t sunny (Sun) or rainy (Rain), it’s possible to play. (The numbers under the terminal leaf show four examples affirming this rule and zero denying it.) Note that you could understand the rule better if the output simply stated that when the outlook is overcast, play is possible. Frequently, decision tree rules aren’t immediately usable, and you need to interpret them before use. However, they are clearly intelligible (and much better than a coefficient vector of values).
On the left, the tree proceeds with other rules related to Humidity. Again, on the left, when humidity is high and outlook is sunny, most terminal leaves are negative, except when the wind isn’t strong. When you explore the branches on the right, you see that the tree reveals that play is always possible when the wind isn’t strong, or when the wind is strong but it doesn’t rain.
Pruning overgrown trees
Even though the play tennis dataset in the previous section illustrates the nuts and bolts of a decision tree, it has little probabilistic appeal because it proposes a set of deterministic actions (it has no conflicting instructions). Training with real data usually doesn’t feature such sharp rules, thereby providing room for ambiguity and the likelihood of the hoped for outcome.
Decision trees have more variability in their estimations because of the noise that they obtain from data during the learning process (an effect of overfitting). To overfit the data less, the example specifies that the minimum split has to involve at least five examples. Because the terminal leaves are numerically larger, the confidence that the tree is picking the correct signal increases because the evidence quantity is higher. Also, it prunes the tree. Pruning happens when the tree is fully grown.
Starting from the leaves, the example prunes the tree of branches, showing little improvement in the reduction of information gain. By initially letting the tree expand, branches with little improvement are tolerated because they can unlock more interesting branches and leaves. Retracing from leaves to root and keeping only branches that have some predictive value reduces the variance of the model, making the resulting rules restrained.
For a decision tree, pruning is just like brainstorming. First, the algorithm generates all possible ramifications of the tree (as you do with ideas in a brainstorming session). Second, when the brainstorming concludes, only feasible ideas are retained, and the algorithm keeps only what really works.