Chapter 11
IN THIS CHAPTER
Beginning with the limited perceptron
Getting the building blocks of neural networks and backpropagation
Perceiving and detecting objects in images using convolutions
Using sequences and catching them with RNNs
Discovering the creative side of AI thanks to GANs
Newspapers, business magazines, social networks, and nontechnical websites are all saying the same thing: AI is cool stuff that’s going to revolutionize the world because of deep learning. Actually, AI is a far larger field than machine learning, and deep learning is just a small part of machine learning.
It’s important to distinguish hype used to lure investors and show what this technology can actually do, which is the overall purpose of this chapter. The article at https://tinyurl.com/2n2w4ktv contains a useful comparison of the roles of the three methods of manipulating data into useful output (AI, machine learning, and deep learning), which this chapter describes in detail.
This chapter helps you understand deep learning from a practical and technical point of view, and understand what deep learning can achieve in the near term by exploring its possibilities and limitations. The chapter begins with the history and basics of neural networks. It then presents the state-of-the-art results from convolutional neural networks, recurrent neural networks (both for supervised learning), and generative adversarial networks (a kind of unsupervised learning).
Shaping Neural Networks Similar to the Human Brain
The following sections present a family of learning algorithms that derive inspiration from how the brain works. They’re neural networks, the core algorithm of the connectionists’ tribe that best mimics neurons inside human brains at a smaller scale. (See Chapter 1 for an overview of the five tribes of machine learning employed by various scientists.)
Connectionism is the machine learning approach based on neuroscience, as well as the example of biologically interconnected networks.
Introducing the neuron
Human brains have billions of neurons, which are cells that receive, process, and transmit electric and chemical signals. Each neuron possesses a nucleus with filaments that act as inputs; dendrites that receive signals from other neurons; and a single output filament, the axon, that terminates with synapses devoted to outside communication. Neurons connect to other neurons and transmit information between them using chemicals, whereas information inside the neuron itself is electrically processed. You can read more about neuronal structure in “What’s the Basic Structure of Nerves?” at Dummies.com (or in Neuroscience For Dummies, 2nd Edition, by Frank Amthor (Wiley).
Reverse-engineering how a brain processes signals helps the connectionists define neural networks based on biological analogies and their components. Connectionists thus use an abundance of brain terms such as neurons, activation, and connections as names for mathematical operations. Yet, in spite of the biological terms, neural networks resemble nothing more than a series of multiplications and summations when you check their math formulations. These algorithms are extraordinarily effective at solving complex problems such as image and sound recognition or machine language translation; using specialized hardware, they can execute prediction computations quickly.
Starting with the miraculous perceptron
The core of a neural network algorithm is the neuron (also called a unit). Many neurons arranged in an interconnected structure make up a neural network, with each neuron linking to the inputs and outputs of other neurons. Thus, a neuron can input data from examples or transmit the results of other neurons, depending on its location in the neural network.
SEEING DEEP LEARNING AS AUGMENTATION
Chapter 10 discusses Bayesian networks and includes an example of how such networks can provide diagnostic hints to a doctor. To do this, the Bayesian network requires well-prepared probability data. Deep learning can create a bridge between the capability of algorithms to make the best decision possible using all the required data and the data that is actually available, which is never in the best format for machine learning algorithms to understand. Photos, images, sound recording, web data (especially from social networks), and company records all require data analysis to make the data suitable for machine learning purposes.
In contrast to Bayesian networks, deep learning algorithms need very few instructions about the data they are working on. A deep learning algorithm could help doctors by matching extensive knowledge in medicine (using all available sources, including books, white papers, and the latest research from the National Institutes of Health) and patient information. The patient information, in turn, could come from previous diagnoses and medicine prescriptions, or even from social media evidence (so that doctors don’t need to ask whether the patient has been in Asia, for example; the AI will detect it from its photos on Instagram or Facebook). This scenario may sound like sci-fi, but creating such a system is nearly possible today; for instance, a deep learning AI can now detect pneumonia from x-rays at a level exceeding practicing radiologists, thanks to the Stanford Machine Learning Group (https://tinyurl.com/2kzrrjhb).
Deep learning also appears in many applications. You find it in social networks in which images and content are automatically classified; in search engines when queries are retrieved; in online advertising when consumers are targeted; in mobile phones and digital assistants for speech, language understanding, or translation tasks; in self-driving cars for vision detection; and in a Go game by AlphaGo against a champion. In less widely known applications, deep learning can also power robotics and earthquake predictions. You might also find applications such as TinEye (https://tineye.com/) helpful. In this case, you supply an image, and TinEye finds it for you on the Internet.
Frank Rosenblatt at the Cornell Aeronautical Laboratory created the first example of a neuron of this kind, the perceptron, a few decades ago. He devised the perceptron in 1957 under the sponsorship of the United States Naval Research Laboratory (NRL). Rosenblatt was a psychologist as well as a pioneer in the field of artificial intelligence. Proficient in cognitive science, his idea was to create a computer that could learn by trial and error, just as a human does.
The perceptron was just a smart way to trace a separating line in a simple space made by the input data, as shown in Figure 11-1, in which you have two features (in this case, the size and level of domestication of an animal) to use to distinguish two classes (dogs and cats in this example). The perceptron formulation produces a line in a Cartesian space where the examples divide more or less perfectly into groups. The approach is similar to Naïve Bayes, described in Chapter 10, which sums conditional probabilities multiplied by general ones in order to classify data.
FIGURE 11-1: Example of a perceptron in simple and challenging classification tasks.
The perceptron didn’t realize the full expectations of its creator or financial supporters. It soon displayed a limited capacity, even in its image-recognition specialization. The general disappointment ignited the first AI winter and abandonment of connectionism until the 1980s. Yet, some research continued despite the loss of funding. (Dr. Nils J. Nilsson, now retired but formerly a Stanford AI professor, tells more about progress during this time in the article at https://tinyurl.com/47h9j8v2.)
However, the ideas prompted by the perceptron were here to stay. Later on, experts tried to create a more advanced perceptron, and they succeeded. Neurons in a neural network are a further evolution of the perceptron: They are many, they connect to each other, and they imitate our neurons when they activate under a certain stimulus. In observing human brain functionalities, scientists noticed that neurons receive signals but don’t always release a signal of their own. Releasing a signal depends on the amount of signal received. When a neuron acquires enough stimuli, it fires an answer; otherwise, it remains silent. In a similar fashion, algorithmic neurons, after receiving data, sum it and use an activation function to evaluate the result. If the input they receive achieves a certain threshold, the neuron transforms and transmits the input value; otherwise, it simply dies.
Neural networks use special functions called activation functions to fire a result. All you need to know is that they are a key neural network component because they allow the network to solve complex problems. They are like doors, letting the signal pass or stop. They don’t simply let the signal pass, however; they transform it in a useful way. Deep learning, for instance, isn’t possible without efficient activation functions such as the Rectified Linear Unit (ReLU), and thus activation functions are an important aspect of the story.
Mimicking the Learning Brain
In a neural network, you must consider the architecture first, which is the arrangement of the neural network components. The following sections discuss neural network architectural considerations.
Considering simple neural networks
Contrary to other algorithms, which have a fixed pipeline that determines how algorithms receive and process data, neural networks require that you decide how information flows by fixing the number of units (the neurons) and their distribution in layers called the neural network architecture, as shown in Figure 11-2.
FIGURE 11-2: A neural network architecture, from input to output.
The figure shows a simple neural network architecture. Note how the layers filter and process information in a progressive way. This is a feed-forward input because data feeds one direction into the network. Connections exclusively link units in one layer with units in the following layer (information flows from left to right). No connections exist between units in the same layer or with units outside the next layer. Moreover, the information pushes forward (from the left to the right). Processed data never returns to previous neuron layers.
In more advanced neural network applications, you also have to decide on the layer types you need and the large number of parameters that will influence the layers’ behavior. Neural networks are extremely flexible, and that aspect is a double-edged sword: You increase the power of the machine learning tool as complexity skyrockets.
Using a neural network is like using a stratified filtering system for water: You pour the water from above, and the water is filtered at the bottom. The water has no way to go back up; it just goes forward and straight down, and never laterally. In the same way, neural networks force data features to flow through the network and mix with each other as dictated by the network’s architecture. By using the best architecture to mix features, the neural network creates newly composed features at every layer and helps achieve better predictions. Unfortunately, in spite of the efforts of academics to discover a theoretical rule, you have no way to determine the best architecture without empirically trying different solutions and testing whether output data helps predict your target values after flowing through the network. This need for manual configuration illustrates the no-free-lunch theorem (which you can read about in Chapter 10) in action. The gist of it is that an architecture that works the best on one task won’t necessarily perform successfully on other problems.
Sometimes concepts can be understood better if directly tested in reality. Google offers a Neural Network Playground (http://playground.tensorflow.org) in which you can actually test how a neural network works in an intuitive manner, as shown in Figure 11-3. You see how the neural network builds a neural network by adding or removing layers and changing kinds of activations.
Figuring out the secret is in the weights
Neural networks have different layers, with each one having its own weights. Weights represent the strength of the connection between neurons in the network. When the weight of the connection between two layers is small, it means that the network dumps values flowing between them and signals that taking this route isn’t likely to influence the final prediction. Likewise, a large positive or negative value affects the values that the next layer receives, thus determining certain predictions. This approach is analogous to brain cells, which don’t stand alone but connect with other cells. As someone grows in experience, connections between neurons tend to weaken or strengthen to activate or deactivate certain brain network cell regions, causing other processing or an activity (a reaction to a danger, for instance, if the processed information signals a life-threatening situation).
FIGURE 11-3: The Neural Network Playground lets you see how modifying a neural network changes how it works.
Each successive layer of neural network units progressively processes values taken from features, as in a conveyor belt. As the network transmits data, it arrives at each unit as a summated value produced by the values present in the previous layer and weighted by connections in the present layer. When the data received from other neurons exceeds a certain threshold, the activation function opportunely increases or modifies the value stored in the unit; otherwise, it extinguishes the signal by reducing or cancelling it. After activation function processing, the result is ready to push forward to the next layer’s connection. These steps repeat for each layer until the values reach the end, and you have a result.
The weights of the connections provide a way to combine the inputs in a new way, creating new features by mixing processed inputs in a creative way because of weights and activation functions. The activation, because of the transformation it applies, also renders nonlinear the resulting recombination of the inputs received by the connections. Both of these neural network components enable the algorithm to learn complex target functions that represent the relationship between the input features and the target outcome.
Understanding the role of backpropagation
Learning occurs in a human brain because of the formation and modification of synapses between neurons, based on stimuli received by trial-and-error experience. Neural networks provide a way to replicate this process as a mathematical formulation called backpropagation. Here’s how this architecture of interconnected computing units can solve problems: The units receive an example, and if they don’t guess correctly, they retrace the problem in the system of existing weights using backpropagation and fix it by changing some values. This process goes on for many iterations before a neural network can learn. Iterations in a neural network are called epochs, a name that fits perfectly because a neural network may need days or weeks of training to learn complex tasks.
Backpropagation math is quite advanced and requires knowledge of concepts such as derivatives. You can read a detailed but accessible math description in Machine Learning For Dummies, 2nd Edition, by John Paul Mueller and Luca Massaron (Wiley) and get an overview of the necessary calculations. Backpropagation as a concept is intuitive enough to grasp and convey because it resembles what people do when performing a task using iterated approximate trial and error. Since the appearance of the backpropagation algorithm in the 1970s, developers have fixed it many times and are currently discussing whether to rethink it. (You can read the opinion of Geoffrey Hinton, one of the coauthors of the method, at https://tinyurl.com/rrea42wz.) Backpropagation is at the core of the present AI renaissance. In the past, each neural network learning process improvement resulted in new applications and a renewed interest in the technique. Also, the current deep learning revolution, which involves a revival of neural networks (abandoned at the beginning of the 1990s), resulted from key advances in the way neural networks learn from their errors.
Introducing Deep Learning
After backpropagation, the next improvement in neural networks led to deep learning. Research continued in spite of the AI winter, and neural networks overcame technical problems, such as the vanishing gradient, which limits the dimensionality of neural networks. Developers needed larger neural networks to solve certain problems, so large that creating such a large neural network was not feasible in the 1980s. Moreover, researchers started taking advantage of the computational developments in CPUs and GPUs (the graphic processing units better known for their application in gaming).
The vanishing gradient is when you try to transmit a signal through a neural network and the signal quickly fades to near-zero values; after that, it can’t get through the activation functions anymore. This happens because neural networks are chained multiplications. Each near-zero multiplication decreases the values rapidly, and activation functions need large enough values to let the signal pass. The farther neuron layers are from the output, the higher the likelihood that they’ll get locked out of updates because the signals are too small and the activation functions will stop them. Consequently, your network stops learning as a whole, or it learns at an incredibly slow pace.
New solutions help avoid the problem of the vanishing gradient and many other technical problems, allowing larger deep networks in contrast to the simpler shallow networks of the past. Deep networks are possible thanks to the studies of scholars from the University of Toronto in Canada, such as Geoffrey Hinton (https://tinyurl.com/2nwjwzay), who insisted on working on neural networks, even when they seemed to most people to be an old-fashioned machine learning approach.
GPUs are powerful matrix and vector calculation computing units necessary for backpropagation. These technologies make training neural networks achievable in a shorter time and accessible to more people. Research also opened a world of new applications. Neural networks can learn from huge amounts of data and take advantage of big data (images, text, transactions, and social media data), creating models that continuously perform better, depending on the flow of data you feed them.
Big players such as Google, Facebook, Microsoft, and IBM spotted the new trend and have, since 2012, started acquiring companies and hiring experts (Hinton now works with Google; LeCun, the creator of Convolutional Neural Networks, leads Facebook AI research) in the new fields of deep learning. The Google Brain project, run by Andrew Ng and Jeff Dean, put together 16,000 computers to calculate a deep learning network with more than a billion weights, thus enabling unsupervised learning from YouTube videos. The computer network could even determine what a cat is by itself, without any human intervention (as you can read in this article from Wired at https://tinyurl.com/u4ssuh6j).
UNDERSTANDING DEEP LEARNING ISSUES
As things stand now, people have an unrealistic idea of how deep learning can help society as a whole. You see a deep learning application beat someone at chess or Go and think that if it can do that really amazing thing, what other amazing things can it do? The problem is that even its proponents don’t understand deep learning very well. In technical papers about deep learning, the author often describes layers of nebulous processing organized into a network without any sort of discourse as to what really happens in each of those boxes. Recent advances point out that deep learning networks are basically a way to memorize data and then retrieve relevant bits of it using similarity between the actual problem and the memorized one. (You can read an amazing scientific paper on the topic by Pedro Domingos here: https://tinyurl.com/46wfu3mr.) The essential point to remember is that deep learning doesn’t actually understand anything. It uses a massive number of examples to derive statistically based pattern matching using mathematical principles. When an AI wins a game involving a maze, it doesn’t understand the concept of a maze; it simply knows that certain inputs manipulated in specific ways create certain winning outputs.
In contrast to humans, deep learning must rely on a huge number of examples to discover specific relationships between inputs and outputs. If you tell a child that everyone between a certain age is a tween — neither a child nor a teen — the child will be able to recognize anyone fitting the category of a tween with a high percentage of accuracy, even when the other person is a complete unknown. Deep learning would require special training to accomplish the same task, and it would be easy to fool because examples outside its experience wouldn’t register.
Humans can also create hierarchies of knowledge without any sort of training. We know, for example, without much effort that dogs and cats are both animals. In addition, in knowing that dogs and cats are animals, a human can easily make the leap to see other animals as animals, even without specific training. Deep learning would require separate training for each thing that is an animal. In short, deep learning can’t transfer what it knows to other situations as humans can.
Even with these limitations, deep learning is an amazing tool, but it shouldn’t be the only tool in the AI toolbox. Using deep learning to see patterns where humans can’t is the perfect way to apply this technology. Patterns are an essential part of discovering new things. For example, human testing of compounds to battle cancer or fight a coronavirus pandemic could take an immense amount of time. By seeing patterns where humans can’t, deep learning could make serious inroads toward a solution with a lot less effort than humans would require.
Explaining the differences between deep learning and other forms of neural networks
Deep learning may seem to be just a larger neural network that runs on more computers — in other words, just a mathematics and computational power technology breakthrough that makes larger networks available. However, something inherently qualitative changed in deep learning as compared to shallow neural networks. It’s more than the paradigm shift of brilliant techs at work. Deep learning shifts the paradigm in machine learning from feature creation (features that make learning easier and that you have to create using data analysis) to feature learning (complex features automatically created based on the actual data). Such an aspect couldn’t be spotted otherwise when using smaller networks but becomes evident when you use many neural network layers and lots of data.
When you look inside deep learning, you may be surprised to find a lot of old technology, but amazingly, everything works as it never had before. Because researchers finally figured out how to make some simple, good-ol’ solutions work together, big data can automatically filter, process, and transform data. For instance, new activations like ReLU aren’t all that new; they’ve been known since the perceptron. Also, the image-recognition abilities that initially made deep learning so popular aren’t new. Initially, deep learning achieved great momentum thanks to Convolutional Neural Networks (CNN). Discovered in the 1980s by the French scientist Yann LeCun (whose personal home page is at http://yann.lecun.com/), such networks now bring about astonishing results because they use many neural layers and lots of data. The same goes for technology that allows a machine to understand human speech or translate from one language to another; it’s decades-old technology that a researcher revisited and got to work in the new deep learning paradigm.
Of course, part of the difference is also provided by data (more about this later), the increased usage of GPUs, and computer networking. Together with parallelism (more computers put in clusters and operating in parallel), GPUs allow you to create larger networks and successfully train them on more data. In fact, a GPU is estimated to perform certain operations 70 times faster than any CPU, allowing a cut in training times for neural networks from weeks to days or even hours.
For more information about how much a GPU can empower machine learning through the use of a neural network, peruse this technical paper on the topic: https://icml.cc/2009/papers/218.pdf.
GPUs aren’t the only option for building effective deep learning solutions promptly. Special application-specific integrated circuits (ASIC) have made an appearance, and the designers have demonstrated that those circuits perform even better than GPUs. For instance, Google started developing the Tensor Processing Unit (TPU) in 2015. In 2018, Google made TPUs available in its cloud centers. A TPU is a blazing-fast, application-specific integrated circuit to accelerate the calculations involved in deep learning when using Google’s specialized computational library, TensorFlow. See the “Working with Deep Learning Processors (DLPs)” section of Chapter 4 for details on other alternatives.
Finding even smarter solutions
Deep learning influences AI’s effectiveness in solving problems in image recognition, machine translation, and speech recognition that were initially tackled by classic AI and machine learning. In addition, it presents new and advantageous solutions:
· Continuous learning using online learning
· Reusable solutions using transfer learning
· More democratization of AI using open source frameworks
· Simple straightforward solutions using end-to-end learning
The following sections describe these four new approaches.
Using online learning
Neural networks are more flexible than other machine learning algorithms, and they can continue to train as they work on producing predictions and classifications. This capability comes from optimization algorithms that allow neural networks to learn, which can work repeatedly on small samples of examples (called batch learning) or even on one example at a time (called online learning). Deep learning networks can build their knowledge step by step and be receptive to new information that may arrive (like a baby’s mind, which is always open to new stimuli and to learning experiences). For instance, a deep learning application on a social media website can be trained on cat images. As people post photos of cats, the application recognizes them and tags them with an appropriate label. When people start posting photos of dogs on the social network, the neural network doesn’t need to restart training; it can continue by learning images of dogs as well. This capability is particularly useful for coping with the variability of Internet data. A deep learning network can be open to novelty and adapt its weights to deal with it.
Using transfer learning
Flexibility is handy even when a network completes its training, but you must reuse it for purposes different from the initial learning. Networks that distinguish objects and correctly classify them require a long time and a lot of computational capacity to learn what to do. Extending a network’s capability to new kinds of images that weren’t part of the previous learning means transferring the knowledge to this new problem (transfer learning).
For instance, you can transfer a network that’s capable of distinguishing between dogs and cats to perform a job that involves spotting dishes of macaroni and cheese. You use the majority of the layers of the network as they are (you freeze them) and then work on the final, output layers (fine-tuning). In a short time, and with fewer examples, the network will apply what it learned in distinguishing dogs and cats to macaroni and cheese. It will perform even better than a neural network trained only to recognize macaroni and cheese.
Transfer learning is something new to most machine learning algorithms and opens up a possible market for transferring knowledge from one application to another, from one company to another. Google is already doing that, actually sharing its immense data repository by making public the networks it built on it (as detailed in this post: https://tinyurl.com/448hkhpa). This is a step in democratizing deep learning by allowing everyone to access its potentiality. To make things even better, there is now a lite version of the TensorFlow Object Recognition API for mobile devices, which is described at https://tinyurl.com/yuyznrh9.
Democratization by using open source frameworks
Today, networks can be accessible to everyone, including access to tools for creating deep learning networks. It’s not just a matter of publicly divulging scientific papers explaining how deep learning works; it’s a matter of programming. In the early days of deep learning, you had to build every network from scratch as an application developed in a language such as C++, which limited access to a few well-trained specialists. Scripting capabilities today (for instance, using Python; go to https://www.python.org/) are better because of a large array of open source deep learning frameworks, such as TensorFlow by Google (https://www.tensorflow.org/) or PyTorch by Facebook (https://pytorch.org/). These frameworks allow the replication of the most recent advances in deep learning using straightforward commands.
Along with many lights come some shadows. Neural networks need huge amounts of data to work, and data isn’t accessible to everybody because larger organizations hold it. Transfer learning can mitigate the lack of data, but only partially, because certain applications do require actual data. Consequently, the democratization of AI is limited. Moreover, deep learning systems are so complex that their outputs are both hard to explain (allowing bias and discrimination to flourish) and frail because tricks can fool those systems (see https://tinyurl.com/5ua5jw42 for details). Any neural network can be sensitive to adversarial attacks, which are input manipulations devised to deceive the system into giving a wrong response.
Using end-to-end learning
Finally, deep learning allows end-to-end learning, which means that it solves problems in an easier and more straightforward way than previous deep learning solutions and might therefore have more impact when solving problems. Say that you wanted to solve a difficult problem, such as having AI recognize known faces or drive a car. Using the classical AI approach, you would have to split the problem into more manageable sub-problems to achieve an acceptable result in a feasible time. For instance, if you wanted to recognize faces in a photo, previous AI systems arranged the problem into these parts:
1. Find the faces in the photo.
2. Crop the faces from the photo.
3. Process the cropped faces to have a pose similar to an ID card photo.
4. Feed the processed cropped faces as learning examples to a neural network for image recognition.
Today, you can feed the photo to a deep learning architecture and guide it to learn to find faces in the images and then classify them. You can use the same approach for language translation, speech recognition, or even self-driving cars (as discussed in Chapter 14). In all cases, you simply pass the input to a deep learning system and obtain the wanted result.
Detecting Edges and Shapes from Images
Convolutional Neural Networks (also known as ConvNet or CNN) have fueled the recent deep learning renaissance. Practitioners and academics are persuaded that deep learning is a feasible technique because of its results in image-recognition tasks. This success has produced a sort of gold rush, with many people trying to apply the same technology to other problems. The following sections discuss how CNNs help detect image edges and shapes for tasks such as deciphering handwritten text.
Starting with character recognition
CNNs aren’t a new idea. They appeared at the end of the 1980s as the work of Yann LeCun (now director of AI at Facebook) when he worked at AT&T Labs-Research, together with Yoshua Bengio, Leon Bottou, and Patrick Haffner on a network named LeNet5. You can see the network at http://yann.lecun.com/exdb/lenet/ or in this video, in which a younger LeCun himself demonstrates the network: https://tinyurl.com/3rnwr6de. At that time, having a machine able to decipher handwritten numbers was quite a feat, one that assisted the postal service in automating zip code detection and sorting incoming and outgoing mail.
Developers achieved some results earlier by connecting a number of images to a detection neural network. Each image pixel connected to a node in the network. The problem of using this approach is that the network can’t achieve translation invariance, which is the capability to decipher the number under different conditions of size, distortion, or position in the image, as exemplified in Figure 11-4. A similar neural network could detect only similar numbers — those that it has seen before. Also, it made many mistakes. Transforming the image before feeding it to the neural network partially solved the problem by resizing, moving, cleaning the pixels, and creating special chunks of information for better network processing. This technique, called feature creation, requires both expertise on the necessary image transformations as well as many computations in terms of data analysis. Image-recognition tasks at that time were more the work of an artisan than a scientist.
FIGURE 11-4: Using translation invariance, a neural network spots the dog and its variations.
Convolutions easily solved the problem of translation invariance because they offer a different image-processing approach inside the neural network. Convolutions are the foundation of LeNet5 and provide the basic building blocks for all actual CNNs performing the following:
· Image classification: Determining what object appears in an image
· Image detection: Finding where an object is in an image
· Image segmentation: Separating the areas of an image based on their content; for example, in an image of a road, separating the road itself from the cars on it and the pedestrians
Explaining how convolutions work
To understand how convolutions work, you start from the input, which is an image composed of one or more pixel layers, called channels, using values from 0 (the pixel is fully switched on) to 255 (the pixel is switched off). For instance, RGB images have individual channels for red, green, and blue colors. Mixing these channels generates the palette of colors as you see them on the screen.
The input data receives simple transformations to rescale the pixel values (for instance, to set the range from zero to one) and then pass on those values. Transforming the data makes the convolutions’ work easier because convolutions are simply multiplication and summation operations, as shown in Figure 11-5. The convolution neural layer takes small portions of the image, multiplies the pixel values inside the portion by a grid of particularly devised numbers, sums everything derived from the multiplication, and projects it into the next neural layer.
Such an operation is flexible because backpropagation forms the basis for numeric multiplication inside the convolution (see the article at https://tinyurl.com/2e2293b9 for precisely how the convolution step works, including an animation), and the values that the convolution filters are image characteristics, which are important for the neural network to achieve its classification task. Some convolutions catch only lines, some only curves or special patterns, no matter where they appear in the image (and this is the translation invariance property of convolutions). As the image data passes through various convolutions, it’s transformed, assembled, and rendered in increasingly complex patterns until the convolution produces reference images (for instance, the image of an average cat or dog), which the trained CNN later uses to detect new images.
FIGURE 11-5: A convolution scanning through an image.
If you want to know more about convolutions, you can check out a visualization created by some Google researchers from Research and Google Brain. The visualization is of the inner workings of a 22-layer network developed by scientists at Google called GoogleLeNet (see the paper at https://tinyurl.com/6x8zuk5c). In the appendix (https://tinyurl.com/3d77zthf), they show examples from the layers assigned to detect first edges, then textures, then full patterns, then parts, and finally entire objects.
Interestingly, setting basic ConvNet architectures isn’t hard. Just imagine that the more layers you have, the better (up to a certain limit, however). You set the number of convolution layers and some convolution behavior characteristics, like how the grid is made (filter, kernel, or feature detector values), how the grid slides in the image (stride), and how it behaves around the image borders (padding).
Looking at how convolutions work hints that going deep in deep learning means that data goes into deeper transformations than it does under any machine learning algorithm or a shallow neural network. The more layers, the more transformations an image undergoes, and the deeper it becomes.
Advancing using image challenges
CNNs are a smart idea. AT&T actually implemented LeNet5 into ATM check readers. However, another AI winter started in the mid 1990s, with many researchers and investors losing faith that neural networks could revolutionize AI. In addition, the data lacked complexity at the time. Researchers were able to achieve results comparable to LeNet5’s using new machine learning algorithms called Support Vector Machines (from the Analogiers tribe) and Random Forests, a sophistication of decision trees from the symbologists’ tribe (see Chapter 10 for an explanation of that tribe).
Only a handful of researchers, such as Geoffrey Hinton, Yann LeCun, and Yoshua Bengio, kept developing neural network technologies until a new dataset offered a breakthrough and ended the AI winter. Meanwhile, 2006 saw an effort by Fei-Fei Li, then a computer science professor at the University of Illinois Urbana-Champaign (and now chief scientist at Google Cloud as well as professor at Stanford) to provide more real-world datasets to better test algorithms. She started amassing an incredible number of images, representing a large number of object classes. She and her team achieved such a huge task by using Amazon’s Mechanical Turk, a service that you use to ask people to do microtasks for you (like classifying an image) for a small fee.
The resulting dataset, completed in 2009, was called ImageNet and contained 3.2 million labeled images, arranged into 5,247 hierarchically organized categories. You can explore it at https://image-net.org/ or read the original paper that presents the dataset at https://tinyurl.com/yy98efcj. ImageNet soon appeared at a 2010 competition in which neural networks proved their capability to correctly classify images arranged into 1,000 classes.
In seven years of competition (the challenge closed definitely in 2017), the winning algorithms raised the accuracy in predicting the images from 71.8 percent to 97.3 percent, which surpasses human capabilities (yes, humans make mistakes in classifying objects). At the beginning, researchers noticed that their algorithms started working better with more data (there was nothing like ImageNet at that time), and then they started testing new ideas and improved neural network architectures. That brought about a host of innovations in the way to process data, to build layers, and to connect them all together. Striving to achieve better results on the ImageNet competition had favorable impacts on all related fields of deep learning research.
Although the ImageNet competitions don’t take place anymore, researchers are even today developing more CNN architectures, enhancing accuracy or detection capabilities as well as robustness. In fact, many deep learning solutions are still experimental and not yet applied to critical applications, such as banking or security, not just because of difficulties in their interpretability but also because of possible vulnerabilities.
Vulnerabilities come in all forms. Researchers have found that by adding specially devised noise or changing a single pixel in an image, a CNN can radically change its answers, in nontargeted (you just need to fool the CNN) or targeted (you want the CNN to provide a specific answer) attacks. You can investigate more about this matter in the OpenAI tutorial at https://tinyurl.com/amp8w5c. OpenAI is a nonprofit AI research company. The paper entitled “One pixel attack for fooling deep neural networks” (https://tinyurl.com/4std9cc8) is also helpful. The point is that CNNs aren’t a safe technology yet. You can’t simply use them in place of your eyes; you have to use great care with them.
Learning to Imitate Art and Life
CNNs didn’t impact just computer vision tasks (such as vision in self-driving cars) but are important for many other applications as well (for example, they’re necessary for virtual assistant AI technology such as Alexa, Siri, or Google Assistant). CNNs persuaded many researchers to invest time and effort in the deep learning revolution. The consequent research and development sprouted new ideas. Subsequent testing finally brought innovation to AI by helping computers learn to understand spoken language, translate written foreign languages, and create both text and modified images, thus demonstrating how complex computations about statistical distributions can be translated into a kind of artistry, creativity, and imagination. If you talk of deep learning and its possible applications, you also have to mention Recurrent Neural Networks (RNN) and Generative Adversarial Networks (GAN) or you won’t have a clear picture of what deep learning can do for AI.
Memorizing sequences that matter
One of the weaknesses of CNNs is the lack of memory. They do well with understanding a single picture, but trying to understand a picture in a context, like a frame in a video, translates into an inability to get the right answer to difficult AI challenges. Technically, a CNN can recognize a set of patterns, but without much distinction of how they are spatially arranged (hence their property of translation invariance). Instead, when the sequence in which patterns appear does matter, CNNs don’t offer any particular advantage. Many important problems are sequences. If you want to understand a book, you read it page by page. The sequences are nested. Within a page is a sequence of words, and within a word is a sequence of letters. To understand the book, you must understand the sequence of letters, words, and pages. An RNN is the answer because it processes new inputs while tracking past inputs. The input in the network doesn’t just proceed forward as usual in a neural network, but also loops inside it. It’s as if the network hears an echo of itself.
If you feed an RNN a sequence of words, the network will learn that when it sees a word, preceded by certain other words, it can determine how to complete the phrase. RNNs aren’t simply a technology that can automate input compilation (as when a browser automatically completes search terms as you type words). In addition, RNNs can feed sequences and provide a translation as output, such as the overall meaning of a phrase (so now, AI can disambiguate phrases where wording is important) or translate text into another language (again, translation works in a context). This even works with sounds, because it’s possible to interpret certain sound modulations as words. RNNs allow computers and mobile phones to understand, with great precision, not only what you said (it’s the same technology that automatically subtitles) but also what you meant to say, opening the door to computer programs that chat with you and to digital assistants such as Siri, Cortana, and Alexa.
Discovering the magic of AI conversations
A chatbot is software that can converse with you through two methods: auditory (you speak with it and listen to answers) or textual (you type what you want to say and read the answers). You may have heard of it under other names (conversational agent, chatterbot, talkbot, and others), but the point is that you may already use one on your smartphone, computer, or a special device. Siri, Cortana, and Alexa are all well-known examples. You may also exchange words with a chatbot when you contact a firm’s customer service by web or phone, or through an app on your mobile phone when using Twitter, Slack, Skype, or other applications for conversation.
Chatbots are big business because they help companies save money on customer service operators — maintaining constant customer contact and serving those customers — but the idea isn’t new. Even if the name is recent (devised in 1994 by Michael Mauldin, the inventor of the Lycos search engine), chatbots are considered the pinnacle of AI. According to Alan Turing’s vision, detecting a strong AI by talking with it shouldn’t be possible. Turing devised a famous conversation-based test to determine whether an AI has acquired intelligence equivalent to a human being.
You have a weak AI when the AI shows intelligent behavior but isn’t conscious like a human being. A strong AI occurs when the AI can really think as a human.
The Turing test requires a human judge to interact with two subjects through a computer terminal: one human and one machine. The judge evaluates which one is an AI based on the conversation. Turing asserted that if an AI can trick a human being into thinking that the conversation is with another human being, it’s possible to believe that the AI is at the human level of AI. The problem is hard because it’s not just a matter of answering properly and in a grammatically correct way, but also a matter of incorporating the context (place, time, and characteristics of the person the AI is talking with) and displaying a consistent personality (the AI should be like a real persona, both in background and attitude).
Since the 1960s, challenging the Turing test has proved to be motivation for developing chatbots, which are based on the idea of retrieval-based models. That is, a Natural Language Processing (NLP) algorithm parses language that is input by the human interrogator. Certain words or sets of words recall preset answers and feedback from chatbot memory storage.
NLP is data analysis focused on text. The algorithm splits text into tokens (elements of a phrase such as nouns, verbs, and adjectives) and removes any less useful or confounding information. The tokenized text is processed using statistical operations or machine learning. For instance, NLP can help you tag parts of speech and identify words and their meaning, or determine whether one text is similar to another.
Joseph Weizenbaum built the first chatbot of this kind, ELIZA, in 1966 as a form of computer psychological therapist. ELIZA was made of simple heuristics, which are base phrases to adapt to the context and keywords that triggered ELIZA to recall an appropriate response from a fixed set of answers. You can try an online version of ELIZA at https://tinyurl.com/3cfrj53y. You might be surprised to read meaningful conversations such as the one produced by ELIZA with its creator: https://tinyurl.com/j3zw42fj.
Retrieval-based models work fine when interrogated using preset topics because they incorporate human knowledge, just as an expert system does (as discussed in Chapter 3), thus they can answer with relevant, grammatically correct phrases. Problems arise when confronted with off-topic questions. The chatbot can try to fend off these questions by bouncing them back in another form (as ELIZA did) and be spotted as an artificial speaker. A solution is to create new phrases, for instance, based on statistical models, machine learning, or even a pretrained RNN, which could be built on neutral speech or could even reflect the personality of a specific person. This approach is called generative-based models and is the frontier of chatbots today because generating language on the fly isn’t easy.
Generative-based models don’t always answer with pertinent and correct phrases, but many researchers have made advances recently, especially in RNNs. As noted in previous sections, the secret is in the sequence: You provide an input sequence in one language and an output sequence in another language, as in a machine translation problem. In this case, you provide both input sequence and output sequence in the same language. The input is a part of a conversation, and the output is the following reaction.
Given the actual state of the art in chatbot building, RNNs work great for short exchanges, although obtaining perfect results for longer or more articulated phrases is more difficult. As with retrieval-based models, RNNs recall information they acquire, but not in an organized way. If the scope of the discourse is limited, these systems can provide good answers, but they degrade when the context is open and general because they would need knowledge comparable to what a human acquires during a lifetime. (Humans are good conversationalists based on experience and knowledge.)
Data for training an RNN is really the key. For instance, Google Smart Reply, a chatbot by Google, offers quick answers to emails. The story at https://tinyurl.com/2d43528f tells more about how this system is supposed to work. In the real world, it tended to answer most conversations with “I love you” because it was trained using biased examples. Something similar happened to Microsoft’s Twitter chatbot Tay, whose ability to learn from interactions with users led it astray because conversations were biased and malicious (https://tinyurl.com/55v6dxuh).
If you want to know the state of the art in the chatbot world, you can keep updated about yearly chatbot competitions in which Turing tests are applied to the current technology. For instance, the Lobner prize is the most famous one (https://tinyurl.com/cwb2zr8c) and the right place to start. Though still unable to pass the Turing test, the most recent winner of the Lobner prize at the time of the writing of this book was Mitsuku (for the fourth time in a row), a program that can reason about specific objects proposed during the discourse; it can also play games and even perform magic tricks (https://tinyurl.com/uvmh8cnk).
Going for the state of the pretrained art
RNNs have come a long way in recent years. When researchers and practitioners experienced how much more useful RNNs are than the previous statistical approach of analyzing text as a pool of words (the commonly used technical term is bag of words), they started using them in mass and, as they tested more and more applications, they also discovered limitations that they tried to overcome.
As initially devised, RNNs had limits. In particular, they needed too much data to learn from and they couldn’t really remember information that appeared earlier in a phrase. Moreover, many researchers reported that RNNs were just a look-back algorithm (also called backjumping; scroll down to Chapter 6 at https://tinyurl.com/2snz8rep for more details) when processing text and that sometimes you need to look further into a phrase in order to make sense of what has been said before. Thus, in order to cope with the memory limitations of the RNNs and the multiple relations of words in a phrase, researchers devised the Long Short-Term Memory (LSTM) and the Gated Recurrent Unit (GRU) neural units, which can both remember and forget previous words in a smarter way. The researchers also made all these neural units read text bi-directionally, so they can pick a word from both the start and the end of the phrase and make sense of everything.
Sepp Hochreiter, a computer scientist who made many contributions to the fields of machine learning, deep learning, and bioinformatics, and Jürgen Schmidhuber, a pioneer in the field of artificial intelligence, invented LSTMs. See: “Long Short-Term Memory” in the MIT Press journal Neural Computation. The GRU first appeared in the paper called “Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation” (https://arxiv.org/pdf/1406.1078.pdf).
Working with word embeddings
Researchers found creating pretrained models useful for dealing with data quantity limitations. Pretrained models for word embeddings are similar to image pretrained models in that they process large amounts of publicly available textual data to provide a means to score a word in a phrase in a meaningful way. The idea is to change words into numbers. The numbers aren’t randomly chosen, but relate to each other in the same way as words relate by meaning. For example, you can transform the names of different foods into columns of numeric values (a matrix) in such a way that the words that show fruits can have a similar score on a particular column. On the same column, vegetables can get different values, but not too far from those of fruit. Finally, the names of meat dishes can be far away in value from fruits and vegetables. The values are similar when the words are synonymous or refer to a similar concept called semantic similarity, with semantic referring to the meaning of words.
These pretrained models are called embeddings. Embeddings aren’t new; they have a long history. The concept of embeddings appeared in statistical multivariate analysis under the name of multivariate correspondence analysis. Since the 1970s, Jean-Paul Benzécri, a French statistician and linguist, along with many other French researchers from the French School of Data Analysis, discovered how to map a limited set of words into low-dimensional spaces (usually 2-D representations, such as a topographic map). This process turns words into meaningful numbers and projections, a discovery that brought about many applications in linguistics and the social sciences and paved the way for the recent advancements in language processing using deep learning.
The word embedding refers to nothing more than a mapping function that transforms words into numeric sequences that are meaningful for a deep learning algorithm. Popular word embeddings are Google’s Word2Vec, Stanford’s Global Vectors (GloVe), and Facebook’s fastText.
Finding out the limits of BERT and GPT-3
Word embeddings such as Word2Vec and others aren’t the only advanced technique that you can use to make deep learning solutions shine with unstructured text. Recently, a series of pretrained networks appeared that make modeling language problems even easier. For instance, one of the most promising is the Google Bidirectional Encoder Representations from Transformers (BERT). Here’s a link to the Google AI blog post describing the technique: https://tinyurl.com/b7vhn8d6. The interesting part of BERT is that it produces even more useful embeddings because it can map words into numbers differently based on the other words that appear with it in the phrase. Even if embeddings are just numbers, recent developments show an approach similar to how humans understand the meaning of words based on their context.
Based on the same philosophy, the GPT-3 neural network, created by OpenAI, a San Francisco–based artificial intelligence research laboratory, can achieve even more competitive results than the BERT. A GPT-3 can answer questions, write and summarize essays, generate adventure games (just try AI Dungeon to get an idea: https://play.aidungeon.io/), translate languages, and even write computer code, as described here: https://tinyurl.com/3w5y53wf. Yet, it’s important to realize that technology is still far from a real AI. When faced with interacting with a human, deep learning networks such as GPT-3 and BERT, or other, even more complex ones (such as the gigantic one that Google recently trained: https://tinyurl.com/2cnppm3n) really can’t understand the discourse. They can only process the phrases in order to achieve a particular, pre-ordered result.
Making one AI compete against another AI
RNNs and transformers can make a computer converse with you, and if you have no idea that the neural network is reactivating sequences of words that it has previously learned, you get the idea that something related to intelligence is going on behind the scenes. In reality, no thought or reasoning goes on behind it, although the technology doesn’t simply recall preset phrases but is fairly articulated.
Generative Adversarial Networks (GANs) represent another kind of deep learning technology that can give you an even stronger illusion that the AI can display creativity. Again, this technology relies on recalling previous examples and the machine’s understanding that the examples contain rules — rules that the machine can play with as a child plays with toy bricks (technically, the rules are the statistical distributions underlying the examples). Nevertheless, a GAN is an incredible type of technology that has displayed promise for a fairly large number of future applications, in addition to the uses today (see https://tinyurl.com/na74p3uz as an example).
GANs originated from the work of a few researchers at the Departement d’informatique et de recherche operationnelle at Montreal University in 2014, and the most notable among them is Ian Goodfellow (see the white paper at https://tinyurl.com/4r65ca6e). The proposed new deep learning approach immediately raised interest and now is one of the most researched technologies, with constant developments and improvements. Yann LeCun found Generative Adversarial Networks to be “the most interesting idea in the last ten years in machine learning” (https://tinyurl.com/y4j7ch6b). In an interview at MIT Technology Review, Ian Goodfellow explains that level of enthusiasm with this intriguing statement: “You can think of generative models as giving artificial intelligence a form of imagination” (https://tinyurl.com/zpzrsdpp).
To see a basic GAN in action (there are now many sophisticated variants, and more are being developed), you need a reference dataset, usually consisting of real-world data, whose examples you would like to use to teach the GAN. For instance, if you have a dog image dataset, you expect the GAN to learn how a dog looks from the dataset. After learning about dogs, the GAN can propose plausible, realistic images of dogs that are different from those in the initial dataset. (They’ll be new images; simply replicating existing images is considered an error from a GAN.)
The dataset is the starting point. You also need two neural networks, each one specializing in a different task and both in competition with each other. One network is called the generator; it takes an arbitrary input (for instance, a sequence of random numbers) and generates an output (for instance, a dog’s image), which is an artifact because it’s artificially created using the generator network. The second network is the discriminator, which must correctly distinguish the products of the generator, the artifacts, from the examples in the training dataset.
When a GAN starts training, both the networks try to improve by using backpropagation (explained in the “Understanding the role of backpropagation” section, earlier in this chapter), based on the results of the discriminator. The errors the discriminator makes in distinguishing a real image from an artifact propagate to the discriminator (as with a classification neural network). The correct discriminator answers propagate as errors to the generator (because it was unable to make artifacts similar to the images in the dataset, and the discriminator spotted them). Figure 11-6 shows this relationship.
The original images chosen by Goodfellow to explain how a GAN works are that of the art faker and the investigator. The investigator gets skilled in detecting forged art, but the faker also improves in order to avoid detection by the investigator.
Photos courtesy of (montage, clockwise from bottom left): Lileephoto/Shutterstock; Menno Schaefer/Shutterstock; iofoto/Shutterstock; vilainecrevette/iStockphoto; Middle: Rana Faure/Corbis/VCG/Getty Images
FIGURE 11-6: How a GAN network works, oscillating between generator and discriminator.
You may wonder how the generator learns to create the right artifacts if it never sees an original. Only the discriminator sees the original dataset when it tries to distinguish real art from the generator artifacts. Even if the generator never examines anything from the original dataset, it receives hints through the work of the discriminator. They’re slight hints, guided by many failed attempts at the beginning from the generator. It’s like learning to paint the Mona Lisa without having seen it and with only the help of a friend telling you how well you’ve guessed. The situation is reminiscent of the theorem called the infinite army of monkeys, with some differences. In this theorem, you expect the monkeys to write Shakespeare’s poems by mere luck (see https://tinyurl.com/2t8v5bbr). In this case, the generator uses randomness only at the start, and then it’s slowly guided by feedback from the discriminator. With some modifications of this basic idea, GANs have become capable of the following:
· Creating photo-realistic images of objects such as fashion items, as well as interior or industrial design based on a word description (you ask for a yellow and white flower and you get it, as described in this paper: https://tinyurl.com/wu2n8nxn)
· Modifying existing images by applying higher resolution, adding special patterns (for instance, transforming a horse into a zebra: https://tinyurl.com/mbf5rwex), and filling in missing parts (for example, you want to remove a person from a photo, and a GAN replaces the gap with some plausible background, as in this image-completion neural architecture: https://tinyurl.com/3ryvpzy2)
· Many frontier applications, such as ones for generating movement from static photos; creating complex objects such as complete texts (which is called structured prediction because the output is not simply an answer, but rather a set of answers that relate to each other); creating data for supervised machine learning; or even generating powerful cryptography (https://tinyurl.com/yzwhsa8c)
GANs are a deep learning frontier technology, and there are many open and new areas of research for its application in AI. If AI will have an imaginative and creative power, it will probably derive from technologies like GANs. You can get an idea of what’s going on with this technology by reading the pages on GANs from OpenAI, a nonprofit AI research company founded by Greg Brockman, Ilya Sutskever, Elon Musk (PayPal, SpaceX, and Tesla founder), and Sam Altman (https://openai.com/blog/generative-models/).
Pondering reinforcement learning
Deep learning isn’t limited to supervised learning predictions. You also use deep learning for unsupervised learning and reinforcement learning (RL). Unsupervised learning supports a number of established techniques, such as autoencoders and self-organizing maps (SOMs), which can help you to segment your data into homogeneous groups or to detect anomalies in your variables. Even though scientists are still researching and developing unsupervised learning, reinforcement learning has recently taken the lion’s share of attention in both the academic papers and popularity among practitioners. RL achieves smarter solutions for problems such as parking a car, learning to drive in as little as 20 minutes (as this paper illustrates: https://tinyurl.com/nr5wzvwx), controlling an industrial robot, and more. (This article by Yuxi Li provides a complete list of reinforcement learning applications as of 2019: https://tinyurl.com/e4t2887v.)
RL provides a compact way of learning without gathering large masses of data, but it also involves complex interaction with the external world. Because RL begins without any data, interacting with the external world and receiving feedback defines the method used to obtain the data it requires. You could use this approach for a robot, moving in the physical world, or for a bot, wandering in the digital one.
In RL, you have an agent (which could be a robot in the real world or a bot in the digital one) interacting with an environment that could include a virtual or other sort of world with its own rules. The agent can receive information from the environment (called the state) and can act on it, sometimes changing it. More important, the agent can receive an input from the environment, a positive or negative one, based on its sequence of actions or inactions. The input is a reward even when negative. The purpose of RL is to have the agent learn how to behave to maximize the total sum of rewards received during its experience inside the environment.
Understanding how reinforcement learning works
You can determine the relationship between the agent and the environment from Figure 11-7. Note the time subscripts. If you consider the present instant in time as t, the previous instant is t–1. At time t–1, the agent acts and then receives both a state and a reward from the environment. Based on the sets of values relative to the action at time t, state at time t–1, and reward at time t, an RL algorithm can learn the action to obtain a certain environmental state.
FIGURE 11-7: A schema of how an agent and an environment interact in RL.
Ian Goodfellow, the AI research scientist behind the creation of GANs, believes that better integration between RL and deep learning is among the top priorities for further deep learning advances. Better integration leads to smarter robots. Integration is now a hot topic, but until recently, RL typically had stronger bonds to statistics and algorithms than to neural networks, at least until the Google deep learning research team proved the contrary.
Progressing with Google AI and DeepMind discoveries
At Google DeepMind, a research center in London owned by Google, they took a well-known RL technique called Q-learning and made it work with deep learning rather than the classical computation algorithm. The new variant, named Deep Q-Learning, uses both convolutions and regular dense layers to obtain problem input and process it. At Google, they used this solution to create a Deep Q-Network (DQN) which has been successfully used to play vintage Atari 2600 games at expert human level and win (see https://tinyurl.com/t2u3dhf8). The algorithm learned to play in a relatively short time and found clever strategies that only the most skilled game players use.
The idea behind Deep Q-Learning is to approximately determine the reward of an agent after taking a certain action, given the present state of the agent and the environment. In a human sense, the algorithm simply associates state and actions with expected rewards, which is done using a mathematical function. The algorithm, therefore, can’t understand whether it’s playing a particular game; its understanding of the environment is limited to the knowledge of the reported state deriving from taken actions.
In the recent years, the DeepMind team has continued exploring other possible RL solutions for playing Atari games. In particular, the team has tried to understand whether learning a model of the environment (in this case, the rules and characteristics of a specific Atari game) by image inputs could help the RL achieve even better results. In collaboration with Google AI and the University of Toronto, they finally introduced DreamerV2 (you can read more about it here: https://tinyurl.com/3jjkdwne), an RL agent-based application that achieves human-level performance on Atari games by means of creating a world model of the game itself through images of the game. In simple words, the DreamerV2 takes hints from provided images of the game and, on its own, figures out game object positions, trajectory, effects and so on. Whereas DQN mapped actions and rewards, this RL agent goes beyond those tasks and re-creates an internal representation of the game in order to understand it even better. This is similar to what humans do when they develop internal images and ideas of the external world.
The dream of scientists is to create a general RL agent that can approach different problems and solve them, in the same spontaneous way as humans do. Recently, though, the most astonishing results have again occurred with task-specific problems that don’t transfer easily to other situations.
Clear examples are the AI built to beat humans at games such as chess or Go. Chess and Go are both popular board games that share characteristics, such as being played by two players who move in turns and lack a random element (no dice are thrown, as in backgammon). Apart from that, they have different game rules and complexity. In chess, each player has 16 pieces to move on the board according to type, and the game ends when the king piece is stalemated or checkmated — unable to move further. Experts calculate that about 10123 different chess games are possible, which is a large number when you consider that scientists estimate the number of atoms in the known universe at about 1080. Yet, computers can master a single game of chess by determining the future possible moves far enough ahead to have an advantage against any human opponent. In 1997, Deep Blue, an IBM supercomputer designed for playing chess, defeated Garry Kasparov, the world chess champion.
A computer cannot prefigure a complete game of chess using brute force (calculating every possible move from the beginning to the end of the game). It uses some heuristics and its ability to look into a certain number of future moves. Deep Blue was a computer with high computational performance that could anticipate more future moves in the game than any previous computer.
In Go, you have a 19-x-19 grid of lines containing 361 spots on which each player places a stone (usually black or white in color) each time a player takes a turn. The purpose of the game is to enclose in stones a larger portion of the board than one’s opponent’s. Considering that, on average, each player has about 250 possible moves at each turn, and that a game consists of about 150 moves, a computer would need enough memory to hold 250150 games, which is on the order of 10360 boards. From a resource perspective, Go is more complex than chess, and experts used to believe that no computer software would be able to beat a human Go master within the next decade using the same approach as Deep Blue. Yet, a computer system called AlphaGo accomplished it using RL techniques.
DeepMind developed AlphaGo in 2016, which featured Go playing skills never attained before by any hardware and software solution. After setting up the system, DeepMind had AlphaGo test itself against the strongest Go champion living in Europe, Fan Gui, who had been the European Go champion three times. DeepMind challenged him in a closed-door match, and AlphaGo won all the games, leaving Fan Gui amazed by the game style displayed by the computer.
Then, after Fan Gui helped refine the AlphaGo skills, the DeepMind team, led by their CEO Demis Hassabis and chief scientist David Silver, challenged Lee Sedol, a South Korean professional Go player ranked at the ninth dan, the highest level a master can attain. AlphaGo won a series of four games against Lee Sedol and lost only one. Apart from the match it lost because of an unexpected move from the champion, it actually led the other games and amazed the champion by playing unexpected, impactful moves. In fact, both players, Fan Gui and Lee Sedol, felt that playing against AlphaGo was like playing against a contestant coming from another reality: AlphaGo moves resembled nothing they had seen before.
The story behind AlphaGo is so fascinating that someone made a film out of it named AlphaGo. It’s well worth seeing: https://tinyurl.com/58hvfs79.
The DeepMind team that created AlphaGo didn’t stop after the success of its solution; it soon retired AlphaGo and created even more incredible systems. First, the team built up AlphaGo Zero, which is AlphaGo trained by playing against itself. Then it created Alpha Zero, which is a general program that can learn to play chess and shogi, the Japanese chess game, by itself. If AlphaGo demonstrated how to solve a problem deemed impossible for computers, AlphaGo Zero demonstrated that computers can attain super-capabilities using self-learning (which is RL in its essence). In the end, its results were even better than with those starting from human experience: AlphaGo Zero has challenged the retired AlphaGo and won 100 matches without losing one. Finally, the DeepMind team even perfected Alpha Zero by further developing it into MuZero (see https://tinyurl.com/bephn4e8), an AI algorithm matching Alpha Zero's results in chess and shogi, but improving it in Go (thus setting a new world standard) and even Atari games.
Alpha Zero managed to reach the pinnacle of performance starting with zero data. This capability goes beyond the idea that data is needed to achieve every AI target (as Alon Halevy, Peter Norvig, and Fernando Pereira stated just a few years ago in the white paper at https://tinyurl.com/4et9hktx). Alpha Zero is possible because we know the generative processes used by Go game players, and DeepMind researchers were able to re-create a perfect Go environment.