Reinforcement learning. Deep learning

What is reinforcement learning?

Reinforcement learning is another class of learning which I find to actually be the “real learning” (at least compared to supervised learning). Indeed, this type of learning uses rules without “supervision” and without data, i.e. without a database previously entered into its system and from which the computer “learns”, as it is the case in supervised learning. Here, the computer “learns” by starting to “play” on its own, based on rules previously defined (for example, how we are allowed to move the figures in chess play), from random move, simply by trying out possibilities and rating them depending on their outcome (e.g. win/loose; i.e. trial/error strategy). Actually, this is a bit more complicated than that, as behind these rules, there are algorithms written by human beings (actually computer freaks! ;-p).

These algorithms program the computer to evaluate, per move, all the possibilities of what the next move could be, and for each possibility, evaluate its probability of leading to a win. After this evaluation, the computer decides to play the next move with the highest probability. Some programs, as AlphaGo, can plan (i.e., are programmed to plan) ahead 50 to 60 moves in advance. Even with a simple computer program! The procedure of evaluation as described just before is called a “simulation”, which we could refer to mental thinking in humans, for example when a chess player proceeds to the same type of evaluation as a computer, trying to plan ahead what would be the best next move to play; however, probably not 50 to 60 moves ahead… !!


In the case of a computer, the system learned that one trial randomly chosen would lead to a win or to a loose, and by self-play, learns accordingly to reinforce itself to only learn the “win” or let’s say optimal possibilities. To achieve such performance, a self-play reinforcement learning algorithm on which neural networks (defined below) get trained on is necessary.

In the case of AlphaGoZero, a self-play training algorithm, by reinforcement learning, consists of a neural network updated as it plays to guess (predict) the next moves and of a powerful search algorithm which outputs, for each possible move, the probability lambda of playing each move. (AlphaGo is the famous program developed by the British company Google DeepMind to compete with human players to play the game of GO; AlphaGoZero its latest version playing with no human input except the basic rules of the game of GO).

This search algorithm is called the “Monte Carlo Tree Search” (MCTS for the acronym), but I won’t go through any details here. I think this is already all too complicated and I hope I make it simple for you to follow!

The overall motivation? Cumulative (more and more!) reward! For humans, it is usually money or at least anything to win/achieve at the end. For a computer, it is – for now – defined as the goal of the program/algorithm which drives it to “win” (i.e., achieve the goal it has been programmed for!).

What is “Deep Learning”?


Deep Learning is an ensemble of methods brought together to create automated learning. In brief, Deep Learning, a sub-field of Machine Learning, encompasses what we call “convoluted neural networks” and “recurrent neural networks”. Convoluted neural networks are utilized to scan an image, thus commonly used in the branch of computer vision called object recognition. These networks are able to discriminate objects within a scene. Recurrent neural networks are used to remember the past, i.e. history of experiences and actions, by having access to previously acquired and stored knowledge. These deep neural networks are what is the closest so far to the human’s brain, trying to reproduce our cognitive abilities in a machine, and grouped in the category called Deep Learning.

Some more concrete examples?

Back to AlphaGo, it only uses supervised learning; that was the first version of it. AlphaGoFan and AlphaGoLee (the version which beats Lee Sedol in 2016), later (better) versions of it, use supervised and reinforcement learning. The last version, AlphaGoZero, as mentioned above, only uses reinforcement learning. By using only one single neural network, compared to its predecessors, AlphaGoZero was therefore able to “learn on its own”; on that account, we can qualify the system/the machine of being “intelligent” (hence the term of “Artificial Intelligence” (AI)). But not yet as intelligent as us, humans… (see below)

Another example or branch of unsupervised machine learning algorithms is called “Generative adversarial networks” (called GANs as its acronym). You do not have to remember the full name. I know it sounds pretty scary (complex) at first !! GANs is easier to remember and I find it much cooler !! These specific type of networks – the GANS – have been around since 2014 and invented by Ian Goodfellow and colleagues. I mention it because so far, it seems to be the most powerful method in machine learning. The concept of this GANs is to train 2 neural networks to compete with each other in order to reinforce themselves. Like this, they can mutually train themselves and increase their performances without human intervention.


For example, these networks have been extensively applied in the field of computer vision, for image recognition. For example, the “StackGANs“ (for Stacked GAN), an improved version of the GANs, generate high-resolution images with what the authors called „photo-realistic details“ (I think the term speaks on its own), a technical feat in the field of computer vision. The concept of these StackGANs is pretty simple (at least I find it pretty simple !! I hope you will too !! I’ll try to describe it in an easy way and then will give you another example to make the analogy with a real-life case): one neural network (NN), let’s call it NN1, generates images, a second NN, let’s call it NN2 – within the same system – makes a decision about whether the image generated by NN1 is real or fake. Based on the feedback received by NN2, NN1 improves its performance, i.e. is able to generate better and better images. Consequently, NN2 also gets better at deciding whether images generated by NN1 are real or fake since it becomes easier and easier to discriminate between real and fake images as NN1 becomes better and better at generating high-quality images. This creates a feedback loop of continuous improvement without human intervention.

NN1 :




Let me now give you another real-life example to make the analogy with what I just explained. Recall this famous and fun board game you probably have one day played with your friends or family during long winter evenings, ideally close to a fireplace !! Well, the game is called Amnesia. Basically you have a card “gently glued” onto your forehead which indicates to everybody but you who you are in the game (i.e. which celebrity name is written on the card; it could be Cleopatra, Charlie Chaplin, Marylin Monroe, Elvis Presley, BenHur, Charlon Brando, Lady Gaga, and many more; you got the idea… !!). This game is played by team. The goal of your team, in order to win, is to succeed in making you guess the celebrity you are in the game by making “mimes” (i.e. gestures for imitation without verbal tips).


According to the same principles as NN2 network previously, which needed to guess whether images were real or fake, you will have to guess who you are (i.e. which name of the celebrity is currently gently “glued” onto your forehead) based on your friend’s mimes/gestures. If you cannot guess, your friend (NN1 for the analogy) will try to get better at making mimes so that you (NN2 for the analogy) can guess better who she is imitating. If she does improve, therefore you will guess better thus improve performances. This way you will both reinforce each other (i.e. you will both improve your performances based on each other’s behavior). Well that’s exactly the same concept in the previous example with the two NN, NN1 and NN2.

Coming back now to the field of image recognition, Google was able to train machines which are now better than humans at finding and discriminating objects in a scene, with 3% of errors compared to 5% in humans. This was achieved by increasing automated learning methods (called Deep Learning, cf. the next post) and computational power. Therefore in the field of image recognition, machines recently became better than humans. But no worries, that’s still far for considering them as “intelligent” as us… !!


Last but not least, and for your information, the domain of healthcare had become a growing field benefiting a lot from AI with several projects actually already funded by the EU, as the MURAB project, MURAB for “MRI and Ultrasound Robotic Assisted Biopsy“, setup with AI to better diagnose cancers and other diseases, among other projects.

Here you have reached the end of this post. Hope you enjoyed it. Do not hesitate to share any comment or ask any question! To read more about “What’s really behind AI?” you can click here.

One thought on “Reinforcement learning. Deep learning

  1. Pingback: Machine learning | Demystify_AI

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.