En introduktion til Policy Gradients med Cartpole og Doom

Denne artikel er en del af Deep Reinforcement Learning Course med Tensorflow? ️. Se pensum her.

I de sidste to artikler om Q-læring og dyb Q-læring arbejdede vi med værdibaserede forstærkende læringsalgoritmer. For at vælge hvilken handling, der skal udføres givet en tilstand, tager vi handlingen med den højeste Q-værdi (maksimal forventet fremtidig belønning, jeg får i hver stat). Som en konsekvens, i værdibaseret læring, eksisterer der kun en politik på grund af disse estimationsværdier.

I dag lærer vi en politikbaseret forstærkningslæringsteknik kaldet Policy Gradients.

Vi implementerer to agenter. Den første lærer at holde bjælken i balance.

Den anden vil være en agent, der lærer at overleve i et Doom-fjendtligt miljø ved at samle sundhed.

I politikbaserede metoder lærer vi i stedet for at lære en værdifunktion, der fortæller os, hvad der er den forventede sum af belønninger givet en tilstand og en handling, den politikfunktion, der kortlægger tilstand til handling (vælg handlinger uden brug af en værdifunktion).

Det betyder, at vi direkte forsøger at optimere vores politikfunktion π uden at bekymre os om en værdifunktion. Vi parametriserer direkte π (vælg en handling uden en værdifunktion).

Sikker på, vi kan bruge en værdifunktion til at optimere politikparametrene. Men værdifunktionen bruges ikke til at vælge en handling.

I denne artikel lærer du:

  • Hvad er Policy Gradient og dets fordele og ulemper
  • Sådan implementeres det i Tensorflow.

Hvorfor bruge politikbaserede metoder?

To typer politik

En politik kan være enten deterministisk eller stokastisk.

En deterministisk politik er politik, der kortlægger tilstand til handlinger. Du giver den en tilstand, og funktionen returnerer en handling, der skal udføres.

Deterministiske politikker anvendes i deterministiske miljøer. Dette er miljøer, hvor de udførte handlinger bestemmer resultatet. Der er ingen usikkerhed. For eksempel, når du spiller skak, og du flytter din bonde fra A2 til A3, er du sikker på, at din bonde flytter til A3.

På den anden side udsender en stokastisk politik en sandsynlighedsfordeling over handlinger.

Det betyder, at i stedet for at være sikker på at tage handling a (for eksempel venstre), er der en sandsynlighed for, at vi tager en anden (i dette tilfælde 30%, som vi tager sydpå).

Den stokastiske politik anvendes, når miljøet er usikkert. Vi kalder denne proces en delvist observerbar Markov-beslutningsproces (POMDP).

For det meste bruger vi denne anden type politik.

Fordele

Men Deep Q Learning er virkelig fantastisk! Hvorfor bruge politikbaserede forstærkningsindlæringsmetoder?

Der er tre hovedfordele ved at bruge Policy Gradients.

Konvergens

For det første har politikbaserede metoder bedre konvergensegenskaber.

Problemet med værdibaserede metoder er, at de kan have en stor svingning under træning. Dette skyldes, at valg af handling kan ændre sig dramatisk for en vilkårligt lille ændring i de estimerede handlingsværdier.

På den anden side følger vi gradvis gradient for at finde de bedste parametre med politikgradient. Vi ser en jævn opdatering af vores politik på hvert trin.

Fordi vi følger gradienten for at finde de bedste parametre, er vi garanteret at konvergere til et lokalt maksimum (worst case) eller globalt maksimum (best case).

Politikgradienter er mere effektive i højdimensionelle handlingsrum

Den anden fordel er, at politikgradienter er mere effektive i højdimensionelle handlingsrum eller ved kontinuerlige handlinger.

Problemet med Deep Q-læring er, at deres forudsigelser tildeler en score (maksimal forventet fremtidig belønning) for hver mulig handling på hvert tidspunkt, givet den aktuelle tilstand.

Men hvad hvis vi har en uendelig mulighed for handlinger?

For eksempel med en selvkørende bil i hver tilstand kan du have et (næsten) uendeligt valg af handlinger (drejning af hjulet ved 15 °, 17,2 °, 19,4 °, honk ...). Vi bliver nødt til at afgive en Q-værdi for hver mulig handling!

På den anden side, i politikbaserede metoder, justerer du bare parametrene direkte: takket være det begynder du at forstå, hvad det maksimale vil være, snarere end at beregne (estimere) det maksimale direkte ved hvert trin.

Politikgradienter kan lære stokastiske politikker

En tredje fordel er, at politikgradient kan lære en stokastisk politik, mens værdifunktioner ikke kan. Dette har to konsekvenser.

En af disse er, at vi ikke behøver at gennemføre en efterforskning / udnyttelse afvejning. En stokastisk politik tillader vores agent at udforske statsrummet uden altid at tage den samme handling. Dette skyldes, at det udsender en sandsynlighedsfordeling over handlinger. Som en konsekvens håndterer den efterforskning / udnyttelse af handel uden hårdkodning.

Vi slipper også af med problemet med perceptuel aliasing. Perceptuel aliasing er, når vi har to tilstande, der ser ud til at være (eller faktisk er) de samme, men som har brug for forskellige handlinger.

Lad os tage et eksempel. Vi har en intelligent støvsuger, og dens mål er at suge støvet og undgå at dræbe hamstere.

Vores støvsuger kan kun opleve, hvor væggene er.

Problemet: de to røde tilfælde er alias-tilstande, fordi agenten opfatter en øvre og nedre væg for hver to.

Under en deterministisk politik bevæger politikken sig enten til højre i rød tilstand eller til venstre. I begge tilfælde vil vores agent sætte sig fast og aldrig suge støvet.

Under en værdibaseret RL-algoritme lærer vi en kvasi-deterministisk politik ("epsilon grådig strategi"). Som en konsekvens kan vores agent bruge meget tid på at finde støvet.

On the other hand, an optimal stochastic policy will randomly move left or right in grey states. As a consequence it will not be stuck and will reach the goal state with high probability.

Disadvantages

Naturally, Policy gradients have one big disadvantage. A lot of the time, they converge on a local maximum rather than on the global optimum.

Instead of Deep Q-Learning, which always tries to reach the maximum, policy gradients converge slower, step by step. They can take longer to train.

However, we’ll see there are solutions to this problem.

Policy Search

We have our policy π that has a parameter θ. This π outputs a probability distribution of actions.

Awesome! But how do we know if our policy is good?

Remember that policy can be seen as an optimization problem. We must find the best parameters (θ) to maximize a score function, J(θ).

There are two steps:

  • Measure the quality of a π (policy) with a policy score function J(θ)
  • Use policy gradient ascent to find the best parameter θ that improves our π.

The main idea here is that J(θ) will tell us how good our π is. Policy gradient ascent will help us to find the best policy parameters to maximize the sample of good actions.

First Step: the Policy Score function J(θ)

To measure how good our policy is, we use a function called the objective function (or Policy Score Function) that calculates the expected reward of policy.

Three methods work equally well for optimizing policies. The choice depends only on the environment and the objectives you have.

First, in an episodic environment, we can use the start value. Calculate the mean of the return from the first time step (G1). This is the cumulative discounted reward for the entire episode.

The idea is simple. If I always start in some state s1, what’s the total reward I’ll get from that start state until the end?

We want to find the policy that maximizes G1, because it will be the optimal policy. This is due to the reward hypothesis explained in the first article.

For instance, in Breakout, I play a new game, but I lost the ball after 20 bricks destroyed (end of the game). New episodes always begin at the same state.

I calculate the score using J1(θ). Hitting 20 bricks is good, but I want to improve the score. To do that, I’ll need to improve the probability distributions of my actions by tuning the parameters. This happens in step 2.

In a continuous environment, we can use the average value, because we can’t rely on a specific start state.

Each state value is now weighted (because some happen more than others) by the probability of the occurrence of the respected state.

Third, we can use the average reward per time step. The idea here is that we want to get the most reward per time step.

Second step: Policy gradient ascent

We have a Policy score function that tells us how good our policy is. Now, we want to find a parameter θ that maximizes this score function. Maximizing the score function means finding the optimal policy.

To maximize the score function J(θ), we need to do gradient ascent on policy parameters.

Gradient ascent is the inverse of gradient descent. Remember that gradient always points to the steepest change.

In gradient descent, we take the direction of the steepest decrease in the function. In gradient ascent we take the direction of the steepest increase of the function.

Why gradient ascent and not gradient descent? Because we use gradient descent when we have an error function that we want to minimize.

But, the score function is not an error function! It’s a score function, and because we want to maximize the score, we need gradient ascent.

The idea is to find the gradient to the current policy π that updates the parameters in the direction of the greatest increase, and iterate.

Okay, now let’s implement that mathematically. This part is a bit hard, but it’s fundamental to understand how we arrive at our gradient formula.

We want to find the best parameters θ*, that maximize the score:

Our score function can be defined as:

Which is the total summation of expected reward given policy.

Now, because we want to do gradient ascent, we need to differentiate our score function J(θ).

Our score function J(θ) can be also defined as:

We wrote the function in this way to show the problem we face here.

We know that policy parameters change how actions are chosen, and as a consequence, what rewards we get and which states we will see and how often.

So, it can be challenging to find the changes of policy in a way that ensures improvement. This is because the performance depends on action selections and the distribution of states in which those selections are made.

Both of these are affected by policy parameters. The effect of policy parameters on the actions is simple to find, but how do we find the effect of policy on the state distribution? The function of the environment is unknown.

As a consequence, we face a problem: how do we estimate the ∇ (gradient) with respect to policy θ, when the gradient depends on the unknown effect of policy changes on the state distribution?

The solution will be to use the Policy Gradient Theorem. This provides an analytic expression for the gradient ∇ of J(θ) (performance) with respect to policy θ that does not involve the differentiation of the state distribution.

So let’s calculate:

Remember, we’re in a situation of stochastic policy. This means that our policy outputs a probability distribution π(τ ; θ). It outputs the probability of taking these series of steps (s0, a0, r0…), given our current parameters θ.

But, differentiating a probability function is hard, unless we can transform it into a logarithm. This makes it much simpler to differentiate.

Here we’ll use the likelihood ratio trick that replaces the resulting fraction into log probability.

Now let’s convert the summation back to an expectation:

As you can see, we only need to compute the derivative of the log policy function.

Now that we’ve done that, and it was a lot, we can conclude about policy gradients:

This Policy gradient is telling us how we should shift the policy distribution through changing parameters θ if we want to achieve an higher score.

R(tau) is like a scalar value score:

  • If R(tau) is high, it means that on average we took actions that lead to high rewards. We want to push the probabilities of the actions seen (increase the probability of taking these actions).
  • On the other hand, if R(tau) is low, we want to push down the probabilities of the actions seen.

This policy gradient causes the parameters to move most in the direction that favors actions that has the highest return.

Monte Carlo Policy Gradients

In our notebook, we’ll use this approach to design the policy gradient algorithm. We use Monte Carlo because our tasks can be divided into episodes.

Initialize θfor each episode τ = S0, A0, R1, S1, …, ST: for t <-- 1 to T-1: Δθ = α ∇theta(log π(St, At, θ)) Gt θ = θ + Δθ
For each episode: At each time step within that episode: Compute the log probabilities produced by our policy function. Multiply it by the score function. Update the weights

But we face a problem with this algorithm. Because we only calculate R at the end of the episode, we average all actions. Even if some of the actions taken were very bad, if our score is quite high, we will average all the actions as good.

So to have a correct policy, we need a lot of samples… which results in slow learning.

How to improve our Model?

We’ll see in the next articles some improvements:

  • Actor Critic: a hybrid between value-based algorithms and policy-based algorithms.
  • Proximale politikgradienter: sikrer, at afvigelsen fra den tidligere politik forbliver relativt lille.

Lad os implementere det med Cartpole og Doom

Vi lavede en video, hvor vi implementerer en Policy Gradient-agent med Tensorflow, der lærer at spille Doom ?? i et Deathmatch-miljø.

Du kan få direkte adgang til notesbøgerne i Deep Reinforcement Learning Course repo.

Cartpole:

Undergang:

Det er alt! Du har lige oprettet en agent, der lærer at overleve i et Doom-miljø. Fantastisk!

Glem ikke at implementere hver del af koden selv. Det er virkelig vigtigt at prøve at ændre den kode, jeg gav dig. Prøv at tilføje epoker, ændre arkitekturen, ændre læringshastigheden, brug et hårdere miljø ... og så videre. Hav det sjovt!

I den næste artikel vil jeg diskutere de sidste forbedringer inden for dyb Q-læring:

  • Faste Q-værdier
  • Prioriteret oplevelse Genafspilning
  • Dobbelt DQN
  • Dueling Networks

If you liked my article, please click the ? below as many time as you liked the article so other people will see this here on Medium. And don’t forget to follow me!

If you have any thoughts, comments, questions, feel free to comment below or send me an email: [email protected], or tweet me @ThomasSimonini.

Keep Learning, Stay awesome!

Deep Reinforcement Learning Course with Tensorflow ?️

? Syllabus

? Video version

Part 1: An introduction to Reinforcement Learning

Part 2: Diving deeper into Reinforcement Learning with Q-Learning

Part 3: An introduction to Deep Q-Learning: let’s play Doom

Part 3+: Improvements in Deep Q Learning: Dueling Double DQN, Prioritized Experience Replay, and fixed Q-targets

Part 4: An introduction to Policy Gradients with Doom and Cartpole

Del 5: En introduktion til Advantage Actor Kritiske metoder: lad os spille Sonic the Hedgehog!

Del 6: Proximal politikoptimering (PPO) med Sonic the Hedgehog 2 og 3

Del 7: Nysgerrighedsdrevet læring gjort let del I