Dota 2 with Large Scale Deep Reinforcement Learning

Dota 2 with Large Scale Deep Reinforcement Learning

Abstract

On April 13th, 2019, OpenAI Five became the first AI system to defeat the world champions at an esports game. The game of Dota 2 presents novel challenges for AI systems such as long time horizons, imperfect information, and complex, continuous state-action spaces, all challenges which will become increasingly central to more capable AI systems.

OpenAI Five leveraged existing reinforcement learning techniques, scaled to learn from batches of approximately 2 million frames every 2 seconds. By defeating the Dota 2 world champion (Team OG), OpenAI Five demonstrates that self-play reinforcement learning can achieve superhuman performance on a difficult task.

Team OG and the OpenAI dev team

Relative to previous AI milestones like Chess or Go, complex video games start to capture the
complexity and continuous nature of the real world.

The key ingredient in solving this complex environment was to scale existing reinforcement learning systems to unprecedented levels, utilizing thousands of GPUs over multiple months. A distributed training system was built to do this which was used to train OpenAI Five. 

Training System

Playing Dota using AI

The Dota 2 engine runs at 30 frames per second. OpenAI Five only acts on every 4th frame which we call a timestep. At each timestep, OpenAI Five receives an observation from the game engine encoding all the information a human player would see such as units’ health, position, etc. OpenAI Five then returns a discrete action to the game engine, encoding a desired movement, attack, etc. 

Some properties of the environment were randomized during training, including the heroes in the game and which items the heroes purchased. Sufficiently diverse training games are necessary to ensure robustness to the wide variety of strategies and situations that arise in games against human opponents.

 
OpenAI Five sees the world as a bunch of numbers that it must decipher. It uses the same general-purpose learning code whether those numbers represent the state of a Dota game (about 20,000 numbers) or robotic hand (about 200).

How OpenAI tackled Dota 2

OpenAI Five isn’t able to handle the full game yet — it can only play 18 out of the 115 different heroes, and it can’t use abilities like summons and illusions. And in a somewhat controversial design decision, OpenAI’s engineers opted not to have it read pixels from the game to retrieve information (like human players do).

It uses Dota 2’s bot API instead, obviating the need for it to search the map to check where its team might be, check if a spell is ready, or estimate an enemy’s health or distance.

OpenAI Five’s view from the Dota 2 Battlefield

OpenAI Five consists of five single-layer, 4,096-unit long short-term memory (LSTM) networks.) The networks are trained using a deep reinforcement learning model that incentivises their self-improvement with rewards ( kills, deaths, assists, last mile hits, net worth).

OpenAI’s training framework consists of two parts: a set of rollout workers that run a copy of Dota 2 and an LSTM network, and optimiser nodes that perform synchronous gradient descent (an essential step in machine learning) across a fleet of graphics cards. As the rollout workers gain experience, they inform the optimiser nodes, and another set of workers compare the trained LSTM networks (agents) to reference agents.

OpenAI Five Model Architecture

Playing Dota using AI

The complex multi-array observation space is processed into a single vector, which is then passed through a 4096-unit LSTM. The LSTM state is projected to obtain the policy outputs (actions and value function). Each of the five heroes on the team is controlled by a replica of this network with nearly identical inputs, each with its own hidden state.

The networks take different actions due to a part of the observation processing’s output indicating which of the five heroes is being controlled. The LSTM composes 84% of the model’s total parameter count.

Training Progress

Over the course of training, OpenAI Five developed a distinct style of play with noticeable similarities and differences to human play styles. As the agents improved, the play style evolved to align closer with human play while still maintaining many of the characteristics learned early on.

 
This graph evaluates all bots on the final game rules (1 courier, patch 7.21, etc)—even those trained on older ones.

Self Play

OpenAI Five is trained without any human gameplay data through a self-improvement process named self-play. This technique was successfully used in prior work to obtain super human performance in a variety of multiplayer games including Backgammon, Go, Chess, Hex, StarCraft 2, Poker.

In self-play training, OpenAI Five continually pits the current best version of an agent against itself or older versions, and optimise for new strategies that can defeat these past and present opponents.

 

Source: OpenAI

arXiv:1912.06680 [cs.LG]

This website uses cookies to ensure you get the best experience.