Multi-agent reinforcement learning
Multi-agent reinforcement learning (MARL) is a sub-field of reinforcement learning. It focuses on studying the behavior of multiple agents that coexist in a shared environment.[1] Each agent is motivated by its own rewards, and does actions to advance its own interests; these interests may be opposed to the interests of other agents, resulting in complex group dynamics.
Part of a series on |
Machine learning and data mining |
---|
![]() |

Multi-agent reinforcement learning is closely related to game theory and especially repeated games. Its study combines the pursuit of finding ideal algorithms that maximize rewards with a more sociological set of concepts. While research in single-agent reinforcement learning is concerned with finding the algorithm that gets the biggest number of points for one agent, research in multi-agent reinforcement learning evaluates and quantifies social metrics, such as cooperation,[2] reciprocity,[3] equity,[4] social influence,[5] language[6] and discrimination.[7]
Social dilemmas
As in game theory, much of the research in MARL revolves around social dilemmas, such as prisoner's dilemma,[8] chicken and stag hunt.[9]
While game theory research might focus on Nash equilibria and what an ideal policy for an agent would be, MARL research focuses on how the agents would learn these ideal policies using a trial-and-error process. The reinforcement learning algorithms that are used to train the agents are maximizing the agent's own reward; the conflict between the needs of the agents and the needs of the group is a subject of active research.[10]
Various techniques have been explored in order to induce cooperation in agents: Modifying the environment rules,[11] adding intrinsic rewards,[12] and more.
Sequential social dilemmas
Social dilemmas like prisoner's dilemma, chicken and stag hunt are "matrix games". Each agent takes only one action from a choice of two possible actions, and a simple 2x2 matrix is used to describe the reward that each agent will get, given the actions that each agent took.
In humans and other living creatures, social dilemmas tend to be more complex. Agents take multiple actions over time, and the distinction between cooperating and defecting isn't as clear cut as in matrix games. The concept of a sequential social dilemma (SSD) was introduced in 2017[13] as an attempt to model that complexity. There is ongoing research into defining different kinds of SSDs and showing cooperative behavior in the agents that act in them.[14]
Autocurricula
An autocurriculum[15] (plural: autocurricula) is a reinforcement learning concept that's salient in multi-agent experiments. As agents improve their performance, they change their environment; this change in the environment affects themselves and the other agents. The feedback loop results in several distinct phases of learning, each depending on the previous one. The stacked layers of learning are called an autocurriculum. Autocurricula are especially apparent in adversarial settings,[16] where each group of agents is racing to counter the current strategy of the opposing group.
Autocurricula in reinforcement learning experiments are compared to the stages of the evolution of life on earth and the development of human culture. A major stage in evolution happened 2-3 billion years ago, when photosynthesizing life forms started to produce massive amounts of oxygen, changing the balance of gases in the atmosphere.[17] In the next stages of evolution, oxygen-breathing life forms evolved, eventually leading up to land mammals and human beings. These later stages could only happen after the photosynthesis stage made oxygen widely available. Similarly, human culture couldn't have gone through the industrial revolution in the 18th century without the resources and insights gained by the agricultural revolution at around 10,000 BC.[18]
Limitations
There are some inherent difficulties about multi-agent deep reinforcement learning.[19] The environment is not stationary anymore, thus the Markov property is violated: transitions and rewards do not only depend on the current state of an agent.
Software
There are various tools and frameworks for working with multi-agent reinforcement learning environments:
References
- Albrecht, Stefano; Stone, Peter (2017), "Multiagent Learning: Foundations and Recent Trends. Tutorial", IJCAI-17 conference (PDF)
- Lowe, Ryan; Wu, Yi (2020). "Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments". arXiv:1706.02275v4 [cs.LG].
- Baker, Bowen (2020). "Emergent Reciprocity and Team Formation from Randomized Uncertain Social Preferences". NeurIPS 2020 proceedings. arXiv:2011.05373.
- Hughes, Edward; Leibo, Joel Z.; et al. (2018). "Inequity aversion improves cooperation in intertemporal social dilemmas". NeurIPS 2018 proceedings. arXiv:1803.08884.
- Jaques, Natasha; Lazaridou, Angeliki; Hughes, Edward; et al. (2019). "Social Influence as Intrinsic Motivation for Multi-Agent Deep Reinforcement Learning". Proceedings of the 35th International Conference on Machine Learning. arXiv:1810.08647.
- Lazaridou, Angeliki (2017). "Multi-Agent Cooperation and The Emergence of (Natural) Language". ICLR 2017. arXiv:1612.07182.
- Duéñez-Guzmán, Edgar; et al. (2021). "Statistical discrimination in learning agents". arXiv:2110.11404v1 [cs.LG].
- Sandholm, Toumas W.; Crites, Robert H. (1996). "Multiagent reinforcement learning in the Iterated Prisoner's Dilemma". Biosystems. 37 (1–2): 147–166. doi:10.1016/0303-2647(95)01551-5. PMID 8924633.
- Peysakhovich, Alexander; Lerer, Adam (2018). "Prosocial Learning Agents Solve Generalized Stag Hunts Better than Selfish Ones". AAMAS 2018. arXiv:1709.02865.
- Dafoe, Allan; Hughes, Edward; Bachrach, Yoram; et al. (2020). "Open Problems in Cooperative AI". NeurIPS 2020. arXiv:2012.08630.
- Köster, Raphael; Hadfield-Menell, Dylan; Hadfield, Gillian K.; Leibo, Joel Z. "Silly rules improve the capacity of agents to learn stable enforcement and compliance behaviors". AAMAS 2020. arXiv:2001.09318.
- Hughes, Edward; Leibo, Joel Z.; et al. (2018). "Inequity aversion improves cooperation in intertemporal social dilemmas". NeurIPS 2018 proceedings. arXiv:1803.08884.
- Leibo, Joel Z.; Zambaldi, Vinicius; Lanctot, Marc; Marecki, Janusz; Graepel, Thore (2017). "Multi-agent Reinforcement Learning in Sequential Social Dilemmas". AAMAS 2017. arXiv:1702.03037.
- Badjatiya, Pinkesh; Sarkar, Mausoom (2020). "Inducing Cooperative behaviour in Sequential-Social dilemmas through Multi-Agent Reinforcement Learning using Status-Quo Loss". arXiv:2001.05458 [cs.AI].
- Leibo, Joel Z.; Hughes, Edward; et al. (2019). "Autocurricula and the Emergence of Innovation from Social Interaction: A Manifesto for Multi-Agent Intelligence Research". arXiv:1903.00742v2 [cs.AI].
- Baker, Bowen; et al. (2020). "Emergent Tool Use From Multi-Agent Autocurricula". ICLR 2020. arXiv:1909.07528.
- Kasting, James F; Siefert, Janet L (2002). "Life and the evolution of earth's atmosphere". Science. 296 (5570): 1066–1068. Bibcode:2002Sci...296.1066K. doi:10.1126/science.1071184. PMID 12004117. S2CID 37190778.
- Clark, Gregory (2008). A farewell to alms: a brief economic history of the world. Princeton University Press. ISBN 978-0-691-14128-2.
- Hernandez-Leal, Pablo; Kartal, Bilal; Taylor, Matthew E. (2019-11-01). "A survey and critique of multiagent deep reinforcement learning". Autonomous Agents and Multi-Agent Systems. 33 (6): 750–797. arXiv:1810.05587. doi:10.1007/s10458-019-09421-1. ISSN 1573-7454. S2CID 52981002.