Relay Game Update
We ran an experiment where people did programming as a relay. Each player spends ten minutes on a programming problem before passing on their code to a new player who has not seen any of the previous work. We found that people were able to solve some problems using the relay approach, but it was much less efficient than having a single person work on their own. This project was part of Ought’s research on Factored Cognition, where we are testing whether people can solve problems by decomposing them into self-contained sub-problems.
Ought is investigating ways of decomposing tasks into smaller sub-tasks. Task decomposition is not a new idea: it’s widely recognized as fundamental to modern economies. People have worked out ways to decompose complex tasks (e.g. create an electric car) into smaller sub-tasks (create an engine and battery, design steering controls, test the car for safety) which in turn are broken down into yet smaller tasks, and so on. The smallest sub-tasks are carried out by individual humans, who may have a limited understanding of how their work relates to the original task.
Our focus is on the decomposition of cognitive tasks, where the goal is to provide information or to answer a question -- without having to take physical actions. Cognitive tasks include solving a mathematics problem, analyzing a dataset, or summarizing a story. We are testing a bold hypothesis about the decomposability of cognitive tasks:
Factored Cognition Hypothesis: Any cognitive task can be decomposed recursively into self-contained, interpretable sub-tasks that can be solved by humans in ten minutes of work.
Sub-tasks being "self-contained" means they are solvable without knowledge of the task. If the task is a hard physics problem, then self-contained sub-tasks would be solvable for someone who hasn’t seen the physics problem (and need not know the task is about physics). This differs from most real-world examples of collaborative problem solving, where everyone in the team knows what task is being solved.
Testing the Factored Cognition Hypothesis involves the following steps:
- Find cognitive tasks that seem hard to decompose (e.g. because normally one person would spend days or weeks on the same problem rather than 10 minutes).
- Generate a high-level strategy that would plausibly decompose the task.
- Test whether the strategy from (2) works by having a group of people solve (1) under controlled conditions.
For some ideas on high-level strategies for a range of problems, see our Factored Cognition slides and the paper Supervising strong learners by amplifying weak experts (Appendix B). We are working towards experiments that encompass steps (1)-(3) and we have developed a web application, Mosaic, for solving tasks with groups of people. At some point we will report on experiments based on Mosaic. For now we describe some experiments with a simple form of Factored Cognition called the Relay Game.
In the Relay Game participants work on a task sequentially, with each person having ten minutes to help solve the problem before handing off to the next person (see Figure 1). This is similar to real-world situations where one person quits a project and someone else takes over. However, the big difference comes from the ten-minute time limit. If the task is complex, it might take ten minutes just to read and understand the task description. This means that most players in the relay won't have time to both understand the task and make a useful contribution. Instead players must solve sub-tasks that previous people have constructed.
We tested the Relay Game on programming problems from Project Euler. Here is a simple example problem:
How many different ways can one hundred be written as a sum of at least two positive integers?
Solving these problems requires both mathematical insight and a working implementation in code. The problems would take 20-90 minutes for one of our players working alone and we expected the relay approach to be substantially slower.
Players worked on a shared Google doc (for notes) and code editor (see Figure 2). The first player receives only the Project Euler problem and begins making notes in the doc and writing code. After ten minutes, the second player takes over the doc and code editor. The Relay ends when a player computes the correct answer to the problem, which can be automatically verified at Project Euler.
We had 103 programmers volunteer to play Relay. They started 40 questions in total but only 25 had relay chains of more than 5 people. The total amount of work was 48 hours and only four questions were successfully solved. See Table 1 for a breakdown of a subset of the 40 questions. (Note: Most of the 103 players only worked on a few problems. Since much of the work was done by a smaller number of players, they were spread thinly over the 40 problems -- as each player spends just 10 minutes on a problem.)
|Project Euler Problem||Relay Total Time||Solved by Relay?||Est. Solo Time to Solve|
|Prize Strings||30 mins||Y||27 mins|
|Counting Summations||40 mins||Y||24 mins|
|Reversible Numbers||70 mins||Y||30 mins|
|Palindromic Sums||100 mins||Y||60 mins|
|Blancmange||130 mins||N||35 mins|
|Leftovers||130 mins||N||35 mins|
|Anti-Chain||140 mins||N||35 mins|
|Prime Geometric||160 mins||N||35 mins|
|Prime Triplets||260 mins||N||35 mins|
|Pandigital||410 mins||N||80 mins|
Can we conclude from Table 1 that Relay is much less efficient than the usual way of working on problems? Not really. It could be that Relay players would get better with experience by developing strategies for decomposing problems and coordinating work. So the main lesson from our experiment is that Relay with inexperienced players is probably less efficient at Project Euler problems. (We say “probably” because we did not conduct a rigorous comparison of Relay vs the usual way of solving problems).
We are interested in general failure modes for Factored Cognition with humans and in strategies for avoiding them. Our Relay experiment is a first step in this direction. We exhibit concrete examples from our Relay experiment that are suggestive of pitfalls and good practices for Factored Cognition.
Here are five “click-throughs”, which show how ideas and code evolved for particular Project Euler problems. In each case, the problem to be solved is shown on the bottom half of the Google doc on the first slide of the click-through.
In these three attemps on Prize Strings the players quickly build on players previous work and get the correct answer.
Empty Chairs was not solved but significant progress was made (with seven people contributing). The clickthrough demonstrates iterative improvements to a math heavy solution.
Palindromic Sums was solved in 10 attempts.
A Scoop of Blancmange
There were 13 unique attempts on A Scoop of Blangmange. While technically unsolved, the answer was off by only 1e-8. Only attempts that changed the state of the problem are shown.
Sum of a Square and a Cube
Sum of a Square and a Cube was unsolved with seven attempts.
With five attempts 500 went unsolved, but demonstrated interesting use of metadata tags.
Adding Meta-data to Notes
Relay players worked on the mathematical part of the Project Euler problems by writing notes in a Google Doc. A tactic that emerged organically in early rounds was to label contributions with meta-data with specific formatting (ex.using square brackets, strikethroughs). The meta-data was intended to provide a quick way for future players to decide which parts of the Google doc to read and which to ignore.
Summarizing Next Actions
For several problems the Google doc became messy and was probably made it difficult for new players to orient themselves. An example is the problem Prime Triples Geometric Sequence, shown here mid-round, where some of the subsequent rounds were spent cleaning up these notes and formulating clear next steps.
The problem with the longest Relay chain was Pandigital Step Numbers with a chain of 41 players. While substantial progress was made on the problem, there was a one-line bug in the code implementation that persisted to the last player in the chain. Given the size and complexity of the problem, it was difficult for players to locate the bug in only ten minutes.
How does Relay relate to Ought’s other research on Factored Cognition with humans? The Relay experiment had two key features:
- We used existing collaboration tools (Google docs and web-based interpreter) rather than specialized tools for Factored Cognition.
- Participants worked in sequence, building on the work of all previous participants.
We are also running experiments with specialized software called Mosaic. Mosaic facilitates tree-structured decompositions of tasks. The overall task is divided into sub-tasks which can be solved independently of each other, and users only see a sub-task (i.e. node) and not the rest of the tree. If this kind of decomposition turns out to be important in Factored Cognition, then the Relay setup will be less relevant.
The Relay experiment was exploratory and we decided not to continue working on it for now. Nevertheless we would be interested to hear about related work or to collaborate on research related to Relay.
Ben Goldhaber led this project with input from Andreas Stuhlmueller, Owain Evans, and Paul Christiano. BG created the software for Relay and oversaw experiments. BG and OE wrote the blogpost. We thank everyone who participated in Relay games, especially the teams at OpenAI and Ought. The original idea for Relay is due to Buck Shlegeris.
Appendix: Related work
The idea of Factored Cognition is superficially similar to crowdsourcing. Two important differences are:
- In crowdsourcing on Mechanical Turk, the task decomposition is usually fixed ahead of time and does not have to be done by the crowd.
- In crowdsourcing for Wikipedia (and other collaborative projects), contributors can spend much more than ten minutes building expertise in a particular task (Wikipedians who edit a page might spend years building expertise on the topic).
For more discussion of differences between Factored Cognition and crowdsourcing and how they relate to AI alignment, see Ought's Factored Cognition slides and William Saunders' blogpost on the subject. Despite these differences, crowdsourcing is useful source of evidence and insights for Relay and Factored Cognition. See Reinventing Discovery (Michael Nielsen) for an overview for crowdsourcing for science. Three crowdsourced projects especially relevant to Relay are:
- The Polymath Project was an example of leveraging internet scale collaboration to solve research problems in mathematics. Starting in 2009, Timothy Gowers posted a challenging problem to his blog and asked other mathematicians and collaborators to help push it forward to a solution. This was similar to the type of distributed problem solving we aimed for with the Relay Game, with the major difference being in the Relay game there is a ten minute time limit, so a player can’t keep working on a problem.
- The MathWorks Competition is a mathematics modeling competition for high school students. When a student submits an answer to a problem, the code for that solution is immediately made publicly available. The fastest solutions are then often improved upon by other students, and resubmitted.
- Microtask Programming is a project aiming to apply crowdsourcing techniques to software development. The project provides a development enviroment where software engineers can collaborate to complete small self-contained microtasks that are automatically generated by the system.
Transmission of information under constraints:
There is also academic research on problem solving under constraints somewhat similar to Relay.
- Causal understanding is not necessary for the improvement of culturally evolving technology tests how improvements to strategies for using tools can improve in situations where information for conveying these improvements is limited.
- Cumulative Improvements in Iterated Problem Solving. Similar to the relay chain model, with solutions to puzzles passed on to later generations to see how they can build on those solutions.