Relay Game Update

We ran an experiment where people did programming as a relay. Each player spends ten minutes on a programming problem before passing on their code to a new player who has not seen any of the previous work. We found that people were able to solve some problems using the relay approach, but it was much less efficient than having a single person work on their own. This project was part of Ought’s research on Factored Cognition, where we are testing whether people can solve problems by decomposing them into self-contained sub-problems.

Table of Contents

Introduction
Experiment Design
Discussion
Acknowledgements
Appendix: Related work
1. Crowdsourcing:
2. Transmission of information under constraints:

Introduction

Ought is investigating ways of decomposing tasks into smaller sub-tasks. Task decomposition is not a new idea: it’s widely recognized as fundamental to modern economies. People have worked out ways to decompose complex tasks (e.g. create an electric car) into smaller sub-tasks (create an engine and battery, design steering controls, test the car for safety) which in turn are broken down into yet smaller tasks, and so on. The smallest sub-tasks are carried out by individual humans, who may have a limited understanding of how their work relates to the original task.

Our focus is on the decomposition of cognitive tasks, where the goal is to provide information or to answer a question -- without having to take physical actions. Cognitive tasks include solving a mathematics problem, analyzing a dataset, or summarizing a story. We are testing a bold hypothesis about the decomposability of cognitive tasks:

Factored Cognition Hypothesis: Any cognitive task can be decomposed recursively into self-contained, interpretable sub-tasks that can be solved by humans in ten minutes of work.^[1]

Sub-tasks being "self-contained" means they are solvable without knowledge of the task. If the task is a hard physics problem, then self-contained sub-tasks would be solvable for someone who hasn’t seen the physics problem (and need not know the task is about physics). This differs from most real-world examples of collaborative problem solving, where everyone in the team knows what task is being solved.

Testing the Factored Cognition Hypothesis involves the following steps:

Find cognitive tasks that seem hard to decompose (e.g. because normally one person would spend days or weeks on the same problem rather than 10 minutes).
Generate a high-level strategy that would plausibly decompose the task.
Test whether the strategy from (2) works by having a group of people solve (1) under controlled conditions.

For some ideas on high-level strategies for a range of problems, see our Factored Cognition slides and the paper Supervising strong learners by amplifying weak experts (Appendix B). We are working towards experiments that encompass steps (1)-(3) and we have developed a web application, Mosaic, for solving tasks with groups of people. At some point we will report on experiments based on Mosaic. For now we describe some experiments with a simple form of Factored Cognition called the Relay Game.

In the Relay Game participants work on a task sequentially, with each person having ten minutes to help solve the problem before handing off to the next person (see Figure 1). This is similar to real-world situations where one person quits a project and someone else takes over. However, the big difference comes from the ten-minute time limit. If the task is complex, it might take ten minutes just to read and understand the task description. This means that most players in the relay won't have time to both understand the task and make a useful contribution. Instead players must solve sub-tasks that previous people have constructed.

Figure 1: In the Relay Approach (left), each person works on a programming problem for a fixed number of minutes before passing over their notes and code to the next person. Eventually someone in the chain completes the problem. This contrasts with the usual approach (right), where a single person works for an extended period. Note: Our experiments had a time limit of 10 minutes per person (vs. 1 minute in the illustration)

We tested the Relay Game on programming problems from Project Euler. Here is a simple example problem:

How many different ways can one hundred be written as a sum of at least two positive integers?

Solving these problems requires both mathematical insight and a working implementation in code. The problems would take 20-90 minutes for one of our players working alone and we expected the relay approach to be substantially slower.

Experiment Design

Players worked on a shared Google doc (for notes) and code editor (see Figure 2). The first player receives only the Project Euler problem and begins making notes in the doc and writing code. After ten minutes, the second player takes over the doc and code editor. The Relay ends when a player computes the correct answer to the problem, which can be automatically verified at Project Euler.

Figure 2

We had 103 programmers volunteer to play Relay. They started 40 questions in total but only 25 had relay chains of more than 5 people. The total amount of work was 48 hours and only four questions were successfully solved. See Table 1 for a breakdown of a subset of the 40 questions. (Note: Most of the 103 players only worked on a few problems. Since much of the work was done by a smaller number of players, they were spread thinly over the 40 problems -- as each player spends just 10 minutes on a problem.)

Table of Relay Game Problems (10 of 40)

Project Euler Problem	Relay Total Time	Solved by Relay?	Est. Solo Time to Solve
Prize Strings	30 mins	Y	27 mins
Counting Summations	40 mins	Y	24 mins
Reversible Numbers	70 mins	Y	30 mins
Palindromic Sums	100 mins	Y	60 mins
Blancmange	130 mins	N	35 mins
Leftovers	130 mins	N	35 mins
Anti-Chain	140 mins	N	35 mins
Prime Geometric	160 mins	N	35 mins
Prime Triplets	260 mins	N	35 mins
Pandigital	410 mins	N	80 mins

Table 1: The total time spent on each problem for Relay. Note that most problems were not solved by Relay and so would take longer to actually solve. (So solo vs. Relay cannot be directly compared).

Can we conclude from Table 1 that Relay is much less efficient than the usual way of working on problems? Not really. It could be that Relay players would get better with experience by developing strategies for decomposing problems and coordinating work. So the main lesson from our experiment is that Relay with inexperienced players is probably less efficient at Project Euler problems. (We say “probably” because we did not conduct a rigorous comparison of Relay vs the usual way of solving problems).

Clickthrough Examples

We are interested in general failure modes for Factored Cognition with humans and in strategies for avoiding them. Our Relay experiment is a first step in this direction. We exhibit concrete examples from our Relay experiment that are suggestive of pitfalls and good practices for Factored Cognition.

Here are five “click-throughs”, which show how ideas and code evolved for particular Project Euler problems. In each case, the problem to be solved is shown on the bottom half of the Google doc on the first slide of the click-through.

Prize Strings

In these three attemps on Prize Strings the players quickly build on players previous work and get the correct answer.

Player One proposes a solution strategy for the problem and correctly identifies the recursive pattern.

Player two makes some minor modifications to the document and expands the code.

Player three tweaks the code, runs it, and gets the answer.

Empty Chairs

Empty Chairs was not solved but significant progress was made (with seven people contributing). The clickthrough demonstrates iterative improvements to a math heavy solution.

Player One starts by attempting to turn the problem statement into a formal specification.

Player Two adds more structure to the problem, commentary to the existing strategy, and defines the current task.

Player Three improves the first strategy and proposes a second solution strategy.

Player Four focuses on the "dynamic programming" strategy.

Player Five adds more structure and details to both of the existing strategies.

Player Six rejects the "dynamic programming" strategy and adds desiderata for any new strategy.

Player Seven first starts to code an answer.

Palindromic Sums

Palindromic Sums was solved in 10 attempts.

Player One creates an outline of the strategy and tasks for solving the problem.

Player Two adds more detail and formal structure to the strategy.

Player Three adds more detail to one step in the strategy and also starts implementing code.

Player Four adds a utility function.

Player Five notes a mathematical relationship in the problem, and links the two functions.

Player Six made no changes to the state of the problem.

Player Seven added logging to debug the speed of the solution.

Player Eight started to check the strategy and function logic.

Player Nine rewrote the code to use a generator syntax to improve performance.

The code was rerun with no changes but avoiding the timout issues, and the correct answer was generated.

A Scoop of Blancmange

There were 13 unique attempts on A Scoop of Blangmange. While technically unsolved, the answer was off by only 1e-8. Only attempts that changed the state of the problem are shown.

Player One proposed a strategy and wrote initial functions

Player Two pointed out a potential problem in the strategy and proposed a modified approach.

Player Three benchmarked the strategy.

Player Four tested the random_sample function.

Player Five tested the random_sample function.

Player Six proposed a new strategy focusing on a different paradigm using calculus.

Player Seven notes that current approach is too imprecise, and comments on improvements to the Reimann sum strategy.

Player Eight took the idea from the previous player and updated it.

Player Nine selected the strategy to use and added specific subtasks.

Player Thirteen (players 10-12 made no change to the problem state) refined the stategy and implementation.

Sum of a Square and a Cube

Sum of a Square and a Cube was unsolved with seven attempts.

Player One converted the problem into pseudocode, and listed next steps to take.

Player Two estimated the run time.

Player Three brainstormed ways to speed up the code.

Player Four made a small note but was probably cutoff from finishing.

Player Five added to the pseudocode solution.

Player Six brainstormed more ways of speeding up the code.

Player Seven evaluated which of the potential solutions was most promising.

500

With five attempts 500 went unsolved, but demonstrated interesting use of metadata tags.

Player One simplified the problem and started generating tasks.

Player Two analyzed the mathematical properties of the problem.

Player Three started analyzing the primality of the target number (but likely ran out of time before writing anything).

Player Four determined that 500500507 was not prime, and added to next steps and the code solution.

Player Five proposed a tactic for generating the multiplication table.

Adding Meta-data to Notes

Relay players worked on the mathematical part of the Project Euler problems by writing notes in a Google Doc. A tactic that emerged organically in early rounds was to label contributions with meta-data with specific formatting (ex.using square brackets, strikethroughs). The meta-data was intended to provide a quick way for future players to decide which parts of the Google doc to read and which to ignore.

Summarizing Next Actions

For several problems the Google doc became messy and was probably made it difficult for new players to orient themselves. An example is the problem Prime Triples Geometric Sequence, shown here mid-round, where some of the subsequent rounds were spent cleaning up these notes and formulating clear next steps.

Summarizing

Costly Bugs

The problem with the longest Relay chain was Pandigital Step Numbers with a chain of 41 players. While substantial progress was made on the problem, there was a one-line bug in the code implementation that persisted to the last player in the chain. Given the size and complexity of the problem, it was difficult for players to locate the bug in only ten minutes.

Bug

Figure 3. Code window for Pandigital Step Numbers problem. The highlighted line contains the bug that probably contributed to the failure of a 41-person Relay to solve the problem. The code incorrectly sets the first digit of the number in the bitmask as “False”.

Discussion

How does Relay relate to Ought’s other research on Factored Cognition with humans? The Relay experiment had two key features:

We used existing collaboration tools (Google docs and web-based interpreter) rather than specialized tools for Factored Cognition.
Participants worked in sequence, building on the work of all previous participants.

We are also running experiments with specialized software called Mosaic. Mosaic facilitates tree-structured decompositions of tasks. The overall task is divided into sub-tasks which can be solved independently of each other, and users only see a sub-task (i.e. node) and not the rest of the tree. If this kind of decomposition turns out to be important in Factored Cognition, then the Relay setup will be less relevant.

The Relay experiment was exploratory and we decided not to continue working on it for now. Nevertheless we would be interested to hear about related work or to collaborate on research related to Relay.

Acknowledgements

Ben Goldhaber led this project with input from Andreas Stuhlmueller, Owain Evans, and Paul Christiano. BG created the software for Relay and oversaw experiments. BG and OE wrote the blogpost. We thank everyone who participated in Relay games, especially the teams at OpenAI and Ought. The original idea for Relay is due to Buck Shlegeris.

[1]: This is a loose formulation of the hypothesis intended to get the basic idea across. For a discussion of how to make this kind of hypothesis precise, see Paul Christiano's Universality post

Crowdsourcing:

The idea of Factored Cognition is superficially similar to crowdsourcing. Two important differences are:

In crowdsourcing on Mechanical Turk, the task decomposition is usually fixed ahead of time and does not have to be done by the crowd.
In crowdsourcing for Wikipedia (and other collaborative projects), contributors can spend much more than ten minutes building expertise in a particular task (Wikipedians who edit a page might spend years building expertise on the topic).

For more discussion of differences between Factored Cognition and crowdsourcing and how they relate to AI alignment, see Ought's Factored Cognition slides and William Saunders' blogpost on the subject. Despite these differences, crowdsourcing is useful source of evidence and insights for Relay and Factored Cognition. See Reinventing Discovery (Michael Nielsen) for an overview for crowdsourcing for science. Three crowdsourced projects especially relevant to Relay are:

The Polymath Project was an example of leveraging internet scale collaboration to solve research problems in mathematics. Starting in 2009, Timothy Gowers posted a challenging problem to his blog and asked other mathematicians and collaborators to help push it forward to a solution. This was similar to the type of distributed problem solving we aimed for with the Relay Game, with the major difference being in the Relay game there is a ten minute time limit, so a player can’t keep working on a problem.
The MathWorks Competition is a mathematics modeling competition for high school students. When a student submits an answer to a problem, the code for that solution is immediately made publicly available. The fastest solutions are then often improved upon by other students, and resubmitted.
Microtask Programming is a project aiming to apply crowdsourcing techniques to software development. The project provides a development enviroment where software engineers can collaborate to complete small self-contained microtasks that are automatically generated by the system.

Transmission of information under constraints:

There is also academic research on problem solving under constraints somewhat similar to Relay.

Causal understanding is not necessary for the improvement of culturally evolving technology tests how improvements to strategies for using tools can improve in situations where information for conveying these improvements is limited.
Cumulative Improvements in Iterated Problem Solving. Similar to the relay chain model, with solutions to puzzles passed on to later generations to see how they can build on those solutions.

This post was published on March 5, 2019 by Ben Goldhaber.

Comments