EMNLP Demo Experiments¶
To validate our implementation of recursive multi-agent systems, we ran three different benchmarks and compared the performance of various configurations of ReDel systems. The benchmarks we ran were:
FanOutQA, a multi-hop, multi-document information seeking benchmark with open-domain search
TravelPlanner, a real-world planning benchmark for language agents
WebArena, an autonomous agent benchmark with diverse tasks in a realistic web environment
All of our experiment code is open-source on the demo/emnlp branch of the ReDel repository:
https://github.com/zhudotexe/redel/tree/demo/emnlp
For more information on our experiments, see our paper (link coming soon)!
Reproducing Experiments¶
In the demo/emnlp branch of the ReDel repository, we include the logs of every single experiment run in
the experiments/ directory. You can load any of these runs in the visualization to view what the ReDel system did!
The experiments directory is broken down into the following
structure: experiments/BENCHMARK_NAME/BENCHMARK_SPLIT/[RUN_ID]/SYSTEM_ID/QUERY_ID, where:
BENCHMARK_NAMEis the name of the benchmark (fanoutqa, travelplanner, or webarena)BENCHMARK_SPLITis the split of the benchmark we ran (usually the dev/validation split)RUN_IDis an internal split in the FanOutQA experiment to analyze an edge-case behaviour wrt parallel function calling and long contextsSYSTEM_IDis the system under test, configured as in the table belowQUERY_IDis the benchmark-specific ID of a single run (loadable in the visualizer).
To reproduce the experiments included in this repository, we include scripts to run each benchmark.
Follow these steps to setup the environment, then follow the instructions in each benchmark. We recommend setting up a virtual environment for this project.
First, you’ll need to clone this repository and check out the
demo/emnlpbranch:git clone -b demo/emnlp https://github.com/zhudotexe/redelInstall the necessary dependencies:
pip install -r requirements.txt
FanOutQA¶
output path: experiments/fanoutqa/dev/trial2/SYSTEM_ID
Run
python bench_fanoutqa.py <full|root-fc|baseline|small-leaf|small-all|small-baseline|short-context|short-baseline>
This will run the given system on the FanOutQA dev set in the Open Book setting.
Evaluate
Set the FANOUTQA_OPENAI_API_KEY environment variable to a valid OpenAI API key. You can
use export FANOUTQA_OPENAI_API_KEY=$OPENAI_API_KEY to copy an existing API key from environment variables.
python score_fanoutqa.py experiments/fanoutqa/**/results.jsonl
This will output a score.json file in the output path with the final scores.
TravelPlanner¶
output path: experiments/travelplanner/validation/SYSTEM_ID
Setup
Install the TravelPlanner database:
Download the database from this link
Extract the zip file in
redel/tools/travelplanner. This should create a directory nameddb.
In another directory, clone our fork of the TravelPlanner repository. This will be used for scoring, and includes the fixes discussed in our paper.
git clone https://github.com/zhudotexe/TravelPlanner
Run
python bench_travelplanner.py <full|root-fc|baseline|small-leaf|small-all|small-baseline>
Note: This benchmark does not test the short-ctx systems since this benchmark doesn’t have a long-context requirement.
Evaluate
python score_travelplanner.py experiments/travelplanner/**/results.jsonl
This script will write files in the correct format for the TravelPlanner evaluation in the output path, and print the command to run to score the results.
You should now switch to the TravelPlanner repository you cloned in the setup step and run the commands output by this script.
WebArena¶
output path: experiments/webarena/test/SYSTEM_ID
Setup
We reproduce some of the scripts and data contained in the WebArena repository in this repo under the terms of the
Apache-2.0 license, contained in experiments/webarena/vendor/LICENSE.
First, you’ll need to set up your own WebArena environment. See https://github.com/web-arena-x/webarena/blob/main/environment_docker/README.md for instructions.
Next, run the following to setup the webarena configuration:
# setup env vars (see https://github.com/web-arena-x/webarena/blob/main/environment_docker/README.md for env setup)
export SHOPPING="<your_shopping_site_domain>:7770"
export SHOPPING_ADMIN="<your_e_commerce_cms_domain>:7780/admin"
export REDDIT="<your_reddit_domain>:9999"
export GITLAB="<your_gitlab_domain>:8023"
export MAP="<your_map_domain>:3000"
export WIKIPEDIA="<your_wikipedia_domain>:8888/wikipedia_en_all_maxi_2022-05/A/User:The_other_Kiwix_guy/Landing"
export HOMEPAGE="<your_homepage_domain>:4399"
# generate config files
python experiments/webarena/generate_test_data.py
You’ll also need to ensure Playwright is installed:
playwright install chromium
Run
First, make sure you have reset your WebArena environment (see https://github.com/web-arena-x/webarena/blob/main/environment_docker/README.md#environment-reset).
Then, launch the WebArena environment.
As the default WebArena script is incompatible with asyncio, ReDel launches a separate process to handle the WebArena environment, which it communicates with over a pipe. This is done automatically.
Finally, run the bench script:
python bench_webarena.py <full|root-fc|baseline|small-leaf|small-all|small-baseline|short-context|short-baseline>