Efficient Imitation Without Demonstrations via Value-Penalized Auxiliary Control from Examples

This repository contains data, code, and instructions for reproducing the results from value-penalized auxiliary control from examples (VPACE). VPACE uses the scheduled auxiliary control framework, combined with a novel value penalty, to significantly improve the learning efficiency of example-based control.

Installation (Algorithm + Simulated Panda Environments)
- Sawyer and Adroit Hand Baselines
- Real World Experiments
Running
Figures
Citation

Installation (Algorithm + Simulated Panda Environments)

Our method and code is built on learning from guided play (LfGP). We recommend first setting up a virtual environment (conda, virtualenv, etc.). We have tested python 3.7-3.11, but recommend python 3.11. To install:

git clone [email protected]:utiasSTARS/vpace.git && cd vpace
pip install -r reqs/requirements.txt
export VPACE_TOP_DIR=$PWD (optionally set this elsewhere, but you must also move expert_data).

Sawyer and Adroit Hand Baselines

The above installation only allows you to run our code in the Panda environments originally from LfGP. To run Sawyer and Adroit Hand baselines, you must also run:

pip install tf-agents==0.19.0
pip install -r reqs/sawyer_hand_requirements.txt
1. This will cause a pip resolver issues about gym and cloudpickle, which can be safely ignored because we barely use tf-agents or tensorflow-probability.
(if step above alone still results in mujoco issues) sudo apt install gcc libosmesa6-dev libglew-dev patchelf

Note that the Sawyer and Adroit Hand environments require an older version of mujoco_py, which can sometimes be a bit of a hassle to install and run. If you still can't run these environments after following our instructions, try looking up your specific error, and someone else has, most likely, encountered it as well.

Real World Experiments

We completed experiments on a real world Franka Emika Panda (FR3). To complete these experiments, we used panda-rl-envs. See that repository for more details on setting up a real-world Panda to work with VPACE.

Running

All experiments from our paper can be reproduced with run_vpace.py and various argument combinations. Results will be stored under top_save_path/results/long_env_name/seed/algo/exp_name/date_time. By default, all results will be stored under vpace/results, but you can change this with either either the VPACE_TOP_DIR environment variable, or the --top_save_path argument.

Quick Start

To train a VPACE-SQIL model for Unstack-Stack, you can use:

python run_vpace.py --main_task unstack_nm --reward_model sqil --exp_name test

For sawyer_box_close:

python run_vpace.py --env_type sawyer --env_name sawyer_box_close --reward_model sqil --exp_name test

For relocate-human-v0-najp-dp:

python run_vpace.py --env_type hand_dapg --env_name relocate-human-v0-najp-dp --reward_model sqil --exp_name test

For SimPandaReach (from panda-rl-envs):

python run_vpace.py --env_type panda_rl_envs --env_name SimPandaReach --reward_model sqil --exp_name test

Task Options

To choose a task to run, you can use combinations of --env_type, --env_name, and --main_task as follows:

	Options	Description
`--env_type`	`manipulator_learning` (default, Panda environment), `sawyer`, `hand_dapg`	Environment type
`--env_name` (Sim Panda)	`PandaPlayInsertTrayXYZState` (default)	Environment name for simulated Panda tasks
`--env_name` (Sawyer)	`sawyer_drawer_open`, `sawyer_drawer_close`, `sawyer_push`, `sawyer_lift`, `sawyer_box_close`, `sawyer_bin_picking`	Environment name for Sawyer tasks
`--env_name` (Adroit Hand)	`door-human-v0`, `hammer-human-v0`, `relocate-human-v0`, `door-human-v0-dp`, `hammer-human-v0-dp`, `relocate-human-v0-najp-dp`	Environment name for Adroit Hand tasks
`--env_name` (Real Panda)	'`SimPandaReach`, `SimPandaReachRandInit`, `SimPandaReachAbs`, `PandaDoorNoJamAngleLongEp`, `PandaDrawerLineLongEp`, '	Environment name for real Panda tasks¹
`--main_task`	`reach`, `lift`, `move`, `stack_nm`, `unstack_nm`, `bring_nm`,`insert_nb_nm`	Sim Panda env task (applies to Sim Panda env only)

Algorithm Options

Common options you can change to reproduce our main results are:

	Options	Description
`--reward_model`	`discrimininator`,`sqil`,`rce`,`sparse`	Reward model
`--single_task`	(Add to turn on, otherwise off)	Run without ACE/LfGP framework
`--q_over_max_penalty`	Float (default `10.0`)	Strength of value penalization (\lambda from the paper). Set to `0.0` to turn VP off.
`--q_regularizer`	`vp`,`c2f`,`cql`	Type of value penalization method. Set to `0.0` to turn VP off.

Ablation Study Options

To reproduce our ablation study results, you can use the following option combinations:

Ablation	Options to Add	Description
+Full Trajectories	`--add_default_full_traj`	Add full trajectory expert data, in addition to exmaples.
+Full Trajectories & Actions	`--add_default_full_traj --expert_data_mode obs_act`	Same as above, but include actions as well.
SAC-X (Sparse Rewards)	`--reward_model sparse`	Use true sparse rewards, instead of examples.
\lambda = 1	`--q_over_max_penalty 1.0`	Value penalization strength of 1.
\lambda = 100	`--q_over_max_penalty 100.0`	Value penalization strength of 100.
10 Examples	`expert_amounts 10`	Use only 10 examples per task.
100 Examples	`expert_amounts 100`	Use only 100 examples per task.
No Example Augmentation	`--expert_randomize_factor 0.0`	Turn off example augmentation.
10 Examples, No Ex. Aug.	`--expert_amounts 10 --expert_randomize_factor 0.0`	10 examples, no ex. aug.

Figures

To generate plots and figures, you can use the scripts in figures and plotting. You may want to install some dependencies first with:

pip install -r reqs/fig_requirements.txt

Citation

If you find this repository useful for your work, please consider citing VPACE:

@misc{ablett2024efficientimitationwithoutdemonstrationsvia,
      title={Efficient Imitation Without Demonstrations via Value-Penalized Auxiliary Control from Examples}, 
      author={Trevor Ablett and Bryan Chan and Jayce Haoran Wang and Jonathan Kelly},
      year={2024},
      eprint={2407.03311},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2407.03311}, 
}

The real world tasks PandaDoorNoJamAngleLongEp and PandaDrawerLineLongEp were produced using our own environment and shelves. You can generate your own versions of these real tasks following the code and configurations from panda-rl-envs. ↩

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
alg_cfgs		alg_cfgs
cc_tools		cc_tools
create_data		create_data
expert_data		expert_data
figures		figures
plotting		plotting
rce		rce
reqs		reqs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
all_seeds_args.bash		all_seeds_args.bash
env_default_configs.py		env_default_configs.py
run_vpace.py		run_vpace.py
system-900dpi.png		system-900dpi.png
vpace-motivation-new-50.png		vpace-motivation-new-50.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Efficient Imitation Without Demonstrations via Value-Penalized Auxiliary Control from Examples

Installation (Algorithm + Simulated Panda Environments)

Sawyer and Adroit Hand Baselines

Real World Experiments

Running

Quick Start

Task Options

Algorithm Options

Ablation Study Options

Figures

Citation

About

Releases

Packages

Contributors 2

Languages

License

utiasSTARS/vpace

Folders and files

Latest commit

History

Repository files navigation

Efficient Imitation Without Demonstrations via Value-Penalized Auxiliary Control from Examples

Installation (Algorithm + Simulated Panda Environments)

Sawyer and Adroit Hand Baselines

Real World Experiments

Running

Quick Start

Task Options

Algorithm Options

Ablation Study Options

Figures

Citation

Footnotes

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages