Reach:
- Minimize distance between palm and object (without moving the latter) while encouraging max. hand aperture.
- Minimize distance between palm and object and additional bonus for contact between hand and object.
Grasp & move:
- Minimize distance between target position (x,y: initial object - z: 40 cm) while encouraging the contact between fingertips and objects.
- As step 3. but fixing the z-target position (x,y: initial object - z: initial object + 40 cm)
- As step 4. but needed to restart
- We changed the target position from initial object to the final position but keeping the z-target position 40 cm above the z-goal position. We modified the hyperparameter of the target box (from phase 1 to phase 2). Additionally, we modified the reward component by giving more weight to palm distance over fingertip distance and introducing action regularization.
- As step 6. but we fixed the key_frame id and trained for longer time
Insert:
- We included the solved component in the reward.
- As step 8. but needed to restart
- Enlarge the hyperparameter object space to achieve a more robust policy
All the trained models, environment configurations, main files, and tensorboard logs are all present in the output/trained_agents folder.
We use RecurrentPPO from Stable Baselines 3 as our base algorithm with the following architecture for both the actor and the critic with nothing shared between the two:
obs --> 256 LSTM --> 256 Linear --> 256 Linear --> output
All the layers have ReLU activation functions and the output, of course, is the value for the critic and the 63-dimensional continuous actions for the actor.