Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need help in defining and solving a POMDP: Varying Mountainhike #43

Open
sai-prasanna opened this issue Nov 8, 2024 · 4 comments
Open

Comments

@sai-prasanna
Copy link

sai-prasanna commented Nov 8, 2024

I am working on a POMDP called VaryingMountainHike (which I have redubbed as MordorHike 😉 ). It was introduced here. It is basically a 2d environment with mountains defined by 3 gaussians. The agent has to get from bottom left to top right. The rewards are always negative, and it is least negative on top of the mountains.

I am totally new to julia and pomdp planners. So bear with me if I had made some bugs in the implementation.

import POMDPs
import POMDPModels
import Distributions

using POMDPs
using POMCPOW
using POMDPModels
using POMDPTools
using POMDPs
using POMDPTools
using LinearAlgebra
using Distributions
using Random

mutable struct MordorHikePOMDP <: POMDP{Vector{Float64}, Int64, Float64}
    # Constants
    translate_step::Float64
    translate_std::Float64
    obs_std::Float64
    discount::Float64
    
    # Map bounds and positions
    map_lower_bound::Vector{Float64}
    map_upper_bound::Vector{Float64}
    fixed_start_pos::Vector{Float64}
    goal_position::Vector{Float64}
    
    # Mountain parameters
    slope::Vector{Float64}
    mvn_1::MvNormal
    mvn_2::MvNormal
    mvn_3::MvNormal

    # Constructor
    function MordorHikePOMDP()
        new(
            0.1,    # translate_step
            0.05,   # translate_std
            0.1,    # obs_std
            0.99,   # discount
            [-1.0, -1.0],  # map_lower_bound
            [1.0, 1.0],    # map_upper_bound
            [-0.8, -0.8],  # fixed_start_pos
            [0.8, 0.8],    # goal_position
            [0.2, 0.2],    # slope
            MvNormal([0.0, 0.0], [0.005 0.0; 0.0 1.0]),    # mvn_1
            MvNormal([0.0, -0.8], [1.0 0.0; 0.0 0.01]),    # mvn_2
            MvNormal([0.0, 0.8], [1.0 0.0; 0.0 0.01])      # mvn_3
        )
    end
end

# State: [x, y, θ]
# Action: 1=north, 2=south, 3=east, 4=west
# Observation: [altitude]

function POMDPs.actions(pomdp::MordorHikePOMDP)
    return [1, 2, 3, 4]
end

function POMDPs.initialstate(pomdp::MordorHikePOMDP)
    # Random rotation for medium difficulty
    return ImplicitDistribution(rng -> ([pomdp.fixed_start_pos[1], pomdp.fixed_start_pos[2], rand(rng, [0.0, π/2, π, 3π/2])]))
end

function POMDPs.reward(pomdp::MordorHikePOMDP, s::Vector{Float64}, a::Int64, sp::Vector{Float64})
    if isterminal(pomdp, s)
        0.0
    else
        return calculate_altitude(pomdp, sp[1:2])
    end
end
POMDPs.discount(pomdp::MordorHikePOMDP) = pomdp.discount

function POMDPs.isterminal(pomdp::MordorHikePOMDP, s::Vector{Float64})
    return norm(s[1:2] - pomdp.goal_position) <= 2 * pomdp.translate_step
end



function POMDPs.transition(pomdp::MordorHikePOMDP, s::Vector{Float64}, a::Int64)
    return ImplicitDistribution(function(rng)
        directions = Dict(
            1 => [0.0, 1.0],  # North
            2 => [0.0, -1.0], # South
            3 => [1.0, 0.0],  # East
            4 => [-1.0, 0.0]  # West
        )
        
        # Apply movement
        direction = directions[a]
        next_pos = s[1:2] + direction * pomdp.translate_step
        
        # Add noise to position
        next_pos += rand(rng, Normal(0, pomdp.translate_std), 2)
        
        # Clip to bounds
        next_pos = clamp.(next_pos, pomdp.map_lower_bound, pomdp.map_upper_bound)
        
        # Keep the same theta
        return [next_pos[1], next_pos[2], s[3]]
    end)
end



function POMDPs.observation(pomdp::MordorHikePOMDP, a::Int64, sp::Vector{Float64})
    altitude = calculate_altitude(pomdp, sp[1:2])
    return Normal(altitude, pomdp.obs_std)
end

function POMDPs.observation(pomdp::MordorHikePOMDP, s::Vector{Float64}, a::Int64, sp::Vector{Float64})
    return observation(pomdp, a, sp)
end


function calculate_altitude(pomdp::MordorHikePOMDP, pos::Vector{Float64})
    # Convert pos to a 2-element vector to ensure correct dimensionality
    pos_2d = pos[1:2]  # Take only x,y coordinates if pos is longer
    mountains = [
        pdf(pomdp.mvn_1, pos_2d),
        pdf(pomdp.mvn_2, pos_2d),
        pdf(pomdp.mvn_3, pos_2d)
    ]
    altitude = maximum(mountains)
    return -exp(-altitude) + dot(pos_2d, pomdp.slope) - 0.02
end



pomdp = MordorHikePOMDP()

# Use POMCPOW solver (as in your first code block)
using POMCPOW
solver = POMCPOWSolver(criterion=MaxUCB(20.0))
planner = solve(solver, pomdp)

# Simulate
hr = HistoryRecorder(max_steps=200,)
hist = simulate(hr, pomdp, planner)
for (s, b, a, r, sp, o) in hist
    @show s, a, r, sp
    println(isterminal(pomdp, s))
end

rhist = simulate(hr, pomdp, RandomPolicy(pomdp))
println("""
    Cumulative Discounted Reward (for 1 simulation)
        Random: $(discounted_reward(rhist))
        POMCPOW: $(discounted_reward(hist))
    """)

I need help on knowing what are the reasonable hyperparameters for the solver for this problem. When I used the default tree queries the rollout terminated before max steps without reaching the terminal state.

@himanshugupta1009
Copy link

himanshugupta1009 commented Nov 18, 2024

Hey Sai,

I can help you with this. Let's start with a few questions I have.

  1. Based on my understanding of your implementation, it doesn't seem like the state variable theta is important, as it remains unchanged throughout the simulation and does not affect anything. Is that intentional?

  2. Most tree solvers require a good rollout policy to guide the search toward high-value actions. Did you experiment with this? If not, I can help you implement one. I think a reasonable rollout policy would be to move directly to the goal location in a straight line (assuming full observability).

Let me know whenever you get a chance!

Edit: Also, I did some debugging, and it seems like the POMCPOW tree is reaching the terminal state during planning. Moreover, the simulation stops before executing max_steps because the agent reaches its goal which is a terminal state in your implementation.

@sai-prasanna
Copy link
Author

sai-prasanna commented Nov 19, 2024

  1. The state variable theta is important, although it does not change in an episode, it is picked randomly among four choices. One could obtain a similar sub-sequence of observation for different start angles unless one takes information-seeking actions that reduce this sub-sequence length. There was a bug, I intended to make the directions different based on theta! Thanks for catching that. Here is the fixed code
# import Pkg
# Pkg.add("POMDPs")
# Pkg.add("POMCPOW")
# Pkg.add("POMDPModels")
# Pkg.add("POMDPTools")
# Pkg.add("Distributions")
import POMDPs
import POMDPModels
import Distributions

using POMDPs
using POMCPOW
using POMDPModels
using POMDPTools
using POMDPs
using POMDPTools
using LinearAlgebra
using Distributions
using Random

mutable struct MordorHikePOMDP <: POMDP{Vector{Float64}, Int64, Float64}
    # Constants
    translate_step::Float64
    translate_std::Float64
    obs_std::Float64
    discount::Float64
    
    # Map bounds and positions
    map_lower_bound::Vector{Float64}
    map_upper_bound::Vector{Float64}
    fixed_start_pos::Vector{Float64}
    goal_position::Vector{Float64}
    
    # Mountain parameters
    slope::Vector{Float64}
    mvn_1::MvNormal
    mvn_2::MvNormal
    mvn_3::MvNormal

    # Constructor
    function MordorHikePOMDP()
        new(
            0.1,    # translate_step
            0.05,   # translate_std
            0.1,    # obs_std
            0.99,   # discount
            [-1.0, -1.0],  # map_lower_bound
            [1.0, 1.0],    # map_upper_bound
            [-0.8, -0.8],  # fixed_start_pos
            [0.8, 0.8],    # goal_position
            [0.2, 0.2],    # slope
            MvNormal([0.0, 0.0], [0.005 0.0; 0.0 1.0]),    # mvn_1
            MvNormal([0.0, -0.8], [1.0 0.0; 0.0 0.01]),    # mvn_2
            MvNormal([0.0, 0.8], [1.0 0.0; 0.0 0.01])      # mvn_3
        )
    end
end

# State: [x, y, θ]
# Action: 1=north, 2=south, 3=east, 4=west
# Observation: [altitude]

function POMDPs.actions(pomdp::MordorHikePOMDP)
    return [1, 2, 3, 4]
end

function POMDPs.initialstate(pomdp::MordorHikePOMDP)
    # Random rotation for medium difficulty
    return ImplicitDistribution(rng -> ([pomdp.fixed_start_pos[1], pomdp.fixed_start_pos[2], rand(rng, [0.0, π/2, π, 3π/2])]))
end

function POMDPs.reward(pomdp::MordorHikePOMDP, s::Vector{Float64}, a::Int64, sp::Vector{Float64})
    if isterminal(pomdp, s)
        0.0
    else
        return calculate_altitude(pomdp, sp[1:2])
    end
end
POMDPs.discount(pomdp::MordorHikePOMDP) = pomdp.discount

function POMDPs.isterminal(pomdp::MordorHikePOMDP, s::Vector{Float64})
    return norm(s[1:2] - pomdp.goal_position) <= 2 * pomdp.translate_step
end



function POMDPs.transition(pomdp::MordorHikePOMDP, s::Vector{Float64}, a::Int64)
    return ImplicitDistribution(function(rng)
        theta = s[3]
        
        # Create movement vector based on theta
        forward_vector = [cos(theta), sin(theta)]
        lateral_vector = [-sin(theta), cos(theta)]
        
        # Map actions to movement
        movement = if a == 1  # North
            forward_vector
        elseif a == 2  # South 
            -forward_vector
        elseif a == 3  # East
            lateral_vector
        else  # West
            -lateral_vector
        end
        
        next_pos = s[1:2] + movement * pomdp.translate_step
        
        # Add noise to position
        next_pos += rand(rng, Normal(0, pomdp.translate_std), 2)
        
        # Clip to bounds
        next_pos = clamp.(next_pos, pomdp.map_lower_bound, pomdp.map_upper_bound)
        
        # Keep the same theta
        return [next_pos[1], next_pos[2], theta]
    end)
end



function POMDPs.observation(pomdp::MordorHikePOMDP, a::Int64, sp::Vector{Float64})
    altitude = calculate_altitude(pomdp, sp[1:2])
    return Normal(altitude, pomdp.obs_std)
end

function POMDPs.observation(pomdp::MordorHikePOMDP, s::Vector{Float64}, a::Int64, sp::Vector{Float64})
    return observation(pomdp, a, sp)
end


function calculate_altitude(pomdp::MordorHikePOMDP, pos::Vector{Float64})
    # Convert pos to a 2-element vector to ensure correct dimensionality
    pos_2d = pos[1:2]  # Take only x,y coordinates if pos is longer
    mountains = [
        pdf(pomdp.mvn_1, pos_2d),
        pdf(pomdp.mvn_2, pos_2d),
        pdf(pomdp.mvn_3, pos_2d)
    ]
    altitude = maximum(mountains)
    return -exp(-altitude) + dot(pos_2d, pomdp.slope) - 0.02
end



pomdp = MordorHikePOMDP()

# Use POMCPOW solver (as in your first code block)
using POMCPOW
solver = POMCPOWSolver(criterion=MaxUCB(20.0))
planner = solve(solver, pomdp)

# Simulate
hr = HistoryRecorder(max_steps=200,)
hist = simulate(hr, pomdp, planner)
for (s, b, a, r, sp, o) in hist
    @show s, a, r, sp
    println(norm(s[1:2] - pomdp.goal_position))
end

rhist = simulate(hr, pomdp, RandomPolicy(pomdp))
println("""
    Cumulative Discounted Reward (for 1 simulation)
        Random: $(discounted_reward(rhist))
        POMCPOW: $(discounted_reward(hist))
    """)
  1. Hmm, I can implement the cross-entropy method ( a black box optimization) as a rollout policy or maybe even a greedy one that goes directly to the goal. Is there an example of how to specify this in the API?

When I ran the above code and checked the trajectory (hist), the final state did not satisfy the is_terminal condition of being near [0.8, 0.8]! Do you get a different result?

If I try POMCPOWSolver(criterion=MaxUCB(20.0), tree_queries=10000, max_depth=10) it reaches 200 steps without solving the task but better than random performance.

@sai-prasanna
Copy link
Author

Ok, sorry for too many updates. I reran the solver on the updated code. It solves the task sometimes and sometimes it does not. I guess a non-random rollout policy would help here? Can you help me define it? I don't understand the API that well.

@himanshugupta1009
Copy link

himanshugupta1009 commented Nov 19, 2024

Yeah, hopefully, a better rollout policy will increase the number of times it solves the task.

This would be the general layout for the rollout policy code. estimate_value is the parameter you use to specify the rollout policy/function.

struct GoToGoalPolicy{P} <: Policy
    pomdp::P
end

function POMDPs.action(p::GoToGoalPolicy, s)
  # ROLLOUT CODE TO RETURN AN ACTION FOR THE INPUT STATE s
end

rollout_policy = FORollout(GoToGoalPolicy(pomdp))

solver = POMCPOWSolver(
                    estimate_value=rollout_policy,
                    criterion=MaxUCB(20.0),
                    tree_queries=10_000, 
                    max_depth=10
                    )

Let me know how well it performs after using a better rollout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants