Need help in defining and solving a POMDP: Varying Mountainhike #43

sai-prasanna · 2024-11-08T17:43:32Z

I am working on a POMDP called VaryingMountainHike (which I have redubbed as MordorHike 😉 ). It was introduced here. It is basically a 2d environment with mountains defined by 3 gaussians. The agent has to get from bottom left to top right. The rewards are always negative, and it is least negative on top of the mountains.

I am totally new to julia and pomdp planners. So bear with me if I had made some bugs in the implementation.

import POMDPs
import POMDPModels
import Distributions

using POMDPs
using POMCPOW
using POMDPModels
using POMDPTools
using POMDPs
using POMDPTools
using LinearAlgebra
using Distributions
using Random

mutable struct MordorHikePOMDP <: POMDP{Vector{Float64}, Int64, Float64}
    # Constants
    translate_step::Float64
    translate_std::Float64
    obs_std::Float64
    discount::Float64
    
    # Map bounds and positions
    map_lower_bound::Vector{Float64}
    map_upper_bound::Vector{Float64}
    fixed_start_pos::Vector{Float64}
    goal_position::Vector{Float64}
    
    # Mountain parameters
    slope::Vector{Float64}
    mvn_1::MvNormal
    mvn_2::MvNormal
    mvn_3::MvNormal

    # Constructor
    function MordorHikePOMDP()
        new(
            0.1,    # translate_step
            0.05,   # translate_std
            0.1,    # obs_std
            0.99,   # discount
            [-1.0, -1.0],  # map_lower_bound
            [1.0, 1.0],    # map_upper_bound
            [-0.8, -0.8],  # fixed_start_pos
            [0.8, 0.8],    # goal_position
            [0.2, 0.2],    # slope
            MvNormal([0.0, 0.0], [0.005 0.0; 0.0 1.0]),    # mvn_1
            MvNormal([0.0, -0.8], [1.0 0.0; 0.0 0.01]),    # mvn_2
            MvNormal([0.0, 0.8], [1.0 0.0; 0.0 0.01])      # mvn_3
        )
    end
end

# State: [x, y, θ]
# Action: 1=north, 2=south, 3=east, 4=west
# Observation: [altitude]

function POMDPs.actions(pomdp::MordorHikePOMDP)
    return [1, 2, 3, 4]
end

function POMDPs.initialstate(pomdp::MordorHikePOMDP)
    # Random rotation for medium difficulty
    return ImplicitDistribution(rng -> ([pomdp.fixed_start_pos[1], pomdp.fixed_start_pos[2], rand(rng, [0.0, π/2, π, 3π/2])]))
end

function POMDPs.reward(pomdp::MordorHikePOMDP, s::Vector{Float64}, a::Int64, sp::Vector{Float64})
    if isterminal(pomdp, s)
        0.0
    else
        return calculate_altitude(pomdp, sp[1:2])
    end
end
POMDPs.discount(pomdp::MordorHikePOMDP) = pomdp.discount

function POMDPs.isterminal(pomdp::MordorHikePOMDP, s::Vector{Float64})
    return norm(s[1:2] - pomdp.goal_position) <= 2 * pomdp.translate_step
end



function POMDPs.transition(pomdp::MordorHikePOMDP, s::Vector{Float64}, a::Int64)
    return ImplicitDistribution(function(rng)
        directions = Dict(
            1 => [0.0, 1.0],  # North
            2 => [0.0, -1.0], # South
            3 => [1.0, 0.0],  # East
            4 => [-1.0, 0.0]  # West
        )
        
        # Apply movement
        direction = directions[a]
        next_pos = s[1:2] + direction * pomdp.translate_step
        
        # Add noise to position
        next_pos += rand(rng, Normal(0, pomdp.translate_std), 2)
        
        # Clip to bounds
        next_pos = clamp.(next_pos, pomdp.map_lower_bound, pomdp.map_upper_bound)
        
        # Keep the same theta
        return [next_pos[1], next_pos[2], s[3]]
    end)
end



function POMDPs.observation(pomdp::MordorHikePOMDP, a::Int64, sp::Vector{Float64})
    altitude = calculate_altitude(pomdp, sp[1:2])
    return Normal(altitude, pomdp.obs_std)
end

function POMDPs.observation(pomdp::MordorHikePOMDP, s::Vector{Float64}, a::Int64, sp::Vector{Float64})
    return observation(pomdp, a, sp)
end


function calculate_altitude(pomdp::MordorHikePOMDP, pos::Vector{Float64})
    # Convert pos to a 2-element vector to ensure correct dimensionality
    pos_2d = pos[1:2]  # Take only x,y coordinates if pos is longer
    mountains = [
        pdf(pomdp.mvn_1, pos_2d),
        pdf(pomdp.mvn_2, pos_2d),
        pdf(pomdp.mvn_3, pos_2d)
    ]
    altitude = maximum(mountains)
    return -exp(-altitude) + dot(pos_2d, pomdp.slope) - 0.02
end



pomdp = MordorHikePOMDP()

# Use POMCPOW solver (as in your first code block)
using POMCPOW
solver = POMCPOWSolver(criterion=MaxUCB(20.0))
planner = solve(solver, pomdp)

# Simulate
hr = HistoryRecorder(max_steps=200,)
hist = simulate(hr, pomdp, planner)
for (s, b, a, r, sp, o) in hist
    @show s, a, r, sp
    println(isterminal(pomdp, s))
end

rhist = simulate(hr, pomdp, RandomPolicy(pomdp))
println("""
    Cumulative Discounted Reward (for 1 simulation)
        Random: $(discounted_reward(rhist))
        POMCPOW: $(discounted_reward(hist))
    """)

I need help on knowing what are the reasonable hyperparameters for the solver for this problem. When I used the default tree queries the rollout terminated before max steps without reaching the terminal state.

himanshugupta1009 · 2024-11-18T23:06:15Z

Hey Sai,

I can help you with this. Let's start with a few questions I have.

Based on my understanding of your implementation, it doesn't seem like the state variable theta is important, as it remains unchanged throughout the simulation and does not affect anything. Is that intentional?
Most tree solvers require a good rollout policy to guide the search toward high-value actions. Did you experiment with this? If not, I can help you implement one. I think a reasonable rollout policy would be to move directly to the goal location in a straight line (assuming full observability).

Let me know whenever you get a chance!

Edit: Also, I did some debugging, and it seems like the POMCPOW tree is reaching the terminal state during planning. Moreover, the simulation stops before executing max_steps because the agent reaches its goal which is a terminal state in your implementation.

sai-prasanna · 2024-11-19T00:20:36Z

The state variable theta is important, although it does not change in an episode, it is picked randomly among four choices. One could obtain a similar sub-sequence of observation for different start angles unless one takes information-seeking actions that reduce this sub-sequence length. There was a bug, I intended to make the directions different based on theta! Thanks for catching that. Here is the fixed code

# import Pkg
# Pkg.add("POMDPs")
# Pkg.add("POMCPOW")
# Pkg.add("POMDPModels")
# Pkg.add("POMDPTools")
# Pkg.add("Distributions")
import POMDPs
import POMDPModels
import Distributions

using POMDPs
using POMCPOW
using POMDPModels
using POMDPTools
using POMDPs
using POMDPTools
using LinearAlgebra
using Distributions
using Random

mutable struct MordorHikePOMDP <: POMDP{Vector{Float64}, Int64, Float64}
    # Constants
    translate_step::Float64
    translate_std::Float64
    obs_std::Float64
    discount::Float64
    
    # Map bounds and positions
    map_lower_bound::Vector{Float64}
    map_upper_bound::Vector{Float64}
    fixed_start_pos::Vector{Float64}
    goal_position::Vector{Float64}
    
    # Mountain parameters
    slope::Vector{Float64}
    mvn_1::MvNormal
    mvn_2::MvNormal
    mvn_3::MvNormal

    # Constructor
    function MordorHikePOMDP()
        new(
            0.1,    # translate_step
            0.05,   # translate_std
            0.1,    # obs_std
            0.99,   # discount
            [-1.0, -1.0],  # map_lower_bound
            [1.0, 1.0],    # map_upper_bound
            [-0.8, -0.8],  # fixed_start_pos
            [0.8, 0.8],    # goal_position
            [0.2, 0.2],    # slope
            MvNormal([0.0, 0.0], [0.005 0.0; 0.0 1.0]),    # mvn_1
            MvNormal([0.0, -0.8], [1.0 0.0; 0.0 0.01]),    # mvn_2
            MvNormal([0.0, 0.8], [1.0 0.0; 0.0 0.01])      # mvn_3
        )
    end
end

# State: [x, y, θ]
# Action: 1=north, 2=south, 3=east, 4=west
# Observation: [altitude]

function POMDPs.actions(pomdp::MordorHikePOMDP)
    return [1, 2, 3, 4]
end

function POMDPs.initialstate(pomdp::MordorHikePOMDP)
    # Random rotation for medium difficulty
    return ImplicitDistribution(rng -> ([pomdp.fixed_start_pos[1], pomdp.fixed_start_pos[2], rand(rng, [0.0, π/2, π, 3π/2])]))
end

function POMDPs.reward(pomdp::MordorHikePOMDP, s::Vector{Float64}, a::Int64, sp::Vector{Float64})
    if isterminal(pomdp, s)
        0.0
    else
        return calculate_altitude(pomdp, sp[1:2])
    end
end
POMDPs.discount(pomdp::MordorHikePOMDP) = pomdp.discount

function POMDPs.isterminal(pomdp::MordorHikePOMDP, s::Vector{Float64})
    return norm(s[1:2] - pomdp.goal_position) <= 2 * pomdp.translate_step
end



function POMDPs.transition(pomdp::MordorHikePOMDP, s::Vector{Float64}, a::Int64)
    return ImplicitDistribution(function(rng)
        theta = s[3]
        
        # Create movement vector based on theta
        forward_vector = [cos(theta), sin(theta)]
        lateral_vector = [-sin(theta), cos(theta)]
        
        # Map actions to movement
        movement = if a == 1  # North
            forward_vector
        elseif a == 2  # South 
            -forward_vector
        elseif a == 3  # East
            lateral_vector
        else  # West
            -lateral_vector
        end
        
        next_pos = s[1:2] + movement * pomdp.translate_step
        
        # Add noise to position
        next_pos += rand(rng, Normal(0, pomdp.translate_std), 2)
        
        # Clip to bounds
        next_pos = clamp.(next_pos, pomdp.map_lower_bound, pomdp.map_upper_bound)
        
        # Keep the same theta
        return [next_pos[1], next_pos[2], theta]
    end)
end



function POMDPs.observation(pomdp::MordorHikePOMDP, a::Int64, sp::Vector{Float64})
    altitude = calculate_altitude(pomdp, sp[1:2])
    return Normal(altitude, pomdp.obs_std)
end

function POMDPs.observation(pomdp::MordorHikePOMDP, s::Vector{Float64}, a::Int64, sp::Vector{Float64})
    return observation(pomdp, a, sp)
end


function calculate_altitude(pomdp::MordorHikePOMDP, pos::Vector{Float64})
    # Convert pos to a 2-element vector to ensure correct dimensionality
    pos_2d = pos[1:2]  # Take only x,y coordinates if pos is longer
    mountains = [
        pdf(pomdp.mvn_1, pos_2d),
        pdf(pomdp.mvn_2, pos_2d),
        pdf(pomdp.mvn_3, pos_2d)
    ]
    altitude = maximum(mountains)
    return -exp(-altitude) + dot(pos_2d, pomdp.slope) - 0.02
end



pomdp = MordorHikePOMDP()

# Use POMCPOW solver (as in your first code block)
using POMCPOW
solver = POMCPOWSolver(criterion=MaxUCB(20.0))
planner = solve(solver, pomdp)

# Simulate
hr = HistoryRecorder(max_steps=200,)
hist = simulate(hr, pomdp, planner)
for (s, b, a, r, sp, o) in hist
    @show s, a, r, sp
    println(norm(s[1:2] - pomdp.goal_position))
end

rhist = simulate(hr, pomdp, RandomPolicy(pomdp))
println("""
    Cumulative Discounted Reward (for 1 simulation)
        Random: $(discounted_reward(rhist))
        POMCPOW: $(discounted_reward(hist))
    """)

Hmm, I can implement the cross-entropy method ( a black box optimization) as a rollout policy or maybe even a greedy one that goes directly to the goal. Is there an example of how to specify this in the API?

When I ran the above code and checked the trajectory (hist), the final state did not satisfy the is_terminal condition of being near [0.8, 0.8]! Do you get a different result?

If I try POMCPOWSolver(criterion=MaxUCB(20.0), tree_queries=10000, max_depth=10) it reaches 200 steps without solving the task but better than random performance.

sai-prasanna · 2024-11-19T01:23:51Z

Ok, sorry for too many updates. I reran the solver on the updated code. It solves the task sometimes and sometimes it does not. I guess a non-random rollout policy would help here? Can you help me define it? I don't understand the API that well.

himanshugupta1009 · 2024-11-19T05:44:17Z

Yeah, hopefully, a better rollout policy will increase the number of times it solves the task.

This would be the general layout for the rollout policy code. estimate_value is the parameter you use to specify the rollout policy/function.

struct GoToGoalPolicy{P} <: Policy
    pomdp::P
end

function POMDPs.action(p::GoToGoalPolicy, s)
  # ROLLOUT CODE TO RETURN AN ACTION FOR THE INPUT STATE s
end

rollout_policy = FORollout(GoToGoalPolicy(pomdp))

solver = POMCPOWSolver(
                    estimate_value=rollout_policy,
                    criterion=MaxUCB(20.0),
                    tree_queries=10_000, 
                    max_depth=10
                    )

Let me know how well it performs after using a better rollout.

himanshugupta1009 closed this as completed Nov 18, 2024

himanshugupta1009 reopened this Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need help in defining and solving a POMDP: Varying Mountainhike #43

Need help in defining and solving a POMDP: Varying Mountainhike #43

sai-prasanna commented Nov 8, 2024 •

edited

Loading

himanshugupta1009 commented Nov 18, 2024 •

edited

Loading

sai-prasanna commented Nov 19, 2024 •

edited

Loading

sai-prasanna commented Nov 19, 2024

himanshugupta1009 commented Nov 19, 2024 •

edited

Loading

Need help in defining and solving a POMDP: Varying Mountainhike #43

Need help in defining and solving a POMDP: Varying Mountainhike #43

Comments

sai-prasanna commented Nov 8, 2024 • edited Loading

himanshugupta1009 commented Nov 18, 2024 • edited Loading

sai-prasanna commented Nov 19, 2024 • edited Loading

sai-prasanna commented Nov 19, 2024

himanshugupta1009 commented Nov 19, 2024 • edited Loading

sai-prasanna commented Nov 8, 2024 •

edited

Loading

himanshugupta1009 commented Nov 18, 2024 •

edited

Loading

sai-prasanna commented Nov 19, 2024 •

edited

Loading

himanshugupta1009 commented Nov 19, 2024 •

edited

Loading