Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different results for single vs multi-thread #1766

Closed
DanielDoehring opened this issue Dec 3, 2023 · 11 comments
Closed

Different results for single vs multi-thread #1766

DanielDoehring opened this issue Dec 3, 2023 · 11 comments
Labels
bug Something isn't working parallelization Related to MPI, threading, tasks etc.

Comments

@DanielDoehring
Copy link
Contributor

DanielDoehring commented Dec 3, 2023

Consider this example derived from the MHD rotor:

using OrdinaryDiffEq
using Trixi


###############################################################################
# semidiscretization of the compressible ideal GLM-MHD equations
equations = IdealGlmMhdEquations2D(1.4)

"""
    initial_condition_rotor(x, t, equations::IdealGlmMhdEquations2D)

The classical MHD rotor test case. Here, the setup is taken from
- Dominik Derigs, Gregor J. Gassner, Stefanie Walch & Andrew R. Winters (2018)
  Entropy Stable Finite Volume Approximations for Ideal Magnetohydrodynamics
  [doi: 10.1365/s13291-018-0178-9](https://doi.org/10.1365/s13291-018-0178-9)
"""
function initial_condition_rotor(x, t, equations::IdealGlmMhdEquations2D)
  # setup taken from Derigs et al. DMV article (2018)
  # domain must be [0, 1] x [0, 1], γ = 1.4
  dx = x[1] - 0.5
  dy = x[2] - 0.5
  r = sqrt(dx^2 + dy^2)
  f = (0.115 - r)/0.015
  if r <= 0.1
    rho = 10.0
    v1 = -20.0*dy
    v2 = 20.0*dx
  elseif r >= 0.115
    rho = 1.0
    v1 = 0.0
    v2 = 0.0
  else
    rho = 1.0 + 9.0*f
    v1 = -20.0*f*dy
    v2 = 20.0*f*dx
  end
  v3 = 0.0
  p = 1.0
  B1 = 5.0/sqrt(4.0*pi)
  B2 = 0.0
  B3 = 0.0
  psi = 0.0
  return prim2cons(SVector(rho, v1, v2, v3, p, B1, B2, B3, psi), equations)
end
initial_condition = initial_condition_rotor

surface_flux = (flux_lax_friedrichs, flux_nonconservative_powell)
volume_flux  = (flux_hindenlang_gassner, flux_nonconservative_powell)
polydeg = 4
basis = LobattoLegendreBasis(polydeg)
indicator_sc = IndicatorHennemannGassner(equations, basis,
                                         alpha_max=0.5,
                                         alpha_min=0.001,
                                         alpha_smooth=true,
                                         variable=density_pressure)
volume_integral = VolumeIntegralShockCapturingHG(indicator_sc;
                                                 volume_flux_dg=volume_flux,
                                                 volume_flux_fv=surface_flux)
solver = DGSEM(basis, surface_flux, volume_integral)

coordinates_min = (0.0, 0.0)
coordinates_max = (1.0, 1.0)
mesh = TreeMesh(coordinates_min, coordinates_max,
                initial_refinement_level=4,
                n_cells_max=10_000,
                periodicity=true)

semi = SemidiscretizationHyperbolic(mesh, equations, initial_condition, solver)


###############################################################################
# ODE solvers, callbacks etc.

tspan = (0.0, 0.15)
ode = semidiscretize(semi, tspan)

summary_callback = SummaryCallback()

amr_indicator = IndicatorHennemannGassner(semi,
                                          alpha_max=0.5,
                                          alpha_min=0.001,
                                          alpha_smooth=false,
                                          variable=density_pressure)
# For density_pressure                                          
amr_controller = ControllerThreeLevel(semi, amr_indicator,
                                      base_level=3,
                                      med_level =7, med_threshold=0.0025,
                                      max_level =9, max_threshold=0.25)                                   
amr_callback = AMRCallback(semi, amr_controller,
                           interval = 8,
                           adapt_initial_condition=true,
                           adapt_initial_condition_only_refine=true)

cfl = 0.54 # ParsaniKetchesonDeconinck3S53

stepsize_callback = StepsizeCallback(cfl=cfl)

glm_speed_callback = GlmSpeedCallback(glm_scale=0.5, cfl=cfl)

callbacks = CallbackSet(summary_callback,
                        amr_callback,
                        stepsize_callback,
                        glm_speed_callback)

###############################################################################
# run the simulation
    
sol = solve(ode, ParsaniKetchesonDeconinck3S53(;thread = OrdinaryDiffEq.False());
            dt = 1.0,
            ode_default_options()..., callback=callbacks)
  

summary_callback() # print the timer summary

This does not run when I execute it single-threaded:

ERROR: DomainError with -7.2495041331637795:

And it gives yet another message when run with 2, 3, 4, threads:

ERROR: DomainError with -0.17505094228860651:

But works e.g. for 5 threads.

I was able to reproduce this on a different machine (Rocky Linux) and on Julia 1.9.3 (besides 1.9.4).

I am not sure what is going on here, most likely something in the interaction of CFL, GLM and AMR callback is not working right.
If the AMR callback is omitted, things are running.
The same story applies essentially also for different integrators such as our custom ones.

Can someone check this if this is reproducible, ideally on Mac or Windows?

@DanielDoehring DanielDoehring added bug Something isn't working help wanted Extra attention is needed high-priority parallelization Related to MPI, threading, tasks etc. labels Dec 3, 2023
@jlchan
Copy link
Contributor

jlchan commented Dec 3, 2023

I can confirm it crashes with the same error on 1 thread but runs on 5 threads.

I'm running on an M2 ARM Mac with Julia 1.9.4.

@ranocha
Copy link
Member

ranocha commented Dec 4, 2023

Can you maybe reduce it further?

  • Does it happen without AMR?
  • Does it happen with fixed time step size?
  • ...

I'm also wondering why you set periodicity=true and pass BCs...

@DanielDoehring
Copy link
Contributor Author

No, without AMR things are fine.
As far as I know, GLM-MHD is not possible with fixed timestep since the GlmSpeedCallback always demands a CFL number. (The simulation crashes without GlmSpeedCallback immediately)

Sorry for the confusion with the BCs, originally I ran an example with outflow BCs but then simplified it.

@DanielDoehring
Copy link
Contributor Author

DanielDoehring commented Dec 4, 2023

The simulations crash (at least) at the same timestep. If you save a restart file prior to that crashing timestep and then restart the simulations, they finish for some reason.

You can check that for the current simulation e.g. with changes

cfl = 0.42

save_restart = SaveRestartCallback(interval = 1600,
                                   save_final_restart = false)

sol = solve(ode, SSPRK33(;thread = OrdinaryDiffEq.False());
            dt = 1.0,
            ode_default_options()..., callback=callbacks)

@DanielDoehring
Copy link
Contributor Author

I found the place where this behaviour originates from, but I am severely confused.

I added in the implementation of the inv_ln_mean

@inline function inv_ln_mean(x, y)
epsilon_f2 = 1.0e-4
f2 = (x * (x - 2 * y) + y * y) / (x * (x + 2 * y) + y * y) # f2 = f^2
if f2 < epsilon_f2
return @evalpoly(f2, 2, 2/3, 2/5, 2/7) / (x + y)
else
return log(y / x) / (y - x)
end
end

an if clause

@inline function inv_ln_mean(x, y)
    epsilon_f2 = 1.0e-4
    f2 = (x * (x - 2 * y) + y * y) / (x * (x + 2 * y) + y * y) # f2 = f^2
    if f2 < epsilon_f2
        return @evalpoly(f2, 2, 2/3, 2/5, 2/7) / (x + y)
    else
        if y/x < 0 # New
            println("y/x = ", y/x)
            println("log should crash, evaluating log:")
            println("log(y/x) = ", log(y/x))
        end
        return log(y / x) / (y - x)
    end
end

On one thread, this gives the expected outcome:

y/x = -96.03009844096532
log should crash, evaluating log:
ERROR: DomainError with -96.03009844096532:

On e.g. 16 threads however, I receive:

y/x = -96.03009844096532
log should crash, evaluating log:

and the simulation continues - i.e., it seems like the line println("log(y/x) = ", log(y/x)) has been skipped?
The same behaviour is observed when inv_ln_mean is not @inlined.

What is going on here?

The elixir to reproduce this is given below:

using OrdinaryDiffEq
using Trixi


###############################################################################
# semidiscretization of the compressible ideal GLM-MHD equations
equations = IdealGlmMhdEquations2D(1.4)

"""
    initial_condition_rotor(x, t, equations::IdealGlmMhdEquations2D)

The classical MHD rotor test case. Here, the setup is taken from
- Dominik Derigs, Gregor J. Gassner, Stefanie Walch & Andrew R. Winters (2018)
  Entropy Stable Finite Volume Approximations for Ideal Magnetohydrodynamics
  [doi: 10.1365/s13291-018-0178-9](https://doi.org/10.1365/s13291-018-0178-9)
"""
function initial_condition_rotor(x, t, equations::IdealGlmMhdEquations2D)
  # setup taken from Derigs et al. DMV article (2018)
  # domain must be [0, 1] x [0, 1], γ = 1.4
  dx = x[1] - 0.5
  dy = x[2] - 0.5
  r = sqrt(dx^2 + dy^2)
  f = (0.115 - r)/0.015
  if r <= 0.1
    rho = 10.0
    v1 = -20.0*dy
    v2 = 20.0*dx
  elseif r >= 0.115
    rho = 1.0
    v1 = 0.0
    v2 = 0.0
  else
    rho = 1.0 + 9.0*f
    v1 = -20.0*f*dy
    v2 = 20.0*f*dx
  end
  v3 = 0.0
  p = 1.0
  B1 = 5.0/sqrt(4.0*pi)
  B2 = 0.0
  B3 = 0.0
  psi = 0.0

  return prim2cons(SVector(rho, v1, v2, v3, p, B1, B2, B3, psi), equations)
end
initial_condition = initial_condition_rotor

surface_flux = (flux_lax_friedrichs, flux_nonconservative_powell)
volume_flux  = (flux_hindenlang_gassner, flux_nonconservative_powell)
polydeg = 4
basis = LobattoLegendreBasis(polydeg)
indicator_sc = IndicatorHennemannGassner(equations, basis,
                                         alpha_max=0.5,
                                         alpha_min=0.001,
                                         alpha_smooth=true,
                                         variable=density_pressure)
volume_integral = VolumeIntegralShockCapturingHG(indicator_sc;
                                                 volume_flux_dg=volume_flux,
                                                 volume_flux_fv=surface_flux)
solver = DGSEM(basis, surface_flux, volume_integral)

coordinates_min = (0.0, 0.0)
coordinates_max = (1.0, 1.0)
mesh = TreeMesh(coordinates_min, coordinates_max,
                initial_refinement_level=4,
                n_cells_max=10_000,
                periodicity=true)

semi = SemidiscretizationHyperbolic(mesh, equations, initial_condition, solver)

tspan = (0.0, 0.0891)
ode = semidiscretize(semi, tspan)                        

###############################################################################
# ODE solvers, callbacks etc.

#=
restart_filename = "out/restart_001639.h5"
mesh = load_mesh(restart_filename)

semi = SemidiscretizationHyperbolic(mesh, equations, initial_condition, solver,
                                    boundary_conditions = boundary_conditions)     

tspan = (load_time(restart_filename), 0.089)
dt = load_dt(restart_filename)
ode = semidiscretize(semi, tspan, restart_filename);       
=#

summary_callback = SummaryCallback()

amr_indicator = IndicatorHennemannGassner(semi,
                                          alpha_max=0.5,
                                          alpha_min=0.001,
                                          alpha_smooth=false,
                                          variable=density_pressure)
# For density_pressure                                          
amr_controller = ControllerThreeLevel(semi, amr_indicator,
                                      base_level=3,
                                      med_level =7, med_threshold=0.0025,
                                      max_level =9, max_threshold=0.25)      
amr_callback = AMRCallback(semi, amr_controller,
                           interval = 8,
                           adapt_initial_condition=true,
                           adapt_initial_condition_only_refine=true)

cfl = 0.42 # SSPRK33

save_restart = SaveRestartCallback(interval = 1639,
                                   save_final_restart = false)

stepsize_callback = StepsizeCallback(cfl=cfl)

analysis_interval = 1600
analysis_callback = AnalysisCallback(semi, interval = analysis_interval)

glm_speed_callback = GlmSpeedCallback(glm_scale=0.5, cfl=cfl)

callbacks = CallbackSet(summary_callback,
                        amr_callback,
                        stepsize_callback,
                        analysis_callback,
                        #save_restart,
                        glm_speed_callback)

###############################################################################
# run the simulation

sol = solve(ode, SSPRK33(),
            dt=42.0,
            save_everystep=false, callback=callbacks);
  

summary_callback() # print the timer summary

@sloede
Copy link
Member

sloede commented Dec 8, 2023

it seems like the line println("log(y/x) = ", log(y/x)) has been skipped?

It would seem to me that the line is not necessarily skipped (that would be super-duper weird) but rather that the exception that was thrown somehow got lost and did not properly surface (which is just regular weird).

Can you try running this again but instead of using Polyester.jl threads with regular Julia threads?

@DanielDoehring
Copy link
Contributor Author

Can you try running this again but instead of using Polyester.jl threads with regular Julia threads?

I.e., Threads.threads instead of threaded ?

@sloede
Copy link
Member

sloede commented Dec 8, 2023

Yes. You can achieve this by modifying the @threaded macro to use the currently commented code block

# return esc(quote
# let
# if Threads.nthreads() == 1
# $(expr)
# else
# Threads.@threads $(expr)
# end
# end
# end)

instead of the code a few lines below that.

@DanielDoehring
Copy link
Contributor Author

Ah nice, yeah now the program fails with

ERROR: TaskFailedException

    nested task error: DomainError with -96.03009844096532:

I check if I can create a MWE for the Polyester.jl guys.

@DanielDoehring
Copy link
Contributor Author

For me using ThreadingUtilities.jl with version v0.5.1 resolves this issue in the sense that at least thread errors are printed (although the simulation still continues)

@DanielDoehring
Copy link
Contributor Author

I close this issue as it is caused by an upstream package and we found a workaround.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working parallelization Related to MPI, threading, tasks etc.
Projects
None yet
Development

No branches or pull requests

4 participants