Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stan terminating with no error messages on CentOS linux 7 (core) cluster, but runs on my personal device (Ubunutu LTS 22.04) #753

Open
Garren-H opened this issue May 21, 2024 · 0 comments

Comments

@Garren-H
Copy link

Description

Hi all. I have ran a stan program on a cluster running on CentOS linux 7 (core), but stan just terminated without warning nor error messages.

The stan code is:
   functions {
        vector NRTL(vector x, vector T, vector p12, vector p21, real a, matrix map_tij, matrix map_tij_dT) {
            int N = rows(x);
            vector[N] t12 = map_tij * p12;
            vector[N] t21 = map_tij * p21;
            vector[N] dt12_dT = map_tij_dT * p12;
            vector[N] dt21_dT = map_tij_dT * p21;   
            vector[N] at12 = a * t12;
            vector[N] at21 = a * t21;
            vector[N] G12 = exp(-at12);
            vector[N] G21 = exp(-at21);
            vector[N] term1 = ( ( (1-x) .* G12 .* (1 - at12) + x .* square(G12) ) ./ square((1-x) + x .* G12) ) .* dt12_dT;
            vector[N] term2 = ( ( x .* G21 .* (1 - at21) + (1-x) .* square(G21) ) ./ square(x + (1-x) .* G21) ) .* dt21_dT;
            return -8.314 * square(T) .* x .* (1-x) .* ( term1 + term2 );
        }

        real ps_like(array[] int N_slice, int start, int end, vector y, vector x, vector T, array[] matrix U_raw, 
          array[] matrix V_raw, vector v_ARD, vector v, vector scaling, real a, real error, array[] int N_points,
          array[,] int Idx_known, array[] matrix mapping, vector var_data) {
            real all_target = 0;
            for (i in start:end) {
                vector[4] p12_raw;
                vector[4] p21_raw;
                vector[N_points[i]] y_std = sqrt(var_data[sum(N_points[:i-1])+1:sum(N_points[:i])]+v[i]);
                vector[N_points[i]] y_means;

                for (j in 1:4) {
                    p12_raw[j] = dot_product(U_raw[j,:,Idx_known[i,1]] .* v_ARD, V_raw[j,:,Idx_known[i,2]]);
                    p21_raw[j] = dot_product(U_raw[j,:,Idx_known[i,2]] .* v_ARD, V_raw[j,:,Idx_known[i,1]]);
                }

                y_means = NRTL(x[sum(N_points[:i-1])+1:sum(N_points[:i])], 
                                T[sum(N_points[:i-1])+1:sum(N_points[:i])], 
                                p12_raw, p21_raw, a,
                                mapping[1][sum(N_points[:i-1])+1:sum(N_points[:i]),:],
                                mapping[2][sum(N_points[:i-1])+1:sum(N_points[:i]),:]);
                all_target += normal_lpdf(y[sum(N_points[:i-1])+1:sum(N_points[:i])] | y_means, y_std);
            }
            return all_target;
        }
    }


    data {
        int N_known;                    // number of known data points
        array[N_known] int N_points;    // number of data points in each known data set
        vector[sum(N_points)] x;        // mole fraction
        vector[sum(N_points)] T;        // temperature
        vector[sum(N_points)] y;        // excess enthalpy
        vector[4] scaling;              // scaling factor for NRTL parameter
        real a;                         // alpha value for NRTL model
        int grainsize;                  // grainsize for parallelization
        int N;                          // number of compounds
        int D;                          // number of features
        array[N_known,2] int Idx_known; // indices of known data points
        vector<lower=0>[N_known] v;     // known data-model variance
    }

    transformed data {
        real error = 0.01;                      // error in the data (fraction of experimental data)
        vector[sum(N_points)] var_data = square(error*y);    // variance of the data
        array[2] matrix[sum(N_points),4] mapping;           // temperature mapping
        array[N_known] int N_slice;             // slice indices for parallelization
    
        for (i in 1:N_known) {
            N_slice[i] = i;
        }

        mapping[1] = append_col(append_col(append_col(rep_vector(1.0, sum(N_points)), T),
                        1.0 ./ T), log(T));         // mapping for tij
        mapping[1] = mapping[1] .* rep_matrix(scaling', sum(N_points)); // scaling the mapping

        mapping[2] = append_col(append_col(append_col(rep_vector(0.0, sum(N_points)), rep_vector(1.0, sum(N_points))),
                        -1.0 ./ square(T)), 1.0 ./ T);    // mapping for dtij_dT
        mapping[2] = mapping[2] .* rep_matrix(scaling', sum(N_points)); // scaling the mapping
    }

    parameters {
        array[4] matrix[D,N] U_raw;       // feature matrices U
        array[4] matrix[D,N] V_raw;       // feature matrices V
        real<lower=0> scale;              // scale dictating the strenght of ARD effect
        vector<lower=0>[D] v_ARD;         // ARD variances aranged in increasing order with lower bound zero
    }
    
    model {
        // Gamma Prior for scale
        profile("Scale Prior"){
            scale ~ gamma(1e-9, 1e-9);
        }

        // ARD Exponential prior
        profile("ARD Prior"){
            v_ARD ~ exponential(scale);
        }
    
        // Priors for feature matrices
        profile("Feature Matrices"){
            for (i in 1:4) {
                to_vector(U_raw[i]) ~ std_normal();
                to_vector(V_raw[i]) ~ std_normal();
            }
        }
        
        // Likelihood function
        profile("Likelihood"){
            target += reduce_sum(ps_like, N_slice, grainsize, y, x, T, U_raw, 
                                    V_raw, v_ARD, v, scaling, a, error, N_points,
                                    Idx_known, mapping, var_data);
        }
    }
The model compiled and everything, and even did the prelimary gradient evaluations. The (relevant) sample python code is:
print('Step1: Sampling sort chain using random initialization')
fit = model.sample(data=f'{path}/data.json', output_dir=output_dir1,
                        refresh=1, iter_warmup=5000, 
                        iter_sampling=1000, chains=chains, parallel_chains=chains, 
                        threads_per_chain=threads_per_chain, max_treedepth=5,
                        metric='dense_e', save_profile=True, sig_figs=18,
                        show_console=True)
The output from the stan `.txt` file displays:
method = sample (Default)
  sample
    num_samples = 1000 (Default)
    num_warmup = 5000
    save_warmup = 0 (Default)
    thin = 1 (Default)
    adapt
      engaged = 1 (Default)
      gamma = 0.050000 (Default)
      delta = 0.800000 (Default)
      kappa = 0.750000 (Default)
      t0 = 10.000000 (Default)
      init_buffer = 75 (Default)
      term_buffer = 50 (Default)
      window = 25 (Default)
      save_metric = 0 (Default)
    algorithm = hmc (Default)
      hmc
        engine = nuts (Default)
          nuts
            max_depth = 5
        metric = dense_e
        metric_file =  (Default)
        stepsize = 1.000000 (Default)
        stepsize_jitter = 0.000000 (Default)
    num_chains = 8
id = 1 (Default)
data
  file = Subsets/Alkane_Primary alcohol/Include_clusters_False/Variance_known_True/rank_1/data.json
init = 2 (Default)
random
  seed = 96157
output
  file = /mnt/lustre/users/ghermanus/Hybrid PMF/Subsets/Alkane_Primary alcohol/Include_clusters_False/Variance_known_True/rank_1/Step1/Hybrid_PMF-20240521214809.csv
  diagnostic_file =  (Default)
  refresh = 1
  sig_figs = 18
  profile_file = /mnt/lustre/users/ghermanus/Hybrid PMF/Subsets/Alkane_Primary alcohol/Include_clusters_False/Variance_known_True/rank_1/Step1/Hybrid_PMF-20240521214809-profile.csv
  save_cmdstan_config = 0 (Default)
num_threads = 24 (Default)


Gradient evaluation took 0.009228 seconds
1000 transitions using 10 leapfrog steps per transition would take 92.28 seconds.
Adjust your expectations accordingly!



Gradient evaluation took 0.00178 seconds
1000 transitions using 10 leapfrog steps per transition would take 17.8 seconds.
Adjust your expectations accordingly!



Gradient evaluation took 0.001501 seconds
1000 transitions using 10 leapfrog steps per transition would take 15.01 seconds.
Adjust your expectations accordingly!



Gradient evaluation took 0.001735 seconds
1000 transitions using 10 leapfrog steps per transition would take 17.35 seconds.
Adjust your expectations accordingly!



Gradient evaluation took 0.001593 seconds
1000 transitions using 10 leapfrog steps per transition would take 15.93 seconds.
Adjust your expectations accordingly!



Gradient evaluation took 0.001275 seconds
1000 transitions using 10 leapfrog steps per transition would take 12.75 seconds.
Adjust your expectations accordingly!



Gradient evaluation took 0.001483 seconds
1000 transitions using 10 leapfrog steps per transition would take 14.83 seconds.
Adjust your expectations accordingly!



Gradient evaluation took 0.00134 seconds
1000 transitions using 10 leapfrog steps per transition would take 13.4 seconds.
Adjust your expectations accordingly!



And the output from `show_console=True` is:
Evaluating the following conditions for the Hybrid Model:
Include clusters: False
Variance known: True
Lower rank of feature matrices: 1


Step1: Sampling sort chain using random initialization
method = sample (Default)
sample
num_samples = 1000 (Default)
num_warmup = 5000
save_warmup = 0 (Default)
thin = 1 (Default)
adapt
engaged = 1 (Default)
gamma = 0.050000 (Default)
delta = 0.800000 (Default)
kappa = 0.750000 (Default)
t0 = 10.000000 (Default)
init_buffer = 75 (Default)
term_buffer = 50 (Default)
window = 25 (Default)
save_metric = 0 (Default)
algorithm = hmc (Default)
hmc
engine = nuts (Default)
nuts
max_depth = 5
metric = dense_e
metric_file =  (Default)
stepsize = 1.000000 (Default)
stepsize_jitter = 0.000000 (Default)
num_chains = 8
id = 1 (Default)
data
file = Subsets/Alkane_Primary alcohol/Include_clusters_False/Variance_known_True/rank_1/data.json
init = 2 (Default)
random
seed = 96157
output
file = /mnt/lustre/users/ghermanus/Hybrid PMF/Subsets/Alkane_Primary alcohol/Include_clusters_False/Variance_known_True/rank_1/Step1/Hybrid_PMF-20240521214809.csv
diagnostic_file =  (Default)
refresh = 1
sig_figs = 18
profile_file = /mnt/lustre/users/ghermanus/Hybrid PMF/Subsets/Alkane_Primary alcohol/Include_clusters_False/Variance_known_True/rank_1/Step1/Hybrid_PMF-20240521214809-profile.csv
save_cmdstan_config = 0 (Default)
num_threads = 24 (Default)


Gradient evaluation took 0.009228 seconds
1000 transitions using 10 leapfrog steps per transition would take 92.28 seconds.
Adjust your expectations accordingly!



Gradient evaluation took 0.00178 seconds
1000 transitions using 10 leapfrog steps per transition would take 17.8 seconds.
Adjust your expectations accordingly!



Gradient evaluation took 0.001501 seconds
1000 transitions using 10 leapfrog steps per transition would take 15.01 seconds.
Adjust your expectations accordingly!



Gradient evaluation took 0.001735 seconds
1000 transitions using 10 leapfrog steps per transition would take 17.35 seconds.
Adjust your expectations accordingly!



Gradient evaluation took 0.001593 seconds
1000 transitions using 10 leapfrog steps per transition would take 15.93 seconds.
Adjust your expectations accordingly!



Gradient evaluation took 0.001275 seconds
1000 transitions using 10 leapfrog steps per transition would take 12.75 seconds.
Adjust your expectations accordingly!



Gradient evaluation took 0.001483 seconds
1000 transitions using 10 leapfrog steps per transition would take 14.83 seconds.
Adjust your expectations accordingly!



Gradient evaluation took 0.00134 seconds
1000 transitions using 10 leapfrog steps per transition would take 13.4 seconds.
Adjust your expectations accordingly!

The standard error file displays:
21:47:31 - cmdstanpy - INFO - compiling stan file /home/ghermanus/lustre/tmphhck335m/tmpvaov4lbc.stan to exe file /mnt/lustre/users/ghermanus/Hybrid PMF/Subsets/Alkane_Primary alcohol/Include_clusters_False/Variance_known_True/rank_1/Hybrid_PMF
21:48:09 - cmdstanpy - INFO - compiled model executable: /mnt/lustre/users/ghermanus/Hybrid PMF/Subsets/Alkane_Primary alcohol/Include_clusters_False/Variance_known_True/rank_1/Hybrid_PMF
21:48:09 - cmdstanpy - INFO - CmdStan start processing
21:48:10 - cmdstanpy - INFO - CmdStan done processing
21:48:10 - cmdstanpy - ERROR - CmdStan error: terminated by signal 11 Unknown error -11
Traceback (most recent call last):
  File "/mnt/lustre/users/ghermanus/Hybrid PMF/Hybrid_PMF.py", line 147, in <module>
    fit = model.sample(data=f'{path}/data.json', output_dir=output_dir1,
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ghermanus/cmdstan_condaforge/lib/python3.12/site-packages/cmdstanpy/model.py", line 1136, in sample
    raise RuntimeError(msg)
RuntimeError: Error during sampling:

Command and output files:
RunSet: chains=8, chain_ids=[1, 2, 3, 4, 5, 6, 7, 8], num_processes=1
 cmd (chain 1):
	['/mnt/lustre/users/ghermanus/Hybrid PMF/Subsets/Alkane_Primary alcohol/Include_clusters_False/Variance_known_True/rank_1/Hybrid_PMF', 'id=1', 'random', 'seed=96157', 'data', 'file=Subsets/Alkane_Primary alcohol/Include_clusters_False/Variance_known_True/rank_1/data.json', 'output', 'file=/mnt/lustre/users/ghermanus/Hybrid PMF/Subsets/Alkane_Primary alcohol/Include_clusters_False/Variance_known_True/rank_1/Step1/Hybrid_PMF-20240521214809.csv', 'profile_file=/mnt/lustre/users/ghermanus/Hybrid PMF/Subsets/Alkane_Primary alcohol/Include_clusters_False/Variance_known_True/rank_1/Step1/Hybrid_PMF-20240521214809-profile.csv', 'refresh=1', 'sig_figs=18', 'method=sample', 'num_samples=1000', 'num_warmup=5000', 'algorithm=hmc', 'engine=nuts', 'max_depth=5', 'metric=dense_e', 'adapt', 'engaged=1', 'num_chains=8']
 retcodes=[-11]
 per-chain output files (showing chain 1 only):
 csv_file:
	/mnt/lustre/users/ghermanus/Hybrid PMF/Subsets/Alkane_Primary alcohol/Include_clusters_False/Variance_known_True/rank_1/Step1/Hybrid_PMF-20240521214809_1.csv
 profile_file:
	/mnt/lustre/users/ghermanus/Hybrid PMF/Subsets/Alkane_Primary alcohol/Include_clusters_False/Variance_known_True/rank_1/Step1/Hybrid_PMF-20240521214809-profile_1.csv
 console_msgs (if any):
	/mnt/lustre/users/ghermanus/Hybrid PMF/Subsets/Alkane_Primary alcohol/Include_clusters_False/Variance_known_True/rank_1/Step1/Hybrid_PMF-20240521214809-stdout.txt

Nothing shows that PBS terminated the job either. I currently have the same code running on the server but with a different values for D, the above case is when setting D=1. The command qstat -fx <JOBID> yield the comment

comment = Job run at Tue May 21 at 21:47 on (cnode0897:ncpus=24:mem=1572864
	0kb) and finished

Indicating that none of the admins, neither myself terminated the job

Running the job with the same data (and same seed which failed) does not reproduce this error on my device. I have attached a json file with data used.
data.json

Current Version:

cluster:
cmdstan 2.34.0 hff4ab46_0 conda-forge
cmdstanpy 1.2.2 pyhd8ed1ab_0 conda-forge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant