Version: 0.4.4
- RealTrace
The following two libraries are needed:
- nlopt (for minimization)
- see nlopt documentation for installation details
- Eigen (for linear algebra)
- see Eigen documentation for installation details
Make sure the correct paths to the two libraries are set in the Makefile
. Currently, both are assumed to be located in the home directory. Then, compile with:
cd src; make local
- Install nlopt
- Will install nlopt in the home directory with static linking. You can change that via
DCMAKE_INSTALL_PREFIX
, but make sure to adjust the makefile accordingly!
ml purge
ml GCC/8.3.0
ml Eigen/3.3.7
ml CMake/3.15.3-GCCcore-8.3.0
wget https://github.com/stevengj/nlopt/archive/v2.7.1.tar.gz
tar -xf v2.7.1.tar.gz
cd nlopt-2.7.1/
cmake -DCMAKE_INSTALL_PREFIX=~/nlopt -DBUILD_SHARED_LIBS=OFF .
make
make install
- Compile
- Clone this repository and navigate to the
src
directory within the repository - Run
make cluster
. This will runml GCC/8.3.0; ml Eigen/3.3.7
as well as the compile command! Note, that the modules remain loaded after compilation.
cd bin
./RealTrace [-options]
with the following options:
-h, --help this help message
-i, --infile (required) input data file
-b, --parameter_bounds (required) file(s) setting the type, step, bounds of the parameters
-c, --csv_config file that sets the columns that will be used from the input file
-l, --print_level print level {0,1,2}, default: 0
-o, --outdir specify output direction and do not use default
-t, --tolerance_maximization absolute tolerance of maximization between optimization steps, default: 1e-10
-r, --rel_tolerance_joints relative tolerance of joint calculation: default 1e-10
-space, --search_space search parameter space in {'log'|'linear'} space, default: 'log'
-noise, --noise_model measurement noise of fp content {'const'|'scaled'} default: 'const'
-div, --cell_division cell divison model {'gauss'|'binomial'} default: 'gauss'
-m, --maximize run maximization
-s, --scan run 1d parameter scan
-p, --predict run prediction
-j, --joints run calculation of joint probabilities
Example: ./RealTrace -c csv_config.txt -b parameters.txt -i data/example.csv -o out/ -l 1 -t 1e-10 -m -p
infile
sets the input file that contains the data, eg as given by MOMA (see 2.1.1)parameter_bounds
sets the file that defines the parameter space (see 2.1.2)
The input file is assumed to fulfill the following:
- the data points of a cell appear as consecutive rows and are in the correct order with respect to time.
- The data set has to include all columns that are set via the
csv_config
file, i.e.time_col
,length_col
,fp_col
. - The cells can be uniquely identified via the tags provided via
parent_tags
andcell_tags
and each mother cell has at most 2 daughter cells. If that is not the case, theparent_tags
andcell_tags
are not sufficient and a warning will be printed. - In order to estimate the initial covariance matrix, the data set needs to contain at least (!) 2 cells.
- An optional column may be added for the usage of segments, see below for more information.
How the different parameters are treated during the likelihood maximization is defined by the following syntax:
- free_parameter = init, step
- bound_parameter = init, step, lower, upper
- fixed_parameter = init
An example file can look like this:
mean_lambda = 0.01, 1e-3
gamma_lambda = 0.01, 1e-3, 1e-4, 0.05
var_lambda = 1e-07
mean_q = 10, 1e-1
gamma_q = 0.01, 1e-3, 1e-4, 0.05
var_q = 1, 1e-2
beta = 5e-2
var_x = 1e-3, 1e-5
var_g = 1, 1e-3
var_dx = 1e-4, 1e-5
var_dg = 1, 1e-2
ALL parameters are restricted to positive numbers by default avoiding unphysical/meaningless parameter ranges. By setting the lower bound in the parameter file, one can overwrite the lower bounds of the parameters.
During the maximization, the s will be the initial step size. From nlopt doc: "For derivative-free local-optimization algorithms, the optimizer must somehow decide on some initial step size to perturb x by when it begins the optimization. This step size should be big enough that the value of the objective changes significantly, but not too big if you want to find the local optimum nearest to x."
To analyze data sets that contain data points that shall be fitted by a different set of underlying parameters, segment indices can be used. For that, a segment_col
in the csv_config
file can be specified. This column should contain the segment index specifying for each data point to which segment it belongs. The segment indices are required to be consecutive and start at index 0.
The likelihood maximization that determines the parameter estimates is run independently for each segment. That means there is no difference between running different segments in separate runs or as part of the same data set. The same behavior is used for 1d scans. However, the predictions as well as the calculation of the joint probabilities that are used for the correlation functions are calculated by iterating through the entire data set. For that, the following scheme is used Where each step involves two calculations: the calculation of the new prior which depends on the parameters of the biophysical model and the calculation of the posterior which depends on the parameters of the measurement noise. Note, that the prior calculation to go from time points 2 to 3 and vice versa both take the parameters of the 0th segment.
For each segment in the data set one parameter file is required submitted in the order of the segment indices. For example:
./RealTrace -b parametersA.txt parametersB.txt ...
will use the parameters in the file parametersA.txt
for the segment with index 0 and the parameters in the file parametersB.txt
for the segment with index 1, etc...
csv_config
sets the file that contains information on which columns will be used from the input file (see 2.3.1)print_level (0)
set to 0 suppresses input of the likelihood calculation,1
prints every step of the maximization/scan/error bar calculation. This is purely meant for debugging!tolerance_maximization (1e-3)
sets the stopping criterion by setting the tolerance of maximization: Stop when an optimization step changes the function value by less than tolerance. By setting very low tolerances one might encounter rounding issues, in that case, the last valid step is taken and a warning is printed to stderr.rel_tolerance_joints (1e-12)
sets the stopping criterium for the joint calculation. The calculation is stopped when the cross covariances between the two time points are smaller than the product of the corresponding means times the set tolerance. $\frac{\text{Cov}(z_{n+m}, z_n){i,j}}{ \langle z{n+m}\rangle_i \langle z_n\rangle_j} < \text{tolerance }$outdir
overwrites the default output directory, which is (given the infiledir/example.csv/
)dir/example_out/
search_space (log)
sets the search space of the parameters to be either in log space or linear space. The parameter file does not need to be changed as everything is done internally.noise_model (scaled)
defines how the measurement noise depends on the content of fluorescence proteins.const
means that the measurement is constant with a variancevar_g
.scaled
means the variance of the measurement scales linearly with the fluorescence protein content. In this casevar_g
is the prefactor of the scaling.cell_division (binomial)
defines the model for cell division.binomial
splits the FP content according to the cell sizes of the daughter cells and binomial sampling. In this case, the parametervar_dg
is the conversion factor between the FP input and the physical number of independent molecules that can be distributed across cells.gauss
refers to a model where the FP contents of the daughter cells are drawn from a gaussian with variancevar_dg
centered around half of the mother cell FP content
time_col = time_min
rescale_time = 60
length_col = length_um
fp_col = GFP
The following settings define how the input file will be interpreted.
time_col (time)
: column from which the time is readrescale_time (1)
: the factor by which time will be divided before anything is runlength_col (length)
: column from which the length of the cell is readlength_islog (false)
: indicates if the cell length in the data file is in logscale (true) or not (false)fp_col (gfp)
: column from which the fluorescence protein content is readdelm (,)
: delimiter between columns, probably ',' or ';'segment_col ()
: column from which the segment index is read. Not settingsegment_col
in the file indicates that segment indices will not be usedfilter_col ()
: column from which the filter will be read. To include a data point, set the entry in this column toTrue
,true
,TRUE
or1
and to EXclude a data point, set the entry in this column toFalse
,false
,FALSE
or0
. Not settingfilter_col
in the file indicates that the input file will not be filteredcell_tag (cell_id)
: columns that will make up the unique cell id, endings like .0 .00, etc of numeric values will be removedparent_tags (parent_id)
: columns that will make up the unique cell id of the parent cell, endings like .0 .00, etc of numeric values will be removed
The 2 OU processes are described with a mean value (thus the mean growth/production rate), a gamma parameter determining how fast the process is driven towards its mean after a deviation, and a characteristic kick size that scales the noise term. Thus we have the following parameters:
- Growth rate fluctuations params:
- mean_lambda
- gamma_lambda
- var_lambda
- GFP fluctuation params:
- mean_q
- gamma_q
- var_q
- Bleaching rate
- beta
- cell division:
- var_dx
- var_dg
- Measurement noise:
- var_x
- var_g
The following run modes can be set: -m, --maximize -s, --scan -p, --predict -j, --joints The respective output is explained below.
All files that are generated in the maximization and the prediction step are named as follows: example_f<free>_b<bounds>
and where <free>
lists the variable via the index as e.g. printed when the code is run and <bounds>
lists the bound parameters in the same way. Example: example_f034_b129
. Then, the ending of the different files indicates what they contain.
Each file generated starts with a table with the parameter settings that were used to enable reproducibility:
- The first line is a header for the parameter table and the following 11 lines contain the parameter setting and potentially the final estimates from the optimization if available. The prediction files and the auto_correlation file append the parameter settings of the other segments one after another if multiple segments are present.
- The following line is empty
Maximizes the likelihood function and in this way estimates the parameters of the model.
Output:
- Will create 3 files: one for the maximization process (
_interations.csv
), one for the final estimations (_final.csv
), and a "parameter file" (_parameter_file.txt
) that can directly be used as an input for a prediction run (this file is formatted like the input parameter file and does not contain the table contain the parameter settings) - The interations file contains all likelihood evaluations of the likelihood maximization
- The final file contains the estimated error for the estimated parameters via a Hessian matrix. The Hessian is calculated using a range of finite differences that are set relative to the value of the respective parameter. I.e. epsilon=1e-2 corresponds to 1% of each parameter used for the Hessian matrix estimation. Finally, the number of data points, the total log-likelihood, the normalized log-likelihood, the used optimization algorithm, the set tolerance, and the search space (log/linear) are noted.
- The parameter file is in the format of the parameter file that was submitted to the code (see Sec. 2.1.2). It only contains the final estimation of each parameter. Thus, the parameters are treated as "fixed", when the code is run with this parameter file. This file can be used to run the prediction step directly (potentially on a different input file).
- As the maximization is run on the different segments independently,
_segment
followed by the segment index is added to the sample input file name in case there are multiple segments.
In case the maximization is also run, the final parameter estimate is used for the prediction, otherwise, the init value of each parameter is used.
Output:
- Will create 3 files: combined predictions (
_prediction.csv
), backward only (_prediction_backward.csv
), and forward only (_prediction_forward.csv
) - The predictions consist of:
- the 4 mean quantities of
x
,g
,l
(refers to lambda), andq
- the upper triangle of the covariance matrix labeled
cov_zi_zj
wherezi
andzj
are one ofx
,g
,l
, andq
. - ... of each time point for each cell in the same order as the input file
- the 4 mean quantities of
Runs the calculation of the joint probabilities of pairs of data points. The 2-point probabilities are needed to calculate correlation functions.
Running the calculation of the joint distributions also runs the prediction part and generates those files. Setting both flags is therefore redundant but also not harmful.
The number of joints rel_tolerance_joints
that is set (see 2.2.3).
Output:
- Will create a file named
_joints.csv
containing all calculated joints formatted as an upper triangle matrix. Rows correspond to the earlier time point in the joint, while columns correspond to the later time point namedcelllid_time
, wherecellid
corresponds to whatever is defined as the cellid and time is time
cell_id | parent_id | time | cell1_0 | cell1_1 | cell1_2 | cell1_3 | cell2_4 | cell2_5 |
---|---|---|---|---|---|---|---|---|
cell1 | 0 | P(z_1, z_0) | P(z_2, z_0) | P(z_3, z_0) | P(z_4, z_0) | P(z_5, z_0) | ||
cell1 | 1 | P(z_2, z_1) | P(z_3, z_1) | P(z_4, z_1) | P(z_5, z_1) | |||
cell1 | 2 | P(z_3, z_2) | P(z_4, z_2) | P(z_5, z_2) | ||||
cell2 | cell1 | 3 | P(z_4, z_3) | P(z_5, z_3) | ||||
cell2 | cell1 | 4 | P(z_5, z_4) |
- Each joint probability consists of its means and the upper triangle of its covariance matrix and thus of 8+36=44 values in total.
The 1d parameter scans will calculate the likelihood for the 1d ranges set by the parameter_bound file. Note, that only "bound" parameters will be scanned.
Output:
- Will create a file for each parameter containing all calculated likelihoods of the scan
- As the scans are run on the different segments independently,
_segment
followed by the segment index is added to the sample input file name in case there are multiple segments.
The code has a number of errors that might be thrown at runtime. Some of them are listed below. In particular, the errors that occur during command line parsing (arg_parser) are not listed here.
- (sc_likelihood) ERROR: Log-likelihood calculation failed: Log-likelihood is Nan: The likelihood of the data is maximized starting from the initial set of parameters. However, for initial parameters far away from the optimum the evaluation of the likelihood becomes numerically inaccessible. Thus, if Nan occurs in during the maximization of the likelihood, the code is stopped and it is necessary to set new (and ideally more accurate) initial parameters. The issue might also occur if the initial guess is fine, but the first step (given by the step in the parameter file) kicks the calculation out of the range in which the likelihood can be evaluated. In this case, the Nans occur in the later entries in the iteration file and one needs to set new initial steps. Note, that also the sign of the step matters.
- (minimize_wrapper) ERROR: Log likelihood optimization failed: Something went wrong during the optimisation of the likelihood. Just let me know.
- (build_cell_genealogy) ERROR: Both daughter pointers are set: Each parent cell is expected to have 0, 1, or 2 daughter cells, in case there are more daughter cells associated with a parent cell, the code is stopped. The setting of parent_tags and cell_tags in the csv_config file might not be set correctly, meaning the combination of the tags is still ambiguous pointing to different cells. Revise the parent_tags and cell_tags.
- (get_data) ERROR: (...) is not a column in the input file: The keyword marking the column that is used for getting the time/log_length or fp from the input file is not found in the header. Make sure the keyword matches the name of the column in the input file.
- (get_data) ERROR: Line ... cannnot be processed: A line in the input file contains an entry (in time/log_length or fp) that is not a number. Note, that Nans are also rejected.
- (set_paramter) ERROR: Parameter settings of ... cannot be processed: The processing of the parameter file failed. The formatting of the parameter file is explained above. In particular, all arguments of the parameter setting need to be numbers. Note, that Nans are also rejected.
- (check_if_complete) ERROR: Parameter ... not found in parameter file: Each parameter needs to be set in the parameter file.. The code is stopped if one (or more) is missing.
- rescale_time default is 1
- time in the prediction file is also rescaled making the prediction file consistent
- tol default is 1e-3
- prior for the first time point is set according to the parameters making stationery and use_beta obsolete
- joint calculation and output of joints
- Use
LN_NELDERMEAD
- introduction of a scaled noise model
- introduction of binomial cell division model
- cell is growing and producing between n and n+1 even if it is dividing, division happend just before n+1
- joint matrix has as identical row and column indices
- changed tolerance defaults to 1e-10
- taking care of prior correctly
- only returns full posterior
- use absolute value of scaled error variance to cope with negative input
- prepare for cluster
- write new simulation including asymmetric cell division and tree structure
- use log of parameters
- create a parameter file that can be used as input
- include error messages to the readme
- consistent output files
- check input and give precise error messages
- catch all Nans in optimisation
- autocorrelation
- segments
- optional filtering
- calculate joints over cell division
- calculation of joints on tree
- joint output
- stopping criteria for joints -> tolerance
- write (python) script for reading joint file and calculate correlation function on single lineages
- include marginals (ie predictions) to correlation function calculation for Cov < tolerance
- calculate correlations on a tree (downstream of the ggp code)
- include scaled noise to correlation part
- binomial cell division model