This script extracts only the methods used for the MSTreesV2 pipeline, which were originally implemented in the MSTree.py script.
The enhanced script introduces two key concepts: chunks and memory-mapped files (memmap).
-
Chunks: The script now processes data in chunks, which can significantly improve performance when dealing with large datasets. Instead of loading the entire dataset into memory at once, the script loads and processes smaller subsets of the data (chunks), one at a time.
-
Memory-mapped files (memmap): The script uses memory-mapped files, a method that allows a file or a device to be read from and written to as if it were an array in memory. This is achieved by mapping a disk file to a segment of memory. Memory-mapped files can be used for accessing small segments of large files without reading the entire file into memory.
This script is a proof of concept (POC) and is not intended for production use. It is a modified version of the original script, tailored by the GENPAT team and is currently in an unstable state as it is under active development.
- Optimized TSV Matrix Handling: The key upgrade in this version is the optimized handling of large TSV matrix profiles, achieved through the '-p' option.
- Skip Phases: Allow the script to skip phases when precomputed
.npy
files (e.g., profiles, names, or distance matrix) are passed. - Implement Remaining Functions: Test and update the remaining functions from the original script.
- Validate results: Especially when float16 dtype is used.
- Parallelization: Maximize parallelization wherever possible (actually implemented only in
get_distance
).
- Utilizing float16 as the data type inhibits the just-in-time (JIT) compilation of the
contemporary
method. - The Newick tree (nwk) resulting from the use of float16 in
MSTreeV2.py
cannot be directly compared with the Newick tree obtained from the originalMSTree.py
script. - float16 reduces precision, especially in divisions and sorting operations, which may not be as precise.
IMPORTANT: The utilization of float16 data type does not confer any memory-related benefits
The original script can be found at:
To use this script:
git clone https://github.com/achtman-lab/GrapeTree/
cd GrapeTree/module
git clone https://github.com/genpat-it/MSTreesV2_POC
$ python MSTreesV2.py -h
#########################################################
#### !!! DON'T USE IN PRODUCTION !!! ########
#########################################################
[info] MSTreesV2 started at 2024-05-16 17:28:27.278087...
#########################################################
usage: MSTreesV2.py [-h] --profile FNAME --chunk_size CHUNK_SIZE [--n_proc NUMBER_OF_PROCESSES] [--dtype DTYPE] [--sep SEP] [--keep_files]
For details, see "https://github.com/achtman-lab/GrapeTree/blob/master/README.md".
In brief, GrapeTree generates a NEWICK tree to the default output (screen)
or a redirect output, e.g., a file.
options:
-h, --help show this help message and exit
--profile FNAME, -p FNAME
An input filename of a file containing MLST
--chunk_size CHUNK_SIZE, -c CHUNK_SIZE
Chunk size
--n_proc NUMBER_OF_PROCESSES, -n NUMBER_OF_PROCESSES
Number of CPU processes in parallel use. Default is half of available cores.
--dtype DTYPE, -d DTYPE
Data type for numpy arrays in the distance matrix. Provide 16 for np.float16. Default is np.float32.
--sep SEP, -s SEP Separator for the input file. Default is tab.
--keep_files, -k Keep the files generated by the script. Default is True.
This script generates a dummy matrix of size 80000x4000, with a similarity percentage of 30% of the samples and pairwise distances not bigger than 30% of the number of loci.
$ python dummy_generator.py 80000 4000 20 30 80000x4000_30_20.tsv
Execute the script with the default -p
pipeline option, processing in chunks of 10000 rows at a time (-c
), utilizing 120 cores (-n
), with data type set to float16 (-d
), and retaining the temporary files generated during the process (-k
).
$ /usr/bin/time -v python MSTreesV2.py -p 80000x4000_30_20.tsv -c 10000 -n 150 -d 32
#########################################################
#### !!! DON'T USE IN PRODUCTION !!! ########
#########################################################
[info] MSTreesV2 started at 2024-05-16 23:43:13.579159...
#########################################################
[info] 80000x4000_30_20.tsv has 80000 rows and 4000 columns.
[info] The chunk size is set to 10000.
[info] The profile file will be saved in /MSTrees_fork/github/tmpvtqulnja.prof.npy.
[info] The names file will be saved in /MSTrees_fork/github/tmpvtqulnja.names.npy.
[info] The distance file will be saved in /MSTrees_fork/github/tmpvtqulnja.dist.npy.
[info] The distance file for edmonds will be saved in /MSTrees_fork/github/tmpvtqulnja.dist.list
[info] The nwk file will be saved in /MSTrees_fork/github/tmpvtqulnja.nwk
[info] Processing 80000x4000_30_20.tsv in chunks...
Processing chunks: 100%|███████████████████████████████████████████████████████████████████████████████| 8/8 [02:54<00:00, 21.79s/it]
[info] Processing finished in 180.72375059127808 seconds.
[info] nonredundant method started...
[info] nonredundant method finished in 128.9532024860382 seconds.
[info] New shape of profiles: (63966, 4000)
[info] New shape of names: (63966,)
[info] MSTree method started...
[info] get_distance method started...
Calculating distances: 100%|█████████████████████████████████████████████████████████████████████| 150/150 [2:25:10<00:00, 58.07s/it]
[info] get_distance method finished in 8712.560944795609 seconds.
[info] harmonic method started...
[info] harmonic method finished in 126.75932931900024 seconds.
[info] _asymmetric method started...
[info] _asymmetric method finished in 12471.007730722427 seconds.
[info] _branch_recraft method started...
[info] _branch_recraft method finished in 13733.726621627808 seconds.
[info] symmetric_link method started...
[info] symmetric_link method finished in 1.681640863418579 seconds.
[info] _network2tree method started...
[info] _network2tree method finished in 0.701179027557373 seconds.
[info] MSTree method finished in 35047.98455452919 seconds.
[info] Removing temporary files...
[info] Process completed in 35368.38639831543 seconds.
#########################################################
[info] MSTreesV2 finished at 2024-05-17 09:32:41.959877...
#########################################################
Command being timed: "python MSTreesV2.py -p 80000x4000_30_20.tsv -c 10000 -n 150 -d 32"
User time (seconds): 116229.01
System time (seconds): 130015.81
Percent of CPU this job got: 696%
Elapsed (wall clock) time (h:mm:ss or m:ss): 9:49:31
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 682338396
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 547353225
Minor (reclaiming a frame) page faults: 174643040
Voluntary context switches: 597842748
Involuntary context switches: 329760
Swaps: 0
File system inputs: 94246936
File system outputs: 1989345920
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
The image is displayed using the SPREAD tool.
The edmonds-linux
program could potentially be the most performance-limiting aspect of the script. The worst-case scenario for Edmonds' algorithm, when applied to a distance matrix of allelic profiles, involves dense graphs with many nodes (few zeros) and uniform weights ( a common situation with synthetic data). This scenario requires extensive cycle detection and contraction, leading to a time complexity of O(N * M) = O(N * N * (N - 1)) = O(N^3)
and substantial memory usage.
Efficient cycle detection methods like Tarjan's algorithm can also help avoid the cubic time complexity in most cases.
In the above testing example, the maximum amount of physical memory (RAM) used by the process is about 682 GB, indicating very high memory usage. However, before the edmonds-linux execution, the RAM usage was around 22 GB. It is very likely that this high memory usage is related to the nature of the synthetic data generated.
$ /usr/bin/time -v python MSTreesV2.py -p real_dataset.tsv -c 10000 -n 120 -d 32 -k
#########################################################
#### !!! DON'T USE IN PRODUCTION !!! ########
#########################################################
[info] MSTreesV2 started at 2024-05-17 10:12:27.318310...
#########################################################
[info] real_dataset.tsv has 54783 rows and 7643 columns.
[info] The chunk size is set to 10000.
[info] The profile file will be saved in /MSTrees_fork/github/tmpvnxecue7.prof.npy.
[info] The names file will be saved in /MSTrees_fork/github/tmpvnxecue7.names.npy.
[info] The distance file will be saved in /MSTrees_fork/github/tmpvnxecue7.dist.npy.
[info] The distance file for edmonds will be saved in /MSTrees_fork/github/tmpvnxecue7.dist.list
[info] The nwk file will be saved in /MSTrees_fork/github/tmpvnxecue7.nwk
[info] Processing real_dataset.tsv in chunks...
Processing chunks: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [03:45<00:00, 37.56s/it]
[info] Processing finished in 234.7057967185974 seconds.
[info] nonredundant method started...
[info] nonredundant method finished in 130.54289841651917 seconds.
[info] New shape of profiles: (27695, 7643)
[info] New shape of names: (27695,)
[info] MSTree method started...
[info] get_distance method started...
Calculating distances: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 120/120 [58:51<00:00, 29.43s/it]
[info] get_distance method finished in 3537.0775895118713 seconds.
[info] harmonic method started...
[info] harmonic method finished in 13.572845697402954 seconds.
[info] _asymmetric method started...
[info] _asymmetric method finished in 42.826945543289185 seconds.
[info] _branch_recraft method started...
[info] _branch_recraft method finished in 109.54710674285889 seconds.
[info] symmetric_link method started...
[info] symmetric_link method finished in 0.9602999687194824 seconds.
[info] _network2tree method started...
[info] _network2tree method finished in 0.40360546112060547 seconds.
[info] MSTree method finished in 3704.8224573135376 seconds.
[info] Process completed in 4075.650115251541 seconds.
#########################################################
[info] MSTreesV2 finished at 2024-05-17 11:20:22.962614...
#########################################################
Command being timed: "python MSTreesV2.py -p real_dataset.tsv -c 10000 -n 120 -d 32 -k"
User time (seconds): 26711.64
System time (seconds): 74722.19
Percent of CPU this job got: 2476%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:08:16
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 38159880
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 291000613
Minor (reclaiming a frame) page faults: 48633918
Voluntary context switches: 164289437
Involuntary context switches: 94208
Swaps: 0
File system inputs: 48054384
File system outputs: 609367192
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Memory used using float32: 38.15 GB