-
Notifications
You must be signed in to change notification settings - Fork 72
Use the KMC directly from code through the API
Besides the possibility to use KMC via a command-line interface (CLI) it is also possible to use it directly from C++ code.
To use the API one needs to include kmc_runner.h
header file and link the application against libkmc_core.a
.
KMC depends on zlib and bz2, so these libraries must be also used for linking.
The simplest way to prepare all necessary files is to run:
git clone https://github.com/refresh-bio/KMC/
cd KMC
make bin/libkmc_core.a # -j<n_jobs> recommended for faster compilation
As a result, the files needed to use KMC are in following locations:
include\kmc_runner.h
bin\libkmc_core.a
Here is a small working example on how to use KMC from C++ code:
#include "include/kmc_runner.h"
#include <iostream>
int main()
{
try
{
KMC::Runner runner;
KMC::Stage1Params stage1Params;
stage1Params
.SetKmerLen(31)
.SetInputFiles({"test1.fq", "test2.fq"});
auto stage1Result = runner.RunStage1(stage1Params);
KMC::Stage2Params stage2Params;
stage2Params
.SetOutputFileName("31mers");
auto stage2Result = runner.RunStage2(stage2Params);
//print some stats
std::cout << "total k-mers: " << stage2Result.nTotalKmers << "\n";
std::cout << "total unique k-mers: " << stage2Result.nUniqueKmers << "\n";
}
catch(const std::exception& e)
{
std::cerr << e.what() << '\n';
}
}
Assuming the above code is in example.cpp
file the following command may be used to compile the code:
g++ -O3 example.cpp bin/libkmc_core.a -o example -lbz2 -lz -lpthread
As it may be seen from the simple example above the KMC run is divided into stages 1 and 2. Each stage has its own parameters set (wrapped in Stage1Params
and Stage2Params
types) and its own results set (wrapped in Stage1Results
and Stage2Results
types).
Everything is contained in the KMC
namespace.
The execution is split into two parts to allow the API user to set some of the Stage2Params
parameters based on Stage1Results
results.
For example, it is possible to determine the memory limit for the second stage based on the estimated number of unique counted k-mers.
There is a couple of interfaces related to the logging process of KMC. Those are: IPercentProgressObserver
, IProgressObserver
and ILogger
.
In case of a critical error (for example during input decompression) KMC will throw an std::exception
.
This is the main class for k-mer counting. It is a default-constructible type. It defines following methods:
-
Stage1Results RunStage1(const Stage1Params& params)
- run stage 1 of KMC -
Stage2Results RunStage2(const Stage2Params& params)
- run stage 2 of KMCRunStage2
must be called afterRunStage1
.RunStage2
may be omitted when is not needed (for example if one is interested only in k-mer abundance histogram estimation).
This class allows setting all the parameters needed for the first stage of KMC. All the parameters may be set and read using appropriate setters and getters. There are the following setters:
-
Stage1Params& SetInputFiles(const std::vector<std::string>& inputFiles)
- sets the list of input files (all must be in the same format, allowed formats are: fasta, fastq, multi-line fasta, bam, kmc) -
Stage1Params& SetTmpPath(const std::string& tmpPath)
- sets a tmp path for intermediate files (default path is ".") -
Stage1Params& SetKmerLen(uint32_t kmerLen)
- sets the k-mer length -
Stage1Params& SetNThreads(uint32_t nThreads)
- sets the number of threads (default:std::thread::hardware_concurrency
) -
Stage1Params& SetMaxRamGB(uint32_t maxRamGB)
- sets the maxium amount of RAM that KMC is allowed to consume in GB (default: 12 GB) -
Stage1Params& SetSignatureLen(uint32_t signatureLen)
- set a signature length (allowed range: [5;11], default: 9) -
Stage1Params& SetHomopolymerCompressed(bool homopolymerCompressed)
- enable (true)/disable (false) homopolymer compressed k-mers counts (approximate and experimental, default: disabled) -
Stage1Params& SetInputFileType(InputFileType inputFileType)
- sets the input file type, available values are:InputFileType::FASTQ
,InputFileType::FASTA
,InputFileType::MULTILINE_FASTA
,InputFileType::BAM
,InputFileType::KMC
(default: InputFileType::FASTQ) -
Stage1Params& SetCanonicalKmers(bool canonicalKmers)
- count canonical (true) k-mers or not (false) (default: canonical) -
Stage1Params& SetRamOnlyMode(bool ramOnlyMode)
- turn RAM-only mode on (true)/off (false) (default: off) -
Stage1Params& SetNBins(uint32_t nBins)
- sets the number of intermediate files (in range: [64, 2000], default: 512) -
Stage1Params& SetNReaders(uint32_t nReaders)
- sets the number of readers threads (Warning: only for experienced users) -
Stage1Params& SetNSplitters(uint32_t nSplitters)
- sets the number of splitting threads (Warning: only for experienced users) -
Stage1Params& SetVerboseLogger(ILogger* verboseLogger)
- sets the verbose logger, check theILogger
interface description (default: ignoring verbose logs) -
Stage1Params& SetPercentProgressObserver(IPercentProgressObserver* percentProgressObserver)
- sets the observer of a percent progress, check theIPercentProgressObserver
interface description (default: print percent progress onstd::cerr
) -
Stage1Params& SetWarningsLogger(ILogger* warningsLogger)
- sets the warning logger, check theILogger
interface description (default: print logs onstd::cerr
) -
Stage1Params& SetEstimateHistogramCfg(EstimateHistogramCfg estimateHistogramCfg)
- sets the estimate k-mer abundance histogram configuration, available values areEstimateHistogramCfg::DONT_ESTIMATE
,EstimateHistogramCfg::ESTIMATE_AND_COUNT_KMERS
andEstimateHistogramCfg::ONLY_ESTIMATE
(default:EstimateHistogramCfg::DONT_ESTIMATE
). The estimation of k-mer abundance histogram is performed with our implementation ofntCard
algorithm. Detailed explanation of each setting:- For
EstimateHistogramCfg::DONT_ESTIMATE
k-mer abundance histogram is not being estimated, this is the normal mode, because additional histogram estimation may affect performance (although from the preliminary experiments the impact is negligible). - When
EstimateHistogramCfg::ONLY_ESTIMATE
only histogram estimation is performed, the second stage will do nothing in such a case, the intermediate files will not be created, it should be used when only histogram estimation is needed. - When
EstimateHistogramCfg::ESTIMATE_AND_COUNT_KMERS
is used the histogram is estimated, but also the rest of the computations are done as usual, this may be used to determine some of the stage 2 parameters based on the histogram.
- For
-
Stage1Params& SetProgressObserver(IProgressObserver* progressObserver)
- sets the observer of a progress other than percentage, check theIProgressObserver
interface description (default: print progress onstd::cerr
)
Each parameter may be read using one of the following getters:
const std::vector<std::string>& GetInputFiles() const noexcept
const std::string& GetTmpPath() const noexcept
uint32_t GetKmerLen() const noexcept
uint32_t GetNThreads() const noexcept
uint32_t GetMaxRamGB() const noexcept
uint32_t GetSignatureLen() const noexcept
bool GetHomopolymerCompressed() const noexcept
InputFileType GetInputFileType() const noexcept
bool GetCanonicalKmers() const noexcept
bool GetRamOnlyMode() const noexcept
uint32_t GetNBins() const noexcept
uint32_t GetNReaders() const noexcept
uint32_t GetNSplitters() const noexcept
ILogger* GetVerboseLogger() const noexcept
IPercentProgressObserver* GetPercentProgressObserver() const noexcept
ILogger* GetWarningsLogger() const noexcept
EstimateHistogramCfg GetEstimateHistogramCfg() const noexcept
IProgressObserver* GetProgressObserver() const noexcept
This class allows setting all the parameters needed for the second stage of KMC. All the parameters may be set and read using appropriate setters and getters. There are the following setters:
-
Stage2Params& SetMaxRamGB(uint32_t maxRamGB)
- sets the maximum amount of RAM that KMC is allowed to consume, in fact, KMC may use more memory if it is needed to process the data, if the limit should be strict useSetStrictMemoryMode
method -
Stage2Params& SetNThreads(uint32_t nThreads)
- sets the number of threads (default:std::thread::hardware_concurrency
) -
Stage2Params& SetStrictMemoryMode(bool strictMemoryMode)
- enable (true)/ disable (false) strict memory mode, if enabled KMC will not consume more RAM than specified withSetMaxRamGB
method, but the computation may take longer (default: disabled) -
Stage2Params& SetCutoffMin(uint64_t cutoffMin)
- exclude k-mers occurring less thancutoffMin
times (default: 2) -
Stage2Params& SetCounterMax(uint64_t counterMax)
- sets maximal value of a counter (default: 255) -
Stage2Params& SetCutoffMax(uint64_t cutoffMax)
- exclude k-mers occurring more of thancutoffMax
times (default: 1e9) -
Stage2Params& SetOutputFileName(const std::string& outputFileName)
- sets the path of the output file -
Stage2Params& SetOutputFileType(OutputFileType outputFileType)
- sets the format of the output, available values:OutputFileType::KMC
andOutputFileType:KFF
(default: KMC) -
Stage2Params& SetWithoutOutput(bool withoutOutput)
- do not produce (true) output file (default: false) -
Stage2Params& SetStrictMemoryNSortingThreadsPerSorters(uint32_t strictMemoryNSortingThreadsPerSorters)
- sets the number of sorters per sorted in strict memory mode (Warning: only for experienced users) -
Stage2Params& SetStrictMemoryNUncompactors(uint32_t strictMemoryNUncompactors)
- sets the number of uncompactors in strict memory mode (Warning: only for experienced users) -
Stage2Params& SetStrictMemoryNMergers(uint32_t strictMemoryNMergers)
- - sets the number of mergers in strict memory mode (Warning: only for experienced users)
Each parameter may be read using one of the following getters:
uint32_t GetMaxRamGB() const noexcept
uint32_t GetNThreads() const noexcept
bool GetStrictMemoryMode() const noexcept
uint64_t GetCutoffMin() const noexcept
uint64_t GetCounterMax() const noexcept
uint64_t GetCutoffMax() const noexcept
const std::string& GetOutputFileName() const noexcept
OutputFileType GetOutputFileType() const noexcept
bool GetWithoutOutput() const noexcept
uint32_t GetStrictMemoryNSortingThreadsPerSorters() const noexcept
uint32_t GetStrictMemoryNUncompactors() const noexcept
uint32_t GetStrictMemoryNMergers() const noexcept
This structure stores the results of the stage 1 run. It contains the following values:
-
double time
- time spend by KMC to execute stage 1 -
uint64_t nSeqences
- total number of input sequences (usually reads) -
bool wasSmallKOptUsed
- true if small k optimization was used -
uint64_t nTotalSuperKmers
- total number of super-k-mers -
uint64_t tmpSize
- total amount of disk memory used by KMC for intermediate files -
std::vector<uint64_t> estimatedHistogram
- estimated histogram, non-empty only ifEstimateHistogramCfg::ONLY_ESTIMATE
orEstimateHistogramCfg::ESTIMATE_AND_COUNT_KMERS
flag was used set bySetEstimateHistogramCfg
method. At i-th index of this vector the estimated number of k-mers occurring i times is stored.
This structure stores the results of the stage 2 run. It contains the following values:
-
double time
- time spend by KMC to execute stage 2 -
double timeStrictMem
- time sped by KMC processing bins that are too big to fit into memory limit (only if strict memory mode is used) -
uint64_t tmpSizeStrictMemory
- disk usage related to the strict memory more -
uint64_t maxDiskUsage
- disk usage peak -
uint64_t nBelowCutoffMin
- the total number of k-mers below minimal cutoff -
uint64_t nAboveCutoffMax
- the total number of k-mers above maximal cutoff -
uint64_t nTotalKmers
- the total number of k-mers -
uint64_t nUniqueKmers
- the total number of unique k-mers
This interface is used to allow KMC to inform the caller of percentage progress made during computations. There are the following methods to override:
-
virtual void SetLabel(const std::string& label)
- KMC will call this method to set the label of current percentage execution -
virtual void ProgressChanged(int newValue)
- KMC will this method to inform that the percent process changed
This interface is used to allow KMC to inform the caller of progress made during computations. There are the following methods to override:
-
virtual void Start(const std::string& name)
- KMC will call this method to inform that it starts the phase labeled asname
-
virtual void Step()
- KMC will call this method to inform that some progress was made -
virtual void End()
- KMC will call this method to inform that the phase is completed
This interface is used to allow KMC for logging (e.g. warnings, verbose logs). There is the following method to override:
-
virtual void Log(const std::string& msg)
- KMC will call this method to log a messagemsg
.
In the KMC
namespace there are a couple of classes implementing these interfaces. Their purpose is to define the default behavior of KMC, which is writing to ``std::cerr``` in most cases, or totally ignoring the KMC messages (e.g. verbose). Using these interfaces one can use the KMC from a non-command-line environment, e.g. GUI.
Below is a code of a more complex example. The idea behind this is that during the first stage k-mers abundance histogram is estimated to determine the estimate of the total number of counted k-mers, which in turn is used to compute RAM limit for the second stage.
#include "include/kmc_runner.h"
#include <iostream>
int main()
{
try
{
std::vector<std::string> inputFiles { "test_file1.fq", "test_file2.fq"};
KMC::Stage1Params stage1Params;
stage1Params.SetInputFiles(inputFiles)
.SetKmerLen(31)
.SetNThreads(8)
.SetMaxRamGB(10)
.SetEstimateHistogramCfg(KMC::EstimateHistogramCfg::ESTIMATE_AND_COUNT_KMERS);
KMC::Runner kmc;
auto stage1Results = kmc.RunStage1(stage1Params);
uint32_t cutoffMin = 5;
uint64_t nUniqCountedKmersEst{};
for(uint32_t i = cutoffMin ; i < stage1Results.estimatedHistogram.size() ; ++i)
nUniqCountedKmersEst += stage1Results.estimatedHistogram[i];
std::cout << "#uniq counted kmers estimate: " << nUniqCountedKmersEst << "\n";
//now lests assume the total amount of memory that we allow KMC to use depends on the total number of unique k-mers
//and per each unique k-mer we want at most bitsPerUniqieKmers
double bitsPerUniqieKmers = 20;
double ramBits = bitsPerUniqieKmers * nUniqCountedKmersEst;
double ramBytes = ramBits / 8;
uint32_t ramForStage2 = ramBytes / 1000 / 1000 / 1000;
if (ramForStage2 < 2)
ramForStage2 = 2; //at least 2 GB needed for kmc
std::cout << "ram for Stage 2 (GB): " << ramForStage2 << "\n";
KMC::Stage2Params stage2Params;
stage2Params.SetNThreads(8)
.SetMaxRamGB(ramForStage2)
.SetCutoffMin(cutoffMin)
.SetOutputFileName("kmers").
SetStrictMemoryMode(true);
auto stage2Results = kmc.RunStage2(stage2Params);
std::cout << "#total counted k-mers: " << stage2Results.nTotalKmers << "\n";
std::cout << "#unique k-mers: " << stage2Results.nUniqueKmers << "\n";
std::cout << "#unique counted k-mers: " << stage2Results.nUniqueKmers - stage2Results.nBelowCutoffMin - stage2Results.nAboveCutoffMax <<"\n";
std::cout << "#sequences: " << stage1Results.nSeqences << "\n";
}
catch(const std::runtime_error& err)
{
std::cerr << err.what() << "\n";
}
}