Skip to content

Use the KMC directly from code through the API

marekkokot edited this page Dec 7, 2021 · 13 revisions

Introduction

Besides the possibility to use KMC via a command-line interface (CLI) it is also possible to use it directly from C++ code. To use the API one needs to include kmc_runner.h header file and link the application against libkmc_core.a. KMC depends on zlib and bz2, so these libraries must be also used for linking.

Simple setup

The simplest way to prepare all necessary files is to run:

git clone https://github.com/refresh-bio/KMC/
cd KMC
make bin/libkmc_core.a # -j<n_jobs> recommended for faster compilation

As a result, the files needed to use KMC are in following locations:

  • include\kmc_runner.h
  • bin\libkmc_core.a

Simple working example

Here is a small working example on how to use KMC from C++ code:

#include "include/kmc_runner.h"
#include <iostream>
int main()
{
    try
    {       
        KMC::Runner runner;

        KMC::Stage1Params stage1Params;
        stage1Params
            .SetKmerLen(31)
            .SetInputFiles({"test1.fq", "test2.fq"});
        
        auto stage1Result = runner.RunStage1(stage1Params);
        
        KMC::Stage2Params stage2Params;

        stage2Params
            .SetOutputFileName("31mers");

        auto stage2Result = runner.RunStage2(stage2Params);

        //print some stats
        std::cout << "total k-mers: " << stage2Result.nTotalKmers << "\n";
        std::cout << "total unique k-mers: " << stage2Result.nUniqueKmers << "\n";
    }
    catch(const std::exception& e)
    {
        std::cerr << e.what() << '\n';
    }
}

Assuming the above code is in example.cpp file the following command may be used to compile the code:

g++ -O3 example.cpp bin/libkmc_core.a -o example -lbz2 -lz -lpthread

The philosophy of an API

As it may be seen from the simple example above the KMC run is divided into stages 1 and 2. Each stage has its own parameters set (wrapped in Stage1Params and Stage2Params types) and its own results set (wrapped in Stage1Results and Stage2Results types). Everything is contained in the KMC namespace. The execution is split into two parts to allow the API user to set some of the Stage2Params parameters based on Stage1Results results. For example, it is possible to determine the memory limit for the second stage based on the estimated number of unique counted k-mers.

There is a couple of interfaces related to the logging process of KMC. Those are: IPercentProgressObserver, IProgressObserver and ILogger.

In case of a critical error (for example during input decompression) KMC will throw an std::exception.

API documentation

Runner class

This is the main class for k-mer counting. It is a default-constructible type. It defines following methods:

  • Stage1Results RunStage1(const Stage1Params& params) - run stage 1 of KMC
  • Stage2Results RunStage2(const Stage2Params& params) - run stage 2 of KMC RunStage2 must be called after RunStage1. RunStage2 may be omitted when is not needed (for example if one is interested only in k-mer abundance histogram estimation).

Stage1Params class

This class allows setting all the parameters needed for the first stage of KMC. All the parameters may be set and read using appropriate setters and getters. There are the following setters:

  • Stage1Params& SetInputFiles(const std::vector<std::string>& inputFiles) - sets the list of input files (all must be in the same format, allowed formats are: fasta, fastq, multi-line fasta, bam, kmc)

  • Stage1Params& SetTmpPath(const std::string& tmpPath) - sets a tmp path for intermediate files (default path is ".")

  • Stage1Params& SetKmerLen(uint32_t kmerLen) - sets the k-mer length

  • Stage1Params& SetNThreads(uint32_t nThreads) - sets the number of threads (default: std::thread::hardware_concurrency)

  • Stage1Params& SetMaxRamGB(uint32_t maxRamGB) - sets the maxium amount of RAM that KMC is allowed to consume in GB (default: 12 GB)

  • Stage1Params& SetSignatureLen(uint32_t signatureLen) - set a signature length (allowed range: [5;11], default: 9)

  • Stage1Params& SetHomopolymerCompressed(bool homopolymerCompressed) - enable (true)/disable (false) homopolymer compressed k-mers counts (approximate and experimental, default: disabled)

  • Stage1Params& SetInputFileType(InputFileType inputFileType) - sets the input file type, available values are: InputFileType::FASTQ, InputFileType::FASTA, InputFileType::MULTILINE_FASTA, InputFileType::BAM, InputFileType::KMC (default: InputFileType::FASTQ)

  • Stage1Params& SetCanonicalKmers(bool canonicalKmers) - count canonical (true) k-mers or not (false) (default: canonical)

  • Stage1Params& SetRamOnlyMode(bool ramOnlyMode) - turn RAM-only mode on (true)/off (false) (default: off)

  • Stage1Params& SetNBins(uint32_t nBins) - sets the number of intermediate files (in range: [64, 2000], default: 512)

  • Stage1Params& SetNReaders(uint32_t nReaders) - sets the number of readers threads (Warning: only for experienced users)

  • Stage1Params& SetNSplitters(uint32_t nSplitters) - sets the number of splitting threads (Warning: only for experienced users)

  • Stage1Params& SetVerboseLogger(ILogger* verboseLogger) - sets the verbose logger, check the ILogger interface description (default: ignoring verbose logs)

  • Stage1Params& SetPercentProgressObserver(IPercentProgressObserver* percentProgressObserver) - sets the observer of a percent progress, check the IPercentProgressObserver interface description (default: print percent progress on std::cerr)

  • Stage1Params& SetWarningsLogger(ILogger* warningsLogger) - sets the warning logger, check the ILogger interface description (default: print logs on std::cerr)

  • Stage1Params& SetEstimateHistogramCfg(EstimateHistogramCfg estimateHistogramCfg) - sets the estimate k-mer abundance histogram configuration, available values are EstimateHistogramCfg::DONT_ESTIMATE, EstimateHistogramCfg::ESTIMATE_AND_COUNT_KMERS and EstimateHistogramCfg::ONLY_ESTIMATE (default: EstimateHistogramCfg::DONT_ESTIMATE). The estimation of k-mer abundance histogram is performed with our implementation of ntCard algorithm. Detailed explanation of each setting:

    • For EstimateHistogramCfg::DONT_ESTIMATE k-mer abundance histogram is not being estimated, this is the normal mode, because additional histogram estimation may affect performance (although from the preliminary experiments the impact is negligible).
    • When EstimateHistogramCfg::ONLY_ESTIMATE only histogram estimation is performed, the second stage will do nothing in such a case, the intermediate files will not be created, it should be used when only histogram estimation is needed.
    • When EstimateHistogramCfg::ESTIMATE_AND_COUNT_KMERS is used the histogram is estimated, but also the rest of the computations are done as usual, this may be used to determine some of the stage 2 parameters based on the histogram.
  • Stage1Params& SetProgressObserver(IProgressObserver* progressObserver) - sets the observer of a progress other than percentage, check the IProgressObserver interface description (default: print progress on std::cerr)

Each parameter may be read using one of the following getters:

  • const std::vector<std::string>& GetInputFiles() const noexcept
  • const std::string& GetTmpPath() const noexcept
  • uint32_t GetKmerLen() const noexcept
  • uint32_t GetNThreads() const noexcept
  • uint32_t GetMaxRamGB() const noexcept
  • uint32_t GetSignatureLen() const noexcept
  • bool GetHomopolymerCompressed() const noexcept
  • InputFileType GetInputFileType() const noexcept
  • bool GetCanonicalKmers() const noexcept
  • bool GetRamOnlyMode() const noexcept
  • uint32_t GetNBins() const noexcept
  • uint32_t GetNReaders() const noexcept
  • uint32_t GetNSplitters() const noexcept
  • ILogger* GetVerboseLogger() const noexcept
  • IPercentProgressObserver* GetPercentProgressObserver() const noexcept
  • ILogger* GetWarningsLogger() const noexcept
  • EstimateHistogramCfg GetEstimateHistogramCfg() const noexcept
  • IProgressObserver* GetProgressObserver() const noexcept

Stage2Params class

This class allows setting all the parameters needed for the second stage of KMC. All the parameters may be set and read using appropriate setters and getters. There are the following setters:

  • Stage2Params& SetMaxRamGB(uint32_t maxRamGB) - sets the maximum amount of RAM that KMC is allowed to consume, in fact, KMC may use more memory if it is needed to process the data, if the limit should be strict use SetStrictMemoryMode method
  • Stage2Params& SetNThreads(uint32_t nThreads) - sets the number of threads (default: std::thread::hardware_concurrency)
  • Stage2Params& SetStrictMemoryMode(bool strictMemoryMode) - enable (true)/ disable (false) strict memory mode, if enabled KMC will not consume more RAM than specified with SetMaxRamGB method, but the computation may take longer (default: disabled)
  • Stage2Params& SetCutoffMin(uint64_t cutoffMin) - exclude k-mers occurring less than cutoffMin times (default: 2)
  • Stage2Params& SetCounterMax(uint64_t counterMax) - sets maximal value of a counter (default: 255)
  • Stage2Params& SetCutoffMax(uint64_t cutoffMax) - exclude k-mers occurring more of than cutoffMax times (default: 1e9)
  • Stage2Params& SetOutputFileName(const std::string& outputFileName) - sets the path of the output file
  • Stage2Params& SetOutputFileType(OutputFileType outputFileType) - sets the format of the output, available values: OutputFileType::KMC and OutputFileType:KFF (default: KMC)
  • Stage2Params& SetWithoutOutput(bool withoutOutput) - do not produce (true) output file (default: false)
  • Stage2Params& SetStrictMemoryNSortingThreadsPerSorters(uint32_t strictMemoryNSortingThreadsPerSorters) - sets the number of sorters per sorted in strict memory mode (Warning: only for experienced users)
  • Stage2Params& SetStrictMemoryNUncompactors(uint32_t strictMemoryNUncompactors) - sets the number of uncompactors in strict memory mode (Warning: only for experienced users)
  • Stage2Params& SetStrictMemoryNMergers(uint32_t strictMemoryNMergers) - - sets the number of mergers in strict memory mode (Warning: only for experienced users)

Each parameter may be read using one of the following getters:

  • uint32_t GetMaxRamGB() const noexcept
  • uint32_t GetNThreads() const noexcept
  • bool GetStrictMemoryMode() const noexcept
  • uint64_t GetCutoffMin() const noexcept
  • uint64_t GetCounterMax() const noexcept
  • uint64_t GetCutoffMax() const noexcept
  • const std::string& GetOutputFileName() const noexcept
  • OutputFileType GetOutputFileType() const noexcept
  • bool GetWithoutOutput() const noexcept
  • uint32_t GetStrictMemoryNSortingThreadsPerSorters() const noexcept
  • uint32_t GetStrictMemoryNUncompactors() const noexcept
  • uint32_t GetStrictMemoryNMergers() const noexcept

Stage1Results struct

This structure stores the results of the stage 1 run. It contains the following values:

  • double time - time spend by KMC to execute stage 1
  • uint64_t nSeqences - total number of input sequences (usually reads)
  • bool wasSmallKOptUsed - true if small k optimization was used
  • uint64_t nTotalSuperKmers - total number of super-k-mers
  • uint64_t tmpSize - total amount of disk memory used by KMC for intermediate files
  • std::vector<uint64_t> estimatedHistogram - estimated histogram, non-empty only if EstimateHistogramCfg::ONLY_ESTIMATE or EstimateHistogramCfg::ESTIMATE_AND_COUNT_KMERS flag was used set by SetEstimateHistogramCfg method. At i-th index of this vector the estimated number of k-mers occurring i times is stored.

Stage2Results struct

This structure stores the results of the stage 2 run. It contains the following values:

  • double time - time spend by KMC to execute stage 2
  • double timeStrictMem - time sped by KMC processing bins that are too big to fit into memory limit (only if strict memory mode is used)
  • uint64_t tmpSizeStrictMemory - disk usage related to the strict memory more
  • uint64_t maxDiskUsage - disk usage peak
  • uint64_t nBelowCutoffMin - the total number of k-mers below minimal cutoff
  • uint64_t nAboveCutoffMax - the total number of k-mers above maximal cutoff
  • uint64_t nTotalKmers - the total number of k-mers
  • uint64_t nUniqueKmers - the total number of unique k-mers

IPercentProgressObserver interface

This interface is used to allow KMC to inform the caller of percentage progress made during computations. There are the following methods to override:

  • virtual void SetLabel(const std::string& label) - KMC will call this method to set the label of current percentage execution
  • virtual void ProgressChanged(int newValue) - KMC will this method to inform that the percent process changed

IProgressObserver interface

This interface is used to allow KMC to inform the caller of progress made during computations. There are the following methods to override:

  • virtual void Start(const std::string& name) - KMC will call this method to inform that it starts the phase labeled as name
  • virtual void Step() - KMC will call this method to inform that some progress was made
  • virtual void End() - KMC will call this method to inform that the phase is completed

ILogger interface

This interface is used to allow KMC for logging (e.g. warnings, verbose logs). There is the following method to override:

  • virtual void Log(const std::string& msg) - KMC will call this method to log a message msg.

In the KMC namespace there are a couple of classes implementing these interfaces. Their purpose is to define the default behavior of KMC, which is writing to ``std::cerr``` in most cases, or totally ignoring the KMC messages (e.g. verbose). Using these interfaces one can use the KMC from a non-command-line environment, e.g. GUI.

More complex example

Below is a code of a more complex example. The idea behind this is that during the first stage k-mers abundance histogram is estimated to determine the estimate of the total number of counted k-mers, which in turn is used to compute RAM limit for the second stage.

#include "include/kmc_runner.h"
#include <iostream>

int main()
{
    try
    {
        std::vector<std::string> inputFiles { "test_file1.fq", "test_file2.fq"};
        KMC::Stage1Params stage1Params;

        stage1Params.SetInputFiles(inputFiles)
            .SetKmerLen(31)
            .SetNThreads(8)
            .SetMaxRamGB(10)
            .SetEstimateHistogramCfg(KMC::EstimateHistogramCfg::ESTIMATE_AND_COUNT_KMERS);

        KMC::Runner kmc;

        auto stage1Results = kmc.RunStage1(stage1Params);

        uint32_t cutoffMin = 5;

        uint64_t nUniqCountedKmersEst{};
        for(uint32_t i = cutoffMin ; i < stage1Results.estimatedHistogram.size() ; ++i)
            nUniqCountedKmersEst += stage1Results.estimatedHistogram[i];
        
        std::cout << "#uniq counted kmers estimate: " << nUniqCountedKmersEst << "\n";

        //now lests assume the total amount of memory that we allow KMC to use depends on the total number of unique k-mers
        //and per each unique k-mer we want at most bitsPerUniqieKmers
        double bitsPerUniqieKmers = 20;
        double ramBits = bitsPerUniqieKmers * nUniqCountedKmersEst;
        double ramBytes = ramBits / 8;
        uint32_t ramForStage2  = ramBytes / 1000 / 1000 / 1000;    

        if (ramForStage2 < 2)
            ramForStage2 = 2; //at least 2 GB needed for kmc

        std::cout << "ram for Stage 2 (GB): " << ramForStage2 << "\n";

        KMC::Stage2Params stage2Params;

        stage2Params.SetNThreads(8)
            .SetMaxRamGB(ramForStage2)
            .SetCutoffMin(cutoffMin)
            .SetOutputFileName("kmers").
            SetStrictMemoryMode(true);

        auto stage2Results = kmc.RunStage2(stage2Params);

        std::cout << "#total counted k-mers: " << stage2Results.nTotalKmers << "\n";
        std::cout << "#unique k-mers: " << stage2Results.nUniqueKmers << "\n";
        std::cout << "#unique counted k-mers: " << stage2Results.nUniqueKmers - stage2Results.nBelowCutoffMin - stage2Results.nAboveCutoffMax <<"\n";
        std::cout << "#sequences: " << stage1Results.nSeqences << "\n";


    }
    catch(const std::runtime_error& err)
    {
        std::cerr << err.what() << "\n";
    }
}