This is a set of Suffix Array implementations using the SA-IS algorithm and the Skew Algorithm.
LCP Array construction from a Suffix Array is implemented by the Kasai Algorithm and used to find the Longest Common Substring in k-Strings.
Resources to learn about the algorithim, Suffix Arrays, and their applications can be found at the end of this readme.
In order to build the project, you will need to install:
- CMake
- Make
Clone the Repository.
git clone "https://github.com/Tascate/Suffix-Array-Implementation.git"
Compile using CMake.
cmake .
Build using the Makefile.
make
Run to view test code.
./SuffixArray
Given the following code:
#include "Skew.h"
Skew sa;
sa.addString("cabbage");
sa.makeSuffixArray();
sa.printSuffixArray();
This will output the computed Suffix Array using the Skew Algorithm:
Suffix Array : 7 1 4 3 2 0 6 5
The abstract class SuffixArray represents a generic base class that any class can derive from to implement different Suffix Array Construction Algorithms. The classes: Naive, Skew, SAIS, all derive from the SuffixArray class to implement their respective algorithms.
O(n) time, O(n) work
The algorithm is fairly simple to implement, requiring less code overall than the SA-IS implementation. Because of this, it is quite useful for quick testing and usage of Suffix Arrays. This implementation has support for multiple strings which is useful for finding the LCS of k-Strings.
O(n) time, O(n) work
This is an implementation of the SA-IS algorithm for my own understanding as well as improving my C++. The code is largely based on Screwtape's A walk through the SA-IS Suffix Array Construction Algorithm and may not necessarily be optimized. Screwtape's walkthrough is a great resource for learning the SA-IS algorithm!
O(n) time, O(1) work
A simple algorithm to construct the LCP Array from a Suffix Array in linear time. In this code, the LCP Array can be used to find the Longest Common Substring.
O(n) time, O(n) work)
Finding the LCS is based upon the premise that for a sliding window from i to j on the LCP array. The LCP value for that sliding window is the minimum value from i+1 to j provided that the first character for the suffix i and suffix j are the same. Finding the minimum value in our sliding window takes amortized O(1) time using a deque. An algorithim for finding the minimum value of a sliding window can be found here.
We can then use bookkeeping to find the longest common substring.
For coding simplicity, finding out the parent string for a given suffix takes O(logm) time (where m = # of strings) and O(1) space using binary search. However, it is possible to optimize for O(1) time by tracking the parent string for a suffix in memory. (For example, a map or array)
- SA-IS: "Linear Suffix Array Construction by Almost Pure Induced-Sorting" by G. Nong, S. Zhang and W. H. Chan"
- Skew: "Simple Linear Work Suffix Array Construction" by Juha Kärkkäinen and Peter Sanders)
- Kasai: "Linear-Time Longest Common-Prefix Computation in Suffix Arrays and Its Applications" by Toru Kasai, Gunho Lee, Hiroki Arimura, Setsuo Arikawa and Kunsoo Park