A fast substitution to the stdlib's strstr()
sub-string search function.
fast_strstr()
is significantly faster than most sub-string search algorithms
when searching relatively small sub-strings, such as words. We recommend any
user to benchmark the algorithm on their data as it uses the same interface as
strstr()
before discarding other algorithms.
Its worst case complexity (O(n × m)
where n
is the length of the string and
m
the length of the searched sub-string) is the same as the naive brute-force
algorithm but it mostly runs with a linear complexity (O(n)
) on most strings.
Unlike other efficient sub-string search algorithms, it doesn't try to skip characters of the string but reads every character of the string. However, the amount of work done for each character is substantially smaller.
This algorithm is thus faster when the searched sub-string has a small number of characters as other algorithms are not able to skip a large number of characters.
Its starts by computing the sum of the character values of the sub-string.
It then reads the string by shifting a reading window the size of the sub-string. It knows the sum of the character values inside this window and is able to update this sum when shifting it.
Comparing the two sub-string (the one inside the iterating window and the searched one) only occurs when the both sums are equal (actually when sums difference equals zero for performance issues). This trick enable the algorithm to skip a large number of comparisons.
A reference implementation of the algorithm is freely available here.
This algorithm is a special case of the Rabin-Karp algorithm for sub-string search which uses a simple sum as rolling hash function.
The algorithm has been benchmarked together with three other algorithms by searching all occurrences of 100 random Latin words in portions of Commentarii de Bello Gallico by Julius Caesar. Each test was run 100 times on an AMD Phenom II X4 965 and the best time was taken.
strstr()
is from the GNU C Library (version 2.19) and uses the Two Ways
Algorithm, naive strstr()
is a naive brute-force implementation and
Volnitsky's strstr()
is
an algorithm by Leonid Volnitsky
whereas fast_strstr()
is our implementation.
Scores where compared to strstr()
.
Section of 10, 100, 500, 1000, 5000, 10000, 50000 characters of Bello Gallico along with the full text (147277 characters) have been used.
Algorithm \ Size | 10 | 100 | 500 | 1000 |
---|---|---|---|---|
strstr() | 3.6 µs (1×) | 12.4 µs (1×) | 78.3 µs (1×) | 161 µs (1×) |
naive strstr() | 9.7 µs (2.7×) | 89.8 µs (7.2×) | 479 µs (6.1×) | 943 µs (5.8×) |
Volnitsky | 138 µs (39×) | 146 µs (11.8×) | 186 µs (2.4×) | 239 µs (1.5×) |
fast_strstr() | 2.3 µs (0.6×) | 6.8 µs (0.5×) | 48 µs (0.6×) | 102 µs (0.6×) |
Algorithm \ Size | 5000 | 10000 | 50000 | Full text (147277) |
---|---|---|---|---|
strstr() | 841 µs (1×) | 1.7 ms (1×) | 8.5 ms (1×) | 25.2 ms (1×) |
naive strstr() | 4.8 ms (5.6×) | 9.6 ms (5.6×) | 48 ms (5.6×) | 140 ms (5.52×) |
Volnitsky | 635 µs (0.7×) | 1.3 ms (0.8×) | 10 ms (1.2×) | 58.6 ms (2.3×) |
fast_strstr() | 540 µs (0.6×) | 1.1 ms (0.6×) | 5.5 ms (0.6×) | 17.2 ms (0.6×) |
Notice that as the algorithm doesn't require to pre-process the sub-string while other algorithms such as Volnitsky's do, it is also fast on short strings.
Benchmarks were also tried by third parties on newer Intel's Haswell and
Sandy Bridge processors, with less impressive results (strstr()
significantly defeated fast_strstr()
when reading large texts).
We are working on what could give such poorer results. As said in the
introduction, we suggest you to try the algorithm on your data before discarding
strstr()
.
Benchmark scripts are available here. These have been written in Haskell and require the Criterion benchmarking library to run.
If you want to run the benchmarks on your hardware, download and install the
Glasgow Haskell Compiler and its library manager Cabal. Then go in the
benchmark/ directory and ask cabal
to install the required
libraries :
cabal install --only-dependencies
Then compile the benchmark executable using cabal build
. The resulting
executable should be located at benchmark/dist/build/benchmark/benchmark
.
This algorithm is licensed under the open-source BSD3 license :
Copyright (c) 2014, Raphael Javaux All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
-
Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
-
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
-
Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.