Bryce M. Henson
Matlab code for fast masking/selection of ordered vectors based on binary search.
Status: This Code is ready for use in other projects. Testing is implemented and passing.
Selecting the elements of a vector that lie between some limits (herein masking, sometimes known as 'gating') is a widely used analytical tool (eg. particle physics) commonly employed in the analysis routines of the He* BEC group (@spicydonkey,@GroundhogState).
The common approach to masking data involves comparing each element (n) to the upper and lower limit (herein Brute compare) has complexity O(n). The novel contribution of this code is a demonstration of a relatively simple approach that uses a binary search algorithm ( O(log(n)) ) on an ordered vector to achieve superior performance in two use cases.
- Data that is already sorted ( O(log(n)) cf. brute O(n) ).
- When there is a requirement to repeatedly (m times) mask the same data such that the inial cost of the sort is offset by the increased speed of the mask operation. (O(n·log(n)+m·log(n)) cf. O(n·m) )
Note that if m is small and you check that the data is ordered (eg issorted(data)) you have probably lost most of any potential speedup already.
There are two things that the user may want from this masking operation:
- Returning the number of data points(or counts) in this mask (herein return mask count).
- This is where the search based algorithm really shines compared to the brute mask (as it just subtracts the upper and lower index while the brute compare must count up the logical vector, see details).
- Return the values of the data that makes it through this mask. (herein retun mask values)
- This has a smaller speedup because copying a subset of a vector (even a contiguous block) is a substantially slower than the search.
The code here demonstrates the algorithm for both types (counting and subset) in native matlab and provides a number of tests in order to compare the performance. For taking small slices of large (>1e6 elements) sorted vectors, a speedup of 1000x is demonstrated.
The brute compare implementation is very easy by using a logical mask vector:
mask=data>lower_lim & data<upper_lim
(If you are not familaiar with Logical indexing read more here ) the number of counts (integer) may be extracted as
num_mask=sum(mask)
or the subset of data points (vector)
subset_mask=data(mask)
The equivelent (but faster) operation using fast_sorted_mask on ordered data is:
mask_idx=fast_sorted_mask(data,lower_lim,upper_lim);
num_mask=mask_idx(2)-mask_idx(1)+1;
subset_mask=data(mask_idx(1):mask_idx(2));
WARNING: the data vector MUST be sorted. See figures above for when it is still advantagous to sort unordered data and then use this approach.
fast_search_based_histogram where I apply similar principles to dramaticaly speed up histograming (in certian cases).
contributors welcome! Drop me an email .
- figure out what the bump in the relative time is at n=10^7.05
- Compile to C
- attempts have not shown any improvement.
- Fast nd histogram to replace histcn & ndhistc
- add to fileexchange
- Fast 1d histogram based on this approach
- consolidate test scripts
- check what the function does to counts that equal the bin edge
This project would not have been possible without the open source tools that it is based on.
- Benjamin Bernard Binary search modified from fileexchange project binary-search-for-closest-value-in-an-array
- Denis Gilbert M-file Header Template
- Daniel Eaton "Smart"/Silent Figure