Scope out/write pseudocode for network-based sampling #9

ChesterHuynh · 2021-03-02T18:49:52Z

SUMMARY

Here, we scope out the code to implement any prior biases in multi-variate data. Take X = (X_1, X_2, X_3, ...), where
each X_i has dimensionality, d_i. Then we might have apriori notions of how "correlated" samples are in each axis.

For example:

if X is an image, then this corresponds to just a contiguous patch.
if X is a multivariate time-series, then this corresponds to a discontiguous patch described in [WIP] An implementation of discontiguous sampling of the SRerf variant; we call it MTORF(?) neurodata/SPORF#353
if we have further intuition of how to "sample" nearby points that are highly correlated and forming that patch at each node of the decision tree, then we might introduce the notion of a "sampling graph"

TODO

Look into SPORF source code:

Sketch out where new methods would need to be added
Sketch out what these new methods would entail

adam2392 · 2021-03-10T17:48:45Z

So as a pass in from Python side:

Inputs

sampling_graphs = (G1, G2, ...)
sampling_patch_dimensions = (...)  # "Analagous" to PATCH_HEIGHT/PATCH_WIDTH min/max

-> G1.shape = dimension of variable /axis 0
-> G2.shape = dimension of variable/axis 1
...

C++ psuedo algorithm

- pick arbitrary point in data matrix
- determine which index (i.e. row) in the sampling graphs (G1, G2, ...) this point falls in
- according to patch dimensions, perform samplking

Summary of Proposed Work

I think to make things backwards compatiable for the sake of comparing, for now, we can assume we're trying to add a new projection_matrix variant: G-Rerf (i.e.... graph sampling RERF?) heh.

List of files in Python:

In Python/rerf/rerfClassifer.py: Alter the inputs to have the following: sampling_graphs and sampling_patch_min and sampling_patch_max. To make "it backwards compatabile" for now, we can keep the image width/height and patch height/width parameters but raise an error if they are both passed in(?)

Errors and Python backend computations:

The sampling_graphs and sampling_patch_min/max arguments MUST have the same length, or raise error.
The graphs in sampling_graphs must all be of square and symmetric form.
Moreover, they must be row-wise stochastic (rows add up to 1); we can normalize the rows if they are not and raise a warning, or just error altogether and require user to normalize it (i'm leaning towards error and make user aware).
Moreover, one should be able to derive the "data shape" from the size of the sampling graphs:

shape_arr = []
for graph in sampling_graphs:
    shape_arr.append(graph.shape[0])

These shapes should be error checked during fit(X, y) if the projection_matrix is G-Rerf, that the shape_arr should be summed to the number of columns in X. In addition, we should assume that the columns of X are flattened in the same way the images are currently flattened (I think it's column-wise default in np.flatten?).

Any data that needs to then be passed down to C++: in RerF.py, it seems, one can do something like forestClass.setParameters("...", X); idk if it accepts numpy datatypes? However, we need to expose the ability to pass this into C++ side.

List of files in C++:

Getting parameters to the fpForest class via setParameter in src/baseFunctions/fpForest.h:73-85:

				inline void setParameter(const std::string& parameterName, const std::string& parameterValue){
					fpSingleton::getSingleton().setParameter(parameterName, parameterValue);	
				}


				inline void setParameter(const std::string& parameterName, const double parameterValue){
					fpSingleton::getSingleton().setParameter(parameterName, parameterValue);	
				}


				inline void setParameter(const std::string& parameterName, const int parameterValue){
					fpSingleton::getSingleton().setParameter(parameterName, parameterValue);	
				}

Once we have the "parameters" successfully set in the fpSingletonClass, we can then hypothetically access it easily. Potentially it can be done via: https://stackoverflow.com/questions/30388170/sending-a-c-array-to-python-and-back-extending-c-with-numpy

Possibly add checks at the C++ level, which should be more easy once we get the numbers into C++ array: packedForest/src/fpSingleton/fpSingleton.h. This might involve checking the formatting of the arrays, datatypes, etc.
Finally implement the new sampling procedure in packedForest/src/forestTypes/binnedTree/processingNodeBin.h, for G-RerF.

Test Scenario

Specify a two-node connected, or disconnected graph.
The signal is generated as 1's and -1's, where X = (X_1, X_2). The classes are then:

class 1: X_1 = 1, X_2 = -1
class 2: X_1 = -1, X_2 = 1

connected, sampling_dim_min/max = 2 / 2: should be 50% accuracy
connected, sampling_dim_min/max = 1, 2: should get to 100% accuracy
disconnected, sampling_dim_min/max = 1, 2: should get to 100% accuracy
disconnected, sampling_dim_min/max = 2,2; should get 100% accuracy

Application Scenarios

thresholded euclidean pairwise distance graphs in channel axis
thresholded nearby points in frequency axis
exponentially decaying weight functions on the euclidean distance (A_ij = exp(-||x_i - x_j||) Euclidean distance)

All scenarios that "add" your prior belief about how the data might be correlated

adam2392 · 2021-03-11T01:16:11Z

More info on the topic of passing numpy arrays from Python -> C++ via pybind11:

ChesterHuynh · 2021-03-16T06:10:22Z

Re: Last point of 1. Are we expecting a graph for each time step? I.e. will sampling graphs be a list/array of matrices?

shape_arr = []
for graph in sampling_graphs:
    shape_arr.append(graph.shape[0])

If so, then the above snippet makes sense. Otherwise, need a little more clarification

adam2392 · 2021-03-16T14:00:35Z

Yep a list/array of matrices. So we should expect a graph for each time step, and possibly... we can allow for the "default" SRerf functionality by passing in None? Idk we can play around w/ it as long as it's simple. E.g.

sampling_graphs = (arr_1, None, arr_2) would impose sampling graph in axis 0 and 2 but use Srerf in axis 1 (i.e. nearest neighbors in a patch).

ChesterHuynh added the planning Exploration, research, or sketching pseudocode label Mar 2, 2021

adam2392 assigned adam2392 and ChesterHuynh Mar 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scope out/write pseudocode for network-based sampling #9

Scope out/write pseudocode for network-based sampling #9

ChesterHuynh commented Mar 2, 2021 •

edited by adam2392

Loading

adam2392 commented Mar 10, 2021 •

edited

Loading

adam2392 commented Mar 11, 2021

ChesterHuynh commented Mar 16, 2021

adam2392 commented Mar 16, 2021

Scope out/write pseudocode for network-based sampling #9

Scope out/write pseudocode for network-based sampling #9

Comments

ChesterHuynh commented Mar 2, 2021 • edited by adam2392 Loading

SUMMARY

TODO

adam2392 commented Mar 10, 2021 • edited Loading

Summary of Proposed Work

Test Scenario

Application Scenarios

adam2392 commented Mar 11, 2021

ChesterHuynh commented Mar 16, 2021

adam2392 commented Mar 16, 2021

ChesterHuynh commented Mar 2, 2021 •

edited by adam2392

Loading

adam2392 commented Mar 10, 2021 •

edited

Loading