-
Notifications
You must be signed in to change notification settings - Fork 54
Framework Algs
Summary: Here we explain where an individual algorithm fits into the five key api function of NEXT.
We will now look at the code for an example algorithm for PoolBasedBinaryClassification
. The algorithm code obeys an interface defined in apps/PoolBasedBinaryClassification/algs/Algs.yaml
and the actual code is given inapps/PoolBasedBinaryClassification/RandomSamplingLinearLeastSquares/RandomSamplingLinearLeastSquares.py
.
Algs.yaml
provides a uniform interface that every algorithm has to satisfy. As with PoolBasedBinaryClassification.yaml
, the interface is specified in a standard YAML format, described here. Requiring a specific interface makes it easier to know what your algorithm is allowed to take as inputs and return as outputs. In projects where there are multiple algorithms for a given application, this makes it easier for many developers to work together and spread the task of developing algorithms.
initExp:
args:
n:
type: num
description: Number of targets available.
rets:
type: bool
description: A boolean indiciating if the algorithm initialization succeeded or failed
values: true
getQuery:
args:
participant_uid:
type: str
description: Participant unique ID
rets:
type: num
description: The index of the target to ask about
processAnswer:
args:
target_index:
type: num
description: The ID of the target we are asking about
target_label:
type: num
description: The label assigned to the target
rets:
description: Indicates if the algorithm succeeded
type: bool
values: true
getModel:
rets:
type: dict
description: The current state of the model
values:
weights:
type: list
description: The linear model weights
values:
type: num
num_reported_answers:
type: num
description: The number of reported answers (for this algorithm)
This interface is best understood when compared to the algorithm file, RandomSamplingLinearLeastSquares.py
. The args
in each API function correspond to the inputs of that API function in the algorithm. The rets
correspond to the outputs of those functions.
import numpy.random
class RandomSamplingLinearLeastSquares:
def initExp(self, butler, n, d):
return ...
def getQuery(self, butler, participant_uid):
return ...
def processAnswer(self, butler, target_index, target_label):
return ...
def getModel(self, butler):
return ...
def full_embedding_update(self, butler, args):
...
Ignoring the butler input, we see that the args
to initExp
should be n
and d
. These are precisely the inputs to the associated initExp
function. Note that these also correspond to the keys in the inputs to the alg
call in initExp
in the application code. We recommend checking this consistency in inputs across Algs.yaml
and the algorithm code with the application code for each API function.
We now describe the API functions in RandomSamplingLeastSquares.py
in detail.
QUESTION: Can the butler in the algorithm access other collections and change them? Look into how this is setup.
As we can see above, each API function takes in a butler
object. Algorithm specific variables should be set and retrieved using the butler.algorithms
collection. The butler
has to be used to ensure that variables are stored in the NEXT database and can be retreived over different workers and user web sessions. Again, the full set of features ofthe butler is documented in the Butler API.
This is only run once at the very beginning of the experiment.
def initExp(self, butler, n, d):
# Save the number of targets, dimension, and to algorithm storage
butler.algorithms.set(key='n',value= n)
butler.algorithms.set(key='d',value= d)
# Initialize the weight to an empty list of 0's
butler.algorithms.set(key='weights',value=[0]*(d+1))
return True
The initExp
function is very simple. It saves n
and d
, and an empty weights
vector (representing the weights in our least squares linear model) in the butler.algorithms
collection. You may recall that these values are also stored in the butler.experiment
collection. Again, best practice dictates that any variables needed in an algorithm be stored and retreived from the butler.algorithm
collection directly.
Note about atomicity Insert one here.
def getQuery(self, butler, participant_uid):
# Retrieve the number of targets and return the index of one at random
n = butler.algorithms.get(key='n')
# Get the list of queries answered by this choice
answered_queries = butler.participants.get(uid=participant_uid, key='asked_queries')
# If we have asked this participant to label all the targets, return 0
if len(asked_queries) == n:
return 0
# Choose a random target to answer
i = numpy.random.choice(n)
while i in asked_queries:
i = numpy.random.choice(n)
return i
In an active algorithm, the procedure to return an active query is at the heart of the algorithm. In this example, our (in)active algorithm is very simple, it should just returns a random index between 0 and number of targets -1. This index corresponds to the target_id
of the random target that we wish to return to the user. We also want to ensure that the user has not labelled this item previously. It is a good idea to review how this index is used by the application code.
Note that we retrieve the set of answered queries from the butler.participants
collection. Our decision to return 0 if we run out of targets is arbitrary. It is up to the developer to decide how to handle that.
Note. As we discuss in more depth in widgets the participant_uid
is not associated with a person but rather with a browser session, so refreshing a query page will assign a new participant_uid
.
def processAnswer(self, butler, target_index, target_label):
# S maintains a list of labelled items. Appending to S will create it.
butler.algorithms.append(key='S',value=(target_index,target_label))
# Increment the number of reported answers by one.
num_reported_answers = butler.algorithms.increment(key='num_reported_answers')
# Append the
# Run a model update job after every d answers
d = butler.algorithms.get(key='d')
if num_reported_answers % int(d) == 0:
butler.job('full_embedding_update', {}, time_limit=30)
return True
processAnswer
appends the id and the associated label to the S
list, an internal representation by the algorithm of the set of queries. Note that the algorithm could also access the set of queries by calling butler.queries
, this is a much slower operation compared to pulling S
so we recommend against it. The number of reported answers for this algorithm is also incremented.
Finally, an asynchronous job is given to the butler to run every d
steps. In our case, the job is a full_embedding_update
which uses a least squares model to update our weights. In the case where we have a lot of targets, and a large set of answered queries, least squares may be very slow and it is best not to leave the user waiting for the response from processAnswer
. Instead the model will be updated in the background by the butler and the result can be retrieved and used later.
The downside of this approach is that the weights may be out of date at any given time if the model has not fully updated. This can lead to "stale" queries in algorithms which use the weights to generate active queries, i.e. queries which have not been generated by the most up to date information. It is up to the application/algorithm developer to manage this tradeoff. We address this issue more carefully in our NIPS paper on NEXT.
def full_embedding_update(self, butler, args):
# Main function to update the model.
labelled_items = butler.algorithms.get(key='S')
# Get the list of targets.
targets = butler.targets.get_targetset(butler.exp_uid)
# Make sure the targets are sorted by id
targets = sorted(targets,key=lambda x: x['target_id'])
# Extract the features form each target and then append a bias feature.
target_features = [targets[i]['meta']['features'] for i in range(len(targets))]
for feature_vector in target_features:
feature_vector.append(1.)
# Build a list of feature vectors and associated labels.
X = []
y = []
for index, label in labelled_items:
X.append(target_features[index])
y.append(label)
# Convert to numpy arrays and use lstsquares to find the weights.
X = numpy.array(X)
y = numpy.array(y)
weights = numpy.linalg.lstsq(X,y)[0]
# Save the weights under the key weights.
butler.algorithms.set(key='weights',value=weights.tolist())
The embedding update code first pulls the full set of queries aasked and answered by this algorithm. The butler.targets
collection is then used to extract the associated feature vectors, which are then aggrated into a matrix X
. The labels are similarly aggregated and then numpy
's least squares algorithm is used to compute the weights and store them.
def getModel(self, butler):
# The model is simply the vector of weights and a record of the number of reported answers.
return butler.algorithms.get(key=['weights','num_reported_answers'])
getModel
is intended to return the data that comprises the classifier. In this case, that is simply the list of weights and the number of reported answers for this algorithm.
The argument checking we use (described here) supports many types (dict, str, num, etc). It also supports the types "any", "anything" and "stuff" which mean an arbitrary type. When init'ing the experiment, you can some parameter of type "anything" to Algs.yaml and then include it when launching the experiment in a initExp['args']['alg_list']
item.
Of course, a parameter might not be changed by the user of your algorithm and only by you, the algorithm developer. It's up to you how you want to do this; more defaults in the appropriate functions might be a good call (globals probably aren't a good call).