Skip to content

Algorithm Example

Scott Sievert edited this page Feb 17, 2017 · 1 revision

We will implement two algorithms:

  • Round Robin Survey, in which we will start each participant off with a random question and then will cycle that participant through the images in a pre-defined order thereafter. The classifier this algorithm will use will simply be majority vote on the correct label. In particular, this algorithm will essentially be nothing more than a glorified survey.

  • Equal Sampling with Linear Regression, which will present to each participant the image about which the fewest questions have theretofore been answered. If there are multiple such, then we will select one of them at random. The classifier this algorithm will use will be linear regression on the feature vectors.

Round Robin Survey

The code for this algorithm would go in /next/apps/Apps/PoolBasedBinaryClassification/algs/RoundRobinSurvey/RoundRobinSurvey.py, and will contain one class with (at a minimum) the main four functions:

import random, time

class RoundRobinSurvey:
   def initExp(self, butler, ...):
       # ...
   def getQuery(self, butler, ...):
       # ...
   def processAnswer(self, butler, ...):
       # ...
   def getModel(self, butler, ...):
       # ...

initExp:

Because we're going to be cycling through the images, we need in this algorithm to know how many images there are. This function will be the place to get that information and store it for future reference. The mechanism for storing and accessing stored data in NEXT is called the butler.

The full set of features of the butler is documented in the Butler API. To store data that is global to our algorithm, we use the butler.algorithms collection, with its set and get functions.

   def initExp(self, butler, n=0, labels=[]):
      butler.algorithms.set(key='n', value=n)
      butler.algorithms.set(key='results', value=[{l:0 for l in labels} for i in range(n)])

This will store the argument n in a way so that, later, we can access it from elsewhere in the algorithm with butler.algorithms.get(key='n').

You might ask: "Why can I not just save n using self.n = n?" The answer is that NEXT will usually be run with many instances of the same code running in multiple processes or possibly even on multiple computers. So to guarantee future access to data, we have to store it in a database that is shared across all these processes. This is precisely the job of the butler.

getQuery:

getQuery always takes as an argument a dictionary participant_doc, which contains all the information about the participant who is requesting a query. Principally, it contains a participant_uid key, which references a string that uniquely identifies that participant. (NB: this identifier is not associated with a person but with a session, so if they refresh the page they will be assigned a new participant_uid). For our purposes we do not require any further arguments, so the function begins:

   def getQuery(self, butler, participant_doc):
      ...

Now, the logic will be:

  • If we have not seen this participant before, pick a random number up to our stored n and store that as this participant's current image index.

  • Then, regardless of whether this participant is new or not, return their current image index.

We'll store the image index for each participant in the butler.participants collection using something like:

butler.participants.set(uid=participant_uid, key='image_index', value=image_index)

Thus we can tell if a participant exists by checking 'image_index' in participant_doc:

   def getQuery(self, butler, participant_doc):
      participant_uid = participant_doc['participant_uid']
      if not 'image_index' in participant_doc:
         n = butler.algorithms.get(key='n')
	 # we need to have import random, time at the top of the file
	 random.seed(time.time())
	 i = random.randInt(0,n)
	 butler.participants.set(uid=participant_uid, key='image_index', value=i)
      else:
         i = participant_doc['image_index']
      return i

This index will be processed by the framework discussed on Framework-Apps before being sent to the user. That is where the index will be used to look up an actual image.

processAnswer:

When we get a response, the framework discussed on Framework-Apps will handle it first and will pass on to the algorithm whatever arguments are needed by the algorithm. In particular, we can make this function take whatever arguments we need. Certainly, we need to know what participant answered the question, but we don't actually need much more than that.

This function is expected to return a boolean: True if the answer was valid, and False otherwise.

The logic then will then be: As long as image_index is present for this participant, simply increment it modulo n and return True. Otherwise, return False. To wit:

   def processAnswer(self, butler, participant_doc, query_index=-1, label=None):
      n = butler.algorithms.get(key='n')
      if label is None or not (0 <= query_index and query_index < n) or not 'image_index' in participant_doc:
         return False
      participant_uid = participant_doc['participant_uid']
      n = butler.algorithms.get(key='n')
      i = (participant_doc['image_index']+1)%n
      butler.participants.set(uid=participant_uid, key='image_index', value=i)
      results = butler.algorithms.get(key='results')
      results[query_index][label] += 1
      butler.algorithms.set(key='results',value=results)
      return True

getModel:

getModel is intended to return the data that comprises the classifier. In this case, that is simply the list of results:

   def getModel(self, butler):
      return {'results':butler.algorithms.get(key='results')}

Equal Sampling with Linear Regression

The code for this algorithm would go in /next/apps/Apps/PoolBasedBinaryClassification/algs/EqualSamplingLinearRegression/EqualSamplingLinearRegression.py, and will again contain one class with (at a minimum) the main four functions:

import random, time, numpy

class EqualSamplingLinearRegression:
   def initExp(self, butler, ...):
      ...
   def getQuery(self, butler, ...):
      ...
   def processAnswer(self, butler, ...):
      ...
   def getModel(self, butler, ...):
      ...

initExp:

Again, we will only need to receive the number of images n and the list of possible labels in order to initialise this algorithm. However we will also need to create an array for how often we have received data about each image, so we will not only store n, but also an array of n 0's which we shall modify as the experiment progresses. Finally, as we get responses, we will store them in an array called results:

   def initExp(self, butler, n=0):
      butler.algorithms.set(key='n', value=n)
      butler.algorithms.set(key='num_answers', value=[0 for i in range(n)])
      butler.algorithms.set(key='results', value=[])

getQuery:

To get a query, we don't have to care about the participant for this algorithm--we may simply pick a random index out of

   def getQuery(self, butler, participant_doc):
      answers = butler.algorithms.get(key='num_answers')
      smallest = min(answers)
      random.seed(time.time())
      return random.choice([i for i in range(n) if answers[i] = smallest])

processAnswer:

The only thing we'll care about for the answer that comes back is which index it is about. Then we will return True. If the index is outside the valid range, return False

   def processAnswer(self, butler, participant_doc, query_index=-1, label=None):
      n = butler.algorithms.get(key='n')
      if label is None or not (0 <= query_index and query_index < n):
         return False
	 
      # Store the answer we got
      butler.algorithms.append(key='results',value=(query_index,label))

      # Increment the total number of answers received
      butler.algorithms.increment(key='num_reported_answers')
      
      # Increment the number of answers we have received for this query
      answers = butler.algorithms.get(key='num_answers')
      answers[query_index] += 1
      butler.set(key='num_answers',value=answers)
      
      return True

getModel:

Here, the model is our classifier, which we can compute using a standard linear regression. To do this, we will need to get the vectors associated to each image. These are stored in butler.targets, which is a TargetManager object. In particular, we can get all the image data with butler.targets.get_targetset(butler.exp_uid) which will be a dictionary that is populated with the image URLs and metadata that the experiment is initialized with. For example, in our case perhaps the targetset is a list of dictionaries of the form

{'url':image_url, 'meta':{'features':[0,1,4,20.1,-1.9,...]}}

Then we can run

   def getModel(self, butler):
      # Get the pairs of (index, label) that we have stored
      answer_pairs = butler.algorithms.get(key='results')

      # Get the list of feature vectors for each target
      targets = butler.targets.get_targetset(butler.exp_uid)
      targets = sorted(targets,key=lambda x: x['target_id'])
      target_features = [t['meta']['features'] + [1.] for t in targets]

      # Set up for a call to numpy's least squares solver
      X = []
      y = []
      for answer in answer_pairs:
         target_index,target_label = answer
         X.append(target_features[target_index])
         y.append(target_label)
      X = numpy.array(X)
      y = numpy.array(y)

      # Run numpy's least squares routine
      w = numpy.linalg.lstsq(X,y)[0]

      # Return the resulting weights
      return {'weights':value=w.tolist(),'num_reported_answers':butler.algorithms.get(key='num_reported_answers')}

Optimisation:

Notably, this implementation of getModel is going to be very slow, especially as the number of answers we receive grows. Besides, once we compute the weights once, if another call is made to getModel we have to do that whole expensive computation all over again. This may or may not matter for your particular purposes. If it does, there is a way to speed it up: Rather than compute the weights anew every time we run getModel, we say:

"Every 100 answers we receive, recompute the weights and store them. Then getModel can be as fast as simply 'return the stored weights'".

But if processAnswer has to do this computation every time, then won't answering queries be slow? The trick is that the NEXT butler provides a mechanism to run a function in the background in another process, namely the function butler.job. So we will add code t processAnswer to call butler.job to run an auxiliary "update the weights" function every 100 answers. (To avoid clogging up the machine, we put a time limit on each update call of 30 seconds. This parameter may need tuning to ensure best performance):

   def processAnswer(self, butler, participant_doc, query_index=-1, label=None):
      n = butler.algorithms.get(key='n')
      if label is None or not (0 <= query_index and query_index < n):
         return False
	 
      # Store the answer we got
      butler.algorithms.append(key='results',value=(query_index,label))

      # Increment the total number of answers received
      answer_count = butler.algorithms.increment(key='num_reported_answers')
      
      # Increment the number of answers we have received for this query
      answers = butler.algorithms.get(key='num_answers')
      answers[query_index] += 1
      butler.set(key='num_answers',value=answers)

      if (answer_count + 1) % 100 == 0:
         butler.job('update_weights',{},time_limit=30)
      return True
      
   def update_weights(self,butler,args):
      # Get the pairs of (index, label) that we have stored
      answer_pairs = butler.algorithms.get(key='results')

      # Get the list of feature vectors for each target
      targets = butler.targets.get_targetset(butler.exp_uid)
      targets = sorted(targets,key=lambda x: x['target_id'])
      target_features = [t['meta']['features'] + [1.] for t in targets]

      # Set up for a call to numpy's least squares solver
      X = []
      y = []
      for answer in answer_pairs:
         target_index,target_label = answer
         X.append(target_features[target_index])
         y.append(target_label)
      X = numpy.array(X)
      y = numpy.array(y)

      # Run numpy's least squares routine
      w = numpy.linalg.lstsq(X,y)[0]

      # Store the resulting weights
      butler.algorithms.set(key='weights',value=w.tolist())

Then, as promised, getModel becomes a very fast and simple lookup:

   def getModel(self, butler):
      return {'weights':butler.algorithms.get(key='weights'),
              'num_reported_answers':butler.algorithms.get(key='num_reported_answers')}
Clone this wiki locally