-
Notifications
You must be signed in to change notification settings - Fork 88
Data Processing
Waterbear now has support for inputting local files (particularly CSV files) for data processing purposes. What follows is a run-down of exactly what capabilities this entails, and how they can be applied.
Table of Contents
- Basic File Input
- What is a CSV?
- Creating arrays from CSVs
- Performing Data Analysis
- Statistical functions
- Machine Learning 1. k-Nearest Neighbors (kNN) Algorithm 1. kNN in Waterbear
- Example Walkthrough: What Coffee Should I Drink?
- The Idea
- Initialize the Data
- Prompt the User
- Convey Prediction to User
A CSV file is a plain text file storing comma-separated-values. These are exactly what they sound like, i.e. they are lists of numbers and text separated by commas. For instance, a line in a CSV may look like "1,2,3,cat,dog,42". CSVs may contain many lines, and the lines needn't end in commas.
CSVs are often used to store tabular data, such as address books and basketball stat lines. Now users can upload their own CSV files to Waterbear and interact with them in many different ways.
The "new array from CSV" block (show below) includes a file picker: upon clicking "Choose file", the user is prompted to select a file from their machine.
This block only accepts files with the extension ".csv", and will throw the error "File is not a CSV file" in the browser's error console. If the Waterbear script is run without a input file, the error "File not entered" will be thrown.
In general, when uploading local files to a server, one does not fully know where the data goes. Therefore, the user should be cautious when uploading files with sensitive data, and for this reason the following alert is displayed when a user chooses a file to upload:
Clicking "Ok" allows the file and its contents to be uploaded to the server, and "Cancel" leads to the chosen file being forgotten altogether. On the topic of security, it's worth noting that Firefox, Chrome, and Safari all do not save the entire filepath of the file the user chooses -- it just saves the file name and allows the contents to be read.
Once a CSV is successfully uploaded and its contents are read, it can be saved in one of two forms. If the CSV contains only one line, then the data is saved as a Waterbear "Array", where the entries correspond to the comma-separated values in the order in which they appear in the CSV. Numbers in the raw CSV are treated as numbers, and non-numerical text in the raw CSV is treated as strings. If the CSV contains multiple lines, then the data is saved as an Array of Arrays, a.k.a. a 2-D Array (i.e. the first entry of the Array is an Array containing the entries of the CSV's first line). In the case where the CSV is empty, the "new array from CSV" block just creates an empty array (an array with no entries).
In the "Array" category of the block menu are blocks for performing statistical functions on Waterbear Arrays. These functions are:
- Returns the sum of the input array's entries
- Returns the mean (a.k.a. average) of the input array's entries
- Returns the standard deviation of the input array's entries
- Returns the variance of the input array's entries (which is the square of the standard deviation)
- Modifies the input array by normalizing its entries, i.e. scaling all its entries so that they sum to 1
Each of these blocks require the input Array to contain only numbers as entries. If this is not the case, the message "Non-numerical value in array!" will be thrown to the browser's error console upon running the Waterbear script, and the block will return an undefined value. The same will occur if the input Array is empty, but the message will instead be "Array is empty!".
For those unfamiliar, machine learning is a branch of artificial intelligence that deals with systems being able to learn autonomously from data. A common example is that of email filtering: given an inbox of emails that a user labels as either "spam" or "not spam", an email service can find patterns amongst the emails listed as "spam" and thus classify new emails as "spam" based on these pattens. There are many algorithms that can be used to "find patterns", each with advantages in different scenarios. These include the k-Nearest Neighbors algorithm, Decision Trees, and Support Vector Machines. Waterbear now has blocks that can perform the k-Nearest Neighbors algorithm.
####k-Nearest Neighbors (kNN) Algorithm This may sound intimidating, but it really is quite harmless. In fact, you probably have a lot in common with this algorithm -- when you and your friends try to decide on a movie to watch, and there isn't total agreement, what do you do? If you said "majority vote", then you and the kNN algorithm may become good friends.
The basis of kNN is this: given an unclassified object, look into your training data for the k most similar objects, and take a majority vote amongst those to determine a classification for the object in question. The terms training data and most similar ought to be defined -- they can be explained by example:
Suppose you pick up a flower and would like to identify what type of flower it is. So, you take measurements: stem length, petal width, etc. These measurements together form a vector of attributes for the flower. Also at your disposal is a trove of pre-collected flower data of the form (attribute vector, type of flower). Note that the flowers in this data trove -- the training data -- are already classified, i.e. their type is known.
What can we do to guess precisely what type of flower is in your hand? You can look at the flowers within our training data whose attribute vectors closely match that which we measured, and classify your flower based on the labelings of those most similar flowers. The notion of most similar can vary from situation to situation; in this case, it makes sense to use Euclidean distance as our measure, so that the smaller the Euclidean distance between two flowers, the more similar they are (in cases where the attributes are strings and not numbers, Hamming distance may be useful). The Euclidean distance between your flower's attribute vector to a training data flower's attribute vector is a similarity score. For this to work, all attributes vectors need to be of the same length and the attributes need to appear in the same order.
After calculating the similarity scores for all flowers in our training data, you can look at the k flowers with the best similarity scores (k can be chosen arbitrarily, or a "smart" k-value can be experimentally determined via cross-validation, which is outside the scope of this discussion). kNN looks at the types (a.k.a. labels) of these k most similar flowers and picks the mode. So, if k=10 and the types of the most similar flowers include 7 sunflowers, 2 irises, and 1 rose, we would label the flower in your hand as a sunflower (this is a contrived example...I'm not doubting your ability to identify a sunflower, but as it turns out, flower data is commonly used for explanation's sake).
####kNN in Waterbear There are two types of kNN blocks in Waterbear: unweighted and weighted. I'll first discuss the former -- the latter only differs by one small change to the algorithm.
The below unweighted kNN block can be found in the "Array" category of the block menu:
The first socket (following "classify test point") calls for a Waterbear array containing the attribute vector of your test point (the unclassified flower in your hand in the aforementioned example). The second socket requires a value for the k parameter; an integer is expected, but floating point numbers are accepted, in which case the value used is the ceiling of the inputted k.
Lastly, the third socket is meant for the training data (called the "training set" in the block itself) as described in the above example. This must be in the form of a Waterbear 2-D Array (defined in Creating arrays from CSVs section), i.e. an array of vectors where the last entry of each vector is the label, and the preceding entries are the attribute vector. For example, if the attribute vector for a sunflower is (1.2,3.0,6.3,2,2), then this should be represented as (1.2,3.0,6.3,2.2,sunflower). It seems like it'd be a hassle to get the vectors into this specific array format and manually add each such array to an overarching training data array (especially when there are hundreds of training data points) -- how can this be done more easily?
By uploading a CSV, of course! Isn't it great when everything comes together? If you're tackling a machine learning problem with a set of training data, chances are this data is already in CSV format. If the labels are not at the end of the attribute vectors, then it is a trivial transformation to make it so. By uploading this file into the "new array from CSV" block, a 2-D Array is created with precisely the right format -- the rows of training data in the CSV are converted into Arrays, which are then placed into an overarching Array containing all these training data points. There are just a few requirements on the actual data in the CSV:
- The training set mustn't be empty, otherwise the message "Training set is empty!" will be thrown to the browser's error console.
- As mentioned earlier, attribute vectors must be of the same length, and the ordering of the attributes within each vector ought to be the same.
- Attribute values must be numbers in order for the similarity score calculations to work.
- Training point labels must belong to a discrete set so that it makes sense to take a majority vote.
With all sockets filled, the block is ready to be taken for a spin. When evaluated, this block returns the predicted label of the inputted test point.
Finally, as promised, the weighted kNN block (below) shall be discussed.
The only difference between this block and the unweighted kNN block is the way the algorithm takes the majority vote. The weighted kNN takes a weighted majority vote, i.e. the "votes" of each of the k nearest neighbors are scaled by their similarity score. In particular, since larger similarity scores are bad (as they correspond to a greater distance from the test point's attribute vector), the weighted kNN algorithm scales each vote by (1/similarity_score). For example, suppose k=4 and the k nearest neighbors to the flower in your hand are as follows:
- sunflower with similarity score 0.02
- rose with similarity score 2.10
- rose with similarity score 4.30
- rose with similarity score 3.00
In this case, "sunflower" receives a vote of weight (1/0.02)=50, while "rose" receives a vote of weight (1/2.1)+(1/4.3)+(1/3.0)=1.042. So although more of the k nearest neighbors are labeled "rose", the sole nearest neighbor labeled "sunflower" is so close attribute-wise to the flower in your hand that its vote is "worth more", and thus your flower ought to be labeled "sunflower".
The aforementioned flower example is quite contrived. We can understand if running to Waterbear isn't your gut reaction when picking up an unidentified flower. But there are plenty of cool things one can do with the CSV uploading and machine learning capabilities of Waterbear. One such example is a script that recommends a specific type of coffee to a user (machine learning is often used in recommendation systems like those in Netflix and Amazon).
The following example is inspired from Buzzfeed quizzes (example here). The format is: the user is asked a few questions -- some silly -- and based on their answers, they are recommended a cup of coffee that suits their personality types (with a message explaining why). A mapping from personality type to coffee type can be found here. The fully functioning example can be found in the "Examples" drop-down of Waterbear under the name "Coffee Quiz".
How can machine learning come into play with a quiz like this? Consider letting the user's quiz answers form an attribute vector. And, suppose we've already surveyed many coffee drinkers, asking them to answer the quiz questions and then specify their go-to type of coffee. The answers from the surveyed people are their attribute vectors, and the specified type of coffee is the label. Altogether, the survey data is our training set, and new users taking the survey correspond to new test points to be classified after completing the quiz. In order to use Waterbear's kNN blocks, we format our questions to require answers on a scale from 1-10, which allows us to use Euclidean distance as a similarity score.
Given the context of the problem, there are several pieces of data we'll need to maintain:
- Questions to ask the user
- All possible coffees
- Messages corresponding to each coffee type
- Training data
These will be tackled sequentially:
The questions used are store in coffee_questions.csv. They shall simply be stored in an Array, which can be accomplished by loading this file into the "new array from CSV" block described earlier.
Double-clicking the rectangular global array variable that appears allows you to rename it -- upon choosing the name "questions", we get the following:
The training data for this example exists in the file coffee_train.csv. Each row contains others' answers to the five quiz questions (in order) followed by their actual go-to coffee. This can be stored into a 2-D Array "training_data" using the "new array from CSV" block, shown below:
As mentioned before, in reality, gathering training data would require surveying many people and recording 1) their quiz answers and 2) what they specify their go-to coffee to be. But for sake of example, this training data was generated by a python script -- reasonable ranges of answers were speculated for each coffee type, and then a python script randomly generated, for each coffee, 30 CSV rows using answers within that coffee's ranges.
The file coffee_messages.csv is a CSV where the first row is a list of possible coffees, and the second row is a list of messages for each corresponding coffee in the first row (i.e. the messages that explain why a user's personality matches some coffee type). This file can be loaded into a 2-D Array using the "new array from CSV" block -- i've named the resulting 2-D Array "coffee_data".
The first entry of this 2-D array is the Array of coffee types. It would be nice to have this in a separate Array for when we refer to it later in the program. So, we can create a "coffees" Array using the "new array with array" block. The socket of this block is to be filled with the zero-th item of the coffee_data Array -- this can be accomplished via the "array...item" block with its two sockets filled in with the coffee_data Array and 0, respectively. Similarly, we can also create a "messages" Array by grabbing the first item of the coffee_data Array.
To make these additions, the following should be appended to the block you made for the "training_data" Array:
At this point, your workspace should look as follows:
The goal now is to ask the user the five questions and store their answers.
From the get-go, we know that we'll be needing to storing the user's quiz answers, so let's create an empty array called "answers" using the "new array" block. Now, for each question in the "questions" Array, we will need to prompt the user and get input, so let's insert a "array...for each" block and use the "questions" Array in its socket. This gives us the following:
To clarify, the "item" local variable that appears inside the for loop corresponds to the the question that the loop is currently on, and "index" is the index of that question within the "questions" array.
Now we must think about what we want to do for each question. For one thing, we'll need to store the user's answer, so we'll create and "user_answer" variable (the block needed to create a new variable can be found in the "Controls" category of the block menu). Don't worry about initializing the value of this variable for now.
To ask the user a question, we add an "ask...and wait" block from the block menu's "Sensing" category, and place the local variable "item" into its socket, because again, "item" is a question to be asked. The local variable "answer ##" can be renamed to "response" -- it is equal to the answer that the user enters, which we want to store into our "user_answer" variable. So inside the "ask...and wait", we add a "set variable...to..." block and place into its two sockets the "user_answer" and "response" variables, respectively. Lastly, after receiving an answer we must append it ("user_answer") to the "answers" array using the "array...append..." block, which gives us:
We require that the user answer the questions on a scale from 1-10, but as it stands they can enter any number. We need to add blocks that ensure that the user's answer is indeed in this range, and re-prompt the user if not.
A simple solution is to wrap the "ask...and wait" block in a "repeat...until" loop, so that the user is asked the same question repeatedly until some condition holds. The condition naturally is "1 <= user_answer AND user_answer <= 10", which can be conveyed using the "and" block from the block menu's "Boolean" category and the "...<=..." block from the "Math" category.
Lastly, we mustn't forget to initialize the "user_answer" variable! Given the condition for our repeat/until loop, what would be a good choice for the initial value? Well, we want to guarantee that we enter the repeat/until loop the first time we reach it (because we need to prompt the user for an answer at least once), and the loop will be skipped over if 1 <= user_answer <= 10, so we need to initialize it to something outside the range 1-10. I'll choose 20. After entering the loop, as soon as the user provides an answer from 1-10, the script breaks out of the repeat/until loop and continues onward.
At this point we are done with the innards of the "array [questions] for each" block -- it should look like the following:
The only remaining remaining task is to perform kNN and display to the user their coffee recommendation.
The end result is an alert box with a message explaining to the user what coffee type was predicted. Let's first create a variable entitled "alert_message" with the empty string "" as it's initial value. Then let's create a variable "coffee_suggestion" that will contain the recommended type of coffee as predicted by kNN.
In the socket of "coffee_suggestion", place one of the kNN blocks (weighted or unweighted) -- set this kNN block's test point to be the "user_answers" Array, the k value to be 10 (arbitrary, but should be less than 30 because there are only 30 training data points for each coffee type), and the training data to be the "training_data" 2-D Array we created earlier. At this point you should have the following:
Now we have a coffee type stored in the coffee_suggestion variable (e.g. "espresso", "americano", etc.) We need to know which message in the "messages" Array corresponds to coffee_suggestion. As mentioned earlier, the first message in "messages" corresponds to the first coffee in "coffees", the second message in "messages" corresponds to the second coffee in "coffees", etc. So, if we iterate over "coffees" until we see a coffee equal to coffee_suggestion, then we know the message at that index in "messages" is precisely what we want. We thus can find the desired message with a for loop as follows:
As discussed, the above blocks iterate over all coffees, and then store in alert_message the message at the index at which "item"=coffee_suggestion. There is guaranteed to be an "item" equalling coffee_suggestion because kNN will only return a label it finds in the training set, and the training set only contains coffee types in "coffees".
With our prediction and message in tow, we are at last able to display our conclusion to the user. We will use the "alert" block in the block menu's "Strings" category, which displays to the user whatever text is in its socket.
The message we'd like to display is:
You should drink...[coffee_suggestion]!
[alert_message]
An example of this would actually look like is:
The header here, "http://localhost:8000", just denotes the URL I'm running Waterbear from -- your's will presumably be different.
We can build the desired string by nesting "concatenate...with..." blocks as done below. The "concatenate...with..." block appends the second string to the first string.
The "\n" after the "!" is a newline character -- it ensures that everything that follows comes on a new line.
And with that, the script it done! Go ahead and grab yourself your ideal coffee, you earned it.