Skip to content

Model and Engine

Raef Maroof edited this page Dec 8, 2022 · 28 revisions

Story Map:

sm1 sm2

Engine Architecture:

Preface:

The engine architecture of Identisound can be divided into three concrete sections: Frontend, Backend server and database, and Machine Learning methods. We present a full view of the entire engine architecture below before zooming into each section to more clearly see the interactions between components.

Full view:

full

Frontend:

frontend

The frontend must be able to record audio and send the recorded data to either:

  • ShazamKit (in the case of the Skeletal version)
  • Backend postAudio endpoint (in the case of the MVP version)

Currently, the ShazamKit version has higher accuracy. However, when proceeding below, it is assumed that the used version is the postAudio endpoint.

In order to record audio, the frontend uses AudioRecord to generate a byte array of 16-bit mono PCM audio. See here for the AudioRecord code

After recording audio, the frontend queries the postAudio endpoint with a Multipart Form Data request containing a file of the pcm bytes. See here for the MPFD request code. After receiving a response from the postAudio endpoint containing the label of the song, the frontend endeavors to populate a MovieView.

The MovieView consists of the song name and all movies that the song appeared in within our database. Additionally, each movie is displayed with its director, poster and year. In order to get this data, the frontend sends another request to the backend through the getsongs endpoint. Immediately after the postAudio endpoint returns a song, the frontend queries the getsongs endpoint with the returned song.

Backend Structure:

backend

The backend consists of a PostgreSQL database and a Python-Django server instance. It is hosted on AWS EC2 with an Ubuntu OS. The Songs, Songs_to_movies, and Movies tables all contain data gleaned through an IMDb webscraper. The code for the actual data population is linked here.

The getsongs endpoint performs a natural join on the three tables and then selects records associated with the requested song name. The code is linked here. The getmovies endpoint is not used within the app, however, the code is included for reference and utility here. The postAudio endpoint is a bit more complex and involves concepts linked to the ML system. Thus, the discussion of its mechanics is left to the ML section.

Full Machine Learning View:

Here is an overarching view of the ML system. Clearly, it is a bit complicated to digest from this vantage, so we split it up into three subsections for easier discussion: Training Data Generation, Machine Learning Model Training, and finally, a discussion of the postAudio endpoint.

fullml

Training Data Generation:

Identisound uses a dataset of 12 chroma pitch vectors each associated with a 10 second range from every song as its training set. The method for its generation is outlined in the following diagram and explained below.

trainZoom

In order to generate training data, we used the Spotify Developer API to extract pitch data on all songs in our database. At a glance, our training data consists of 10 second pitch vectors where every song in the database is fully covered by all associated pitch vectors in the training set.

More specifically, we begin generating the dataset by requesting an OAuth token from Spotify through a POST request to the token endpoint. This will be passed along with all our future requests to the Spotify API for authentication.

Next, for every song in the database, we take the following actions

  • Request a search on the song name in Spotify to get its trackId through the search endpoint.
  • Request audio analysis (which is a super set of the pitch analysis we are after) from the audioAnalysis endpoint.

These calls are orchestrated from the following code and then written to a file for unnormalized training data.

Before this phase is complete, however, we must normalize the range each pitch vector to be associated with 10 second intervals. Spotify splits ranges by "segments of similar audio content" which generally is anywhere from 0.5-3 seconds and not what we want here. After normalizing the time ranges, we are done generating training data.

Machine Learning Model Training:

Identisound uses Support Vector Machines (SVMs/SVCs) to classify pitch vectors as some song label. The general process is outlined in the following diagram and walked through below.

mlZoom

To identify the ideal model, we choose to assess approximate RBF kernels, approximate polynomial kernels, and linear kernels. We choose to use approximate kernels in order to decrease computation time by preventing the need for kernel evaluation and allowing the use of one-vs-rest multiclass decision shapes for our classifiers.

With this in mind, the main handler must attempt to train each of these models over a range of hyperparameters and write the ideal models to files. We link the driving code for linear, polynomial, and rbf models here.

These driver blocks of code trigger 5-fold cross validation on the appropriate ranges of hyperparameters for each model and write the highest performing models to pickle files.

Generally speaking, the flow by which a given model is trained is outlined below (numbers corresponding to the diagram):

  1. Request 5 fold cross validation for a given model. This can either be linear, quadratic, cubic, or rbf.
  2. Project the model to the appropriate kernel approximation feature space. If rbf (a), this is done through the Sklearn Nystroem object. If polynomial of any degree (b), this is done through the Sklearn PolynomialCountSketch object. If linear (c), there is no projection needed.
  3. Partition the projected training data into 5 folds. Among these folds, designate a portion as the training set. The remainder will be our test set on which we will validate the model's performance.
  4. For every fold, pass the training set and training labels (X_train, y_train) to the model requested
  5. After passing the data, fit the model to the training data.
  6. Forward the testing data to the model and have it compute predictions for it (y_pred)
  7. Calculate the performance of the fitted model by comparing y_pred to y_test.
  8. Return the average performance across all folds as the final performance. Code for 3-8 is linked here

Steps 3-8 are repeated for every valid hyperparameters value associated with a model. The method by which these hyperparameters values are swept through is outlined here.

The driver code itself will retrain a model with the ideal parameters identified and write it to a pickle file as outlined in (9).

A closer look at postAudio:

Finally, we have the background to discuss postAudio! The diagram below outlines the general flow of the endpoint. We must convert the pcm file to a pitch vector of 12 chromas, normalize it with respect to the mean and standard deviation of the training data, and then project it to the Nystroem RBF feature space. Then, we can predict its label! This can then be returned to the requester.

Note that we must project to RBF feature space as that is the model used as our "final" and best model.

postAudio