This project was completed by Marcos Ortiz, Sayantan Roy, Karthik Prabhu, Kristina Knowles, and Diptanil Roy, as part of The Erdös Institute Deep Learning Boot Camp (Spring, 2024).
Our project is detailed below, and you can follow along with the main steps through the demonstration notebooks and data provided in the paw_demo/ directory.
The first four notebooks walk through a demonstration using a small subset of the data. The last notebook provided an summary of our results on the entire dataset.
- Notebook 1 - Loading, Preprocessing, and Embedding Data
- Notebook 2 - Generating Sentiment Data
- Notebook 3 - Attaching Additional Metadata
- Notebook 4 - Indexing and Querying
- Notebook 5 - Evaluating the Results (Full Dataset)
Given an arbitrary user query and a dataset of human-generated content, build an algorithm to identify and rank the relevant content in the dataset, such that the match set can be retrieved quickly and accurately
We know that the eventual application for the results of our project is use in a Retrieval-Augmented Generation (RAG) pipeline. This recent survey paper, describing the current state of RAG for Large Language Models (LLMs) helped provide some insight into what tools might be a good fit for our particular task and data.
The main steps in RAG are:
- Indexing documents
- Retrieving the most relevant documents, given a user query
- Generating output using a LLM
The raw data provided to us consists of 5,528,298 posts from Reddit, from 34 subreddits. This data was provided in a parquet file, along with a data dictionary.
For this project, we are focussed on the first two steps of the RAG process: Indexing and retrieval.
Starting with the raw data, we performed some basic cleaning:
- Dropped rows with
reddit_text
values of"[deleted]"
or"removed"
. - Dropped rows that were deemed to be likely bots or memes.
- This was done by filtering out any rows with
reddit_text
values that were at least 35 characters long, and appeared more than 7 times. We did not want to immediately drop shorter common phrases, in case they might be useful later (see Using Engineered Metadata).
- This was done by filtering out any rows with
- Handled empty
reddit_text
values.- There were hardly any comments (as opposed to submissions) that had empty values. Few enough that they could be inspected manually. It appeared that these posts had been either deleted, or edited so that they were empty by the original author. These rows were dropped.
- A cursory inspection of submissions with empty values revealed that the
reddit_title
was a proxy for thereddit_text
. So, we replaced the emptyreddit_text
with thereddit_title
in these instances.
We used the base version of the General Text Embeddings (GTE) model, which is based on the BERT framework. Documentation on HuggingFace: link.
We chose this model because it seemed to be a reasonable size (0.22GB), it is open source, and it allows embedding of texts up to 512 tokens in length. It performs especially well in clustering and retrieval compared to other open source sentence transformers that have fewer than 250M parameters: link.
Moreover, part of its training was done using Reddit data, which added to its appeal.
We considered experimentation with other models, but due to the high computational cost of embedding the dataset with each new model, we save this avenue for future work.
We use the Sentence Transformers framework provided by SBERT to implement our embedding model, as well as Hugging Face Embedding tools provided by Langchain
During embedding we considered the following parameters:
chunk_size
: The maximum length of text to embed as a documentchunk_overlap
: Whenever a document needed to be broken into chunks, how much should they overlap
We also experimented with attaching metadata to chunks prior to embedding. To do this, we simply add the subreddit title (or an approximation) to the start of a text chunk before embedding. For example, if there is a comment in the FedExers that says “I really like working here because...” then we would append “FedEx” to the start of the chunk and embed “FedEx \n\n I really like working here because...”
Our intuition was that, in the cases where a post does not explicitly include the name of the company they are discussing, we might infer that information from the subreddit and that this might nudge that vector closer to our query. For example, If we ask “Why do employees like working at Disney?” and “Why do employees like working at FedEx?” our hope is that the addition of metadata makes it more likely that the above comment shows up higher in the results for the FedEx query, and maybe lower in the results for the Disney query.
We used spotlight to visualize the effect on a small sample of our data.
Embedding without metadata:
Embedding with metadata:
We chose LanceDB (link) to handle our vector database needs. LanceDB is an open source option, and it provides integration with both Python and Polars, both of which we are heavily reliant on.
LanceDB provides a combination of and inverted file index (IVF) and product quantization (PQ) to build an approximate nearest neighbors (ANN) index.
Both part of the IVF-PQ index can be fine tuned by adjusting the the following parameters:
- During indexing:
- The number of partitions in the IVF portion of the index.
- The number of sub-vectors that will be created during PQ.
- During retrieval:
- The number of partitions to probe for results.
- A "refine factor" that expands the number of results that are retrieves in memory, and re-ranked using their full (pre-PQ) vectors.
We fixed the indexing parameters, and varied the retrieval parameters. Though, if time permitted, we might vary both to see how retrieval times and accuracy are impacted.
Besides the query parameters that are built into our ANN index, we varied other pre-retrieval and post-retrieval variables to try and improve our overall results.
While labeling data, we noticed a common type of "related but not relevant" result: a reddit_text
that posed a question similar to the query itself.
Most of the time, these texts came from a submission
(as opposed to a comment). So, one way to try and elevate more relevant results might be to omit these from the vector search. This is easy enough, given that this information is contained in our original metadata.
Less frequently, but still enough to be noticed, a comment
would exhibit this property. To try and curb their impact, we engineered a metadata column is_short_question
to try and identify all reddit_text
examples that posed short questions (and thus were unlikely to provide useful information for answering those questions) so that they could also be filtered out before the search.
In order to improve the ranking of results after retrieval, we engineered some aditional metadata that might allow us to leverage information provided by the content of replies.
We engineered two type of metadata:
- A measure of
sentiment
of replies and, - A measure of
agree_distance
(anddisagree_distance
) for replies.
In the case of reply_sentiment
, we utilized a pre-trained natural language processing model called "mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis", to gauge the emotional tone behind the texts. This model helped us to classify each reply into categories such as positive, neutral, or negative. The sentiment scores of all the replies were then aggregated to reflect the overall sentiment towards each original post and the following comments. The underlying assumption here is that posts generating predominantly positive replies are likely to be constructive and informative, thereby serving as a proxy for user endorsements similar to upvotes in reddit. Our hypotheses was that a post with more positive replies would be more likely to contain useful information.
In the case of agree_distance
we measured the distance between each reddit_text
and a set of "agree statements". Then, whenever a submission or comment had replies, we added the top_reply_agree_distance
and the avg_reply_agree_distance
. Our hypothesis was that posts with replies that were closer to "agree" statements would be more likely to contain relevant information. Similarly, posts with replies that were closer to "disagree" statements would be less likely to be relevant.
When re-ranking, results with lower avg_reply_agree_distance
were bumped higher, results with lower avg_reply_disagree_distance
were bumped lower.
We tested 160 different model configurations. Each configuration included a choice of embedding, a strategy for filtering the tests before performing our vector search, and a strategy for re-ranking the results that were retrieved.
These hyper-parameters are summarized in the image below:
We had two main objectives that we had in mind when evaluating our results:
- We wanted query results to place relevant documents as high as possible in our ranking.
- We wanted query results to be returned in less than a second.
While retrieval time is easy enough to measure, we needed to develop some tools to measure our progress on result ranking.
To establish a baseline for evaluating result ranking, we manually labeled a subset of results to establish an initial metric of relevance. To do this, we created two queries for each of the thirteen datasets in our training set, and labeled the top 20 results retrieved for each query. Results were labeled as:
- Relevant to the query
- Related to the query, but not relevant to the query
- Not related to the query
For each query-result pair, the final label was determined by plurality vote, with ties defaulting to less relevant. This manually labeled data was then used to quantify results.
We used three metrics for ranking results. Each is a modified version of a recommender system metric, adapted to our use case where we do not have a clear ground truth, or an established ranking of relevant results from most relevant to least relevant.
This metric gives a score that indicates how close to the top the first known relevant result appears. A perfect score of 1 is achieved if the top result of every query is relevant.
To compute reciprocal rank for a given query, we applied the following formula:
In standard applications, there is a single known "ground truth" result. In our modified application, we accepted any known relevant result as the ground truth.
We then computed the average of these scores across all of our standard queries to arrive at the Mean Reciprocal Rank.
This metric gives a score that indicates how many of our known relevant result appear near the top. A perfect score of 1 is achieved if all known relevant results appear as the top results for all queries (with no unlabeled result appearing higher than any known relevant result.)
To compute extended reciprocal rank for a given query, we applied the following formula:
where
where
In standard applications, each relevant result has its own rank, and its contribution to the overall score takes into account this rank as its expected position in the results. In our modified application, we gave the same contribution to any known relevant result that appeared above position
We then computed the average of these scores across all of our standard queries to arrive at the Mean Reciprocal Rank.
Discounted cumulative gain (DCG) is often employed as a metric to evaluate the performance of a search engine, and measures the efficiency of the algorithm in placing relevant results at the top of the retrieval list. For a list of responses of length
where
Since the DCG score is strongly dependent on the length of the retrieval list, we need to normalize it so that scoring is consistent across query retrieval scenarios with variable number of results. The normalized discounted cumulative gain (NDCG) score at position
where the
NDCG can take in ordinal relevance score (1 for highly relevant, 2 for somewhat relevant, so on). We modify the scoring scheme for our case, by converting our human labels (1-relevant, 2-related but not relevant, 3-not related) into a binary scoring scheme. Results with human label = 1 were given a relevance score =1, and everything else was given a relevance score of 0. This was done to ensure that the best configuration, as dictated by the NDCG score, should only return highly relevant results. We then computed the NDCG score of our standard queries and averaged them to obtain the mean NDCG score of a particular configuration. The DCG scores and IDCG scores were calculated by setting
where
We used the following model parameters as our baseline for comparison:
- Chunk size for embedding: 512
- No attached metadata
- No pre-filtering before vector search
- No re-ranking of results
The baseline configuration achieve the following scores across our metrics:
Metric | Score | Rank (out of 160) |
---|---|---|
Mean Reciprocal Rank | 0.626031 | 46 |
Extended Mean Reciprocal Rank | 0.427189 | 84 |
Normalized Discounted Cumulative Gain | 0.714459 | 84 |
Average Overall Rank | 71.33 |
The configuration that achieved the best overall result (highest average rank across metrics):
- Chunk size for embedding: 512
- With attached metadata
- Pre-filtering short questions before vector search
- Re-ranking retrieved results using reply sentiment
- Re-ranking retrieved results using reply "agree distance"
This configuration achieve the following scores across our metrics:
Metric | Score | Rank (out of 160) |
---|---|---|
Mean Reciprocal Rank | 0.742735 | 7 |
Extended Mean Reciprocal Rank | 0.599379 | 9 |
Normalized Discounted Cumulative Gain | 0.806476 | 1 |
Average Overall Rank | 5.67 |
Below, we can see the relative position of the baseline configuration to the top overall.
It appeared that decreased chunk size had a generally negative impact on results.
Also, filtering out short questions prior to retrieval had a positive impact regardless of other hyperparameter choices.
If we highlight just those configuration that contain these variations together (512 chunk size, with added metadata, and filtering by short questions) we see how well they perform relative to other configurations.
Some areas of potential future investigation:
- Preprocessing:
- Handling emojis and common abbreviations to better capture sentiments.
- Labeling:
- Label additional data, representing a wider range of query types and targeting a larger subset of the data.
- Augment with automated labeling using an LLM or other means.
- Rank relevant results (as opposed to simply categorizing as relevant versus irrelevant) so that we might apply metrics that detect more subtlety that might better quantify the impact of our engineered hyperparameter.
- Hyperparameter Engineering:
- Refine short question filter to handle more nuance in what makes a short question.
- Modify agree distances by varying the selection of standard "agree statements" (similarly for disagree distances).
- Experiment with additional "reply distance" variations.
- Experiment with re-ranking using a different chunk size than the primary embedding
- Indexing Parameters:
- Test various parameters to see if retrieval times can be improved within top configurations.
- Re-ranking:
- Refine re-ranking implementations and experiment with variations.
- Experiment with the order of re-rankings to understand whether that has an impact on overall results.