-
Notifications
You must be signed in to change notification settings - Fork 0
15th Jan 2024: Initial data analysis 2: Pairwise distance metric
In the previous report I showed some initial results from a feature extraction + statistical clustering approach. I'm going to explore a non-statistical approach to outlier detection in this case, based on the use of a similarity metric between time series. This write up might be rather brief as I don't have much time.
As before we are only considering continuous light treatments, and Y(II) time series.
As always, I'm keeping it as simple as possible to begin with, and one of the simplest distance metrics which we can use is just the Euclidean distance. Given a pair of time series (which must be the exact same length), the Euclidean distance just treats each time point as a component of a vector, and returns the distance between these two vectors. For the continuous-light case, all Y(II) time series have 42 time points, so the Euclidean distance tells us how far apart a pair of time series are in 42-dimensional space. I apply a rolling average with a window size of 5 to smooth all time series, and reduce the impact of noise on the distance metric.
The first step is to compute all pairwise distances between wells. When we include both the HL and the ML treatments, the resulting pairwise distance matrix has a nice chessboard appearance, due to all the HL plates being relatively more similar to other HL plates relative to the ML plates:
When we restrict the analysis to just a single light treatment, in this case 20h_ML, the distance matrix looks like:
At this point, note that it could be better to concatenate the time series for each light treatment and then conduct the same pairwise distance analysis, rather than looking at light treatments one at a time. Anyway, the distance matrix by itself isn't particularly helpful (although it would also be a good jumping off point for a network analysis- this graph could represent edge weights between mutants, and we could start comparing this to biological knowledge graphs in various ways). The next set of steps describe how I used these pairwise distances to find genes which were self-similar (i.e. different replicates gave similar time series) as well as being notably different from the WT time series.
I grouped mutants according to the gene(s) which were knocked out, and then computed the mean distance within each group, and also plotted the number of replicates. The red horizontal line is the average distance between WT replicates:
Based on this plot (this is very arbitrary) I decided on a distance threshold of 0.5. Any gene groups with a mean intra-group distance of greater than 0.5 were eliminated from contention for being interesting outliers, as they are not consistent enough within replicates. Again, I should emphasize that this threshold is very arbitrary, and we would need to come up with a more scientific approach for choosing a threshold if we take this approach forward.
Of the genes which were consistent enough, I then computed the average distance between their replicates, and all WT time series.
With this in hand, we can show the final results.
For each light treatment, I took the top 16 genes whose replicates had the largest average distance from the WT replicates (and met my arbitrary self-consistency threshold). This resulted in these two very interesting plots, one for each light treatment. I plotted the WT replicates in black, with the highlighted "outlier" replicates for a particular gene knockout in red.
20h_ML:
20h_HL:
I don't think that the ML treatment results are particularly reliable, because if you refer back to the Y(II) normalisation results here you can see that the time series from plate 99 is appears to be biased downwards relative to the other WT replicates for some reason. Therefore, a lot of these outliers which appear to have a much higher Y(II) time series are probably in fact not outliers. Indeed, the HL results (which do not have this bias issue) show that almost all the top-16 outliers have a much lower Y(II) than WT.
This is an interesting non-statistical approach for outlier detection. It's not completely independent from the previous post- for example we could extract features from each time series and then apply the same Euclidean distance-based outlier detection. Euclidean distance is not the only type of distance metric out there, we could swap many other methods in. However, some aspects which are commonly brought up as disadvantages of the Euclidean distance (sensitive to outliers, not invariant to linear transforms of the data) are actually positives in this case I think.
It would be interesting to get some thoughts from the biologists on whether the candidate "outlier" genes identified here and in the previous post are plausible. How do we evaluate the predictions from this approach?