You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First of all, I appreciate the team for putting in the work for this research paper. I would like to preface this by saying that my comments here should just be considered a point of discussion for possible future research related to this topic, either by the team or by the rest of the community. I am not implying that any significant changes to the current version of the paper are necessary. I hope that this is crystal clear. After all, I am not a formal researcher or reviewer of this paper. 😄
With that said, here are some of my comments on how future research can be done to improve the work done here:
Let's begin by stating that the overall idea of performing NLP using data compression is not new:
There was also some work done in 2021 on text classification by data compression inspired by the AIMA textbook here, albeit not using KNN.
The paper's authors claim that they are "the first to use NCD with kNN for topic classification". After comparing this claim with the aforementioned list of research work, the following combination of concepts is the actual novel component:
NCD as the distance metric used
KNN as the classification method used
Gzip as the compressor used
Topic classification as the objective
Text as the data type of interest
The combination above hinges on the fact that "text" is the data type of interest here, which is rather weak since "text" is a subset of binary data (i.e., text can be converted into bytes). The only significant difference between text and other types of binary data, such as images or executable files, would be its structural semantics.
Exploring the usage of other distance metrics specified by previous research work, such as LZJD or SHWeL, could indicate a more meaningful contribution by a research paper instead of repeating most of the work done by previous research but with new datasets and datatypes. After all, the paper mentioned that "the compressor is data-type-agnostic by nature". Redoing a data-type-agnostic approach for different data types seems somewhat redundant (although I suppose that verifying the claim that it is indeed data-type-agnostic could be worthwhile as well).
Exploring modifications to NCD can also provide more meaningful insights. For example, the paper "On normalized compression distance and large malware" (2016) suggested "interleaving" and "NCD-shuffle" as possible adaptations to NCD to improve its performance for large input data. Applying these techniques to text classification might yield interesting results.
The paper's authors highlight what distance metric (NCD), what compression algorithm (Gzip), and what classification method (KNN) is used. Instead of simply stating what selections were done, perhaps exploring the following questions in the future can provide greater insights:
Why NCD, Gzip, and KNN were chosen? What advantages and disadvantages do they have over other distance metrics (LZJD, SHWeL, etc.), compressors (LZW12, zip, etc., on top of the ones already in the paper), compression schemes (Huffman coding, etc.), and classification methods (SVM, K-Means, K-Medoids, etc.) for this specific application of text classification?
A comparison of both the speed and performance of the "NCD-Gzip-KNN" method against other combinations of distance metrics, compressors, and classifiers might also be insightful.
The paper mentioned that "the compressor is data-type-agnostic by nature and non-parametric methods do not introduce inductive bias during training". While this might be true for the specific method used by the paper, it might not be true for all compressors in general. For example, zstd builds an internal compression dictionary when it encounters the training data, which can potentially be reused to perform predictions on the test data. On the other hand, since gzip is based on the DEFLATE algorithm, it simply uses the previous 32K bytes as a source for potential matches. Meanwhile, zlib, another DEFLATE-based compressor, allows the user to even "prime" the compressor with 32K bytes.
Do the 32K bytes used by gzip really not introduce any bias?
Are compressors really "parameter-free"? Do the internal states of the compressor count as parameters, especially since that state directly helps the compressor to compress more effectively and efficiently?
Would application-specific or domain-specific compressors help with classification? For example, instead of using general LZ77-based compression algorithms, compressing DNA sequences is usually done by using a custom algorithm (see this and this).
The main advantage of the approach stated by the paper is that it requires a lower amount of resources to perform as well as or even better than other more resource-intensive LLMs such as BERT.
More comparisons with other LLMs can further highlight the competitiveness of the compression approach.
Future work can also consider comparisons between this compression approach and other similar lower-cost alternatives, such as bag-of-words, Multinomial Naive Bayes, word embedding, etc. (on top of the ones already in the paper).
I hope that the paper's authors as well as other members of the research community can consider my comments and research questions as possible starting points for future research. Cheers and all the best!
Best regards,
James Raphael Tiovalen
The text was updated successfully, but these errors were encountered:
Hey @jamestiotio, you may want to look at this https://github.com/cyrilou242/ftcc
It's an order of magnitudes faster compression based method, with similar accuracy performance, based on zstd.
This looks interesting. It seems similar to DEFLATE, which seems to be better at preserving semantically important contiguities that are relevant for text classification than other algorithms (the dictionary coding is definitely important here) even though those algorithms might produce better compression ratios. There's a bit of a No-Free-Lunch situation at play here. It would be nice to somehow bound the potential gains in accuracy we could get by optimizing the compression algorithm for the specific problem of text classification. Then we could balance that with pragmatic concerns (speed, etc). Are DEFLATE style algorithms brushing up against the limit?
Hi there!
First of all, I appreciate the team for putting in the work for this research paper. I would like to preface this by saying that my comments here should just be considered a point of discussion for possible future research related to this topic, either by the team or by the rest of the community. I am not implying that any significant changes to the current version of the paper are necessary. I hope that this is crystal clear. After all, I am not a formal researcher or reviewer of this paper. 😄
With that said, here are some of my comments on how future research can be done to improve the work done here:
Let's begin by stating that the overall idea of performing NLP using data compression is not new:
We can also find a plethora of related research for other domains:
There was also some work done in 2021 on text classification by data compression inspired by the AIMA textbook here, albeit not using KNN.
The paper's authors claim that they are "the first to use NCD with kNN for topic classification". After comparing this claim with the aforementioned list of research work, the following combination of concepts is the actual novel component:
The combination above hinges on the fact that "text" is the data type of interest here, which is rather weak since "text" is a subset of binary data (i.e., text can be converted into bytes). The only significant difference between text and other types of binary data, such as images or executable files, would be its structural semantics.
The paper's authors highlight what distance metric (NCD), what compression algorithm (Gzip), and what classification method (KNN) is used. Instead of simply stating what selections were done, perhaps exploring the following questions in the future can provide greater insights:
LZW12
,zip
, etc., on top of the ones already in the paper), compression schemes (Huffman coding, etc.), and classification methods (SVM, K-Means, K-Medoids, etc.) for this specific application of text classification?The paper mentioned that "the compressor is data-type-agnostic by nature and non-parametric methods do not introduce inductive bias during training". While this might be true for the specific method used by the paper, it might not be true for all compressors in general. For example,
zstd
builds an internal compression dictionary when it encounters the training data, which can potentially be reused to perform predictions on the test data. On the other hand, sincegzip
is based on the DEFLATE algorithm, it simply uses the previous 32K bytes as a source for potential matches. Meanwhile,zlib
, another DEFLATE-based compressor, allows the user to even "prime" the compressor with 32K bytes.gzip
really not introduce any bias?The main advantage of the approach stated by the paper is that it requires a lower amount of resources to perform as well as or even better than other more resource-intensive LLMs such as BERT.
I hope that the paper's authors as well as other members of the research community can consider my comments and research questions as possible starting points for future research. Cheers and all the best!
Best regards,
James Raphael Tiovalen
The text was updated successfully, but these errors were encountered: