We provide a technology intelligence platform for companies, to analyse their competitors, explore their own technology road map, build technology opportunities, using patent data. However, for SMEs, starups, they might not have patents. Therefore, to classify unstructured technology data for building technology portoflio is very important in real business situations.
In the experiments, we only use very small samples to test/select text classification methods, with less computation time required for training. The process is to get some ideas of preparing data, selecting suitable classifiers, etc.
Download the dataset from https://github.com/JasonHoou/USPTO-2M
-
Remove invalid class information from the original dataset. Most of those invalid classes have only one patent.
-
Create balanced dataset for each class.
-
Use own vocabulary to build feature matrix, according to the domain knowledge of the dataset. The vocabulary contains n-gram words, generated by using spaCy, Textblob, etc.
-
tf-idf vectorization.
-
tf-idf vectorization with feature selection (5k~20k features).
-
-
Word embeddings.
-
Train word embedding model using Skip-gram model based on patent abstracts. Plot some cases using t-SNE and PCA.
-
Allow neural networks to learn the embedding layer itself.
-
-
Traditional methods, like Naive Bayes, SVM, Logistic regression.
-
Neural networks, like ANN, CNN, LSTM, etc.
-
The embedding layer: use pre-trained word embeddings or allow to learn the word embedding itself.
-
Hidden layers: Relu as the activation function.
-
Output layer: Softmax (multiple-class classification problem) as the activation function, categorical cross-entropy as the loss function.
-
-
Since we only used patent abstracts for testing, the results of using “bag of words” and “word embedding“ seem not very different. It may also because the technical words used in patent documents are more important for choosing patent classes. Sequence representation of words in patent documents might not affect the classification very much, especially the classes represent technology domains that might be more related to technical keywords.
-
In the experiments of using neural networks, 1 hidden layer performs well, compared to 2 or 3 hidden layers.
-
Simple multi-layer perceptron model takes n-grams as input perform good even with smaller samples/features, compared to SVM, Naive Bayes, or other neural networks. Varying unit or learning rate do not affect the performance. Only the change of batch size affected the accuracy a lot.
-
In feature selection, 15k features seem more effective, varying 5k to 20k.
-
Using pre-trained word embeddings do not perform better than using tfidf matrix or allowing the algorithm to learning its own embedding layer. It may because 1) we only choose 100 dimension, which is very small; 2) the technical words actually play an important role in patent classification, and which might be enough for patent text classification; 3) use an average of token vectors to represent a document.
*Note that, in the tfidf vectorization experiments, we use 630k~ samples and 180m~ features (number of words) for training Naive Bayes, SVM, etc., traditional classifiers, but only 315k~ samples and 15k~ features for training neural networks, in terms of limitations of computing power.
-
Use the entire patent dataset (6M+ patents from 1976 to 2019, supported by PatentsView), descriptions of patent documents as texts, and different classification systems (e.g., Cooperative patent classification, WIPO technology fields).
-
Test different variations, e.g., seqCNN, seq2seq with attention, bidirectional RCNN.
-
Train doc2vec based on patent abstracts and descriptions.
-
Use different pre-trained word2vec models from Google and Stanford.
-
Use grid search, k-fold cross-validation to optimize hyperparameters.
- Use grid search to fine tuning the hyperparameters used in CNN, and larger set of samples to train the CNN model via GPU.