Text classification with word embedding regularization & soft similarity measure
Improving text classification through the regularization of word embeddings at training and inference time
Since the seminal work of Mikolov et al. (2013), word embeddings have become the
preferred word representations for many natural language processing tasks.
Document similarity measures extracted from word embeddings, such as the soft
cosine measure (SCM) and the Word Mover's Distance
(WMD), were reported to achieve state-of-the-art performance on
the semantic text similarity and text classification.
Despite the strong performance of the WMD on text classification
and semantic text similarity, its super-cubic average time complexity is
impractical. The SCM has quadratic worst-case time complexity,
but its performance on text classification has never been compared with the
WMD. Recently, two word embedding regularization techniques were
shown to reduce storage and memory costs, and to improve training speed,
document processing speed, and task performance on word analogy, word
similarity, and semantic text similarity. However, the effect of these
techniques on text classification has not yet been studied.
In our work, we investigate the individual and joint effect of the two word
embedding regularization techniques on the document processing speed and the
task performance of the SCM and the WMD on text
classification. For evaluation, we use the classifier and
six standard datasets: BBCSPORT, TWITTER, OHSUMED,
REUTERS-21578, AMAZON, and 20NEWS.
We show 39% average test error reduction with regularized word
embeddings compared to non-regularized word embeddings.
We describe a practical procedure for deriving such regularized
embeddings through Cholesky factorization.
We also show that the SCM with regularized word embeddings significantly
outperforms the WMD on text classification and is over 10,000× faster.
Word embeddings are the state-of-the-art words representation for many
natural language processing (NLP) tasks. Most of these tasks
are at the word level, such as the word analogy and word similarity, and at the
sentence level, such as question answering, natural language inference,
semantic role labeling, co-reference resolution, named-entity recognition,
sentiment analysis, and semantic text
similarity. On document-level
tasks, such as machine translation, text
classification, and ad-hoc
information retrieval, word
embeddings provide simple and strong baselines.
Document similarity measures, such as the Soft Cosine Measure
(SCM) and the Word Mover's
Distance (WMD) can be extracted from word embeddings. The
SCM achieves state-of-the-art performance on the semantic text
similarity task. The
WMD outperforms standard methods, such as the VSM,
BM25, LSI, LDA, mSDA, and
CCG on the text classification task, and achieves state-of-the-art performance on
nine text classification datasets and 22 semantic text similarity datasets with
better performance on datasets with shorter documents. The SCM is
asymptotically faster than the WMD, but their task performance has
never been compared.
Regularization techniques were reported to improve the task performance of
word embeddings. We use the quantization technique of Lam (2018), which reduces storage and memory cost, and improves
training speed and task
performance on word analogy and word similarity. We also use the
orthogonalization technique of Novotný (2018), which improves the document processing speed and the
task performance of the SCM on semantic text similarity.
However, the effect of these
techniques at the document-level (e.g. text classification) has not been
studied.
In our work, we investigate the effect of word embedding regularization on text
classification. The contributions of our work are as follows:
-
We show that word embedding regularization techniques that reduce
storage and memory costs and improve speed can also significantly
improve performance on the text classification task.
-
We show that the SCM with regularized word embeddings
significantly outperforms the slower WMD on the text
classification task.
-
We define orthogonalized word embeddings and we prove a relationship
between orthogonalized word embeddings, Cholesky factorization, and the
word embedding regularization technique of Novotný (2018)
.
The rest of the paper is organized as follows: First, we outline the
related work, and we discuss the document distance and similarity
measures and word embedding regularization
techniques. Then, we describe our experiment,
discuss its results, and present our conclusions.
Word embeddings represent words in a vector space, where syntactically and
semantically similar words are close to each other. Word embeddings can be
extracted from word co-occurrence matrices
and from neural network language models.
Word embeddings extracted from neural network language models have been
shown to be effective on several NLP tasks, but they tend to
suffer from overfitting due to high feature dimensionality. There have been
several works that use word embedding regularization to reduce overfitting.
Labutov and Lipson (2013) introduce a
technique which re-embeds an existing embedding with the end product being
a target embedding. In their technique, they perform optimization of a
convex objective. Their objective is a linear combination of the
log-likelihood of the dataset under a designated target embedding and the
Frobenius norm of a distortion matrix. The Frobenius norm serves as a
regularizer that penalizes the Euclidean distance between the initial and
the designated target embeddings. To enrich the embedding, they further use
external source embedding, which they incorporated into the regularizer, on
the supervised objective. Their approach was reported to improve
performance on the sentiment classification task.
The dropout technique was introduced by Srivastava et al. (2014) to mitigate the problem of overfitting in
neural networks by dropping their neurons and links between neurons during
training. During the training, dropout samples an exponential number of the
reduced networks and at test time, it approximates the effect of averaging
the previsions of these thinned networks by using a single unthinned network
with smaller weights. Learning the dropout networks involves two major steps:
back propagation and unsupervised pre-training. dropout was successfully
applied to a number of tasks in the areas of vision, speech recognition,
document classification, and computational biology.
Sun et al. (2016) introduce a sparse
constraint into Word2Vec in
order to increase its interpretability and performance. They added
regularization into the loss function of the Continuous
Bag of Words (CBoW). They applied the technique to online
learning and to solve the problem of stochastic gradient descent, they
employ an online optimization algorithm for regularized stochastic learning
-- the Regularized Dual Averaging (RDA).
In their own work, Peng et al. (2015)
experimented with four regularization techniques: penalizing weights
(embeddings excluded), penalizing embeddings, word re-embedding and
dropout. At the end of their experiments, they concluded the following:
-
regularization techniques do help generalization, but their effect depends
largely on the dataset size,
-
penalizing the norm of embeddings
( regularization) also improves task performance,
-
incremental hyperparameter tuning achieves similar performance, which
shows that regularization has mostly a local effect,
-
dropout performs slightly worse than the
regularization, and
-
word re-embedding does not improve task performance.
Another approach by Song et al. (2017)
uses pre-learned or external priors as a regularizer for the enhancement of
word embeddings extracted from neural network language models. They
considered two types of embeddings. The first one was extracted from topic
distributions generated from unannotated data using the Latent Dirichlet
Allocation (LDA). The second was based on dictionaries that
were created from human annotations. A novel data structure called the
trajectory softmax was introduced for effective learning with the
regularizers. Song et al. reported improved embedding quality through
learning from prior knowledge with the regularizer.
A different algorithm was presented by Yang et al. (2017). They applied their own regularization to
cross-domain embeddings. In contrast to Sun et al. (2016), who applied their regularization to the
CBoW, the technique of Yang et al. is a regularized skip-gram
model, which allows word embeddings to be learned from different domains.
They reported the effectiveness of their approach with experiments on
entity recognition, sentiment classification and targeted sentiment
analysis.
Berend (2018) in his own approach
investigates the effect of regularized sparse word
embeddings for identification of multi-word expressions. Berend uses
dictionary learning, which decomposes the original embedding matrix by
solving an optimization problem.
Other works that focus solely on reducing storage and memory costs of word
embeddings include the following:
-
Hinton et al. (2014) use
a distillation compression technique to compress the knowledge in an
ensemble into a single model, which is much easier to deploy.
-
Chen et al. (2015) present
HashNets, a novel framework to reduce redundancy in neural networks.
Their neural architecture uses a low-cost hash function to arbitrarily
group link weights into hash buckets, and all links within the same hash
bucket share a single parameter value.
-
See et al. (2016) employ a
network weight pruning strategy and apply it to Neural Machine
Translation (NMT). They experimented with three
NMT models, namely the class-blind, the class-uniform and
the class-distribution model. The result of their experiments shows that
even strong weight pruning does not reduce task performance.
-
FastTest.zip is a compression technique by Joulin et al. (2016) who use product quantization to
mitigate the loss in task performance reported with earlier quantization
techniques.
-
In an attempt to reduce the space and memory costs of word embeddings,
Shu and Nakayama (2017)
experimented with construction of embeddings with a few basis vectors, so
that the composition of the basis vectors is determined by a hash code.
Our technique is based partly on the recent regularization technique by
Lam (2018), in which a quantization
function was introduced into the loss function of the CBoW
with negative sampling to show that high-quality word embeddings using 1--2
bits per parameter can be learned. Lam's technique is based on the work of
Courbariaux et al. (2016), who employ neural networks with
binary weights and activations to reduce space and memory costs. A major
component of their work is the use of bit-wise arithmetic operations during
computation.
The Vector Space Model (VSM) is a distributional semantics model
that is fundamental to a number of text similarity applications including
text classification. The VSM represents documents as
coordinate vectors relative to a real inner-product-space orthonormal basis
, where coordinates correspond to weighted and
normalized word frequencies. In the VSM, a commonly used
measure of similarity for document vectors and
is the cosine similarity where and is the norm of .
The cosine similarity is susceptible to polysemy, since distinct
words correspond to mutually orthogonal basis vectors. Therefore, documents
that use different terminology will always be regarded as dissimilar. To
borrow an example from Kusner et al. (2015), the cosine similarity of the documents
“Obama speaks to the media in Illinois” and “the President greets the
press in Chicago” is zero if we disregard stop words.
The Word Mover's Distance (WMD) and the Soft Cosine Measure
(SCM) are document distance and similarity measures that
address polysemy. In the following subsections, we briefly discuss
the WMD and the SCM.
The Word Mover's Distance (WMD) uses network flows to find the optimal
transport between VSM document vectors. The distance of two
document vectors and
is the minimum cumulative cost of
a flow subject to and
\( \sum_i f_{ij}=(y_j)\BETA / \Vert y\Vert_1, \) where the cost
is the Euclidean distance of embeddings for words
and and is the norm of
The Sankey diagram of the network flow
between the vector space representations of various sentences.
Semantically similar words and with
small cost are connected with green edges and
words and with large flow
are connected with thick edges.
Observe how the Word Mover's Distance
approaches zero for semantically
similar sentences despite different terminology.
We use the implementation in PyEMD with the best known average time
complexity ,
where is the number of unique words in
and .
In contrast to the standard VSM, which assumes that
the basis is orthonormal, the soft
VSM
assumes that document vectors are represented in a non-orthogonal
normalized basis . In the soft VSM,
basis vectors of similar words are close and the cosine similarity of two
document vectors and
is the Soft Cosine Measure (SCM) where
and is a word similarity matrix.
We define the word similarity matrix like
Charlet and Damnati: ,
where and are the embeddings
for words and , and
and are free parameters.
The Sankey diagram of the similarities
between the vector space representations of various sentences.
Semantically similar words and with
large are connected with thick green edges.
Observe how the Soft Cosine Measure
approaches one for semantically similar sentences despite different
terminology.
We use the implementation in
the
similarities.termsim module of Gensim. The worst-case time complexity of
the SCM is ,
where and are
the numbers of unique words in and
, respectively.
The Continuous Bag-of-Words Model (CBoW) is a neural network language model
that predicts the center word from context words. The CBoW
with negative sampling minimizes the following loss function: where is the vector of a center word with
corpus position , is the vector of a context
word with corpus position , is
a random vector, and the window size and the number of
negative samples are free parameters.
Word embeddings are the sum of center word vectors and context word
vectors. To improve the properties of word embeddings and the task
performance of the WMD and the SCM, we apply two
regularization techniques to CBoW.
Following the approach of Lam (2018), we
quantize the center word vector and the
context word vector during training using the
quantization function
: The
quantized center word vector equals and the
quantized context word vector equals . Since
the quantization function is discontinuous at
, and the
gradient is undefined at .
Therefore, we use the
Hinton's straight-through estimator as the gradient:
where is the identity function.
Lam shows that quantization reduces the storage and memory cost and improves
the performance of word embeddings on the word analogy and word similarity
tasks.
Novotný (2018) shows that producing a sparse
word similarity matrix that stores at most
largest values from every column of
reduces the worst-case time complexity of the SCM to
, where is
the number of unique words in a document vector .
Novotný also claims that improves the performance of
the soft VSM on the question answering task and describes a
greedy algorithm for producing , which we will refer to
as the orthogonalization algorithm. The orthogonalization algorithm has
three boolean parameters: Sym, Dom, and Idf. Sym and Dom make
symmetric and strictly diagonally dominant. Idf
processes columns of in descending order of inverse
document frequency:
the inverse document frequency of word equals
where are documents.
In our experiment, we compute the SCM directly from the word
similarity matrix . However, actual word embeddings
must be extracted for many NLP tasks. Novotný shows that the
word similarity matrix can be decomposed using
Cholesky factorization. We will now define orthogonalized word embeddings
and we will show that the Cholesky factors of are in
fact orthogonalized word embeddings.
Definition (Orthogonalized word embeddings)
Let and be real matrices with
rows, where is a vocabulary of
words. Then are orthogonalized word embeddings from
, which we will denote , iff
, where
and denote the -th rows of
and , respectively.
Theorem
Let be a real matrix with rows,
where is a vocabulary of words, and
. Let be a word
similarity matrix constructed from with the parameter
values and (see
the SCM's definition). Let be a word
similarity matrix produced from using the
orthogonalization algorithm with the Sym and Dom parameters. Let
be the Cholesky factor of . Then
.
Proof
With the Sym and Dom parameters, is symmetric and
strictly diagonally dominant, and therefore also positive definite. The
symmetric positive definite matrix has a unique
Cholesky factorization of the form .
Therefore, the Cholesky factor exists and is uniquely
determined.
From , we have that for all
such that the sparse matrix
does not contain the value it
holds that . Since
the implication in the theorem only applies when , we do not need to consider this case.
From and
, we have that for all
such that the sparse matrix
contains the value , it holds
that
which is the claim of our theorem.
T-SNE visualizations of non-regularized word embeddings
(top left) and orthogonalized word embeddings
(top right) for the 40 most common words on the
first 100 MiB of the English Wikipedia, and images of the word similarity
matrices (bottom left) and
(bottom right). Word similarity matrix construction uses the parameters
and , and orthogonalization
uses the Sym, Dom, and Idf parameters. Notice how the cluster of numerals
from is separated in due to the
Idf parameter, which makes common words and
more likely to be mutually orthogonal, i.e.
.
The figure above shows the extraction of orthogonalized word embeddings
from : From , we
construct the dense word similarity matrix and from
, we produce the sparse word similarity matrix
through orthogonalization. From ,
we produce the orthogonalized embeddings through
Cholesky factorization.
With a large vocabulary , a
dense matrix may not fit in the main memory, and we
produce the sparse matrix directly from
. Similarly, a dense
matrix may also not fit in the main memory, and we use
sparse Cholesky factorization to produce a sparse matrix
instead. If dense word embeddings are required, we use
dimensionality reduction on to produce a
dense matrix, where is the
number of dimensions.
With the Idf parameter, words with small inverse document frequency, i.e.
common words such as numerals, prepositions, and articles, are more likely
to be mutually orthogonal than rare words, i.e. This is why in the figure above,
the numerals form a cluster in the non-regularized word embeddings
, but they are separated in orthogonalized word
embeddings .
The experiment was conducted using six standard text classification
datasets by employing both the WMD and the SCM
with a Nearest Neighbor
() classifier using both regularized and
non-regularized word embeddings. First, we describe briefly our datasets,
then our experimental steps. See also our experimental
code.
In our experiment, we used the following six standard text classification datasets:
BBCSPORT, TWITTER, OHSUMED,
REUTERS-21578, AMAZON, and 20NEWS.
BBCSPORT
The BBCSPORT dataset consists of 737 sport news articles from
the BBC sport website in five topical areas: athletics,
cricket, football, rugby, and tennis. The period of coverage was during
2004--2005.
TWITTER
The TWITTER dataset consists of 5,513 tweets
hand-classified into one of four topics: Apple, Google, Twitter, and
Microsoft. The sentiment of every tweet was also hand-classified as
either Positive, Neutral, Negative, or Irrelevant.
OHSUMED
The OHSUMED dataset is a set of 348,566 references spanning
1987--1991 from the MEDLINE bibliographic database of
important, peer-reviewed medical literature maintained by the National
Library of Medicine (NLM). While the majority of references
are to journal articles, there are also a small number of references to
letters of the editor, conference proceedings, and other reports. Each
reference contains human-assigned subject headings from the 17,000-term
Medical Subject Headings (MeSH) vocabulary.
REUTERS
The documents in the
REUTERS-21578 collection appeared on the
Reuters newswire in 1987. The documents were assembled and indexed with
categories by the personnel of Reuters Ltd. The collection is contained in
22 SGML files. Each of the first 21 files
(reut2-000.sgm through reut2-020.sgm) contains 1,000
documents, while the last one (reut2-021.sgm) contains only 578
documents.
AMAZON
The AMAZON dataset contains 142.8 million product
reviews from Amazon spanning 1996--2014. Each product belongs to one of
24 categories, which include Books, Cell Phones & Accessories, Clothing,
Shoes & Jewelry, Digital Music, and Electronics, among others. The
-core subset of the AMAZON
dataset contains only those products and users with at least five
reviews.
20NEWS
The 20NEWS dataset is a
collection of 18,828 Usenet newsgroup messages partitioned across 20
newsgroups with different topics. The collection has become popular for
experiments in text classification and text clustering.
For TWITTER, we use 5,116 out of 5,513 tweets
due to unavailability, we restrict our experiment to the Positive, Neutral,
and Negative classes, and we subsample the dataset to 3,108 tweets like
Kusner et al. (2015). For
OHSUMED, we use the 34,389 abstracts related
to cardiovascular diseases out
of 50,216 and we restrict our experiment to abstracts with a single class
label from the first 10 classes out of 23. In the case of REUTERS, we use the R8 subset. For AMAZON, we use 5-core reviews from the Books, CDs
and Vinyl, Electronics, and Home and Kitchen classes.
We preprocess the datasets by lower-casing the text and by tokenizing to
longest non-empty sequences of alphanumeric characters that contain at
least one alphabetical character. We do not remove stop words or rare
words, only words without embeddings. We split each dataset into train and
test subsets using either a standard split (for REUTERS and 20NEWS)
or following the split size of Kusner et al. (2015). See the table below for statistics of
the preprocessed datasets, where is the average
number of unique words in a document, and is
the number of document classes.
Dataset name |
Train set size |
Test set size |
|
|
BBCSPORT |
517 |
220 |
181.0 |
5 |
TWITTER |
2,176 |
932 |
13.7 |
3 |
OHSUMED |
3,999 |
5,153 |
89.4 |
10 |
REUTERS |
5,485 |
2,189 |
56.0 |
8 |
AMAZON |
5,600 |
2,400 |
86.3 |
4 |
20NEWS |
11,293 |
7,528 |
145.0 |
20 |
Using Word2Vec, we train the CBoW on the first 100 MiB of the
English Wikipedia using the same parameters as Lam (2018, Section
4.3) and 10 training epochs. We use quantized
word embeddings in 1,000 dimensions and non-quantized word embeddings in
200 dimensions to achieve comparable performance on the word analogy task.
We use the VSM with uniform word frequency weighting, also known as the
bag of words (BoW), as our baseline. For the SCM, we use the
double-logarithm inverse collection frequency word weighting (the SMART
dtb weighting scheme) as suggested by Singhal (2011, Table 1).
For the WMD, we use
BoW document vectors like Kusner et al. (2015).
We tune the parameters and
of the SCM,
the parameter of the
SMART dtb weighting scheme, the parameter
of the
, and the parameters , Idf, Sym, and
Dom of the orthogonalization. For each dataset, we hold out 20% of the
train set for validation, and we use grid search to find the optimal
parameter values. To classify each sample in the test set, we use the
classifier.
We use the method of Agresti and Coull (1998) to construct 95% confidence intervals
for the test error. For every dataset, we use
Student's -test at 95% confidence level with
-values
for all combinations of document similarities and word embedding regularization
techniques to find significant differences in
test error.
Our results are shown in the figures and tables below. In the following
subsections, we will discuss the individual results and how they are
related.
The figure below shows 95% interval estimates for the
test error. All differences are
significant, except for the second and fourth results from the left on the
BBCSPORT and TWITTER datasets, the sixth and seventh results
from the left on the BBCSPORT and OHSUMED datasets, and the fourth and sixth results
from the left on the TWITTER dataset.
95% interval estimates for the test
error on six standard text classification datasets.
Although most differences in the test
error are statistically significant on the TWITTER dataset, they are relatively minor
compared to other datasets. This is because of two reasons:
-
TWITTER is a sentiment analysis dataset.
-
Words with opposite sentiment often appear in similar sentences.
Since word embeddings are similar for words that appear in similar
sentences, embeddings for words with opposite sentiment are often similar.
For example, the embeddings for the words good and bad are mutual nearest
neighbors with cosine similarity 0.58 for non-quantized and 0.4 for
quantized word embeddings. As a result, word embeddings don't help us
separate positive and negative documents. Using a better measure of
sentiment in the word similarity matrix and in the flow
cost would improve the task performance of the soft
VSM and the WMD.
The figures below show confusion matrices and t-SNE document
visualizations for the soft
VSM with non-regularized word embeddings and for the soft
VSM with orthogonalized and quantized word embeddings.
() confusion matrices and t-SNE document visualizations
for the soft VSM with non-regularized (50.71% test error, top) and
regularized (24.14% test error, bottom) word embeddings on the
OHSUMED dataset.
See interactive three-dimensional t-SNE document
visualisations of the
non-regularized and
regularized word embeddings.
In the figure above, we see that with non-regularized word embeddings, we
predict most documents as class C04, followed by classes C10 and C06. When
a document representation is uniformly random, then the most common classes
in a dataset are most likely to be predicted by the
classifier. In the OHSUMED dataset, 2,835 documents belong to the
most common class C04, 1,711 documents belong to the second most common
class C10, and 1,246 documents belong to the third most common class C06.
In contrast to the almost random representations of the soft
VSM with non-regularized word embeddings, orthogonalized word
embeddings make the true class label much easier to predict. We hypothesize
that this is because OHSUMED is a medical
dataset. Medical terms are highly specific, so when we search for documents
containing similar words, we need to restrict ourselves only to highly
similar words, which is one of the effects of using orthogonalized word
embeddings.
( and
) confusion matrices and t-SNE document
visualizations for the soft VSM with non-regularized
(10.19% test error, top) and regularized (7.22% test error, bottom) word
embeddings on the REUTERS dataset.
See interactive three-dimensional t-SNE document
visualisations of the
non-regularized and
regularized word embeddings.
In the above figure, we see that with non-regularized word embeddings, most
documents from the class Grain are misclassified as the more common class
Crude. With regularized word embeddings, the classes are separated. We
hypothesize that this is because both grain and crude oil are commodities,
which makes the documents from both classes contain many similar words. The
classes will become indistinguishable unless we restrict ourselves only to
a subset of the similar words, which is one of the effects of using
orthogonalized word embeddings.
( and
) confusion matrices and t-SNE document
visualizations for the soft VSM with non-regularized
(42.53% test error, top) and regularized (29.97% test error, bottom) word
embeddings on the 20NEWS dataset.
Notice how the Usenet hierarchies comp.*, rec.*,
and rec.sport.* form visible clusters in the non-regularized
word embeddings. See interactive three-dimensional t-SNE
document visualisations of the
non-regularized and
regularized word embeddings.
In the figure above, we see that with non-regularized word embeddings,
messages are often misclassified as a different newsgroup in the same
Usenet hierarchy. We can see that the Usenet hierarchies comp.*,
rec.*, and rec.sport.* form visible clusters in both the
t-SNE document visualization and in the confusion matrix. We
can also see that with regularized word embeddings, the clusters are
separated. We hypothesize that this is because newsgroups in a Usenet
hierarchy share a common topic and similar terminology. The terminology of
the newsgroups will become difficult to separate unless only highly
specific words are allowed to be considered similar, which is one of the
effects of using orthogonalized word embeddings.
Average test error compared to the BoW.
The figure on the side shows the average test error
ratio between the document similarities and the BoW baseline. This
ratio is the lowest for the soft VSM with orthogonalized word
embeddings, which achieves 48.86% of the average BoW
test error. The average test error ratio
between the soft VSM with regularized word embeddings and the soft
VSM with non-regularized word embeddings is 60.57%, which is a
39.43% reduction in the average test error.
Unlike the soft VSM, the WMD does not benefit from
word embedding quantization. This is because of two reasons:
-
The soft VSM takes into account the similarity between all
words in two documents, whereas the WMD only considers the
most similar word pairs.
-
Non-quantized word embeddings are biased towards positive similarity, see
the figure below.
Histograms of word embedding cosine similarity.
With non-quantized word embeddings, embeddings of unrelated words have
positive cosine similarity, which makes dissimilar documents less
separable. With quantized embeddings, unrelated words have negative cosine
similarity, which improves separability and reduces
test error. The WMD is
unaffected by the bias in non-quantized word embeddings, and the reduced
precision of quantized word embeddings increases
test error.
Average document processing speed on one Intel Xeon X7560 core.
The figure on the side shows the average document processing speed using a single
Intel Xeon X7560 2.26 GHz core. Although the orthogonalization reduces the
worst-case time complexity of the soft VSM from quadratic to
linear, it also makes the word similarity matrix sparse, and performing
sparse instead of dense matrix operations causes a 2.73× slowdown compared
to the soft VSM with non-orthogonalized word embeddings.
Quantization causes a 1.8× slowdown, which is due to the 5× increase in the
word embedding dimensionality, since we use 1000-dimensional quantized word
embeddings and only 200-dimensional non-quantized word embeddings.
The super-cubic average time complexity of the WMD results in
an average 819,496× slowdown compared to the soft
VSM with orthogonalized and quantized word embeddings, and a
1,505,386× slowdown on the 20NEWS dataset,
which has a large average number of unique words in a document. Although
Kusner et al. (2015) report
up to 110× speed-up using an approximate variant of the WMD
(the WCD), this still results in an average 7,450× slowdown,
and a 13,685× slowdown on the 20NEWS dataset.
The table below shows the optimal parameter values for the soft
VSM with orthogonalized word embeddings. Most common parameter
values are bold. The most common Idf parameter shows that
it is important to store the nearest neighbors of rare words in the word
similarity matrix The most common parameter values
, , , and
the Sym and Dom parameters show that strong orthogonalization, which makes
most values in zero or close to zero, gives the best
results.
Dataset name |
|
|
|
|
|
Idf |
Sym |
Dom |
BBCSPORT |
2 |
0 |
−1 |
100 |
1 |
✔ |
✔ |
✔ |
TWITTER |
2 |
0 |
−1 |
400 |
13 |
✔ |
✔ |
|
OHSUMED |
4 |
0 |
−1 |
200 |
11 |
✔ |
|
✔ |
REUTERS |
4 |
0 |
−1 |
100 |
19 |
✔ |
✔ |
✔ |
AMAZON |
4 |
0 |
−1 |
100 |
17 |
✔ |
✔ |
✔ |
20NEWS |
3 |
0 |
−1 |
100 |
1 |
✔ |
|
|
Due to the low document processing speed of the WMD and the
number of orthogonalization parameters, using parameter optimization on the
WMD with regularized embeddings is computationally
intractable. Therefore, we do not report results for the WMD
with regularized embeddings in the above figures.
Word embeddings achieve state-of-the-art results on several
NLP tasks, predominantly at the sentence level, but overfitting is a
major issue, especially when applied at the document level with a large number
of words. We have shown that regularization of word embeddings significantly
improves their performance not only on the word analogy, word similarity, and
semantic text similarity tasks, but also on the text classification task.
We further show that the most effective word embedding regularization
technique is orthogonalization and we prove a connection between
orthogonalization, Cholesky factorization and orthogonalized word
embeddings. With word embedding orthogonalization, the task performance of
the soft VSM exceeds the WMD, an earlier known
state-of-the-art document distance measure, while being several orders of
magnitude faster. This is an important step in bringing application of word
embeddings from supercomputers to mobile and embedded systems.