Sentiment analysis with covariate-assisted word embeddings

Sentiment analysis measures inclination of textual documents, aiming to extract and quantify their subjective sentiment polarity. In literature, most sentiment analysis methods first numericalize textual documents through certain word embeddings framework, and then formulate sentiment analysis as an ordinal regression or classification task. Yet it is often ignored that different people may have different preference of wording, and thus a uniform word embeddings often leads to suboptimal performance. In this article, to accommodate the heterogeneity among individual persons, we propose a covariate-assisted word embeddings in a margin-based ordinal regression framework, where covariates are incorporated through scaling factors to adjust the word embeddings. Moreover, we employ a block-wise coordinate descent scheme to tackle the resultant large-scale optimization task, and establish theoretical results to quantify the asymptotic behavior of the proposed method, guaranteeing its fast convergence rate in terms of prediction accuracy. Finally, we demonstrate the advantages of the proposed method over its competitors in both the Yelp Challenge dataset and synthetic datasets. MSC2020 subject classifications: Primary 62H30.


Introduction
Unstructured text data has become increasingly important in recent years, due to the fast advancement of information technology and evolution of information storage. It typically arises from text-heavy documents, including customer reviews, news stories, or online twits. One of the central tasks of text data analysis is to extract subjective sentiment polarity of the textual documents, which has been an essential component in modern business analytics and political surveys [1,2].
In literature, most sentiment analysis methods first convert textual documents into numerical vectors, and then formulate it as a classification task, where sentiment levels are treated as binary or ordinal responses [3,4,5]. The numericalization is often done by using the bag-of-words framework [6], where word presences or frequencies in the textual documents are extracted as the numerical features. The bag-of-words framework is interpretable and easy to implement, but it fails to capture the relationship among meaningful words. Recently, embedding technique [7] has drawn significant interests from both statistics and machine learning communities for its flexibility and interpretability in representing textual documents, including Word2Vec [8] and Global Vectors for Word Representation (GloVe) [9]. The key idea of word embeddings is to embed each word into a low-dimensional vector space so that the corresponding vectors of relevant words are close in the embedded space. A number of embeddings schemes have been proposed from various perspectives, in order to obtain a uniform word embeddings for all individual persons to facilitate the subsequent text data analysis.
A uniform word embeddings is simple in nature but also suffers from some intrinsic limitations, due to the fact that different people may have different preferences of wording. For example, the word "interesting" can be used to express neutral or even negative sentiment level by people who tend not to use negative words to show politeness. Also, it appears very common that people like to use sarcastic expressions on internet. One review in Yelp dataset says "I feel so excited to check out" to express the extremely negative sentiment without using any negative words. Analyzing such textual statements can be easily misled by the presence of positive words with strong polarity [10,11]. In literature, difference in wording across genders, ages, educational background and political background has been widely reported [1,12,13]. It is thus natural to consider adaptive word embeddings to capture the heterogeneity among individual persons so that their preferences of wording can be incorporated to improve the prediction accuracy. Yet, only a few attempts have been made in literature, including time-varying word embeddings [14] and topic-adaptive word embeddings [15,16].
In this paper, we propose a sentiment analysis method based on a novel covariate-assisted word embeddings, which integrates covariates into a ordinal regression framework [17] to refine word embeddings for better prediction accuracy. Specifically, a sentiment lexicon and the corresponding word embeddings will be employed to construct covariate-adjusted representation of each textual document. For each covariate level, an adjusting factor is introduced to scale the original word embeddings, which quantifies the deviation of semantics from the pre-trained word embeddings. Furthermore, we also develop a scalable block-wise coordinate descent algorithm to tackle the resultant large-scale optimization task. Theoretically, the asymptotic convergence rate of the proposed method is established in terms of sample size, sentiment levels, covariate levels, lexicon size.
The rest of the paper is organized as follows. Section 2 presents the proposed covariate-assisted word embeddings in an ordinal regression framework for sentiment analysis, as well as the block-wise coordinate descent algorithm. Section 3 establishes the asymptotic results for the proposed method assuring its fast convergence rate under several situations. Section 4 conducts a simulation study to examine the numerical performance of the proposed method in various synthetic datasets and applies the proposed method to analyze the Yelp challenge dataset. A brief summary is given in Section 5, and the Appendix contains the technical proofs.

Preambles
In sentiment analysis, a training dataset consists of {(t ij , y ij ); i = 1, · · · , u, j = 1, · · · , N i }, where t ij is the j-th textual document made by the i-th person, and y ij ∈ {1, · · · , K} indicates its sentiment level with ordering 1 ≺ 2 ≺ · · · ≺ K, where ≺ denotes less positive in terms of sentiment level. The primary goal of sentiment analysis is to construct a decision function φ(t ij ) to accurately predict the sentiment level of t ij , so that the disagreement between φ(t ij ) and y ij can be minimized.
Various disagreement metrics for ordinal regression have been considered in literature [18], including mean absolute error(MAE), mean zero-one error (MZOE) and mean square error(MSE). However, neither MZOE nor MSE is originally designed for ordinal regression, and both metrics have their own limitations for analyzing ordinal data. Particularly, MZOE fails to take the ordinality into account, which is particularly undesirable in sentiment analysis, where mis-classifying a positive review as neural is less severe than as negative, whereas these two type of misclassifications are treated equally in MZOE. As for MSE, it regards ordinal response as continuous, leading to unnecessary bias, especially when ordinal responses are only encoded to reflect the ordering but imply no elaboration of the difference among the ordered values. By contrast, MAE appears to be a reasonable choice for ordinal regression, and widely used in literature [17]. It can be written as where I(·) is an indicator function and sgn(x) = 1 when x > 0, and −1 otherwise. It follows immediately from (2.1) that minimizing MAE is equivalently transformed into solving K − 1 binary classification problems, where sgn(y − k) can be treated as the binary class label and sgn (φ(t) − k) denotes the corresponding classification decision function.
Instead of estimating φ(t) directly, it is common to introduce K − 1 functions The indicator function in (2.2) is computationally intractable for optimization, thus we replace it by some surrogate margin losses. Specifically, an empirical version of (2.2) with a surrogate loss and regularization term can then be constructed to estimate f , is a surrogate margin loss function non-decreasing in z, J(f ) is a regularization term, and λ is a tuning parameter. Here V (z) can take various forms. For instance, V (·) can be the hinge loss V (u) = (1 − u) + [19], the ψ-loss V (u) = min((1 − u) + , 1) [20], or the logistic loss V (u) = 1/(1 + exp(−u)) [21]. It is interesting to note that setting V as the hinge loss or the logistic loss in (2.3) resembles the ALL-threshold method [22] for ordinal regression.
To facilitate the modelling of f , textual documents need to be pre-processed into numerical vectors. In literature, primitive approaches extract word presence or frequency in t as predictors under the bag-of-words framework [6,23].
Recently, more informative word embeddings frameworks have been developed, such as Word2Vec and GloVe [8,9]. In particular, Word2Vec learns the word representation via a three-layer neural network, which assumes that words with similar linguistic meaning should be close in textual documents, resulting in frequent co-occurrences within a fixed-sized context window.
Specifically, let D = {ω 1 , ω 2 , . . . , ω d } be a lexicon of sentiment words and E ∈ R p×d be an embedding matrix, where p is the dimension of the embedded space and each column denotes the embedding of the corresponding word in D. We further define B(t) = (b 1 , . . . , b d ) T to be the frequency vector of t based on D. Then EB(t) is the averaged embeddings of words appearing in t, which can be viewed as the representation of t in the embedded space, and the sentiment function f k can be formulated as where β ∈ R p , and β 0,K−1 ≤ . . . ≤ β 0,1 . Clearly, f 1 , . . . , f K−1 are parallel, and their ordering is inherent in β 0,k ; k = 1, . . . , K − 1. Such structures have been commonly used in literature to enforce ordering among multiple functions [24,25,26].

Covariate-assisted word embeddings
In many scenarios, some covariates x ij = (x ij1 , x ij2 , . . . , x ijL ) T are also available for each observation (t ij , y ij ), where x ijl ∈ {1, . . . , m l } denotes the l-th covariate and m l denotes its number of distinct levels. To incorporate available covariates in the word embeddings, the proposed covariate-assisted sentiment function f k can be formulated as where • denotes the entry-wise product, w (l) x ijl ∈ R d denotes the parameter vector corresponding to x ijl , and the intercept β 0,k (x) is allowed to vary with x. Note that both E and B are pre-trained or pre-specified, and only β, β 0,k and w (l) x ijl are the unknown parameters in (2.5) that need to be estimated. Particularly, w (l) x ijl serves the purpose of adjusting B(t ij ) in f k (t ij , x ij ), and the varying intercept can be viewed as baseline in predicting sentiment for each covariate level. Furthermore, although β and w (l) x ijl may not be identifiable, they contribute to f k (t ij , x ij ) only through their product, and thus does not affect the predictability of the proposed method. If interpretability is also of interest, one may fix β = 1 to avoid the non-identifiability, which only requires an additional normalization step of β. Additionally, the proposed method in (2.5) is mainly designed for binary and categorical covariates, to which a direct extension to continuous covariates is to divide the domain of covariates into exclusive subsets, which are then treated as distinct categorical levels.
Let w xij = w (1) x ijL be the overall adjusting effect, then f k (t ij , x ij ) can be rewritten as where ⊗ denotes the Kronecker product. This formulation leads to the proposed covariate-assisted word embeddings, where w ij calibrates the embedding matrix E by multiplying its embedding vector with a scaling factor. It allows for a refined word embedding by incorporating the available covariates, which is in sharp contrast to the uniform word embeddings in literature. For example, a positive word can be used to express negative sentiment when its embedding vector is multiplied by a negative scalar. This flexibility is particularly attractive when analyzing sarcastic statements on internet. Additionally, the overall adjusting effect w ij of documents issued by the same person will be close since they share common covariates, implying similarity among wordings of different textual documents by the same person. It also allows certain similarity among wording of different persons depending on the level of their common covariates.
With the modelling of f k in (2.6), the proposed method can be organized as where W = [W (1) , W (2) , . . . , W (L) ], and m l ] denotes the adjusting matrix of l-th categorical covariate. Here the number of parameters of β and W are p and d L l=1 m l , respectively. To control the complexity of sentiment functions, J(β) and J(W ) can be any regularization term. In the sequel, we illustrate the proposed method by setting V (·) is the hinge loss, J(β) = β 2 2 , and J(W ) = W 2 F . Note that the proposed method in (2.7) consists of two tuning parameters λ 1 and λ 2 , and Lemma 1 shows they play a similar role in (2.7) and thus significantly simplifies the tuning process for λ 1 and λ 2 . Lemma 1. The solution to (2.7) remains the same as long as λ 1 λ L 2 stays the same.
Lemma 1 implies that the optimization task in (2.7) with tuning parameters (λ 1 , λ 2 ) has the same solution as that with (λ, λ) satisfying λ L+1 = λ 1 λ L 2 . Therefore, we simplify the cost function in (2.7) by setting λ 1 = λ 2 = λ in the sequel. The proof of Lemma 1 and all other technical proofs are provided in a supplementary file [27].

Scalable computation
Note that optimization task in (2.7) is a bi-convex optimization problem with respect to β andw xij , and hence we employ a block-wise coordinate descent algorithm to update β andw xij sequentially. By introducing a slack variable ξ ijk , (2.7) can be reformulated as We then break (2.8) into multiple sub-tasks, and update β, W and β 0 alternatively. Specifically, when W and β 0 are fixed, β can be updated by solving Note that the optimization task in (2.9) resembles linear support vector machine (SVM) in nature except additional varying intercepts, which can be efficiently solved by Liblinear [28]. Therefore, we develop a similar optimization scheme based on dual coordinate descent method as in Liblinear to solve (2.9), which is named as driftSVM and available in Python package VarSVM.
When β and β 0 are fixed, our strategy is to use the back-fitting scheme to update W (l) ; l = 1, · · · , L sequentially. Particularly, W (l) can be updated by solving q can be optimized in a parallel fashion. That is, each w (l) q , q = 1, · · · , m l can be updated by solving The optimization task in (2.10) is exactly the same as (2.9), and hence can be solved by following identical optimization scheme. When β and W are fixed, β 0 can be updated by solving It is clear that (2.11) is a standard linear programming formulation with respect to β xij and ξ ijk , and can be efficiently solved by the popular interior-point algorithm which is available in Python package cvxopt [29]. The parallel block-wise coordinate descent algorithm for the proposed method is summarized in Algorithm 1. In essence, this is an implementation of the block successive convex minimization, and hence it is guaranteed to converge to a stationary point [30].

Theory
This section establishes the asymptotic convergence of the proposed method in estimating the ideal sentiment function [17]. Then the regret of f is defined as ⊗ B(t)). Let n = u i=1 N i denote the total number of observations, and two technical assumptions are made to quantify the asymptotic behavior of the proposed method.
Assumption A assures that the approximation error of F in approximating f 0 is governed by ξ n , which eventually will impact the asymptotic behavior of the proposed sentiment function f .

Assumption B.
There exist constants α > 0, 1 ≥ γ ≥ 0 and a 1 , a 2 > 0 such that for any sufficiently small δ n > 0, Assumption B implies the local smoothness of e(f , f 0 ) and t))) within a neighborhood of f 0 . Here α and γ are determined by the joint distribution of (x, t) and the loss function V . Additionally, (3.1) provides a connection between the first and second moments ofV t)), which is essential for establishing the subsequent large deviation inequalities. In fact, Assumption B is a mild assumption and has been verified for various losses and distributions in literature [31,32]. For example, Assumption B holds true for the hinge loss and any distribution P (x, t) with α = 1 and γ = 1 [32].
Clearly, the rate δ 2 n is governed by both 2 n and ξ n , where 2 n is determined by the complexity of F V depending on the dimension of the embedded space p, the size of sentiment lexicon d, and the number of levels of all covariates m l ; l = 1, . . . , L. Usually, there is a trade-off between the approximation error ξ n and the complexity of F V over the choice of f 0 , so as to attain the optimal convergence rate δ 2α n .

Numerical experiments
In this section, we conduct a series of numerical experiments on simulated datasets and the Yelp challenge dataset to examine the performance of the proposed method. We compare it against various baseline word embedding methods in literature, including word embeddings based on Google news trained by word2vec technique [8], word embeddings based on Wikipedia trained by GloVe [9], and random word embeddings generated by multivariate normal distribution. Here the random word embeddings are used as baseline to verify the effectiveness of the other two pre-trained embeddings. Moreover, we let Google p , Wiki p , and Random p denote the corresponding covariate-assisted word embeddings, whereas Google, Wiki and Random denote their corresponding baseline embeddings in (2.4), respectively. For each method, the tuning parameters are selected via grid search over [10 −6 , 10 3 ], and their numerical performance is measured by MAE evaluated on a test set, where n test is the size of the test set.

Yelp challenge
The Yelp challenge dataset consists of four parts, including "business", "review", "user" and "check-in", and is publicly available at https://www.yelp.com/ dataset/challenge. In "business", it contains location, latitude-longitude, averaged stars, opening hours, review counts and business categories. In "review", a specific review is composed of textual comment, stars, business, user and corresponding feedback given by other users. In "user", personal information associated with each user is given, including users' social network, starting time and elite-experience in the Yelp community. Additionally, users' behavior like votes and stars are also provided. In "checking", the counts of check-ins at each business are provided. In fact, all capitalized words in the reviews are converted into lower case by using the nltk package in Python. Other pre-processing steps, including removing spaces, stop words and punctuation, are also conducted. We implement the proposed method based on the "review" and "user" parts. The "review" part contains reviews with star ratings from 1 to 5. Due to the imbalance of classes in "stars" of reviews, we encode "1" and "2" as 1, "3" as 2, and "4" and "5" as 3. In the pre-processing step, all capitalized words are converted into lower case, stop words and punctuation are removed, and frequency vectors are constructed for each review under the bag-of-words framework against a sentiment lexicon consisting of about 6,800 positive and negative words [33] combined with 1,000 1-gram features extracted based on term frequency-inverse document frequency (TF-IDF). The "user" part provides a personal social network, number of fans, counts of "useful", counts of "cool", counts of "funny" and elite-experience. In particular, elite-experience indicates the years when the user was selected as elite for well-written reviews, high-quality tips, or a detailed personal profile. Furthermore, elite users are characterized by pertinent comments and useful tips, with which other users have resonated to cast "useful", "funny" and "cool" votes. About 3.25% of the users in the Yelp community have elite experience.
One salient difference between elite and non-elite users is their preference of wording, in particular the frequencies of sentiment words in reviews. For instances, as showed in Figure 1, "reputable", "diligence" and "abnormal" are used much more frequently by non-elite users, whereas "slut" and "catchy" are much more popular among elite users. Also, as surprising as it appears, nonelite users tend to use "reputable" in an ironic way to show that the service they receive does not live up to their expectations, such as "A reputable apartment would try to fix their errors", "I'm spending my money on an experience from a reputable Salon that has been nothing but rude and unhelpful", and so on. In sharp contrast, elite users use "reputable" as a positive comment, leading to a 3.34 stars in average among those reviews containing "reputable", while only 2.00 for by non-elite reviews.
Another interesting difference between elite and non-elite users is their preferences in giving "stars". As seen in Figure 2, the distribution of averaged stars given by elite users appears to be a normal distribution, whereas the stars given by non-elite users appear to be more disordered in that they tend to give 1-star or 5-star reviews. Furthermore, number of feedbacks including "helpful", "cool", and "funny" tend to provides useful information about users. Users with a large number of feedbacks are popular for their objective comments, interesting expressions or humorous reviews. In fact, these three covariates are proportional to each other, and hence we only include "helpful" for application, which is converted to a binary covariate indicating whether "useful" count of the user is in the top 10 percent.
In this numerical experiment, 100,000 reviews are sampled from the Yelp challenge dataset, which are then split into three equal-sized sets used for training, validation and testing. The averaged test errors and their standard errors over 100 replications are reported in Table 1. To further evaluate the effectiveness of the proposed method in sentiment analysis, we also include the word embeddings trained by Embeddings from Language Models (ELMo) for comparison [34], which generate word embeddings with information of context. Specifically, each review is converted to averaged word embeddings of length 1,024 obtained from ELMo based on 1 Billion Word Benchmark [35].  Table 1 shows that the proposed method is able to improve the performance of Google, Wiki and Random by incorporating covariates, with improvement ranging from 5.7% to 21.3%. Even though word embeddings by ELMo appears to be more accurate than those trained by Word2Vec and GloVe, the proposed method yields the best performance on the word embeddings on Google news, showing that the proposed method is capable of learning covariate-varying word embeddings to improve prediction accuracy. Interestingly, when using covariates to adjust random embeddings which does not make use of semantic information at all, the improvement is much more significant and the performance of Random p is almost comparable to Google p and Wiki p , showing that the proposed method is capable of training a word embeddings adaptive to prediction task. To verify the significance of the improvement, we further conduct t-tests between the proposed method on three word embeddings with their corresponding baselines, and the best performer with ELMo. As shown in Table 2, the improvements on three baseline word embeddings are statistically significant, showing that a uniform word embeddings in sentiment analysis may lead to sub-optimal performance, and hence entails adjusting effect from covariates. Additionally, Google p also outperforms ELMo with statistically significant improvement, suggesting the proposed framework is competitive in sentiment analysis.

Simulations
We further verify the effectiveness of our proposed method under the assumed model in 2.4 with various numbers of covariates and degrees of covariate effect.
The simulated examples are generated as follows. We first choose 100 words from the sentiment lexicon [33] and obtain corresponding word embeddings E based on GoogleNews. Then we generate w j = (1 100−r , w j ); j = 1, . . . , m, where w j ∈ R r with each component generated from unif (−1, 1) and r adjusts the degree of covariate effect. Then we generate (x i , b i ); i = 1, . . . , n, where b i ∈ R 100 is a sequence of integers denoting word frequencies with each element of b i generated independently from P ois (1), and x i is uniformly chosen from {1, . . . , m}. The sentiment level y i is generated via y i = max{k : where β is generated from N (0, I 300 ) and β 0 is set to generate K equal-sized classes.
Under this data generation scheme, we consider various cases with (n, m, K) = (2000,2,5), (2000,4,5), (4000,2,5) and (4000,4,5) and r = 20, 40, 60, 80, respectively. In each case, we split the dataset into training, validation and test sets with ratio 1:1:1. The averaged test errors of various methods over 100 replications are summarized in Table 3. As shown in Table 3, the proposed method outperforms its baseline embedding method under all settings, showing that our proposed method is capable of enhancing the sentiment performance by incorporating covariate effect into word embeddings. The advantage becomes more substantial when r gets larger and w j 's become more different, showing that employing a homogeneous word embeddings may yield poor performance when embeddings varies with some covariates.
To verify the efficiency of the proposed method, we examine the computing time of three sub-optimization tasks, where the sample size increases from 1,000 to 10,000, or the dictionary size increases from 100 to 500. The averaged computing time over 50 replications of three sub-optimization tasks under all settings are reported in Figure 3.
As shown in Figure 3, the averaged computing time of three optimizations tasks are all linearly proportional to sample size. This is due to the fact that the optimization tasks for β and W resemble linear SVM in nature except varying intercepts, and hence can be solved efficiently by a dual coordinate descent algorithm as in Liblinear [28]. Moreover, the computational time for updating W also depends on the size of dictionary, whereas those for updating β and β 0 appear to be less affected.

Summary
This article proposes a flexible framework for covariate-assisted sentiment analysis by incorporating covariates into word embeddings to improve prediction accuracy. Specifically, the proposed method admits varying document representations over covariate information, such as gender, education level and so on. This is also equivalent to admitting varying sentiment functions over levels of covariates, and then endows the proposed method with the ability to capture distinctions in wording and sentiment derived from covariate. Additionally, we propose an scalable block-wise coordinate descent algorithm to solve the resultant optimization task. We also establish the asymptotic properties of the proposed method, which provides a theoretical guarantee of its convergence to the ideal sentiment function. Note that even though our proposed method is formulated under the ordinal regression framework, the key idea of integrating covariates can be employed in other models.
Next we verify only (5.1) for β 0,k (x) ≥ C 1 (τ ) + T + 1, and it can be verified similarly for where both inequalities follow from the non-increasing property of V T (·). When sgn(y − k) = −1, we have The desirable result then follows.
By the definition of F k (τ ), β is bounded by J * τ . It suffices to bound w x 2 and w x −w x 2 respectively. Forw x , by inequality of arithmetic and geometric means, we have Similarly for w x −w x 2 , for any integer 1 ≤ m ≤ L, we have where the last inequality follows by applying similar steps iteratively. This completes the proof.
Proof of Theorem 1. By Assumption B, we have n }, and thus it suffices to bound P (e V T ( f , f 0 ) ≥ δ 2 n ). Since f is a global minimizer of (7) of the manuscript, it yields that whereṼ T (y, f (x, t)) =V T (y, f (x, t))+λJ(β)+λJ(W ). Next we define a scaled empirical process as f (x, t))).