Sentiment Classification of Time-Sync Comments: A Semi-Supervised Hierarchical Deep Learning Method

Time-sync comment (TSC) has emerged as a new type of textual comment for real-time user interactions on online video platforms. The sentiment classification of TSCs provides considerable potential for platforms to optimize operation strategies but inevitably faces great challenges due to the TSCs’ often uninformative and informal text. Considering the contextual dependency among TSCs posted within the same video clip, this study posits that contextual TSCs may benefit the sentiment classification of a target TSC. To address the challenges of leveraging contextual TSCs, such as their semantic representation and fusion, we propose a semi-supervised hierarchical deep learning method for the sentiment classification of TSCs. We design a hierarchical architecture to capture the semantics of TSCs at the word, comment, and context levels. Considering the varying importance of words and comments, we also design attention mechanisms to focus on important sentiment information and fuse semantic representations. Empirical evaluation shows that the proposed method outperforms benchmarked sentiment classification methods. This study advances our knowledge of contextual information indicative of TSC sentiment, and contributes to improving the service operation of online video platforms.


Introduction
With the proliferation of the internet and mobile devices, digital video markets and online video platforms have witnessed rapid growth (Chakraborty et al., 2021;Wu & Chiu, 2023).Time-sync comment (TSC), also known as danmaku, is the outcome of these online video platforms (Li & Guo, 2021b;Xu & Zhang, 2017).This new comment type is the latest innovation in the rapid progression of features that marketing analytics must adapt to.TSC provides a real-time interaction mechanism that allows video viewers to express their ideas and emotions about specific video content, with the posted TSCs appearing immediately alongside the video (Zhou et al., 2019).Compared to other evaluative information such as a satisfaction rating that reflects viewers' evaluation of the whole video, TSCs reflect viewers' attitudes toward certain video clips (i.e., certain sections of a video), and exist as more fine-grained feedback information.Hence, the sentiment classification of TSCs is crucial for refining the operation management of online video platforms, and thus creating added value for various stakeholders.For video viewers, identifying the sentiment of TSCs can help online video platforms understand viewer preferences and implement personalized video clip recommendations, which are critical to improving user experience (Jiang et al., 2020).Viewers are directly pushed to potentially interesting video clips and can quickly find their favorite people and scenes without watching the entire video.For video creators, identifying the TSC sentiments allows them to better understand current market demands, which is a key strategy for content providers (Chong et al., 2016).In a video, viewers' sentiments toward video clips are important guidance as to which clips may be outdated and need to be improved, and which styles should be continued.Moreover, viewer sentiment acts as a quality indicator of video clips and provides an effective foundation for subdividing the management and control of video quality (Tarí et al., 2007).Video platforms can introduce flexible incentive measures to motivate excellent video creators based on video clip quality.The video platform itself can benefit by implementing membership or fee systems for certain high-quality video clips instead of the entire video, which could attract more viewers and expand profitability.
Given its distinctive characteristics, the sentiment classification of TSCs is nontrivial and even more difficult than that of other types of online comments.First, TSCs are generally posted by viewers quickly, and thus the text length is much shorter than that of traditional comments, which means that the semantic information from a certain TSC is limited or even ambiguous (He et al., 2018).Second, TSCs usually contain popular phrases and Internet slang related to particular fields, the informal expression of TSCs makes semantic representation intractable (Chen et al., 2022).For example, Figure 1 gives one video comment and three TSCs from the same online video.The figure shows that, compared with the video comment, each TSC contains insufficient semantic information to understand the sentiment.As a comment on real-time video content, the TSC content is highly correlated to the content of the corresponding video clip (Yang et al., 2019a).In this case, TSCs posted in the same period of time (i.e., neighboring TSCs) comment on similar objects, and thus may contain similar sentiment polarities.
Moreover, viewers who post TSCs can concurrently browse other TSCs synchronously with the video, and thus their opinions and emotions are inevitably influenced by the TSCs already posted (Liao et al., 2020).
Conversely, a new TSC may affect subsequent TSCs.Under the influence of nearby TSCs, especially those that strike a responsive chord, viewers may feel similar opinions and post a new TSC promptly in response to sympathetic ones.Take the following five successive TSCs from an online music variety show as an example: Oh my God!The song is sure to be straight fire.
Alas, but the lyrics do not form a whole.
Awesome!The girls' singing sounds unexpected this time.

Yes! All the girls sing very well! Why the lyrics are completely irrelevant!
Taking the third TSC as the target TSC, the other TSCs can be regarded as contextual TSCs.While the target TSC expresses strong sentiment regarding the aspect of singing using the words "awesome" and "unexpected," its sentiment polarity may be ambiguous; that is, it could express a positive sentiment in a normal manner or a negative sentiment in a sarcastic manner.Fortunately, the first and fourth TSCs talk about similar topics as the target TSC (i.e., the singing) and show positive sentiments.This background information provides clearer evidence to infer that the target TSC has a positive sentiment.In this regard, contextual TSCs have considerable potential to enhance the performance of sentiment classification of the target TSC as a type of auxiliary information.
However, the unique text characteristics and diverse posting scenarios of TSCs bring great challenges for using contextual TSCs, among which we identify two essential ones.First, incorporating contextual information requires not only extracting the semantics of each TSC, but also capturing complex contextual relationships among multiple neighboring TSCs.The challenge lies in capturing various syntactic and semantic relationships from different levels for TSC text representation.Second, with the particular and individual preferences of viewers, different viewers may post TSCs to comment on different objects in the same video clip, and thus the sentiments of contextual TSCs are not always the same as that of the target TSC.This scenario is reflected in the above example, in which the second and fifth TSCs discuss the lyrics and express negative sentiments.Therefore, the challenge also lies in focusing on useful contextual TSCs that enrich our understanding with complementary semantics, and suppressing irrelevant contextual TSCs.
Existing research provides rich insight into machine learning and deep learning methods for the sentiment classification of a TSC (Onan et al., 2016;Wang et al., 2021b).However, while the existing sentiment classification methods can be useful, they rarely specifically address TSCs.In addition, most existing methods focus on learning the features of target comment samples, without the ability to adaptively capture contextual TSC information, and cannot address the characteristics of TSCs (Chen et al., 2019).
To fill this gap, this study proposes a semi-supervised hierarchical deep learning method (called SHDL here) for the sentiment classification of TSCs.SHDL is an innovative operations research application in the marketing domain.Specifically, to capture contextual semantic information, we propose a hierarchical deep learning architecture for the semantic representation of TSCs at the word, comment, and context levels.The word-level and comment-level representations are developed to extract semantics from a single TSC, and the context-level representation is further developed to extract effective contextual semantics from multiple contextual TSCs.Moreover, considering the heterogeneous significance of different words and contextual TSCs in sentiment classification, we design two attention mechanisms for adaptive semantic fusion.The word-level attention mechanism is designed to focus on important words during the comment-level representation, and the comment-level attention mechanism focuses on key contextual information during the context-level representation.
We have evaluated the proposed method using a real TSC dataset collected from a major TSCenhanced online video platform (i.e., Bilibili).We compared SHDL with eight representative sentiment classification methods from the families of both machine learning and deep learning.Empirical results show that SHDL significantly outperformed all benchmarked methods in terms of all performance metrics.The ablation study also shows that all design artifacts in SHDL have performance-enhancing effects on the sentiment classification of TSCs, and that capturing the temporal correlation among TSCs is particularly significant.Moreover, the representation performance analysis illustrates how SHDL improves the performance of TSC sentiment classification with the ability to effectively use contextual information.
The contributions of this study are fourfold.First, to the best of our knowledge, this is the first study that leverages contextual information (i.e., contextual TSCs) to classify the sentiment of a TSC.In contrast with the existing literature, which merely considers a single TSC sample, we use multiple samples and utilize contextual information in a semi-supervised way.Second, we propose a deep learning-based hierarchical representation architecture for generating the semantics of TSCs at the word, comment, and context levels, which can effectively capture heterogeneous semantics at different levels (i.e., a single word, a single TSC, and a group of neighboring TSCs).Compared to existing structures that extract semantic levels only within comments, we extend the extracted semantic levels beyond the comments.Third, we design two attention mechanisms to adaptively focus on important words and comments for the fusion of the semantic representation at different levels, and the weighting process of the two attention mechanisms is different.Fourth, the proposed TSC sentiment classification method provides significant practical implications for video platforms to refine their operation strategies.The accurate identification of TSC sentiment enables platforms to rapidly show viewers better-suited video clips, enables video creators to achieve more sophisticated video production, and can improve the profit model of the video platforms.
The remainder of this paper is organized as follows.In the next section, we summarize the recent literature on TSC research and sentiment classification.Then in Section 3, we provide a description of the modeling approach.We describe the empirical evaluation in Section 4 before presenting the results in Section 5. Finally, we conclude the study by summarizing our contributions and discussing future research directions in Section 6.

Time-sync comments in online videos
TSCs originated from the Japanese video website Niconico, which initially represented the Animation, Comic, and Game (ACG) subculture (Xi et al., 2021).In contrast with regular comments, TSCs synchronously moved over the ACG videos in the form of subtitles as soon as viewers posted them.This novel and timely comment mechanism became popular with ACG audiences and was quickly accepted and enjoyed by mainstream culture.At present, numerous worldwide mainstream video platforms support the TSC function, and the introduction of TSCs has had profound impacts on various stakeholders, including video viewers, video creators, and video platforms.
From the perspective of video viewers, TSCs serve as a communication medium between viewers and videos, which improves the degree of viewer involvement (Lv et al., 2019).Posting TSCs helps a viewer concentrate on a video and also improves their retention of the video content and sense of fulfillment after watching it (Li et al., 2021a).TSCs can complement the video content, and video viewers might see diverse TSCs, such as several copies of an alert like "dragons ahead" and explanations provided by the expert viewers, which can greatly improve the viewing experience.Meanwhile, existing TSCs affect the TSCposting behavior of subsequent viewers, as explained by the "herding" effect wherein viewers who are inclined to post TSCs can be influenced by observed TSCs (He et al., 2018).Furthermore, viewers who often post TSCs to express their attitudes toward videos can enjoy more precise video clip push services through personalized video recommendation algorithms based on the sentiment analysis of their TSCs (Bai et al., 2021).
For video creators, TSCs in their videos allow them to observe viewer feedback and better cater to social hotspots and viewing demands in future videos.Considering that TSCs give timely feedback on video clips, which may be positive or negative, careful creators can absorb the opinions and then specifically improve the video production.Moreover, TSCs can intuitively benefit creators during live broadcasting by influencing the viewers' gifting behavior through stimulation and social density (Zhou et al., 2019).
For video platforms, TSC has brought about a new mechanism for video quality assessment, by which a series of operational strategies can be modified.For example, some unruly viewers may spoil the video content through TSCs, so platforms can use automatic detection technologies to block spoilers and penalize offenders (Yang et al., 2019a).Platforms can encourage and award excellent video creators according to video quality, and the awards can be fine-grained (i.e., awards given to specific clips of a video) with the help of approaches such as highlight detection (Liaw & Dai, 2020).Such auxiliary techniques using TSCs can help platforms upgrade their monetization strategies by implementing membership or fee systems for viewers to see certain high-quality video clips instead of whole videos.

Sentiment classification of online comments
The popularity of social media has broadened the number and variety of online channels in which Internet users can express themselves with comments.These comments usually contain abundant opinion and sentiment information, and are valuable for service providers to improve service quality (Meire et al., 2016;Xia et al., 2021).Since TSC is a new type of online comment, the sentiment classification of TSCs is an urgent matter to fully exploit its potential value.However, the existing research on the sentiment classification of TSCs is scant, which is an important motivation for this study.In this section, we expand the discussion of sentiment classification methods from TSC to online comments to provide a more comprehensive overview of existing methods.The task of classifying the sentiments of online comments can be considered as a text classification problem because of the involvement of several operations that ultimately classify a given piece of text to show either a positive or negative sentiment.The methods of this classification can be divided into two categories: lexicon-based and model-based sentiment classification.
Lexicon-based sentiment classification usually builds auxiliary lexical resources that link the words to corresponding sentiment polarities by scoring (Cruz et al., 2014).For example, Deng et al. proposed a method to adapt existing sentiment lexicons for domain-specific sentiment classification (Deng et al., 2017).However, given that several words have multiple meanings and senses, building a pervasive lexical resource is difficult.The lexical resources are usually constructed based on specific domains and scenarios, which largely limit the flexibility of applying lexicon-based methods (Han et al., 2020).
Regarding sentiment analysis as a classification problem, training sets are first built by manually labeling a portion of the comment text, and then learning the features from the training data to construct classification models.Finally, the model obtained from the training period is used to classify the test data with its unknown sentiments.For example, Ghaddar and Naoum-Sawaya (2018) improved the high-dimensional online comment data classification efficiency of SVM, offering benefits for optimal decision-making.
However, most of the above methods are based on the bag-of-words model, where words in the comments are independent and their significant sequential dependency is ignored (Tsai & Wang, 2017).
As a mainstream branch of machine learning with powerful representation ability, deep learning models with more complex structures can directly learn abstract features from comment text and implement end-to-end sentiment classification (Yang et al., 2022).A recurrent neural network (RNN) is widely used for text mining because of its advantages in capturing the sequential relationships of words owing to the recursive structure (Lai et al., 2015).In particular, two well-known variants of RNN, long short-term memory (LSTM) and gated recurrent unit (GRU), have attracted much attention for text data processing owing to their capacity to handle long series (Kratzwald et al., 2018;Kriebel & Stitz, 2022).Meanwhile, convolutional neural network (CNN) is another popular deep learning model that is used for text classification.With unique convolutional and pooling operations, CNN has relatively low computational costs while providing an excellent feature representation ability.In particular, TextCNN, proposed by Kim, learns the n-gram feature representation from sentences via 1D convolution, which has shown excellent performance in short-text classification (Kim, 2014).Moreover, CNN can be combined with LSTM to use the advantages of both the recurrent structure and convolutional neural models to improve sentiment classification results (Ankita et al., 2022).By combining the recurrent structure and max-pooling layer, Lai et al. (2015) proposed the recurrent convolutional neural network (RCNN), where a recurrent structure is applied to capture sequence information and a max-pooling layer is used to catch the key components in pieces of text.In addition, the attention mechanism in deep learning is currently a research hotspot for natural language processing tasks, which allows a focus on relevant information and ignores irrelevant information with efficient computing (Wang et al., 2021b).For instance, Xu et al. ( 2022) combined an attention-based model and transfer learning to enhance the performance of aspect-level sentiment classification, where the attention mechanism was developed to extract important features from the sequence according to their weight distributions.

Research gaps
As summarized in Table 1, previous studies have pointed out diverse representative methods for the sentiment classification of online comments, including machine learning and deep learning methods.While existing sentiment classification methods have used semantic representations and attention mechanisms, most of them focus on semantic representation and attention at the word and comment levels.It is important to clarify that in our study, the term "context" specifically refers to the TSCs that appear in proximity to a given target TSC within the corpus.Although in other methods, such as Word2Vec, "context" may refer to the nearby words used for model training, we do not adopt this definition in our study.Instead, we focus on the TSCs surrounding the target TSC as our reference for context analysis.As discussed earlier, contextual TSCs have considerable potential to enhance the performance of the sentiment classification of the target TSC, but how to identify and extract effective semantic information from contextual TSCs is still an open and challenging topic.We strive to bridge this research gap by proposing a TSC sentiment classification method (SHDL) based on semi-supervised hierarchical deep learning.

Proposed sentiment classification method
To address the sentiment classification of TSCs, we propose a semi-supervised hierarchical deep learning method named SHDL, whose overall framework is illustrated in Figure 2. The method consists of word-, comment-, and context-level representations, and sentiment polarity classification.In the word-level representation, we augment the target TSC with its contextual TSCs and generate their word embeddings in parallel.In the comment-level representation, we develop concurrent CNN with word-level attention branches to learn the representation vector for each TSC.In the context-level representation, we develop a bidirectional GRU (BiGRU) with comment-level attention to learn the representation vector for the TSC context.Finally, the obtained context vector representation is used to identify the sentiment of the target TSC.The proposed method is an innovative application of operations research in the marketing domain.
The distinctive characteristics of TSCs, including their brevity and informal expression, pose a critical challenge for sentiment analysis; that is, TSCs have insufficient semantic information.The idea behind the proposed method is to augment the semantic representation of the target TSC by incorporating context information from the surrounding TSCs, and accordingly improve the performance of the sentiment analysis of TSCs.Specifically, we propose a semi-supervised hierarchical learning framework to use contextual TSCs and learn the multilevel semantics of the TSC context.Moreover, we design two attention mechanisms to focus on important information for adaptive semantic representation fusion.Those design artifacts help the proposed method obtain additional contextual information for understanding the target TSC and address the problem of insufficient semantic information of one single TSC.
It should be noted that our proposed method introduces new elements to the marketing analytics domain, namely the semi-supervised hierarchical learning framework and two attention mechanisms.These novel components enhance the existing sentiment analysis methodologies in the field.However, certain elements such as word embedding, CNN, and BiGRU are derived from established methods (Mikolov et al., 2013;Kim, 2014;Cho et al., 2014).The incorporation of these existing elements, alongside the introduction of new components, forms the foundation of our proposed method.TSCs are combined with the labeled target TSC for cooperative optimization in a semi-supervised learning process.The contextual TSCs and target TSC cooperate to complete the forward propagation and jointly obtain the feature embedding that represents the entire input for the sentiment classification of the target TSC.At this time, the classification error is calculated through the predicted value and the label of the target TSC.Through this semi-supervised learning approach, extensive unlabeled data can be used for modeling, and the huge workload caused by marking massive amounts of data is alleviated.
After the sample augmentation, the TSCs are all input into the word embedding model to generate the word vectors that map words onto a real-valued vector space.Because TSC text has the characteristics of domain relevance and contains specific expressions, directly using pre-trained word embedding models trained on existing corpora to generate word embeddings is not applicable.There are two potential solutions to address this issue: one is to use a TSC corpus to train conventional word embedding models (such as Word2Vec (Mikolov et al., 2013)), and the other is to use TSC data for fine-tuning on large-scale pre-trained language models (such as BERT (Kenton et al., 2019)).Considering the significant domain differences that exist between TSC data and a pre-trained corpus, as well as potential computational challenges associated with fine-tuning large-scale pre-trained models.Hence, we opted for the former method, in which we train the Word2Vec model using our collected TSC corpus.Word2Vec uses local semantic relationships and is relatively easier to train, but it ignores the relationships between words inside the local window and those outside it, which can be addressed in the subsequent structure of our method.For the training algorithm, we choose the skip-gram of Word2Vec, which uses a central word as the input of a classifier with a continuous projection layer to predict the words in a certain range before and after it (Mikolov et al., 2013).
Given a sequence of training words  1 ,  2 ,  3 , … ,   and window size k, the skip-gram model aims to maximize the probabilities of generating all background words for any central word by minimizing the following loss functions: (1) Given that TSCs contain numerous words in special fields and Internet slang, we construct a specific dictionary for segmenting the TSC text with a word segmentation tool.Specifically, we enhance the functionality of the Jieba word segmentation tool by supplementing its built-in dictionary with 372 domainspecific words.These additional words are curated based on the collected data, including names of individuals, unique appellations, and Internet catchphrases.Following the expansion of the dictionary, we employ Jieba for word segmentation, using its enhanced capabilities for our analysis.
Then, the segmented text is used to train word embeddings by using Word2Vec, the length of the word sequence is fixed, and positions without words are padded with zero.In this way, the word-level representation vectors of TSCs are obtained, and the word vectors of the tth TSC context sample can be denoted as   = [ − , … ,  −1 ,   ,  +1 , … ,  + ], where   = [ ,1 ,  ,2 ,  ,3 , … ,  , ] ∈  × , L is the length of the word vectors, and S is the dimension size of the word vectors.

Comment-level representation
Given the fast response time of those posting TSCs, the TSC text length is typically short.A TSC is a concise sentence composed of several words and thus can be suitably addressed by CNN, a widely used deep learning model for short text (Chen et al., 2019;Kim, 2014).CNN has powerful modeling capabilities in automatically extracting abstract feature representations from text data with fewer parameters.Moreover, with the convolution filters that are applied to local features, CNN can capture short-distance dependencies in comments.However, ordinary CNN treats every word equally and ignores their different values.In TSC text, words in different positions have different effects on the TSC sentiment; several keywords may play a decisive role, while others may matter little.For instance, emotional adjectives and evaluative words are highly related to the sentiment, while words such as a character's name, personal pronouns, and common conjunctions hardly have an influence.Therefore, we develop a CNN with word-level attention to learn comment-level representation vectors for each TSC, where the designed attention can assess the importance of different words and fuse their representations in different positions.Figure 3   where G is the number of scales.The kernel heights signify the sizes of the word windows.For example,  = 1 represents the situation in which the convolution operation maps features for the current word, and is a necessary value because several TSCs contain only one word.Then,  = 3 means that the convolution operation maps features for the current word and the previous two words.The kernels with different scales carry out the convolution operation in parallel, and multiple kernels are used for each scale to capture rich semantic features.In this paper, we use three scales, with sizes of 1, 2, and 3.Meanwhile, padding is used to ensure the consistency of feature dimensions from different kernel sizes.For the word vectors   , the calculation of CNN can be expressed as follows: where  represents the convolution operation,   and   are the convolution parameters and bias of the gth scale, respectively, and  , is the obtained features.LeakyReLU is the nonlinear activation function used to filter useful information.
The feature vectors obtained by CNN with multiple scales and channels are concatenated to form the feature matrix  ∈  × , where C is the total number of feature channels, each containing semantic features of all words in a TSC.Each word has multiple channels for feature representation.To fuse the features of different words and obtain the representation vector of the entire TSC, the feature matrix F is input to the word-level attention mechanism, with the goal of determining which words need promotion or suppression.The word-level attention mechanism first extracts the global information of each TSC, and then conducts a "squeeze and excitation" operation to obtain the weight of each word.The weights directly work on the word representation for feature fusion, and the weights will be automatically updated in the training process of SHDL.A word with richer semantic information will adaptively get a larger weight; thus, the word-level attention mechanism highlights important emotional words that determine the semantics of the corresponding TSC.Specifically, since we pay attention to the importance of words, global average pooling is first performed along the channel axis to obtain global information of all words, which can be formalized as follows: (3) where l represents the index of words,    represents the feature point of the lth word of the cth channel of the feature matrix, and   represents the squeezed point of the lth word.Then, the squeezed points of all words are concatenated to obtain a word descriptor  ∈  ×1 that contains global information of a TSC.
Based on the word descriptor V, two fully connected layers are used as the excitation operation to generate the word attention weights, which can be calculated as follows: where  ,1 ∈  (5 where  is the representation vector for the corresponding TSC.Thus, the word-level attention mechanism determines where to highlight in the word-level feature vectors and adaptively fuses word representations according to their informativeness, and the comment-level vector representation of each TSC is obtained.

Context-level representation
Through the above comment-level representation, the obtained vector representation of each TSC in  Considering the temporal correlations among TSCs, RNN and its variants LSTM and GRU, which can memorize historical information, are typically used for temporal data.However, standard RNN may face the problems of gradient vanishing or gradient explosion (Chung et al., 2014).Compared with LSTM, GRU, which introduces the gate mechanism, solves gradient problems with fewer parameters (Chung et al., 2014).
In addition, BiGRU with its bidirectional structure can learn both the forward and backward dependencies of contextual TSCs.Hence, as shown in Figure 4, we employ BiGRU to capture the temporal correlation of contextual TSCs.Taking a sequence of obtained comment-level vector representations as input, BiGRU is used to obtain the hidden state features for each comment-level representation.The conversion functions of the GRU cell are as follows: = (    +   ℎ −1 +   ) (7) where   and   are the update and reset gates, respectively, which are responsible for controlling the selective flow of information.  ,   ,   ,   ,  ℎ , and  ℎ are the weight parameters;   ,   , and  ℎ are the biases;  is the Sigmoid function; ℎ ̃ represents the candidate state of the tth TSC; and ℎ  represents the final hidden state of a GRU cell.Based on the GRU cell, BiGRU carries out a bidirectional calculation and can capture the contextual dependencies among TSCs from both directions.For the tth TSC, the representation vectors are simultaneously input into the forward and backward GRUs, and the forward ℎ ⃗  and backward ℎ ⃖⃗  semantic features are obtained, respectively, which are concatenated as the output.
The concatenated feature vector of the tth TSC can be presented as follows: Through the temporal feature representation by the above-described BiGRU, the obtained representation sequence of the TSC context is denoted as  = [ℎ − , … , ℎ −1 , ℎ  , ℎ +1 , … , ℎ + ].Among the contextual TSCs used, the ones with higher semantic similarity to the target TSC contribute more to identifying the target sentiment, while the irrelevant ones are useless and may even become interference.
To distinguish the contributions of different contextual TSCs, comment-level attention is applied on the contextual feature representations.The comment-level attention aims to analyze the semantic correlation between the target and contextual TSCs, and assigns weights according to their semantics.The key to the attention mechanism is defined as which is transformed from the comment-level contextual features by a dense layer and calculated as follows: where   and   are the weight and bias of the dense layer, respectively.Considering that the target TSC must be dominant to identify its sentiment, we distinguish the target TSC by defining the query of the developed comment-level attention as the state vector of BiGRU in the time step of the target TSC ℎ ̅  .
Thus, the attention weight is calculated as follows: where   is the attention weight of the ith TSC in the augmented sample.In this way, we can compute the similarity of the target and its contextual TSCs, and a more relevant TSC is distributed with a larger weight.
Consequently, if the semantics of all contextual TSCs are mainly consistent with that of the target TSC, the target TSC then tends to be assigned a smaller weight.Conversely, if the semantics of most contextual TSCs are uncorrelated to the target TSC, then the weight of the latter becomes larger.Then, the attention weights are applied to the contextual representation sequence, which is expressed as follows: where   is the obtained representation vector.Through the comment-level attention mechanism, we reasonably take advantage of the contextual dependency of TSC data by strengthening the role of relevant contextual TSCs and reducing the role of irrelevant ones.In addition, the contextual semantics are adaptively fused.The ultimately fused context vector representation covers the contextual semantics of the whole augmented TSC sample.
Finally, the obtained context representation is used to output the sentiment probability by a fully connected layer and activation function.

Objective function
Among viewers who watch online videos and post TSCs, many are interested in the video content or already love the people or objects of the video, whereas those who are not interested in the content or are disgusted by the content seldom watch it.As such, TSCs are normally dominated by positive sentiments, whereas negative sentiments are relatively fewer.This causes an imbalance between the two sentiment classes in TSC data.This problem is alleviated by using a focal loss as the objective function to train our proposed TSC sentiment classification model (Lin et al., 2020).The focal loss has been developed in the fields of image analysis and object detection, which have proven its effectiveness in compensating for class imbalance.By increasing the weight of the minority class and reducing that of the majority class, the focal loss allows the model to focus more on samples that are difficult to classify.
The focal loss is constructed on the basis of the traditional binary cross entropy loss function and introduces the weighting factor   and the tunable focusing parameter  to account for the class imbalance.Specifically, the loss is computed as follows: Loss  (  ) = −  (1 −   )  log(  ) (15 where P is the calculated probability for the class and y is the label of the classified sample.By utilizing the focal loss, the proposed model is penalized for overconfidence in predicting certain values and pays more attention to the training for difficult negative samples.

Data
We evaluated our proposed sentiment classification method using a real-world TSC dataset collected from a series of video programs on Bilibili, which is one of the largest TSC-enhanced online video platforms.
The dataset contains the text content of TSCs and their posting time in the video timeline.For TSC annotation, we randomly selected 15,000 candidate TSCs from the collected TSC set for labeling.Three domain experts were solicited to label the candidate TSCs, with each TSC being assessed by all three experts.Initially, the annotators performed the labeling independently according to the predetermined annotation guideline (available in Appendix B), in which TSCs with sentiment tendencies were labeled with positive or negative labels, and the neutral TSCs were filtered out.Following the initial labeling process, the inter-rater agreement, as measured by Fleiss's kappa value (Fleiss, 1971), reached 0.79, indicating a substantial level of agreement.In the event of initial labeling inconsistencies, the experts engaged in discussions and conducted relabeling to ensure accuracy and consistency.To determine the final sentiment labels, we employed the majority vote mechanism.Consequently, we obtained a labeled dataset consisting of 13,135 target TSCs for empirical evaluation.Among these, 11,090 TSCs were labeled as positive sentiments, while 2,045 TSCs were labeled as negative sentiments.For each labeled TSC, the four closest TSCs before it and four closest TSCs after it (in terms of posting time) were considered as the possible contextual TSCs of the labeled (or target) TSC for contextual information, resulting in 105,080 unlabeled TSCs that were used as contextual TSCs.Overall, our experimental data comprised a total of 118,215 TSCs.

Experimental design
The representative methods of sentiment classification (as summarized in Table 1) were selected as benchmarked methods.Benchmarked machine learning methods include NB, LR, SVM, and RF, which have been commonly used for text mining (Onan et al., 2016;Ghaddar & Naoum-Sawaya, 2018;Parmar et al., 2014).Benchmarked deep learning methods include TextCNN, BiRNN, RCNN, and BiGRU with Attention (BiGRU-A) (Kim, 2014;Kratzwald et al., 2018;Lai et al., 2015;Wang et al., 2021b).For machine learning methods, we calculated word frequency vectors on the basis of the TF-IDF algorithm for document feature extraction, which were used as the input.For deep learning methods, the embedding word vectors of TSCs as the input were acquired using the skip-gram technique of Word2Vec, with dimension set to 100 and sequence length set to 16.For training all the deep learning methods, the Adam optimizer was used to train the models with the learning rate of 0.005 and batch size of 32.All the hyperparameters mentioned above were tuned using a validation set in our experiments.Meanwhile, the early stopping criteria and dropout were applied to avoid overfitting.
To measure the performance of the sentiment classification of our experimental methods, we adopted three performance metrics, including recall, precision, and F1-score.Recall reflects the ability of the model to detect target categories, precision reflects the accurate proportion of all samples predicted to be of this category, and F1-score reflects the trade-off between recall and precision.We calculated the above three performance metrics for each category separately.Typically, a desired classification method is expected to have high values for each performance metric.
We evaluated the sentiment classification performance using repeated cross-validation.Specifically, we conducted 10 independent five-fold cross-validations with different random seeds, resulting in 50 performance estimates.Such a way can effectively alleviate the impact of randomness in the training-testing split, and thus has been used in many extant studies (Chen et al., 2023;Jiang et al., 2019;Rodriguez et al., 2009).During each cross-validation, the dataset was divided into five equal-sized subsets (folds), and each fold was used to estimate the performance of the classifier trained on the other four folds.For fairness, the fold splitting was kept identical across all classification methods.Performance results (averages and the 95 percent confidence interval) reported later are all based on the 50 estimates.Moreover, all experiments were implemented by Python 3.8 based on Xeon(R) Gold 5218R CPU and NVIDIA-Tesla-TU104GL GPU.

Sentiment classification performance
Table 2 summarizes the performance comparisons of the above-mentioned methods on the sentiment classification of TSCs in terms of recall, precision, and F1-score.The results show that the proposed method outperforms the traditional machine learning and deep learning methods in terms of all performance metrics.
Overall, compared with traditional machine learning methods, deep learning methods achieve better classification performances.The chosen deep learning methods can handle the sequence relationship of words in the text, which enables them to capture richer semantic information.Compared to the deep learning methods (TextCNN, BiRNN, RCNN, and BiGRU-A), SHDL shows a better sentiment classification performance in terms of each performance metric for both positive and negative classes, indicating its superiority and the robustness of the results.We tested the statistical significance of the comparisons between SHDL and benchmarked methods using both non-parametric and parametric tests.For the non-parametric test, we used the Friedman test with a post-hoc procedure (Demšar 2006).Table 3 summarizes the results of the pairwise comparisons adjusted by Bonferroni correction of the nine sentiment classification methods in terms of F1-score for the negative class.The actual p-values in terms of all performance metrics for both positive and negative classes are available in Appendix C. Overall, the differences across the nine sentiment classification methods were statistically significant ( 2 =377.58,p<0.001).Further pairwise comparisons verify that SHDL significantly outperformed all benchmarked methods.For the parametric test, we used the repeated measures ANOVA, with the method (SHDL vs. one of the benchmarked methods, respectively) as the main factor.Figure 5 illustrates the effect size (i.e., partial  2 ) of using SHDL in lieu of each benchmarked method in terms of performance improvement.To comprehensively reflect the performance improvement, for each performance metric, we calculated the averages of the two target classes as comparative data.The results show that except for the partial  2 in terms of recall when SHDL and BiGRU-A are the main factors, all the others are over 0.4, proving that using the proposed method accounts for the conspicuous performance improvement in sentiment classification.The comparison results verify that the proposed method can better identify the sentiment polarity of TSCs and improve sentiment classification performance.

Equal feature space
Considering that the proposed method is fed with information from several TSCs, to verify whether the superiority of our method stems from the additional data that is considered for each observation or from the method being able to better capture the information from the available data, we conducted extra feature space experiments.In this section, the input of the benchmark methods was also contextual TSCs, as in SHDL, to ensure a consistent feature space.Specifically, in view of the fact that the benchmarks do not have hierarchical feature learning capabilities, we combined the target TSC with its contextual TSCs into a TSC document as the input of the benchmarks, and the sequence length of the combined TSC document is equivalent to the sequence length of a single TSC multiplied by the number of TSCs included.The remaining settings were the same as in the above experiments.As summarized in Table 4, in the feature space experiments, RF achieved the best performance in terms of recall for positive sentiment and precision for negative sentiment, while SHDL achieved the best performance in terms of F1-score.Although the benchmark methods were fed additional data, the performance of most benchmarks did not improve and even decreased, which indicates that their structures cannot extract valuable contextual information and that the added TSC data may be a form of interference.The experimental results verify that SHDL outperforms most other methods due to its excellent information extraction capability.BiGRU -A 97.11(96.73-97.48) 77.56(76.39-78.72) 95.93(95.62-96.23) 83.72(82.11-85.34) 96.50(96.30-96.70)

Ablation study
To verify the contribution of each artifact in SHDL, we also carried out an ablation study.For comparison with the complete model, we respectively removed one of the components in SHDL to construct five reduced models (i.e., M1-M5).Specifically, we respectively removed CNN (M1), word-level attention (M2), BiGRU (M3), and comment-level attention (M4), and used the standard binary cross entropy loss function as the objective function for model training (M5).Table 5 shows the results of the ablation experiments, which indicate that removing any component leads to the overall classification performance degradation of SHDL and demonstrates the effectiveness of these key artifacts in SHDL for the semantic representation and sentiment classification of TSCs.Moreover, Figure 6 shows the performance decrease percentage of the above reduced methods (M1-M5) with the SHDL as a benchmark in terms of F1-score.
The figure shows that the largest performance decrease is in M3 when BiGRU is removed, with the F1score for positive and negative classes decreasing by 1.44% and 6.64%, respectively.These results indicate that BiGRU, which captures the temporal correlation of a TSC, is significant for contextual semantic representation.In SHDL, we use a contextual window to locate contextual TSCs and augment the target TSC for its sentiment classification.The context size, which reflects the number of contextual TSCs used (i.e., the window size), acts as a key hyperparameter of the proposed method, and its effect on the classification performance was analyzed.Considering that the former and latter posted TSCs of target TSC included in contextual TSCs may have different effects, we evaluated the performance of the proposed method with varying numbers of former and latter TSCs. Figure 7 reports the overall performance of SHDL with 1, 2, 3, and 4 former and latter TSCs in terms of average recall, precision, and F1-score.In Figure 7, the horizontal and vertical axes represent the number of former and latter TSCs used.The results show that SHDL achieves optimal performance when the number of former and latter TSCs are both 3. Theoretically, as the number of contextual TSCs used increases, SHDL will be able to capture longer distance dependencies among TSCs.
In addition, with an increase in the number of contextual TSCs used, more unlabeled TSC data can be used to provide more abundant semantic information.However, in general, a contextual TSC that is farther from the target TSC in terms of posting time means a weaker dependency, and two TSCs with too long a distance between them may have no correlation.Hence, increasing the number of contextual TSCs used does not necessarily improve the sentiment classification performance of the model.Figure 7 shows that enforcing a proper number of former and latter TSCs (3 in our study) improves the model performance.Given this number, the context size of the size of the window is 7.This study has several limitations, which may be addressed in future research.First, our proposed method was evaluated on only one dataset with a class imbalance issue.Further research may collect data from various online video platforms to comprehensively evaluate our proposed method.Second, the proposed method used a fixed number of contextual TSCs, whereas the distance of the contextual dependency depends on the density of posted TSCs and is nonstationary.Further research may consider designing a method that could adaptively select context windows to further improve the sentiment classification performance.Third, due to the reliance on TSCs following the target TSC, our proposed method may not be suitable for live scoring.Future research may consider designing artifacts for accommodating live scoring.

Figure 1 .
Figure 1.A real-life example of a comment and TSCs from the same video The characteristic contextual dependency of TSCs provides a new pathway for sentiment classification.

Figure 2 .
Figure 2. Overall SHDL framework illustrates the detailed structure of the comment-level representation in SHDL.

Figure 3 .
Figure 3. Detailed structure of comment-level representation in SHDL Given the word vectors of the augmented TSC context   = [ − , … ,  −1 ,   ,  +1 , … ,  + ], the word vectors of each TSC are fed into the same CNN with the word-level attention structure to learn their comment-level representation vectors in parallel.As shown in Figure 3, we use multiple convolution kernels with different scales on the word vectors to extract multiscale dependencies within a TSC.Specifically, the width of the convolution kernels is the same as the size of inputs S, which ensures that the kernel slides sequentially in the direction of TSC length and can simultaneously process complete information of one or several words.The convolution kernels have various heights, which are recorded as  ∈ [1,2,3, … , ] , the augmented context can be denoted as  = [ − , … ,  −1 ,   ,  +1 , … ,  + ] .Then, the representation sequence is input into the context-level representation to capture the contextual dependency among multiple TSCs.Considering that TSCs appear in sequence according to their posting in the video timeline, capturing the temporal correlation of contextual TSCs is important.Moreover, given the viewers' unique ideas and diverse topics, viewers may express comments and opinions that differ from mainstream views, which indicates that contextual TSCs have different effects on judging the sentiment of the target TSC.Therefore, we propose a BiGRU with comment-level attention to extract the dependency and quantify the contributions of contextual TSCs for obtaining contextual semantic representations.Figure4illustrates the detailed structure of the context-level representation in SHDL.

Figure 4 .
Figure 4. Detailed structure of context-level representation in SHDL

Figure 5 .
Figure 5. Partial   of repeated measures ANOVA

Figure 6 .
Figure 6.Percentage of performance decrease of the comparative methods in terms of F1-score 5.4.Effect of context size

Figure 7 .
Figure 7. Sentiment classification performance with different context sizes5.5.Representation performance analysisThe proposed method hierarchically generates representations of TSC text at word, comment, and context levels.The word-and comment-level representations are both generated from one single TSC, while the context-level representations are generated from the augmented TSC consisting of a target TSC and its contextual TSCs.To intuitively show the superiority of context-level representations on sentiment classification, we used the t-SNE technique to graphically illustrate the learned comment-and context-level representations, with dimensions reduced into two for visualization.For this comparison, Figure8shows the reduced 2D features of comment-and context-level representations, where the different colors represent different sentiment categories.Figure8(a) provides the 2D visual features of comment-level representations extracted from target TSCs.We can observe that different sentiment categories heavily overlap, which indicates that the feature information of a single TSC is hardly differentiable.Figure8(b) provides the 2D

Figure 8 .
Figure 8. Feature visualization of the hierarchical representations by t-SNE6.ConclusionsAs a new type of online comment, TSCs contain significant sentiment information with considerable potential value for the success of online video platforms.The contextual dependency of TSCs provides opportunities for using contextual TSCs to assist in identifying the sentiment of a target TSC.In this study, we identified the challenges in capturing contextual information from neighboring TSCs and fusing contextual semantics for sentiment classification.To address these challenges, we proposed a semisupervised hierarchical deep learning method for sentiment classification of TSCs with the reasonable usage of contextual TSCs.Specifically, we designed a hierarchical architecture to capture the multilevel semantics of TSCs and developed two attention mechanisms for semantic fusion at different levels.We evaluated our method using a TSC dataset from a popular online video platform.The empirical results show that our SHDL method effectively improved the performance of TSC sentiment classification compared with benchmark methods.The results advance our knowledge of contextual information indicative of TSC sentiment.This study contributes to both research and practice.First, we propose a novel and effective sentiment classification method, and managers and operators of online video platforms could use the proposed method to analyze the sentiments of the massive numbers of TSCs on their platforms, so as to grasp user demands and enhance their service performance.Second, we use contextual TSC information in a semi-supervised manner, which could save burdensome data annotation work and reduce the model deployment costs of online video platforms.Third, while the proposed method focuses on TSCs, the prescriptive knowledge advanced in this study (e.g., hierarchical semantic representation architecture and attention mechanisms)

For
the annotation task, the annotators were three graduate students majoring in Management Science and Engineering at Hefei University of Technology.They have long conducted research in the field of online video and social media commentary, and are familiar with the operation of new online video platforms, such as the TSC mechanism.Appendix C. The actual p-values of non-parametric full pairwise comparisonsTables C1 to C6 report the actual p-values of full pairwise comparisons in terms of F1-score, recall, and precision, for positive and negative classes, respectively.

Table B1 . The annotation guideline for data annotation
praise, satisfaction, or comfort toward the objects in the video or toward the video creator.Describing the joy, excitement, or emotional movement of the viewer.NegativeExpressing criticism, disgust, anger, fear, or disappointment toward the objects in the video or toward the video creator.Describing the sadness, anxiety, or pain of the viewer.