The R package sentometrics to compute, aggregate and predict with textual sentiment

We provide a hands-on introduction to optimized textual sentiment indexation using the R package sentometrics. Textual sentiment analysis is increasingly used to unlock the potential information value of textual data. The sentometrics package implements an intuitive framework to efficiently compute sentiment scores of numerous texts, to aggregate the scores into multiple time series, and to use these time series to predict other variables. The workflow of the package is illustrated with a built-in corpus of news articles from two major U.S. journals to forecast the CBOE Volatility Index.


Introduction
Individuals, companies, and governments continuously consume written material from various sources to improve their decisions. The corpus of texts is typically of a high-dimensional longitudinal nature requiring statistical tools to extract the relevant information. A key source of information is the sentiment transmitted through texts, called textual sentiment. Algaba, Ardia, Bluteau, Borms, and Boudt (2020) review the notion of sentiment and its applications, mainly in economics and finance. They define sentiment as "the disposition of an entity towards an entity, expressed via a certain medium." The medium in this case is texts. The sentiment expressed through texts may provide valuable insights on the future dynamics of variables related to firms, the economy, political agendas, product satisfaction, and marketing campaigns, for instance. Still, textual sentiment does not live by the premise to be equally useful across all applications. Deciphering when, to what degree, and which layers of the sentiment add value is needed to consistently study the full information potential present within qualitative communications. The econometric approach of constructing time series of sentiment by means of optimized selection and weighting of textual sentiment is referred to as sentometrics by Algaba et al. (2020) and Ardia, Bluteau, and Boudt (2019). The term sentometrics is a composition of (textual) sentiment analysis and (time series) econometrics.
The release of the R (R Core Team 2021) text mining infrastructure tm (Feinerer, Hornik, and Meyer 2008) over a decade ago can be considered the starting point of the development and popularization of textual analysis tools in R. A number of successful follow-up attempts at improving the speed and interface of the comprehensive natural language processing capabilities provided by tm have been delivered by the packages openNLP (Hornik 2019), cleanNLP (Arnold 2017), quanteda (Benoit, Watanabe, Wang, Nulty, Obeng, Müller, and Matsuo 2018), tidytext (Silge and Robinson 2016), and qdap (Rinker 2020).
The notable tailor-made packages for sentiment analysis in R are meanr (Schmidt 2019), SentimentAnalysis (Feuerriegel and Proellochs 2021), sentimentr (Rinker 2019b), and syuzhet (Jockers 2020). Many of these packages rely on one of the larger above-mentioned textual analysis infrastructures. The meanr package computes net sentiment scores fastest, but offers no flexibility. 1 The SentimentAnalysis package relies on a similar calculation as used in tm's sentiment scoring function. The package can additionally be used to generate and evaluate sentiment dictionaries. The sentimentr package extends the polarity scoring function from the qdap package to handle more difficult linguistic edge cases, but is therefore slower than packages which do not attempt this. The SentimentAnalysis and syuzhet packages also become comparatively slower for large input corpora. The quanteda and tidytext packages have no explicit sentiment computation function but their toolsets can be used to construct one.
Our R package sentometrics proposes a well-defined modeling workflow, specifically targeted at studying the evolution of textual sentiment and its impact on other quantities. It can be used (i) to compute textual sentiment, (ii) to aggregate fine-grained textual sentiment into various sentiment time series, and (iii) to predict other variables with these sentiment measures. The combination of these three facilities leads to a flexible and computationally efficient framework to exploit the information value of sentiment in texts. The package presented in this paper therefore addresses the present lack of analytical capability to extract time series intelligence about the sentiment transmitted through a large panel of texts.
Furthermore, the sentometrics package positions itself as both integrative and supplementary to the powerful text mining and data science toolboxes in the R universe. It is integrative, as it combines the strengths of quanteda and stringi (Gagolewski 2021) for corpus construction and manipulation. It uses data.table (Dowle and Srinivasan 2021) for fast aggregation of textual sentiment into time series, and glmnet (Friedman, Hastie, and Tibshirani 2010) and caret (Kuhn 2021) for (sparse) model estimation. It is supplementary, given that it easily extends any text mining workflow to compute, aggregate and predict with textual sentiment.
The remainder of the paper is structured as follows. Section 2 introduces the methodology behind the R package sentometrics. Section 3 describes the main control functions and illustrates the package's typical workflow. Section 4 applies the entire framework to forecast the Chicago Board Options Exchange (CBOE) Volatility Index. Section 5 concludes.

Use cases and workflow
The typical use cases of the R package sentometrics are the fast computation and aggregation of textual sentiment, the subsequent time series visualization and manipulation, and the estimation of a sentiment-based prediction model. The use case of building a prediction model out of textual data encompasses the previous ones.
We propose a modular workflow that consists of five main steps, Steps 1-5, from corpus construction to model estimation. All use cases can be addressed by following (a subset of) this workflow. The R package sentometrics takes care of all steps, apart from corpus collection and cleaning. However, various conversion functions and method extensions are made available that allow the user to enter and exit the workflow at different steps. Table 1 pieces together the key functionalities of sentometrics together with the associated functions and S3 class objects. All steps are explained below. We minimize the mathematical details to clarify the exposition, and stay close to the actual implementation. Section 3 demonstrates how to use the functions.

Pre-process a selection of texts and generate relevant features (Step 1)
We assume the user has a corpus of texts of any size at its disposal. The data can be scraped from the web, retrieved from news databases, or obtained from any other source. The texts should be cleaned such that graphical and web-related elements (e.g., HTML tags) are removed. To benefit from the full functionality of the sentometrics package, a minimal requirement is that every text has a timestamp and an identifier. This results in a set of documents d n,t for n = 1, . . . , N t and time points t = 1, . . . , T , where N t is the total number of documents at time t. If the user has no interest in an aggregation into time series, desiring to do only sentiment calculation, the identifiers and especially the timestamps can be dropped. The corpus can also be given a language identifier, for a sentiment analysis across multiple languages at once. The identifier is used to direct the lexicons in the different languages to the right texts.
Secondly, features have to be defined and mapped to the documents. Features can come in many forms: news sources, entities (individuals, companies or countries discussed in the texts), or text topics. The mapping implicitly permits subdividing the corpus into many smaller groups with a common interest. Many data providers enrich their textual data with information that can be used as features. If this is not the case, topic modeling or entity recognition techniques are valid alternatives. Human classification or manual keyword(s) occurrence searches are simpler options. The extraction and inclusion of features is an important part of the analysis and should be related to the variable that is meant to be predicted.
The texts and features have to be structured in a rectangular fashion. Every row represents a document that is mapped to the features through numerical values w k n,t ∈ [0, 1] where the features are indexed by k = 1, . . . , K. The values are indicative of the relevance of a feature to a document. Binary values indicate which documents belong to which feature(s).
This rectangular data structure is turned into a 'sento_corpus' object when passed to the  sento_corpus() function. The reason for this separate corpus structure is twofold. It controls whether all corpus requirements for further analysis are met (specifically, dealing with timestamps and numeric features), and it allows performing operations on the corpus in a more structured way. If no features are of interest to the analysis, a dummy feature valued w k n,t = 1 throughout is automatically created. The add_features() function is used to add or generate new features, as will be shown in the illustration. When the corpus is constructed, it is up to the user to decide which texts have to be kept for the actual sentiment analysis.

Sentiment computation and aggregation (Steps 2 and 3)
Overall, in the sentiment computation and aggregation framework, we define three weighting parameters: ω, θ and b. They control respectively the within-document, across-document, and across-time aggregation. Section 3.3 explains how to set the values for these parameters. Appendix B gives a overview of the implemented formulae for weighting.
Compute document-or sentence-level textual sentiment (Step 2) Every document requires at least one sentiment score for further analysis. The sentometrics package can be used to assign sentiment using the lexicon-based approach, possibly augmented with information from valence shifters. The sentiment computation always starts from a corpus of documents. However, the package can also automatically decompose the documents into sentences and return sentence-level sentiment scores. The actual computation of the sentiment follows one of the three approaches explained below. Alternatively, one can align own sentiment scores with the sentometrics package making use of the as.sentiment() and merge() functions. The lexicon-based approach to sentiment calculation is flexible, transparent, and computationally convenient. It looks for words (or unigrams) that are included in a pre-defined word list of polarized (positive and negative) words. The package benefits from built-in word lists in English, French, and Dutch, with the latter two mostly as a checked web-based translation. The sentometrics package allows for three different ways of doing the lexicon-based sentiment calculation. These procedures, though simple at their cores, have proven efficient and powerful in many applications. In increasing complexity, the supported approaches are: (i) A unigrams approach. The most straightforward method, where computed sentiment is simply a (weighted) sum of all detected word scores as they appear in the lexicon.
(ii) A valence-shifting bigrams approach. The impact of the word appearing before the detected word is evaluated as well. A common example is "not good", which under the default approach would get a score of 1 ("good"), but now ends up, for example, having a score of −1 due to the presence of the negator "not".
(iii) A valence-shifting clusters approach. Valence shifters can also appear in positions other than right before a certain word. We implement this layer of complexity by searching for valence shifters (and other sentiment-bearing words) in a cluster of at maximum four words before and two words after a detected polarized word.
In the first two approaches, the sentiment score of a document d n,t (d in short) is the sum of the adjusted sentiment scores of all its unigrams. The adjustment comes from applying weights to each unigram based on its position in the document and adjusting for the presence of a valence shifting word. This leads to: for every lexicon l = 1, . . . , L. The total number of unigrams in the document is equal to Q d . The score s {l} i,n,t is the sentiment value attached to unigram i from document d n,t , based on lexicon l. It equals zero when the word is not in the lexicon. The impact of a valence shifter is represented by v i , being the shifting value of the preceding unigram i − 1. No valence shifter or the simple unigrams approach boils down to v i = 1. If the valence shifter is a negator, typically v i = −1. The weights ω i define the within-document aggregation. The values ω i and v i are specific to a document d n,t , but we omit the indices n and t for brevity.
The third approach differs in the way it calculates the impact of valence shifters. A document is decomposed into C d clusters around polarized words, and the total sentiment equals the sum of the sentiment of each cluster. The expression (1) becomes in this case s Given a detected polarized word, say unigram j, valence shifters are identified in a surrounding cluster of adjacent unigrams J ≡ {J L , J U } around this word (irrespective of whether they appear in the same sentence or not), where J L ≡ {j − 4, j − 3, j − 2, j − 1} and J U ≡ {j + 1, j + 2}. The resulting sentiment value of cluster J around associated unigram j is s The number of amplifying valence shifters is n A , those that deamplify are counted by n D , n = 1 and n N = −1 if there is an odd number of negators, else n = 0 and n N = 1. 2 All n A , n D , n and n N are specific to a cluster J. The unigrams in J are first searched for in the lexicon, and only when there is no match, they are searched for in the valence shifters word list. Clusters are non-overlapping from one polarized word to the other; if another polarized word is detected at position j + 4, then the cluster consists of the unigrams {j + 3, j + 5, j + 6}. This clustersbased approach borrows from how the R package sentimentr does its sentiment calculation. Linguistic intricacies (e.g., sentence boundaries) are better handled in their package, at the expense of being slower.
In case of a clusters-based sentence-level sentiment calculation, we follow the default settings used in sentimentr. This includes, within the scope of a sentence, a cluster of 5 words (not 4 as above) before and 2 words after the polarized word, limited to occurring commas. A fourth type of valence shifters, adversative conjunctions (e.g., however), is used to reweight the first expression of max{·, −1} by 1 + 0.25n AC , where n AC is the difference between the number of adversative conjunctions within the cluster before and after the polarized word.
The scores obtained above are subsequently multiplied by the feature weights to spread out the sentiment into lexicon-and feature-specific sentiment scores, as s {l,k} n,t ≡ s {l} n,t w k n,t , with k the index denoting the feature. If the document does not correspond to the feature, the value of s {l,k} n,t is zero. In sentometrics, the sento_lexicons() function is used to define the lexicons and the valence shifters. The output is a 'sento_lexicons' object. Any provided lexicon is applied to the corpus. The sentiment calculation is performed with compute_sentiment(). Depending on the input type, this function outputs a data.table with all values for s {l,k} n,t . When the output can be used as the basis for aggregation into time series in the next step (that is, when it has a "date" column), it becomes a 'sentiment' object. To do the computation at sentencelevel, the argument do.sentence = TRUE should be used. The as.sentiment() function transforms a properly structured table with sentiment scores into a 'sentiment' object.

Aggregate the sentiment into textual sentiment time series (Step 3)
In this step, the purpose is to aggregate the individual sentiment scores and obtain various representative time series. Two main aggregations are performed. The first, across-document, collapses all sentiment scores across documents within the same frequency (e.g., day or month, as defined by t) into one score. The weighted sum that does so is: The weights θ n define the importance of each document n at time t (for instance, based on the length of the text). The second, across-time, smooths the newly aggregated sentiment scores over time, as: where t τ ≡ u − τ + 1. The entire aggregation setup is specified by means of the ctr_agg() function, including the within-document aggregation needed for the sentiment analysis. The sento_measures() function performs both the sentiment calculation (via compute_sentiment()) and time series aggregation (via aggregate()), outputting a 'sento_measures' object. The obtained sentiment measures in the 'sento_measures' object can be further aggregated across measures, also with the aggregate() function.

Specify regression model and do (out-of-sample) predictions (Step 4)
The sentiment measures are now regular time series variables that can be applied in regressions. In case of a linear regression, the reference equation is: The target variable y u+h is often a variable to forecast, that is, h > 0. Let s u ≡ (s 1 u , . . . , s P u ) encapsulate all textual sentiment variables as constructed before, and β ≡ (β 1 , . . . , β P ) . Other variables are denoted by the vector x u at time u and γ is the associated parameter vector. Logistic regression (binomial and multinomial) is available as a generalization of the same underlying linear structure.
The typical large dimensionality of the number of predictors in (4) relative to the number of observations, and the potential multicollinearity, both pose a problem to ordinary least squares (OLS) regression. Instead, estimation and variable selection through a penalized regression relying on the elastic net regularization of Zou and Hastie (2005) is more appropriate. As an example, Joshi, Das, Gimpel, and Smith (2010) and Yogatama, Heilman, O'Connor, Dyer, Routledge, and Smith (2011) use regularization to predict movie revenues, and scientific article downloads and citations, respectively, using many text elements such as words, bigrams, and sentiment scores. Ardia et al. (2019) similarly forecast U.S. industrial production growth based on a large number of sentiment time series extracted from newspaper articles.
Regularization, in short, shrinks the coefficients of the least informative variables towards zero. It consists of optimizing the least squares or likelihood function including a penalty component. The elastic net optimization problem for the specified linear regression is expressed as: (5) The tilde denotes standardized variables, and . p is the p -norm. The standardization is required for the regularization, but the coefficients are rescaled back once estimated. The rescaled estimates of the model coefficients for the textual sentiment indices are in β, usually a sparse vector, depending on the severity of the shrinkage. The parameter 0 ≤ α ≤ 1 defines the trade-off between the Ridge (Hoerl and Kennard 1970), 2 , and the LASSO (Tibshirani 1996), 1 , regularization, respectively for α = 0 and α = 1. The λ ≥ 0 parameter defines the level of regularization. When λ = 0, the problem reduces to OLS estimation. The two parameters are calibrated in a data-driven way, such that they are optimal to the regression equation at hand. The sentometrics package allows calibration through cross-validation, or based on an information criteria with the degrees of freedom properly adjusted to the elastic net context according to Tibshirani and Taylor (2012).
A potential analysis of interest is the sequential estimation of a regression model and outof-sample prediction. For a given sample size M < N , a regression is estimated with M observations and used to predict some next observation of the target variable. This procedure is repeated rolling forward from the first to the last M -sized sample, leading to a series of estimates. These are compared with the realized values to assess the (average) out-of-sample prediction performance.
The type of model, the calibration approach, and other modeling decisions are defined via the ctr_model() function. The (iterative) model estimation and calibration is done with the sento_model() function that relies on the R packages glmnet and caret. The user can define here additional (sentiment) values for prediction through the x argument. The output is a 'sento_model' object (one model) or a 'sento_modelIter' object (a collection of iteratively estimated 'sento_model' objects and associated out-of-sample predictions).
A forecaster, however, is not limited to using the models provided through the sentometrics package; (s)he is free to guide this step to whichever modeling toolbox available, continuing with the sentiment variables computed in the previous steps.

Evaluate prediction performance and sentiment attributions (Step 5)
A 'sento_modelIter' object carries an overview of out-of-sample performance measures rele-vant to the type of model estimated. Plotting the object returns a time series plot comparing the predicted values with the corresponding observed ones. A more formal way to compare the forecasting performance of different models, sentiment-based or not, is to construct a model confidence set (Hansen, Lunde, and Nason 2011). This set isolates the models that are statistically the best regarding predictive ability, within a confidence level. To do this analysis, one needs to first call the function get_loss_data() which returns a loss data matrix from a collection of 'sento_modelIter' objects, for a chosen loss metric (like squared errors); see ?get_loss_data for more details. This loss data matrix is ready for use by the R package MCS (Catania and Bernardi 2017) to create a model confidence set.
The aggregation into textual sentiment time series is entirely linear. Based on the estimated coefficients β, every underlying dimension's sentiment attribution to a given prediction can thus be computed easily. For example, the attribution of a certain feature k in the forecast of the target variable at a particular date is the weighted sum of the model coefficients and the values of the sentiment measures constructed from k. Attribution can be computed for all features, lexicons, time-weighting schemes, time lags, and individual documents. Through attribution, a prediction is broken down in its respective components. The attribution to documents is useful to pick the texts with the most impact to a prediction at a certain date. The function attributions() computes all types of possible attributions.

The R package sentometrics
In what follows, several examples show how to put the steps into practice using the sentometrics package. The subsequent sections illustrate the main workflow, using built-in data, focusing on individual aspects of it. Section 3.1 studies corpus management and features generation. Section 3.2 investigates the sentiment computation. Section 3.3 looks at the aggregation into time series (including the control function ctr_agg()). Section 3.4 briefly explains further manipulation of a 'sento_measures' object. Section 3.5 regards the modeling setup (including the control function ctr_model()) and attribution.

Corpus management and features generation
The very first step is to load the R package sentometrics. We also load the data.table package as we use it throughout, but loading it is in general not required.

R> library("sentometrics") R> library("data.table")
We demonstrate the workflow using the usnews built-in dataset, a collection of news articles from The Wall Street Journal and The Washington Post between 1995 and 2014. 3 It has a data.frame structure and thus satisfies the requirement that the input texts have to be structured rectangularly, with every row representing a document. The data is loaded below.
R> data("usnews", package = "sentometrics") R> class(usnews) [1] "data.frame" For conversion to a 'sento_corpus' object, the "id", "date", and "texts" columns have to come in that order. One could also add an optional "language" column for a multilanguage sentiment analysis (see the multi-language sentiment computation example in the next section). All other columns are reserved for features, of type numeric. For this particular corpus, there are four original features. The first two indicate the news source, the latter two the relevance of every document to the U.S. economy. The feature values w k n,t are binary and complementary (when "wsj" is 1, "wapo" is 0; similarly for "economy" and "noneconomy") to subdivide the corpus to create separate time series. To access the texts, one can simply do usnews[["texts"]] (i.e., the third column omitted above). An example of one text is:

R> usnews[["texts"]][2029]
[1] "Dow Jones Newswires NEW YORK --Mortgage rates rose in the past week after Fridays employment report reinforced the perception that the economy is on solid ground, said Freddie Mac in its weekly survey. The average for 30-year fixed mortgage rates for the week ended yesterday, rose to 5.85 from 5.79 a week earlier and 5.41 a year ago. The average for 15-year fixed-rate mortgages this week was 5.38, up from 5.33 a week ago and the year-ago 4.69. The rate for five-year Treasury-indexed hybrid adjustable-rate mortgages, was 5.22, up from the previous weeks average of 5.17. There is no historical information for last year since Freddie Mac began tracking this mortgage rate at the start of 2005." The built-corpus is cleaned for non-alphanumeric characters. To put the texts and features into a corpus structure, call the sento_corpus() function. If you have no features available, the corpus can still be created without any feature columns in the input data.frame, but a dummy feature called "dummyFeature" with a score of 1 for all texts is added to the 'sento_corpus' output object.
R> uscorpus <-sento_corpus(usnews) R> class(uscorpus) [1] "sento_corpus" "corpus" "character" The sento_corpus() function creates a 'sento_corpus' object on top of the quanteda's package 'corpus' object. Hence, many functions from quanteda to manipulate corpora can be applied to a 'sento_corpus' object as well. For instance, quanteda::corpus_subset(uscorpus, date < "2014-01-01") would limit the corpus to all articles before 2014. The presence of the date document variable (the "date" column) and all other metadata as numeric features valued between 0 and 1 are the two distinguishing aspects between a 'sento_corpus' object and any other corpus-like object in R. Having the date column is a requirement for the later aggregation into time series. The function as.sento_corpus() transforms a quanteda 'corpus' object, a tm 'SimpleCorpus' object or a tm 'VCorpus' object into a 'sento_corpus' object; see ?as.sento_corpus for more details.
To round off Step 1, we add two metadata features using the add_features() function. The features uncertainty and election give a score of 1 to documents in which respectively the word "uncertainty" or "distrust" and the specified regular expression regex appear. Regular expressions provide flexibility to define more complex features, though it can be slow for a large corpus if too complex. Overall, this gives K = 6 features. The add_features() function is most useful when the corpus starts off with no additional metadata, i.e., the sole feature present is the automatically created "dummyFeature" feature. The corpus_summarize() function is useful to numerically and visually display the evolution of various parameters within the corpus.
We pack these lexicons together in a named list, and provide it to the sento_lexicons() function, together with an English valence word list. The valenceIn argument dictates the complexity of the sentiment analysis. If valenceIn = NULL (the default), sentiment is computed based on the simplest unigrams approach. If valenceIn is a table with an "x" and a "y" column, the valence-shifting bigrams approach is considered for the sentiment calculation. The values of the "y" column are those used as v i . If the second column is named "t", it is assumed that this column indicates the type of valence shifter for every word, and thus it forces employing the valence-shifting clusters approach for the sentiment calculation. Three types of valence shifters are supported for the latter method: negators (value of 1, defines n and n N ), amplifiers (value of 2, counted in n A ), and deamplifiers (value of 3, counted in n D ).
Adversative conjunctions (value of 4, counted in n AC ) are an additional type only picked up during a sentence-level calculation.

R> lex[["HENRY_en"]]
x y The individual word lists themselves are data.tables, as displayed above.

Document-level sentiment computation
The simplest way forward is to compute sentiment scores for every text in the corpus. This is handled by the compute_sentiment() function, which works with either a character vector, a 'sento_corpus' object, a quanteda 'corpus' object, a tm 'SimpleCorpus' object, or a tm 'VCorpus' object. The core of the sentiment computation is implemented in C++ through Rcpp (Eddelbuettel and Francois 2011). The compute_sentiment() function has, besides the input corpus and the lexicons, other arguments. The main one is the how argument, to specify the within-document aggregation. In the example below, how = "proportional" divides the net sentiment score by the total number of tokenized words. More details on the contents of these arguments are provided in Section 3.3, when the ctr_agg() function is discussed. See below for a brief usage and output example.

Sentence-level sentiment computation
A sentiment calculation at sentence-level instead of the given corpus unit level requires to set do.sentence = TRUE in the compute_sentiment() function. The example shows the aggregated sentiment scores for document with identifier "830981846" as a simple average of the sentiment scores of its ten sentences.

Multi-language sentiment computation
To run the sentiment analysis for multiple languages, the 'sento_corpus' object needs to have a character "language" identifier column. The names should map to a named list of 'sento_lexicons' objects to be applied to the different texts. The language information should be expressed in the different unique lexicon names. The values for the columns pertaining to a lexicon in another language than the document are set to zero.

Creation of sentiment measures
To create sentiment time series, one needs a well-specified aggregation setup defined via the control function ctr_agg(). To compute the measures in one go, the sento_measures() function is to be used. Sentiment time series allow to use the entire scope of the package. We focus the explanation on the control function's central arguments and options, and integrate the other arguments in their discussion: • howWithin: This argument defines how sentiment is aggregated within the same document (or sentence), setting the weights ω i in (1). It is passed on to the how argument of the compute_sentiment() function. For binary lexicons and the simple unigrams matching case, the "counts" option gives sentiment scores as the difference between the number of positive and negative words. Two common normalization schemes are dividing the sentiment score by the total number of words ("proportional") or by the number of polarized words ("proportionalPol") in the document (or sentence). A wide number of other weighting schemes are available. They are, together with those for the next two arguments, summarized in Appendix B.
• howDocs: This argument defines how sentiment is aggregated across all documents at the same date (or frequency), that is, it sets the weights θ n in (2). The time frequency at which the time series have to be aggregated is chosen via the by argument, and can be set to daily ("day"), weekly ("week"), monthly ("month") or yearly ("year"). The option "equal_weight" gives the same weight to every document, while the option "proportional" gives higher weights to documents with more words, relative to the document population at a given date. The do.ignoreZeros argument forces ignoring documents with zero sentiment in the computation of the across-document weights. By default these documents are overlooked. This avoids the incorporation of documents not relevant to a particular feature (as in those cases s {l,k} n,t is exactly zero, because w k n,t = 0), which could lead to a bias of sentiment towards zero. 7 When applicable, this argument also defines the aggregation across sentences within the same document.
• howTime: This argument defines how sentiment is aggregated across dates, to smoothen the time series and to acknowledge that sentiment at a given point is at least partly based on sentiment and information from the past. The lag argument has the role of τ dictating how far to go back. In the implementation, lag = 1 means no time aggregation and thus b t = 1. The "equal_weight" option is similar to a simple weighted moving average, "linear" and "exponential" are two options which give weights to the observations according to a linear or an exponential curve, "almon" does so based on Almon polynomials, and "beta" based on the Beta weighting curve from Ghysels, Sinko, and Valkanov (2007). The last three curves have respective arguments to define their shape(s), being alphasExp and do.inverseExp, ordersAlm and do.inverseAlm, and aBeta and bBeta. These weighting schemes are always normalized to unity. If desired, user-constructed weights can be supplied via weights as a named data.frame. All the weighting schemes define the different b t values in (3). The fill argument is of sizeable importance here. It is used to add in dates for which not a single document was available. These added, originally missing, dates are given a value of 0 ("zero") or the most recent value ("latest"). The option "none" accords to not filling up the date sequence at all. Adding in dates (or not) impacts the time aggregation by respectively combining the latest consecutive dates, or the latest available dates.
• nCore: The nCore argument can help to speed up the sentiment calculation when dealing with a large corpus. It expects a positive integer passed on to the setThreadOptions() function from the RcppParallel package (Allaire, Francois, Ushey, Vandenbrouck, Geelnard, and Intel 2021), and parallelizes the sentiment computation across texts. By default, nCore = 1, which indicates no parallelization. Parallelization is expected to improve the speed of the sentiment computation only for sufficiently large corpora, or when using many lexicons.
• tokens: Our unigram tokenization is done with the R package stringi; it transforms all tokens to lowercase, strips punctuation marks and strips numeric characters (see the internal function sentometrics:::tokenize_texts()). If wanted, the texts could be tokenized separately from the sentometrics package, using any desired tokenization setup, and then passed to the tokens argument. This way, the tokenization can be tailor-made (e.g., stemmed 8 ) and reused for different sentiment computation function calls, for example to compare the impact of several normalization or aggregation choices for the same tokenized corpus. Doing the tokenization once for multiple subsequent computation calls is more efficient. In case of a document-level calculation, the input should be a list of unigrams per document. If at sentence-level (do.sentence = TRUE), it should be a list of tokenized sentences as a list of the respective tokenized unigrams.
In the example code below, we aggregate sentiment at a weekly frequency, choose a countsbased within-document aggregation, and weight the documents for across-document aggregation proportionally to the number of words in the document. The resulting time series are smoothed according to an equally-weighted and an exponential time aggregation scheme (B = 2), using a lag of 30 weeks. We ignore documents with zero sentiment for acrossdocument aggregation, and fill missing dates with zero before the across-time aggregation, as per default.
R> ctrAgg <-ctr_agg(howWithin = "counts", howDocs = "proportional", + howTime = c("exponential", "equal_weight"), do.ignoreZeros = TRUE, + by = "week", fill = "zero", lag = 30, alphasExp = 0.2) The sento_measures() function performs both the sentiment calculation in Step 2 and the aggregation in Step 3, and results in a 'sento_measures' output object. The generic summary() displays a brief overview of the composition of the sentiment time series. A 'sento_measures' object is a list with as most important elements "measures" (the textual sentiment time series), "sentiment" (the original sentiment scores per document) and "stats" (a selection of summary statistics). Alternatively, the same output can be obtained by applying the aggregate() function on the output of the compute_sentiment() function, if the latter is computed from a 'sento_corpus' object.
R> as.data. A 'sento_measures' object is easily plotted across each of its dimensions. For example, Figure 2 shows a time series of average sentiment for both time weighting schemes involved. 9 A display of averages across lexicons and features is achieved by altering the group argument from the plot() method to "lexicons" and "features", respectively.

Manipulation of a 'sento_measures' object
There are a number of methods and functions implemented to facilitate the manipulation of a 'sento_measures' object. Useful methods are subset(), diff(), and scale(). The The select and delete arguments in the subset() function indicate which combinations of sentiment measures to extract or delete. Here, the subset() function call returns a new object without all sentiment measures created from the "LM_en" lexicon and without the single time series consisting of the specified combination "SENTICNET" (lexicon), "economy" (feature), and "equal_weight" (time weighting). 10 R> subset(sentMeas, 1:600, delete = list(c("LM_en"), + c("SENTICNET", "economy", "equal_weight"))) A sento_measures object (95 textual sentiment time series, 600 observations).
10 The new number of sentiment measures is not necessarily equal to L × K × B anymore once a 'sento_measures' object is modified.
Subsetting across rows is done with the subset() function without specifying an exact argument. A typical example is to subset only a time series range by specifying the dates part of that range, as below, where 50 dates are kept. One can also condition on specific sentiment measures being above, below, or between certain values, or directly indicate the row indices (as shown above).
To ex-post fill the time series date-wise, the measures_fill() can be used. The function at minimum fills in missing dates between the existing date range at the prevailing frequency. Dates before the earliest and after the most recent date can be added too. The argument fill = "zero" sets all added dates to zero, whereas fill = "latest" takes the most recent known value. This function is applied internally depending on the fill parameter from the ctr_agg() function. The example below pads the time series with trailing dates, taking the first value that occurs.
R> sentMeasFill <-measures_fill(sentMeas, fill = "latest", + dateBefore = "1995-07-01") R> head(as.data. The sentiment visualized using the plot() function when there are many different lexicons, features, and time weighting schemes may give a distorted image due to the averaging. To obtain a more nuanced picture of the differences in one particular dimension, one can ignore the other two dimensions. For example, corpusPlain below has only the dummy feature, and there is no time aggregation involved (lag = 1). This leaves the lexicons as the sole distinguishing dimension between the sentiment time series.

Sparse regression using the sentiment measures
Step 4 consists of the regression modeling. The sentometrics package offers an adapted interface to sparse regression. Other model frameworks can be explored with as input the sentiment measures extracted through the as.data.table() (or as.data.frame()) function. For example, one could transform the computed sentiment time series into a 'zoo' object from the zoo package (Zeileis and Grothendieck 2005), use any of zoo's functionalities thereafter (e.g., dealing with an irregular time series structure), or run a simple linear regression (here on the first six sentiment variables) as follows: R> y <-rnorm(nobs(sentMeas)) R> dt <-as.data. We proceed by explaining the available modeling setup in the sentometrics package. The ctr_model() function defines the modeling setup. The main arguments are itemized, all others are reviewed within the discussion: • model: The model argument can take "gaussian" (for linear regression), and "binomial" or "multinomial" (both for logistic regression). The argument do.intercept = TRUE fits an intercept.
• type: The type specifies the calibration procedure to find the most appropriate α and λ in (5). The options are "cv" (cross-validation) or one of three information criteria ("BIC", "AIC" or "Cp"). The information criterion approach is available in case of a linear regression only. The argument alphas can be altered to change the possible values for alpha, and similarly so for the lambdas argument. If lambdas = NULL, the possible values for λ are generated internally by the glmnet() function from the R package glmnet. If lambdas = 0, the regression procedure is OLS. The arguments trainWindow, testWindow and oos are needed when model calibration is performed through cross-validation, that is, when type = "cv". The cross-validation implemented is based on the "rolling forecasting origin" principle, considering we are working with time series. 12 The argument do.progress = TRUE prints calibration progress statements. The do.shrinkage.x argument is a logical vector to indicate on which external explanatory variables to impose regularization. These variables, x u , are added through the x argument of the sento_model() function.
• h: The integer argument h shifts the response variable up to y u+h and aligns the explanatory variables in accordance with (4). 13 If h = 0 (by default), no adjustments are made. The logical do.difference argument, if TRUE, can be used to difference the target variable y supplied to the sento_model() function, if it is a continuous variable (i.e., model = "gaussian"). The lag taken is the absolute value of the h argument (given |h| > 0). For example, if h = 2, and assuming the y variable is aligned timewise with all explanatory variables, denoted by X here for sake of the illustration, the regression performed is of y t+2 − y t on X t . If h = −2, the regression fitted is y t+2 − y t on X t+2 .
• do.iter: To enact an iterative model estimation and a one-step ahead out-of-sample analysis, set do.iter = TRUE. To perform a one-off in-sample estimation, set do.iter 12 As an example, take 120 observations in total, trainWindow = 80, testWindow = 10 and oos = 5. In the first round of cross-validation, a model is estimated for a certain α and λ combination with the first 80 observations, then 5 observations are skipped, and predictions are generated for observations 86 to 95. The next round does the same but with all observations moved one step forward. This is done until the end of the total sample is reached, and repeated for all possible parameter combinations, relying on the train() function from the R package caret. The optimal (α, λ) couple is the one that induces the lowest average prediction error (measured by the root mean squared error for linear models, and overall accuracy for logistic models). 13 If the input response variable is not aligned time-wise with the sentiment measures and the other explanatory variables, h cannot be interpreted as the exact prediction horizon. In other words, h only shifts the input variables as they are provided.
= FALSE. The arguments nSample, start and nCore are used for iterative modeling, thus, when do.iter = TRUE. The first argument is M , that is, the size of the sample to re-estimate the model with each time. The second argument can be used to only run a later subset of the iterations (start = 1 by default runs all iterations). The total number of iterations is equal to length(y) -nSampleabs(h)oos, with y the response variable as a vector. The oos argument specifies partly, as explained above, the cross-validation, but also provides flexibility in defining the out-of-sample exercise. For instance, given t, the one-step ahead out-of-sample prediction is computed at t+oos+1. As per default, oos = 0. If nCore > 1, the %dopar% construct from the R package foreach (Microsoft and Weston 2020) is utilized to speed up the out-of-sample analysis.
To enhance the intuition about attribution, we estimate a contemporaneous in-sample model and compute the attribution decompositions.
R> ctrInSample <-ctr_model(model = "gaussian", + h = 0, type = "BIC", alphas = 0, do.iter = FALSE) R> fit <-sento_model(sentMeas, y, ctr = ctrInSample) The attributions() function takes the 'sento_model' object and the related sentiment measures object as inputs, and generates by default attributions for all in-sample dates at the level of individual documents, lags, lexicons, features, and time weighting schemes. The function can be applied to a 'sento_modelIter' object as well, for any specific dates using the refDates argument. If do.normalize = TRUE, the values are normalized between −1 and 1 through division by the 2 -norm of the attributions at a given date. The output is an 'attributions' object. Attribution decomposes a prediction into the different sentiment components along a given dimension, for example, lexicons. The sum of the individual sentiment attributions per date, the constant, and other non-sentiment measures are thus equal to the prediction. Indeed, the piece of code below shows that the difference between the prediction and the summed attributions plus the constant is equal to zero throughout.

Application to predicting the CBOE Volatility Index
A noteworthy amount of finance research has pointed out the impact of sentiment expressed through various corpora on stock returns and trading volume, including Heston and Sinha (2017), Jegadeesh and Wu (2013), Tetlock, Saar-Tsechansky, and Macskassy (2008), and Antweiler and Frank (2004). Caporin and Poli (2017) create lexicon-based news measures to improve daily realized volatility forecasts. Manela and Moreira (2017) explicitly construct a news-based measure closely related to the CBOE Volatility Index (VIX) and a good proxy for uncertainty. A more widely used proxy for uncertainty is the Economic Policy Uncertainty (EPU) index (Baker, Bloom, and Davis 2016). This indicator is a normalized text-based index of the number of news articles discussing economic policy uncertainty, from ten large U.S. newspapers. A relationship between political uncertainty and market volatility is found by Pástor and Veronesi (2013).
The VIX measures the annualized option-implied volatility on the S&P 500 stock market index over the next 30 days. It is natural to expect that media sentiment and political uncertainty partly explain the expected volatility measured by the VIX. In this section, we test this using the EPU index and sentiment variables constructed from the usnews dataset. We analyze if our textual sentiment approach is more helpful than the EPU index in an out-of-sample exercise of predicting the end-of-month VIX in six months. The prediction specifications we are interested in are summarized as follows: The target variable V IX u is the most recent available end-of-month daily VIX value. We run the predictive analysis for h = 6 months. The sentiment time series are in s u and define the sentiment-based model (M s ). As primary benchmark, we exchange the sentiment variables for EP U u−1 , the level of the U.S. economic policy uncertainty index in month u − 1 we know is fully available by month u (M epu ). We also consider a simple autoregressive specification (M ar ).
We use the built-in U.S. news corpus of around 4145 documents, in the uscorpus object. Likewise, we proceed with the nine lexicons and the valence shifters list from the lex object used in previous examples. To infer textual features from scratch, we use a structural topic modeling approach as implemented by the R package stm (Roberts, Stewart, and Tingley 2019). This is a prime example of integrating a distinct text mining workflow with our textual sentiment analysis workflow. The stm package works with a quanteda document-term matrix as an input. We perform a fairly standard cleaning of the document-term matrix, and use the default parameters of the stm() function. We group into eight features.
R> topTerms <-t(stm::labelTopics(topicModel, n = 5)[["prob"]]) R> keywords <-lapply(1:ncol(topTerms), function(i) topTerms[, i]) R> names(keywords) <-paste0("TOPIC_", 1:length(keywords)) We use the add_features() function to generate the features based on the occurrences of these keywords in a document, scaling the feature values between 0 and 1 by setting do.binary = FALSE. We also delete all current features. Alternatively, one could use the predicted topics per text as a feature and set do.binary = TRUE, to avoid documents sharing mutual features, instead of relying on the generated keywords. We see a relatively even distribution of the corpus across the generated features.
The package has the EPU index as a dataset epu included. We consider the EPU index values one month before all other variables and use the lubridate package (Grolemund and Wickham 2011) to do this operation. We assure that the length of the dependent variable is equal to the number of observations in the sentiment measures by selecting based on the proper monthly dates. The pre-processed VIX data is represented by the vix variable.  Sentiment   TOPIC_1  TOPIC_2  TOPIC_3  TOPIC_4  TOPIC_5  TOPIC_6 TOPIC_7 TOPIC_8 Figure 4: Textual sentiment time series across latent topic features.
R> data("epu", package = "sentometrics") R> sentMeasIn <-subset(sentMeasPred, date %in% vix$date) R> datesIn <-get_dates(sentMeasIn) R> datesEPU <-lubridate::floor_date(datesIn, "month") %m-% months(1) R> xEPU <-epu[epu$date %in% datesEPU, "index"] R> y <-vix[vix$date %in% datesIn, "value"] R> x <-data.frame(lag = y, epu = xEPU) We apply the iterative rolling forward analysis (do.iter = TRUE) for our 6-month prediction horizon. The target variable is aligned with the sentiment measures in sentMeasIn, such that h = 6 in the modeling control means forecasting the monthly averaged VIX value in six months. The calibration of the sparse linear regression is based on a Bayesian-like information criterion (type = "BIC") proposed by Tibshirani and Taylor (2012). We configure a sample size of M = 60 months for a total sample of N = 232 observations. Our out-of-sample setup is nonoverlapping; oos = h -1 means that for an in-sample estimation at time u, the last available explanatory variable dates from time u − h, and the out-of-sample prediction is performed at time u as well, not at time u − h + 1. We consider a range of alpha values that allows any of the Ridge, LASSO, and pure elastic net regularization objective functions. 15 15 When a sentiment measure is a duplicate of another or when at least 50% of the series observations are equal to zero, it is automatically discarded from the analysis. Discarded measures are put under the "discarded" element of a 'sento_model' object.
R> preds <-predsAR <-rep (NA, nrow(out[["performance"] A simple plot to visualize the out-of-sample fit of any 'sento_modelIter' object can be produced using plot(). We display in Figure 5 the realized values and the different predictions. R> plot(out) + + geom_line(data = melt(data.table(date = names(out$models), "M-epu" = + preds, "M-ar" = predsAR, check.names = FALSE), id.vars = "date")) Table 2 reports two common out-of-sample prediction performance measures, decomposed in a pre-crisis period (spanning up to June 2007, from the point of view of the prediction date), a crisis period (spanning up to December 2009) and a post-crisis period. It appears that sentiment adds predictive power during the crisis period. The flexibility of the elastic net avoids that predictive power is too seriously compromised when adding sentiment to the regression, even when it has no added value.
The last step is to perform a post-modeling attribution analysis. For a 'sento_modelIter' object, the attributions() function generates sentiment attributions for all out-of-sample dates. To study the evolution of the prediction attribution, the attributions can be visualized with the plot() function applied to the 'attributions' output object. This can be done according to any of the dimensions, except for individual documents. Figure 6 shows two types of attributions in separate panels. The attributions are displayed stacked on top of each other, per date. The y-axis represents the attribution to the prediction of the target variable. The third topic was most impactful during the crisis, and the first topic received the most negative post-crisis weight. Likewise, the lexicons attribution conveys an increasing influence of the SO-CAL lexicon on the predictions. Finally, it can be concluded that the predictive role of sentiment is least present before the crisis.  This illustration shows that the sentometrics package provides useful insights in predicting variables like the VIX starting from a corpus of texts. Results could be improved by expanding the corpus, or by optimizing the features generation. For a larger application of the entire workflow, we refer to Ardia et al. (2019). They find that the incorporation of textual sentiment indices results in better prediction of the U.S. industrial production growth rate compared to using a panel of typical macroeconomic indices only.

Conclusion and future development
The R package sentometrics provides a framework to calculate sentiment for texts, to aggre-gate textual sentiment scores into many time series at a desired frequency, and to use these in a flexible prediction modeling setup. It can be deployed to quantify a qualitative corpus of texts, relate it to a target variable, and retrieve which type of sentiment is most informative through visualization and attribution analysis. The main priorities for further development are integrating better prediction tools, enhancing the complexity of the sentiment engine, allowing user-defined weighting schemes, and adding intra-day aggregation.
If you use R or sentometrics, please cite the software in publications. In case of the latter, use citation("sentometrics").
Additional code examples can be found in the regularly updated "Examples" section at https://sborms.github.io/sentometrics.

Computational details
The results in this paper were obtained using  (Feinerer et al. 2008), and zoo version 1.8.9 (Zeileis and Grothendieck 2005). Computations were performed on a Windows 10 Pro machine, x86 64-w64-mingw32/x64 (64-bit) with Intel(R) Core(TM) i7-7700HQ CPU 2x 2.80 GHz. The code used in the main paper is available in the R script run_vignette.R located in the examples folder on the dedicated sentometrics GitHub repository at https://github.com/SentometricsResearch/sentometrics. R, sentometrics, and all other packages are available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org. Any version under development will be available on our GitHub repository. Additional resources related to sentometrics can be found at https: //sentometrics-research.com.

A. Package methods overview
This appendix provides an overview of the R methods made available in the sentometrics package, as also highlighted in Table 1. The S3 class objects from the sentometrics package are created using their function counterpart with the same name, except for the 'sentiment' object (created with the compute_sentiment() function) and the 'sento_modelIter' object (created with the sento_model(..., ctr = ctr_model(..., do.iter = TRUE)) function).
Most of the methods are individually documented; to access the help files do ?method.object (e.g., ?aggregate.sento_measures).
Standard methods plot() Classes: 'attributions', 'sento_measures', 'sento_modelIter'. Plots, all in a similar ggplot2 style, respectively, the computed sentiment attributions of a run regression model, the constructed sentiment measures, and the target variable versus the predicted outcomes of an iteratively ran regression model. The first two can be grouped according a specific dimension (e.g., by "features"). summary() Classes: 'sento_measures, 'sento_model', 'sento_modelIter'.
Provides a short description of the contents of the respective object. The print() method simply displays the object class; it is also supported for a 'sento_corpus' object and prints like in quanteda.
Gives the number of data points (i.e., rows) in the sentiment measures. The number of sentiment measures can be obtained with the nmeasures() function. predict() Classes: 'sento_model'.
Generates predictions from the model object for a data 'matrix' of values for the explanatory sentiment measures and other variables. scale() Classes: 'sento_measures'.
Returns a 'sento_measures' object with scaled sentiment measures. One can also use the center and scale arguments to define values to add to the sentiment measures or divide them by. Transforms the given corpus input object into a 'sento_corpus' object, integrating available metadata, where possible, into corpus features.
Can be used to do three things: subset the rows (either by index or by a condition), select certain sentiment measures, or delete certain sentiment measures. The selection and deletion is based on the names of the sentiment measures along the features, lexicons, and time-weighting schemes dimensions.

B. Aggregation weighting schemes
This appendix presents the formulas that define the weights used in the different sentiment aggregation schemes available in the package. The constant c indicates a normalization factor that makes sure the considered weights sum up to 1. When not specified, arguments referred to are from the ctr_agg() function.

Within-document and within-sentence weighting
We outline here the different options available for the howWithin argument of the ctr_agg() function and the how argument of the compute_sentiment() function, for the sentiment calculation in (1). The weight ω i is associated to the unigram at the ith position in a document (resp. sentence) d n,t , where d serves as a notational shorthand. The number of unigrams in a document (resp. sentence) is Q d , the number of unigrams in a document (resp. sentence) that appear in the lexicon is n pol , N is the total number of documents (resp. sentences) in the corpus, and q i is the number of documents (resp. sentences) across the entire corpus containing unigram i.

Across-document and across-sentence weighting
We outline here the different options available for the howDocs argument of the ctr_agg() function, for the aggregation in (2). The weight θ n values a document (resp. sentence) d n,t (again, we use d as a shorthand) in the aggregation window (per date for across-document, and per document for across-sentence). Recall that N t is the total number of documents at time t, or, abusing the notation, it can similarly represent the number of sentences within a document. The total number of unigrams of all documents (resp. sentences) included in the aggregation window is z.
For a given document (resp. sentence) d n,t , the weight can be one of: