Subjective? Emotional? Emotive?: Language Combinatorics based Automatic Detection of Emotionally Loaded Sentences

In this paper presents our research in automatic detection of emotionally loaded, or emotive sentences. We define the problem from a linguistic point of view assuming that emotive sentences stand out both lexically and grammatically. To verify this assumption we prepare a text classification experiment. In the experiment we apply language combinatorics approach to automatically extract emotive patterns from training sentences. the applied approach allows automatic extraction of not only widely used unigrams (tokens), or n-grams, but also more sophisticated patterns with disjointed elements. The results of experiments are explained with the use of means such as standard Precision, Recall and balanced F-score. The algorithm also provides a refined list of most frequent sophisticated patterns typical for both emotive and non-emotive context. The method reached results comparable to the state of the art, while the fact that it is fully automatic makes it more efficient and language independent.


Introduction
Among recent developments in Natural Language Processing (NLP) research the one that has attracted increasing interest has been in the field of sentiment analysis (SA). The goal of SA research is to distinguish between sentences loaded with positive and negative attitudes. It has become popular to try different methods to distinguish between sentences loaded with positive and negative sentiments. However, a few research focused on a task more generic, namely, discriminating if a sentence is even loaded with emotional content or not. The difficulty of the task is indicated by three facts. Firstly, the task has not been widely undertaken. Secondly, in research which addresses the challenge, the definition of the task is usually based on subjective ad hoc assumptions. Thirdly, in research which do tackle the problem in a system-atic way, the results are usually unsatisfactory, and satisfactory results can be obtained only with large workload.
To approach the problem from a systematic perspective we focused on emotive sentences which in linguistics are defined as fulfilling the emotive function of language. We also assumed that such sentences contain repetitive patterns representative for contents (sentences, utterances, etc.) produced with emotional attitude, in contrast to contents produced with emotionless attitude. Then, we used a novel algorithm based on the idea of language combinatorics to verify the above assumptions. The method proposed in this paper not only achieved F-score comparable to previous handcrafted state of the art while achieving much higher Recall rate, but also minimized human effort in constructing a list of such distinctive emotive patterns.
The outline of the paper is as follows. Firstly, we present the background for this research and define the problem in Section 2. We also present a general literature review discussing the emotive aspects of language, and describe particular previous research which try to deal with the problem of discriminating between emotive and non-emotive sentences. Section 3 describes the language combinatorics approach which we used to compare emotive and non-emotive sentences. In section 4 we describe our dataset and experiment settings. The results of the experiments as well as discussion are presented in Section 4.3. Finally the paper is concluded in Section 5 with several remarks regarding future work.

Background
In normal circumstances, for example, in during a natural face-to-face conversation, but also in modern means of communication, such as Internet chat, tweet exchange on Twitter, or in other online discussion environments, interlocutors are equipped with a number of tools to inform one another that they are in an emotional state. Such tools include both linguistic and paralinguistic means. Traditional linguistics distinguishes several means used particularly to express the emotional, or "emotive" meaning. These include such verbal and lexical means as exclamations [3,14], hypocoristi-cons (endearments) [8], vulgar language [5] or, for example in Japanese, mimetic expressions (gitaigo) [2]. A key role in expressing emotions is also played by the lexicon of words describing the states of emotions [12]. These might appear in sentences separately, or in combinations. However, when they appear the recipient (reader/listener) is immediately informed that the speaker/writer has produced their sentence in some kind of emotional state. What exactly the state wasmight sometimes be ambiguous and context-dependent, but the fact that something emotionally loaded has been conveyed is unquestionable.
The analysis of elements of language such as intonation or tone of voice as well as nonverbal elements, like gestures or facial expressions, is also important in the task of detecting emotional load in a sentence. However, in our research we limit the realization of language (communication channel) to the transmission of lexical symbols, while all nonverbal information is represented by its textual manifestations, like exclamation marks, ellipsis, or emoticons.
The function of language gathering the knowledge about the above emotive elements is called the emotive function of language. It was first distinguished by Bühler in his Sprachtheorie [4] as one of three basic functions of language 1 . Bühler's theory was picked up further by Jakobson [7], who distinguished three other functions providing the basis for structural linguistics and communication studies. The realization of the emotive function in language enriches the uttered language with a feature called emotiveness. This feature was widely discussed by Stevenson (1963) [26], who defined it as a strong and persistent linguistic tendency used to inform interlocutors about the speaker's emotions and evoke corresponding emotions in those to whom the speaker's remarks are addressed. Also, Bahameed [27] argues after Shunnaq [28], that emotiveness is the speaker's emotive intention embedded in the text through specific language procedures or strategies, some of which convey neutral/objective meaning, whereas others convey emotive/subjective meaning.
To grasp the general view on how emotive meaning is realized within language, we performed a literature review on the general subject of studying emotions from the linguistic, socio-linguistic and cognitive linguistic perspective. The summary of this literature review is presented in the section 2.1.

Literature review
Research on emotions from a linguistic point of view, although still a young discipline, has already been done to some extent. For example, works of Wierzbicka [24] mark out a fresh track in research on cognitive linguistics of emotions among different cultures. Fussell [6] approached emotions from a wide cross-disciplinary perspective, trying to investigate the emotion phenomena form three broad areas: background theory of emotions, figurative language use, and social/cultural aspects of emotional communication. Weigand [23] tried to formulate a model of emotions in dialogic interactions proposing an attempt to explain emotions from the point of view of communication research. As for the Japanese language, which this research focuses on, Ptaszynski [16], made an attempt to explain both communicative and 1 The other two functions being descriptive and impressive. semiotic functions of emotive expressions, with a specific focus on expressions in Japanese.
Apart from research generalizing about emotions, there is also a wide range of study in the expressions of particular emotion types, or specific expressions of emotions. As for the former, a lifetime work in lexicography performed by Nakamura [12] resulted in the creation of a dictionary devoted particularly to the expressions describing states of emotions in Japanese. As for the latter, for example, Baba [2] studied Japanese mimetics in spoken discourse, Ono [14] studied emphatic functions of Japanese particle -da, and Sasai [20] examined nanto-type exclamatory sentences.
Unfortunately, there have been little linguistic research on more sophisticated emotive patterns in language. For example, a sentence "Oh, what a pleasant whether it is today, isn't it?" contains such emotive elements as interjection "oh", exclamatory sentence marker "what a", and emotive interrogative phrase "isn't it?". However, these emotive elements should rather be perceived as one pattern, like "oh, what a * isn't it?" (we discuss this pattern in more detail in section 2.3). In fact, this is one of the typical patterns of wh-type exclamative sentences [3]. However, although there are linguistic works investigating such emotive patterns, there has been no research experimentally confirming the existence of such patterns, or attempts to systematically and automatically extract them from larger text collections. The lack of such research is most likely caused by the limitation of typical linguistic approach in which the analysis is usually performed manually. A great help here could be offered by computer supported corpora analysis.
There has been a number of research in Computational Linguistics (CL) and Natural Language Processing (NLP) focusing on the task of recognizing whether a sentence is emotionally loaded or not. We describe them in section 2.2.

Previous research in Emotional Sentence Detection
Detecting whether sentences are loaded with emotional content has been undertaken by a number of researchers, most often as an additional task in either sentiment analysis (SA) or affect analysis (AA). SA, in great simplification, focuses on determining whether a language entity (sentence, document) was written with positive or negative attitude toward its topic. AA on the other hand focuses on specifying which exactly emotion type (joy, anger, etc.) has been conveyed. The fact, that the task was usually undertaken as a subtask, influences the way it was formulated. Below we present some of the most influential works on the topic, but formulating it in slightly different terms.
Emotional vs. Neutral: Discriminating whether a sentence is emotional or neutral is to answer the question of whether it can be interpreted as produced in an emotional state. This way the task was studied by Minato et al. (2006) [11], Aman and Szpakowicz (2007) [1] or Neviarouskaya et al. (2011) [13].
Subjective vs. Objective: Discriminating between subjective and objective sentences is to say whether the speaker presented the sentence contents from a first-person-centric perspective or from no specific perspective. The research formulating the problem this way include e.g, Wiebe et al. (1999) [31], who classified subjectivity of sentences using naive Bayes classifier, or later improvement of this research by Wilson and Wiebe (2005) [25].
In other research Yu and Hatzivassiloglou (2003) [29] used supervised learning to detect subjectivity and Hatzivassiloglou and Wiebe (2000) [30] studied the effect of gradable adjectives on sentence subjectivity.
Emotive vs. Non-emotive: Saying that a sentence is emotive means to specify the linguistic features of language which where used to produce a sentence uttered with emphasis. Research that formulated and tackled the problem this way was done by, e.g., Ptaszynski et al. (2009) [17].
Each of the above nomenclature implies similar, though slightly different assumptions. For example, a sentence produced without any emotive characteristics (non-emotive) could still imply emotional state in some situations. Also Liu and Zhang (2012) [32] notice that "not all subjective sentences express opinions and those that do are a subgroup of opinionated sentences." A comparison of the scopes and overlaps of different nomenclature is represented in Figure  1. In this research we formulate the problem similarly to Ptaszynski et al. (2009) [17]. We also used their system to compare with our method.

Problem definition
The task of discriminating between emotive and nonemotive sentences could be considered as a kind of automated text classification task, which is a standard task in NLP. Some of the approaches to modeling language in text (or document) classification include Bag-of-Words (BOW) or n-gram. In the BOW model, a text or document is perceived as an unordered set of words. BOW thus disregards both grammar and word order. An approach in which word order is retained is called the n-gram approach. First proposed by Shannon [22] over half a century ago, this approach perceives a given sentence as a set of n-long ordered sub-sequences of words. This allows matching the words while retaining the sentence word order. However, the n-gram approach allows only for a simple sequence matching, while disregarding the structure of the sentence. Although instead of words one could represent a sentence with parts of speech (POS), or dependency structure, the n-gram approach still does not allow extraction or matching of more sophisticated patterns than the subsequent strings of elements. An example of such a pattern, more sophisticated than n-gram, can be explained as follows. A sentence in Japanese Kyō wa nante kimochi ii hi nanda ! (What a pleasant day it is today!) contains a pattern nante * nanda ! 2 . Similar cases can be easily found in other languages, for instance, in English or Spanish. An exclamative sentence "Oh, she is so pretty, isn't she?", contains a pattern "Oh * is so * isn't *?". In Columbian Spanish, sentences "¡Qué majo está carro!" (What a nice car!) and "¡Que majó está chica!" (What a nice girl!) contain a common pattern "¡Que majó está * !" (What a nice * !). With another sentence, like "¡Qué porquería de película!" (What a crappy movie!) we can obtain a higher level generalization of this pattern, namely "¡Que * !" (What a * !), which is a typical wh-exclamative sentence pattern [3,15]. The existence of such patterns in language is common and well recognized. However, it is not possible to discover such subtle patterns using only n-gram approach.
In our research we aimed to contribute to dealing with the above problems. To do this we applied a method for language modeling and extracting from unrestricted text frequent sophisticated patterns. We also performed a text classification task with the use of such patterns. The method is based on language combinatorics (LC) idea developed by Ptaszynski et al. (2011) [19].

Pattern-based language modelling method
To deal with the sophisticated patterns mentioned in section 2.3 we applied a language modeling method based on the idea of language combinatorics [19]. This idea assumes that linguistic entities, such as sentences can be perceived as bundles of ordered non-repeated combinations of elements (words, punctuation marks, etc.). Furthermore, the most frequent combinations appearing in different sentences from one collection can be perceived as patterns specific to this collection.
This idea does not limit the meaning of a pattern to n-gram and assumes that sophisticated patterns with disjointed elements will provide better results than the usual bag-of-words or n-gram approach. Defining sentence patterns this way allows automatic extraction of such patterns by generating all ordered combinations of sentence elements and verifying their occurrences within a specified corpus.
Algorithms using combinatorial approach at first generate a massive number of combinations -potential answers to a given problem. This is the reason such algorithms are sometimes called brute-force search algorithms. Brute-force approach often faces the problem of exponential and rapid growth of function values during combinatorial manipulations. This phenomenon is known as combinatorial explosion [9]. Since this phenomenon often results in very long processing time, combinatorial approaches have often been disregarded. We assumed however, that combinatorial explosion can be contained on modern hardware to the extent needed in our research. Moreover, optimizing the combinatorial algorithm to the problem requirements should shorten the processing time making it advantageous in language processing task. Ptaszynski et al. [19] in their preliminary experiments verified the amount of generated patterns with comparison to n-grams, and evaluated their validity using a generic sentence pattern extraction architecture SPEC. According to the evaluation, in language processing tasks it is not necessary to generate patterns of all lengths, since the most useful ones usually appear in the group of 2 to 5 element-long patterns.
Based on the above assumptions we propose a method for automatic extraction of frequent sentence patterns distinguishable for a corpus, and perform a sentence classification experiment by training a classifier on the extracted patterns. Firstly, ordered non-repeated combinations are generated from all elements of a sentence. In every n-element sentence there is k-number of combination clusters, such as that 1 ≤ k ≤ n, where k represents all k-element combinations being a subset of n. The number of combinations generated for one k-element group of combinations is equal to binomial coefficient, represented in equation 1. In this procedure the system creates all combinations for all values of k from the range of {1, ..., n}. Therefore the number of all combinations is equal to the sum of all combinations from all k-element combination groups, like in equation 2.
Next, all non-subsequent elements are separated with an asterisk ("*"). For all patterns generated this way their occurrences O are used to calculate their normalized weight w j according to equation 3. In the task presented in this paper we apply the method to distinguish between sentences containing emotive patterns and sentences containing non-emotive patterns. Therefore the normalized weight w j is calculated as a ratio of all occurrences from one corpus O pos to the sum of all occurrences in both corpora O pos + O neg . The weight of each pattern is also normalized to fit in range from +1 (representing purely emotive patterns) to -1 (representing purely non-emotive patterns). The normalization is achieved by subtracting 0.5 from the initial score and multiplying this intermediate product by 2. The score of one sentence is calculated as a sum of weights of patterns found in the sentence, like in equation 4.
The weight can be later modified in several ways. Two features are important in weight calculation. A pattern is the more representative for a corpus when, firstly, the longer it is (length k), and the more often it appears in the corpus (occurrence O). Thus the weight can be modified by • awarding length, • awarding length and occurrence. The list of frequent patterns generated in the process of pattern generation and extraction can be also further modified. When two collections of sentences of opposite features (such as "positive vs. negative", or "emotive vs. non-emotive") are compared, a generated list will contain patterns that appear uniquely in only one of the sides (e.g. uniquely positive patterns and uniquely negative patterns) or in both (ambiguous patterns). Therefore the pattern list can be further modified by deleting • all ambiguous patterns (which weight is not +1 or -1, but somewhere in between), • only those ambiguous patterns which appear in the same number on both sides (later called "zero patterns", since their normalized weight is equal to 0). Moreover, a list of patterns will contain both the sophisticated patterns (with disjointed elements) as well as more common n-grams. Therefore the experiments could be performed on either all patterns, or n-grams only.
Furthermore, if the initial collection of sentences was biased toward one of the sides (e.g., more sentences of one kind, or the sentences were longer, etc.), there will be more patterns of a certain sort. Thus to avoid bias in the results, instead of applying a rule of thumb, threshold is automatically optimized.
The above settings are automatically verified in the process of evaluation (10-fold cross validation) to choose the best model. The metrics used in evaluation are standard Precision (P), Recall (R) and balanced F-score (F). Finally, to deal with the combinatorial explosion mentioned on the beginning of this section we applied two heuristic rules. In the preliminary experiments Ptaszynski et al. [19] found out that the most valuable patterns in language usually contain no more than six elements. Therefore we limited the scope to k ≤ 6. Thus the procedure of pattern generation will (1) generate up to 6-element patterns, or (2) terminate at the point where no more frequent patterns were found. A diagram of the whole system is represented on Figure 2.

Dataset preparation
In the experiments we used a dataset developed by Ptaszynski et al. (2009) [17] for the needs of evaluating their affect analysis system ML-Ask for Japanese language. The dataset contains 50 emotive and 41 non-emotive sentences. It was created in the following way.
Ptaszynski et al. performed an anonymous survey on thirty participants of different age and social groups. Each of them was to imagine or remember a conversation with any person or persons they know and write three sentences from that conversation: one free, one emotive, and one non-emotive. Additionally, the participants were asked to make the emotive and non-emotive sentences as close in content as possible, so the only perceivable difference was in whether a sentence was loaded with emotion or not. After that the participants also tagged the free utterances written by themselves whether or not they were emotive. Some examples from the dataset are represented in Table 1.
The system takes as an input sentences separated into elements (words, tokens, etc.). Therefore we needed to preprocess the dataset and make the sentences separable into elements. We did this in five ways to check how the preprocessing influences the results.
We used MeCab 3 , a morphological analyzer for Japanese to preprocess the sentences from the dataset in the five following ways: • Tokenization: All words, punctuation marks, etc. are separated by spaces.  Figure 2. Diagram of the whole system.  • Lemmatization: Like the above but the words are represented in their generic (dictionary) forms (lemmas). For example, "went" is normalized to "go", etc. The examples of preprocessing are represented in Table 2. In theory, the more generalized a corpus is, the less unique patterns it will produce, but the produced patterns will be more frequent. This can be explained by comparing tokenized sentence with its POS representation. For example, in the sentence from Table 2 we can see that a simple phrase kimochi ii ("feeling good / pleasant") can be represented by a POS pattern N ADJ. We can easily assume that there will be more N ADJ patterns than kimochi ii, because many word combinations can be represented as N ADJ. Since there are more words in the dictionary than POS labels, the POS patterns will come in less variety but with higher occurrence frequency. By comparing the result of the classification using different preprocessing methods we can find out whether it is better to represent sentences as more generalized or as more specific.

Experiment setup
The preprocessed dataset provides five separate datasets for the experiment. The experiment was performed five times, once for each kind of preprocessing. For each version of the dataset a 10-fold cross validation was performed and the results were calculated using the metrics of Precision, Recall and balanced F-score, and Accuracy for the whole threshold span. There were two winning conditions. Firstly, we looked at which version of the algorithm achieves the top score within the threshold span. We checked that by looking at best F-score and Accuracy separately. However, theoretically, an algorithm could achieve its best score for only one certain threshold, while for others it could perform poorly. Therefore we also wanted to know which version of the algorithm achieves the highest score for the longest threshold span. We calculated this as a sum of scores for all thresholds. This shows whether algorithm is balanced within the threshold span. Additionally, we also checked which version obtained the highest Break-Even Point (BEP) of Precision and Recall. Finally, we checked the statistical significance of the results. We used paired Student's t-test because the classification results could represent only one of two results (emotive or non-emotive). To chose the best version of the algorithm we compared the results achieved by each group of modifications: "different pattern weight calculations", "pat-tern list modifications" and "patterns vs n-grams". All classifier version abbreviations were listed in Appendix at the end of this paper. We also compared the performance to the stateof-the-art, namely the affect analysis system ML-Ask developed by Ptaszynski et al. (2009) [17].

Experiment results
One of the main questions when using the language combinatorics approach is whether it is even necessary to use the sophisticated patterns in classification. It could happen that it is equally effective to use the usual n-gram based approach. Moreover, if the n-gram based approach was sufficient, it would be not only equally good, but also advisable to reject the combinatorial approach, since the processing time needed to learn all patterns is incomparably longer.

Tokenized dataset
At first we checked the version of the algorithm using only tokenized sentences. The F-score results for tokenized sentences were not unequivocal. Usually for higher thresholds patterns achieved higher scores, while for lower thresholds the results were similar, or n-grams scored higher than patterns. However, in all situations where n-grams achieved visibly better results, the differences in results were not statistically significant. The highest score optimized for F-score was F = 0.754 with P = 0.605 and R = 1 for n-grams, and F = 0.739 with P = 0.599 and R = 0.963 for patterns. The algorithm usually reached its optimal F-score around 0.73-0.74. An example of F-score comparison between n-grams and patterns is represented in Figure 5. All F-scores for tokenized dataset were represented in Figure 3. The highest score optimized for Accuracy was achieved by patterns, with A = 0.649, F = 0.72, P = 0.663, an R = 0.787. All scores optimized for F-score and Accuracy were represented in Table  3.
When it comes to Precision, there always was at least one threshold for which n-grams achieved better Precision score than patterns. On the other hand, the Precision scores for patterns were more balanced, starting with a high score and slowly decreasing with the threshold span (from 1 to -1), while for n-grams, although they did achieve better results for several thresholds, they always started from a lower position and for lower thresholds more-less equaled their scores with patterns.
Recall scores were better for patterns within most of the threshold span with results equaling while the threshold decreases. However, the differences were not evident and rarely statistically significant.

Lemmatized dataset
Next, we tried a different preprocessing, namely, using sentences lemmatized (tokens converted to their original dictionary form, or lemmas). In theory this allows generation of a smaller number of patterns, but with higher occurrence frequency, since, e.g., different declensions of an adjective or conjugations of a verb become represented the same way.
By changing the preprocessing from tokens to lemmas the results became more straightforward. Patterns were usually better, especially for higher thresholds, while for lower thresholds the results were similar, with n-grams occasionally scoring slightly higher. The results were in most Table 3. Comparison of best F-scores and Accuracies for tokenized dataset within the threshold span for each version of the classifier. Best classifier version within each preprocessing kind -highlighted in bold type font. cases statistically significant (p<0.05), often very significant (p<0.01), or extremely significant (p<0.001). Similarly to using tokenized sentences the algorithm was optimized at around 0.73-0.74 of F-score. The highest score optimized for F-score was F = 0.744 with P = 0.666 and R = 0.843 or P = 0.646 and R = 0.877 for patterns. The highest score optimized for Accuracy was A = 0.661, with F = 7.44, P = 0.666, and R = 0.843. All scores optimized for F-score and Accuracy were represented in Table 4.
An example of F-score comparison between n-grams and patterns for this dataset is represented in Figure 7.
In all versions of the algorithm for this preprocessing patterns achieved the highest Precision score in comparison with n-grams, however not on the whole threshold span. For many thresholds the results for patterns and n-grams crossed each other making the differences less significant. The highest overall Precision scores were P = 0.767 for patterns and P = 0.727 for n-grams.
When it comes to Recall, pattern-based approach achieved significantly higher Recall scores across the board for all compared cases, on nearly the whole threshold span. Thus since patterns catch much more of the data (higher Recall) while reaching highest Precision scores above n-grams, it can be said that using patterns is much more effective for this type of sentence preprocessing. Language Combinatorics based Automatic Detection of Emotionally Loaded Sentences

Parts-of-speech dataset
Next, we verified the performance using sentences preprocessed to represent Parts-of-speech (POS) information (nouns, verbs, etc.). In theory this type of preprocessing should provide more generalized patterns than tokens or lemmas, with smaller number of patterns but with higher occurrence frequencies.
Interestingly, F-scores for the algorithm with POSpreprocessed sentences revealed less constancy then for tokens. For most cases n-grams scored higher than patterns, but very only few results reached statistical significance. The highest F-scores were F = 0.774 with P = 0.704, and R = 0.86 for both n-grams and patterns. Similarly to tokens, the algorithm was usually optimized at F-score around 0.73-0.74.
Slightly lower scores for patterns in this case could suggest that the algorithm itself works better with less abstracted, more specific preprocessing.
The highest score optimized for F-score was F = 0.774 with P = 0.704 and R = 0.86 for n-grams. This was also the highest score optimized for Accuracy. All scores optimized for F-score and Accuracy were represented in Table 5.
Results for Precision were ambiguous. For some versions of the algorithm (e.g., unmodified, zero pattern deleted) it was better for patterns, while for others (e.g., length awarded) n-grams scored higher. The highest achieved Precision for patterns was 0.723, while for n-grams 0.706.
Results for Recall confirm the results for tokens. Patterns achieved significantly higher Recall across the board.

Tokens-POS dataset
Next we used sentences preprocessed so they included both tokens and POS information. While in the previous preprocessing the elements were more abstracted (POS), the token-POS preprocessing makes the elements more specific, thus allowing the extraction of a larger number, but less frequent patterns.
For most cases the pattern-based approach achieved significantly better results, with the difference between n-grams and patterns being in most cases very-or extremely significant (p-value <0.01 or <0.001, respectively).
The highest score optimized for F-score was F = 0.769 with P = 0.647 and R = 0.947 for n-grams and F = 0.764 with P = 0.656 and R = 0.913. The algorithm was usually reaching its optimal F-score values around 0.75-0.76. As for the scores optimized for Accuracy, the highest achieved Accuracy was 0.676 with F = 0.762, P = 0.674, and R = 0.877. All scores optimized for F-score and Accuracy were represented in Table 6.
An comparison of F-scores for all experiment settings for two datasets (tokenized and tokens with POS) are represented in Figures 3 and 4. An additional graph showing both Precision and Recall with the break-even point (BEP) for this F-score is represented in Figure 6.
The results for Precision were not as straightforward as for F-score. For many cases patterns scored higher, but not for the whole threshold span. However, the highest Precision was achieved by patterns with P = 0.867 and R = 0.397.
Recall was usually better for patterns with the scores get-ting closer as the threshold decreases.

Lemma-POS dataset
For version of the algorithm using sentences preprocessed so they would contain both lemmas and POS information, in almost all cases patterns reached better F-score results than n-grams for most thresholds (from 1 to -1). We noticed that patterns were especially better for higher thresholds, while for lower thresholds the results were similar, with n-grams occasionally scoring slightly higher. The results were nearly always statistically significant (p<0.05), often very significant (p<0.01), or extremely significant (p<0.001).
The highest score optimized for F-score was F = 0.746 with P = 0.595 and R = 1 for patterns. As for the scores optimized for Accuracy, the highest achieved Accuracy was 0.656 with F = 0.659, P = 0.745, and R = 0.59. All scores optimized for F-score and Accuracy were represented in Table  7. An example of F-score comparison between n-grams and patterns is represented in Figure 8.
When it comes to Precision alone, the results were not as straightforward as for F-score. Most often Precision for all patterns and n-grams only was similar, with n-grams often achieving their highest score slightly higher than patterns. Also for all cases of Precision comparison, pattern-based algorithm did not loose the Precision across the whole threshold span as it did for n-gram-based algorithm. This means that the pattern-based algorithm can be considered as more balanced. Furthermore, for n-grams, Precision in higher thresholds usually begins from a very low score and goes Table 4. Comparison of best F-scores and Accuracies within the threshold span for lemmatized dataset for each version of the classifier. Best classifier version within each preprocessing kind -highlighted in bold type font. While Precision was often comparable between patterns and n-grams, when it comes to Recall, pattern-based approach achieved significantly much higher Recall scores across the board. Thus since patterns catch much more of the data (higher Recall) while retaining the Precision, we can conclude that using patterns is much more effective. This is also reflected in the values of F-score.

Break-Even Point Analysis
The results so far indicated the following. While patterns usually achieve better scores in general, e.g., for the same classifier settings in the same threshold pattern-based classifier is usually much better. The improvement is the most visible in Recall, which influences the overall F-score. This can be especially noticed by comparing results in Figure 5, and Figures 8 and 7. However, in specific cases n-gram based classifier also scored higher. Therefore we performed an additional analysis of results, by comparing the Break-Even Points (BEP) for all classifier versions. BEP is a point where Precision and Recall cross, meaning, values of P, R as well as F-score are equal. In theory, higher BEP means the classifier is more balanced, extracts more relevant cases, and classifies them correctly. Comparison of all BEPs for all classifier versions and experiment settings is represented in Table 8.
The comparison indicated, similarly to previous analysis, Table 5. Comparison of best F-scores and Accuracies within the threshold span for POS-annotated dataset, for each version of the classifier. Best classifier version within each preprocessing kind -highlighted in bold type font.
Highest F-score within threshold Pr Re that the dataset containing both tokens and POS almost for all cases performed best, achieving the highest BEP. This confirms the suggestion that the algorithm works better on more specific, less generalized features. The best BEP of all, with P=R=F=0.723 was achieved by n-gram based classifier awarding pattern length in weight calculation.

Analysis of Significance Test Results
We also compared statistical significance of differences between the results. We used Student's paired t-test, since the results could represent only one of two classes (emotive or non-emotive). All results of statistical significance tests were represented in Table 11 for F-score and Table 12 for Accuracy.
In result, especially for F-scores, differences between pattern based classifiers were almost always significant, meaning, that if an improvement appeared it was usually reliable. Differences among n-gram based classifiers were less often significant. When it comes to differences for the same parameter settings between n-gram and pattern based classifiers, the majority, but not all of the differences were statistically significant. The two classifier settings which achieved the highest BEP, namely, NGR-LA, and NGR-LA-0P for tokenPOS dataset, were significant when compared to most of remaining settings.
In general statistical significance was better for datasets with more specific features (tokePOS, lemmatized, lemma- Table 6. Comparison of best F-scores and Accuracies within the threshold span for tokenized dataset with POS, for each version of the classifier. Best classifier version within each preprocessing kind -highlighted in bold type font.

Detailed analysis of learned patterns
Within some of the most frequently appearing emotive patterns there were for example: ! (exclamation mark), n*yo, cha (emotive verb modification), yo (exclamative sentence ending particle), ga*yo, n*!, n desu, naa (interjection). Some examples of sentences containing those patterns are in the examples below (patterns underlined). Interestingly, most of those patterns appear in hand-crafted databases of ML-Ask developed by Ptaszynski et al. (2009) [17] (however in single word form). This suggests that it could be possible to improve ML-Ask performance by extracting additional patterns with SPEC.

Comparison to State-of-the-Art
The affect analysis system ML-Ask developed by Ptaszynski et al. [17] on the same dataset reached the following results. F-score = 0.79, Precision = 0.8 and Recall = 0.78. The results were generally comparable, however slightly higher for ML-Ask when it comes to general F-score and Precision. Recall was always better for the proposed method. However, ML-Ask is a system developed mostly manually for several years and is based specifically on linguistic knowledge concerning emotive function of language. On the other hand, the proposed method is fully automatic and does not need any particular preparations. Therefore, for example when performing similar task for other languages, rather than ML-Ask it would be more efficient to use our method, since it simply learns the patterns from data, while ML-Ask would require laborious preparation of appropriate databases.

Conclusions and future work
We presented a method for automatic extraction of patterns from emotive sentences. We assumed emotive sentences stand out both lexically and grammatically and performed experiments to verify this assumption. In the experiments we used a set of emotive and non-emotive sentences.
The patterns extracted from those sentences were applied to recognize emotionally loaded and non-emotional sentences. We applied different preprocessing techniques (tokenization, POS, lemmatization, and combinations of those) to find the best version of the algorithm.
The algorithm reached its optimal F-score 0.75-0.76 for preprocessed sentences containing both tokens and POS information, with Precision equal to 0.64 and Recall 0.95. Precision for patterns, when compared to n-grams, was balanced, while for n-grams, although occasionally achieving high scores, the Precision was quickly decreasing. Recall scores were almost always better for patterns within most of the threshold span. By the fact that the results for sentences represented in POS were lower than the rest, we conclude that the algorithm works better with less abstracted, more specific elements.
The results of the proposed method were compared to state-of-the-art affect analysis system ML-Ask and Support Vector Machine classifier. The results for SVM were usually lower, while results of the proposed method and ML-Ask were comparable.
ML-Ask achieved better Precision, but lower Recall. However, since our method is fully automatic, it would be more efficient to use it for other languages. Moreover, many of the automatically extracted patterns appear in hand-crafted databases of ML-Ask, which suggests it could be possible to improve ML-Ask performance by extracting additional patterns automatically with our method. Moreover, the method is language independent while ML-Ask has been developed only for Japanese. In the near future we plan to perform experiments on datasets in other languages, as well as on larger datasets to analyze the scalability of the algorithm.