Grammaticalization degrees in Catalan anar vs . estar + adjective in the 19 th and 20 th centuries : A language contact , corpus-based distributional approach

The present study constitutes an exploratory analysis of the use of the Catalan constructions anar (‘to go’) + adjective and estar (‘to be’) + adjective in writing during the 19 and 20 centuries. In modern Catalan, anar and estar can be used before an adjective as copulas, with an identical or similar meaning (e.g., en Joan va begut, en Joan està begut ‘John is drunk’). Both constructions show evidence of being highly grammaticalized and are apparently found in free variation rather than complementary distribution. However, a closer approach seems to reveal certain underlying linguistic factors that might be informing such contrast, specifically within the semantic-pragmatic sphere. This is here explored using the narrow theory of grammaticalization (Hopper & Traugott 2003), a multistage process by which lexical (i.e., contextual) items eventually become functional (i.e., grammatical) over time. A corpus-based search was conducted using the electronic version of the Corpus Textual Informatitzat de la Llengua Catalana (CTILC), which comprises literary and non-literary texts published between 1833 and 1988. Anar + adjective and estar + adjective tokens were obtained from a sample of Isogloss 2020, 6/7 Marc Gandarillas 2 texts, then sorted into four different groups corresponding to century halves. The search results from the corpus were randomly sorted in terms of genre and date and, from these, the top 250 tokens on the list of results for each construction were selected and subsequently analyzed. Whereas few instances of both constructions are found overall in the first half of the 19 century, estar already prevails over anar when followed by an adjective. During the first half of the 20 century, however, this trend starts to be reversed. This suggests the possibility that anar and estar might have traditionally been found in free variation in a number of contexts, with 19 century texts showing a clear preference for estar.


Introduction
The present study constitutes an exploratory analysis of the use of the Catalan constructions anar ('to go') + adjective and estar ('to be') + adjective in writing during the 19 th and 20 th centuries. In modern Catalan, anar and estar can be used before an adjective as copulas, with an identical or at least very similar meaning (e.g., en Joan va begut, en Joan està begut 'John is drunk'). Both constructions show evidence of being highly grammaticalized and are apparently found in free variation rather than complementary distribution. However, a closer approach seems to reveal certain underlying linguistic factors that might be informing such contrast, specifically within the semantic/pragmatic sphere. This is here explored using the grammaticalization theory as the overarching common thread.
Specifically, I adopt the usage-based definition of grammaticalization provided by Hopper and Traugott (2003: Ch. 1), according to which this phenomenon could be defined as the multistage process by which a lexical (i.e., context) item eventually becomes functional (i.e., grammatical) over time. According to this definition, interaction is key in promoting language change. Consequently, Hopper and Traugott (2003) approach grammaticalization from a diachronic perspective, as they focus on every potential factor underlying the conversion of the semantic substance into a-more-grammatical meaning. In such process, all changes within the language (i.e., at the semantic, functional, morphosyntactic, or phonological levels) appear to be consistently interrelated to some degree, which leads to diverse, salient outcomes such as-most commonly-generalizations of meaning (and, relatedly, semantic bleaching) and/or contexts of use, phonological reductions (either substantial or temporary), syntactic fixation, specialization (e.g., Eng. will vs. shall, Fr. pas vs. point), or an increase in frequency. By adopting this conception of grammaticalization-in which grammar is epiphenomenal; i.e., a byproduct of the circumstances surrounding the communication 1 -I am therefore keeping my distance from some studies that admit a broader scope of the phenomenon (e.g., by embracing certain instances of what other researchers have more precisely referred to as "pragmaticalization" or "discoursivization").

Research goals
Even though-being a gradual process-grammaticalization is intrinsically diachronic, it has occasionally been referred to as "panchronic" in the sense that it admittedly requires to be studied from a comprehensive-both diachronic and synchronic-perspective. Taking this into account, in this study I will mostly focus on the outcomes of the grammaticalization process rather than the process itself.
As mentioned earlier, I expect this grammaticalization-based approach to assist me in paving the way for establishing relevant comparisons between the Catalan constructions anar + adjective and estar + adjective, which might in turn shed light on the degree of grammaticalization and distribution (i.e., free, complementary, or otherwise) of these forms-both synchronically and diachronically. Even though I am not tracing here the evolution of these expressions from their earliest documentation, I still expect to encounter some enlightening changes over the course of the 19 th and 20 th centuries.
Consequently, my exploratory analysis pursues a threefold goal: (1) determine whether or not consistent variation can be found between the Catalan expressions anar + adjective and estar + adjective; (2) if so, tackle variation in terms of relative frequency and distributional patterns; (3) based on the previous goal, establish a series of linguistic (e.g., morphosyntactic, pragmatic) and/or sociolinguistic factors that appear to be playing a role in the aforementioned distributional patterns.

Literature review
Grammaticalization-both in Romance languages and otherwise-has received abundant and increasing scholarly attention over the course of the last two decades (Antolí 2012 Pons 1958Pons , 1978Romà i Roura 1987). The study of this phenomenon has traditionally focused on the gradual development of a particular lexical item (e.g., noun or verb bearing a full semantic load) into a more functional, semantically deprived one (e.g., morph expressing futurity). The process in between can be explained based on a multistage cline whose delimitation is highly contingent on the researcher's own theoretical framework, 1 Even though this may be hard to prove, it might not be completely inconceivable that Pompeu Fabra's standardization of Catalan (Qüestions de gramàtica catalana, 1911) had contributed to favor anar over estar in certain contexts in which the use of the latter was-either rightly or wrongfully-viewed as a result of language contact with Spanish. This might have led to a number of hypercorrected instances. the availability of supporting evidence, or the unique characteristics of the specific process under study (Heine 2002, Montserrat i Buendia 2004, Pérez-Saldanya 1998). As a general trend, pragmatic factors (cf. metonymy, reanalysis), psychosocial characteristics of those involved (cf. subjectification; vid. Traugott 1995), as well as constant readjustments to new sociolinguistic demands all appear to play a significant role in both the development and the outcomes of the grammaticalization process.
As mentioned earlier (vid. 1), for the scope of this study only Hopper and Traugott's (2003) strict, narrow definition of grammaticalization is considered. However, I was able to locate a number of publications that, though differing to an extent from Hopper and Traugott's (2003) definition, provide potentially interesting theoretical and methodological insights into the matter. Among these, Alturo and Chodorowska-Pilch (2009) take a corpus-based, quantitative stance on how the Catalan original protasis si us plau (lit. if it/that pleases you) eventually becomes a discourse marker, which commonly undergoes articulatory reduction to sisplau 'please' in informal speech (however, the term grammaticalization, instead of pragmaticalization, is the one used to define this process). Along the same lines, and resorting to similar terminology, Cuenca and Massip (2005) investigate the cline followed by certain conjunctive phrases (in Catalan, the authors refer to these as connectors; e.g., encara que 'although') prior to reaching their current status. This whole process is usually explained in terms of syntactic reanalysis based on a need for higher degrees of subjectification or "intersubjectification" through pragmatic processes.
The study of grammaticalization in Catalan appears to have received significantly less scholarly attention than in some of the surrounding Romance languages (e.g., Spanish, French, Italian). There appears to be, however, one outstanding exception to the rule-the case of anar + infinitive specializing in the expression of the past perfective, which has received significant attention both at the Catalan studies level (Alturo 2017, Pérez-Saldanya & Hualde 2003, Segura 2012, Vallduví 1989) and more globally (Juge 2005). In this respect, Catalan appears to follow the path opposite to that in other Romance languages, in which the present tense of 'to go' + infinitive has eventually specialized to convey an idea of near and/or intentional future (cf. Sp. voy a comer, Fr. je vais manger 'I am going to eat' vs. Cat. vaig menjar 'I ate').
In order to explain the grammaticalization process of anar + infinitive, Alturo (2017) qualitatively compares different hypotheses that have to date been proposed within the sociolinguistics field (e.g., reanalysis, syncretism, contact with Occitan). Though acknowledging that extensive pragmatic research has been conducted on the matter, Juge (2005) finds some gaps in research that affect the morphological sphere, as well as certain inconsistencies in terms of diachrony. The author proposes that the present/preterit syncretism found in ancient Catalan in the first-person plural, anam-still preserved in the Balearic varieties of the language-at some point allowed for a reanalysis of the construction. This view contrasts with previous proposals that support an explanation along the lines of the so-called "narrative present" (Vallduví 1989). Pérez-Saldanya and Hualde (2003) hypothesize that the contexts in which the anar + infinitive construction is first attested might be key to understanding such apparently "highly anomalous" development. Based on an ad hoc literary corpus comprising narrative texts, the authors conclude that all these contexts are actually narrative contexts in which the infinitive corresponds to a verb of accomplishment and, for this reason, there is a subsequent action expressed in the past tense.
Even though no previous research has been found on the relationship between anar + adjective and estar + adjective, the aforementioned references are undoubtedly useful to guide my methodology, enlighten the interpretation of my own results, enrich subsequent discussions, and shed light onto potential further implications and next steps related to my study. This holds especially true regarding the development of the grammaticalization process and-more specifically-its underpinnings, ranging from formal language factors (i.e., morphology, syntax) to more pragmatic or discourse-related ones (e.g., sociolinguistic, psychosocial components).

Research questions and hypotheses
Based on the literature reviewed, I decided to establish the following research questions, which are followed by my initial assumptions: 1. Can significant variation be found in terms of grammaticalization degrees in the use of anar + adjective vs. estar + adjective in written texts from the 19 th and 20 th centuries? Based on the grammaticalization model proposed by Hopper and Traugott (2003), not only do I expect these two highly grammaticalized expressions to differ synchronically in terms of grammaticalization-even when both show evidence of high grammaticalization-but I also expect these to show signs of change-both individually and relatively to each other-along the course of the two centuries. 2. If so, how can relative frequency and distributional patterns inform the above variation? First and foremost, I anticipate that the expressions anar + adjective and estar + adjective will show a tendency to appear-at least to an extent-in complementary distribution. Additionally, I would initially expect to encounter inversely proportional frequencies between expressions. 3. Which linguistic (e.g., morphosyntactic, pragmatic) and sociolinguistic factors could be playing an active role in establishing the aforementioned distributional patterns? Firstly, based on the reviewed literature, a tendency of these two expressions to appear in complementary distribution might be explained based on pragmatic factors (e.g., either verb might be showing a tendency to co-occur with adjectives that are semantically similar). Secondly, if found, inversely proportional frequencies between anar + adjective and estar + adjective might be helpful to shed some light onto the matter in terms of the sociolinguistic sphere-particularly in connection with the steadily increasing contact between Catalan and Spanish, which is welldocumented for the 20 th century.

Materials and procedure
A corpus-based search was conducted using the electronic version of the Corpus Textual Informatitzat de la Llengua Catalana (CTILC 2 ), designed by the Institute for Catalan Studies 3 . This corpus comprises over 52 million words that were originally written in Catalan between 1833 and 1988. The corpus contains both literary (i.e., narrative, theater, poetry, essays) and non-literary texts (e.g., crossdisciplinary treatises/manuals, specialized and informative articles, legal texts, periodical publications, pamphlets).
Anar + adjective and estar + adjective tokens were obtained from a sample of texts 4 and then sorted into four different groups, corresponding to the first (19a) and second (19b) halves of the 19 th century, as well as the first (20a) and second (20b) halves of the 20 th century. The search results from the corpus were randomly sorted in terms of genre and date and, from these, the top 250 tokens on the list of results for each construction were selected and subsequently analyzed.
Tokens in which the adjective was modified by a preceding adverb were included for further analysis: (1) Està ja revisat. to be-3sPI already to revise-PPms 'It is already revised.' (Pompeu Fabra, Cartes a Joaquim Casas-Carbó, 1910) (2) Anava molt més rerassagada. to go-3sII much more to lag behind-PPfs 'She was way behind.' (Josep Maria Poblet, Aribau i abans i després, 1963) This equally applied to all adverb types, including the emphatic negative particle pas: (3) No estarà pas enllestida a temps. not-AUX to be-3sF no way to get ready-PPfs in time 'There is no way she will be ready in time.' (Pompeu Fabra, Cartes a Joaquim Casas-Carbó i a J. Massó i Torrents, 1911) Also, some (rare) instances were found in which a subject pronoun is inserted between the verb and the adjective: (4) Estich jo malaltís. to be-1sPI I unhealthy-ms 'I am feeling unwell.' (Josep Yxart, Cartes a Joan Sardà, 1893) Past participles preceded by anar or estar were considered functional adjectives and, consequently, also included for further analysis 5 : On the other hand, exclusions mainly comprised instances in which the adjective appeared prior to the verb: donat lo delicat que estic to give-PPms the-ns delicate-ms that to be-3Spi 'given how weak I am feeling' (Josep Yxart, Cartes a Joan Sardà, 1894) Present participles were also excluded from the list of potential adjectives: (8) lo teló va arriantse magestuosament the-ms curtain-ms to go-3sPI to lower-GER majestically 'the curtain is being lowered majestically' (Fèlix Socias i Urgellés, ¡Ja hi han entrat!, 1860)

Data management and analysis
Firstly, in order to comparatively track the evolution of anar + adjective and estar + adjective (and to be able to interpret this further in the light of pragmatic and sociolinguistic factors), raw frequencies for all tokens (N=500, 250 per construction) were obtained and sorted into one of the four time-period groups (i.e., 19a, 19b, 20a, 20b) along with their respective percentages. Secondly, every token for either construction was classified according to its particular degree of grammaticalization. Specifically, three degrees were established-nongrammaticalized (NG), semi-grammaticalized (SG) 6 , and grammaticalized (G). 5 In Catalan, as is the case in other world languages, it is commonplace to find instances of adjectives that originated in a past participle (e.g., estic esgotat 'I am exhausted'; la noia està cansada 'the girl is tired'; vas vestit tot elegant 'you are all dressed up'). 6 The semi-grammaticalized label here is consistent with Heine's (2002) concept of "bridging context," since it appears to account for the transition from In the case of anar ('to go'), tokens classified as non-grammaticalized were those in which verb internal arguments were other than the following adjectives: anava sola a abeurar sos ulls to go-3sII alone-fs to to soak-INF POSS-mp eye-mp 'she would go [somewhere] all by herself to soak her eyes' (Llorenç Riber i Campins, Al sol alt, 1931) Semi-grammaticalized tokens were those in which, although the verb internal argument indicating the destination was not explicit (and, probably, also completely irrelevant), the idea of directionality and/or movement had been nonetheless preserved. Here, several different patterns were encountered, such as anar vestit/-ida, anar nu/-a, anar coix/-a, anar junts, anar descalç/-a (respectively and literally, 'to go dressed/naked/limping/together/barefoot'): (10) Els pagesos, quan van descalços, la temen molt. the-mp peasant-mp when-CONJ to go-3pPI barefoot-mp ACC-3sf to be afraid-3pPI a lot 'When they go barefoot, peasants are very afraid of it.' (Josep Calicó, Apunts de la flora medicinal de Catalunya, 1921) Finally, tokens classified as grammaticalized were those in which the notion of movement originally conveyed by anar was no longer a possible interpretation. In fact, in multiple instances of anar + adjective the association between the verb and the adjective is strong (both in terms of frequency and semantic bleaching). Additionally, the semantic characteristics of the adjective appear to prevent anar from deploying its original notion of movement (e.g., anar errat/-ada 'to be (lit. go) wrong,' anar ple/-ena 'to be (lit. go) full'): (11) Però, si no vaig molt errat, hom aconsella sovint […] but if not-AUX to go-1sPI very wrong-ms someone to advise-3sPI often 'But, unless I am too wrong, it is oftentimes advised […]' (Joan F. Mira i Casterà, A la recerca de la història cultural, 1974) On the other hand, for estar + adjective no instances were found of nongrammaticalization (i.e., the original meaning of the verb, 'to stand,' from the Latin etymon STARE 7 ). However, if we consider the original meaning of the verb, grammaticalized contexts to non-grammaticalized ones (vid. Heine, Bernd 2002: 83-101). 7 In the first entry of the lemma estar, the Diccionari català-valencià-balear (a purely descriptive lexicographical work accounting for most varieties of the language) provides the following information: "Trobar-se dempeus i aturat (per oposició a anar o caminar)" ['To stand still on one's feet (as opposed to going or walking)']. The example provided dates back to the 13 th century (Ramon Llull, Cont., 315, 22: "S'esforsa... com pusca esser vertuós en son parlar e en son callar e en son anar e en son estar" ['He strives to be virtuous not only when he speaks, but also when he is silent, when he walks, and when he stands']). This could be suggesting that estar completed its own grammaticalization process well before anar, which would in turn occurrences appear to be considerably split between semi-grammaticalized and non-grammaticalized instances. The former group includes multiple occurrences of estar + past participle that seem to approach the notion of the so-called middle voice (or even, in some cases, the passive voice itself), in which the 'verb + participle' bigram appears to have undergone significant semantic bleaching and increased its strength in terms of mutual association (most likely through increased frequency): Other highly-frequent grammaticalized constructions from the corpus are estar ben vist ('to be well thought of'), estar content/-a de ('to be happy/glad to'), estar decidit/-ida a ('to be determined to'), estar pròxim/-a a ('to be about to'), all followed by an infinitive. 8 Last but not least, distributional patterns were searched for every construction in order to support or reject the complementary distribution hypothesis.

Results
In connection with my first and second research questions, Table 1 below provides an overview of how constructions anar + adjective and estar + adjective quantitatively evolve over the course of the 19 th and 20 th centuries in written language. Whereas few instances of both constructions are found overall in the first half of the 19 th century 9 , estar already prevails over anar when followed by an adjective. Such prevalence continues well into the second half of the 19 th century-in which estar accounts for 70.18% of the tokens from both constructions. However, data from the first half of the 20 th century shows that this explain why no instances of non-grammaticalization were found in my own selection of the CTILC corpus for estar + adjective constructions during the 19 th or 20 th centuries. 8 However, it must be noted that the differentiation between grammaticalized and semi-grammaticalized is usually contingent, not on the 'estar + adjective' bigram, but on pragmatic and/or contextual factors. For instance, estar content/-a in the literal sense was considered as semi-grammaticalized, whereas estar content/-a de + infinitive was classified as grammaticalized (cf. Eng. I am happy (because I won the lottery) vs. I am happy to inform you that you have been selected). 9 In this respect, it might be relevant to emphasize that the earliest materials found in the CTILC corpus date back to 1833 (not 1800). Along the same lines, the most recent texts within the corpus belong to the year 1988 (not 1999). trend starts to be reversed at this point. This becomes especially noticeable during the second half of the 20 th century, with estar + adjective constructions only accounting for 37.50% of the total number of tokens.
The above-presented data suggests the possibility that anar and estar might have traditionally been found in free variation (at least in certain contexts), with 19 th century texts showing a clear preference for estar. The figures for anar + adjective and estar + adjective eventually become virtually even-still with a slight precedence of estar-during the first half of the 19 th century. Finally, during the second half of the 20 th century the initial trend is reversed, with anar + adjective bigrams accounting for 62.50% of the total number of tokens. With Pompeu Fabra-considered the father of Catalan standardization-having published his most influential works by mid-20 th century (relevantly, the Diccionari General de la Llengua Catalana, 1932), the increase in the use of anar + adjective in written Catalan might-arguably-have presented some sort of connection with the clandestine standardization efforts in a time in which the language was being banned from the public sphere by the Francoist dictatorship 10 . On the other hand-as earlier indicated and exemplified in Section 4.2 ('Data management and analysis')-every token of anar + adjective and estar + adjective was annotated according to the degree of grammaticalization shown within its own context. This information was summarized in a table for further analysis and comparison. Table 2 below contains both the raw frequencies and percentages accounting for the tokens of anar/estar + adjective in each time period, considering their respective degrees of grammaticalization. 10 Without the possibility of readily-available access to materials published in Catalan (e.g., literary works from previous centuries), some purist currents at the time would hold the belief that the most correct choices in terms of linguistic structures or lexical items were those most differing from their Spanish counterparts. According to some sources, Pompeu Fabra earned the reputation of being a purist, even in the words of some of his colleagues at the Institute for Catalan Studies (Gracia Alonso, F. (2011) Table 2 shows that, by the early 1800s, both anar + adjective and estar + adjective already appear to be highly grammaticalized. In fact, for the latter construction, no instances of non-grammaticalization are found, which indicates that the grammaticalization process had reached full completion well before the beginning of the 1800s 11 . Non-grammaticalized tokens of anar (i.e., with the verb indicating movement toward an established destination) are actually widespread in Catalan. However, when non-grammaticalization is present, the anar + adjective structure is possible, yet not as common, since the internal argument of the verb is forced to appear after the adjective, not immediately following the verb (e.g., en Joan va content a la feina 'John goes happily (lit. happy) to work'). On the other hand, tokens of semi-grammaticalized estar + adjective-in which estar basically represents a transient condition or state-remain stable through all the time periods studied, by far outnumbering their anar counterparts. Now focusing on the grammaticalized occurrences of anar vs. estar + adjective, these appear to be virtually and consistently even throughout the 19 th century, whereas grammaticalization instances of anar + adjective considerably outnumber their estar + adjective counterparts throughout the 20 th century. This might align well with the 1900s trend of switching to anar in constructions (usually fixed expressions) that might traditionally have resorted to estar (e.g., anar cansat/-ada 'to be (lit. go) tired'). As mentioned earlier, I believe that standardization efforts of the language within an adverse sociopolitical and sociolinguistic context should not be excluded as a possible factor. Additionally, in the instances analyzed of anar + adjective fewer examples of idiomatic expressions are found (e.g., anar errat/-ada 'to be (lit. go) wrong' (10 tokens)). Concurrently, these appear to be more frequent than in the case of estar + 11 Like in Spanish, the original use of estar 'to stand' (< Lat. STARE) has not been preserved in Catalan. adjective, in which variety consistently overcomes frequency (e.g., estar cansat/-ada 'to be tired,' estar obligat/-ada 'to be obliged,' estar content/-a 'to be happy,' estar decidit/-ida 'to be determined,' estar segur/-a 'to be sure,' estar disposat/-ada 'to be willing'). This can be explained based on the fact that estar, unlike anar, shows evidence of having fully completed its grammaticalization process (or, at least, multiple uses appear to have been grammaticalized for considerably longer periods of time).
Finally, the answer to my third research question seems to involve greater complexity. On the one hand, given the fact that in many contexts the two constructions under study seem to appear in free variation (excluding idiomatic expressions such as the ones I mentioned earlier), there are necessarily certain factors-likely, language standardization efforts-that can explain the precedence of anar throughout the 1900s. This would also relate to the fact that anar emerges as being used in more grammaticalized contexts in the 1900s.

Conclusion and discussion
The results obtained are certainly interesting in the light of the theoretical framework proposed by Hopper and Traugott (2003), as they undoubtedly appear to point to certain directions for prospective research. Still, more research is needed to refine and validate my results. For instance, it could be enlightening to conduct a follow-up study based on the contrastive analysis of anar vs. estar + adjective across both literary and non-literary genres (also with an emphasis in oral speech; vid. Romà i Roura 1987). Probably, a step in the right direction would be to explore how this is reflected in letters, which-despite obvious reservations-probably constitute the genre closest to popular, spontaneous speech. Should language standardization efforts have actually played an important role in the expansion of anar + adjective contexts, data from the CTILC corpus could be compared to spontaneous speech samples prior to venturing hurried assumptions on the matter. Another possibility consists in delving into the semilinking (Cat. semicopulatiu) characteristics of anar and estar, as well as the divergences shown by these verbs to this respect (Bosque & Demonte 1999, RAE & ASALE 2009-2011, Solà 2002. A strict differentiation (whenever possible) of the nature of intervening factors (i.e., sociolinguistic vs. discourse-pragmatic factors) might also lead to supplementary insights into the matter. As mentioned earlier, a factor that is not to be completely dismissed lies in the influence of Pompeu Fabra's standardization of the language, which might have been a potential contributor to accelerating the grammaticalization process of anar in those instances in which it was consistently preferred over estar (based on hypercorrection or otherwise).
Another interesting clue is the frequency vs. variety contrast that is clearly observed between grammaticalized samples of anar + adjective and estar + adjective, with the former being more frequent in 20 th century writing, yet not presenting as many possible combinations as the latter. It would certainly be relevant to further explore these results in the light of the grammaticalization theory, since the fact that estar + adjective has been grammaticalized for longer seems to have allowed for a larger variety of combinations. In order to shed more light onto the matter and help develop further perspectives, it is my belief that it could be really enlightening to trace the grammaticalization of estar and anar back to the earliest documentations of the language.