Understanding environment textually and linguistically

Coordinator: Marie-Claude L’Homme, Linguistique et traduction, Université de Montréal

In collaboration with:

This project is supported by the Social Sciences and Humanities Research Council (SSHRC), 2013-2018

The field of environment is very complex (since it integrates concepts that are related to meteorology, climatology, geology, economy, etc.) and is dealt with in a wide variety of publications (reports written by experts, articles in newspapers, ideological leaflets, popularization publications, etc.).  The issues raised by the field are also extremely important and new words are created in order to convey them.  It thus becomes very difficult to keep track of all the changes that occur in the field (for both experts and non-experts).

The objective of this project is to develop methods for characterizing the contents of texts on two different levels: textual (using methods and techniques derived from corpus linguistics text mining) and linguistic (based on lexical semantics models). The project combines theories, methods and techniques used in linguistics, information science and terminology.

First, we will develop a text typology characterizing environmental texts according to two different perspectives:  1. The topic dealt with (e.g. climatic change, recycling, sustainable development); 2. The level of specialization (expert to expert; expert to initiate, expert to layperson, etc.). The typology will be elaborated for texts written in English, French and Spanish and will be based on work on text genres by Biber (1988) and Swales (1990) and on communicative situations by Pearson (1998). Then, descriptive text mining methods will be applied in order to identify important or new topics in different texts (e.g., greenhouse effect, climate warming, shale gas). This part of the project will be carried out, first, by applying unsupervised classification algorithms to texts (to group texts dealing with comparable topics), then, by extracting terms that are representative of groups of texts produced by classification algorithms. This procedure will allow us to discover the thematic structure that appears in corpora. We hypothesize that topics identified by text mining techniques correspond to conceptual clusters that are important in the field of environment. These topics will serve to start a linguistic description of the specialized lexicon that appears in environmental texts. The linguistic description will be based on a lexical semantics theory, i.e. is Frame Semantics (Fillmore 1982). FS are conceptual scenarios that lexical units evoke from different perspectives. For instance, we can hypothesize that the LUs change (n.), change (v.), fluctuate, fluctuation, vary, variation evoke the same frame in the field. The identification of LUs that are likely to belong to the same frame will be carried out semi-automatically with TermoStat (Drouin 2003), which is a term extractor that offers different viewpoints on the lexicon contained in a corpus. Once the frames and their lexical units identified, we will describe them based on the methodology developed within the FrameNet project (Ruppenhofer et al. 2010). The method comprises a step that consists in annotating LUs and their participants in corpora. This annotation can then lead to a better characterization of LUs used in texts.

The main expected outcome of the project is a method that will serve to improve the management of information related to the field of environment. More specifically, it will allow us to test automatic classification methods and to adapt them to a very complex field of knowledge. Finally, the lexical descriptions will be placed online in a freely available resource. Terminological, lexicographical and pedagogical applications could then be derived from them.