Non classé Publications

Just published: Automatic creation of a dictionary of French verb regimes. (Master Thesis)

Hassert, Naïma (2023) Automatic creation of a dictionary of French verb regimes. Département de linguistique et de traduction, Université de Montréal. [PDF (610Kb)].

Valency dictionaries are useful for many tasks in automatic language processing. However, quality dictionaries of this type are created at least in part manually; they are therefore resource-intensive and difficult to update. In addition, many of these resources do not take into account the different meanings of lemmas, which are important because the arguments selected tend to vary according to the meaning of the verb. In this thesis, we automatically create a French verb valency dictionary that takes polysemy into account. We extract 20 000 example sentences for each of the 2 000 most frequent French verbs. We then obtain the lexical embeddings of these verbs in context using a monolingual and two multilingual language models. Then, we use clustering algorithms to induce the different meanings of these verbs. Finally, we automatically parse the sentences using different parsers to find their arguments. We determine that the combination of the French language model CamemBERT and an agglomerative clustering algorithm offers the best results in the sense induction task (58.19% of F1 B3), and that for syntactic parsing, Stanza is the tool with the best performance (83.29% of F1). By filtering the syntactic frames obtained using maximum likelihood estimation, a very simple statistical method for finding the most likely parameters of a probability model that explains our data, we build a valency dictionary that almost completely dispenses with human intervention. Our procedure is used here for French, but can be used for any other language for which sufficient written data exists.
Keywords : word sense induction, valency, computational lexicography.