Lázaro: An Extractor of Emergent Anglicisms in Spanish Newswire

Abstract

The use of lexical borrowings from English (often called anglicisms) in the Spanish press evokes great interest, both in the Hispanic linguistics community and among the general public. Anglicism usage in Spanish language has been previously studied within the field of corpus linguistics. Prior work has traditionally relied on manual inspection of corpora, with the limitations that implies. This thesis proposes a model for automatic extraction of unadapted anglicisms in Spanish newswire. This thesis introduces: (1) an annotated corpus of 21,570 newspaper headlines (325,665 tokens) written in European Spanish annotated with unadapted anglicisms and (2) two sequence-labeling models to perform automatic extraction of unadapted anglicisms: a conditional random field model with handcrafted features and a BiLSTM-CRF model with word and character embeddings. The best results are obtained by the CRF model, with an F1 score of 89.60 on the development set and 87.82 on the test set. Finally, a practical application of the CRF model is presented: an automatic pipeline that performs daily extraction of anglicisms from the main national newspapers of Spain.

Publication
Computational Linguistics MS thesis, Department of Computer Science, Brandeis University