Journal of Computer-Assisted Linguistic Research https://www.polipapers.upv.es/index.php/jclr <p style="text-align: justify; text-justify: inter-ideograph; margin: 0cm 0cm 6.0pt 0cm;"><strong>Journal of Computer-Assisted Linguistic Research (JCLR)</strong> is a double-blind peer-reviewed journal that publishes high-quality scientific articles on linguistic studies where computer tools or techniques play a major role. JCLR aims to promote the integration of computers into linguistic research. In particular, articles in JCLR make a clear contribution to research in which software plays a key role to represent and process written or spoken data. Contributions submitted to JCLR must be in English or Spanish, but we welcome works about the study of any language. Topics of interest include computational linguistics, text mining, natural language processing, discourse analysis, and language-resource construction, among many others.</p> en-US <p><a href="http://creativecommons.org/licenses/by-nc-nd/4.0/" rel="license"><img src="https://i.creativecommons.org/l/by-nc-nd/4.0/88x31.png" alt="Creative Commons License" /></a></p> <p>This journal is licensed under <a href="http://creativecommons.org/licenses/by-nc-nd/4.0/" rel="license">Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License</a></p> jopepas3@upv.es (Carlos Periñán-Pascual) polipapers@upv.es (Administrador PoliPapers) Tue, 12 Dec 2023 12:09:50 +0100 OJS 3.3.0.8 http://blogs.law.harvard.edu/tech/rss 60 Computer-based Reading Recall on Sociolinguistic Research https://www.polipapers.upv.es/index.php/jclr/article/view/18547 <p>Global bilingual communities are a fascinating phenomenon that has received constant attention from different angles and disciplines. Sociolinguistic research has also turned interest towards what motivates change in these globalized settings, as well as psycholinguistic research has wanted to focus on the cognitive aspects of L2 speakers. With the widespread use of computer-based methods, it seems natural to add them to contemporary research as a way of understanding variation and change to a deeper level. Through the data I have collected, I debate in this article the importance of including computer-based tests as part of traditional variationist research. I argue that the traditional separation of methods and data collection has influenced the research process to a point where some new behaviors could be overlooked. In this article I report the relationship between cognitive adaptation and social experiences in the Colombian in the Philadelphia bilingual community, which becomes more proficient not only because of age and time of L2 learning, but also because of how welcoming their social circles are, as well as how diverse their friendships and workplaces are.</p> Camila Franco Rodriguez Copyright (c) 2023 Journal of Computer-Assisted Linguistic Research https://creativecommons.org/licenses/by-nc-nd/4.0 https://www.polipapers.upv.es/index.php/jclr/article/view/18547 Tue, 12 Dec 2023 00:00:00 +0100 On Methods of Data Standardization of German Social Media Comments https://www.polipapers.upv.es/index.php/jclr/article/view/19907 <p>This article is part of a larger project aiming at identifying discursive strategies in social media discourses revolving around the topic of gender diversity, for which roughly 350,000 comments were scraped from the comments sections below YouTube videos relating to the topic in question. This article focuses on different methods of standardizing social media data in order to enhance further processing. More specifically, the data are corrected in terms of casing, spelling, and punctuation. Different tools and models (LanguageTool, T5, seq2seq, GPT-2) were tested. The best outcome was achieved by the German GPT-2 model: It scored highest in all of the applied scores (ROUGE, GLEU, BLEU), making it the best model for the task of Grammatical Error Correction in German social media data.</p> Lidiia Melnyk, Linda Feld Copyright (c) 2023 Journal of Computer-Assisted Linguistic Research https://creativecommons.org/licenses/by-nc-nd/4.0 https://www.polipapers.upv.es/index.php/jclr/article/view/19907 Tue, 12 Dec 2023 00:00:00 +0100 A Lightweight Statistical Method for Terminology Extraction https://www.polipapers.upv.es/index.php/jclr/article/view/20427 <p>We propose a method for the task of automatic terminology extraction in the context of a larger project devoted to the automation of part of the tasks involved in the production of terminological databases. Terminology extraction is the key to drafting the macrostructure of a terminological resource (i.e., the list of entries), to which information can be later added at the microstructural level with grammatical or semantic information. To this end, we developed a statistical method that is conceptually simple compared to modern neural network approaches. It is a lightweight method because it is based on term dispersion and co-occurrence statistics that can be computed with basic hardware. For the evaluation, we experimented with corpora of lexicography and linguistics in English and Spanish of ca. 66 million tokens. Results improve baselines in almost 20%.</p> Rogelio Nazar, Nicolás Acosta Copyright (c) 2023 Journal of Computer-Assisted Linguistic Research https://creativecommons.org/licenses/by-nc-nd/4.0 https://www.polipapers.upv.es/index.php/jclr/article/view/20427 Tue, 12 Dec 2023 00:00:00 +0100 Self-supervision of Hallucinations in Large Language Models: LLteaM https://www.polipapers.upv.es/index.php/jclr/article/view/20408 <p>Large language models like GPT and Claude have revolutionized the tech industry over the past year. However, as generative artificial intelligence, they are prone to hallucinations. A large language model hallucinates when it generates false or nonsensical text. As these models improve, these hallucinations become less obvious and more dangerous for users. This research explores the phenomenon in the context of automated email response for customer service. First, it proposes a taxonomy of hallucinations in large language models based on their linguistic nature, and second, a multi-agent system that allows for the self-supervision of such hallucinations. This system generates email responses but prevents their delivery if hallucinations are detected, thus reducing the risks of generative AI in productive environments. Experiments with various state-of-the-art language models reveal that the only successful model’s operating costs currently exceed those viable for operational deployment. Moreover, a drastic performance drop after a recent update to GPT-3.5-turbo suggests likely shortcomings in industrial applications driven by retrieval-augmented generation. Overall, the research advocates for a Machine Linguistics to analyze the outputs of large language models, suggesting that such a collaboration between Linguistics and Artificial Intelligence could help mitigate the social risks of hallucination.</p> Sofía Correa Busquets, Lucas Maccarini Llorens Copyright (c) 2023 Journal of Computer-Assisted Linguistic Research https://creativecommons.org/licenses/by-nc-nd/4.0 https://www.polipapers.upv.es/index.php/jclr/article/view/20408 Tue, 12 Dec 2023 00:00:00 +0100