Comparing methods for measuring dialect similarity in Norwegian

Article from the Proceedings of the 12th Language Resources and Evaluation Conference by A. Kåsen, J. Bondi Johannessen, K. Hagen, A. Nøklestad, & J. Priestley.

Introduction

This paper explores whether it is possible to use existing dialect transcriptions from a recently finished corpus project to automatically generate dialect areas. The phonetic transcriptions are quite coarse-grained, which indicates that they might be possible to generalise over by an automatic method. We have chosen to test two different methods: The Levenshtein method, which uses edit distance to compute distances between dialects, and the neural long short term memory (LSTM) autoencoder network, a machine learning algorithm. The resulting maps show that the transcriptions can indeed be used for the purpose of automatically generating dialect maps, but while the neural network method needs a large dataset, the Levenshtein method can get very good results with a small dataset, too. The paper is structured a follows: Section 2 describes the LIA project from which the transcriptions have been taken, and also describes the actual transcriptions and traditional dialect maps, Section 3 describes the datasets we use, while Sections 4 and 5 present the results of using the two datasets on the two methods. Section 6 concludes the paper, while references are given in Sections 7 and 8.

The full article can be downloaded on DUO Research Archive.

Published June 1, 2021 2:48 PM - Last modified May 2, 2024 10:44 AM