Anders Nøklestad

Senioringeniør - Forskningsseksjonen

English version of this page

E-post anders.noklestad@iln.uio.no

Rom HW 240

Brukernavn

Besøksadresse Humit Henrik Wergelands hus 2. etasje Niels Henrik Abels vei 36 N-0313 OSLO

Postadresse Postboks 1079 Blindern 0316 Oslo

Pressebilde Last ned visittkort

Jeg jobber på Humit, der jeg utvikler ressurser og verktøy for språkforskere og studenter, først og fremst databaser, datalingvistiske verktøy og web-grensesnitt for søking i korpus (grammatisk analysert tekst eller talespråk). Jeg har også ansvaret for driften av en av Humits (tidligere Tekstlaboratoriets) servere.

Jeg har hovedfag i lingvistikk og PhD i datalingvistikk/språkteknologi, begge fra UiO.

Emneord: Språkteknologi, Korpus, Humanistisk informatikk, Digital humaniora, Lingvistikk, Norsk språk, IT-støtte til forskning

Honkapohja, Alpo; Thaisen, Jacob & Nøklestad, Anders (2024). A search tool based on language modelling developed for The Index of Middle English Prose. Open Research Europe. ISSN 2732-5121. 3. doi: 10.12688/openreseurope.16590.1. Fulltekst i vitenarkiv
Haug, Dag Trygve Truslew; Yildirim, Ahmet; Hagen, Kristin & Nøklestad, Anders (2023). Rules and neural nets for morphological tagging of Norwegian - Results and challenges. I Alumäe, Tanel & Fishel, Mark (Red.), Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa). University of Tartu. ISSN 978-99-1621-999-7. s. 425–435. Fulltekst i vitenarkiv
Mæhlum, Petter; Haug, Dag Trygve Truslew; Jørgensen, Tollef Emil; Kåsen, Andre; Nøklestad, Anders & Rønningstad, Egil [Vis alle 9 forfattere av denne artikkelen] (2022). NARC – Norwegian Anaphora Resolution Corpus. International Conference on Computational Linguistics (ICCL) (COLING). ISSN 1525-2477. 29(7), s. 48–60. Fulltekst i vitenarkiv Vis sammendrag
Published in: Proceedings of the Fifth Workshop on Computational Models of Reference, Anaphora and Coreference (CRAC): https://aclanthology.org/venues/coling/. We present the Norwegian Anaphora Resolution Corpus (NARC), the first publicly available corpus annotated with anaphoric relations between noun phrases for Norwegian. The paper describes the annotated data for 326 documents in Norwegian Bokmål, together with inter-annotator agreement and discussions of relevant statistics. We also present preliminary modelling results which are comparable to existing corpora for other languages, and discuss relevant problems in relation to both modelling and the annotations themselves.
Kåsen, Andre; Hagen, Kristin; Nøklestad, Anders; Priestley, Joel; Solberg, Per Erik & Haug, Dag Trygve Truslew (2022). The Norwegian Dialect Corpus Treebank. I Calzolari, Nicoletta; Béchet, Frédéric; Blache, Philippe; Choukri, Khalid; Cieri, Christopher; Declerck, Thierry; Goggi, Sara; Isahara, Hitoshi; Maegaard, Bente; Mariani, Joseph; Mazo, Hélène; Odijk, Jan & Piperidis, Stelios (Red.), Proceedings of the Thirteenth Language Resources and Evaluation Conference. European Language Resources Association. ISSN 979-10-95546-72-6. s. 4827–4832. Fulltekst i vitenarkiv Vis sammendrag
This paper presents the NDC Treebank of spoken Norwegian dialects in the Bokmal variety of Norwegian. It consists of dialect ˚ recordings made between 2006 and 2012 which have been digitised, segmented, transcribed and subsequently annotated with morphological and syntactic analysis. The nature of the spoken data gives rise to various challenges both in segmentation and annotation. We follow earlier efforts for Norwegian, in particular the LIA Treebank of spoken dialects transcribed in the Nynorsk variety of Norwegian, in the annotation principles to ensure interusability of the resources. We have developed a spoken language parser on the basis of the annotated material and report on its accuracy both on a test set across the dialects and by holding out single dialects.
Lane, Pia; Hagen, Kristin; Nøklestad, Anders & Priestley, Joel (2022). Creating a corpus for Kven, a minority language in Norway. Nordlyd. ISSN 0332-7531. 46(1), s. 159–170. doi: 10.7557/12.6345. Fulltekst i vitenarkiv Vis sammendrag
Language documentation, including the development and use of corpora, is frequently linked to revitalisation. This is also the case for the Kven language, a Finnic minoritised language, traditionally spoken in the two northernmost counties of Norway. Kven is a recognised minority language in Norway, protected by the European Charter for Regional or Minority Languages. This status led to increased efforts to document Kven, including the development of the Ruija Corpus, consisting of recordings of interviews in Kven. The corpus was an important tool for the standardisation of Kven. In this article we describe how the corpus was developed and account for search functions, including a discussion of the limitations of the corpus. We also discuss the role of corpora and other online tools for language revitalisation, with a particular focus on the standardisation of Kven and conclude by reflecting on how expertise also resides with the speakers of an endangered language and that they have a right to be involved in efforts of language documentation and revitalisation.
Borthen, Kaja; Søfteland, Åshild; Kveen, Perlaug Marie; Karagjosova, Elena & Nøklestad, Anders (2021). Finalpartiklar i norske talemål: Ei undersøking av variasjon knytt til geografi, alder og kjønn. Maal og Minne. ISSN 0024-855X. 113(1), s. 1–63. Fulltekst i vitenarkiv
Søfteland, Åshild; Nøklestad, Anders; Priestley, Joel & Hagen, Kristin (2020). Glossa som forskningsverktøy. Hva folk søker etter og hva resultatene brukes til. Oslo Studies in Language (OSLa). ISSN 1890-9639. 11(2), s. 449–464. doi: 10.5617/osla.8512. Fulltekst i vitenarkiv
Kåsen, Andre; Hagen, Kristin; Johannessen, Janne Bondi; Nøklestad, Anders & Priestley, Joel (2020). Comparing methods for measuring dialect similarity in Norwegian. I Calzolari, Nicoletta; Béchet, Frédéric; Blache, Philippe; Choukri, Khalid; Cieri, Christopher; Declerck, Thierry; Goggi, Sara; Isahara, Hitoshi; Maegaard, Bente; Mariani, Joseph; Mazo, Hélène; Moreno, Asuncion; Odijk, Jan & Piperidis, Stelios (Red.), Proceedings of The 12th Language Resources and Evaluation Conference. European Language Resources Association. ISSN 979-10-95546-34-4. s. 5343–5350. Fulltekst i vitenarkiv
Fjeld, Ruth E. Vatvedt; Nøklestad, Anders & Hagen, Kristin (2020). Leksikografisk Bokmålskorpus (LBK) – Bakgrunn og bruk. I Johannessen, Janne Bondi & Hagen, Kristin (Red.), Leksikografi og korpus. En hyllest til Ruth Vatvedt Fjeld. Universitetet i Oslo. ISSN 9788291398129. s. 101–124. Fulltekst i vitenarkiv
Kåsen, Andre; Hagen, Kristin; Nøklestad, Anders & Priestley, Joel (2019). Tagging a Norwegian Dialect Corpus. I Hartmann, Mareike & Plank, Barbara (Red.), Proceedings of the 22nd Nordic Conference on Computational Linguistics (NoDaLiDa). Linköping University Electronic Press. ISSN 978-91-7929-995-8. s. 350–355. Fulltekst i vitenarkiv Vis sammendrag
This paper describes an evaluation of five data-driven part-of-speech (PoS) taggers for spoken Norwegian. The taggers all rely on different machine learning mechanisms: decision trees, hidden Markov models (HMMs), conditional random fields (CRFs), long-short term memory networks (LSTMs), and convolutional neural networks (CNNs). We go into some of the challenges posed by the task of tagging spoken, as opposed to written, language, and in particular a wide range of dialects as is found in the recordings of the LIA (Language Infrastructure made Accessible) project. The results show that the taggers based on either conditional random fields or neural networks perform much better than the rest, with the LSTM tagger getting the highest score.
Lundquist, Bjørn; Larsson, Ida; Westendorp, Maud; Tengesdal, Eirik & Nøklestad, Anders (2019). Nordic Word Order Database: Motivations, methods, material and infrastructure. Nordic Atlas of Language Structures (NALS) Journal. ISSN 2387-2667. 4(1), s. 1–33. doi: 10.5617/nals.7529. Fulltekst i vitenarkiv Vis sammendrag
In this article, we present the Nordic Word Order Database (NWD), with a focus on the rationale behind it, the methods used in data elicitation, data analysis and the empirical scope of the database. NWD is an online database with a user-friendly search interface, hosted by The Text Laboratory at the University of Oslo, launched in April 2019 (https://tekstlab.uio.no/nwd). It contains elicited production data from speakers of all of the North Germanic languages, including several different dialects. So far, 7 fieldtrips have been conducted, and data from altogether around 250 participants (age 16–60) have been collected (approx. 55 000 sentences in total). The data elicitation is carried out through a carefully controlled production experiment that targets core syntactic phenomena that are known to show variation within and/or between the North Germanic languages, e.g., subject placement, object placement, particle placement and verb placement. In this article, we present the motivations and research questions behind the database, as well as a description of the experiment, the data collection procedure, and the structure of the database.
Øvrelid, Lilja; Kåsen, Andre; Hagen, Kristin; Solberg, Per Erik; Johannessen, Janne Bondi & Nøklestad, Anders (2018). The LIA Treebank of Spoken Norwegian Dialects. I Calzolari, Nicoletta; Choukri, Khalid; Cieri, Christopher; Declerck, Thierry; Goggi, Sara; Hasida, Koiti; Isahara, Hitoshi; Maegaard, Bente; Mariani, Joseph; Mazo, Hélène; Moreno, Asuncion; Odijk, Jan; Piperidis, Stelios & Tokunaga, Takenobu (Red.), Proceedings of the Eleventh International Conference on Language Resources and Evaluation. European Language Resources Association. ISSN 979-10-95546-00-9. s. 4482–4488. Fulltekst i vitenarkiv Vis sammendrag
This article presents the LIA treebank of transcribed spoken Norwegian dialects. It consists of dialect recordings made in the period between 1950--1990, which have been digitised, transcribed, and subsequently annotated with morphological and dependency-style syntactic analysis as part of the LIA (Language Infrastructure made Accessible) project at the University of Oslo. In this article, we describe the LIA material of dialect recordings and its transcription, transliteration and further morphosyntactic annotation. We focus in particular on the extension of the native NDT annotation scheme to spoken language phenomena, such as pauses and various types of disfluencies, and present the subsequent conversion of the treebank to the Universal Dependencies scheme. The treebank currently consists of 13,608 tokens, distributed over 1396 segments taken from three different dialects of spoken Norwegian. The LIA treebank annotation is an on-going effort and future releases will extend on the current data set.
Nøklestad, Anders; Hagen, Kristin; Johannessen, Janne Bondi; Kosek, Michał & Priestley, Joel (2017). A modernised version of the Glossa corpus search system. I Tiedemann, Jörg (Red.), Proceedings of the 21st Nordic Conference on Computational Linguistics (NoDaLiDa). Linköping University Electronic Press. ISSN 978-91-7685-601-7. s. 251–254. Fulltekst i vitenarkiv Vis sammendrag
This paper presents and describes a modernised version of Glossa, a corpus search and results visualisation system with a user-friendly interface. The system is open source and can be easily installed on servers or even laptops for use with suitably prepared corpora. It handles parallel corpora as well as monolingual written and spoken corpora. For spoken corpora, the search results can be linked to audio/video, and spectrographic analysis and visualised geographical distributions can be provided. We will demonstrate the range of search options and result visualisations that Glossa provides.
Bick, Eckhard; Hagen, Kristin & Nøklestad, Anders (2015). Optimizing the Oslo-Bergen Tagger. I Bick, Eckhard & Hagen, Kristin (Red.), Proceedings of the Workshop on “Constraint Grammar - methods, tools and applications” at NODALIDA 2015, May 11-13, 2015, Institute of the Lithuanian Language, Vilnius, Lithuania. Linköping University Electronic Press. ISSN 978-91-7519-037-2. s. 11–17.
Kosek, Michał; Nøklestad, Anders; Priestley, Joel; Hagen, Kristin & Johannessen, Janne Bondi (2015). Visualisation in Speech Corpora: Maps and Waves in the Glossa System. I Grigonytė, Gintarė; Clematide, Simon; Utka, Andrius & Volk, Martin (Red.), Proceedings of the Workshop on Innovative Corpus Query and Visualization Tools at NODALIDA 2015. Linköping University Electronic Press. ISSN 978-91-7519-035-8. s. 23–31. Fulltekst i vitenarkiv Vis sammendrag
We present the Glossa web-based system for corpus search and results handling, focussing on two modes of visualisation implemented in the system. First, we describe the use of maps to show the geographical distribution of search results and its utility for exploring dialectal variation and discovering new isoglosses. Secondly, we present a functionality for speech visualisation, yielding dynamically generated representations of spectrograms, pitch and formants. The analyses are accompanied by the ability to replay selected parts of the waveform, as well as export and compare maximum, minimum and average values of the parameters for different selections. Among other things, this can be used to explore in more detail the set of spoken variants revealed by the geographical map view.
Kapociute-Dzikiene, Jurgita; Nøklestad, Anders; Johannessen, Janne Bondi & Krupavicius, Algis (2013). Exploring Features for Named Entity Recognition in Lithuanian Text Corpus. I Oepen, Stephan; Hagen, Kristin & Johannessen, Janne Bondi (Red.), Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013). Linköping University Electronic Press. ISSN 978-91-7519-589-6. s. 73–88. Fulltekst i vitenarkiv
Johannessen, Janne Bondi; Priestley, Joel; Hagen, Kristin; Nøklestad, Anders & Lynum, Andre (2012). The Nordic Dialect Corpus. I Calzolari, Nicoletta; Choukri, Khalid; Declerck, Thierry; Ugur Dogan, Mehmet; Maegaard, Bente; Mariani, Joseph & Odijk, Jan (Red.), Proceedings of the Eighth International Conference on Language Resources and Evaluation. European Language Resources Association. ISSN 978-2-9517408-7-7. s. 3388–3391. Fulltekst i vitenarkiv
Johannessen, Janne Bondi; Hagen, Kristin; Lynum, André & Nøklestad, Anders (2012). OBT-stat A combined rule-based and statistical tagger. I Andersen, Gisle (Red.), Exploring newspaper language : using the web to create and investigate a large corpus of modern Norwegian. John Benjamins Publishing Company. ISSN 978-90-272-0354-0. s. 51–65. doi: 10.1075/scl.49.03joh.
Lynum, Andre; Hagen, Kristin; Johannessen, Janne Bondi & Nøklestad, Anders (2011). OBT+Stat: Evaluation of a combined CG and statistical tagger. NEALT Proceedings Series. ISSN 1736-8197. 14, s. 26–34.
Hagen, Kristin & Nøklestad, Anders (2010). Bruk av et norsk leksikon til tagging og andre språkteknologiske formål. LexicoNordica. ISSN 0805-2735. s. 55–72.
Johannessen, Janne Bondi; Hagen, Kristin; Nøklestad, Anders & Priestley, Joel (2010). Enhancing Language Resources with Maps. I Calzolari, Nicoletta; Choukri, Khalid; Maegaard, Bente; Mariani, Joseph; Odijk, Jan; Piperidis, Stelios; Rosner, Mike & Tapias, Daniel (Red.), Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10). European Language Resources Association. ISSN 2-9517408-6-7. s. 1081–1088. Fulltekst i vitenarkiv
Johannessen, Janne Bondi; Priestley, Joel & Nøklestad, Anders (2010). A MultilingualSpeech Resource: The Nordic Dialect Corpus. I Otoguro, Ryo; Ishikawa, Kiyoshi; Umemoto, Hiroshi; Yoshimoto, Kei & Harada, Yasunari (Red.), Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation. Waseda University. ISSN 978-4-905166-00-9. s. 749–758. Fulltekst i vitenarkiv
Lindstad, Arne Martinus; Nøklestad, Anders; Johannessen, Janne Bondi & Vangsnes, Øystein Alexander (2009). The Nordic Dialect Database: Mapping Microsyntactic Variation in the Scandinavian Languages. NEALT Proceedings Series. ISSN 1736-8197. 4.
Johannessen, Janne Bondi; Nygaard, Lars; Priestley, Joel & Nøklestad, Anders (2008). Glossa: a Multilingual, Multimodal, Configurable User Interface. I Calzolari, Nicoletta (Red.), Proceedings of the 6th International Conference on Language Resources and Evaluation. European Language Resources Association. ISSN 2-9517408-4-0. Fulltekst i vitenarkiv
Søfteland, Åshild & Nøklestad, Anders (2008). "Manuell morfologisk tagging av NoTa-materialet med støtte fra en statistisk tagger" . I Johannessen, Janne Bondi & Hagen, Kristin (Red.), Språk i Oslo. Ny forskning omkring talespråk. Novus Forlag. ISSN 978-82-7099-471-7. s. 226–234.
Nøklestad, Anders & Søfteland, Åshild (2007). "Tagging a Norwegian speech corpus". I Nivre, Joakim; Kaalep, Heiki-Jaan; Muischnek, Kadri & Koit, Mare (Red.), NODALIDA 2007 PROCEEDINGS. University of Tartu. ISSN 978-9985-4-0513-0. s. 245–248.
Johannessen, Janne Bondi; Hagen, Kristin; Haaland, Åsne ; Nøklestad, Anders; Jónsdottir, Andra Björk & Kokkinakis, Dimitris [Vis alle 9 forfattere av denne artikkelen] (2005). Named Entity Recognition for the Mainland Scandinavian Languages. Literary & Linguistic Computing. ISSN 0268-1145. 20(1), s. 91–102.
Nøklestad, Anders (2005). Semi-unsupervised PP attachment disambiguation for Norwegian. Archives of Control Sciences. ISSN 0004-072X. 15(1), s. 385–396.
Nøklestad, Anders (2001). A Connectionist Model of Past Tense Acquisition in Norwegian, A Cognitive Approach to the Verb: Morphological and Constructional Perspectives. De Gruyter Mouton. ISSN 3-11-017031-0. s. 165–187. Vis sammendrag
In cognitive linguistics theories and in connectionist models, both regular (or weak) and irregular (or strong) inflectional morphology are handled by the same kind of mechanism, a mechanism that is sensitive to the phonological form and the type and token frequencies of items in the linguistic input. In this paper I present a connectionist model of past tense acquisition in Norwegian and compare the model's behaviour to results from past tense elicitation experiments with Norwegian speakers. These comparisons, as well as investigations of the model's treatment of novel verbs, show that the model reflects many aspects of past tense inflection among Norwegian speakers, thus lending support to the single mechanism view on the acquisition and processing of inflectional morphology.

Se alle arbeider i Cristin

Haug, Dag Trygve Truslew; Yildirim, Ahmet; Nøklestad, Anders & Kristen, Hagen (2023). Rules and neural nets for morphological tagging of Norwegian - Results and challenges.
Larsson, Ida; Lundquist, Bjørn; Westendorp, Maud; Nøklestad, Anders & Tengesdal, Eirik (2020). The Nordic Word Order Database.
Golden, Anne; Nøklestad, Anders & Johansson, Sofie (2019). The vocabulary of the Nordic heritage speakers in the US. An attempt of categorization.
Lundquist, Bjørn; Larsson, Ida; Westendorp, Maud; Tengesdal, Eirik & Nøklestad, Anders (2019). Presenting the Nordic Word Order Database.
Borthen, Kaja; Nøklestad, Anders; Søfteland, Åshild; Kveen, Perlaug Marie & Karagjosova, Elena (2018). Database for etterstilte småord i norsk (Småord-databasen).
Johannessen, Janne Bondi; Askeland, Anne Renette; Hagen, Kristin; Håberg, Live; Jensen, Bård Uri & Nøklestad, Anders [Vis alle 8 forfattere av denne artikkelen] (2018). Utfordringa med fonetisk transkripsjon av dialekter i den digitale tidsalderen: Oslo-translitteratoren.
Nøklestad, Anders; Hagen, Kristin; Johannessen, Janne Bondi; Kosek, Michał & Priestley, Joel (2017). A Modernised Version of the Glossa Corpus Search System.
Johannessen, Janne Bondi; Askeland, Anne Renette; Hagen, Kristin; Håberg, Live; Kristoffersen, Gjert & Nøklestad, Anders [Vis alle 9 forfattere av denne artikkelen] (2017). Hundre nye dialektordlister.
Søfteland, Åshild & Nøklestad, Anders (2016). Korpus-workshop.
Johannessen, Janne Bondi; Vangsnes, Øystein A; Lundquist, Bjørn; Larsson, Ida; Bentzen, Kristine & Garbacz, Piotr [Vis alle 10 forfattere av denne artikkelen] (2014). Nye isoglosser illustrert i det nye nettstedet for nordisk språk: NALS – Nordic Atlas of Language Structures (Online).
Priestley, Joel; Johannessen, Janne Bondi; Hagen, Kristin; Nøklestad, Anders & Lynum, André (2012). Maps as a central linguistic research tool.
Johannessen, Janne Bondi; Priestley, Joel; Hagen, Kristin; Nøklestad, Anders & Lynum, André (2012). The Nordic Dialect Corpus.
Nøklestad, Anders; Johannessen, Janne Bondi & Vangsnes, Øystein A (2011). The Nordic Syntactic Judgment Database: informantvurderinger av syntaktiske konstruksjoner i skandinaviske dialekter.
Johannessen, Janne Bondi & Nøklestad, Anders (2010). Recent developments in the Nordic Dialect Corpus and the Nordic Syntactic Judgments Database.
Johannessen, Janne Bondi; Hagen, Kristin; Nøklestad, Anders & Priestley, Joel (2010). Enhancing Language Resources with Maps.
Nøklestad, Anders (2010). Bruk av et norsk leksikon til tagging og andre språkteknologiske formål.
Johannessen, Janne Bondi; Nøklestad, Anders & Priestley, Joel (2007). Developing multi-Scandinavian word-lists for multi-Scandinavian texts.
Johannessen, Janne Bondi; Hagen, Kristin; Laake, Signe; Lindstad, Arne Martinus; Vangsnes, Øystein A. & Åfarli, Tor A. [Vis alle 7 forfattere av denne artikkelen] (2007). Dialektkorpus - presentasjon av prosjekt, metode, innsamling og materiale.
Søfteland, Åshild & Nøklestad, Anders (2006). ”Manuell morfologisk tagging av NoTa-materialet med støtte fra en statistisk tagger”.
Nøklestad, Anders (2005). Memory-based PP Attachment Disambiguation for Norwegian.
Samdal, Gunn Inger Lyse & Nøklestad, Anders (2005). A distributional or a translational basis for data-driven sense discrimination?
Nøklestad, Anders; Johansson, Christer & van den Bosch, Antal (2004). Pronominal anaphora resolution in Norwegian using TiMBL and z-scores.
Nøklestad, Anders (2004). Memory-based Classification of Proper Names in Norwegian.
Johannessen, Janne Bondi; Nøklestad, Anders; Hagen, Kristin & Lindstad, Arne Martinus (2000). Det åpne laboratoriet. [Avis]. Apollon.
Johannessen, Janne Bondi; Nøklestad, Anders & Hagen, Kristin (2000). A Web-Based Advanced and User Friendly System: The Oslo Corpus of Tagged Norwegian Texts. Vis sammendrag
A general purpose text corpus meant for linguists and lexicographers needs to satify quality criteria at at least four different levels. The first two criteria are fairly well established; the corpus should have a wide variety of texts and be tagged according to a fine-grained system. The last two criteria are much less widely appreciated, unfortunately. One has to do with variety of search criteria: the user should be allowed to search for any information contained in the corpus, and with any combination possible. In addition, the search results should be presented in a choice of ways. The forth criterion has to do with accessability. It is a rather surprising fact that while user interfaces tend to be simple and self explanatory in most areas of life represented electronically, corpus interfaces are still extremely user unfriendly. In this paper, we present a corpus whose interface we have given a lot of thought, and likewise the possible search options, viz. the Oslo Corpus of Tagged Norwegian Texts.
Johannessen, Janne Bondi; Nøklestad, Anders & Hagen, Kristin (2000). A Web-Based Advanced and User Friendly System: The Oslo Corpus of Tagged Norwegian Texts. Vis sammendrag
A general purpose text corpus meant for linguists and lexicographers needs to satify quality criteria at at least four different levels. The first two criteria are fairly well established; the corpus should have a wide variety of texts and be tagged according to a fine-grained system. The last two criteria are much less widely appreciated, unfortunately. One has to do with variety of search criteria: the user should be allowed to search for any information contained in the corpus, and with any combination possible. In addition, the search results should be presented in a choice of ways. The forth criterion has to do with accessability. It is a rather surprising fact that while user interfaces tend to be simple and self explanatory in most areas of life represented electronically, corpus interfaces are still extremely user unfriendly. In this paper, we present a corpus whose interface we have given a lot of thought, and likewise the possible search options, viz. the Oslo Corpus of Tagged Norwegian Texts.
Johannessen, Janne Bondi & Nøklestad, Anders (1999). Tavle-analyse ut - data-analyse inn. [Avis]. Aftenposten.
Johannessen, Janne Bondi & Nøklestad, Anders (1999). Mot et maksimalt brukervennlig korpus.
Hagen, Kristin; Johannessen, Janne Bondi & Nøklestad, Anders (1999). The shortcomings of a tagger. Vis sammendrag
The tagger used for the Oslo Corpus of Tagged Norwegian Texts has very good statistical results. In spite of this, it makes mistakes. In this paper we take a closer look at some of them. Although some mistakes are of a kind that would disappear if we improved the tagger, many are impossible or very difficult to do anything about. They are due to errors in the corpus (spelling errors, foreign words, non-standard spellings), to elliptic sentences, such as headlines, and to structural ambiguity, which abounds to a surprising extent. Proofreading the corpus would have removed the first kind of problems, but the other two types cannot be resolved in any obvious way.
Johannessen, Janne Bondi & Nøklestad, Anders (1999). Oslo-korpuset av taggede, norske tekster.
Johannessen, Janne Bondi & Nøklestad, Anders (1999). Oslo-korpuset av taggede, norske tekster.
Hagen, Kristin; Nøklestad, Anders & Johannessen, Janne Bondi (1998). A Constraint-based Tagger for Norwegian. Vis sammendrag
Disambiguating morphosyntactic taggers are computer programs which provide the words in a text with grammatical information and which are able to pick the correct reading for ambiguous words based on linguistic context. We describe such a tagger for Norwegian BOKMÅL and NYNORSK which is based on the Constraint Grammar formalism (Karlsson et al. 1995). The tagger disambiguates through the use of linguistic constraints that operate only the level of individual words, which means that no phrase structure is established. We show how it is possible to perform morphological and syntactic disambiguation of Norwegian texts without having recourse to a phrasal level.
Nøklestad, Anders (2009). A Machine Learning Approach to Anaphora Resolution Including Named Entity Recognition, PP Attachment Disambiguation, and Animacy Detection. Unipub forlag. ISSN 0806-3222. Vis sammendrag
The thesis describes an automatic anaphora resolution (AR) system for Norwegian, focussing on the resolution of pronominal anaphora in fiction material. The system relies primarily on machine learning (ML) methods, and is the first Norwegian AR system to use machine learning. A set of linguistically motivated filters remove incompatible antecedent candidates before the remaining ones are classified as either antecedent or non-antecedent. The closest candidate classified as a suitable antecedent (if any) is selected as the antecedent of the pronoun. For the classifier, three different machine learning methods are evaluated and compared: memory-based learning (MBL), maximum entropy modelling (MaxEnt), and support vector machines (SVMs). The methods are tested with default as well as automatically optimized parameter settings. Different pronouns are handled by separate classifiers. Two other knowledge-poor approaches, a factor/indicator-based approach and a Centering Theory approach, are compared to the machine learning methods. The best machine learning approaches perform significantly better than the non-ML approaches and significantly better than the only previously existing Norwegian AR system. The thesis also describes the development and evaluation of three support modules providing information to the AR system: a named entity recognizer, a PP attachment disambiguator, and an animacy detector. Various machine learning methods are tested and compared with respect to the first two modules. The PP module introduces a novel kind of semi-supervised learning, while the animacy detector employs two different procedures for using the World Wide Web to obtain animacy information for nouns. The three support modules are evaluated both as standalone NLP tools and as information sources for the AR system. In almost all experiments described in this thesis, MBL performs better than or equally well as MaxEnt, while the performance of the SVMs is significantly worse.

Se alle arbeider i Cristin

Publisert 31. mai 2010 13:29 - Sist endret 17. nov. 2023 20:27

Prosjekter

BærUt! Kompetansehub for bærekraftige digitale vitenskapelige utgaver

Anders Nøklestad

Publikasjoner

Prosjekter

Avsluttede prosjekter