An introduction to The English-Norwegian Parallel Corpus (ENPC)

The corpus is intended as a general research tool, available for applied and theoretical linguistic research.

It started out as a research project at the Department of British and American Studies, University of Oslo, in 1994. The corpus was completed in 1997.

In the period 1997-2001 the corpus was extended to include more languages (German, Dutch, Portuguese), and the English and the Norwegian original texts were tagged for part of speech.

The manual was completed in 1999 and revised in 2002.

Fiction and non-fiction

The focus has been on novels and fairly general non-fictional books. In order to include material by a range of authors and translators, the texts of the corpus are limited to text extracts (chunks of 10,000-15,000 words).

The fiction part of the corpus contains 30 original text extracts in each language and their translations, whereas the non-fiction part contains 20 in each direction.

Text-based contrastive studies

The comparison of languages is of great interest in a theoretical as well as in an applied perspective.

It reveals what is general and what is language specific and is therefore important both for understanding language in general and for the study of the individual languages compared.

The analysis has applications within lexicography, language teaching, and translation studies.

Recently there has been a revival of interest in contrastive studies, partially due to the increasing internationalization of society and the growing need for advanced bilingual and multilingual competence.

At the same time, linguistics has become increasingly concerned with the study of language in context, with the emergence of fields like text linguistics, discourse analysis, and pragmatics. 

Text-based contrastive studies can benefit from the progress in computer processing of texts, which has been a major area of research at the Department of British and American Studies, University of Oslo, and the Norwegian Computing Centre for the Humanities, University of Bergen.

The present project extends this work to computer processing of parallel texts.

Aim

The aim of the project was to:

1. Compile a parallel corpus of English and Norwegian texts for computer processing

2. Develop tools for analysing parallel texts

3. Carry out studies of the structure and communicative use of the two languages on the basis of the corpus.

Areas to be studied included:

  • Presentative constructions in English and Norwegian (Jarle Ebeling).
  • Word order and information structure in English and Norwegian (Hilde Hasselgård).
  • Lexical comparison of English and Norwegian (Kay Wikberg).

Examples of more general questions to be addressed are:

  • To what extent are there parallel differences in text genres across languages?
  • In what respects do translated texts differ from comparable original texts in the same language?
  • Are there any features in common among translated texts in different languages (and, if so, what are these features)?

The aim of studying translated texts was not to reveal translation mistakes, but rather to use the work of translators as a resource for contrastive analysis and the study of translation problems.

General research tool

The parallel corpus is planned as an open text bank and will be expanded as allowed by the resources available.

It is intended as a general research tool, available beyond the present project for applied and theoretical linguistic research.

The process of compiling the corpus has taken four years. A lot of work has gone into the development of software and into the preparation of the texts.

Text Encoding Initiative (TEI)

The coding system used to mark up the ENPC follows the Text Encoding Initiative's (TEI) suggestions as presented in Guidelines for Electronic Text Encoding and Interchange (Sperberg-McQueen & Burnard, 1994).

Start- and end-tags are used for the mark-up of the texts, <..> and </..>, respectively. The most important tags mark paragraphs (<p>...</p>) and sentence boundaries (<s>...</s>):

<p><s>These are the myths of beginnings.</s> <s>These are stories and moods deep in those who are seeded in rich lands, who still believe in mysteries.</s></p>

After the texts have been scanned, coded, and proofread they are aligned, i.e. the original text extract is linked to the translated text extract on the sentence level.

The alignment is done automatically by a program developed by Knut Hofland, followed by a manual proofreading stage.

The texts are stored in a database and made searchable in the Translation Corpus Explorer, a browser developed by Jarle Ebeling.

Publications

Publications based on the ENPC, the Oslo Multilingual Corpus and the Multiple-Translation Corpus (Eng-Nor)

Team members

Main researchers

Name Title

Stig Johansson (project leader, language)

Professor of English language at the Department of British and American Studies, University of Oslo

Knut Hofland (project leader, programming)

System manager at the Norwegian Computing Centre for the Humanities, Bergen

Jarle Ebeling

Research assistant at the Department of British and American Studies, University of Oslo (1994 - 1995), later research fellow

Signe Oksefjell

Research assistant at the Department of British and American Studies, University of Oslo (1995 - 1998

Associate researchers

Name Title

Cathrine Fabricius-Hansen 

Professor of German language at the Department of Germanic Studies, University of Oslo

Hilde Hasselgård

Associate professor of English language at the Department of British and American Studies, University of Oslo

Kay Wikberg

Professor of English language at the Department of British and American Studies, University of Oslo

Cooperation

The project was carried out in cooperation with a research group in Sweden (headed by Bengt Altenberg and Karin Aijmer) and with a similar research team in Finland (University of Jyväskylä).

The aim of the cooperation with other contrastive teams, was to facilitate multilingual comparison. There were also important gains in corpus compilation.

The tagging of the original English texts was done by Atro Voutilainen, Researcher at the Research Unit for Multilingual Language Technology, Department of General Linguistics, University of Helsinki.

The dialogue marking in the original fiction texts in both languages was carried out by Berit Løken.

Published Dec. 5, 2022 12:46 PM - Last modified Feb. 2, 2023 1:42 PM