home complete list a - z parallel corpora learner corpora historical corpora spoken corpora ice corpora more languages german corpora english corpora search
You are now in section: CorporaParallel Corpora
     
 
CLUVI ENPC OMC
Compara ESPC  
CRATER IJS-ELAN  

../ CLUVI - Linguistic Corpus of the University of Vigo
Developed by: SLI (Computational Linguistics Group of the University of Vigo)
Size:   The main sections contain ~ 8 million words.
Contents:   heterogeneous; spoken and written. The main corpus contains several corpora consisting of literary, legal and scientific texts in Galician.
Access:    
Notes:    
    to the top of the page

 

../ Compara
Developed by: Ana Frankenberg-Garcia; Diana Santos
Size:   1 million words and growing
Contents:   "open-ended collection of Portuguese-English and English-Portuguese source texts and translations" ( more )
Access:   free
Notes:   Corpus Manual
    to the top of the page

 

../ CRATER: Multilingual Aligned Annotated Corpus
Developed by: Computing Department at Lancaster University
Size:   ~ 1 million words
Contents:   trilingual: English, French and Spanish - telecommunications texts
Access:   access online; also download of the text files via FTP possible
Notes:   aligned at the sentence level; POS tagged in all three languages
    to the top of the page

 

../ ENPC - English Norwegian Parallel Corpus
Developed by: University of Oslo, Norway
Size:   100 original texts and 100 translated texts, amounting to some 2.6 million words in all
Contents:   fictional and non-fictional texts
Access:   Access to the corpus is restricted to the staff and students of the University of Oslo.
Notes:    
    to the top of the page

 

../ ESPC - English Swedish Parallel Corpus
Developed by: Bengt Altenberg; Karin Aijmer; Mikael Svensson
Size:   64 English text samples and translations; 72 Swedish text samples and translations; total corpus size 2.8 million words
Contents:   "With few exceptions, the samples have been taken from texts published since 1980. Most major regional varieties of English are represented (British, American, Canadian, Irish, South African) but no attempt has been made to achieve a systematic or 'representative' distribution of these. Only written texts are represented. A number of prepared speeches have been included but they have their origin in writing and do not reflect genuine speech. Other categories that are missing in the corpus are, for example, newspaper text, private letters and business correspondence."
Access:   Restricted to researchers and students at the Universities of Lund and Göteborg
Notes:   Corpus manual available
    to the top of the page

 

../ IJS ELAN - Slovene-English Parallel Corpus
Developed by: Dept. of Intelligent Systems, Institute Jozef Stefan
Size:   1 million words from 15 parallel Slovene-English / English-Slovene texts
Contents:    
Access:   free access, the corpus can be downloaded or accessed via their online concordancer
Notes:   "the corpus is tokenised, sentence segmented and aligned; encoded as a translation memory in SGML TEI P3"
    to the top of the page

 

../ Oslo Multilingual Corpus
Developed by: Interdisciplinary research project Languages in Contrast (SPRIK) at the University of Oslo
Size:   The OMC is made up of more than 10 sub corpora of varying sizes.
Contents:   "The Oslo Multilingual Corpus (OMC) is a collection of text corpora comprising original texts and translations from several languages. The various sub-copora differ in that they contain a different number of languages or a different combination of languages. The OMC provides unique research material for use in contrastive studies and translation studies, as well as in theoretical and applied linguistics."
Access:   Access available for research purposes; application form provided online
Notes:    
    to the top of the page