home complete list a - z parallel corpora learner corpora historical corpora spoken corpora ice corpora more languages german corpora english corpora search
You are now in section: CorporaMore languages
     
 
BOKR Russian Reference Corpus ET10-63 Oslo Corpus
BoLC Bononia Legal Corpus French French PNC
CNC Czech National Corpus HNC Sejong  
CORIS Corpus di Italiano Scritto HNC/ILSP SNC
CRATOR EnglishFrenchSpanish Korpus 2000 Uppsala
ECI MCI English NCI    

../ BOKR - The Russian Reference Corpus
Developed by:  
Size:   approx. 100 million words
Contents:   Contemporary Russian - this corpus is designed to be the Russian equivalent of the British National Corpus
Access:   Âccess available to pilot version (Russian only) and also through Leeds CQP (English interface)
Notes:    
    to the top of the page

../ BoLC - The Bononia Legal Corpus
Developed by: Centre of Theoretical and Applied Linguistics 'L. Heilmann'
(Rema Rossini Favretti, Fabio Tamburini)
Size:   Italian subcorpus: 33.5 million words / English subcorpus: 21 million words
Contents:   The Bononia Legal Corpus is a multilingual comparable legal corpus
Access:   Online corpus query for authorized users only.
Notes:    
    to the top of the page

../ Corpus di Italiano Scritto - CORIS
Developed by: CILTA - Centro Interfacoltà di Linguistica Teorica e Applicata
Size:  

approx. 100 million words

Contents:   "CORIS contains 100 million words and will be updated every two years by means of a built-in monitor corpus. It consists of a collection of authentic and commonly occurring texts in electronic format chosen by virtue of their representativeness of modern Italian."
Access:   The Corpus Query Form is available to authorised users only. You can request access for research purposes
Notes:   There is also a DEMO version available.
    to the top of the page

../ CRATOR Multilingual Aligned Annotated Corpus
Developed by: Lancaster University (Computing Department)
Size:  

approx. 1 million words

Contents:   Trilingual: English, French and Spanish - telecommunications texts
Access:   access online; also download of the text files via FTP possible
Notes:   aligned at sentence level; POS tagged in all three languages
    to the top of the page

../ Czech National Corpus - CNC
Developed by: Institute of the Czech National Corpus (Faculty of Arts, Charles University, Prague)
Size:  

Synchronous (written): 100 million words (SYN2000)
Synchronous (spoken): 800K words (Prague Spoken Corpus)
Diachronous: The bank of diachronis Czech contains 2 million words transcribed texts, 100K transliterated texts and 200K dialect texts

Contents:   "The aim of the project Czech National Corpus is to build up the Czech language corpora and, subsequently, to retrieve information from them. The CNC consists of two main parts: synchronous and diachronic."
Access:   The SYN2000 is available free of charge after registration; The 20 million words SYN2000 PUBLIC can be searched online.
Notes:    
    to the top of the page

../ ECI MCI - European Corpus Initiative Multilingual Corpus I
Developed by: Distributed by ELSNET
Size:   Several corpora ranging from 4 - 34 million words
Contents:   French, Spanish, Dutch, German and English texts
Access:   available on CD-ROM
Notes:    
    to the top of the page

../ ET 10-63 Corpus (bilingual; parallel)
Developed by:  
Size:   1.25 million words of each language
Contents:   English and French official documents on telecommunications
Access:    
Notes:   POS tagged and lemmatized
    to the top of the page

../ French Corpus
Developed by: Cambridge University Press/Cornell University
Size:   35,303 words
Contents:   "The French corpus is currently comprised of 51 hours of spoken French recorded in Paris, Grenoble, Monpellier and Avignon." (Late 1990s)
Access:    
Notes:    
    to the top of the page

../ Hellenic National Corpus - HNC/ILSP Corpus
Developed by: Institute of Language and Speech Processing
Size:   46 million words
Contents:   Written modern Greek; published after 1976, mostly after 1990
Access:   Access available free of charge through online interface in English and Greek
Notes:   User guide available
    to the top of the page

../ Hungarian National Corpus - HNC
Developed by: Department of Corpus Linguistics of the Research Institute for Linguistics of the Hungarian Academy of Sciences (HAS)
Size:   187.6 million words (07/2006)
Contents:   The Hungarian National Corpus is divided into five subcorpora (text genres/regional variety)
Access:   Free of charge (Registration required)
Notes:   This is a balanced reference corpus of written contemporary Hungarian
    to the top of the page

../ Korpus 2000: Danish Corpus Project
Developed by: The Society for Danish Language and Literature
Size:   approx. 28 mio. words
Contents:   Various texts written from 1998 to 2002
Access:   freely available to the public
Notes:   "It is also possible to search the Korpus 90 (1988-1992) which is similar to the Korpus 2000 in its composition and size and hence serves as an older comparative corpus for the Korpus 2000."
    to the top of the page

../ National Corpus of Irish - Corpas Náisiúnta na Gaeilge
Developed by: Institiúid Teangeolaíochta Éireann
Size:   approx. 30 million words (8 million SGML tagged)
Contents:   Contemporary books, newspapers, periodicals and dialogue
Access:   Corpus available for purchase (€ 50 for research purposes)
Notes:    
    to the top of the page

../ Oslo Corpus of Bosnian Texts
Developed by: IMS - Institut fuer Maschinelle Sprachverarbeitung
Size:   1.5 million words
Contents:   "[It] comprises several different genres: fiction (novels and short stories), essays, children's stories, folklore, islamic texts, legal texts, and newspapers and journals. The texts, written by authors from Bosnia and Herzegovina, have for the most part been published in the 1990s."
Access:   Freely available for non-commercial academic research.
Notes:    
    to the top of the page

../ PELCRA Reference Corpus of Polish
Developed by: PELCRA
Size:   currently ~ 100 million words; 81,000 texts
Contents:    
Access:   Access possible through a guest account
Notes:   This project is still under development
    to the top of the page

../ Sejong Balanced Corpus
Developed by: National Academy of the Korean Language
Size:   On-going project
Contents:   The Sejong corpus was built to match the British National Corpus
Access:   Access available (registration required)
Notes:    
    to the top of the page

../ Slovak National Corpus
Developed by: Jazykovedný ústav Ľ. Štúra SAV
Size:   294 087 581 tokens
Contents:   60.6% journalistic texts, 17.5% fiction, 11.6% specialized texts, 10.3% others (contemporary Slovak language texts)
Access:   Access free of charge available: WWW Interface or full access for research (registration required)
Notes:    
    to the top of the page

../ Uppsala Corpus of Russian
Developed by: Slaviska Institutionen, Uppsala Universitet
Size:   ~ 1 million words
Contents:   "600 Russian texts with a total of one million running words (word tokens), equally divided between informative and literary prose (1960-1989)."
Access:   online search access (cyrillic or latin transliteration) at Tübingen
Notes:    
    to the top of the page