home complete list a - z parallel corpora learner corpora historical corpora spoken corpora ice corpora more languages german corpora english corpora search
You are now in section: CorporaSpoken Corpora
     
 
ANDOSL CSLU LLC
BAS DCPSE LSA
BASE ELISA Lucy
BNC Spoken EUSTACE MICASE
Cancode FRED MARSEC
Christine IViE NECTE
COLT LCIE TRAINS
CPSA LEAP WCSNZE

../ ANDOSL - Australian National Database of Spoken Language
Developed by: School of Electrical Engineering Sydney University, the National Acoustic Laboratories, the Speech Hearing and Language Research Centre Macquarie University and the Computer Sciences Laboratory Australian National University
Size:    
Contents:   "The spoken language data which comprises the first phase of the ANDOSL project was elicited either by written material which was read aloud or by graphical material which was discussed by two speakers thereby generating spontaneous speech."
Access:   Data is available for purchase
Notes:   Last update was in 1998.
    to the top of the page

../ BAS corpora
Developed by: BAS
Size:    
Contents:   The BAS corpora vary in size and content. Available corpora range from read to spontaneous speech to telephone speech and more.
Access:   Listen to example audio files
Notes:    
    to the top of the page

../ BASE - British Academic Spoken English Corpus
Developed by: Universities of Warwick and Reading, UK
Size:   160 lecture and 40 seminar recordings (info recorded Jan. 2005)
Contents:   Digital Video Recordings of seminars and lectures across different disciplines
Access:   the project is still under development
Notes:   BASE is developed as a companion to the MICASE corpus
    to the top of the page

../ BNC Spoken Corpus
Developed by: Universities of Warwick and Reading, UK
Size:   160 lecture and 40 seminar recordings (info recorded Jan. 2005)
Contents:   Digital Video Recordings of seminars and lectures across different disciplines
Access:   the project is still under development
Notes:   BASE is developed as a companion to the MICASE corpus
    to the top of the page

../ CANCODE - Cambridge and Nottingham Corpus of Discourse in English
Developed by: Cambridge University Press
Size:   5 million words
Contents:   spoken materials; sponaneous speech only; recordings collected between 1995 and 2000
Access:   Access is currently restricted to members of Cambridge University Press
Notes:   "[A]ll the recordings have been coded according to the relationship between the speakers."
    to the top of the page

../CHRISTINE Corpus
Developed by: Cambridge University Press
Size:   5 million words
Contents:   spoken materials; sponaneous speech only
Access:   Access is currently restricted to members of Cambridge University Press
Notes:   "[A]ll the recordings have been coded according to the relationship between the speakers."
    to the top of the page

../ COLT - Bergen Corpus of London Teenage Corpus
Developed by:

Department of English, University of Bergen, Norway

Size:   500.000 words; Pilot-version consists of 151 texts
Contents:   transcripts of spoken 'London Teenage Language'; materials were collected in 1993
Access:   available on the ICAME CD-ROM, check out the online sample
Notes:   COLT Manual (doc), (pdf)
    to the top of the page

../ CPSA - Corpus of Spoken American English
Developed by: Michael Barlow, Athelstan
Size:   2 main sub-corpora, 1 million words each
Contents:   short interchanges by 400 speakers - professional activities broadly tied to academics and politics; materials recorded between 1994 and 1998
Access:   Registered users only ($49 for the individual license)
Notes:   The CPSA is also available tagged.
    to the top of the page

../ CSLU Spoken Corpora
Developed by: Center for Spoken Language Understanding
Size:    
Contents:   The CSLU hosts a number of different spoken corpora (English and other languages) ranging from telephone conversations to spontaneous utterances from children.
Access:   Available for ordering
Notes:    
    to the top of the page

../ DCPSE - Diachronic Corpus of Present-day Spoken English
Developed by: Department of English (Survey of English Usage), University College London
Size:    
Contents:  

"The project aims to construct a fully parsed and searchable diachronic corpus of spontaneous spoken English, containing carefully selected and directly comparable texts from the LLC and ICE-GB corpora."

Access:   Available for purchase (Student Licence 25 GBP)
Notes:    
    to the top of the page

../ ELISA - English Language Interview Corpus as a Second-Language Learning Application
Developed by: Department of Applied English Linguistics, University of Tübingen
Size:   currently ~ 60,000 words
Contents:   spoken English; different varieties of English
Access:   Demo version is available online.
Notes:   "The corpus is intended as a resource for the creation of learning materials as well as for autonomous exploitation by learners."
    to the top of the page

../ EUSTACE - Edinburgh University Speech Timing Archive and Corpus of English
Developed by: Centre for Speech Technology Research, University of Edinburgh
Size:   4608 spoken sentences
Contents:   Spoken sentences recorded at the department of Theoretical and Applied Linguistics of Edinburgh University
Access:   Free of charge; Available for download
Notes:   The EUSTACE website includes detailed documentation.
    to the top of the page

../ FRED - Freiburg English Dialect Corpus
Developed by: Bernd Kortmann, English Department, University of Freiburg
Size:   2.5 million words, 300 hours of recorded speech
Contents:   "... 372 interviews with male and female speakers from 163 different locations in 43 different countries in 9 major dialect areas."
Access:   Sample texts and audio files available online
Notes:   Detailed documentation available.
    to the top of the page

../ IViE Corpus - English Intonation in the British Isles
Developed by: Phonetics Laboratory, University of Oxford
Size:   36 hours of speech data
Contents:   Modern or mainstream dialects; Recordings from London, Cambridge, Cardiff, Liverpool, Bradford, Leeds, Newcastle, Belfast in Northern Ireland and Dublin in the Republic of Ireland
Access:   Available for download free of charge; Online search through web interface
Notes:    
    to the top of the page

../ LCIE - Limerick Corpus of Irish-English
Developed by: University of Limerick in conjunction with Mary Immaculate College, Limerick
Size:   1 million words; 375 transcripts
Contents:   Recorded conversations: casual, professional, transactional, pedagogical conversations
Access:   Under development
Notes:    
    to the top of the page

../ LEAP - Learning the Prosody of a Foreign Language
Developed by: University of Bielefeld, Germany
Size:   359 recordings of between 2 and 20 minutes length
Contents:   Read speech, prepared speech, free speech, nonsense word lists
Access:   Speech samples available online
Notes:    
    to the top of the page

../ LLC - London-Lund Corpus of Spoken English
Developed by: Department of English, Lund University, Sweden
Size:   ~ 500,000 words
Contents:   spoken British English; 100 texts; date of materials ranges from 1959 - 1975
Access:   available on the ICAME CD-ROM; check out the online sample
Notes:   The LLC is the result of two projects: SEU (1959) at University College London and SSE at Lund University in 1975.
    to the top of the page

../ Longman Spoken American Corpus
Developed by: Longman Corpus Network
Size:   5 million
Contents:   "everyday conversations of more than 1000 Americans of various age groups, levels of education, and ethnicity, and includes speakers from over 30 US States"
Access:   Currently restricted to members of Longman Corpus Network
Notes:    
    to the top of the page

../ LUCY - Structure in Written English in the UK
Developed by: Geoffrey Sampson
Size:   165,000 words
Contents:   written British English ('polished', young adult and child writing - imaginative and informative); 239 text files, each sample ~2000 words
Access:   Free download
Notes:   Structurally annotated - LUCY is a treebank
    to the top of the page

../ MICASE
Developed by: English Language Institute at University of Michigan, U.S.
Size:   190 hours were recorded (152 transcripts, approx. 1.7 words)
Contents:   Academic speech (e.g. lectures, colloquia, study groups, etc.); The recordings were made during the period 1997-2001.
Access:   Free access to the online version of the corpus; offline versions of the texts, both tagged and untagged, are available for sale; 70 sound files available online as Real Audio; MP3-files on CD-Rom will be available for purchase soon.
Notes:    
    to the top of the page

../ MARSEC - Machine-Readable Spoken English Corpus
Developed by: School of Linguistics, Reading University
Size:    
Contents:   "The Marsec corpus of spoken standard southern British English is a development of the Lancaster/IBM spoken English corpus (SEC)." The acoustic recordings and word-level-time alignments are available.
Access:   Available for 200 GBP.
Notes:    
    to the top of the page

../ NECTE - Newcastle Electronic Corpus of Tyneside English
Developed by: NECTE Project Team
Size:    
Contents:   Based on two pre-existing corpora: Tyneside Linguistic Survey (TLS) and Phonological Variation and Change in Contemporary Spoken English (PVC)
Access:   Free of charge for non-commercial use
Notes:    
    to the top of the page

../ TRAINS Dialogue Corpus
Developed by: Conversational Interaction and Spoken Dialogue Research Group
Size:    
Contents:   Problem solving dialogues
Access:   Transcriptions of the dialogues are available
Notes:    
    to the top of the page
../ Wellington Corpus of Spoken New Zealand English
Developed by: School of Linguistics and Applied Language Studies Victoria University of Wellington
Size:   1 million words
Contents:   1988-1994; different proportions of formal, semi-formal and informal speech
Access:   ICAME CD-Rom, Sample available online
Notes:    
    to the top of the page