corpora and databases
British National Corpus Corpora page
Information about the BNC, and links to other English corpora sites.Corpora, collections, data archives
A large selection of links to corpora of written and spoken languages (chiefly English), from the University of Lancaster.Child Language Data Exchange System (CHILDES)
The CHILDES system provides tools for studying conversational interactions. These tools include a database of transcripts, programs for computer analysis of transcripts, methods for linguistic coding,and systems for linking transcripts to digitised audio and video.UCL Speech Data database
A page with links to the UCL Speaker Database, SCRIBE, EUROM, and the UCL Dysfluency Database.EUSTACE (Edinburgh University Speech Timing Archive and Corpus of English)
4608 sentences of spoken English provided online by Edinburgh's Centre for Speech Technology Research.The IViE (Intonational Variability in English) project
Homepage for an Oxford University-based project investigating intonational variability in British and Irish English.Saarbrücken Corpus of Spoken English (SCoSE)
Recordings of (i) jokes, (ii) stories and (iii) interviews carried out by researchers at Northern Illinois University and the University of the Saarland, Germany, and downloadable as .pdf files. You'll need the Adobe Acrobat Reader, which can be downloaded via the link on the SCoSE homepage or from here, to view them.UCLA Phonetics Lab Language Archive
For over half a century, the UCLA Phonetics Laboratory has collected recordings of hundreds of languages from around the world, providing source materials for phonetic and phonological research. This website, funded by the US National Science Foundation, aims to make the Lab's materials more easily accessible, serving the interests of scholars, speakers, and language learners everywhere.ToBI corpus
Dozens of sound files (.wav format, available by anonymous FTP) to accompany the guidelines for the use of Ohio State's ToBI (Tones and Break Indices) intonational labelling system. Click here to go to the ToBI homepage.Fromkin Speech Error Database
The 2000 version of the database, provided in XML format by the Max Planck Institute for Psycholinguistics, Nijmegen.WebCorp
The University of Liverpool's WebCorp is a suite of tools which allows access to the World Wide Web as a corpus. It can be used by anyone who has an interest in language and how particular words and phrases are used, especially ones which are too new or too rare to appear in any dictionary or standard corpus.The Rosetta Project
'A global collaboration of language specialists and native speakers [whose] goal is a meaningful survey and near permanent archive of 1,000 languages. Our intention is to create a unique platform for comparative linguistic research and education as well as a functional linguistic tool that might help in the recovery or revitalization of lost languages in unknown futures.'IDEA (International Dialects of English Archive)
Created in 1998 as a resource for actors, this archive is comprised of recordings of native speakers of English from various parts of the world, and English spoken in various non-native accents.Current Corpora at CSLU
Long list of corpora provided by Oregon Health and Science University's Center for Spoken Language Understanding.W-3 Corpora
Web access to linguistic corpora provided by the University of Essex (site under development).Legal Language Corpora: summary
Information on and links to corpora made up of legal texts.Corpus of late 18th C prose
c. 300,000 words of north-western English letters on practical subjects (1761-89), collected by the University of Manchester.The Indo-European Database (TIED)
The aim of this database is to provide those interested in the language, history and culture of various Indo-European peoples with access to reliable sources of information, including libraries, research centers and academic institutions.