From final report submitted to Engineering and Physical Sciences Research Council, February 2004

1. Background/Context

Scotland contains a rich variety of languages. A recent survey of 300 respondents revealed that over 30 different languages are regularly spoken at home and at work (ILS Report 2003). In addition to the Scottish English spoken by the majority, there are substantial numbers of speakers of the other indigenous languages, with an estimated 58,650 Gaelic speakers at the 2001 census, and around half the population claiming knowledge of Scots (Macafee 2000). Non-indigenous languages include Arabic, Bengali, Cantonese, Dutch, Hindi, Italian, Kurdish, Polish, Panjabi, Romany and Urdu. British Sign Language has a flourishing Scottish variety. Such multilingualism within a single country is an increasingly common phenomenon around the world, making the results of our research generalisable to other contexts through development of appropriate language engineering tools.

There is an interest in Scots language and culture in many parts of the world, with research being done in places as far apart as North America, Finland, Germany, Holland and Japan. More generally, there is an interest among linguists in the relationship between standard and non-standard languages; the Scottish-English to Broad Scots continuum is a case study of such relationships. As a language without an accepted standard written form, Broad Scots presents particular problems for corpus building, notably in transcribing spoken data and in lemmatizing under search-words, as does Gaelic. Such problems are increasingly encountered world-wide as the race goes on to develop written forms of languages or record endangered languages before they vanish.

2. Key Advances and Supporting Methodology

We would identify three major achievements in Glasgow:
(a) A powerful database has been developed, combining administrative functions with storage of metadata and the texts themselves.
(b) The metadata elicited for the texts provides a sophisticated tool for linguistic research, more extensive than in many corpora of standard English.
(c) We have collected and processed around one million words of written and spoken text.

The SCOTS database is a multi-user, scaleable system. It contains the corpus and its metadata and manages the administrative functions of the project. These comprise storage of contact details (contributors, copyright holders, audio/video participants, authors, other interested parties); automated generation of letters etc. containing information relevant to specific individuals; management of mailing lists; tracking of correspondence, emails and phone-calls; tracking for legal agreements (copyright, data protection, parent/teacher permission for minors, performance rights). These administrative functions are fully integrated with the corpus metadata and information storage functions of the database, which has proved invaluable in maximising staff time resources. The metadata elicitation forms have been replicated electronically for ease of data entry. The information thus stored provides the basis of the final corpus web output. The data storage method is more advanced than simple flat files and multiple concurrent users are fully supported, aiding large-scale data entry. Document contents are accessed via the unified interface, alleviating the requirement for file naming and directory organisation.

The database currently exports all documents that have satisfied legal requirements to a separate database solely for use by the online search system. Searches are performed directly by end-users via forms on the website; various fields and combinations can be used as search criteria. More advanced searches based upon the text or transcription of the documents can be supplemented, including variant spelling checks or the use of regular expressions. Results are rendered primarily as html and delivered back to the user. Optionally the user can request the data set returned as plain text. Our website makes full use of web standards (XHTML, CSS etc); this aids accessibility and re-use. A ‘high-contrast' option is available for users with visual impairments. A pilot version has been tested by Glasgow staff and students and proved fast and user-friendly. An expanded version will be made public in autumn 2004.

Forms completed by participants collect demographic, geographical and social information, currently requiring 500 fields. Categories include resource type, text type, setting, medium, audience, text details, author/speaker details and copyright information. The author/speaker category contains information on gender, age, geographic region, education, occupation, religious background, languages used, etc., as well as information on the subject’s parents. Standards described by the TEI Guidelines and Dublin Core have informed the design, and the metadata categories were decided after exhaustive consideration of both linguistic and legal requirements. Material will be published anonymously except where the respondent chooses otherwise (e.g. in the case of creative writers).

The corpus aims ultimately to provide an electronic home for materials on all the languages of Scotland, but the initial concentration is on Scots and Scottish English. Our appeal for data at the launch of the project produced an overwhelming response. Offers came from 113 people, and included written, spoken and visual materials from a range of genres such as conversation, interviews, correspondence, poetry, fiction and prose. The materials show an expected imbalance overall, with much of the Scots material being poetry or literary prose. We will address this imbalance more thoroughly in Phase 2 of the project. Otherwise, it is clear that Scots survives primarily in speech, and that there should therefore be a concentration on collecting oral data. A style-sheet has been developed for transcribers in consultation with the Newcastle Electronic Corpus of Tyneside English, which is experienced in the orthographic transcription of non-standard data, and will be refined as the project progresses. There is a need for ongoing work on headword lemmatization, both to establish search-words and to create a standard for orthographic transcription. This is currently being done by reference to electronic dictionaries and headword listings provided by Scottish Language Dictionaries, which is also supplying preferred spellings as a first move towards an agreed spelling system.

Decisions remain to be made about software tools. As Arts-based corpus users, we have found that by far the most useful tools are concordances which retrieve specified words and phrases, their contexts, frequencies, etc., and parsers which add tags to particular forms. We have also found that we usually prefer to start with plain text, and will therefore not impose any structures on the data but will simply offer appropriate tools in both cases, taking account of work done elsewhere.

3. Research Impact and Benefits to Society

The SCOTS corpus is timely both in its relevance to present day Scotland and in the general contribution it can make to developments in places with comparable multilingual situations. We expect it to generate considerable linguistic and literary research in future. We also feel that we have contributed to a general renaissance of interest in linguistic matters in Scotland, as evidenced by the formation of Scottish Language Dictionaries, the electronic Dictionary of the Scots Language (, plans for a new Gaelic dictionary and updating of the Linguistic Survey of Scotland, and other projects. A corpus is a necessary and primary part of a complete inventory of a language.

Our efforts in developing the database structure have generated interest in the wider corpus community. The structures we have created for managing the different parts of the administration of the project (a not inconsiderable task, and one which is often overlooked in the literature) form the basis for a re-useable generic package for scholars building an online corpus for new linguistic materials. As a first step we will develop an abstracted model of our system, which will allow standard data objects and types to be used freely within it. As open source software is used (MySQL RDBMS, Apache and PHP), the package will be exportable to different software and hardware platforms. We are participating in a pilot study with the University of Oulu, Finland, which requires such a system for a corpus of transitional dialects (Saami, Finnish and Scandinavian). Anderson and Beavan gave a paper on this project to the ALLC/ACH conference in June 2004. Beavan and Kay gave a paper at Sociolinguistics Symposium 15 at Newcastle University in April 2004, in a special colloquium on Models and Methods in the Handling of Electronic Megacorpora.

Throughout the project we have attached great importance to maintaining a high public profile, mainly through meetings and conference papers rather than published work. The launch of the project in January 2001 attracted a lot of media interest, with newspaper articles and radio appearances by members of the team. We have used opportunities such as focus groups for the proposed Institute for the Languages of Scotland to publicise the project, and have been active in the Scottish Parliament’s Cross-Party Language Group. We have also forged links with comparable academic projects.

Members of the team have spoken about SCOTS at national and international conferences such as the Forum for Research into the Languages of Scotland and Ulster, the Association for Scottish Literary Studies, the Scots Language Society, Scottish CILT, the Association for Literary and Linguistic Computing/Association for Computing in the Humanities, Digital Resources in the Humanities, and ICAME (International Computer Archive of Medieval and Modern English). We have held discussions about the structure of the corpus and problems of dealing with non-standard languages with colleagues at other institutions, such as Newcastle University (NECTE project), the Oxford Text Archive, the Refugee Studies project and the British National Corpus.

4. Future Plans

In November 2003 we were awarded a Resource Enhancement Grant of £309,944 from the Arts and Humanities Research Board, which enabled us to start Phase 2 of the project in April 2004.

The project team comprises:
Dr John Corbett, Principal Investigator
Dr Wendy Anderson, Research Assistant
David Beavan, Computing Manager
Flora Edmonds, Louise Edmonds, Cerwyss O’Hare
Jean Anderson, Professor Christian Kay, Dr Jane Stuart-Smith.

Under this grant we will expand the content of the project, with a target of 800 texts/ 4 million words, 20% of which will be spoken, and tackle the problem of genre balance. Linguistic work on Scots will be continued by project members, principally Corbett and Stuart-Smith. Through an Outreach Officer funded by the Scottish Arts Council for SLD and ASLS, we will take the corpus to a wider audience, such as schools and writers’ groups. We will develop plans to attract more students to work on the data and will continue to support moves for an Institute for the Languages of Scotland.


