SCOTS - Corpus Details

Content

The SCOTS corpus currently contains over 1300 written and spoken texts, totalling over 4.5 million words of running text. 77% of this total is made up of written texts and 23% is made up of spoken texts, which are provided in the form of an orthographic transcription, synchronised with the source audio or video.

Our policy has been to include whole texts where possible, as this makes a corpus more valuable for the study of discourse features which may occur predominantly in certain parts of a text (e.g. greeting formulae at the beginnings and ends of personal correspondence). In a few cases, owing to copyright permissions, only an extract is included: these texts can be identified by titles which include the word “extract(s)”.

SCOTS aims to cover the period from 1945 to the present day (the latest documents in the corpus currently date from 2011). Owing to text availability, however, the majority of texts are from the latter part of this period. In particular, most of the spoken texts were recorded since 2000, and were recorded specifically for the SCOTS project.

SCOTS has sought to do justice to the wide range of texts in varieties of Scots and Scottish English today: texts of different language varieties, genres and registers; speakers and writers from as wide a range of geographical locations as possible; speakers and writers of different backgrounds, ages, genders, occupations, and so on. Nevertheless, SCOTS is not a truly representative corpus. Issues of permissions, copyright and availability meant that certain types of texts were very difficult to obtain (for example, newspaper articles, personal diaries, business correspondence). In addition to Scots and Scottish English, there are a small number of texts in Scottish Gaelic: these and other texts on Gaelic subject matter can be found by searching for texts with the word ‘Gaelic’ in their title.

There are various ways in which users who are concerned with statistical information, about, say the relative frequencies of certain words in different genres, may exploit the SCOTS data to this end. The Advanced search allows search criteria to be defined, to limit a word search to texts of a particular mode (spoken or written), genre (correspondence, fiction, interview), or with a particular set of criteria (conversations after 2000, published texts only, texts written for specialists, and so on). In this way the result of different searches may be compared.

Depending on the user’s aims, groups of texts might be used as research data in themselves. Glancing down the complete list of documents ordered by title (or by author) will reveal a number of texts which fall into natural groupings and may suit the research purpose in mind. These include:

a collection of letters by members of one family dating from 1945-1989 (titles begin “Biggam Collection Letter…”);
a collection of 59 conversations between mothers and their young children in Buckie, Moray (recorded and donated to SCOTS by Dr Jennifer Smith, titles begin “Conversation: Buckie…”);
a series of letters written by a postgraduate student during a year studying in Canada to his family back home (titles begin “Correspondence from Canada…”);
four personal diaries written between 1952 and 1972 (“Diaries of William Young…”);
a collection of 62 poems and prose pieces by James Begg and John Reid, some of which are accompanied by audio recordings (titles begin “Dipper…”);
a number of interviews with people talking about language issues in Scotland (“Interview…”);
a substantial set of documents of different genres from the Scottish Parliament, some of which are written in Scots or include short sections of transcribed Scots (“Scottish Parliament…”).

Content labels

As a resource for linguistic research, SCOTS aims to represent language as it is actually used. Consequently, some documents contain language which some users may find offensive. We have attempted to indicate such documents by means of Content labels, in order that users have enough information to make an informed decision regarding the suitability of the material. The two content labels are as follows:

Content label: lesser - This document contains language which some may find offensive
Content label: serious - This document contains strong or offensive language

Annotation

In its current form, the SCOTS Corpus is not grammatically annotated. Some mark-up is used in transcriptions of spoken material (see section on Transcriptions below), and to mark where personal information has been censored in written and spoken texts.

Analysis tools

The Advanced search provides greater search flexibility than the Standard search, basic statistical information, a concordancer and a map on which results are plotted (see the help page for more information).

Documents may also be downloaded for personal research. Documents may not be used for commercial purposes: copyright restrictions apply to all documents in SCOTS, and copyright remains with the original holder. Please see the Terms and Conditions for more information.

Spelling and variation

The SCOTS Corpus contains documents in Scottish Standard English, documents in different varieties of Scots, and documents which may be described as lying somewhere between Scots and Scottish Standard English. While Scottish Standard English has a standard written form, Scots does not. This means that the corpus contains a wide range of variation in spelling. We hope to offer a means of searching for all of the variant spellings automatically in the future. In the meantime, we recommend the online Dictionary of the Scots Language as an excellent source of possible variants.

Transcriptions

Transcriptions of audio recordings were made using the Praat software, which enabled them to be time-stamped and synchronised with the recording.

Spelling and variation

The general notes above on spelling and variation also apply to transcriptions of spoken material. In addition, with transcribed spoken documents, it was necessary to decide upon conventions in order to make the transcriptions as consistent as possible. Accordingly, where words are clearly Scots forms, rather than Scottish Standard English, we have used the Scots School Dictionary (eds. Iseabail MacLeod and Pauline Cairns, Scottish National Dictionary Association, 1999) as our guide. Where the dictionary offers alternative spellings for a word, the one closest to the speaker’s pronunciation is selected.

Overlap

Stretches where more than one participant is speaking at the same time are marked in the transcription by double slashes ( // ) surrounding the words which overlap:

Speaker 1: …although it might come across //as being arrogant.//
Speaker 2: //[laugh]//

Here Speaker 2’s laugh overlaps with the final three words of Speaker 1’s utterance.

Mark-up

Transcribers have used the following tags:

Censored

Sometimes words or sequences of words have been censored from documents, principally so that individuals may not be identified. Where this has been done, a Censored tag indicates what has been removed as follows:

“Don't put your fingers in it though, [CENSORED: forename]. Cause you'll be a mucky pup.”

Other items which may be censored include postal addresses, email addresses, place names, phone numbers, and company names.

Where this applies to an audio transcription, the corresponding section of the audio file has been replaced by a beep (or, in certain circumstances, silence). Censoring of personal or sensitive information has occasionally been necessary in written documents too, and is marked in the same way as above.

Inaudible

Words or longer stretches which the transcriber has not been able to hear or understand appear as follows:

“Yeah, what kind of cup was this [inaudible]?”

Unclear

Parts of the transcription where the transcriber and checker are unsure are surrounded by question marks: [?]…[/?]

“they’re chaffin away [?]crattlin[/?] these toy cups”

Words marked as unclear are not indexed, and do not contribute to the word count.

False starts and truncation

False starts, stammering and truncated words are tagged and appear in the transcription followed by a hyphen:

“nineteen f-f-f- fifty-nine Triumph.”
“Everybody got a Chri- the whole class got a Christmas present.”

These are not indexed, and do not contribute to the word count.

Semi-lexical items

Semi-lexical items (‘mmhm’, ‘erm’, ‘uh-huh’ etc) appear unmarked in the transcription, but are tagged in the underlying form.

Speaker 1: er not until I was kind of older
Speaker 2: uh-huh

Non-lexical items

Non-lexical sounds (coughs, sneezes, laughter, yawns etc.) appear between square brackets. These are not indexed and therefore will not be found using the Standard Search or Advanced Search features. Such items can be located in a document page, using the web browser’s find function.

“Five years till I’d done my apprenticeship [cough]”
“Yeah, I simply don’t [laugh] really remember.”

Non-linguistic events

Non-linguistic “events” which may have an impact on language also appear between square brackets. Audible background “events” which do not affect the language used have not been transcribed.