Dealing With Variation: Spelling
Dawn Archer, University of Central Lancashire
Paul Rayson, University of Lancaster
Spelling issues tend to create relatively minor (though still complex) problems for corpus linguistics and natural language processing projects that use ‘standard’ or modern varieties of English. For example, in corpus annotation, we have to decide how to deal with tokenisation issues such as whether (i) full-stops are sentence boundaries or indicate acronyms and (ii) apostrophes are quote marks or contractions (Grefenstette and Tapanainen, 1994; Grefenstette, 1999). The issue of spelling variation becomes more problematic when utilising corpus linguistic techniques on non-standard varieties of English (i.e. Tyneside English), different standards of English (i.e. American English as opposed to British English) or historical varieties of English (i.e. Early Modern English), not least because variation can be due to different spelling conventions, transcription practices and morpho-syntactic customs, etc., as well as “misspelling” – all of which may require different procedures to rectify, as becomes clear when we consider studies that have explored:
- Varieties such as Scottish English i (Anderson et al., forthcoming), and dialects such as Tyneside English ii (Allen et al., forthcoming)
- Early Modern English (Archer and Rayson, 2004; Culpeper and Kytö, 2005)
- Emerging varieties such as SMS or CMC in weblogs (Ooi et al., forthcoming)
In this talk, we will be focussing on (spelling) variation in Early Modern English. We will present the particular problems that this variety poses for corpus linguistic techniques such as frequency profiling, corpus annotation, concordancing, n-gram clustering etc., and offer some solutions that we have developed, including the creation of a variant spelling detector (Rayson et al, 2005).
1. Allen, W., Beal, J.C., Corrigan, K.P., Maguire, W. and Moisl, H. (to appear) ‘Taming Unconventional Digital Voices: The Newcastle Electronic Corpus of Tyneside English’, in Beal, J.C., Corrigan, K.P. and Moisl, H. (eds.) Using Unconventional Digital Language Corpora. Houndmills: Palgrave Macmillan.
2. Anderson, J., Beavan, D. and Kay, C. (forthcoming): ‘The Scottish Corpus of Texts and Speech’, Models and Methods in the Handling of Unconventional Digital Corpora, J. Beal, K. Corrigan, H. Moisl (eds.), Houndmills: Palgrave-MacMillan
3. Archer, D. and Rayson, P. (2004) Using an historical semantic tagger as a diagnostic tool for variation in spelling. Presented at Thirteenth International Conference on English Historical Linguistics (ICEHL 13) University of Vienna, Austria 23-29 August, 2004.
4. Culpeper, J. and Kytö, M. (2005). Exploring speech-related Early Modern English texts: lexical bundles re-visited. Presented at the 26th conference of ICAME (International Computer Archive of Modern and Medieval English), University of Michigan, USA, May 2005.
5. Grefenstette, G. (1999). Tokenization. In van Halteren, H, (ed.) Syntactic wordclass tagging, Kluwer, The Netherlands, pp. 117 – 133.
6. Grefenstette, G. and Tapanainen, P. (1994) What is a Word, What is a Sentence? Problems of Tokenization. In Proceedings of 3rd conference on Computational Lexicography and Text Research (COMPLEX’94), Budapest, July 7-10, 1994, pp. 79 – 87.
7. Ooi, Vincent B. Y., Peter K. W. Tan and Andy K. L. Chiang: Analysing weblogs in a speech community using the WMatrix approach. To be presented at 27th conference of the International Computer Archive of Modern and Medieval English (ICAME) University of Helsinki, Finland, 24-28 May, 2006.
8. Rayson, P., Archer, D. and Smith, N. (2005) VARD versus Word: A comparison of the UCREL variant detector and modern spell checkers on English historical corpora. In proceedings of the Corpus Linguistics 2005 conference, July 14-17, Birmingham, UK.