This paper will examine the process of creating a digital corpus based on the Würzburg glosses, the earliest large collection of glosses written in the Irish language. Modern editorial standards applied in publications of these glosses can alter spelling, punctuation, and even the semantic meaning of a sentence where one word is used in place of another. Therefore, an understanding of the original orthography utilised by Old Irish scribes is important in determining the orthography which should be utilised in a modern digital corpus. This paper will outline why the text of the Würzburg glosses as it appears in Thesaurus Palaeohibernicus is the best candidate for digitisation. The automated digitisation and proofing process of the corpus will be outlined, and details will be given of a tag-set utilised within the digital corpus in order to preserve information present in Thesaurus Palaeohibernicus as metadata.
@InProceedings{DOYLE18.20, author = {Adrian Doyle ,John McCrae and Clodagh Downey}, title = {Preservation of Original Orthography in the Construction of an Old Irish Corpus.}, booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)}, year = {2018}, month = {may}, date = {7-12}, location = {Miyazaki, Japan}, editor = {Claudia Soria and Laurent Besacier and Laurette Pretorius}, publisher = {European Language Resources Association (ELRA)}, address = {Paris, France}, isbn = {979-10-95546-22-1}, language = {english} }