Title |
Diachronic Changes in Text Complexity in 20th Century English Language: An NLP Approach |
Authors |
Sanja Štajner and Ruslan Mitkov |
Abstract |
A syntactically complex text may represent a problem for both comprehension by humans and various NLP tasks. A large number of studies in text simplification are concerned with this problem and their aim is to transform the given text into a simplified form in order to make it accessible to the wider audience. In this study, we were investigating what the natural tendency of texts is in 20th century English language. Are they becoming syntactically more complex over the years, requiring a higher literacy level and greater effort from the readers, or are they becoming simpler and easier to read? We examined several factors of text complexity (average sentence length, Automated Readability Index, sentence complexity and passive voice) in the 20th century for two main English language varieties - British and American, using the `Brown family' of corpora. In British English, we compared the complexity of texts published in 1931, 1961 and 1991, while in American English we compared the complexity of texts published in 1961 and 1992. Furthermore, we demonstrated how the state-of-the-art NLP tools can be used for automatic extraction of some complex features from the raw text version of the corpora. |
Topics |
Tools, systems, applications, Corpus (creation, annotation, etc.), Information Extraction, Information Retrieval |
Full paper |
Diachronic Changes in Text Complexity in 20th Century English Language: An NLP Approach |
Bibtex |
@InProceedings{TAJNER12.355,
author = {Sanja Štajner and Ruslan Mitkov}, title = {Diachronic Changes in Text Complexity in 20th Century English Language: An NLP Approach}, booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)}, year = {2012}, month = {may}, date = {23-25}, address = {Istanbul, Turkey}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Uğur Doğan and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, isbn = {978-2-9517408-7-7}, language = {english} } |