Title |
STeP-1: A Set of Fundamental Tools for Persian Text Processing |
Authors |
Mehrnoush Shamsfard, Hoda Sadat Jafari and Mahdi Ilbeygi |
Abstract |
Many NLP applications need fundamental tools to convert the input text into appropriate form or format and extract the primary linguistic knowledge of words and sentences. These tools perform segmentation of text into sentences, words and phrases, checking and correcting the spellings, doing lexical and morphological analysis, POS tagging and so on. Persian is among languages with complex preprocessing tasks. Having different writing prescriptions, spacings between or within words, character codings and spellings are some of the difficulties and challenges in converting various texts into a standard one. The lack of fundamental text processing tools such as morphological analyser (especially for derivational morphology) and POS tagger is another problem in Persian text processing. This paper introduces a set of fundamental tools for Persian text processing in STeP-1 package. STeP-1 (Standard Text Preparation for Persian language) performs a combination of tokenization, spell checking, morphological analysis and POS tagging. It also turns all Persian texts with different prescribed forms of writing to a series of tokens in the standard style introduced by Academy of Persian Language and Literature (APLL). Experimental results show high performance. |
Topics |
Tools, systems, applications, Morphology, Part of speech tagging |
Full paper |
STeP-1: A Set of Fundamental Tools for Persian Text Processing |
Slides |
- |
Bibtex |
@InProceedings{SHAMSFARD10.809,
author = {Mehrnoush Shamsfard and Hoda Sadat Jafari and Mahdi Ilbeygi}, title = {STeP-1: A Set of Fundamental Tools for Persian Text Processing}, booktitle = {Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)}, year = {2010}, month = {may}, date = {19-21}, address = {Valletta, Malta}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis and Mike Rosner and Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-6-7}, language = {english} } |