Title |
Partial Parsing of Spontaneous Spoken French |
Authors |
Olivier Blanc, Matthieu Constant, Anne Dister and Patrick Watrin |
Abstract |
This paper describes the process and the resources used to automatically annotate a French corpus of spontaneous speech transcriptions in super-chunks. Super-chunks are enhanced chunks that can contain lexical multiword units. This partial parsing is based on a preprocessing stage of the spoken data that consists in reformatting and tagging utterances that break the syntactic structure of the text, such as disfluencies. Spoken specificities were formalized thanks to a systematic linguistic study of a 40-hour-long speech transcription corpus. The chunker uses large-coverage and fine-grained language resources for general written language that have been augmented with resources specific to spoken French. It consists in iteratively applying finite-state lexical and syntactic resources and outputing a finite automaton representing all possible chunk analyses. The best path is then selected thanks to a hybrid disambiguation stage. We show that our system reaches scores that are comparable with state-of-the-art results in the field. |
Topics |
Parsing, Speech resource/database, MultiWord Expressions & Collocations |
Full paper |
Partial Parsing of Spontaneous Spoken French |
Slides |
- |
Bibtex |
@InProceedings{BLANC10.554,
author = {Olivier Blanc and Matthieu Constant and Anne Dister and Patrick Watrin}, title = {Partial Parsing of Spontaneous Spoken French}, booktitle = {Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)}, year = {2010}, month = {may}, date = {19-21}, address = {Valletta, Malta}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis and Mike Rosner and Daniel Tapias}, publisher = {European Language Resources Association (ELRA)}, isbn = {2-9517408-6-7}, language = {english} } |