We report on the high success rates of our new, scalable, signer-independent, computational approach for sign recognition from monocular video, exploiting linguistically annotated ASL data sets. We recognize signs using a hybrid framework that combines state-of-the-art learning methods with features based on what is known about the linguistic composition of lexical signs. We model and recognize the sub-components of sign production, with attention to hand shape, orientation, location, motion trajectories, as well as facial features, and we combine these within a CRF framework. The effect is to make the sign recognition problem robust, scalable, and feasible with relatively smaller datasets than are required for purely data-driven methods. From a 350-sign vocabulary of isolated, citation-form lexical signs from the American Sign Language Lexicon Video Dataset (ASLLVD), including both 1- and 2-handed signs, we achieve a top-1 accuracy of 93.6% and a top-5 accuracy of 97.9%. The high probability with which we can produce 5 sign candidates that contain the correct result opens the door to potential applications, as it is reasonable to provide a sign lookup functionality that offers the user 5 possible signs, in decreasing order of likelihood, with the user then asked to select the desired sign.
@InProceedings{METAXAS18.18005, author = {Dimitri Metaxas ,Mark Dilsizian and Carol Neidle}, title = {Scalable ASL Sign Recognition using Model-based Machine Learning and Linguistically Annotated Corpora}, booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)}, year = {2018}, month = {may}, date = {7-12}, location = {Miyazaki, Japan}, editor = {Mayumi Bono and Eleni Efthimiou and
Stavroula-Evita Fotinea and Thomas Hanke and
Julie Hochgesang and Jette Kristoffersen and
Johanna Mesch and Yutaka Osugi}, publisher = {European Language Resources Association (ELRA)}, address = {Paris, France}, isbn = {979-10-95546-01-6}, language = {english} }