Authors
Milutin Ba£lija 1 , Jelena Vasiljevi 1 and Dhinaharan Nagamalai 2 , 1 University Union, Serbia 2 Wireilla, Australia
Abstract
We present SRPhon2Word, a retrieval-oriented pipeline for isolatedword recognition over large fnite lexicons. Given an input recording, a pretrained wav2vec2 model produces frame-level grapheme probabilities, which are matched directly against written lexicon entries using a coarse-to-fne retrieval stage. The lexicon remains editable as plain text, so new candidates can be added without retraining the acoustic model. To keep search practical, lightweight character fltering removes most candidates, probabilistic DTW scores the survivors through indexed probability lookups, and early stopping prunes uncompetitive candidates. In an unbatched and unsharded setting, the system searches a 222,000-form lexicon in 3-5 s per query. We evaluate it on 918 clean recordings from 8 speakers, with noise variants and a dense 39-form morphological partition. Results show strong improvement over greedy CTC and highlight the role of lexicon construction and candidate ranking. Keywords: speech recognition, isolated-word recognition, lexical retrieval, dynamic time warping, wav2vec2, grapheme probabilities, phonetic orthography, probabilistic matching, morphologically rich languages, cross-language transfer
Keywords
speech recognition, isolated-word recognition, lexical retrieval, dynamic time warping, wav2vec2, grapheme probabilities, phonetic orthography, probabilistic matching, morphologically rich languages, cross-language transfer