Discriminating poetry and prose using syllable statistics
Articles
Gediminas Murauskas
Vilnius University
https://orcid.org/0009-0008-9602-8345
Marijus Radavičius
Vilnius University
Published 2022-12-20
https://doi.org/10.15388/LJS.2022.31988
PDF

Keywords

logistic regression
automatic syllabification
cross-validation
training
classification error

How to Cite

Murauskas, G. and Radavičius, M. (2022) “Discriminating poetry and prose using syllable statistics”, Lithuanian Journal of Statistics, 61, pp. 32–45. doi:10.15388/LJS.2022.31988.

Abstract

The aim of the paper is to construct a universal classifier to discriminate short Lithuanian text excerpts of poetry from that of prose. Here the universality means that the classifier is relatively insensitive to a text content and author's style. Since syllables represent phonetic properties and are less sensitive to text content as compared to words, the classifier training is based on frequencies of syllables in texts to be classified. The text data is taken from digitized library http://ebiblioteka.mkp.emokykla.lt. The error rate of the trained classifier applied to testing excerpts of 100 words is less than 5\%.

PDF

Downloads

Download data is not yet available.