Regex to count syllables

Rubies:

Just a tricky one. I need a function that counts syllables in English words
with, say, 99.5% accuracy for common words. The deal is a “syllable” seems
to be “a group of one to three vowels, always including Y”; the exceptions
seen so far are “a trailing ‘ing’ is always a syllable”, and “silent E on
the end not preceded by L is not a syllable.”

We’l think of other rules, but I’m just curious how to fold my brute-force
reckoning into a single regex. Right now I use a regex and a couple of 'if’
statements.

Note the function needn’t accurately locate the syllables, it only needs to
find them:

Tokenizer = /([aeiouy]{1,3})/

def syllableCounter word
len = 0

if word[-3..-1] == 'ing' then
    len += 1 
    word = word[0...-3]
end

got = word.scan(Tokenizer)
len += got.size()

the next exception to the regex is the

trailing silent ‘e’ (which used to be a syllable)

and the exception to the exception is ‘le’.

if got.size() > 1 and got[-1] == ['e'] and 
              word[-1].chr() == 'e' and
              word[-2].chr() != 'l' then
    len -= 1 
end

return len

end

            assert 1 == syllableCounter('a')
            assert 1 == syllableCounter('aa')
            assert 1 == syllableCounter('aaa')
            assert 2 == syllableCounter('aaay')
            assert 1 == syllableCounter('rat')
            assert 1 == syllableCounter('elf')
            assert 2 == syllableCounter('emu')
            assert 2 == syllableCounter('elven')
            assert 2 == syllableCounter('queuing')
            assert 2 == syllableCounter('kozmic')
            assert 2 == syllableCounter('throughout')
            assert 2 == syllableCounter('vicious')
            assert 2 == syllableCounter('aenid')
            assert 1 == syllableCounter('way')
            assert 2 == syllableCounter('parking')
            assert 3 == syllableCounter('sylabus')
            assert 4 == syllableCounter('invigorate')
            assert 4 == syllableCounter('kleptomania')
            assert 3 == syllableCounter('diastrophism')
            assert 3 == syllableCounter('elegy')
            assert 4 == syllableCounter('nymphomaniac')
            assert 5 == syllableCounter('hippopotamus')
            assert 2 == syllableCounter('filthy')
            assert 3 == syllableCounter('venison')
            assert 1 == syllableCounter('frog')
            assert 1 == syllableCounter('fly')
            assert 2 == syllableCounter('eagle')
            assert 4 == syllableCounter('Liberiencis')
            assert 3 == syllableCounter('behemoth')
            assert 2 == syllableCounter('river')
            assert 2 == syllableCounter('zeekoe')
            assert 2 == syllableCounter('yarrow')
            assert 1 == syllableCounter('yeah')
            assert 2 == syllableCounter('being')
            assert 0 == syllableCounter('')

So how do I merge those exceptions into the main regex?

···


Phlip
http://c2.com/cgi/wiki?DevNull
– It’s a small Web, after all… –

I think that syllabication rules in words are not regular sets. I
suspect that at the very least the complete set of rules needs a CFL,
given how messy English syllabication rules are. Note the word
’syllable’ violates the simple rule you’ve given: ‘syl-la-ble’ (the
first syllable contains NO vowels). Maybe it would be better to encode
the syllabication rules and use them recursively. I believe programs
like TeX and MS Word actually have extensive databases of syllabication
rules that do this job with perfect accuracy (I have the feeling that
most if not all open source programs that do hyphenation have code
that’s ultimately derived from TeX…perhaps even some closed-source
programs do, given the nature of TeX’s licensing).

···

On Mon, Aug 05, 2002 at 11:58:18AM +0900, Phlip wrote:

Rubies:

Just a tricky one. I need a function that counts syllables in English words
with, say, 99.5% accuracy for common words. The deal is a “syllable” seems
to be “a group of one to three vowels, always including Y”; the exceptions
seen so far are “a trailing ‘ing’ is always a syllable”, and “silent E on
the end not preceded by L is not a syllable.”


Rafael R. Sevilla +63(2)8123151
Software Developer, Imperium Technology Inc. +63(917)4458925

Rafael ‘Dido’ Sevilla wrote:

I think that syllabication rules in words are not regular sets. I
suspect that at the very least the complete set of rules needs a CFL,
given how messy English syllabication rules are. Note the word
’syllable’ violates the simple rule you’ve given: ‘syl-la-ble’ (the
first syllable contains NO vowels). Maybe it would be better to encode
the syllabication rules and use them recursively. I believe programs
like TeX and MS Word actually have extensive databases of syllabication
rules that do this job with perfect accuracy (I have the feeling that
most if not all open source programs that do hyphenation have code
that’s ultimately derived from TeX…perhaps even some closed-source
programs do, given the nature of TeX’s licensing).

I’m sorry. I should have made clearer I was not asking for high-level
thinking about grammar, or even a 99.9997% accurate counter. (But I did
cover Y is always a vowel.)

Further, you are discussing the word-break rules. I mean the phonetic
pronunciation rules, and all I need is a rough count. I don’t need to
locate the actual breaks; this would require the deep business rule set you
describe.

The existing code handles “syllable” fine:

            assert 3 == syllableCounter('syllable')

I’m only asking for tutorial help refactoring my function to put the 'if’
statements inside the regular expression. Because I supplied the tests
needed to show the refactor’s safe, the refactor should add or remove no
grammar rules.

···


Phlip
http://www.greencheese.org/PeaceAndCalm
– In the future everyone will be Andy Warhol for 15 minutes –

On 5 Aug 2002, at 12:58, Phlip asked about a regex suited to
estimate syllable count.

Phlip, the following regex passed your tests:

Tokenizer = /((a|e(?!$)|i(?!ng$)|o|u|y){1,3}|le$|ing$)/

The first part matches one to three vowels, but not an “e” or "ing"
at the end. The other parts count “le” and “ing” at the end as
syllables.

Regards,
Pit