Anyway, my problem arises from a simple problem: counting words in a
file.
That was my problem also (analysing the text structure, not meaning).
Try this:
WORDS_RE = /# words including apostrophe: can’t
([^\d\s"‘():,.]+[’]?[^\d\s"‘():,.]+)
>
# double quoted sentenced: “a few words”
“([^\d():,.]+)”
>
# single quoted single letter: ‘n’
(’[^\d():,.]')
>
# single quoted sentenced: ‘a few words’
‘([^\d():,.]+)’
/x
It should strip quote marks (" and ') and parenthesis (brackets should
be easy to add), leaving with ‘tokens’ that are ‘words’. These tokens
may contain spaces, if they’re qouted.
Here’s how I use it (with ‘iconv/io’ plus a few modifications):
# Word Splitter #{{{
# Will read a text file, split each line into individual words
# and write (by default append) all the words to the output file.
# each word will be outputed in a single line.
# Will not transform the words.
# Ignores digits and non-word chars.
def split(infilename, outfilename = nil, outmode="ab+")
words = []
#File.open(infilename, "r") { |inf|
Iconv::Reader.open(infilename, "utf-16", "utf-8") { |inf|
while (line = inf.gets())
next if ( line =~ /^#/ )
line.scan(WORDS_RE) { |w| words << w }
end
}
words = words.flatten.compact
if outfilename
#File.open(outfilename, outmode) { |out|
Iconv::Writer.mode_open(outfilename, outmode, "utf-16",
“utf-8”) { |out|
out.puts words
}
end
words
end #}}}