Help needed for a new release of text-hyphen

I've had folks asking me for a release of text-hyphen that works with
Ruby 1.9, and while I've got something that passes the tests that I've
created and added for MRI 1.9, it *loses* compatibility with Ruby
1.8.7 (and does so loudly in the tests) and JRuby (in either 1.8 or
1.9 mode, it appears). I need some help to get the last bits ready,
because I'm not ready to drop Ruby 1.8 entirely (at least one more
version).

- You can find the source on GitHub: https://github.com/halostatue/text-hyphen/
- You will need hoe as a development dependency to assist with this if
you want to use the Rakefile; otherwise, you can run the test files in
test/ directly.
- Only one of the tests fails, but there's a good chance that new
tests along the same lines would probably fail.

I have tested against most Ruby environments, and it only succeeds
against MRI 1.9.2; even JRuby in 1.9 mode fails in the same way is
JRuby 1.8.

This issue is preventing the release of the next release of
text-hyphen, and if you have some help you can provide, I need it as I
don't have time to investigate and fix it myself (I've got another
project that's taking all of my time).

After this release, this project will probably be put into maintenance
mode (the hyphenation files, aside from an update to UTF-8 encoding
where they weren't already such, have not been updated since the
original release) and I will look at implementing a new version that
works only under Ruby 1.9 (probably under a new name) that will use
the same basic engine but can read .tex hyphenation files from the
texhyphen project rather than depending on the hand-converted
hyphenation files I have, which will also simplify the licensing of
this successor project.

-a
[1] No, I won't remove it as it helps with release management.

···

--
Austin Ziegler • halostatue@gmail.comaustin@halostatue.ca
http://www.halostatue.ca/http://twitter.com/halostatue

Hi Austin,

Running with the debugger on for 1.8.7 brings up this discrepancy:

The "letters" array for 1.8.7 is this:
["d", "a", "m", "p", "f", "s", "c", "h", "i", "f", "f", "f", "a", "h", "r", "t", "s", "k", "a", "p", "i", "t", "\303", "\244", "n", "s", "m", "\303", "\274", "t", "z", "e", "n", "h", "a", "l", "t", "e", "r", "h", "e", "r", "s", "t", "e", "l", "l", "e", "r"]

Now, "\303", "\244" is a UTF-8 encoding of umlauts-over-a (ä). In your 1.8 german
hyphenation file, you encode the ä in itä with the latin-1 encoding \344.

Your input text is UTF-8, but the library searches for the latin1 encoding. Changing
the input to \344 for ä and \374 for ü made the test pass for me on 1.8.7.

Michael Edgar
adgar@carboni.ca
http://carboni.ca/

···

On Jul 15, 2011, at 12:45 AM, Austin Ziegler wrote:

I've had folks asking me for a release of text-hyphen that works with
Ruby 1.9, and while I've got something that passes the tests that I've
created and added for MRI 1.9, it *loses* compatibility with Ruby
1.8.7 (and does so loudly in the tests) and JRuby (in either 1.8 or
1.9 mode, it appears). I need some help to get the last bits ready,
because I'm not ready to drop Ruby 1.8 entirely (at least one more
version).

Is this the error I should see for JRuby?

If so...yes, it could be something simple, but there's obviously a bug
here. Perhaps I could bother you to formally file a bug at
http://bugs.jruby.org, so we can track it off-list?

- Charlie

···

On Thu, Jul 14, 2011 at 11:45 PM, Austin Ziegler <halostatue@gmail.com> wrote:

I've had folks asking me for a release of text-hyphen that works with
Ruby 1.9, and while I've got something that passes the tests that I've
created and added for MRI 1.9, it *loses* compatibility with Ruby
1.8.7 (and does so loudly in the tests) and JRuby (in either 1.8 or
1.9 mode, it appears). I need some help to get the last bits ready,
because I'm not ready to drop Ruby 1.8 entirely (at least one more
version).

- You can find the source on GitHub: GitHub - halostatue/text-hyphen: Text::Hyphen will hyphenate words using modified versions of TeX hyphenation patterns.
- You will need hoe as a development dependency to assist with this if
you want to use the Rakefile; otherwise, you can run the test files in
test/ directly.
- Only one of the tests fails, but there's a good chance that new
tests along the same lines would probably fail.

I have tested against most Ruby environments, and it only succeeds
against MRI 1.9.2; even JRuby in 1.9 mode fails in the same way is
JRuby 1.8.

This issue is preventing the release of the next release of
text-hyphen, and if you have some help you can provide, I need it as I
don't have time to investigate and fix it myself (I've got another
project that's taking all of my time).

After this release, this project will probably be put into maintenance
mode (the hyphenation files, aside from an update to UTF-8 encoding
where they weren't already such, have not been updated since the
original release) and I will look at implementing a new version that
works only under Ruby 1.9 (probably under a new name) that will use
the same basic engine but can read .tex hyphenation files from the
texhyphen project rather than depending on the hand-converted
hyphenation files I have, which will also simplify the licensing of
this successor project.

-a
[1] No, I won't remove it as it helps with release management.
--
Austin Ziegler • halostatue@gmail.com • austin@halostatue.ca
http://www.halostatue.ca/http://twitter.com/halostatue

It was the umlauts

Man ... Ruby 1.9.x hates umlauts.

*hugs his 1.8.7 install*

···

--
Posted via http://www.ruby-forum.com/\.

That's the same error I saw, and fixed by using a latin1 input case
instead of a ut8 one.

Michael Edgar
adgar@carboni.ca
http://carboni.ca/

···

On Jul 15, 2011, at 4:38 AM, Charles Oliver Nutter wrote:

Is this the error I should see for JRuby?

gist:1084324 · GitHub

If so...yes, it could be something simple, but there's obviously a bug
here. Perhaps I could bother you to formally file a bug at
http://bugs.jruby.org, so we can track it off-list?

- Charlie

Running with the debugger on for 1.8.7 brings up this discrepancy:

The "letters" array for 1.8.7 is this:
["d", "a", "m", "p", "f", "s", "c", "h", "i", "f", "f", "f", "a", "h", "r", "t", "s", "k", "a", "p", "i", "t", "\303", "\244", "n", "s", "m", "\303", "\274", "t", "z", "e", "n", "h", "a", "l", "t", "e", "r", "h", "e", "r", "s", "t", "e", "l", "l", "e", "r"]

Now, "\303", "\244" is a UTF-8 encoding of umlauts-over-a (ä). In your 1.8 german
hyphenation file, you encode the ä in itä with the latin-1 encoding \344.

Your input text is UTF-8, but the library searches for the latin1 encoding. Changing
the input to \344 for ä and \374 for ü made the test pass for me on 1.8.7.

I second that analysis. It seems to use text-hyphen in Ruby 1.8 with other languages than english (with any languages that use exotic characters not in ASCII), you will have to make sure that your input is in the same character encoding as the language file is. In the case of german, this is LATIN1. So opening and changing the file in your text editor has probably converted the file to utf8, Austin.

Fixing the 1.8 version in the general case (any input, any language file encoding) will be hard... and useless, since you would program towards a use case that should go extinct.

More than one solution offers itself :wink:

a) convert the file test_bugs.rb back to latin1 (-> bad, will break soon again)

b) digging back through the old version history (I am sure you have it ;)) - trying to see if [1] was specifically about german umlauts or if it was just the german and the size of the word that tripped the bug. If it was one of the latter - then remove those damn umlauts from the word (ä -> ae, ü -> ue) and use the new test expectations that derive from that. This would make the file ASCII again, and less sensible to editor conversion.

c) The solution you say you don't want: Dropping 1.8 support from newer gems. Since bundler & rvm this is increasingly simple to manage - I'll just limit my old projects to use an old version of text-hyphen.

Considering the impossible (aka: very laborious and quite not to the point) nature of the bug in 1.8, I would choose c) or (if must be) b).

best regards,
kaspar

[1] http://rubyforge.org/tracker/index.php?func=detail&aid=9807&group_id=294&atid=1195

Yes. But does jruby fake out mvm in this case? Because while Rake is
being run with 1.9, I'm not sure that the tests are:

~/projects/text-hyphen $ jruby --1.9 -S rake test
rake/rdoctask is deprecated. Use rdoc/task instead (in RDoc 2.4.2+)
Couldn't read /Users/headius/.rubyforge/user-config.yml. Run `rubyforge setup`.
/Users/headius/projects/jruby/bin/jruby -w -Ilib:bin:test:. -e
'require "rubygems"; require "test/unit"; require "test/test_bugs.rb";
require "test/test_text_hyphen.rb"' --

The tests claim to be running "jruby -w ..." and not "jruby --1.9 -w
...". It doesn't matter because of JRuby failure (in 1.9 mode) on text-hyphen test/test_bugs.rb · GitHub

I've filed JRUBY-5927 about this; if my interpretation of what's
happening with "jruby --1.9 -S rake test" is correct, I can file a
separate enhancement request about that (it's a problem, but not a bug
per se). I think Michael Edgar is correct about the other case.

-a

···

On Fri, Jul 15, 2011 at 4:38 AM, Charles Oliver Nutter <headius@headius.com> wrote:

Is this the error I should see for JRuby?

gist:1084324 · GitHub

If so...yes, it could be something simple, but there's obviously a bug
here. Perhaps I could bother you to formally file a bug at
http://bugs.jruby.org, so we can track it off-list?

--
Austin Ziegler • halostatue@gmail.com • austin@halostatue.ca
http://www.halostatue.ca/http://twitter.com/halostatue

I think you're right. Now to figure out how to fix it properly in this case.

-a

···

On Fri, Jul 15, 2011 at 1:46 AM, Michael Edgar <adgar@carboni.ca> wrote:

On Jul 15, 2011, at 12:45 AM, Austin Ziegler wrote:

I've had folks asking me for a release of text-hyphen that works with
Ruby 1.9, and while I've got something that passes the tests that I've
created and added for MRI 1.9, it *loses* compatibility with Ruby
1.8.7 (and does so loudly in the tests) and JRuby (in either 1.8 or
1.9 mode, it appears). I need some help to get the last bits ready,
because I'm not ready to drop Ruby 1.8 entirely (at least one more
version).

Running with the debugger on for 1.8.7 brings up this discrepancy:

The "letters" array for 1.8.7 is this:
["d", "a", "m", "p", "f", "s", "c", "h", "i", "f", "f", "f", "a", "h", "r", "t", "s", "k", "a", "p", "i", "t", "\303", "\244", "n", "s", "m", "\303", "\274", "t", "z", "e", "n", "h", "a", "l", "t", "e", "r", "h", "e", "r", "s", "t", "e", "l", "l", "e", "r"]

Now, "\303", "\244" is a UTF-8 encoding of umlauts-over-a (ä). In your 1.8 german
hyphenation file, you encode the ä in itä with the latin-1 encoding \344.

Your input text is UTF-8, but the library searches for the latin1 encoding. Changing
the input to \344 for ä and \374 for ü made the test pass for me on 1.8.7.

--
Austin Ziegler • halostatue@gmail.com • austin@halostatue.ca
http://www.halostatue.ca/http://twitter.com/halostatue

Running with the debugger on for 1.8.7 brings up this discrepancy:

The "letters" array for 1.8.7 is this:
["d", "a", "m", "p", "f", "s", "c", "h", "i", "f", "f", "f", "a", "h",
"r", "t", "s", "k", "a", "p", "i", "t", "\303", "\244", "n", "s", "m",
"\303", "\274", "t", "z", "e", "n", "h", "a", "l", "t", "e", "r", "h",
"e", "r", "s", "t", "e", "l", "l", "e", "r"]

Now, "\303", "\244" is a UTF-8 encoding of umlauts-over-a (ä). In your 1.8
german
hyphenation file, you encode the ä in itä with the latin-1 encoding \344.

Your input text is UTF-8, but the library searches for the latin1
encoding. Changing
the input to \344 for ä and \374 for ü made the test pass for me on 1.8.7.

I second that analysis. It seems to use text-hyphen in Ruby 1.8 with other
languages than english (with any languages that use exotic characters not in
ASCII), you will have to make sure that your input is in the same character
encoding as the language file is. In the case of german, this is LATIN1. So
opening and changing the file in your text editor has probably converted the
file to utf8, Austin.

Fixing the 1.8 version in the general case (any input, any language file
encoding) will be hard... and useless, since you would program towards a use
case that should go extinct.

I'm not so much looking for the general case, but this specific case,
since it's a bug about a word that you filed four years ago (yes, the
one you linked) :wink:

Text::Hyphen under Ruby 1.8 has always said you need to match the
encoding of the input to the encoding of the hyphenation file (and
that'll still be true under Ruby 1.9, but at least there it'll be a
*consistent* UTF-8 encoding for all hyphenation files). I just forgot
that for this particular test.

More than one solution offers itself :wink:

a) convert the file test_bugs.rb back to latin1 (-> bad, will break soon
again)

Doing that would cause Ruby 1.9 to fail. If I'm willing to split the
test into 1.8 and 1.9 versions (and use load) for the specific failing
bug, then I can make this work for this release.

b) digging back through the old version history (I am sure you have it ;)) -
trying to see if [1] was specifically about german umlauts or if it was just
the german and the size of the word that tripped the bug. If it was one of
the latter - then remove those damn umlauts from the word (ä -> ae, ü -> ue)
and use the new test expectations that derive from that. This would make the
file ASCII again, and less sensible to editor conversion.

It was the umlauts, and (ahem) you filed the bug with the umlauts. :wink:

c) The solution you say you don't want: Dropping 1.8 support from newer
gems. Since bundler & rvm this is increasingly simple to manage - I'll just
limit my old projects to use an old version of text-hyphen.

Considering the impossible (aka: very laborious and quite not to the point)
nature of the bug in 1.8, I would choose c) or (if must be) b).

I'm trying to get out *one more* release of 1.8—this one—and then
Text::Hyphen (or its successor) will happily be 1.9 only. This is a
"final 1.8" release and then I'm going to bump the major version if I
keep the project name (which is a good one) and put "ruby >= 1.9.2" in
the gemspec. This is the transitional release only.

···

On Fri, Jul 15, 2011 at 8:18 AM, Kaspar Schiess <eule@space.ch> wrote:

[1]
http://rubyforge.org/tracker/index.php?func=detail&aid=9807&group_id=294&atid=1195

--
Austin Ziegler • halostatue@gmail.com • austin@halostatue.ca
http://www.halostatue.ca/http://twitter.com/halostatue

Thanks everyone for the comments received. I've taken the approach
that I mentioned in my last message in response to Kaspar. You can see
the latest test code (where I have two data files; one latin1 and one
UTF-8). I'll be preparing a release this weekend.

Sadly, JRuby in 1.9 mode won't work because of an apparent bug in
JRuby itself, and "jruby --1.9 -S rake test" only looks like it works
because the test actually runs JRuby again in 1.8 mode. A bug has been
filed for the former case, but an improvement has not yet been filed
for the latter case.

-a

···

On Fri, Jul 15, 2011 at 9:06 AM, Austin Ziegler <halostatue@gmail.com> wrote:

--
Austin Ziegler • halostatue@gmail.comaustin@halostatue.ca
http://www.halostatue.ca/http://twitter.com/halostatue

Ok, I see your bugs. We'll have a look into it.

FWIW, you can specify JRUBY_OPTS=--1.9 and it will pass through to the
child JRuby instances too. But I agree, we need a dotfile or similar
to force it.

- Charlie

···

On Fri, Jul 15, 2011 at 8:37 AM, Austin Ziegler <halostatue@gmail.com> wrote:

Sadly, JRuby in 1.9 mode won't work because of an apparent bug in
JRuby itself, and "jruby --1.9 -S rake test" only looks like it works
because the test actually runs JRuby again in 1.8 mode. A bug has been
filed for the former case, but an improvement has not yet been filed
for the latter case.

I think it's a little more subtle than that, as I noted in my last comment on the --1.9 improvement request. When JRuby starts with --1.9 (whether through an arg, an opt, or a dotfile), it should essentially do:

ENV["JRUBY_OPTS"]="--1.9"

Of course, it should be a bit smarter than that, preserving other values, but this way you get the same expected behaviour that you get when MRI spawns another instance of MRI based on RbConfig::CONFIG["ruby_instance_name"].

-a « from my iPad

···

On 2011-07-15, at 18:40, Charles Oliver Nutter <headius@headius.com> wrote:

On Fri, Jul 15, 2011 at 8:37 AM, Austin Ziegler <halostatue@gmail.com> wrote:

Sadly, JRuby in 1.9 mode won't work because of an apparent bug in
JRuby itself, and "jruby --1.9 -S rake test" only looks like it works
because the test actually runs JRuby again in 1.8 mode. A bug has been
filed for the former case, but an improvement has not yet been filed
for the latter case.

Ok, I see your bugs. We'll have a look into it.

FWIW, you can specify JRUBY_OPTS=--1.9 and it will pass through to the
child JRuby instances too. But I agree, we need a dotfile or similar
to force it.