Unicode challenges for a I18N Instiki

Hi there,

I’m preparing a I18N version of Instiki, but I’m running into a few
issues that’s holding me back. First of all, even though constructs
like \w will correctly recognize unicode characters, [:upper:] does not
include capital unicode letters.

That’s somewhat of a problem if you want to allow wiki words like
ÆbletærtenVarMåskeØensBedste (“the apple pie was perhaps the islands
best” in Danish).

I’ve currently implemented a hack where I just have a long list of
capital unicode letters that I know for Danish (“ÅØÆ”). This list could
probably even be found at some unicode site (links for that would be
great!), but I was wondering if there wasn’t a cleaner way?

Also, the URI parser in WEBrick seems to break down on url encoded
unicode characters. “DuÆlskerLegetøj” (“you love toys” in Danish)
breaks down like this:

  • -> /wiki/show/Du%C3%86lskerLeget%C3%B8j
    [2004-04-24 21:45:15] ERROR URI::InvalidURIError: bad URI(is not URI?):
    /wiki/new/DuÆlskerJoLegetøj
    /usr/local/lib/ruby/1.8/uri/common.rb:345:in split' /usr/local/lib/ruby/1.8/uri/common.rb:368:inparse’
    /usr/local/lib/ruby/1.8/uri/generic.rb:840:in merge0' /usr/local/lib/ruby/1.8/uri/generic.rb:799:inmerge’
    /usr/local/lib/ruby/1.8/webrick/httpresponse.rb:146:in
    setup_header' /usr/local/lib/ruby/1.8/webrick/httpresponse.rb:84:insend_response’
    /usr/local/lib/ruby/1.8/webrick/httpserver.rb:67:in `run’

I’d be much grateful for any tips on how to handle this. The faster I
get it solved, the faster a new Instiki release will see the light of
day :slight_smile:

P.S.: As a treat, I can tell that the new release has a new in-wiki
configuration page that allows you to:

  • switch between markup languages (test to find what you like best
    without starting/stopping Instiki)
  • make additions to the stylesheet (easy tweak the entire look of
    Instiki)
  • Rename/move the entire web
  • Add/remove password protection
···


David Heinemeier Hansson,
http://www.instiki.org/ – A No-Step-Three Wiki in Ruby
http://www.basecamphq.com/ – Web-based Project Management
http://www.loudthinking.com/ – Broadcasting Brain

Hi there,

I’m preparing a I18N version of Instiki, but I’m running into a few
issues that’s holding me back. First of all, even though constructs
like \w will correctly recognize unicode characters, [:upper:] does
not include capital unicode letters.

It turned out that just keeping a list of capital words in latin,
greek, and cyrillic worked out great. It would of course be great if
[:upper:] could do the same, but not that big of a problem.

Also, the URI parser in WEBrick seems to break down on url encoded
unicode characters. “DuÆlskerLegetøj” (“you love toys” in Danish)
breaks down like this:

I was a foul and didn’t escape properly.

So yes, Instiki with I18N (latin, greek, and cyrillic) wiki words is
forthcoming. Oh yearh, [[wiki link]] and [[c]] works now too.

···


David Heinemeier Hansson,
http://www.instiki.org/ – A No-Step-Three Wiki in Ruby
http://www.basecamphq.com/ – Web-based Project Management
http://www.loudthinking.com/ – Broadcasting Brain