[newbie] upper to lower first letter of a word

Recently, i get a vintage list (more than 500 items) with poor typo, for
example, i’ve :

Côte de beaune-villages

instead of :

Côte de Beaune-Villages

Crémant d’alsace

instead of :

Crémant d’Alsace

i wonder of the way to change lower to upper case and also of

a regex able to do the trick.

something like :

every letter following a " ", “-” or “’” should be upper if not
belonging to a black list of words :

black_list = %w{d de du la le sec sur entre etc…}

···


Yvon

string.gsub!(/\b[a-z]+/) { |w| black_list.include?(w) ? w : w.capitalize }

-Mark

···

On Tue, Sep 23, 2003 at 06:29:58PM +0200, Yvon Thoraval wrote:

Recently, i get a vintage list (more than 500 items) with poor typo, for
example, i’ve :

Côte de beaune-villages

instead of :

Côte de Beaune-Villages

Crémant d’alsace

instead of :

Crémant d’Alsace

i wonder of the way to change lower to upper case and also of

a regex able to do the trick.

something like :

every letter following a " ", “-” or “'” should be upper if not
belonging to a black list of words :

black_list = %w{d de du la le sec sur entre etc…}

You might adapt the English language ‘titlecase’ program, which can be
found here:

http://zem.novylen.net/ruby/titlecase.rb

Regards,

Mark

···

On Tuesday, September 23, 2003, at 12:34 PM, Yvon Thoraval wrote:

Recently, i get a vintage list (more than 500 items) with poor typo,
for
example, i’ve :

Côte de beaune-villages

instead of :

Côte de Beaune-Villages

Crémant d’alsace

instead of :

Crémant d’Alsace

i wonder of the way to change lower to upper case and also of

a regex able to do the trick.

something like :

every letter following a " ", “-” or “'” should be upper if not
belonging to a black list of words :

black_list = %w{d de du la le sec sur entre etc…}

a lot of tanxs °;)

···

Mark J. Reed markjreed@mail.com wrote:

string.gsub!(/\b[a-z]+/) { |w| black_list.include?(w) ? w : w.capitalize }


Yvon

Hi –

···

On Wed, 24 Sep 2003, Yvon Thoraval wrote:

Yvon Thoraval <yvon.thoravallist@-SUPPRIMEZ-free.fr.invalid> wrote:

string.gsub!(/\b[a-z]+/) { |w| black_list.include?(w) ? w : w.capitalize }

a lot of tanxs °;)

it seems, it’s a little bit trickier because accentuated characters are
taken as \b for example :

Vosne-romanée
becomes :
Vosne-RomanéE

I believe the /s modifier to the regex will help you here by changing
the encoding, though I’m having character-rendering issues which make
it hard for me to test… But try this, in the hope that I’m right
even though I can’t see the characters:

str.gsub!(/\b[a-z]+/s) {|w| black_list.include?(w) ? w : w.capitalize}

David


David Alan Black
home: dblack@superlink.net
work: blackdav@shu.edu
Web: http://pirate.shu.edu/~blackdav

Yes, tanxs, that way i’d change more easily rules versus area of
vintage…

···

Mark Wilson mwilson13@cox.net wrote:

You might adapt the English language ‘titlecase’ program, which can be
found here:

http://zem.novylen.net/ruby/titlecase.rb


Yvon

it seems, it’s a little bit trickier because accentuated characters are
taken as \b for example :

Vosne-romanée
becomes :
Vosne-RomanéE

then instead of \b i would have to exclude a list of chars :
[à|ä|â|é|è|ê|î|ö|ô|ü|ù]

···

Yvon Thoraval <yvon.thoravallist@-SUPPRIMEZ-free.fr.invalid> wrote:

string.gsub!(/\b[a-z]+/) { |w| black_list.include?(w) ? w : w.capitalize }

a lot of tanxs °;)


Yvon

tanxs, i don’t remember (from Perl) what’s the meaning of this “s” ?

···

dblack@superlink.net wrote:

I believe the /s modifier to the regex will help you here by changing
the encoding, though I’m having character-rendering issues which make
it hard for me to test… But try this, in the hope that I’m right
even though I can’t see the characters:

str.gsub!(/\b[a-z]+/s) {|w| black_list.include?(w) ? w : w.capitalize}


Yvon

it seems, it’s a little bit trickier because accentuated characters are
taken as \b

Really? That’s arguably a bug. What character encoding are you using?

Accented letters should be in \w, not \W, and therefore the
space between one and an adjacent letter should not match \b.
But Ruby regexes may be ASCII-only, and even if not, they’re probably
Latin-1-only. So, for instance, they wouldn’t work on UTF-8 strings.

Vosne-romanée
becomes :
Vosne-RomanéE

then instead of \b i would have to exclude a list of chars :
[à|ä|â|é|è|ê|î|ö|ô|ü|ù]

First, you don’t need the pipes (|'s) there. Pipes are for
alternation without the […]; basically, [abc] is short for
(a|b|c). The pipe form is most useful when the alternatives are
not all single characters, for instance (alfa|bravo|charlie).

I’m not sure whether the exclude-list or the include-list would
be shorter. You could do (^|[- ']) to match “beginning of string or
dash or space or apostrophe”, but then that character would be included
in the resulting string. Which means that it would be, for instance,
" d" or “-d” or “'d” instead of “d”, and therefore won’t be in the
blacklist and won’t capitalize properly (since String#capitalize operates
on the first character, which will be the space or dash or apostrophe).
The block has to compensate for that. Something like this:

string.gsub!(/(^|[- '])([a-z]+)/) { $1 + $2.capitalize }

Except that [a-z] won’t match accented characters, so it’s more like this:

string.gsub!(/(^|[- '])([a-záàâçéèêíìîóòôúùû]+)/) { $1 + $2.capitalize }

And if the names aren’t limited to French, then even more special characters
creep in . . .

-Mark

···

On Tue, Sep 23, 2003 at 07:23:52PM +0200, Yvon Thoraval wrote:

Hi –

···

On Wed, 24 Sep 2003, Yvon Thoraval wrote:

dblack@superlink.net wrote:

I believe the /s modifier to the regex will help you here by changing
the encoding, though I’m having character-rendering issues which make
it hard for me to test… But try this, in the hope that I’m right
even though I can’t see the characters:

str.gsub!(/\b[a-z]+/s) {|w| black_list.include?(w) ? w : w.capitalize}

tanxs, i don’t remember (from Perl) what’s the meaning of this “s” ?

It’s different in Perl and Ruby. In Perl, it means: treat the string
as a single line, so that ‘.’ matches newline. In Ruby, it affects
the encoding… I wish I could give a more knowledgeable account,
but I’ve never actually used it myself and can’t seem to dig up
documentation.

David


David Alan Black
home: dblack@superlink.net
work: blackdav@shu.edu
Web: http://pirate.shu.edu/~blackdav

Left off the blacklist check, which should be applied to $2:

string.gsub!(/(^|[- '])([a-záàâçéèêíìîóòôúùû]+)/) { 
    black_list.include?($2) ? $1 + $2 : $1 + $2.capitalize 
}

-Mark

···

On Tue, Sep 23, 2003 at 05:49:24PM +0000, Mark J. Reed wrote:

string.gsub!(/(^|[- '])([a-záàâçéèêíìîóòôúùû]+)/) { $1 + $2.capitalize }

Really? That’s arguably a bug. What character encoding are you using?

I’m (more-or-less) sure about that because even if i put :
l.gsub!(/\b[a-záàâçéèêíìîóòöôúùüû]+/) { |w| black_list.include?(w) ? w
: w.capitalize }

i get :
MâCon SupéRieur
when input was :
Mâcon supérieur

Accented letters should be in \w, not \W, and therefore the
space between one and an adjacent letter should not match \b.
But Ruby regexes may be ASCII-only, and even if not, they’re probably
Latin-1-only. So, for instance, they wouldn’t work on UTF-8 strings.

precisely i’m using utf-8 °;)
however, i’m able to do a try using iso-8859-1, my word editor (Pepper
on MacOS X) is able to transcode within 2 clicks + one cut’n paste rom
utf to iso…
sounds strange to me because Ruby is coming from Japan where “special”
chars are every-day chars ???

[snip]

The block has to compensate for that. Something like this:

  string.gsub!(/(^|[- '])([a-z]+)/) { $1 + $2.capitalize }

Except that [a-z] won’t match accented characters, so it’s more like this:

  string.gsub!(/(^|[- '])([a-záàâçéèêíìîóòôúùû]+)/) { $1 + $2.capitalize }

And if the names aren’t limited to French, then even more special characters
creep in . . .

Yes, right, i know, for the time being, only about french and german
accentuated chars…

However because vintage are classified by area i might have to change
regex upon region…

···

Mark J. Reed markjreed@mail.com wrote:

Yvon

According to the Pickaxe, or at least the online version thereof
(my dead-trees vesion is at home), /s means to use the SJIS
(Shift-Japanese Information Systems or something like that) multibyte
text encoding. Similarly, /e means to use EUC, and /u means to use
UTF-8. So /u is probably a better bet than /s for Yvon.

http://www.rubycentral.com/book/ref_c_regexp.html#Regexp.new

-Mark

···

On Wed, Sep 24, 2003 at 06:07:08AM +0900, dblack@superlink.net wrote:

On Wed, 24 Sep 2003, Yvon Thoraval wrote:

tanxs, i don’t remember (from Perl) what’s the meaning of this “s” ?
It’s different in Perl and Ruby. In Perl, it means: treat the string
as a single line, so that ‘.’ matches newline. In Ruby, it affects
the encoding… I wish I could give a more knowledgeable account,
but I’ve never actually used it myself and can’t seem to dig up
documentation.

“Yvon Thoraval” <yvon.thoravallist@-SUPPRIMEZ-free.fr.invalid> schrieb im
Newsbeitrag
news:1g1r8u8.1hzv3mvupjeizN%yvon.thoravallist@-SUPPRIMEZ-free.fr.invalid…

Really? That’s arguably a bug. What character encoding are you
using?

I’m (more-or-less) sure about that because even if i put :
l.gsub!(/\b[a-záàâçéèêíìîóòöôúùüû]+/) { |w| black_list.include?(w) ? w
: w.capitalize }

I’d omit the “\b” at the beginning since “é” then still matches a word
boundry:

l.gsub!(/[a-záàâçéèêíìîóòöôúùüû]+/) { |w| black_list.include?(w) ? w
: w.capitalize }

Alternatively:

l.gsub!(/[^\s!?.;:-]+/) {|w| black_list.include?(w) ? w : w.capitalize }

Regards

robert
···

Mark J. Reed markjreed@mail.com wrote:

I’d omit the “\b” at the beginning since “é” then still matches a word
boundry:

l.gsub!(/[a-záàâçéèêíìîóòöôúùüû]+/) { |w| black_list.include?(w) ? w
: w.capitalize }

yes, fine, i discovered also that capitalization don’t work on
accentuated chars (as é)

then i’ve done another step for those “special” chars being as the first
letter of a xord

Alternatively:

l.gsub!(/[^\s!?.;:-]+/) {|w| black_list.include?(w) ? w : w.capitalize }

ok, however in my list no punctuation as ?!;:… only " " and “-”

···

Robert Klemme bob.news@gmx.net wrote:

Yvon

yes, fine, i discovered also that capitalization don’t work on
accentuated chars (as é)

You can use an old library named unicode:

irb(main):001:0> $KCODE=“u”
=> “u”
irb(main):002:0> require “unicode”
=> true
irb(main):003:0> Unicode.capitalize(“àëíôů”)
=> “Àëíôů”

http://raa.ruby-lang.org/list.rhtml?name=unicode

tanxs for all !

···

Carlos angus@quovadis.com.ar wrote:

You can use an old library named unicode:

irb(main):001:0> $KCODE=“u”
=> “u”
irb(main):002:0> require “unicode”
=> true
irb(main):003:0> Unicode.capitalize(“àëíô?”)
=> “Àëíô?”

http://raa.ruby-lang.org/list.rhtml?name=unicode


Yvon

I would appreciate it if someone could give me the regexp that it would
split the following:
for example -
“clonidine300 mg” into “clonidine 300 mg”

I have a bunch of drug data where the dose had been typed together.

Thanks

There’s probably more than one way to do this. Here’s one way:

irb(main):001:0> s=“clonidine300 mg”
=> “clonidine300 mg”
irb(main):005:0> s.scan(/[a-zA-Z]+|\d+/) { |i| p i }
“clonidine”
“300”
“mg”

···

On Saturday, 27 September 2003 at 12:17:19 +0900, Thomas A. Reilly wrote:

I would appreciate it if someone could give me the regexp that it would
split the following:
for example -
“clonidine300 mg” into “clonidine 300 mg”

I have a bunch of drug data where the dose had been typed together.


Jim Freeze

Anybody can win, unless there happens to be a second entry.

/([^\d]+)(\d+)\s*mg/

···

On Saturday 27 September 2003 4:17 am, Thomas A. Reilly wrote:

I would appreciate it if someone could give me the regexp that it would
split the following:
for example -
“clonidine300 mg” into “clonidine 300 mg”


SuSE Linux 8.2 (i586)
Linux 2.4.20-4GB-athlon
ruby 1.8.0 (2003-09-10) [i686-linux]