[newbie] upper to lower first letter of a word

Yvon_Thoravallist · 23 September 2003 16:34

Recently, i get a vintage list (more than 500 items) with poor typo, for
example, i’ve :

Côte de beaune-villages

instead of :

Côte de Beaune-Villages

Crémant d’alsace

instead of :

Crémant d’Alsace

i wonder of the way to change lower to upper case and also of

a regex able to do the trick.

something like :

every letter following a " ", “-” or “’” should be upper if not
belonging to a black list of words :

black_list = %w{d de du la le sec sur entre etc…}

···

–
Yvon

Mark_J_Reed · 23 September 2003 16:54

string.gsub!(/\b[a-z]+/) { |w| black_list.include?(w) ? w : w.capitalize }

-Mark

···

On Tue, Sep 23, 2003 at 06:29:58PM +0200, Yvon Thoraval wrote:

Recently, i get a vintage list (more than 500 items) with poor typo, for
example, i’ve :

Côte de beaune-villages

instead of :

Côte de Beaune-Villages

Crémant d’alsace

instead of :

Crémant d’Alsace

i wonder of the way to change lower to upper case and also of

a regex able to do the trick.

something like :

every letter following a " ", “-” or “'” should be upper if not
belonging to a black list of words :

black_list = %w{d de du la le sec sur entre etc…}

Mark_Wilson · 23 September 2003 18:24

You might adapt the English language ‘titlecase’ program, which can be
found here:

http://zem.novylen.net/ruby/titlecase.rb

Regards,

Mark

···

On Tuesday, September 23, 2003, at 12:34 PM, Yvon Thoraval wrote:

Recently, i get a vintage list (more than 500 items) with poor typo,
for
example, i’ve :

Côte de beaune-villages

instead of :

Côte de Beaune-Villages

Crémant d’alsace

instead of :

Crémant d’Alsace

i wonder of the way to change lower to upper case and also of

a regex able to do the trick.

something like :

every letter following a " ", “-” or “'” should be upper if not
belonging to a black list of words :

black_list = %w{d de du la le sec sur entre etc…}

Yvon_Thoravallist · 23 September 2003 17:14

a lot of tanxs °;)

···

Mark J. Reed markjreed@mail.com wrote:

string.gsub!(/\b[a-z]+/) { |w| black_list.include?(w) ? w : w.capitalize }

–
Yvon

David_A_Black2 · 23 September 2003 18:26

Hi –

···

On Wed, 24 Sep 2003, Yvon Thoraval wrote:

Yvon Thoraval <yvon.thoravallist@-SUPPRIMEZ-free.fr.invalid> wrote:

string.gsub!(/\b[a-z]+/) { |w| black_list.include?(w) ? w : w.capitalize }

a lot of tanxs °;)

it seems, it’s a little bit trickier because accentuated characters are
taken as \b for example :

Vosne-romanée
becomes :
Vosne-RomanéE

I believe the /s modifier to the regex will help you here by changing
the encoding, though I’m having character-rendering issues which make
it hard for me to test… But try this, in the hope that I’m right
even though I can’t see the characters:

str.gsub!(/\b[a-z]+/s) {|w| black_list.include?(w) ? w : w.capitalize}

David

–
David Alan Black
home: dblack@superlink.net
work: blackdav@shu.edu
Web: http://pirate.shu.edu/~blackdav

Yvon_Thoravallist · 23 September 2003 18:54

Yes, tanxs, that way i’d change more easily rules versus area of
vintage…

···

Mark Wilson mwilson13@cox.net wrote:

You might adapt the English language ‘titlecase’ program, which can be
found here:

http://zem.novylen.net/ruby/titlecase.rb

–
Yvon

Yvon_Thoravallist · 23 September 2003 17:34

it seems, it’s a little bit trickier because accentuated characters are
taken as \b for example :

Vosne-romanée
becomes :
Vosne-RomanéE

then instead of \b i would have to exclude a list of chars :
[à|ä|â|é|è|ê|î|ö|ô|ü|ù]

···

Yvon Thoraval <yvon.thoravallist@-SUPPRIMEZ-free.fr.invalid> wrote:

string.gsub!(/\b[a-z]+/) { |w| black_list.include?(w) ? w : w.capitalize }

a lot of tanxs °;)

–
Yvon

Yvon_Thoravallist · 23 September 2003 18:54

tanxs, i don’t remember (from Perl) what’s the meaning of this “s” ?

···

dblack@superlink.net wrote:

I believe the /s modifier to the regex will help you here by changing
the encoding, though I’m having character-rendering issues which make
it hard for me to test… But try this, in the hope that I’m right
even though I can’t see the characters:

str.gsub!(/\b[a-z]+/s) {|w| black_list.include?(w) ? w : w.capitalize}

–
Yvon

Mark_J_Reed · 23 September 2003 17:54

it seems, it’s a little bit trickier because accentuated characters are
taken as \b

Really? That’s arguably a bug. What character encoding are you using?

Accented letters should be in \w, not \W, and therefore the
space between one and an adjacent letter should not match \b.
But Ruby regexes may be ASCII-only, and even if not, they’re probably
Latin-1-only. So, for instance, they wouldn’t work on UTF-8 strings.

Vosne-romanée
becomes :
Vosne-RomanéE

then instead of \b i would have to exclude a list of chars :
[à|ä|â|é|è|ê|î|ö|ô|ü|ù]

I’m not sure whether the exclude-list or the include-list would
be shorter. You could do (^|[- ']) to match “beginning of string or
dash or space or apostrophe”, but then that character would be included
in the resulting string. Which means that it would be, for instance,
" d" or “-d” or “'d” instead of “d”, and therefore won’t be in the
blacklist and won’t capitalize properly (since String#capitalize operates
on the first character, which will be the space or dash or apostrophe).
The block has to compensate for that. Something like this:

string.gsub!(/(^|[- '])([a-z]+)/) { $1 + $2.capitalize }

Except that [a-z] won’t match accented characters, so it’s more like this:

string.gsub!(/(^|[- '])([a-záàâçéèêíìîóòôúùû]+)/) { $1 + $2.capitalize }

And if the names aren’t limited to French, then even more special characters
creep in . . .

-Mark

···

On Tue, Sep 23, 2003 at 07:23:52PM +0200, Yvon Thoraval wrote:

David_A_Black2 · 23 September 2003 21:07

Hi –

···

On Wed, 24 Sep 2003, Yvon Thoraval wrote:

dblack@superlink.net wrote:

I believe the /s modifier to the regex will help you here by changing
the encoding, though I’m having character-rendering issues which make
it hard for me to test… But try this, in the hope that I’m right
even though I can’t see the characters:

str.gsub!(/\b[a-z]+/s) {|w| black_list.include?(w) ? w : w.capitalize}

tanxs, i don’t remember (from Perl) what’s the meaning of this “s” ?

It’s different in Perl and Ruby. In Perl, it means: treat the string
as a single line, so that ‘.’ matches newline. In Ruby, it affects
the encoding… I wish I could give a more knowledgeable account,
but I’ve never actually used it myself and can’t seem to dig up
documentation.

David

–
David Alan Black
home: dblack@superlink.net
work: blackdav@shu.edu
Web: http://pirate.shu.edu/~blackdav

Mark_J_Reed · 23 September 2003 17:54

Left off the blacklist check, which should be applied to $2:

string.gsub!(/(^|[- '])([a-záàâçéèêíìîóòôúùû]+)/) { 
    black_list.include?($2) ? $1 + $2 : $1 + $2.capitalize 
}

-Mark

···

On Tue, Sep 23, 2003 at 05:49:24PM +0000, Mark J. Reed wrote:

string.gsub!(/(^|[- '])([a-záàâçéèêíìîóòôúùû]+)/) { $1 + $2.capitalize }

Yvon_Thoravallist · 23 September 2003 18:34

Really? That’s arguably a bug. What character encoding are you using?

I’m (more-or-less) sure about that because even if i put :
l.gsub!(/\b[a-záàâçéèêíìîóòöôúùüû]+/) { |w| black_list.include?(w) ? w
: w.capitalize }

i get :
MâCon SupéRieur
when input was :
Mâcon supérieur

Accented letters should be in \w, not \W, and therefore the
space between one and an adjacent letter should not match \b.
But Ruby regexes may be ASCII-only, and even if not, they’re probably
Latin-1-only. So, for instance, they wouldn’t work on UTF-8 strings.

precisely i’m using utf-8 °;)
however, i’m able to do a try using iso-8859-1, my word editor (Pepper
on MacOS X) is able to transcode within 2 clicks + one cut’n paste rom
utf to iso…
sounds strange to me because Ruby is coming from Japan where “special”
chars are every-day chars ???

[snip]

The block has to compensate for that. Something like this:
  string.gsub!(/(^|[- '])([a-z]+)/) { $1 + $2.capitalize }
Except that [a-z] won’t match accented characters, so it’s more like this:
  string.gsub!(/(^|[- '])([a-záàâçéèêíìîóòôúùû]+)/) { $1 + $2.capitalize }
And if the names aren’t limited to French, then even more special characters
creep in . . .

Yes, right, i know, for the time being, only about french and german
accentuated chars…

However because vintage are classified by area i might have to change
regex upon region…

···

Mark J. Reed markjreed@mail.com wrote:

Yvon

Mark_J_Reed · 23 September 2003 21:36

According to the Pickaxe, or at least the online version thereof
(my dead-trees vesion is at home), /s means to use the SJIS
(Shift-Japanese Information Systems or something like that) multibyte
text encoding. Similarly, /e means to use EUC, and /u means to use
UTF-8. So /u is probably a better bet than /s for Yvon.

http://www.rubycentral.com/book/ref_c_regexp.html#Regexp.new

-Mark

···

On Wed, Sep 24, 2003 at 06:07:08AM +0900, dblack@superlink.net wrote:

On Wed, 24 Sep 2003, Yvon Thoraval wrote:

tanxs, i don’t remember (from Perl) what’s the meaning of this “s” ?
It’s different in Perl and Ruby. In Perl, it means: treat the string
as a single line, so that ‘.’ matches newline. In Ruby, it affects
the encoding… I wish I could give a more knowledgeable account,
but I’ve never actually used it myself and can’t seem to dig up
documentation.

Robert · 24 September 2003 09:58

“Yvon Thoraval” <yvon.thoravallist@-SUPPRIMEZ-free.fr.invalid> schrieb im
Newsbeitrag
news:1g1r8u8.1hzv3mvupjeizN%yvon.thoravallist@-SUPPRIMEZ-free.fr.invalid…

Really? That’s arguably a bug. What character encoding are you
using?

I’m (more-or-less) sure about that because even if i put :
l.gsub!(/\b[a-záàâçéèêíìîóòöôúùüû]+/) { |w| black_list.include?(w) ? w
: w.capitalize }

I’d omit the “\b” at the beginning since “é” then still matches a word
boundry:

l.gsub!(/[a-záàâçéèêíìîóòöôúùüû]+/) { |w| black_list.include?(w) ? w
: w.capitalize }

Alternatively:

l.gsub!(/[^\s!?.;:-]+/) {|w| black_list.include?(w) ? w : w.capitalize }

Regards

robert

···

Mark J. Reed markjreed@mail.com wrote:

Yvon_Thoravallist · 24 September 2003 13:59

I’d omit the “\b” at the beginning since “é” then still matches a word
boundry:

l.gsub!(/[a-záàâçéèêíìîóòöôúùüû]+/) { |w| black_list.include?(w) ? w
: w.capitalize }

yes, fine, i discovered also that capitalization don’t work on
accentuated chars (as é)

then i’ve done another step for those “special” chars being as the first
letter of a xord

Alternatively:

l.gsub!(/[^\s!?.;:-]+/) {|w| black_list.include?(w) ? w : w.capitalize }

ok, however in my list no punctuation as ?!;:… only " " and “-”

···

Robert Klemme bob.news@gmx.net wrote:

Yvon

Carlos · 26 September 2003 16:28

yes, fine, i discovered also that capitalization don’t work on
accentuated chars (as é)

You can use an old library named unicode:

irb(main):001:0> $KCODE=“u”
=> “u”
irb(main):002:0> require “unicode”
=> true
irb(main):003:0> Unicode.capitalize(“àëíôů”)
=> “Àëíôů”

http://raa.ruby-lang.org/list.rhtml?name=unicode

Yvon_Thoravallist · 26 September 2003 17:06

tanxs for all !

···

Carlos angus@quovadis.com.ar wrote:

You can use an old library named unicode:

irb(main):001:0> $KCODE=“u”
=> “u”
irb(main):002:0> require “unicode”
=> true
irb(main):003:0> Unicode.capitalize(“àëíô?”)
=> “Àëíô?”

http://raa.ruby-lang.org/list.rhtml?name=unicode

–
Yvon

Thomas_A_Reilly · 27 September 2003 03:17

I would appreciate it if someone could give me the regexp that it would
split the following:
for example -
“clonidine300 mg” into “clonidine 300 mg”

I have a bunch of drug data where the dose had been typed together.

Thanks

Jim_Freeze2 · 27 September 2003 03:44

There’s probably more than one way to do this. Here’s one way:

irb(main):001:0> s=“clonidine300 mg”
=> “clonidine300 mg”
irb(main):005:0> s.scan(/[a-zA-Z]+|\d+/) { |i| p i }
“clonidine”
“300”
“mg”

···

On Saturday, 27 September 2003 at 12:17:19 +0900, Thomas A. Reilly wrote:

I would appreciate it if someone could give me the regexp that it would
split the following:
for example -
“clonidine300 mg” into “clonidine 300 mg”

I have a bunch of drug data where the dose had been typed together.

–
Jim Freeze

Anybody can win, unless there happens to be a second entry.

Jonathan_Lim · 27 September 2003 03:47

/([^\d]+)(\d+)\s*mg/

···

On Saturday 27 September 2003 4:17 am, Thomas A. Reilly wrote:

I would appreciate it if someone could give me the regexp that it would
split the following:
for example -
“clonidine300 mg” into “clonidine 300 mg”

–
SuSE Linux 8.2 (i586)
Linux 2.4.20-4GB-athlon
ruby 1.8.0 (2003-09-10) [i686-linux]

Topic		Replies	Views
Capitalization ruby-talk	11	71	13 December 2006
Capitalizing words ruby-talk	16	138	10 April 2008
Code for title-casing (US) snail addresses? ruby-talk	15	124	10 October 2005
Convert "ThisIsSomeString" to "this_is_some_string"? ruby-talk	21	177	21 August 2006
Downcase part of a string ruby-talk	39	231	4 November 2006

[newbie] upper to lower first letter of a word

Mark J. Reed markjreed@mail.com wrote:

Robert Klemme bob.news@gmx.net wrote:

– Jim Freeze

Related topics

–
Jim Freeze