UTF-8 aware chop for 1.8?

Ammar_Ali · 3 November 2010 14:08

Hello,

Is there an easy way to chop (as in String#chop) a string that can
potentially contain UTF-8 in ruby 1.8? Or should I roll my own?

Thanks,
Ammar

Ammar_Ali · 3 November 2010 15:38

Ended up making my own. Posting it here for the benefit of others, and
maybe some feedback.

https://gist.github.com/661217

Regards,
Ammar

JEG2 · 3 November 2010 15:57

Well, it should be this simple:

str.gsub(/.\z/mu, "")

James Edward Gray II

···

On Nov 3, 2010, at 9:08 AM, Ammar Ali wrote:

Is there an easy way to chop (as in String#chop) a string that can
potentially contain UTF-8 in ruby 1.8? Or should I roll my own?

Adam_Prescott1 · 3 November 2010 16:04

I was going to say

$KCODE="U"

=> "U"

s = "one two three"

=> "one two three"

s.gsub(/^(.+)./u) { $1 }

=> "one two thre"

I guess I overthought it, huh!

···

On Wed, Nov 3, 2010 at 3:38 PM, Ammar Ali <ammarabuali@gmail.com> wrote:

Ended up making my own. Posting it here for the benefit of others, and
maybe some feedback.

UTF-8 aware string chop · GitHub

Regards,
Ammar

Ammar_Ali · 3 November 2010 16:33

Beautiful. Thank you both.

It was a god exercise for me, so I don't necessarily feel that I
wasted 30 minutes of my life

By the way, the m options seems superfluous in James' version. I get
the same results without it.

Thanks again,
Ammar

···

On Wed, Nov 3, 2010 at 5:57 PM, James Edward Gray II <james@graysoftinc.com> wrote:

Well, it should be this simple:

str.gsub(/.\z/mu, "")

On Wed, Nov 3, 2010 at 6:04 PM, Adam Prescott <mentionuse@gmail.com> wrote:

s.gsub(/^(.+)./u) { $1 }

=> "one two thre"

JEG2 · 3 November 2010 16:38

Well, it should be this simple:

str.gsub(/.\z/mu, "")

s.gsub(/^(.+)./u) { $1 }

=> "one two thre"

Beautiful. Thank you both.

It was a god exercise for me, so I don't necessarily feel that I
wasted 30 minutes of my life

By the way, the m options seems superfluous in James' version. I get
the same results without it.

It's not:

"\n".sub(/.\z/u, "")

=> "\n"

"\n".sub(/.\z/mu, "")

=> ""

Using gsub() over sub() was a dumb mistake on my part though. sub() is all you need, since it can only match once.

James Edward Gray II

···

On Nov 3, 2010, at 11:33 AM, Ammar Ali wrote:

On Wed, Nov 3, 2010 at 5:57 PM, James Edward Gray II > <james@graysoftinc.com> wrote:
On Wed, Nov 3, 2010 at 6:04 PM, Adam Prescott <mentionuse@gmail.com> wrote:

Brian_Candler · 4 November 2010 14:37

Ammar Ali wrote in post #959047:

By the way, the m options seems superfluous in James' version. I get
the same results without it.

foo = "abc\n"

=> "abc\n"

foo.sub(/.\z/mu, '')

=> "abc"

foo.sub(/.\z/u, '')

=> "abc\n"

···

--
Posted via http://www.ruby-forum.com/\.

Ammar_Ali · 3 November 2010 16:56

Thanks for the clarification.

My method now looks like:

def chop_utf8(s)
return unless s

  lead = s.sub(/.\z/mu, "")
  last = s.scan(/.\z/mu).first
  last = '' unless last

[lead, last]
end

Short and sweet.

Cheers,
Ammar

···

On Wed, Nov 3, 2010 at 6:38 PM, James Edward Gray II <james@graysoftinc.com> wrote:

On Nov 3, 2010, at 11:33 AM, Ammar Ali wrote:

By the way, the m options seems superfluous in James' version. I get
the same results without it.

It's not:

"\n".sub(/.\z/u, "")

=> "\n"

"\n".sub(/.\z/mu, "")

=> ""

Using gsub() over sub() was a dumb mistake on my part though. sub() is all you need, since it can only match once.

Ammar_Ali · 4 November 2010 14:53

James clarified this earlier. But thanks for chiming in nonetheless.

Cheers,
Ammar

···

On Thu, Nov 4, 2010 at 4:37 PM, Brian Candler <b.candler@pobox.com> wrote:

Ammar Ali wrote in post #959047:

By the way, the m options seems superfluous in James' version. I get
the same results without it.

foo = "abc\n"

=> "abc\n"

foo.sub(/.\z/mu, '')

=> "abc"

foo.sub(/.\z/u, '')

=> "abc\n"

JEG2 · 3 November 2010 17:00

My method now looks like:

def chop_utf8(s)
return unless s

lead = s.sub(/.\z/mu, "")
last = s.scan(/.\z/mu).first
last = '' unless last

The two lines above can be replaced with the more efficient:

last = s[/.\z/mu] || ''

[lead, last]
end

James Edward Gray II

···

On Nov 3, 2010, at 11:56 AM, Ammar Ali wrote:

Ammar_Ali · 3 November 2010 17:25

At this rate the method is going to disappear.

I updated the gist accordingly:

UTF-8 aware string chop. (the firs gist was posted as anonymous) · GitHub

Thanks again,
Ammar

···

On Wed, Nov 3, 2010 at 7:00 PM, James Edward Gray II <james@graysoftinc.com> wrote:

On Nov 3, 2010, at 11:56 AM, Ammar Ali wrote:

My method now looks like:

def chop_utf8(s)
return unless s

lead = s.sub(/.\z/mu, "")
last = s.scan(/.\z/mu).first
last = '' unless last

The two lines above can be replaced with the more efficient:

last = s[/.\z/mu] || ''

botp1 · 4 November 2010 02:18

can we make that a one pass?

str =~ /.\z/mu
[$`,$&]

best regards -botp

···

On Thu, Nov 4, 2010 at 1:25 AM, Ammar Ali <ammarabuali@gmail.com> wrote:

On Wed, Nov 3, 2010 at 7:00 PM, James Edward Gray II

last = s[/.\z/mu] || ''

I updated the gist accordingly:
UTF-8 aware string chop. (the firs gist was posted as anonymous) · GitHub

Topic		Replies	Views
String#chop chops last byte, not char ruby-talk	2	149	23 April 2008
Ruby 1.9 hates you and me and the encodings we rode in on so just get used to it ruby-talk	28	202	31 December 2009
Utf-8 ruby-talk	1	88	6 January 2006
UTF-8 strings? ruby-talk	1	93	25 October 2004
UTF-8 question ruby-talk	20	166	15 August 2003

UTF-8 aware chop for 1.8?

Related topics