UTF-8 aware chop for 1.8?

Hello,

Is there an easy way to chop (as in String#chop) a string that can
potentially contain UTF-8 in ruby 1.8? Or should I roll my own?

Thanks,
Ammar

Ended up making my own. Posting it here for the benefit of others, and
maybe some feedback.

  https://gist.github.com/661217

Regards,
Ammar

Well, it should be this simple:

  str.gsub(/.\z/mu, "")

James Edward Gray II

···

On Nov 3, 2010, at 9:08 AM, Ammar Ali wrote:

Is there an easy way to chop (as in String#chop) a string that can
potentially contain UTF-8 in ruby 1.8? Or should I roll my own?

I was going to say

$KCODE="U"

=> "U"

s = "one two three"

=> "one two three"

s.gsub(/^(.+)./u) { $1 }

=> "one two thre"

I guess I overthought it, huh!

···

On Wed, Nov 3, 2010 at 3:38 PM, Ammar Ali <ammarabuali@gmail.com> wrote:

Ended up making my own. Posting it here for the benefit of others, and
maybe some feedback.

UTF-8 aware string chop · GitHub

Regards,
Ammar

Beautiful. Thank you both.

It was a god exercise for me, so I don't necessarily feel that I
wasted 30 minutes of my life :slight_smile:

By the way, the m options seems superfluous in James' version. I get
the same results without it.

Thanks again,
Ammar

···

On Wed, Nov 3, 2010 at 5:57 PM, James Edward Gray II <james@graysoftinc.com> wrote:

Well, it should be this simple:

str.gsub(/.\z/mu, "")

On Wed, Nov 3, 2010 at 6:04 PM, Adam Prescott <mentionuse@gmail.com> wrote:

s.gsub(/^(.+)./u) { $1 }

=> "one two thre"

Well, it should be this simple:

str.gsub(/.\z/mu, "")

s.gsub(/^(.+)./u) { $1 }

=> "one two thre"

Beautiful. Thank you both.

It was a god exercise for me, so I don't necessarily feel that I
wasted 30 minutes of my life :slight_smile:

By the way, the m options seems superfluous in James' version. I get
the same results without it.

It's not:

"\n".sub(/.\z/u, "")

=> "\n"

"\n".sub(/.\z/mu, "")

=> ""

Using gsub() over sub() was a dumb mistake on my part though. sub() is all you need, since it can only match once.

James Edward Gray II

···

On Nov 3, 2010, at 11:33 AM, Ammar Ali wrote:

On Wed, Nov 3, 2010 at 5:57 PM, James Edward Gray II > <james@graysoftinc.com> wrote:
On Wed, Nov 3, 2010 at 6:04 PM, Adam Prescott <mentionuse@gmail.com> wrote:

Ammar Ali wrote in post #959047:

By the way, the m options seems superfluous in James' version. I get
the same results without it.

foo = "abc\n"

=> "abc\n"

foo.sub(/.\z/mu, '')

=> "abc"

foo.sub(/.\z/u, '')

=> "abc\n"

···

--
Posted via http://www.ruby-forum.com/\.

Thanks for the clarification.

My method now looks like:

def chop_utf8(s)
  return unless s

  lead = s.sub(/.\z/mu, "")
  last = s.scan(/.\z/mu).first
  last = '' unless last

  [lead, last]
end

Short and sweet.

Cheers,
Ammar

···

On Wed, Nov 3, 2010 at 6:38 PM, James Edward Gray II <james@graysoftinc.com> wrote:

On Nov 3, 2010, at 11:33 AM, Ammar Ali wrote:

By the way, the m options seems superfluous in James' version. I get
the same results without it.

It's not:

"\n".sub(/.\z/u, "")

=> "\n"

"\n".sub(/.\z/mu, "")

=> ""

Using gsub() over sub() was a dumb mistake on my part though. sub() is all you need, since it can only match once.

James clarified this earlier. But thanks for chiming in nonetheless.

Cheers,
Ammar

···

On Thu, Nov 4, 2010 at 4:37 PM, Brian Candler <b.candler@pobox.com> wrote:

Ammar Ali wrote in post #959047:

By the way, the m options seems superfluous in James' version. I get
the same results without it.

foo = "abc\n"

=> "abc\n"

foo.sub(/.\z/mu, '')

=> "abc"

foo.sub(/.\z/u, '')

=> "abc\n"

My method now looks like:

def chop_utf8(s)
return unless s

lead = s.sub(/.\z/mu, "")
last = s.scan(/.\z/mu).first
last = '' unless last

The two lines above can be replaced with the more efficient:

last = s[/.\z/mu] || ''

[lead, last]
end

James Edward Gray II

···

On Nov 3, 2010, at 11:56 AM, Ammar Ali wrote:

At this rate the method is going to disappear. :slight_smile:

I updated the gist accordingly:

  UTF-8 aware string chop. (the firs gist was posted as anonymous) · GitHub

Thanks again,
Ammar

···

On Wed, Nov 3, 2010 at 7:00 PM, James Edward Gray II <james@graysoftinc.com> wrote:

On Nov 3, 2010, at 11:56 AM, Ammar Ali wrote:

My method now looks like:

def chop_utf8(s)
return unless s

lead = s.sub(/.\z/mu, "")
last = s.scan(/.\z/mu).first
last = '' unless last

The two lines above can be replaced with the more efficient:

last = s[/.\z/mu] || ''

can we make that a one pass?

str =~ /.\z/mu
[$`,$&]

best regards -botp

···

On Thu, Nov 4, 2010 at 1:25 AM, Ammar Ali <ammarabuali@gmail.com> wrote:

On Wed, Nov 3, 2010 at 7:00 PM, James Edward Gray II

last = s[/.\z/mu] || ''

I updated the gist accordingly:
UTF-8 aware string chop. (the firs gist was posted as anonymous) · GitHub