Cutting a piece of text

Zdebel · 12 February 2006 16:18

Helo !
I've started to learn ruby and I'm amazed with it. Now I have a problem
that I can't solve. If I have a string like this:
"<lyrics artist=XXX album=XXX title=XXX> Lalalalala </lyrics>" how can I
cut the " artist=XXX album=XXX title=XXX" part, so it would look like:
"<lyrcis> Lalalalala </lyrics>" Could you please help me ?

···

--
Posted via http://www.ruby-forum.com/.

James_Edward_Gray_II · 12 February 2006 16:57

Helo !
I've started to learn ruby and I'm amazed with it. Now I have a problem
that I can't solve. If I have a string like this:
"<lyrics artist=XXX album=XXX title=XXX> Lalalalala </lyrics>" how can I
cut the " artist=XXX album=XXX title=XXX" part, so it would look like:
"<lyrcis> Lalalalala </lyrics>" Could you please help me ?

You can do it with a regular expression like the following, but I must stress that this isn't very robust:

>> "<lyrics artist=XXX album=XXX title=XXX> Lalalalala </lyrics>".sub(/<(\w+)[^>]+>/, "<\\1>")
=> "<lyrics> Lalalalala </lyrics>"

Hope that helps.

James Edward Gray II

···

On Feb 12, 2006, at 10:18 AM, Zdebel wrote:

David_Vallner · 12 February 2006 17:05

Dňa Nedeľa 12 Február 2006 17:18 Zdebel napísal:

Helo !
I've started to learn ruby and I'm amazed with it. Now I have a problem
that I can't solve. If I have a string like this:
"<lyrics artist=XXX album=XXX title=XXX> Lalalalala </lyrics>" how can I
cut the " artist=XXX album=XXX title=XXX" part, so it would look like:
"<lyrcis> Lalalalala </lyrics>" Could you please help me ?

The very geeky, and most probably least error-prone way would be whacking the
string with a DOM parser, clearing the attributes, and then printing it out
again. Unfortunately, I haven't been doing any DOM manipulation in Ruby, so I
can't provide code.

David Vallner

Sammyo · 12 February 2006 17:18

Learn regular expressions. Here's a not great example:

a = "<lyrics artist=XXX album=XXX title=XXX> Lalalalala </lyrics>"
b = a.gsub(/\w*=\w*/ , "")
c = b.gsub(/\s/, "")
print c, "\n"

<lyrics>Lalalalala</lyrics>

A slightly (yes very slightly) more realistic example:

a = '<lyrics artist="Prince" album="purplerain" title="computerblue">
Lalalalala </lyrics>'
b = a.gsub(/\w*="\w*"/ , "")
c = b.gsub(/\s/, "")
print c, "\n"

<lyrics>Lalalalala</lyrics>

And what if there are spaces in a tag:

a = '<lyrics artist="Prince" album="purplerain" title="Computer Blue">
Lalalalala </lyrics>'
b = a.gsub(/\w*=".*"/ , "")
c = b.gsub(/\s/, "")

W_James · 13 February 2006 07:58

Zdebel wrote:

Helo !
I've started to learn ruby and I'm amazed with it. Now I have a problem
that I can't solve. If I have a string like this:
"<lyrics artist=XXX album=XXX title=XXX> Lalalalala </lyrics>" how can I
cut the " artist=XXX album=XXX title=XXX" part, so it would look like:
"<lyrcis> Lalalalala </lyrics>" Could you please help me ?

--
Posted via http://www.ruby-forum.com/\.

p " <lyrics artist=XXX album=XXX title=XXX> Lalalalala </lyrics>".
sub(/\s+[^<>]*(?=>)/, '' )

p " <lyrics artist=XXX album=XXX title=XXX> Lalalalala </lyrics>".
scan( /\G ( [^<]+ ) | \G ( < \S* ) [^>]* ( > ) /x ).
flatten.compact.join

Zdebel · 12 February 2006 17:05

James Gray wrote:

···

On Feb 12, 2006, at 10:18 AM, Zdebel wrote:

Helo !
I've started to learn ruby and I'm amazed with it. Now I have a
problem
that I can't solve. If I have a string like this:
"<lyrics artist=XXX album=XXX title=XXX> Lalalalala </lyrics>" how
can I
cut the " artist=XXX album=XXX title=XXX" part, so it would look like:
"<lyrcis> Lalalalala </lyrics>" Could you please help me ?

You can do it with a regular expression like the following, but I
must stress that this isn't very robust:

>> "<lyrics artist=XXX album=XXX title=XXX> Lalalalala </lyrics>".sub
(/<(\w+)[^>]+>/, "<\\1>")
=> "<lyrics> Lalalalala </lyrics>"

Hope that helps.

James Edward Gray II

:O, wow it works, I wish I knew how this (/<(\w+)[^>]+>/, "<\\1>")
regular expresion works :). Anyway thank you, you helped me very much.

--
Posted via http://www.ruby-forum.com/\.

James_Edward_Gray_II · 12 February 2006 17:14

Dňa Nedeľa 12 Február 2006 17:18 Zdebel napísal:

Helo !
I've started to learn ruby and I'm amazed with it. Now I have a problem
that I can't solve. If I have a string like this:
"<lyrics artist=XXX album=XXX title=XXX> Lalalalala </lyrics>" how can I
cut the " artist=XXX album=XXX title=XXX" part, so it would look like:
"<lyrcis> Lalalalala </lyrics>" Could you please help me ?

The very geeky, and most probably least error-prone way would be whacking the
string with a DOM parser, clearing the attributes, and then printing it out
again. Unfortunately, I haven't been doing any DOM manipulation in Ruby, so I
can't provide code.

The following is how you do it for valid XML, but the posted example wasn't quite:

#!/usr/local/bin/ruby -w

require "rexml/document"

doc = "<lyrics artist='XXX' album='XXX' title='XXX'> Lalalalala </

"

xml = REXML::Document.new(doc)
xml.root.attributes.clear
xml.write
puts

__END__

James Edward Gray II

···

On Feb 12, 2006, at 11:05 AM, David Vallner wrote:

Marcin_Mielzynski · 12 February 2006 18:08

James Edward Gray II wrote:

>> "<lyrics artist=XXX album=XXX title=XXX> Lalalalala </lyrics>".sub(/<(\w+)[^>]+>/, "<\\1>")
=> "<lyrics> Lalalalala </lyrics>"

reluctant would a bit faster:

p "<lyrics artist=XXX album=XXX title=XXX> Lalalalala </lyrics>".gsub(/<(\w+).*?>/, "<\\1>")

lopex

James_Edward_Gray_II · 12 February 2006 17:20

I wish I knew how this (/<(\w+)[^>]+>/, "<\\1>")
regular expresion works :).

It reads:

/ < # find a < character
   ( # capture this next part into $1 (\\1 in the replacement string)
     \w+ # followed by one or more word characters
   ) # end capture
   [^>]+ # followed by one or more non > characters
   > # and finally a > character
/x

The replacement just restores the <\w+> and leaves out the [^>]+ part (the space and attributes).

Hope that helps.

James Edward Gray II

···

On Feb 12, 2006, at 11:05 AM, Zdebel wrote:

James_Edward_Gray_II · 12 February 2006 18:30

Are you sure?

$ ruby regexp_time.rb
Rehearsal -------------------------------------------------
/<(w+)[^>]+>/ 7.210000 0.030000 7.240000 ( 7.266166)
/<(w+).*?>/ 7.710000 0.020000 7.730000 ( 7.757304)
--------------------------------------- total: 14.970000sec

user system total real
/<(w+)[^>]+>/ 7.170000 0.030000 7.200000 ( 7.227075)
/<(w+).*?>/ 7.730000 0.020000 7.750000 ( 7.777196)
$ cat regexp_time.rb
#!/usr/local/bin/ruby -w

require "benchmark"

tests = 1000000
data = "<lyrics artist=XXX album=XXX title=XXX> Lalalalala </lyrics>"

Benchmark.bmbm do |x|
   x.report("/<(\w+)[^>]+>/") do
     tests.times { data.sub(/<(\w+)[^>]+>/, "<\\1>") }
   end
   x.report("/<(\w+).*?>/") do
     tests.times { data.sub(/<(\w+).*?>/, "<\\1>") }
   end
end

__END__

James Edward Gray II

···

On Feb 12, 2006, at 12:08 PM, Marcin Mielżyński wrote:

James Edward Gray II wrote:

>> "<lyrics artist=XXX album=XXX title=XXX> Lalalalala </>".sub(/<(\w+)[^>]+>/, "<\\1>")
=> "<lyrics> Lalalalala </lyrics>"

reluctant would a bit faster:

p "<lyrics artist=XXX album=XXX title=XXX> Lalalalala </>".gsub(/<(\w+).*?>/, "<\\1>")

Zdebel · 12 February 2006 17:33

Big thank you too all of you guys for such a response. This helped me
alot and my script is working, but I will practice more using your
advices

···

--
Posted via http://www.ruby-forum.com/.

David_Vallner · 12 February 2006 20:21

Dňa Nedeľa 12 Február 2006 19:30 James Edward Gray II napísal:

···

On Feb 12, 2006, at 12:08 PM, Marcin Mielżyński wrote:
> James Edward Gray II wrote:
>> >> "<lyrics artist=XXX album=XXX title=XXX> Lalalalala </
>>
>> >".sub(/<(\w+)[^>]+>/, "<\\1>")
>> => "<lyrics> Lalalalala </lyrics>"
>
> reluctant would a bit faster:
>
> p "<lyrics artist=XXX album=XXX title=XXX> Lalalalala </
> >".gsub(/<(\w+).*?>/, "<\\1>")

Are you sure?

$ ruby regexp_time.rb
Rehearsal -------------------------------------------------
/<(w+)[^>]+>/ 7.210000 0.030000 7.240000 ( 7.266166)
/<(w+).*?>/ 7.710000 0.020000 7.730000 ( 7.757304)
--------------------------------------- total: 14.970000sec

                     user system total real
/<(w+)[^>]+>/ 7.170000 0.030000 7.200000 ( 7.227075)
/<(w+).*?>/ 7.730000 0.020000 7.750000 ( 7.777196)
$ cat regexp_time.rb
#!/usr/local/bin/ruby -w

require "benchmark"

tests = 1000000
data = "<lyrics artist=XXX album=XXX title=XXX> Lalalalala </lyrics>"

Benchmark.bmbm do |x|
   x.report("/<(\w+)[^>]+>/") do
     tests.times { data.sub(/<(\w+)[^>]+>/, "<\\1>") }
   end
   x.report("/<(\w+).*?>/") do
     tests.times { data.sub(/<(\w+).*?>/, "<\\1>") }
   end
end

__END__

James Edward Gray II

The nongreedy match has to "back up" and retry on every character after the
tag name, whileas James' [^>] doesn't ever have to back up. In fact, even a
greedy .* would probably be faster than a nongreedy one in this case.

Gotta love the black art that is optimizing regexps.

David Vallner

Marcin_Mielzynski · 12 February 2006 20:38

David Vallner wrote:

The nongreedy match has to "back up" and retry on every character after the tag name, whileas James' [^>] doesn't ever have to back up. In fact, even a greedy .* would probably be faster than a nongreedy one in this case.

Gotta love the black art that is optimizing regexps.

Ooops.. You are right!

But as I read greedy quantifiers do backtrack as well (but not in the case above).

/a+aa/ =~ "aaaaa"
will backtrack two characters

only possesive quantifier (in oniguruma e.g.) consumes in the real, greedy way.

so
/a++aa/ =~ "aaaaa"
won't match.

lopex

David_Vallner · 12 February 2006 21:35

Dňa Nedeľa 12 Február 2006 21:38 Marcin Mielżyński napísal:

David Vallner wrote:
> The nongreedy match has to "back up" and retry on every character after
> the tag name, whileas James' [^>] doesn't ever have to back up. In fact,
> even a greedy .* would probably be faster than a nongreedy one in this
> case.
>
> Gotta love the black art that is optimizing regexps.

Ooops.. You are right!

But as I read greedy quantifiers do backtrack as well (but not in the
case above).

/a+aa/ =~ "aaaaa"
will backtrack two characters

only possesive quantifier (in oniguruma e.g.) consumes in the real,
greedy way.

so
/a++aa/ =~ "aaaaa"
won't match.

lopex

Yes, they do backtrack. The point is in using the one that you expect to
backtrack less.

Since in this case we very well knew there's going to be quite a few
characters after the first word, the nongreedy quantifier was slower.

Where the REAL black magic is whether a greedy or possessive quantification of
the [^>] variant would be faster. *snicker* Anyone running 1.9 able to BM
this?

David Vallner

Topic		Replies	Views
Print - and strip text between tags using Nokogiri ruby-talk	12	633	17 December 2012
Nokogiri help parsing HTML ruby-talk	17	535	29 March 2013
Surprising Regexp Behavior ruby-talk	12	94	14 September 2005
Regular expressions - Again ruby-talk	13	104	8 March 2007
Remove HTML from String? ruby-talk	11	248	13 June 2012

Cutting a piece of text

Related topics