Regular expressions - Again

J_mp · 6 March 2007 00:00

I'm really bad with this things called regular expressions, so I'm
looking for help again.

Now, if I have a String like
"some string some content <title>this I want</title>"

And I want to use the scan function to extract what is between <title>
and </title> how can I build my regular expression. The final result
should be:
this I want

Thnaks

···

--
Posted via http://www.ruby-forum.com/.

Gavin_Kistner2 · 6 March 2007 00:20

Now, if I have a String like
"some string some content <title>this I want</title>"

And I want to use the scan function to extract what is between <title>
and </title> how can I build my regular expression. The final result
should be:
this I want

irb(main):001:0> str = "This is <title>what I
irb(main):002:0" want</title> and no more"
=> "This is <title>what I\nwant</title> and no more"
irb(main):003:0> str[ %r{<title>(.+?)</title>}, 1 ]
=> nil
irb(main):004:0> str[ %r{<title>(.+?)</title>}m, 1 ]
=> "what I\nwant"

Note that the use of 'm' to match across multiple lines, assuming your
title tag spans them.

Note that this will fail if you have "<title>This is <title>nested</

···

On Mar 5, 5:00 pm, "J. mp" <joaomiguel.pere...@gmail.com> wrote:

content</title>", and will result in "This is <title>nested"

Jenda_Krynicky · 7 March 2007 13:48

J. mp wrote:

I'm really bad with this things called regular expressions, so I'm
looking for help again.

Now, if I have a String like
"some string some content <title>this I want</title>"

And I want to use the scan function to extract what is between <title>
and </title> how can I build my regular expression. The final result
should be:
this I want

Thnaks

You generaly want to use a HTML parser ... provided that Wuby has one.

You may be lucky with <title> since it's likely to not include any
attributes, but still there might be some whitespace INSIDE the tags,
there may be a comment inside the <title>...</title> that you may or may
not want, there may be a <title> or </title> inside a comment etc. etc.
etc.

In (censored) I'd use HTML::Parser from CPAN, but shhhh ... this is a
Wuby site, we don't speak of such things here.

Jenda

···

--
Posted via http://www.ruby-forum.com/\.

J_mp · 7 March 2007 13:59

You generaly want to use a HTML parser ... provided that Wuby has one.

You may be lucky with <title> since it's likely to not include any
attributes, but still there might be some whitespace INSIDE the tags,
there may be a comment inside the <title>...</title> that you may or may
not want, there may be a <title> or </title> inside a comment etc. etc.
etc.

In (censored) I'd use HTML::Parser from CPAN, but shhhh ... this is a
Wuby site, we don't speak of such things here.

Jenda

I ended with Hpricot, it's working fine with a few tests I made till
now.

···

--
Posted via http://www.ruby-forum.com/\.

Harry4 · 7 March 2007 14:05

You may be lucky with <title> since it's likely to not include any
attributes, but still there might be some whitespace INSIDE the tags,

Jenda

str = "This is <title> what I\n\n\n\n \n want </title> and no more"
p str

str =~ /<title>(.*?)<\/title>/m
p $1.gsub(/(\n|\s)+/, " ").strip

···

--

Japanese Ruby List Subjects in English

Alex_Young · 7 March 2007 14:08

Jenda Krynicky wrote:

J. mp wrote:

I'm really bad with this things called regular expressions, so I'm
looking for help again.

Now, if I have a String like
"some string some content <title>this I want</title>"

And I want to use the scan function to extract what is between <title>
and </title> how can I build my regular expression. The final result
should be:
this I want

Thnaks

You generaly want to use a HTML parser ... provided that Wuby has one.

I wonder what the first hit from googling "ruby html parser" is? Ah yes, hpricot. A perfectly valid approach.

Personally, in the past I've libtidy'd html to xml and used REXML's stream parser. This has the rather wonderful benefit of actually being able to fix some fairly broken html, and failing early if it can't.

> You may be lucky with <title> since it's likely to not include any
> attributes, but still there might be some whitespace INSIDE the tags,
> there may be a comment inside the <title>...</title> that you may or may
> not want, there may be a <title> or </title> inside a comment etc. etc.
> etc.
>
> In (censored) I'd use HTML::Parser from CPAN, but shhhh ... this is a
> Wuby site, we don't speak of such things here.
>
It's a mailing list, not a site... Easy to confuse, possibly, but the mailing list is the primary interface.

···

--
Alex

Alex_Young · 7 March 2007 14:08

Harry wrote:

You may be lucky with <title> since it's likely to not include any
attributes, but still there might be some whitespace INSIDE the tags,

Jenda

str = "This is <title> what I\n\n\n\n \n want </title> and no more"
p str

str =~ /<title>(.*?)<\/title>/m
p $1.gsub(/(\n|\s)+/, " ").strip

I think he meant:

str = "This is <title >what I want</title> and no more"

but we don't know if the problem requires handling anything more complex than simple tags.

···

--
Alex

Jenda_Krynicky · 7 March 2007 15:49

Alex Young wrote:

Jenda Krynicky wrote:

this I want

Thnaks

You generaly want to use a HTML parser ... provided that Wuby has one.

I wonder what the first hit from googling "ruby html parser" is? Ah
yes, hpricot. A perfectly valid approach.

Hpricot? How come the name does not surprise me? It's a perfectly clear
name specifying exactly what and how it does.

Jenda
module Enumerable
alias foldl inject # inventing names in a foreign language huh?
end

···

--
Posted via http://www.ruby-forum.com/\.

Harry4 · 7 March 2007 15:14

I think he meant:

str = "This is <title >what I want</title> and no more"

--
Alex

Oh, that's quite different.
Never mind.

Emily Litella

···

--

Japanese Ruby List Subjects in English

J_mp · 7 March 2007 15:57

You generaly want to use a HTML parser ... provided that Wuby has one.

I wonder what the first hit from googling "ruby html parser" is? Ah
yes, hpricot. A perfectly valid approach.

Hpricot? How come the name does not surprise me? It's a perfectly clear
name specifying exactly what and how it does.

Jenda
module Enumerable
alias foldl inject # inventing names in a foreign language huh?
end

Why isn't hpricot a good aproach? any other suggestions?

···

--
Posted via http://www.ruby-forum.com/\.

Jenda_Krynicky · 7 March 2007 16:05

J. mp wrote:

You generaly want to use a HTML parser ... provided that Wuby has one.

I wonder what the first hit from googling "ruby html parser" is? Ah
yes, hpricot. A perfectly valid approach.

Hpricot? How come the name does not surprise me? It's a perfectly clear
name specifying exactly what and how it does.

Jenda
module Enumerable
alias foldl inject # inventing names in a foreign language huh?
end

Why isn't hpricot a good aproach? any other suggestions?

No, it most likely is a good approach. It's just that the name is a bit
... wuby. Which is to be expected.

Jenda

···

--
Posted via http://www.ruby-forum.com/\.

Albert_Ng · 8 March 2007 11:21

one does not expect less from the creator of chunky bacon...

···

On 3/7/07, Jenda Krynicky <jenda@cpan.org> wrote:

J. mp wrote:
>
>>>> You generaly want to use a HTML parser ... provided that Wuby has
one.
>>>>
>>> I wonder what the first hit from googling "ruby html parser" is? Ah
>>> yes, hpricot. A perfectly valid approach.
>>
>> Hpricot? How come the name does not surprise me? It's a perfectly clear
>> name specifying exactly what and how it does.
>>
>> Jenda
>> module Enumerable
>> alias foldl inject # inventing names in a foreign language huh?
>> end
>
> Why isn't hpricot a good aproach? any other suggestions?

No, it most likely is a good approach. It's just that the name is a bit
... wuby. Which is to be expected.

Jenda

--
Posted via http://www.ruby-forum.com/\.

J_mp · 8 March 2007 11:24

Albert Ng wrote:

one does not expect less from the creator of chunky bacon...

can you tell me the whole story? what is the chunky bacon?

···

--
Posted via http://www.ruby-forum.com/\.

Alex_Young · 8 March 2007 11:46

J. mp wrote:

Albert Ng wrote:

one does not expect less from the creator of chunky bacon...

can you tell me the whole story? what is the chunky bacon?

http://poignantguide.net/ruby/

Your brain will never be the same again...

···

--
Alex

Topic		Replies	Views
Regular expression ruby-talk	7	123	23 March 2009
Scan HTML ruby-talk	15	82	3 March 2008
Html stringScanner regexp ruby-talk	1	90	3 May 2006
Simple regex question ruby-talk	3	88	25 October 2005
Making one reg ex out of two ruby-talk	7	163	13 October 2009

Regular expressions - Again

Related topics