Regular Expression interesting problem

Arun_Kumar2 · 28 March 2009 09:07

Hi,
I'm learning about regular expressions right now for a html scraping
based assignment. But now I've reached a problem. Given below are two
different html tags.

Now what i want is to capture the href-url if the type =
"application/rss+xml". It seems to be simple but it is the position of
the 'type' that creates the problem. In first tag the 'type' is after
href and in the second the 'type' is before it. It seems to me as an
interesting problem, but i need help for solving it. Please help me.

Regards
Arun Kumar

···

--
Posted via http://www.ruby-forum.com/.

Eric_Hodel1 · 28 March 2009 10:00

I suggest you use Nokogiri.

Barring that, don't use regular expressions, use something more appropriate like StringScanner from strscan.rb. `ri StringScanner` will get you started.

PS: You'll probably want to do something like scan for <, then scan for a tag name, then scan for attributes, then scan for >, etc.

PPS: There's no need to post twice.

···

On Mar 28, 2009, at 02:07, Arun Kumar wrote:

Hi,
I'm learning about regular expressions right now for a html scraping
based assignment. But now I've reached a problem. Given below are two
different html tags.

<link
href="http://newsrss.bbc.co.uk/rss/newsonline_world_edition/help/rss/rss.xml"
rel="alternate" type="application/rss+xml" title="BBC NEWS | Help | RSS"
/>

<link rel="alternate" type="application/rss+xml" title="YouTube - Top
Favorites Today"
href="http://gdata.youtube.com/feeds/base/standardfeeds/top_favorites?client=ytapi-youtube-index&time=today&v=2">

Now what i want is to capture the href-url if the type =
"application/rss+xml". It seems to be simple but it is the position of
the 'type' that creates the problem. In first tag the 'type' is after
href and in the second the 'type' is before it. It seems to me as an
interesting problem, but i need help for solving it. Please help me.

Sean_O_Halpin · 28 March 2009 11:32

In your last post you were telling us about a strict 'boss' who
wouldn't let you use REXML or any XML parsing libraries. I take it
this 'boss' is your teacher.

This isn't an interesting problem. Do your own homework and don't lie
to try to get others to do it for you.

···

On Sat, Mar 28, 2009 at 9:07 AM, Arun Kumar <arunkumar@innovaturelabs.com> wrote:

Hi,
I'm learning about regular expressions right now for a html scraping
based assignment. But now I've reached a problem. Given below are two
different html tags.

<link
href="http://newsrss.bbc.co.uk/rss/newsonline_world_edition/help/rss/rss.xml"
rel="alternate" type="application/rss+xml" title="BBC NEWS | Help | RSS"
/>

<link rel="alternate" type="application/rss+xml" title="YouTube - Top
Favorites Today"
href="http://gdata.youtube.com/feeds/base/standardfeeds/top_favorites?client=ytapi-youtube-index&time=today&v=2">

Now what i want is to capture the href-url if the type =
"application/rss+xml". It seems to be simple but it is the position of
the 'type' that creates the problem. In first tag the 'type' is after
href and in the second the 'type' is before it. It seems to me as an
interesting problem, but i need help for solving it. Please help me.

Regards
Arun Kumar

Arun_Kumar2 · 28 March 2009 10:08

Eric Hodel wrote:

/>
interesting problem, but i need help for solving it. Please help me.

I suggest you use Nokogiri.

Barring that, don't use regular expressions, use something more
appropriate like StringScanner from strscan.rb. `ri StringScanner`
will get you started.

Nokogiri is a good option. But i want to use net/http for my assignment
and it is compulsory.

PS: You'll probably want to do something like scan for <, then scan
for a tag name, then scan for attributes, then scan for >, etc.

As you said I have to check the tag ie'<link' first and then check for
the attributes. But still the position of the type attribute is the
problem.

Thanks for ur quick reply

Regards
Arun Kumar

···

On Mar 28, 2009, at 02:07, Arun Kumar wrote:

--
Posted via http://www.ruby-forum.com/\.

Robert_K1 · 28 March 2009 10:55

I'd probably rather scan for each <link> tag and then analyze it, i.e.

doc.scan %r{<link[^>]*>}i do |link|
   if %r{(?i:type)=["']application/rss\+xml["']} =~ link
     ...
   end
end

Note that the scanning RX is weak.

But I agree, rather use the proper tool for the job.

Cheers

robert

···

On 28.03.2009 11:00, Eric Hodel wrote:

PS: You'll probably want to do something like scan for <, then scan for a tag name, then scan for attributes, then scan for >, etc.

Arun_Kumar2 · 28 March 2009 11:44

Sean O'halpin wrote:

<link rel="alternate" type="application/rss+xml" title="YouTube - Top
Arun Kumar

In your last post you were telling us about a strict 'boss' who
wouldn't let you use REXML or any XML parsing libraries. I take it
this 'boss' is your teacher.

This isn't an interesting problem. Do your own homework and don't lie
to try to get others to do it for you.

Hi,
You have completely misunderstood me. I'm working as a software engineer
trainee right now. The first problem that i had has been solved. Now it
is a new assignment. To tell frankly. I have just 2 weeks of experience
in ruby and there is nobody right here that have knwledge about ruby.
That is why i'm asking a favour through this community. I'm sorry if i'm
troubling u guys so much.

Regards
Arun Kumar . C. M.

···

On Sat, Mar 28, 2009 at 9:07 AM, Arun Kumar > <arunkumar@innovaturelabs.com> wrote:

--
Posted via http://www.ruby-forum.com/\.

7stud · 28 March 2009 12:55

Sean O'halpin wrote:

···

On Sat, Mar 28, 2009 at 9:07 AM, Arun Kumar > <arunkumar@innovaturelabs.com> wrote:

<link rel="alternate" type="application/rss+xml" title="YouTube - Top
Arun Kumar

In your last post you were telling us about a strict 'boss' who
wouldn't let you use REXML or any XML parsing libraries. I take it
this 'boss' is your teacher.

That was my first thought when I read the other post.

Anyway, split() rules the world--not regexs.
--
Posted via http://www.ruby-forum.com/\.

James_Coglan2 · 28 March 2009 10:22

Looks like you will need to parse in stages -- I can't get String#scan to
capture everything using a single regex, though there's every chance I've
screwed up the expression somehow:

'<link type="application" href="http://google.com" rel="alternate" />'.scan
/<([^\s]+)(?:\s+([^\s]+)="([^"]*)")*\s*\/?>/i
#=> [['link', 'rel', 'alternate']]

···

2009/3/28 Arun Kumar <arunkumar@innovaturelabs.com>

Eric Hodel wrote:
> On Mar 28, 2009, at 02:07, Arun Kumar wrote:
>> />
>> interesting problem, but i need help for solving it. Please help me.
> I suggest you use Nokogiri.
>
> Barring that, don't use regular expressions, use something more
> appropriate like StringScanner from strscan.rb. `ri StringScanner`
> will get you started.
>
Nokogiri is a good option. But i want to use net/http for my assignment
and it is compulsory.

> PS: You'll probably want to do something like scan for <, then scan
> for a tag name, then scan for attributes, then scan for >, etc.

As you said I have to check the tag ie'<link' first and then check for
the attributes. But still the position of the type attribute is the
problem.

Sean_O_Halpin · 28 March 2009 13:06

If I have misrepresented you, then you have my sincerest apologies.
However, you have not really represented your own position terribly
well. Now that we know you are a trainee with little experience who is
currently specifically being trained in regular expressions, it makes
more sense that you cannot use REXML, etc. But this was not clear from
your previous posts.

By the way, you are more likely to get a positive response if you at
least show how far you have got with the problem yourself before
coming to the list.

And to make up for my grouchy mood this morning, here's my contribution:

hashes =
data.scan(/<link[^>]+?>/) do |link|
hashes << Hash[*link.scan(/([a-z]+)=["']?([^"]+)["']?/).flatten]
end
require 'pp'
pp hashes.select{ |hash| hash["type"] == "application/rss+xml" }

But I have no idea if this will meet the requirements of your
assignment or if you will understand it.

Regards,
Sean

···

On Sat, Mar 28, 2009 at 11:44 AM, Arun Kumar <arunkumar@innovaturelabs.com> wrote:

Hi,
You have completely misunderstood me. I'm working as a software engineer
trainee right now. The first problem that i had has been solved. Now it
is a new assignment. To tell frankly. I have just 2 weeks of experience
in ruby and there is nobody right here that have knwledge about ruby.
That is why i'm asking a favour through this community. I'm sorry if i'm
troubling u guys so much.

Topic		Replies	Views
Regular Expression interesting problem ruby-talk	0	101	28 March 2009
Regex problem ruby-talk	4	112	2 December 2007
Regex problem, probably simple ruby-talk	6	144	16 May 2007
Is there link extractor or similar html processing libs for Ruby ruby-talk	16	172	10 March 2006
Regular expression ruby-talk	7	142	23 March 2009

Regular Expression interesting problem

Related topics