Regular Expression interesting problem

Hi,
I'm learning about regular expressions right now for a html scraping
based assignment. But now I've reached a problem. Given below are two
different html tags.

<link
href="http://newsrss.bbc.co.uk/rss/newsonline_world_edition/help/rss/rss.xml"
rel="alternate" type="application/rss+xml" title="BBC NEWS | Help | RSS"
/>

<link rel="alternate" type="application/rss+xml" title="YouTube - Top
Favorites Today"
href="http://gdata.youtube.com/feeds/base/standardfeeds/top_favorites?client=ytapi-youtube-index&time=today&v=2">

Now what i want is to capture the href-url if the type =
"application/rss+xml". It seems to be simple but it is the position of
the 'type' that creates the problem. In first tag the 'type' is after
href and in the second the 'type' is before it. It seems to me as an
interesting problem, but i need help for solving it. Please help me.

Regards
Arun Kumar

···

--
Posted via http://www.ruby-forum.com/.

I suggest you use Nokogiri.

Barring that, don't use regular expressions, use something more appropriate like StringScanner from strscan.rb. `ri StringScanner` will get you started.

PS: You'll probably want to do something like scan for <, then scan for a tag name, then scan for attributes, then scan for >, etc.

PPS: There's no need to post twice.

···

On Mar 28, 2009, at 02:07, Arun Kumar wrote:

Hi,
I'm learning about regular expressions right now for a html scraping
based assignment. But now I've reached a problem. Given below are two
different html tags.

<link
href="http://newsrss.bbc.co.uk/rss/newsonline_world_edition/help/rss/rss.xml&quot;
rel="alternate" type="application/rss+xml" title="BBC NEWS | Help | RSS"
/>

<link rel="alternate" type="application/rss+xml" title="YouTube - Top
Favorites Today"
href="http://gdata.youtube.com/feeds/base/standardfeeds/top_favorites?client=ytapi-youtube-index&time=today&v=2&quot;&gt;

Now what i want is to capture the href-url if the type =
"application/rss+xml". It seems to be simple but it is the position of
the 'type' that creates the problem. In first tag the 'type' is after
href and in the second the 'type' is before it. It seems to me as an
interesting problem, but i need help for solving it. Please help me.

In your last post you were telling us about a strict 'boss' who
wouldn't let you use REXML or any XML parsing libraries. I take it
this 'boss' is your teacher.

This isn't an interesting problem. Do your own homework and don't lie
to try to get others to do it for you.

···

On Sat, Mar 28, 2009 at 9:07 AM, Arun Kumar <arunkumar@innovaturelabs.com> wrote:

Hi,
I'm learning about regular expressions right now for a html scraping
based assignment. But now I've reached a problem. Given below are two
different html tags.

<link
href="http://newsrss.bbc.co.uk/rss/newsonline_world_edition/help/rss/rss.xml&quot;
rel="alternate" type="application/rss+xml" title="BBC NEWS | Help | RSS"
/>

<link rel="alternate" type="application/rss+xml" title="YouTube - Top
Favorites Today"
href="http://gdata.youtube.com/feeds/base/standardfeeds/top_favorites?client=ytapi-youtube-index&time=today&v=2&quot;&gt;

Now what i want is to capture the href-url if the type =
"application/rss+xml". It seems to be simple but it is the position of
the 'type' that creates the problem. In first tag the 'type' is after
href and in the second the 'type' is before it. It seems to me as an
interesting problem, but i need help for solving it. Please help me.

Regards
Arun Kumar

Eric Hodel wrote:

/>
interesting problem, but i need help for solving it. Please help me.

I suggest you use Nokogiri.

Barring that, don't use regular expressions, use something more
appropriate like StringScanner from strscan.rb. `ri StringScanner`
will get you started.

Nokogiri is a good option. But i want to use net/http for my assignment
and it is compulsory.

PS: You'll probably want to do something like scan for <, then scan
for a tag name, then scan for attributes, then scan for >, etc.

As you said I have to check the tag ie'<link' first and then check for
the attributes. But still the position of the type attribute is the
problem.

Thanks for ur quick reply

Regards
Arun Kumar

···

On Mar 28, 2009, at 02:07, Arun Kumar wrote:

--
Posted via http://www.ruby-forum.com/\.

I'd probably rather scan for each <link> tag and then analyze it, i.e.

doc.scan %r{<link[^>]*>}i do |link|
   if %r{(?i:type)=["']application/rss\+xml["']} =~ link
     ...
   end
end

Note that the scanning RX is weak.

But I agree, rather use the proper tool for the job.

Cheers

  robert

···

On 28.03.2009 11:00, Eric Hodel wrote:

PS: You'll probably want to do something like scan for <, then scan for a tag name, then scan for attributes, then scan for >, etc.

Sean O'halpin wrote:

<link rel="alternate" type="application/rss+xml" title="YouTube - Top
Arun Kumar

In your last post you were telling us about a strict 'boss' who
wouldn't let you use REXML or any XML parsing libraries. I take it
this 'boss' is your teacher.

This isn't an interesting problem. Do your own homework and don't lie
to try to get others to do it for you.

Hi,
You have completely misunderstood me. I'm working as a software engineer
trainee right now. The first problem that i had has been solved. Now it
is a new assignment. To tell frankly. I have just 2 weeks of experience
in ruby and there is nobody right here that have knwledge about ruby.
That is why i'm asking a favour through this community. I'm sorry if i'm
troubling u guys so much.

Regards
Arun Kumar . C. M.

···

On Sat, Mar 28, 2009 at 9:07 AM, Arun Kumar > <arunkumar@innovaturelabs.com> wrote:

--
Posted via http://www.ruby-forum.com/\.

Sean O'halpin wrote:

···

On Sat, Mar 28, 2009 at 9:07 AM, Arun Kumar > <arunkumar@innovaturelabs.com> wrote:

<link rel="alternate" type="application/rss+xml" title="YouTube - Top
Arun Kumar

In your last post you were telling us about a strict 'boss' who
wouldn't let you use REXML or any XML parsing libraries. I take it
this 'boss' is your teacher.

That was my first thought when I read the other post.

Anyway, split() rules the world--not regexs.
--
Posted via http://www.ruby-forum.com/\.

Looks like you will need to parse in stages -- I can't get String#scan to
capture everything using a single regex, though there's every chance I've
screwed up the expression somehow:

'<link type="application" href="http://google.com" rel="alternate" />'.scan
/<([^\s]+)(?:\s+([^\s]+)="([^"]*)")*\s*\/?>/i
#=> [['link', 'rel', 'alternate']]

···

2009/3/28 Arun Kumar <arunkumar@innovaturelabs.com>

Eric Hodel wrote:
> On Mar 28, 2009, at 02:07, Arun Kumar wrote:
>> />
>> interesting problem, but i need help for solving it. Please help me.
> I suggest you use Nokogiri.
>
> Barring that, don't use regular expressions, use something more
> appropriate like StringScanner from strscan.rb. `ri StringScanner`
> will get you started.
>
Nokogiri is a good option. But i want to use net/http for my assignment
and it is compulsory.

> PS: You'll probably want to do something like scan for <, then scan
> for a tag name, then scan for attributes, then scan for >, etc.

As you said I have to check the tag ie'<link' first and then check for
the attributes. But still the position of the type attribute is the
problem.

If I have misrepresented you, then you have my sincerest apologies.
However, you have not really represented your own position terribly
well. Now that we know you are a trainee with little experience who is
currently specifically being trained in regular expressions, it makes
more sense that you cannot use REXML, etc. But this was not clear from
your previous posts.

By the way, you are more likely to get a positive response if you at
least show how far you have got with the problem yourself before
coming to the list.

And to make up for my grouchy mood this morning, here's my contribution:

hashes =
data.scan(/<link[^>]+?>/) do |link|
  hashes << Hash[*link.scan(/([a-z]+)=["']?([^"]+)["']?/).flatten]
end
require 'pp'
pp hashes.select{ |hash| hash["type"] == "application/rss+xml" }

But I have no idea if this will meet the requirements of your
assignment or if you will understand it.

Regards,
Sean

···

On Sat, Mar 28, 2009 at 11:44 AM, Arun Kumar <arunkumar@innovaturelabs.com> wrote:

Hi,
You have completely misunderstood me. I'm working as a software engineer
trainee right now. The first problem that i had has been solved. Now it
is a new assignment. To tell frankly. I have just 2 weeks of experience
in ruby and there is nobody right here that have knwledge about ruby.
That is why i'm asking a favour through this community. I'm sorry if i'm
troubling u guys so much.