Regex extraction

Scott_Rubin · 15 December 2004 14:47

Hello,

I'm writing an application that parses log files, specifically gaim html log files, extracts any links it finds and creates an RSS feed of those links. I have a working program that's about 60 lines of ruby, but it is far from perfect. Most of the necessary fixes and improvements are things I know how to do, but just take time. But there are a couple things I need help with.

First, in ruby, how do I extract parts of a regex? Let's use the example from my program. Normally I could use an expression like the following

href\s*=\s*?:(\"?<url>[^\"]*)\")

And this would allow me to get the <url> out of the expression. But this doesn't seem to work in ruby, or at least I don't know how to make it work in ruby. What I would really like to do is match the entire <a href tag structure. I would want to extract: the protocol (ftp,http) the url (www.website.com), and the text which appears between the <a> and the </a> into three string variables. And I have to extract this entire structure from any random line of text in which the structure either exists or does not. I'm guaranteed that it wont be partial i.e: an <a> without a </a>.

The other thing I don't know how to do is replace things like & with &. Is there anything in the ruby standard library, maybe in rexml, that automatically takes care of all those standard entities for me? I looked, but I couldn't find one.

Thanks a lot,

Scott Rubin

Robert · 15 December 2004 15:02

"Scott Rubin" <slr2777@cs.rit.edu> schrieb im Newsbeitrag
news:41c04ca2$1@buckaroo.cs.rit.edu...

Hello,

I'm writing an application that parses log files, specifically gaim html

log

files, extracts any links it finds and creates an RSS feed of those

links. I

have a working program that's about 60 lines of ruby, but it is far from
perfect. Most of the necessary fixes and improvements are things I know

how to

do, but just take time. But there are a couple things I need help with.

First, in ruby, how do I extract parts of a regex? Let's use the

example from

my program. Normally I could use an expression like the following

href\s*=\s*?:(\"?<url>[^\"]*)\")

And this would allow me to get the <url> out of the expression. But

this

doesn't seem to work in ruby, or at least I don't know how to make it

work in

ruby. What I would really like to do is match the entire <a href tag

structure.

I would want to extract: the protocol (ftp,http) the url

(www.website.com),

and the text which appears between the <a> and the </a> into three

string

variables. And I have to extract this entire structure from any random

line of

text in which the structure either exists or does not. I'm guaranteed

that it

wont be partial i.e: an <a> without a </a>.

You need grouping. As a first shot:

if %r{<a\s+href="(\w+)://([^"]+)"[^>]*>([^<]*)</a>}i =~ text
proto, url, text = $1, $2, $3
end

The other thing I don't know how to do is replace things like & with

&. Is

there anything in the ruby standard library, maybe in rexml, that

automatically

takes care of all those standard entities for me? I looked, but I

couldn't find

one.

Dunno. But you can easily create that on your own:

ENT = {
  "amp" => "&",
  "gt" => ">",
  # ...
}

text.gsub!(%r{&(\w+);}i) {|m| ENT[$1] || m}

Kind regards

robert

Awushu · 16 December 2004 00:57

Scott Rubin wrote:

First, in ruby, how do I extract parts of a regex?

The other thing I don't know how to do is replace things like &

with &.

require 'cgi'
CGI.unescapeHTML("&") # => "&"

For extracting parts of the match, try
a, b, c = /(.)(.)(.)/.match("abc").captures

-awu

Craig_Moran · 15 December 2004 15:34

> The other thing I don't know how to do is replace things like & with
&. Is
> there anything in the ruby standard library, maybe in rexml, that
automatically
> takes care of all those standard entities for me? I looked, but I
couldn't find
> one.

On replacing & -- This code will replace a few other items that
begin with & (ampersand) and end with ; (semicolon) including what you
wish to accomplish.

text.gsub!(/&.*?;/m) { |i|
  case i
  when "&"
    "&"
  when " "
    ""
  when "©"
    ""
  when "®"
    ""
  when ""
    "`"
  when ""
    "'"
  when "·"
    "-"
  when "·"
    "-"
  when "—"
    "--"
  else
    ""
  end # case
}

Topic		Replies	Views
Regular expressions (extracting urls) ruby-talk	4	93	7 February 2007
Regular expression ruby-talk	7	100	23 March 2009
Regular Expression interesting problem ruby-talk	8	108	28 March 2009
Regex problem ruby-talk	4	87	2 December 2007
Regex problem, probably simple ruby-talk	6	111	16 May 2007

Regex extraction

Related topics