Ruby global regex question

knohr · 19 November 2008 00:06

For the life of me, i can't figure out a ruby equivalent to perl's /g

basically, i want to do the following

while htmlSource=~m/<table>(.*?)<\table>/g do
   tableSource=$1
   tableSource=~m/Index (\d+)/
   indexNumber=$1

   while tableSource=~m/<tr>(.*?)<\/tr>/g do
      tableRowSource=$1
      doSomethingWith(tableRowSource, indexNumber)
   end#while tableSource

end#while htmlSource

I will actually need to pull multiple vars, not just a single one,
from the regex
I will need to do the outer loop an unknown amount of times per
document (0-20) and i will need to loop the inner an unknown amount of
times (0-29)

Thread safe would be a plus.

any suggestions?

Alan_Johnson · 19 November 2008 00:37

I think this does what you want, although I don't think gsub was really made
for this purpose.

def doSomethingWith(s)
print s, "\n"
end

htmlSource = '<table><tr>1,1</tr><tr>1,2</tr></table>'
htmlSource << '<table><tr>2,1</tr><tr>1,2</tr></table>'

htmlSource.gsub(/<table>(.*?)<\/table>/) do |t|
    tableRowSource = $1
    tableRowSource.gsub(/<tr>(.*?)<\/tr>/) do |r|
        doSomethingWith $1
    end
end

···

On Tue, Nov 18, 2008 at 4:06 PM, knohr <just_a_techie200x@yahoo.com> wrote:

For the life of me, i can't figure out a ruby equivalent to perl's /g

basically, i want to do the following

while htmlSource=~m/<table>(.*?)<\table>/g do
  tableSource=$1
  tableSource=~m/Index (\d+)/
  indexNumber=$1

  while tableSource=~m/<tr>(.*?)<\/tr>/g do
     tableRowSource=$1
     doSomethingWith(tableRowSource, indexNumber)
  end#while tableSource

end#while htmlSource

I will actually need to pull multiple vars, not just a single one,
from the regex
I will need to do the outer loop an unknown amount of times per
document (0-20) and i will need to loop the inner an unknown amount of
times (0-29)

Thread safe would be a plus.

any suggestions?

--
Alan

Peter_Szinek3 · 19 November 2008 02:00

While I can't answer your original question, I could possibly help you with the scraping if you are willing to reveal the page you are trying to scrape and the data bits on it which should be scraped.

Cheers,
Peter

···

On 2008.11.19., at 1:06, knohr wrote:

For the life of me, i can't figure out a ruby equivalent to perl's /g

basically, i want to do the following

while htmlSource=~m/<table>(.*?)<\table>/g do
  tableSource=$1
  tableSource=~m/Index (\d+)/
  indexNumber=$1

  while tableSource=~m/<tr>(.*?)<\/tr>/g do
     tableRowSource=$1
     doSomethingWith(tableRowSource, indexNumber)
  end#while tableSource

end#while htmlSource

I will actually need to pull multiple vars, not just a single one,
from the regex
I will need to do the outer loop an unknown amount of times per
document (0-20) and i will need to loop the inner an unknown amount of
times (0-29)

Thread safe would be a plus.

any suggestions?

___
http://www.rubyrailways.com
http://scrubyt.org

Mark_Thomas · 19 November 2008 03:16

Would fast be a plus? No nested loop?

require 'nokogiri'
doc = Nokogiri::HTML(htmlSource)
doc.search('//tr').each do |row|
index = row.xpath('ancestor::table/*[contains("Index",.)]')
doSomethingWith(row.text,index[/(\d)/])
end

The location of the element containing the index may have to be
modified.

-- Mark.

···

On Nov 18, 7:08 pm, knohr <just_a_techie2...@yahoo.com> wrote:

For the life of me, i can't figure out a ruby equivalent to perl's /g

basically, i want to do the following

while htmlSource=~m/<table>(.*?)<\table>/g do
tableSource=$1
tableSource=~m/Index (\d+)/
indexNumber=$1

while tableSource=~m/<tr>(.*?)<\/tr>/g do
tableRowSource=$1
doSomethingWith(tableRowSource, indexNumber)
end#while tableSource

end#while htmlSource

I will actually need to pull multiple vars, not just a single one,
from the regex
I will need to do the outer loop an unknown amount of times per
document (0-20) and i will need to loop the inner an unknown amount of
times (0-29)

Thread safe would be a plus.

Gustavo_Carvalho · 19 November 2008 12:53

I use this as an equivalent to global match:

class Regexp
  def global_match(str, &proc)
    retval = nil
    loop do
      res = str.sub(self) do |m|
        proc.call($~) # pass MatchData obj
        ''
      end
      break retval if res == str
      str = res
      retval ||= true
    end
  end
end

re = /.../
re.global_match(...) do |m|
...
end

···

On Tue, Nov 18, 2008 at 9:06 PM, knohr <just_a_techie200x@yahoo.com> wrote:

For the life of me, i can't figure out a ruby equivalent to perl's /g

basically, i want to do the following

while htmlSource=~m/<table>(.*?)<\table>/g do
  tableSource=$1
  tableSource=~m/Index (\d+)/
  indexNumber=$1

  while tableSource=~m/<tr>(.*?)<\/tr>/g do
     tableRowSource=$1
     doSomethingWith(tableRowSource, indexNumber)
  end#while tableSource

end#while htmlSource

I will actually need to pull multiple vars, not just a single one,
from the regex
I will need to do the outer loop an unknown amount of times per
document (0-20) and i will need to loop the inner an unknown amount of
times (0-29)

Thread safe would be a plus.

any suggestions?

Einar_Boson · 19 November 2008 06:08

That is pretty much how, except globals are hardly thread safe I think. Use scan instead of gsub:
Here's something I wrote to extract information from data structured like this:

- tablename
+ field1
+ field2:string

- table2name
+field1 : string
+field2

Table = Struct.new(:name, :fields)
Field = Struct.new(:name, :type)

  def extract_db_spec(file)
    tables =
    doc = open(file, File::RDONLY) {|f|f.read}
    table_name = /\- (\w*)\s*?\n/
    field_name = /(\s+\+ (\w+)\s*(\:\s*(\w*))?\n)/
    doc.scan /#{table_name}(#{field_name}+)/ do |tablename, fields|
      t = Table.new tablename,
      fields.scan field_name do |junk, fieldname, junk2, type|
        if type.nil? || type == ""
          if /\w+_id/ === fieldname
            type = "int"
          else
            type = "string"
          end
        end

        t.fields << Field.new(fieldname, type)

      end
      tables << t
    end
    tables
  end

einarmagnus

···

On 19.11.2008, at 00:37 , Alan Johnson wrote:

On Tue, Nov 18, 2008 at 4:06 PM, knohr <just_a_techie200x@yahoo.com> > wrote:

For the life of me, i can't figure out a ruby equivalent to perl's /g

basically, i want to do the following

while htmlSource=~m/<table>(.*?)<\table>/g do
tableSource=$1
tableSource=~m/Index (\d+)/
indexNumber=$1

while tableSource=~m/<tr>(.*?)<\/tr>/g do
    tableRowSource=$1
    doSomethingWith(tableRowSource, indexNumber)
end#while tableSource

end#while htmlSource

I will actually need to pull multiple vars, not just a single one,
from the regex
I will need to do the outer loop an unknown amount of times per
document (0-20) and i will need to loop the inner an unknown amount of
times (0-29)

Thread safe would be a plus.

any suggestions?

I think this does what you want, although I don't think gsub was really made
for this purpose.

def doSomethingWith(s)
   print s, "\n"
end

htmlSource = '<table><tr>1,1</tr><tr>1,2</tr></table>'
htmlSource << '<table><tr>2,1</tr><tr>1,2</tr></table>'

htmlSource.gsub(/<table>(.*?)<\/table>/) do |t|
   tableRowSource = $1
   tableRowSource.gsub(/<tr>(.*?)<\/tr>/) do |r|
       doSomethingWith $1
   end
end

--
Alan

Robert_K1 · 19 November 2008 07:46

That is pretty much how, except globals are hardly thread safe I think.

$1 and the like are

robert@fussel ~
$ ruby -e '2.times{|i|Thread.new(i){|ii|4.times{/(\d+)/=~ii.to_s;puts $1;sleep 1}}};sleep 5'
0
1
0
1
0
1
0

robert@fussel ~
$

Use scan instead of gsub:

Right, as far as I can see no replacements should be done. Just read only access.

html_source.scan %r{<table>(.*?)</table>}i do
table_souce = $1
index_number = table_source[%r{Index\s+(\d+)}, 1].to_i

   table_source.scan %r{<tr>(.*?)</tr>}i do
     do_something_with $1, index_number
   end
end

But a proper HTML parser is probably much better.

Kind regards

robert

···

On 19.11.2008 07:08, Einar Magnús Boson wrote:

Topic		Replies	Views
Regex help please please! ruby-talk	3	75	20 December 2002
How to translate this Perl regex ruby-talk	2	126	31 December 2002
Accessing all matches using regex ruby-talk	4	105	6 August 2007
Regular Expresion Needed ruby-talk	3	73	19 June 2007
Regex help ruby-talk	6	87	16 June 2005

Ruby global regex question

Related topics