Ruby global regex question

For the life of me, i can't figure out a ruby equivalent to perl's /g

basically, i want to do the following

while htmlSource=~m/<table>(.*?)<\table>/g do
   tableSource=$1
   tableSource=~m/Index (\d+)/
   indexNumber=$1

   while tableSource=~m/<tr>(.*?)<\/tr>/g do
      tableRowSource=$1
      doSomethingWith(tableRowSource, indexNumber)
   end#while tableSource

end#while htmlSource

I will actually need to pull multiple vars, not just a single one,
from the regex
I will need to do the outer loop an unknown amount of times per
document (0-20) and i will need to loop the inner an unknown amount of
times (0-29)

Thread safe would be a plus.

any suggestions?

I think this does what you want, although I don't think gsub was really made
for this purpose.

def doSomethingWith(s)
    print s, "\n"
end

htmlSource = '<table><tr>1,1</tr><tr>1,2</tr></table>'
htmlSource << '<table><tr>2,1</tr><tr>1,2</tr></table>'

htmlSource.gsub(/<table>(.*?)<\/table>/) do |t|
    tableRowSource = $1
    tableRowSource.gsub(/<tr>(.*?)<\/tr>/) do |r|
        doSomethingWith $1
    end
end

···

On Tue, Nov 18, 2008 at 4:06 PM, knohr <just_a_techie200x@yahoo.com> wrote:

For the life of me, i can't figure out a ruby equivalent to perl's /g

basically, i want to do the following

while htmlSource=~m/<table>(.*?)<\table>/g do
  tableSource=$1
  tableSource=~m/Index (\d+)/
  indexNumber=$1

  while tableSource=~m/<tr>(.*?)<\/tr>/g do
     tableRowSource=$1
     doSomethingWith(tableRowSource, indexNumber)
  end#while tableSource

end#while htmlSource

I will actually need to pull multiple vars, not just a single one,
from the regex
I will need to do the outer loop an unknown amount of times per
document (0-20) and i will need to loop the inner an unknown amount of
times (0-29)

Thread safe would be a plus.

any suggestions?

--
Alan

While I can't answer your original question, I could possibly help you with the scraping if you are willing to reveal the page you are trying to scrape and the data bits on it which should be scraped.

Cheers,
Peter

···

On 2008.11.19., at 1:06, knohr wrote:

For the life of me, i can't figure out a ruby equivalent to perl's /g

basically, i want to do the following

while htmlSource=~m/<table>(.*?)<\table>/g do
  tableSource=$1
  tableSource=~m/Index (\d+)/
  indexNumber=$1

  while tableSource=~m/<tr>(.*?)<\/tr>/g do
     tableRowSource=$1
     doSomethingWith(tableRowSource, indexNumber)
  end#while tableSource

end#while htmlSource

I will actually need to pull multiple vars, not just a single one,
from the regex
I will need to do the outer loop an unknown amount of times per
document (0-20) and i will need to loop the inner an unknown amount of
times (0-29)

Thread safe would be a plus.

any suggestions?

___
http://www.rubyrailways.com
http://scrubyt.org

Would fast be a plus? No nested loop?

require 'nokogiri'
doc = Nokogiri::HTML(htmlSource)
doc.search('//tr').each do |row|
  index = row.xpath('ancestor::table/*[contains("Index",.)]')
  doSomethingWith(row.text,index[/(\d)/])
end

The location of the element containing the index may have to be
modified.

-- Mark.

···

On Nov 18, 7:08 pm, knohr <just_a_techie2...@yahoo.com> wrote:

For the life of me, i can't figure out a ruby equivalent to perl's /g

basically, i want to do the following

while htmlSource=~m/<table>(.*?)<\table>/g do
tableSource=$1
tableSource=~m/Index (\d+)/
indexNumber=$1

while tableSource=~m/<tr>(.*?)<\/tr>/g do
tableRowSource=$1
doSomethingWith(tableRowSource, indexNumber)
end#while tableSource

end#while htmlSource

I will actually need to pull multiple vars, not just a single one,
from the regex
I will need to do the outer loop an unknown amount of times per
document (0-20) and i will need to loop the inner an unknown amount of
times (0-29)

Thread safe would be a plus.

I use this as an equivalent to global match:

class Regexp
  def global_match(str, &proc)
    retval = nil
    loop do
      res = str.sub(self) do |m|
        proc.call($~) # pass MatchData obj
        ''
      end
      break retval if res == str
      str = res
      retval ||= true
    end
  end
end

re = /.../
re.global_match(...) do |m|
    ...
end

···

On Tue, Nov 18, 2008 at 9:06 PM, knohr <just_a_techie200x@yahoo.com> wrote:

For the life of me, i can't figure out a ruby equivalent to perl's /g

basically, i want to do the following

while htmlSource=~m/<table>(.*?)<\table>/g do
  tableSource=$1
  tableSource=~m/Index (\d+)/
  indexNumber=$1

  while tableSource=~m/<tr>(.*?)<\/tr>/g do
     tableRowSource=$1
     doSomethingWith(tableRowSource, indexNumber)
  end#while tableSource

end#while htmlSource

I will actually need to pull multiple vars, not just a single one,
from the regex
I will need to do the outer loop an unknown amount of times per
document (0-20) and i will need to loop the inner an unknown amount of
times (0-29)

Thread safe would be a plus.

any suggestions?

That is pretty much how, except globals are hardly thread safe I think. Use scan instead of gsub:
Here's something I wrote to extract information from data structured like this:

- tablename
     + field1
     + field2:string

- table2name
     +field1 : string
     +field2

Table = Struct.new(:name, :fields)
Field = Struct.new(:name, :type)

  def extract_db_spec(file)
    tables =
    doc = open(file, File::RDONLY) {|f|f.read}
    table_name = /\- (\w*)\s*?\n/
    field_name = /(\s+\+ (\w+)\s*(\:\s*(\w*))?\n)/
    doc.scan /#{table_name}(#{field_name}+)/ do |tablename, fields|
      t = Table.new tablename,
      fields.scan field_name do |junk, fieldname, junk2, type|
        if type.nil? || type == ""
          if /\w+_id/ === fieldname
            type = "int"
          else
            type = "string"
          end
        end
      
        t.fields << Field.new(fieldname, type)
      
      end
      tables << t
    end
    tables
  end

einarmagnus

···

On 19.11.2008, at 00:37 , Alan Johnson wrote:

On Tue, Nov 18, 2008 at 4:06 PM, knohr <just_a_techie200x@yahoo.com> > wrote:

For the life of me, i can't figure out a ruby equivalent to perl's /g

basically, i want to do the following

while htmlSource=~m/<table>(.*?)<\table>/g do
tableSource=$1
tableSource=~m/Index (\d+)/
indexNumber=$1

while tableSource=~m/<tr>(.*?)<\/tr>/g do
    tableRowSource=$1
    doSomethingWith(tableRowSource, indexNumber)
end#while tableSource

end#while htmlSource

I will actually need to pull multiple vars, not just a single one,
from the regex
I will need to do the outer loop an unknown amount of times per
document (0-20) and i will need to loop the inner an unknown amount of
times (0-29)

Thread safe would be a plus.

any suggestions?

I think this does what you want, although I don't think gsub was really made
for this purpose.

def doSomethingWith(s)
   print s, "\n"
end

htmlSource = '<table><tr>1,1</tr><tr>1,2</tr></table>'
htmlSource << '<table><tr>2,1</tr><tr>1,2</tr></table>'

htmlSource.gsub(/<table>(.*?)<\/table>/) do |t|
   tableRowSource = $1
   tableRowSource.gsub(/<tr>(.*?)<\/tr>/) do |r|
       doSomethingWith $1
   end
end

--
Alan

That is pretty much how, except globals are hardly thread safe I think.

$1 and the like are

robert@fussel ~
$ ruby -e '2.times{|i|Thread.new(i){|ii|4.times{/(\d+)/=~ii.to_s;puts $1;sleep 1}}};sleep 5'
0
1
0
1
0
1
0

robert@fussel ~
$

Use scan instead of gsub:

Right, as far as I can see no replacements should be done. Just read only access.

html_source.scan %r{<table>(.*?)</table>}i do
   table_souce = $1
   index_number = table_source[%r{Index\s+(\d+)}, 1].to_i

   table_source.scan %r{<tr>(.*?)</tr>}i do
     do_something_with $1, index_number
   end
end

But a proper HTML parser is probably much better. :slight_smile:

Kind regards

  robert

···

On 19.11.2008 07:08, Einar Magnús Boson wrote: