Sed -> Ruby : .. and

Keith_Fahlgren1 · 28 July 2005 16:18

Hi all,

What's the deal with all the different versions of
Pattern Matching on the
web? I've found that the most complete document and most helpful on
this sort of Ruby regular expression tutelage.

It seems like documentation on adopting sed to Ruby would be helpful.
[I'll write something if I ever understand it.]

Surprise: I'm trying to adapt some sed scripts into Ruby programs. I'd
like feedback on how to make them more idiomatic/Rubylicious and help
on making 2nd one work.

Here's my essential program for a line of sed that seems to work
correctly. I dunno if there's a better way to express it:

···

---------------------------------------------------

sed:
s/MATCH/GLOBAL REPLACEMENT/g

ruby:
#!/usr/bin/env ruby

files = ARGV

files.each do |arg|
  f = File.open(arg)
  puts "\nOpening file #{f}"
  working_file = f.read

working_file.gsub!(/MATCH/,'GLOBAL REPLACEMENT')

  puts "\nDoing ACTION in #{arg}"
  f = File.new(arg, "w")
  puts "\nWriting #{f} now"
  f.print(working_file)
  f.close
end

---------------------------------------------------

Here's the one I have problems with (we had a working Perl equivalent
but are trying to abandon Perl).

sed:
/BEGIN RANGE/,/END RANGE/{
s/MATCH/REPLACEMENT/g
}

ruby:
[same beginning]
  if working_file =~ /BEGIN RANGE/ .. working_file =~ /END RANGE/
    # INCLUSIVE use ... if you want non inclusive
    working_file.sub!(/MATCH/,'REPLACEMENT')
  end
[same ending]
-----------------

The above works fine on the first occurence of MATCH but won't do any of
the subsequent matches in the document. I subsitute 'gsub' but that
will just effect all of the places in the document, ignoring my if
statement. Suspecting a lack of understanding of the way the range
works and find and old thread on the mailing list
( http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/73674 )
but don't really follow it.

I've come up with the following as a tentative replacement but it's not
setting 'aline' in 'working_file' after it substitues it.

[same beginning]
  working_file.each do |aline| # Read each line, right?
    if aline =~ /BEGIN RANGE/ .. aline =~ /# end of CellContent/
      aline.sub!(/Cell/,'CellHeading2')
    end
  end
[same ending]
-----------------

[Running up against my own ignorance now of the range thingie]
Here's my failing test:
#!/usr/bin/env ruby
files = ARGV
files.each do |arg|
  f = File.open(arg)
  puts "\nOpening file #{f}"
  working_file = f.read
  puts "Subsitutions occuring now"
  working_file.each do |aline|
    if aline =~ /BEGIN RANGE/ .. aline =~ /END RANGE/
      aline.sub!(/MATCH/,'REPLACEMENT')
      puts "#{aline}"
    end
  end
  puts "\nOutput of working_file"
  puts "#{working_file}"
  puts "\nDoing ACTION in #{arg}"
  f = File.new(arg, "w")
  puts "\nWriting #{f} now"
  f.print(working_file)
  f.close
end

-----------------
Here's the test file:
a lonely line
BEGIN RANGE
  MATCH
END RANGE
another lonely line
BEGIN RANGE
  MATCH
  nothing here
  MATCH
END RANGE
don't change MATCH

-----------------
Here's the output:

,rtest2 ,test.txt

Opening file #<File:0xff3f0>
Subsitutions occuring now
BEGIN RANGE
  REPLACEMENT
END RANGE
BEGIN RANGE
  REPLACEMENT
  nothing here
  REPLACEMENT
END RANGE

Output of working_file
a lonely line
BEGIN RANGE
  MATCH
END RANGE
another lonely line
BEGIN RANGE
  MATCH
  nothing here
  MATCH
END RANGE
don't change MATCH

Doing ACTION in ,test.txt

Writing #<File:0xfeef8> now

Thanks for your help,
Keith

Joel_VanderWerf1 · 28 July 2005 16:25

Keith Fahlgren wrote:

sed:
s/MATCH/GLOBAL REPLACEMENT/g

ruby:

$ cat test
foo
bar
123
$ ruby -p -e 'gsub(/bar/, "BAR")' test
foo
BAR
123

Also, you can use the ARGF constant to access all of the input files (or
stdin if there are none) as a single IO.

ARGF.each do |line|
...
end

W_James · 28 July 2005 20:11

Keith Fahlgren wrote:

Here's the one I have problems with (we had a working Perl equivalent
but are trying to abandon Perl).

sed:
/BEGIN RANGE/,/END RANGE/{
        s/MATCH/REPLACEMENT/g
}

ruby:
[same beginning]
  if working_file =~ /BEGIN RANGE/ .. working_file =~ /END RANGE/
    # INCLUSIVE use ... if you want non inclusive
    working_file.sub!(/MATCH/,'REPLACEMENT')
  end
[same ending]

ruby -pe 'gsub(/e/, "-") if $_ =~ /START/ .. $_ =~ /END/' infile

···

outfile

mathew · 28 July 2005 22:46

Keith Fahlgren wrote:

ruby:
#!/usr/bin/env ruby

files = ARGV

files.each do |arg|
  f = File.open(arg)
  puts "\nOpening file #{f}"
  working_file = f.read

  working_file.gsub!(/MATCH/,'GLOBAL REPLACEMENT')

  puts "\nDoing ACTION in #{arg}"
  f = File.new(arg, "w")
  puts "\nWriting #{f} now"
  f.print(working_file)
  f.close
end

One problem with this code is that if you break out of it, you might find that it deleted your data.

A better approach is to rename the data file to a temporary file, write the data, then unlink the temporary file.

The best approach is to write the new data to temporary file #1, rename the input file to a different temporary file #2 and then immediately rename #1 to the input filename, and finally delete temporary file #2. That minimizes the window in which the state on disk is not what you want it to be. It ensures that the worst possible case is that you have a temporary file left behind in the data directory, and the input file contains either the data from before processing, or the data after processing.

Unfortunately, implementing the best approach is non-trivial, because you have to worry about rename not working across filesystem boundaries... But you should at least make sure you don't delete the user's data.

Here's the one I have problems with (we had a working Perl equivalent but are trying to abandon Perl).

sed:
/BEGIN RANGE/,/END RANGE/{
s/MATCH/REPLACEMENT/g
}

working_file.gsub!(/(?=BEGIN RANGE)(.*?)(?=END RANGE)/m) {||
$1.gsub(/MATCH/m, 'REPLACEMENT')
}

(?= ) is a zero-width assertion; the regexp engine matches the BEGIN RANGE and END RANGE, but then forgets about them when it comes to removing and replacing, so they're still left there in the final string.

(.*?) is a non-greedy match, which ensures that we get the shortest possible match between a BEGIN RANGE and an END RANGE; otherwise, if you had

    BEGIN RANGE
     MATCH
    END RANGE
     MATCH
    BEGIN RANGE
     MATCH
    END RANGE

all three MATCHes would be replaced.

The block just does a normal global search and replace on the character sequence.

Don't forget to comment what those three lines do, for the benefit of the person who has to maintain the code...

mathew

···

--
<URL:http://www.pobox.com/~meta/>
WE HAVE TACOS

W_James · 29 July 2005 01:01

Keith Fahlgren wrote:

sed:
s/MATCH/GLOBAL REPLACEMENT/g

ruby:
#!/usr/bin/env ruby

files = ARGV

files.each do |arg|
  f = File.open(arg)
  puts "\nOpening file #{f}"
  working_file = f.read

  working_file.gsub!(/MATCH/,'GLOBAL REPLACEMENT')

  puts "\nDoing ACTION in #{arg}"
  f = File.new(arg, "w")
  puts "\nWriting #{f} now"
  f.print(working_file)
  f.close
end

Let Ruby change the file "in place". A backup file with ".bak"
appended to the filename will be created.

#! ruby -i.bak -pl
gsub(/MATCH/,"REPLACEMENT")

Keith_Fahlgren1 · 29 July 2005 13:34

Hi all,

Thanks for the helpful responses. Many corrected me, rightly, for
writing a whole program rather than just using a one-liner and for
modifying the file in place rather than setting up temporary files. I
was concentrating on my task, making a template to replace thousands of
lines of sed files rather than doing a single replacement.

My larger problem still revolved around doing a bunch matches inside an
range.

working_file.gsub!(/(?=BEGIN RANGE)(.*?)(?=END RANGE)/m) {||
$1.gsub(/MATCH/m, 'REPLACEMENT')
}

Matthew pointed out an easier way to get around it but I had rejected
that method from the start because I wanted to be able to nest ranges
(like I can in sed).

I think the problems with my inital attempts was assuming .read was
returning an array rather than a string. Slapping .to_a to the end
solved that bit and the program now works as expected, I think.

Here's my solution (though I'd still love comments):

#!/usr/bin/env ruby
files = ARGV

files.each do |arg|
  puts "\nOpening file #{arg}"
  f = File.open(arg)
  working_file = f.read.to_a
  f.close
  working_file.each do |aline|
    if aline =~ /BEGIN RANGE/ .. aline =~ /END RANGE/
      #nest if lines like above as many times as needed for nested
ranges
      aline.sub!(/MATCH/,'REPLACEMENT')
    end
  end

  puts "\nDoing ACTION in #{arg}"
  fnew = File.new("#{arg}.tmp", "w")
  puts "\nWriting #{fnew} now"
  fnew.print(working_file)
  fnew.close
end

Thanks,
Keith

···

On Thursday 28 July 2005 5:46 pm, mathew wrote:

Keith_Fahlgren1 · 28 July 2005 16:35

Ah, I should have mentioned that I'm trying to replace 4000+ lines of
sed rather than just one. So, the one-liner approach is probably not
the best in this case.

Thanks for the reminder on ARGF.each,
Keith

···

On Thursday 28 July 2005 11:25 am, Joel VanderWerf wrote:

$ ruby -p -e 'gsub(/bar/, "BAR")' test

Brian_Schroder1 · 29 July 2005 14:39

My approach would be something along these lines:

bschroed@black:~/svn/projekte/ruby-things$ cat test
test
Range1
test1
Range2
test2
EndRange2
test3
EndRange1
bschroed@black:~/svn/projekte/ruby-things$ cat ranges.rb
ARGV.each do | filename |
  result =
  File.read(filename).split("\n").each do | line |
    if line =~ /^Range1/ .. line =~ /^EndRange1/
      if line =~ /^Range2/ .. line =~ /^EndRange2/
        result << " - " + line
      else
        result << " - " + line
      end
    else
      result << "- " + line
    end
  end

  puts filename
  puts "-" * filename.length
  puts result
  puts "-" * filename.length
  puts
end

bschroed@black:~/svn/projekte/ruby-things$ ruby ranges.rb test
test

···

On 29/07/05, Keith Fahlgren <keith@oreilly.com> wrote:

Hi all,

Thanks for the helpful responses. Many corrected me, rightly, for
writing a whole program rather than just using a one-liner and for
modifying the file in place rather than setting up temporary files. I
was concentrating on my task, making a template to replace thousands of
lines of sed files rather than doing a single replacement.

My larger problem still revolved around doing a bunch matches inside an
range.

On Thursday 28 July 2005 5:46 pm, mathew wrote:
> working_file.gsub!(/(?=BEGIN RANGE)(.*?)(?=END RANGE)/m) {||
> $1.gsub(/MATCH/m, 'REPLACEMENT')
> }

Matthew pointed out an easier way to get around it but I had rejected
that method from the start because I wanted to be able to nest ranges
(like I can in sed).

I think the problems with my inital attempts was assuming .read was
returning an array rather than a string. Slapping .to_a to the end
solved that bit and the program now works as expected, I think.

Here's my solution (though I'd still love comments):

#!/usr/bin/env ruby
files = ARGV

files.each do |arg|
  puts "\nOpening file #{arg}"
  f = File.open(arg)
  working_file = f.read.to_a
  f.close
  working_file.each do |aline|
    if aline =~ /BEGIN RANGE/ .. aline =~ /END RANGE/
      #nest if lines like above as many times as needed for nested
ranges
      aline.sub!(/MATCH/,'REPLACEMENT')
    end
  end

  puts "\nDoing ACTION in #{arg}"
  fnew = File.new("#{arg}.tmp", "w")
  puts "\nWriting #{fnew} now"
  fnew.print(working_file)
  fnew.close
end

Thanks,
Keith

----
- test
  - Range1
  - test1
    - Range2
    - test2
    - EndRange2
  - test3
  - EndRange1
----

regards,

Brian

--
http://ruby.brian-schroeder.de/

Stringed instrument chords: http://chordlist.brian-schroeder.de/

mathew · 29 July 2005 20:06

Keith Fahlgren wrote:

working_file.gsub!(/(?=BEGIN RANGE)(.*?)(?=END RANGE)/m) {||
$1.gsub(/MATCH/m, 'REPLACEMENT')
}

Matthew pointed out an easier way to get around it but I had rejected that method from the start because I wanted to be able to nest ranges (like I can in sed).

Oh, well, in that case, let's make my 3 lines into an iterator:

class String
   # Execute a code block on the substring starting with startstring and
   # ending with endstring, and return the result
   def in_range(startstring, endstring)
     return gsub(/(?=#{startstring})(.*?)(?=#{endstring})/m) {||
       yield($1)
     }
   end
end

Example usage:

puts data.in_range('BEGIN RANGE', 'END RANGE') {|subdata|
   subdata.in_range('NESTED RANGE', 'END OF NESTED RANGE') {|x|
     x.gsub('MATCH', 'REPLACEMENT')
   }
}

mathew

···

On Thursday 28 July 2005 5:46 pm, mathew wrote:

--
<URL:http://www.pobox.com/~meta/>
WE HAVE TACOS

Topic		Replies	Views
About Regular Expressions ruby-talk	30	118	20 November 2004
Ruby global regex question ruby-talk	6	108	19 November 2008
Regular expression mismatch? ruby-talk	1	62	7 April 2005
Regular expression question ruby-talk	0	76	3 December 2002
Learning Ruby ruby-talk	5	86	1 November 2006

Sed -> Ruby : .. and

Related topics