Using gsub to remove embedded newlines in HTML file

Wes_Gamble · 2 August 2006 22:59

I have an HTML file that is in a string.

I want to use gsub! to recursively remove any embedded newlines and
whitespace within two known delimeters.

Given a string that includes this kind of string:

~^LNK:http://slashdot.org/login.pl?op=newuserform~
Create a new account
^~

I want to replace the above with:

~^LNK:http://slashdot.org/login.pl?op=newuserform~Create a new account^~

(stripping out the newlines and whitespace)

Having trouble writing the regex for this.

I think I want something like:

/~\^LNK:.*?([\s\r\n])+.*?\^~/

that I could use in:

str.gsub!(/~\^LNK:.*?([\s\r\n])+.*?\^~/, '')

to replace all of the whitespace, or potential newline characters with
null strings.

But I don't think this will work because I really need to loop _within_
each substring of my large HTML string. The thing about gsub is that it
will substitute the entire matched string.

Do I need to scan out the ~^LNK.*?^~, operate on those and then put them
back into the larger string?

I'm not sure I'm asking this very well, so I apologize if that's the
case.

Thanks,
Wes

···

--
Posted via http://www.ruby-forum.com/.

Wes_Gamble · 2 August 2006 23:04

Something like:

    @html.scan(/~\^LNK:.*?\^~/mi).each do |link_line|
      new_link_line = link_line.gsub(/[\s\r\n]/, '')
      @html.gsub!(/#{link_line}/mi, new_link_line)
    end

···

--
Posted via http://www.ruby-forum.com/.

Wes_Gamble · 2 August 2006 23:40

Wes Gamble wrote:

Something like:

    @html.scan(/~\^LNK:.*?\^~/mi).each do |link_line|
      new_link_line = link_line.gsub(/[\s\r\n]/, '')
      @html.gsub!(/#{link_line}/mi, new_link_line)
    end

This seems to work well:

@html.scan(/~\^LNK:.*?\^~/mi).each do |link_line|
new_link_line = link_line.gsub(/[\t\r\n]/, '')
@html.gsub!(/#{Regexp.escape(link_line)}/mi, new_link_line) if
link_line != new_link_line
end

I wonder if I could have done with with one @html.gsub!() command, but
this is much more understandable to me anyway so I'll stick with this.

Thanks,
Wes

···

--
Posted via http://www.ruby-forum.com/\.

Carlos · 3 August 2006 02:51

Wes Gamble wrote:

Wes Gamble wrote:

Something like:

   @html.scan(/~\^LNK:.*?\^~/mi).each do |link_line|
     new_link_line = link_line.gsub(/[\s\r\n]/, '')
     @html.gsub!(/#{link_line}/mi, new_link_line)
   end

This seems to work well:

@html.scan(/~\^LNK:.*?\^~/mi).each do |link_line|
  new_link_line = link_line.gsub(/[\t\r\n]/, '')
  @html.gsub!(/#{Regexp.escape(link_line)}/mi, new_link_line) if link_line != new_link_line
end

You can use a block with gsub:
@html.gsub!(/~\^LNK:.*?~/mi) { |s| s.gsub /\s/, '' }

or something like that.

Good luck.

···

--

Wes_Gamble · 3 August 2006 20:49

Thanks. That is the _Ruby_ way to do it, and that's what I wanted to
know :).

I've used blocks with gsub but I keep forgetting that I can put anything
in there - so far I've only used backrefs to pull out pieces of the
matching regex to rearrange things.

Wes

···

--
Posted via http://www.ruby-forum.com/.

Topic		Replies	Views
Why this code doesn't work? ruby-talk	2	100	21 June 2006
Can't think ruby-talk	4	81	16 March 2003
Noob Question - String Manipulation ruby-talk	4	82	5 May 2006
Crazy gsub/regex scheme - can this be done better? ruby-talk	3	101	11 August 2006
Problem replacing newlines in regexp ruby-talk	5	102	30 April 2007

Using gsub to remove embedded newlines in HTML file

Related topics