Need help with a regexp

rpheath · 8 December 2006 03:15

I'm trying to write a regular expression to replace a <pre>...</pre>
block or a <blockquote><p>...</p></blockquote> block with a blank ('').
I can only get the <pre>...</pre> to work correctly. Here's what I
have:

text.gsub(/^<pre>[^<]*<\/pre>$|^<blockquote><p>(.*?)<\/p><\/blockquote>$/,'')

Can someone help me figure out why the blockquote is still showing
up??? Thanks in advance.

Daniel_Finnie1 · 8 December 2006 03:46

Why are you doing a gsub but then anchoring the Regexp to the start & ends? Use a normal sub or take out all the ^s and $s (except for the character class definitions, i.e., the ones in square brackets).

Please post some sample text, not of what you would like to remove but of what you would like to remove it from.

Dan

rpheath wrote:

···

I'm trying to write a regular expression to replace a <pre>...</pre>
block or a <blockquote><p>...</p></blockquote> block with a blank ('').
I can only get the <pre>...</pre> to work correctly. Here's what I
have:

text.gsub(/^<pre>[^<]*<\/pre>$|^<blockquote><p>(.*?)<\/p><\/blockquote>$/,'')

Can someone help me figure out why the blockquote is still showing
up??? Thanks in advance.

rpheath · 8 December 2006 03:55

Thanks for the reply. I'm relatively new to regular expressions, and
misinterpretted the ^s and $s. I was thinking they were for that
specific check, so it was either the first string "|" (or) the second
string.

Here's sample text that would be passed into it.

···

-----------------------
<p>This is the first sentence. Now I'll post a code snippet:</p>

<pre>
def strip_blocks(text)
text.gsub([regex],'')
end
</pre>

<p>This is another sentence before the block quote.</p>

<blockquote>
<p>This is a quote</p>
</blockquote>

<p>This is one more sentence</p>
----------------------

What I would like to have left is this:

----------------------
<p>This is the first sentence. Now I'll post a code snippet:</p>

<p>This is another sentence before the block quote.</p>

<p>This is one more sentence</p>
----------------------

Hopefully that helps. Sorry the question is not organized and kind of
basic, but I'm new to this. Thanks again for any help.

Edwin_Fine · 8 December 2006 05:17

rpheath wrote:

Thanks for the reply. I'm relatively new to regular expressions, and
misinterpretted the ^s and $s. I was thinking they were for that
specific check, so it was either the first string "|" (or) the second
string.

Here's sample text that would be passed into it.
-----------------------
<p>This is the first sentence. Now I'll post a code snippet:</p>

<pre>
def strip_blocks(text)
text.gsub([regex],'')
end
</pre>

<p>This is another sentence before the block quote.</p>

<blockquote>
<p>This is a quote</p>
</blockquote>

<p>This is one more sentence</p>
----------------------

What I would like to have left is this:

----------------------
<p>This is the first sentence. Now I'll post a code snippet:</p>

<p>This is another sentence before the block quote.</p>

<p>This is one more sentence</p>
----------------------

Hopefully that helps. Sorry the question is not organized and kind of
basic, but I'm new to this. Thanks again for any help.

Try this. It uses the "non-greedy" operator '?' and multiline
case-insensitive matching. Not using the 'non-greedy' operator would
gobble up everything between two tags, including nested tags of the
same name. This is probably not what you would want.

def remove_tag_block(tag, text)
text.gsub(/<#{tag}>.*?<\/#{tag}>/im, '')
end

irb(main):054:0> text
=> "<p>This is the first sentence. Now I'll post a code
snippet:</p>\n\n<pre>\ndef strip_blocks(text)\n
text.gsub([regex],'')\nend\n</pre>\n\n<p>This is another sentence before
the block quote.</p>\n\n<blockquote>\n <p>This is a
quote</p>\n</blockquote>\n\n<p>This is one more sentence</p>"

irb(main):055:0> t=remove_tag_block("pre", text)

=> "<p>This is the first sentence. Now I'll post a code
snippet:</p>\n\n\n\n<p>This is another sentence before the block
quote.</p>\n\n<blockquote>\n <p>This is a
quote</p>\n</blockquote>\n\n<p>This is one more sentence</p>"

irb(main):056:0> remove_tag_block("blockquote", t)

=> "<p>This is the first sentence. Now I'll post a code
snippet:</p>\n\n\n\n<p>This is another sentence before the block
quote.</p>\n\n\n\n<p>This is one more sentence</p>"

The problem is that this won't work with nested tags, e.g.

<table><tr><td><table>stuff</table></td></tr></table>

irb(main):065:0>
x="<table><tr><td><table>stuff</table></td></tr></table>"
=> "<table><tr><td><table>stuff</table></td></tr></table>"
irb(main):066:0> remove_tag_block("table", x)
=> "</td></tr></table>"

This is because *regular* regular expressions can't match nested
pairs, such as "((()(())()))". I think I read somewhere a phrase that
regexp's can't count. You have to use *recursive* regular expressions,
which are found in PCRE (Perl RE), but AFAIK not in the current Ruby
regexp engine. Maybe Oniguruma has it - I dunno. I saw a PCRE extension
for Ruby somewhere, but I don't know anything about it.

The Perl RE for matching nested parentheses is apparently as follows
(from
The Joy of Regular Expressions [1] — SitePoint)

$((?>[^()]+)|(?R))*$

I believe that to do this correctly without PCRE, you have to resort to
some text parsing or use a SAX parser or similar. Maybe some Ruby guru
(i.e. not me) will be able to pull out an RE or some easy way to do
this.

···

--
Posted via http://www.ruby-forum.com/\.

greg · 8 December 2006 05:24

You are missing the 'm' flag which will allow '.' to match new lines

pre_match = /<pre>.*?<\pre>/m
block_match = /<blockquote>.*?:<p>.*?<\/p>.*?<\/blockquote>/m

Rob_Biedenharn1 · 8 December 2006 14:39

http://groups.google.com/group/rubyonrails-talk/browse_frm/thread/6c75d5d4df368186/2743494eb303014c#2743494eb303014c

And might I suggest picking ONE mailing list on which to ask your questions (Ruby is actually the better one for this question about regular expressions), and then JUST ASK ONCE.

-Rob

Rob Biedenharn http://agileconsultingllc.com
Rob@AgileConsultingLLC.com

···

On Dec 7, 2006, at 10:55 PM, rpheath wrote:

Thanks for the reply. I'm relatively new to regular expressions, and
misinterpretted the ^s and $s. I was thinking they were for that
specific check, so it was either the first string "|" (or) the second
string.

Topic		Replies	Views
Nightmare with gsub & friends ruby-talk	0	82	11 November 2005
Another RegExp question ruby-talk	7	74	20 March 2007
Koans: About_Regexp Question ruby-talk	4	104	29 December 2011
Regexp Help ruby-talk	5	121	28 July 2009
Replacing part of a matched regular expression using gsub ruby-talk	4	120	25 March 2008

Need help with a regexp

Related topics