Odd Regexp Issue

Kyle_Heck · 23 May 2007 20:23

I'm writing a web crawler, and in that crawler I want to remove all
scripts in the pages I crawl.

I should be able to do a simple gsub!(//,"") right? Well, I do
that and unfortunately it doesn't remove some scripts. Take google for
instance. It removes the first script, but not the second. I'm really
confused. Since google has two scripts, 
so it's not like the full regexp should ever fail to be triggered.

Any insight on the issue would be GREAT?!

Thanks,
Kyle Heck

···

--
Posted via http://www.ruby-forum.com/.

Luis1 · 23 May 2007 20:32

gsub(//m,"")

If there are new lines inside the string you need to use the m
modifier to make the dot (.) include new lines as well.
And the ? is to make the match non-greedy. Without it it would match
the start of the first script and the end of the last script.

···

On 5/23/07, Kyle Heck <dbcoder@gmail.com> wrote:

I should be able to do a simple gsub!(//,"") right? Well, I do
that and unfortunately it doesn't remove some scripts. Take google for
instance. It removes the first script, but not the second. I'm really
confused. Since google has two scripts, 
so it's not like the full regexp should ever fail to be triggered.

--
Luis Parravicini
http://ktulu.com.ar/blog/

Joel_VanderWerf1 · 23 May 2007 20:35

Kyle Heck wrote:

I'm writing a web crawler, and in that crawler I want to remove all
scripts in the pages I crawl.

I should be able to do a simple gsub!(//,"") right? Well, I do
that and unfortunately it doesn't remove some scripts. Take google for
instance. It removes the first script, but not the second. I'm really
confused. Since google has two scripts, 
so it's not like the full regexp should ever fail to be triggered.

Any insight on the issue would be GREAT?!

Thanks,
Kyle Heck

Try multiline mode: gsub!(//m,"")

···

--
vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407

Jenda_Krynicky · 25 May 2007 13:35

Kyle Heck wrote:

I'm writing a web crawler, and in that crawler I want to remove all
scripts in the pages I crawl.

I should be able to do a simple gsub!(//,"") right? Well, I do
that and unfortunately it doesn't remove some scripts. Take google for
instance. It removes the first script, but not the second. I'm really
confused. Since google has two scripts, 
so it's not like the full regexp should ever fail to be triggered.

Any insight on the issue would be GREAT?!

Thanks,
Kyle Heck

I'm not sure what are you after actually, but apart from the <script>
tags Rob mentioned, you might need to remove the onClick, onMouseOver
and other handlers. And since the handlers can be within almost any tag
it would be very hard to find and remove them correctly with just a few
regexps. You should use a real HTML parser (the preffered Ruby one seems
to be called hpricot ... I guess the author wanted to be funny). If this
is meant to make the display of the pages secure you should also rather
"keep only the tags and attributes that are safe" than "remove stuff
that's not safe". You might easily overlook something.

If you happened to use the-language-that-musn't-be-named, you'd just use
HTML::TagFilter
(http://search.cpan.org/~wross/HTML-TagFilter-1.03/TagFilter.pm\). Good
luck.

Jenda

···

--
Posted via http://www.ruby-forum.com/\.

Joel_VanderWerf1 · 23 May 2007 20:44

Joel VanderWerf wrote:
...

Try multiline mode: gsub!(//m,"")

Luis is right, it needs to be non-greedy, as well:

gsub!(//m,"")

···

--
vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407

Rob_Biedenharn1 · 23 May 2007 21:23

Of course, you do realize that you're saying "scripts" but you're removing "comments" with this regexp.

I can have:
     <script type="text/javascript">
       //<![CDATA[
       <%= yield :page_scripts %>
       //]]>
     </script>

in a page with not a  in sight!

-Rob

Rob Biedenharn http://agileconsultingllc.com
Rob@AgileConsultingLLC.com

···

On May 23, 2007, at 4:32 PM, Luis Parravicini wrote:

On 5/23/07, Kyle Heck <dbcoder@gmail.com> wrote:

I should be able to do a simple gsub!(//,"") right? Well, I do
that and unfortunately it doesn't remove some scripts. Take google for
instance. It removes the first script, but not the second. I'm really
confused. Since google has two scripts, 
so it's not like the full regexp should ever fail to be triggered.

gsub(//m,"")

If there are new lines inside the string you need to use the m
modifier to make the dot (.) include new lines as well.
And the ? is to make the match non-greedy. Without it it would match
the start of the first script and the end of the last script.

--
Luis Parravicini
http://ktulu.com.ar/blog/

Kyle_Heck · 23 May 2007 21:10

Well, that seemed to do the trick

Thanks a lot, I didn't know that regexp only applied to one line by
default, HRM!

Thanks,
Kyle Heck

···

--
Posted via http://www.ruby-forum.com/.

Rob_Biedenharn1 · 23 May 2007 21:31

Actually, it is the . expression that doesn't match a newline without the 'm' option. That option just changes '.' from matching "any character except a newline" to matching "any character". The Regular Expression section of chapter 22 in the pickaxe covers all this (p.324-328)

-Rob

Rob Biedenharn http://agileconsultingllc.com
Rob@AgileConsultingLLC.com

···

On May 23, 2007, at 5:10 PM, Kyle Heck wrote:

Well, that seemed to do the trick

Thanks a lot, I didn't know that regexp only applied to one line by
default, HRM!

Thanks,
Kyle Heck

Topic		Replies	Views
Surprising Regexp Behavior ruby-talk	0	82	13 September 2005
Surprising Regexp Behavior ruby-talk	2	86	13 September 2005
Regexp and stack overflow ruby-talk	0	68	27 March 2006
A bug in Ruby regexp lib? ruby-talk	3	142	27 January 2009
Inconsistent regexp behavior ruby-talk	6	85	15 August 2008

Odd Regexp Issue

Related topics