Trying to use regex

merrittr · 20 June 2007 08:15

hi i am trying to strip out text between body tags but when run it i
get:

rob@rob-laptop:~/ruby$ ./html2.rb
./html2.rb:14: unknown regexp options - bdy
./html2.rb:14: unterminated string meets end of file
./html2.rb:14: parse error, unexpected tSTRING_END, expecting
tSTRING_CONTENT or tREGEXP_END or tSTRING_DBEG or tSTRING_DVAR

#! /usr/bin/ruby

@h = File.open "test.html"
@response = @h.gets

text = @response.scan(/<body[^>]*>(.+?)</body>/)[0]
puts text

Alex_Gutteridge · 20 June 2007 08:41

You need to escape the '/' in your regexp, and unless your html file is one line you may need to also add the multiline option:

text = @response.scan(/<body[^>]*>(.+?)<\/body>/m)[0]

Alex Gutteridge

Bioinformatics Center
Kyoto University

···

On 20 Jun 2007, at 17:15, merrittr wrote:

hi i am trying to strip out text between body tags but when run it i
get:

rob@rob-laptop:~/ruby$ ./html2.rb
./html2.rb:14: unknown regexp options - bdy
./html2.rb:14: unterminated string meets end of file
./html2.rb:14: parse error, unexpected tSTRING_END, expecting
tSTRING_CONTENT or tREGEXP_END or tSTRING_DBEG or tSTRING_DVAR

#! /usr/bin/ruby

@h = File.open "test.html"
@response = @h.gets

text = @response.scan(/<body[^>]*>(.+?)</body>/)[0]
puts text

Drew_Olson · 20 June 2007 14:42

merrittr wrote:

hi i am trying to strip out text between body tags but when run it i
get:

HTML parsing can get quite complicated, why not use a library? I've
heard great things about http://code.whytheluckystiff.net/hpricot/

···

--
Posted via http://www.ruby-forum.com/\.

Rob_Biedenharn1 · 20 June 2007 14:36

Or you can use the %r{} form of a Regexp literal:

text = @response.scan(%r{<body\b.*?>(.*?)</body>}mi)[0]

\b matches a "word boundary"
m is the multi-line option that causes . to match newlines, too
i is the case insensitive option (so BODY would also be matched)

-Rob

Rob Biedenharn http://agileconsultingllc.com
Rob@AgileConsultingLLC.com

···

On Jun 20, 2007, at 4:41 AM, Alex Gutteridge wrote:

On 20 Jun 2007, at 17:15, merrittr wrote:

hi i am trying to strip out text between body tags but when run it i
get:

rob@rob-laptop:~/ruby$ ./html2.rb
./html2.rb:14: unknown regexp options - bdy
./html2.rb:14: unterminated string meets end of file
./html2.rb:14: parse error, unexpected tSTRING_END, expecting
tSTRING_CONTENT or tREGEXP_END or tSTRING_DBEG or tSTRING_DVAR

#! /usr/bin/ruby

@h = File.open "test.html"
@response = @h.gets

text = @response.scan(/<body[^>]*>(.+?)</body>/)[0]
puts text

You need to escape the '/' in your regexp, and unless your html file is one line you may need to also add the multiline option:

text = @response.scan(/<body[^>]*>(.+?)<\/body>/m)[0]

Alex Gutteridge

Bioinformatics Center
Kyoto University

Topic		Replies	Views
Regex html ruby-talk	10	83	16 May 2007
Extracting text from HTML ruby-talk	7	80	11 May 2003
Strinpping html using regexp ruby-talk	4	82	5 May 2009
Seperate body content from HTML ruby-talk	4	110	6 July 2004
Regex problem ruby-talk	4	87	2 December 2007

Trying to use regex

Related topics