Img (regular expressions

Newb_Newb · 21 August 2008 06:28

Hi All..
I Need to Extract Img tag Using Regular Expressions From The Html Page
<\s*img [^\>]*src\s*=\s*(["\'])(.*?)\1
Is This Code Would be ok
if So How it can Be Implemented?
Any Ideas??

···

--
Posted via http://www.ruby-forum.com/.

Robert_K1 · 21 August 2008 07:45

I Need to Extract Img tag Using Regular Expressions From The Html Page
<\s*img [^\>]*src\s*=\s*(["\'])(.*?)\1
Is This Code Would be ok

I would choose a different regexp.

if So How it can Be Implemented?

What exactly?

Any Ideas??

http://code.whytheluckystiff.net/hpricot/

Cheers

robert

···

2008/8/21 Newb Newb <hema@angleritech.com>:

--
use.inject do |as, often| as.you_can - without end

Phlip1 · 21 August 2008 08:11

Newb Newb wrote:

I Need to Extract Img tag Using Regular Expressions From The Html Page
<\s*img [^\>]*src\s*=\s*(["\'])(.*?)\1
Is This Code Would be ok
if So How it can Be Implemented?
Any Ideas??

Regexp is not a parser; it strongly resists matching well-formed syntax, such as HTML.

You need to write unit tests so you can "see" what you are doing. They will feed samples of input to your parser, and assert the output contains no <img tags.

I would load these strings into libxml-ruby or Hpricot documents, then use XPath to seek '//img', then delete their nodes from the document, then write the documents back. But note HTML supports several other ways to inject images, including CSS styles, <object> tags, etc.

You need to consult with your client how clean you need your HTML. If they say to only allow , , , or tags, for example, you could use XPath to seek '//*', meaning all nodes, then replace their tag names with , delete all their attributes, and write the document back.

Next, there might be gems out there to do this (or plugins), so you could google for [rails scrub html], to just find one, and either raid its source, or install and use it.

···

--
Phlip

Phlip1 · 21 August 2008 08:16

You need to consult with your client how clean you need your HTML. If they say to only allow , , , or tags, for example, you could use XPath to seek '//*', meaning all nodes, then replace their tag names with , delete all their attributes, and write the document back.

Another way to scrub input is don't allow raw HTML. Only allow a wiki markup, such as RedCloth. Some wikis allow ''italic'' and '''bold''' content, and very little else. Then you don't need to scrub it; you simply let the wiki engine convert it to harmless read-only HTML.

Topic		Replies	Views
Regexp help ruby-talk	6	97	22 August 2008
Still Query Continues ruby-talk	3	85	28 August 2008
Regular Expressions ruby-talk	15	104	28 August 2008
Regular Expressions ruby-talk	1	92	27 August 2008
Newbie: how to find & extract a string from a file ruby-talk	5	117	30 September 2006

Img (regular expressions

Related topics