Hi All..
I Need to Extract Img tag Using Regular Expressions From The Html Page
<\s*img [^\>]*src\s*=\s*(["\'])(.*?)\1
Is This Code Would be ok
if So How it can Be Implemented?
Any Ideas??
···
--
Posted via http://www.ruby-forum.com/.
Hi All..
I Need to Extract Img tag Using Regular Expressions From The Html Page
<\s*img [^\>]*src\s*=\s*(["\'])(.*?)\1
Is This Code Would be ok
if So How it can Be Implemented?
Any Ideas??
--
Posted via http://www.ruby-forum.com/.
I Need to Extract Img tag Using Regular Expressions From The Html Page
<\s*img [^\>]*src\s*=\s*(["\'])(.*?)\1
Is This Code Would be ok
I would choose a different regexp.
if So How it can Be Implemented?
What exactly?
Any Ideas??
http://code.whytheluckystiff.net/hpricot/
Cheers
robert
2008/8/21 Newb Newb <hema@angleritech.com>:
--
use.inject do |as, often| as.you_can - without end
Newb Newb wrote:
I Need to Extract Img tag Using Regular Expressions From The Html Page
<\s*img [^\>]*src\s*=\s*(["\'])(.*?)\1
Is This Code Would be ok
if So How it can Be Implemented?
Any Ideas??
Regexp is not a parser; it strongly resists matching well-formed syntax, such as HTML.
You need to write unit tests so you can "see" what you are doing. They will feed samples of input to your parser, and assert the output contains no <img tags.
I would load these strings into libxml-ruby or Hpricot documents, then use XPath to seek '//img', then delete their nodes from the document, then write the documents back. But note HTML supports several other ways to inject images, including CSS styles, <object> tags, etc.
You need to consult with your client how clean you need your HTML. If they say to only allow <i>, <em>, <b>, or <strong> tags, for example, you could use XPath to seek '//*', meaning all nodes, then replace their tag names with <span>, delete all their attributes, and write the document back.
Next, there might be gems out there to do this (or plugins), so you could google for [rails scrub html], to just find one, and either raid its source, or install and use it.
--
Phlip
You need to consult with your client how clean you need your HTML. If they say to only allow <i>, <em>, <b>, or <strong> tags, for example, you could use XPath to seek '//*', meaning all nodes, then replace their tag names with <span>, delete all their attributes, and write the document back.
Another way to scrub input is don't allow raw HTML. Only allow a wiki markup, such as RedCloth. Some wikis allow ''italic'' and '''bold''' content, and very little else. Then you don't need to scrub it; you simply let the wiki engine convert it to harmless read-only HTML.