Converting the string

Vlad_Smith · 1 July 2009 16:25

Hi everybody! I`m sorry for asking a silly question .. i hope you`ll
find some time to assist

while searching shrough html body i get the string :
<td class=blk11 ><img
src='https://sc.omniture.com/sc13_5/reports/chart.php?id=CPRIY_6NvJ_kj7s&s=www461.sj2&type=GIF'
width=624 height=280 border=0 align=absmiddle
usemap=#imIY_6NvJ_kj7s></td>

how can i modify it to get just a plain link without tags and params? :
https://sc.omniture.com/sc13_5/reports/chart.php?id=CPRIY_6NvJ_kj7s&s=www461.sj2&type=GIF

···

--
Posted via http://www.ruby-forum.com/.

Greg_Willits · 1 July 2009 17:22

Vlad Smith wrote:

Hi everybody! I`m sorry for asking a silly question .. i hope you`ll
find some time to assist

while searching shrough html body i get the string :
<td class=blk11 ><img
src='https://sc.omniture.com/sc13_5/reports/chart.php?id=CPRIY_6NvJ_kj7s&s=www461.sj2&type=GIF'
width=624 height=280 border=0 align=absmiddle
usemap=#imIY_6NvJ_kj7s></td>

how can i modify it to get just a plain link without tags and params? :
https://sc.omniture.com/sc13_5/reports/chart.php?id=CPRIY_6NvJ_kj7s&s=www461.sj2&type=GIF

You'll have to expand this to take care of all possible scenarios, but
here's an example:

x = "<img src=\"http://some_url_to_scrape">"
y = x.scan(/src=\"([\S\s]+?)\"/)

That will return an array, and you'll have to fish the string out of the
array with y[0][0].

That works specifically with proper HTML using double quotes where your
examples above used malformed single quotes, but you can either use
multple expressions, or build a more complex one to cover the various
cases of quotes href, src, and other attributes names, etc.

There's other ways to do it, this is just a small example to give you
some ideas.

-- gw

···

--
Posted via http://www.ruby-forum.com/\.

Vlad_Smith · 1 July 2009 20:29

Greg Willits wrote:

Vlad Smith wrote:

Hi everybody! I`m sorry for asking a silly question .. i hope you`ll
find some time to assist

while searching shrough html body i get the string :
<td class=blk11 ><img
src='https://sc.omniture.com/sc13_5/reports/chart.php?id=CPRIY_6NvJ_kj7s&s=www461.sj2&type=GIF'
width=624 height=280 border=0 align=absmiddle
usemap=#imIY_6NvJ_kj7s></td>

how can i modify it to get just a plain link without tags and params? :
https://sc.omniture.com/sc13_5/reports/chart.php?id=CPRIY_6NvJ_kj7s&s=www461.sj2&type=GIF

You'll have to expand this to take care of all possible scenarios, but
here's an example:

x = "<img src=\"http://some_url_to_scrape">"
y = x.scan(/src=\"([\S\s]+?)\"/)

That will return an array, and you'll have to fish the string out of the
array with y[0][0].

That works specifically with proper HTML using double quotes where your
examples above used malformed single quotes, but you can either use
multple expressions, or build a more complex one to cover the various
cases of quotes href, src, and other attributes names, etc.

There's other ways to do it, this is just a small example to give you
some ideas.

-- gw

Thanks! that worked!

i also accidently noticed a great feature taken from perl that worked
also:

x = <td class=blk11 ><img
src='https://sc.omniture.com/sc13_5/reports/chart.php?id=CPRIY_6NvJ_kj7s
s=www461.sj2&type=GIF’width=624 height=280 border=0
align=absmiddleusemap=#imIY_6NvJ_kj7s></td>
x = $1 if x =~ /.*(https.*GIF).*/

···

--
Posted via http://www.ruby-forum.com/\.

Aaron_Patterson1 · 1 July 2009 21:20

Please don't do this. Every time you parse HTML with a regular
expression, a kitten dies.

Instead, try using an HTML parsing library:

x = <<-eohtml
  <td class=blk11 ><img
  src='https://sc.omniture.com/sc13_5/reports/chart.php?id=CPRIY_6NvJ_kj7ss=www461.sj2&type=GIF’width=624 height=280 border=0
  align=absmiddleusemap=#imIY_6NvJ_kj7s></td>
eohtml

puts Nokogiri::HTML(x).at('img')['src']

···

On Thu, Jul 02, 2009 at 05:29:04AM +0900, Vlad Smith wrote:

Greg Willits wrote:
> Vlad Smith wrote:
>> Hi everybody! I`m sorry for asking a silly question .. i hope you`ll
>> find some time to assist
>>
>> while searching shrough html body i get the string :
>> <td class=blk11 ><img
>> src='https://sc.omniture.com/sc13_5/reports/chart.php?id=CPRIY_6NvJ_kj7s&s=www461.sj2&type=GIF'
>> width=624 height=280 border=0 align=absmiddle
>> usemap=#imIY_6NvJ_kj7s></td>
>>
>> how can i modify it to get just a plain link without tags and params? :
>> https://sc.omniture.com/sc13_5/reports/chart.php?id=CPRIY_6NvJ_kj7s&s=www461.sj2&type=GIF
>
> You'll have to expand this to take care of all possible scenarios, but
> here's an example:
>
> x = "<img src=\"http://some_url_to_scrape">"
> y = x.scan(/src=\"([\S\s]+?)\"/)
>
> That will return an array, and you'll have to fish the string out of the
> array with y[0][0].
>
> That works specifically with proper HTML using double quotes where your
> examples above used malformed single quotes, but you can either use
> multple expressions, or build a more complex one to cover the various
> cases of quotes href, src, and other attributes names, etc.
>
> There's other ways to do it, this is just a small example to give you
> some ideas.
>
> -- gw

Thanks! that worked!

i also accidently noticed a great feature taken from perl that worked
also:

x = <td class=blk11 ><img
src='https://sc.omniture.com/sc13_5/reports/chart.php?id=CPRIY_6NvJ_kj7s
s=www461.sj2&type=GIF’width=624 height=280 border=0
align=absmiddleusemap=#imIY_6NvJ_kj7s></td>
x = $1 if x =~ /.*(https.*GIF).*/

--
Aaron Patterson
http://tenderlovemaking.com/

Topic		Replies	Views
Remove HTML from String? ruby-talk	11	252	13 June 2012
Noob Question - String Manipulation ruby-talk	4	91	5 May 2006
Regexp help ruby-talk	6	105	22 August 2008
Extract a number from a line of HTML file ruby-talk	4	124	13 August 2007
Confusion trying to get IMG tags from html page ruby-talk	7	126	30 July 2005

Converting the string

Related topics