Weird error using String#[]

Jason_Mcdonald · 13 January 2011 02:19

Please check out the attached file. I am writing a script to notify me
when a few select items become available. It hits a web page then parses
the information in order to determine whether the item is available or
not.

When I parse the values out I start seeing some really weird results
when calling the String#[]. What is even weirder is that when I put
these results with something like puts "val: #{weird_val}" it also
replaces part of the string being put, "val: ".

Example:

ret = res[spos, 90]
puts "ret: #{ret}"
# ^^^^^
# Expected live result (works in baseline):
# ret: id="ProdAvailability"><span style="font-weight: bold; color:
# #000;">Availability:</span>Ou

···

#
# Actual live result (missing Ou on end, r in pos 0 replaced with O):
# Oet: id="ProdAvailability"><span style="font-weight: bold; color:
# #000;">Availability:</span>

If I pull the contents from the web site, it doesn't work. If I pull the
contents from a string saved in the script (denoted as baseline in the
file), it works fine.

I have been spinning my wheels for 2 days now and am pretty sure that I
am overlooking something obvious.

Anyone have any idea what is causing this?

Attachments:
http://www.ruby-forum.com/attachment/5731/availability_watcher.rb

--
Posted via http://www.ruby-forum.com/.

Kedar_Mhaswade · 13 January 2011 02:28

This seems like an encoding issue. But are you hand-scraping an HTML
page? Shouldn't you use something like Nokogiri or REXML or Hpricot?

-Kedar

···

--
Posted via http://www.ruby-forum.com/.

Jason_Mcdonald · 13 January 2011 02:38

I probably should. I'm still relatively new to Ruby / RoR so tend to do
things "by hand" a lot just so I can learn how they work. 9/10 times I
go back afterwards and replace it with something tried and true. I've
seen Nokogiri a lot and it is already on my "to research" list.

I thought encoding might be the culprit here but I haven't gotten so far
as to figure out how to change the encoding. I tried to use the encode()
method but got a no method found error. I'd assume I'd want to change to
whatever the standard is for OS X (UTF-8?)? How would I do this?

···

--
Posted via http://www.ruby-forum.com/.

Jason_Mcdonald · 13 January 2011 03:35

Nokogiri <i>is</i> easier... (see below)

I would still like to know what exactly is causing the weird behavior in
my original post though, if anyone knows. I can understand why encoding
would result in incorrect parsing, but I don't understand why the
encoding would mess up the hard coded portion of the call to puts still.

Working Nokogiri example:

require 'rubygems'
require 'nokogiri'
require 'open-uri'

doc =
Nokogiri::HTML(open("http://www.pennstateind.com/store/PKPARK-MAG.html"))
#puts doc
ret = doc.at("div#ProdAvailability")
puts "ret: #{ret}"

# Output:
# ret: <div id="ProdAvailability">
# Outof Stock / Eta Mid January <a
href="http://www.pennstateind.com/mm5/merchant.mvc?Screen=shippingdelivery&Product_Code=PKPARK-MAG"
onclick="link_popup(this,'width=500,height=600,toolbar=no,scrollbars=yes');
return false;">See Shipping Details</a><br>
# </div>

···

--
Posted via http://www.ruby-forum.com/.

Jason_Mcdonald · 13 January 2011 17:01

Thanks, Robert. The original post has the script with both expected and
unexpected outcomes. What you show with the encoding screwing up the
offsets makes total sense.

What I'm at a loss for is why it affects the hard coded portion of the
string passed to puts:

Example:
puts "ret: #{ret}"

Output:
Oet: [part but not all of the expected string - 2 chars too short]

At this point I plan on using Nokogiri but I am really curious what is
causing what I describe above. This is a weirdness for how strings /
puts works that I'd like to understand and keep in mind going forward.

Thanks!

···

--
Posted via http://www.ruby-forum.com/.

Jason_Mcdonald · 13 January 2011 17:54

Robert,

That example shows the same behavior in my console as you show above. So
it is the \r that is causing it, it seems. I suppose the console sees
the \r and tries to create a new line, can't, and overwrites what is
there? The reason that it is one character too short is because the \r
would count as 1.

Thanks for the example!

···

--
Posted via http://www.ruby-forum.com/.

Robert_K1 · 13 January 2011 16:13

Nokogiri <i>is</i> easier... (see below)

Certainly!

I would still like to know what exactly is causing the weird behavior in
my original post though, if anyone knows. I can understand why encoding
would result in incorrect parsing, but I don't understand why the
encoding would mess up the hard coded portion of the call to puts still.

Can you provide a small program that exhibits the effect you are
seeing? It is especially important to see how you calculate indexes.

Maybe this can help to illustrate a possible scenario:

Ruby version 1.9.2
irb(main):001:0> s = "aä"
=> "aä"
irb(main):002:0> s.encoding
=> #<Encoding:UTF-8>
irb(main):003:0> x = s.dup
=> "aä"
irb(main):004:0> x.encoding
=> #<Encoding:UTF-8>
irb(main):005:0> x.force_encoding "BINARY"
=> "a\xC3\xA4"
irb(main):006:0> x.encoding
=> #<Encoding:ASCII-8BIT>
irb(main):007:0> x[1,1]
=> "\xC3"
irb(main):008:0> s[1,1]
=> "ä"
irb(main):009:0>

Kind regards

robert

···

On Thu, Jan 13, 2011 at 4:35 AM, Jason Mcdonald <finn0013@gmail.com> wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Robert_K1 · 13 January 2011 17:12

Thanks, Robert. The original post has the script with both expected and
unexpected outcomes.

I thought more of a small script which does not need network
connection etc. and rather works with static text.

What you show with the encoding screwing up the
offsets makes total sense.

What I'm at a loss for is why it affects the hard coded portion of the
string passed to puts:

Example:
puts "ret: #{ret}"

Output:
Oet: [part but not all of the expected string - 2 chars too short]

Well, that's easy:

irb(main):014:0> s = "\rA\tB"
=> "\rA\tB"
irb(main):015:0> puts "ret: #{s}"
Aet: B
=> nil
irb(main):016:0> p "ret: #{s}"
"ret: \rA\tB"
=> "ret: \rA\tB"
irb(main):017:0> s = "\rAet: B"
=> "\rAet: B"
irb(main):018:0> puts "ret: #{s}"
Aet: B
=> nil
irb(main):019:0> p "ret: #{s}"
"ret: \rAet: B"
=> "ret: \rAet: B"

To debug you should use p and not puts.

At this point I plan on using Nokogiri but I am really curious what is
causing what I describe above. This is a weirdness for how strings /
puts works that I'd like to understand and keep in mind going forward.

It's probably rather about how your terminal works than how strings work.

Kind regards

robert

···

On Thu, Jan 13, 2011 at 6:01 PM, Jason Mcdonald <finn0013@gmail.com> wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

J-H_Johansen · 13 January 2011 17:15

Hi there,

Normally when I see similiar behaviour it's because of "hidden" characters.

Do you have a hidden \r (0x0D, decimal 13) in the text you're reading ?

···

On Thu, Jan 13, 2011 at 6:01 PM, Jason Mcdonald <finn0013@gmail.com> wrote:

Thanks, Robert. The original post has the script with both expected and
unexpected outcomes. What you show with the encoding screwing up the
offsets makes total sense.

What I'm at a loss for is why it affects the hard coded portion of the
string passed to puts:

Example:
puts "ret: #{ret}"

Output:
Oet: [part but not all of the expected string - 2 chars too short]

At this point I plan on using Nokogiri but I am really curious what is
causing what I describe above. This is a weirdness for how strings /
puts works that I'd like to understand and keep in mind going forward.

Thanks!

--
Posted via http://www.ruby-forum.com/\.

--
Jens-Harald Johansen
--
There are 10 kinds of people in the world: Those who understand binary and
those who don't...

Robert_K1 · 18 January 2011 07:50

That example shows the same behavior in my console as you show above. So
it is the \r that is causing it, it seems. I suppose the console sees
the \r and tries to create a new line, can't, and overwrites what is
there?

No, \n is newline, \r is carriage return which simply positions the
cursor at the beginning of the line.

The reason that it is one character too short is because the \r
would count as 1.

To see what's really in the string you should use p or #inspect.

Thanks for the example!

You're welcome!

Kind regards

robert

···

On Thu, Jan 13, 2011 at 6:54 PM, Jason Mcdonald <finn0013@gmail.com> wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Topic		Replies	Views
Trouble with strings ruby-talk	1	64	30 June 2008
Syswrite changes from 1.8 to 1.9 ruby-talk	2	114	22 July 2010
Encoding issues when parsing HTML in 1.9 ruby-talk	11	132	30 March 2011
String#% and a Hash ruby-talk	2	72	31 December 2007
String Problems? ruby-talk	3	72	8 November 2008

Weird error using String#[]

Related Topics