SocketError when letting ruby open url from txt-file

Hi all,

I have constructed a code that opens all urls in a textfile one by one,
parses through them and finally saves the results into an excel file.

When I run the code on a textfile with just a few urls, it works
perfectly.
When i run the code on a textfile with many thousands of urls, I get an
error ("in 'intitialize': getaddrinfo: Name or service not known
(SocketError)"). What might be causing the issue?

CODE:
require 'nokogiri'
require 'open-uri'
require 'rubygems'
require 'writeexcel'

workbook = WriteExcel.new('parseresult.xlsx')
worksheet = workbook.add_worksheet
row = 0

File.foreach("websites.txt") do |line| #loop on basis urls textfile

searchablefile = Nokogiri::HTML(open(line)) #open each url

#creation of variables
referentieid = searchablefile.at_xpath("//td/strong[contains(text(),
'Referentie')]/parent::*/following-sibling::*")
status = searchablefile.at_xpath("//td/strong[contains(text(),
'Status')]/parent::*/following-sibling::*")

unless searchablefile.at_xpath("//td/strong[contains(text(),
'Referentie')]/parent::*/following-sibling::*").nil?
worksheet.write(row, 1, referentieid.content)
end
unless searchablefile.at_xpath("//td/strong[contains(text(),
'Status')]/parent::*/following-sibling::*").nil?
worksheet.write(row, 2, status.content)
end
row += 1 #next row for next url
end
workbook.close

ERROR:
wadiem@wadiem-TECRA-A2:~$ ruby directerubyparsewoningmarkt.rb
/home/wadiem/.rvm/rubies/ruby-1.9.2-p320/lib/ruby/1.9.1/net/http.rb:644:in
`initialize': getaddrinfo: Name or service not known (SocketError)
  from
/home/wadiem/.rvm/rubies/ruby-1.9.2-p320/lib/ruby/1.9.1/net/http.rb:644:in
`open'
  from
/home/wadiem/.rvm/rubies/ruby-1.9.2-p320/lib/ruby/1.9.1/net/http.rb:644:in
`block in connect'
  from
/home/wadiem/.rvm/rubies/ruby-1.9.2-p320/lib/ruby/1.9.1/timeout.rb:44:in
`timeout'
  from
/home/wadiem/.rvm/rubies/ruby-1.9.2-p320/lib/ruby/1.9.1/timeout.rb:89:in
`timeout'
  from
/home/wadiem/.rvm/rubies/ruby-1.9.2-p320/lib/ruby/1.9.1/net/http.rb:644:in
`connect'
  from
/home/wadiem/.rvm/rubies/ruby-1.9.2-p320/lib/ruby/1.9.1/net/http.rb:637:in
`do_start'
  from
/home/wadiem/.rvm/rubies/ruby-1.9.2-p320/lib/ruby/1.9.1/net/http.rb:626:in
`start'
  from
/home/wadiem/.rvm/rubies/ruby-1.9.2-p320/lib/ruby/1.9.1/open-uri.rb:306:in
`open_http'
  from
/home/wadiem/.rvm/rubies/ruby-1.9.2-p320/lib/ruby/1.9.1/open-uri.rb:769:in
`buffer_open'
  from
/home/wadiem/.rvm/rubies/ruby-1.9.2-p320/lib/ruby/1.9.1/open-uri.rb:203:in
`block in open_loop'
  from
/home/wadiem/.rvm/rubies/ruby-1.9.2-p320/lib/ruby/1.9.1/open-uri.rb:201:in
`catch'
  from
/home/wadiem/.rvm/rubies/ruby-1.9.2-p320/lib/ruby/1.9.1/open-uri.rb:201:in
`open_loop'
  from
/home/wadiem/.rvm/rubies/ruby-1.9.2-p320/lib/ruby/1.9.1/open-uri.rb:146:in
`open_uri'
  from
/home/wadiem/.rvm/rubies/ruby-1.9.2-p320/lib/ruby/1.9.1/open-uri.rb:671:in
`open'
  from
/home/wadiem/.rvm/rubies/ruby-1.9.2-p320/lib/ruby/1.9.1/open-uri.rb:33:in
`open'
  from directerubyparsewoningmarkt.rb:12:in `block in <main>'
  from directerubyparsewoningmarkt.rb:10:in `foreach'
  from directerubyparsewoningmarkt.rb:10:in `<main>

Thanks a bunch.

···

--
Posted via http://www.ruby-forum.com/.

1.9.2p290 :001 > require 'open-uri'
=> true
1.9.2p290 :002 > open("http://ldfmldmflasfmkdfm")
SocketError: getaddrinfo: Name or service not known

There's probably a wrong URL in that file. Can you print it before opening it?

Jesus.

···

On Tue, Oct 9, 2012 at 6:45 PM, Sybren Kooistra <lists@ruby-forum.com> wrote:

Hi all,

I have constructed a code that opens all urls in a textfile one by one,
parses through them and finally saves the results into an excel file.

When I run the code on a textfile with just a few urls, it works
perfectly.
When i run the code on a textfile with many thousands of urls, I get an
error ("in 'intitialize': getaddrinfo: Name or service not known
(SocketError)"). What might be causing the issue?

CODE:
require 'nokogiri'
require 'open-uri'
require 'rubygems'
require 'writeexcel'

workbook = WriteExcel.new('parseresult.xlsx')
worksheet = workbook.add_worksheet
row = 0

File.foreach("websites.txt") do |line| #loop on basis urls textfile

searchablefile = Nokogiri::HTML(open(line)) #open each url

#creation of variables
referentieid = searchablefile.at_xpath("//td/strong[contains(text(),
'Referentie')]/parent::*/following-sibling::*")
status = searchablefile.at_xpath("//td/strong[contains(text(),
'Status')]/parent::*/following-sibling::*")

unless searchablefile.at_xpath("//td/strong[contains(text(),
'Referentie')]/parent::*/following-sibling::*").nil?
worksheet.write(row, 1, referentieid.content)
end
unless searchablefile.at_xpath("//td/strong[contains(text(),
'Status')]/parent::*/following-sibling::*").nil?
worksheet.write(row, 2, status.content)
end
row += 1 #next row for next url
end
workbook.close

ERROR:
wadiem@wadiem-TECRA-A2:~$ ruby directerubyparsewoningmarkt.rb
/home/wadiem/.rvm/rubies/ruby-1.9.2-p320/lib/ruby/1.9.1/net/http.rb:644:in
`initialize': getaddrinfo: Name or service not known (SocketError)
  from
/home/wadiem/.rvm/rubies/ruby-1.9.2-p320/lib/ruby/1.9.1/net/http.rb:644:in
`open'
  from
/home/wadiem/.rvm/rubies/ruby-1.9.2-p320/lib/ruby/1.9.1/net/http.rb:644:in
`block in connect'
  from
/home/wadiem/.rvm/rubies/ruby-1.9.2-p320/lib/ruby/1.9.1/timeout.rb:44:in
`timeout'
  from
/home/wadiem/.rvm/rubies/ruby-1.9.2-p320/lib/ruby/1.9.1/timeout.rb:89:in
`timeout'
  from
/home/wadiem/.rvm/rubies/ruby-1.9.2-p320/lib/ruby/1.9.1/net/http.rb:644:in
`connect'
  from
/home/wadiem/.rvm/rubies/ruby-1.9.2-p320/lib/ruby/1.9.1/net/http.rb:637:in
`do_start'
  from
/home/wadiem/.rvm/rubies/ruby-1.9.2-p320/lib/ruby/1.9.1/net/http.rb:626:in
`start'
  from
/home/wadiem/.rvm/rubies/ruby-1.9.2-p320/lib/ruby/1.9.1/open-uri.rb:306:in
`open_http'
  from
/home/wadiem/.rvm/rubies/ruby-1.9.2-p320/lib/ruby/1.9.1/open-uri.rb:769:in
`buffer_open'
  from
/home/wadiem/.rvm/rubies/ruby-1.9.2-p320/lib/ruby/1.9.1/open-uri.rb:203:in
`block in open_loop'
  from
/home/wadiem/.rvm/rubies/ruby-1.9.2-p320/lib/ruby/1.9.1/open-uri.rb:201:in
`catch'
  from
/home/wadiem/.rvm/rubies/ruby-1.9.2-p320/lib/ruby/1.9.1/open-uri.rb:201:in
`open_loop'
  from
/home/wadiem/.rvm/rubies/ruby-1.9.2-p320/lib/ruby/1.9.1/open-uri.rb:146:in
`open_uri'
  from
/home/wadiem/.rvm/rubies/ruby-1.9.2-p320/lib/ruby/1.9.1/open-uri.rb:671:in
`open'
  from
/home/wadiem/.rvm/rubies/ruby-1.9.2-p320/lib/ruby/1.9.1/open-uri.rb:33:in
`open'
  from directerubyparsewoningmarkt.rb:12:in `block in <main>'
  from directerubyparsewoningmarkt.rb:10:in `foreach'
  from directerubyparsewoningmarkt.rb:10:in `<main>

Thanks a bunch.

--
Posted via http://www.ruby-forum.com/\.

It's half a million files.. So I don't know if printing the file first
would help (or what exactly do you mean)?

Is there a way to workaround possible bad-urls

···

--
Posted via http://www.ruby-forum.com/.

It works, perfect!

Jesus, thanks for all the help!

···

--
Posted via http://www.ruby-forum.com/.

It's half a million files.. So I don't know if printing the file first
would help (or what exactly do you mean)?

Yes, I meant that there's probably some bad (invalid) uris. Or maybe
garbage in the file.

Is there a way to workaround possible bad-urls

Catch the exception and log the line, for example:

File.foreach("websites.txt") do |line| #loop on basis urls textfile
  begin
    searchablefile = Nokogiri::HTML(open(line)) #open each url
   # the rest of the logic
  rescue
    puts "there was an error opening #{line}"
  end
end

Jesus.

···

On Tue, Oct 9, 2012 at 7:58 PM, Sybren Kooistra <lists@ruby-forum.com> wrote: