Nokogiri/ruby and troublesome characters in url

I'm very new to using ruby, and I can't seem to figure something out
(that is probably quite basic). Any help is much appreciated!

When using nokogiri and open-uri in Ruby, I define a variable containing
a partial url (INITIAL_URL =
"https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=ParlementaireDocumenten")
so as to be able to add onto the url for continuous use (I have added
the full code below).

However, I keep running into an error. "syntax error, unexpected tLABEL"
+ "unknown regexp options - zk" + "syntax error, unexpected '?'

How can I fix this?..

Here's the full code:

irb
require ‘Nokogiri’
require ‘open-uri’

def get_search_result_links(n_page)

links = n_page.css('.linker-kolom li a')
puts "** There were #{links.length} links found"
links.each do |link|
    href = link['href']
    inner_url = 'https://zoek.officielebekendmakingen.nl' + href
puts "\n\n\nFetching page at #{File.basename(inner_url).split('?')[0]}"

datalezer = open(inner_url).read
lokalenieuwefilenaam = href + “.html”
lokalenieuwefile = open(lokalenieuwefilenaam, “w”)
lokalenieuwefile.write(datalezer)
lokalenieuwefile.close
end
end

INITIAL_URL =
'https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=ParlementaireDocumenten'
initial_page = Nokogiri::HTML(open(INITIAL_URL))
pagination_links = initial_page.css('.paginering.beneden a')
last_page_link = pagination_links[-2]
last_page_number = last_page_link.text.to_i
(5..last_page_number).each do |page_num|
puts "\n\n\n***** Getting page #{page_num}"
results_page_url = "#{INITIAL_URL}&_page=#{page_num}"
results_page = Nokogiri::HTML(open(results_page_url))
get_search_result_links(results_page)
end

···

--
Posted via http://www.ruby-forum.com/.

(In my setup) the line...

pagination_links = initial_page.css('.paginering.beneden a')

returns an empty Nokogiri::XML::NodeSet =>

What part of your html are you trying to select?

Something googled... Parsing HTML with Nokogiri | The Bastards Book of Ruby

Abinoam Jr.

···

On Tue, Aug 28, 2012 at 4:19 PM, Sybren Kooistra <lists@ruby-forum.com> wrote:

I'm very new to using ruby, and I can't seem to figure something out
(that is probably quite basic). Any help is much appreciated!

When using nokogiri and open-uri in Ruby, I define a variable containing
a partial url (INITIAL_URL =
"https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=ParlementaireDocumenten&quot;\)
so as to be able to add onto the url for continuous use (I have added
the full code below).

However, I keep running into an error. "syntax error, unexpected tLABEL"
+ "unknown regexp options - zk" + "syntax error, unexpected '?'

How can I fix this?..

Here's the full code:

irb
require ‘Nokogiri’
require ‘open-uri’

def get_search_result_links(n_page)

links = n_page.css('.linker-kolom li a')
puts "** There were #{links.length} links found"
links.each do |link|
    href = link['href']
    inner_url = 'https://zoek.officielebekendmakingen.nl' + href
puts "\n\n\nFetching page at #{File.basename(inner_url).split('?')[0]}"

datalezer = open(inner_url).read
lokalenieuwefilenaam = href + “.html”
lokalenieuwefile = open(lokalenieuwefilenaam, “w”)
lokalenieuwefile.write(datalezer)
lokalenieuwefile.close
end
end

INITIAL_URL =
'https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=ParlementaireDocumenten&#39;
initial_page = Nokogiri::HTML(open(INITIAL_URL))
pagination_links = initial_page.css('.paginering.beneden a')
last_page_link = pagination_links[-2]
last_page_number = last_page_link.text.to_i
(5..last_page_number).each do |page_num|
puts "\n\n\n***** Getting page #{page_num}"
results_page_url = "#{INITIAL_URL}&_page=#{page_num}"
results_page = Nokogiri::HTML(open(results_page_url))
get_search_result_links(results_page)
end

--
Posted via http://www.ruby-forum.com/\.

Thanks for the reply Abinoam.

With pagination_links = initial_page.css('.paginering.beneden a') I'm
trying to recover <div class="paginering beneden"> and then <a
href="...">, which refer to all the page-links. So apparently something
is going wrong here aswell?..

The bigger problem I'm dealing with is that ruby believes that letters
following the question mark (in INITIAL_URL =
"https://zoek.officielebekendmakingen.nl/zoeken/resultaat.?zkt=Uitgebreid&pst=ParlementaireDocumenten")
should be interpreted as commands in stead of part of the entire string.
So I get an error when simply trying to define INITIAL_URL with a
url-string, because some of the characters in the url are interpreted as
commands.

···

--
Posted via http://www.ruby-forum.com/.

Are you testing your code by inserting an href by hand, something like
this:

    inner_url = 'https://zoek.officielebekendmakingen.nl' +
/something/zk?x=10&y=5

That produces the error:

   unknown regexp options - zk

The reason for that error is that /something/ is the syntax for a regex
literal.

···

--
Posted via http://www.ruby-forum.com/.

Dear Sybren,

I've indented and fixed some quotes on your code.

It runs, but there's no "paginering beneden" on the html retrieved by it.
So, the code fails at "pagination_links =
initial_page.css('.paginering.beneden a')"

Look:

initial_page.css('.paginering') =>
initial_page.css('.beneden') =>

But, as an example...
initial_page.css('.tekst-kleiner')
initial_page.css('a.tekst-kleiner')
initial_page.css('header').css('a.tekst-kleiner')

all returns...
=> [#<Nokogiri::XML::Element:0x109e3d0 name="a"
attributes=[#<Nokogiri::XML::Attr:0x109e358 name="href"
value="https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=ParlementaireDocumenten&grootte=2&quot;&gt;,
#<Nokogiri::XML::Attr:0x109e344 name="class" value="tekst-kleiner">,
#<Nokogiri::XML::Attr:0x109e308 name="title" value="Schermteksten
verkleinen">] children=[#<Nokogiri::XML::Text:0x10a2854 "—">]>]

Look the html source of your url and you will see it.

Best regards,
Abinoam Jr.

···

On Wed, Aug 29, 2012 at 7:16 AM, Sybren Kooistra <lists@ruby-forum.com> wrote:

Thanks for the reply Abinoam.

With pagination_links = initial_page.css('.paginering.beneden a') I'm
trying to recover <div class="paginering beneden"> and then <a
href="...">, which refer to all the page-links. So apparently something
is going wrong here aswell?..

The bigger problem I'm dealing with is that ruby believes that letters
following the question mark (in INITIAL_URL =
"https://zoek.officielebekendmakingen.nl/zoeken/resultaat.?zkt=Uitgebreid&pst=ParlementaireDocumenten&quot;\)
should be interpreted as commands in stead of part of the entire string.
So I get an error when simply trying to define INITIAL_URL with a
url-string, because some of the characters in the url are interpreted as
commands.

--
Posted via http://www.ruby-forum.com/\.