7stud2
(7stud --)
28 August 2012 20:19
1
I'm very new to using ruby, and I can't seem to figure something out
(that is probably quite basic). Any help is much appreciated!
When using nokogiri and open-uri in Ruby, I define a variable containing
a partial url (INITIAL_URL =
"https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=ParlementaireDocumenten ")
so as to be able to add onto the url for continuous use (I have added
the full code below).
However, I keep running into an error. "syntax error, unexpected tLABEL"
+ "unknown regexp options - zk" + "syntax error, unexpected '?'
How can I fix this?..
Here's the full code:
irb
require ‘Nokogiri’
require ‘open-uri’
def get_search_result_links(n_page)
links = n_page.css('.linker-kolom li a')
puts "** There were #{links.length} links found"
links.each do |link|
href = link['href']
inner_url = 'https://zoek.officielebekendmakingen.nl ' + href
puts "\n\n\nFetching page at #{File.basename(inner_url).split('?')[0]}"
datalezer = open(inner_url).read
lokalenieuwefilenaam = href + “.html”
lokalenieuwefile = open(lokalenieuwefilenaam, “w”)
lokalenieuwefile.write(datalezer)
lokalenieuwefile.close
end
end
INITIAL_URL =
'https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=ParlementaireDocumenten '
initial_page = Nokogiri::HTML(open(INITIAL_URL))
pagination_links = initial_page.css('.paginering.beneden a')
last_page_link = pagination_links[-2]
last_page_number = last_page_link.text.to_i
(5..last_page_number).each do |page_num|
puts "\n\n\n***** Getting page #{page_num}"
results_page_url = "#{INITIAL_URL}&_page=#{page_num}"
results_page = Nokogiri::HTML(open(results_page_url))
get_search_result_links(results_page)
end
···
--
Posted via http://www.ruby-forum.com/ .
abinoam
(Abinoam Praxedes Marques Jr.)
29 August 2012 00:59
2
(In my setup) the line...
pagination_links = initial_page.css('.paginering.beneden a')
returns an empty Nokogiri::XML::NodeSet =>
What part of your html are you trying to select?
Something googled... Parsing HTML with Nokogiri | The Bastards Book of Ruby
Abinoam Jr.
···
On Tue, Aug 28, 2012 at 4:19 PM, Sybren Kooistra <lists@ruby-forum.com> wrote:
I'm very new to using ruby, and I can't seem to figure something out
(that is probably quite basic). Any help is much appreciated!
When using nokogiri and open-uri in Ruby, I define a variable containing
a partial url (INITIAL_URL =
"https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=ParlementaireDocumenten"\ )
so as to be able to add onto the url for continuous use (I have added
the full code below).
However, I keep running into an error. "syntax error, unexpected tLABEL"
+ "unknown regexp options - zk" + "syntax error, unexpected '?'
How can I fix this?..
Here's the full code:
irb
require ‘Nokogiri’
require ‘open-uri’
def get_search_result_links(n_page)
links = n_page.css('.linker-kolom li a')
puts "** There were #{links.length} links found"
links.each do |link|
href = link['href']
inner_url = 'https://zoek.officielebekendmakingen.nl ' + href
puts "\n\n\nFetching page at #{File.basename(inner_url).split('?')[0]}"
datalezer = open(inner_url).read
lokalenieuwefilenaam = href + “.html”
lokalenieuwefile = open(lokalenieuwefilenaam, “w”)
lokalenieuwefile.write(datalezer)
lokalenieuwefile.close
end
end
INITIAL_URL =
'https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=ParlementaireDocumenten' ;
initial_page = Nokogiri::HTML(open(INITIAL_URL))
pagination_links = initial_page.css('.paginering.beneden a')
last_page_link = pagination_links[-2]
last_page_number = last_page_link.text.to_i
(5..last_page_number).each do |page_num|
puts "\n\n\n***** Getting page #{page_num}"
results_page_url = "#{INITIAL_URL}&_page=#{page_num}"
results_page = Nokogiri::HTML(open(results_page_url))
get_search_result_links(results_page)
end
--
Posted via http://www.ruby-forum.com/\ .
7stud2
(7stud --)
29 August 2012 11:16
3
Thanks for the reply Abinoam.
With pagination_links = initial_page.css('.paginering.beneden a') I'm
trying to recover <div class="paginering beneden"> and then <a
href="...">, which refer to all the page-links. So apparently something
is going wrong here aswell?..
The bigger problem I'm dealing with is that ruby believes that letters
following the question mark (in INITIAL_URL =
"https://zoek.officielebekendmakingen.nl/zoeken/resultaat.?zkt=Uitgebreid&pst=ParlementaireDocumenten ")
should be interpreted as commands in stead of part of the entire string.
So I get an error when simply trying to define INITIAL_URL with a
url-string, because some of the characters in the url are interpreted as
commands.
···
--
Posted via http://www.ruby-forum.com/ .
7stud2
(7stud --)
7 September 2012 20:51
4
Are you testing your code by inserting an href by hand, something like
this:
inner_url = 'https://zoek.officielebekendmakingen.nl ' +
/something/zk?x=10&y=5
That produces the error:
unknown regexp options - zk
The reason for that error is that /something/ is the syntax for a regex
literal.
···
--
Posted via http://www.ruby-forum.com/ .
abinoam
(Abinoam Praxedes Marques Jr.)
5 September 2012 01:39
5
Dear Sybren,
I've indented and fixed some quotes on your code.
It runs, but there's no "paginering beneden" on the html retrieved by it.
So, the code fails at "pagination_links =
initial_page.css('.paginering.beneden a')"
Look:
initial_page.css('.paginering') =>
initial_page.css('.beneden') =>
But, as an example...
initial_page.css('.tekst-kleiner')
initial_page.css('a.tekst-kleiner')
initial_page.css('header').css('a.tekst-kleiner')
all returns...
=> [#<Nokogiri::XML::Element:0x109e3d0 name="a"
attributes=[#<Nokogiri::XML::Attr:0x109e358 name="href"
value="https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=ParlementaireDocumenten&grootte=2"> ,
#<Nokogiri::XML::Attr:0x109e344 name="class" value="tekst-kleiner">,
#<Nokogiri::XML::Attr:0x109e308 name="title" value="Schermteksten
verkleinen">] children=[#<Nokogiri::XML::Text:0x10a2854 "—">]>]
Look the html source of your url and you will see it.
Best regards,
Abinoam Jr.
···
On Wed, Aug 29, 2012 at 7:16 AM, Sybren Kooistra <lists@ruby-forum.com> wrote:
Thanks for the reply Abinoam.
With pagination_links = initial_page.css('.paginering.beneden a') I'm
trying to recover <div class="paginering beneden"> and then <a
href="...">, which refer to all the page-links. So apparently something
is going wrong here aswell?..
The bigger problem I'm dealing with is that ruby believes that letters
following the question mark (in INITIAL_URL =
"https://zoek.officielebekendmakingen.nl/zoeken/resultaat.?zkt=Uitgebreid&pst=ParlementaireDocumenten"\ )
should be interpreted as commands in stead of part of the entire string.
So I get an error when simply trying to define INITIAL_URL with a
url-string, because some of the characters in the url are interpreted as
commands.
--
Posted via http://www.ruby-forum.com/\ .