Parsing Newb Help

7stud2 · 4 September 2012 22:40

Hey guys,

I'm pretty new to Ruby, and programming in general, and am having
massive trouble parsing some HTML pages I scraped from Yellow Pages.

So far, I've been using the link below as my template

I am trying to compile a list of restaurants in San Francisco, with the
price, ambiance and neighbourhood attributes. I want to import this list
into Excel. Does anyone have idea on how to adapt the script in the
template for YP?

I have successfully scraped the source code, but when it comes to
parsing, I'm having trouble inputting the right parameters.

Any help would be appreciated!

http://www.yellowpages.com/san-francisco-ca/restaurants?page=1

···

--
Posted via http://www.ruby-forum.com/.

7stud2 · 4 September 2012 23:56

Benedict Wong wrote in post #1074721:

I have successfully scraped the source code, but when it comes to
parsing, I'm having trouble inputting the right parameters.

What have you tried?

···

--
Posted via http://www.ruby-forum.com/\.

7stud2 · 5 September 2012 01:40

I'm not at all clear what the *specific* things are that you want to
extract from the website.

In any case, you need to click on View/Source in your browser and
examine the raw html to figure out what tags you need to extract and how
to identify them. Look at the web page in your browser then use Find or
Search to locate the same text in the raw html.

Then read some basic xpath tutorials starting here:

http://www.engineyard.com/blog/2010/getting-started-with-nokogiri/

Here is an example of how to get the names of the restaurants:

require 'nokogiri'

#require 'open-uri'
#doc = Nokogiri::HTML(open("http://www.threescompany.com/"))

html =<<MY_HTML
<html>
<head>
<title>Stuff</title>
</head>

    <a href="http://blah_blah_blah"
    class="no-tracks url "
    rel="nofollow"
    title="Fishermen's Grotto">Fishermen's Grotto</a>

</h3>

    <a href="http:/blah_blah
    rel="nofollow"
    title="Marnee Thai Restaurant">Marnee Thai Restaurant</a>

</h3>

</body>
</html>

MY_HTML

doc = Nokogiri::HTML(html)

doc.xpath('//h3[@class="title fn org"]/a[1]').each do |node|
puts node.text
end

--output:--
Fishermen's Grotto
Marnee Thai Restaurant

Parsing html requires a good understanding of html structure, e.g.
parents, children, siblings, etc., and css, e.g. classes, ids, etc. As
a beginner it is better to take baby steps, not jump in the deep end of
the pool, so this project may be too hard for you.

···

--
Posted via http://www.ruby-forum.com/.

7stud2 · 5 September 2012 00:32

7stud -- wrote in post #1074725:

Benedict Wong wrote in post #1074721:

I have successfully scraped the source code, but when it comes to
parsing, I'm having trouble inputting the right parameters.

What have you tried?

require 'open-uri'
BASE_LIST_URL =
'http://www.yellowpages.com/san-francisco-ca/restaurants?page='

LAST_PAGE_NUMBER = 157

LIST_PAGES_SUBDIR = 'yp-list-pages'

Dir.mkdir(LIST_PAGES_SUBDIR) unless
File.exists?(LIST_PAGES_SUBDIR)

for page_number in 1..LAST_PAGE_NUMBER
page = open("#{BASE_LIST_URL}#{page_number}")

file =
File.open("#{LIST_PAGES_SUBDIR}/yp-list-page-#{page_number}.html",
'w')

file.write(page.readlines)

file.close

puts "Copied page #{page_number}"

sleep 4
end

This copied all over the web pages onto my hard drive into a .html format.

Then, downloaded and installed Nokogiri gem

Next lines of code:
        require 'rubygems'
         require 'nokogiri'
         require 'open-uri'

url =
<view-source:http://www.yellowpages.com/san-francisco-ca/restaurants?p
age=1>
'http://www.yellowpages.com/san-francisco-ca/restaurants?page=1'
page = Nokogiri::HTML(open(url))

links = page.css('a')

puts links.length

(this printed out the number 982)

Then typed:

Hrefs = links.map{ |link|

Link['href'] }

doc_hrefs = hrefs.select{ |href|
                 href.match('title') != nil
         }
         doc_hrefs = doc_hrefs.uniq

After this point, I got kind of lost.

···

--
Posted via http://www.ruby-forum.com/\.

Robert_K1 · 5 September 2012 08:11

I'm not at all clear what the *specific* things are that you want to
extract from the website.

In any case, you need to click on View/Source in your browser and
examine the raw html to figure out what tags you need to extract and how
to identify them. Look at the web page in your browser then use Find or
Search to locate the same text in the raw html.

Then read some basic xpath tutorials starting here:

Getting Started with Nokogiri

More at
http://www.w3schools.com/xpath/
http://www.zvon.org/xxl/XPathTutorial/General/examples.html

Parsing html requires a good understanding of html structure, e.g.
parents, children, siblings, etc., and css, e.g. classes, ids, etc. As
a beginner it is better to take baby steps, not jump in the deep end of
the pool, so this project may be too hard for you.

When using Firefox there are some useful extensions for XPath testing, namely
https://code.google.com/p/xpathchecker/
Firefinder - Robert's talk (needs Firebug)

Kind regards

robert

···

On Wed, Sep 5, 2012 at 3:40 AM, 7stud -- <lists@ruby-forum.com> wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Topic		Replies	Views
Nokogiri help parsing HTML ruby-talk	17	509	29 March 2013
Parsing through downloaded html ruby-talk	27	186	21 October 2012
Waiter, there's a noob in my soup! ruby-talk	14	141	29 March 2006
Help missing something BASIC ruby-talk	10	98	21 October 2010
[Newbie] Getting data from html-ish like crap ruby-talk	4	124	1 March 2006

Parsing Newb Help

Related topics