HTML/XML Parsing


(Ruby Tuesday) #1

I’m wondering if anyone ever come across an example on how to parse an html
table(with images) using either XSLT or Ruby scripts.

I’d like to be able to extract all the data and put them in the
database(MySQL, SQLite, etc).

There’s a twist though, some of the image cells has 2 or more jpeg images
instead of one. Since I’m not an expert database designer, how do you do
that?

Table fields:

···

xnum text(40) unique
image jpeg image (may have none, or 1+ images)
desc memo(256)
loc memo(256)

Thanks.


(dhtapp) #2

My first suggestion would be to

  1. make a to-many relationship to the image records,
  2. to store any IMG attributes parsed from the HTML in the image records
    themselves (with maybe an ordering attribute within the to-many set, in case
    sliced images are bumped up against each other for positioning), and
  3. To decide whether to store the images themselves on a filesystem with
    pathnames in the records, or to store image data as BLOBS within the records
    themselves.

If you need to retrieve the images through a non-HTTP pipeline into another
process, then BLOBS may be the way to go. If I was simply going to generate
dynamic HTML, then I’d probably go ahead and put the images out on a
filesystem where both the database and the Webserver could get to 'em.

  • dan

“Ruby Tuesday” rubytuzdayz@yahoo.com wrote in message
news:c1gidb$1j6abi$1@ID-205437.news.uni-berlin.de

I’m wondering if anyone ever come across an example on how to parse an
html

···

table(with images) using either XSLT or Ruby scripts.

I’d like to be able to extract all the data and put them in the
database(MySQL, SQLite, etc).

There’s a twist though, some of the image cells has 2 or more jpeg images
instead of one. Since I’m not an expert database designer, how do you do
that?

Table fields:

xnum text(40) unique
image jpeg image (may have none, or 1+ images)
desc memo(256)
loc memo(256)

Thanks.


(Mark Hubbart) #3

“Ruby Tuesday” rubytuzdayz@yahoo.com wrote in message
news:c1gidb$1j6abi$1@ID-205437.news.uni-berlin.de

I’m wondering if anyone ever come across an example on how to parse an
html
table(with images) using either XSLT or Ruby scripts.

I’d like to be able to extract all the data and put them in the
database(MySQL, SQLite, etc).

There’s a twist though, some of the image cells has 2 or more jpeg
images
instead of one. Since I’m not an expert database designer, how do you
do
that?

Table fields:

xnum text(40) unique
image jpeg image (may have none, or 1+ images)
desc memo(256)
loc memo(256)

Thanks.

It’s been a while since I worked with databases, but perhaps something
like this:

table 1: “cells”

  • id int autoincrement primary_key
  • xnum text(40) unique
  • desc …
  • loc …

table 2: “images”

  • id int autoincrement primary_key
  • cell_id int index
  • image blob

that way, more than one image could be linked to the same cell_id. then:

SELECT image, xnum FROM images, cells WHERE cell_id = cells.id;

…to select a list of records conating to fields: image data, and cell
numbers (assuming that’s what the xnum is)

Alternatively, you could forgo the ids, and link images via xnums. But
I understand that using ids is the “right” way, whatever that means. :slight_smile: