Reading XML to relational tables

Hi everyone,

I need to build 3 relational tables from an xml text. In this tables, I
need to keep track of words that have the <emph> and <bold> tags in them
along with the
word mentioned and its count in the <p> tag. This is easier to
illustrate with an example:

I need to take this text:

<p> My name is <strong>Ted</strong>, and I like <emph>coffee</emph>.
<strong>Ted</strong> does not like tea. </p>
<p> I have a brother who likes <emph>tea</emph> but does not like
<emph>coffee</emph> </p>

To 3 normalized tables like this:

...p_table...
p_id desc
1 My name is....
2 I have a ....

...p_to_emph_table...
p_id e_id count
1 2 1
2 1 1
2 2 1

...emph_table...
e_id emph_word
1 Tea
2 Coffee

I am not sure what would be the best approach to parse this xml with
ruby or what tool
could help me do this efficiently?

Any ideas appreciated,

Ted.

···

--
Posted via http://www.ruby-forum.com/.

What I'd do is parse the XML (use Nokogiri, for example) and get all p
elements. For each p element, insert it into p_table if not present
and get its id. Look at all emph inside the p element, and for each of
them:
- Check if the word is already in emph_table and get the id or
- Insert it into emph_table and get the id

With that id, insert or update a row in the p_to_emph_table with the p
and the word id.

This is a straightforward approach that should work. Make a try (ask
any question that blocks you) and let us know how it goes.

Jesus.

···

On Sat, Apr 2, 2011 at 12:47 AM, Ted Flethuseo <flethuseo@gmail.com> wrote:

Hi everyone,

I need to build 3 relational tables from an xml text. In this tables, I
need to keep track of words that have the <emph> and <bold> tags in them
along with the
word mentioned and its count in the <p> tag. This is easier to
illustrate with an example:

I need to take this text:

<p> My name is <strong>Ted</strong>, and I like <emph>coffee</emph>.
<strong>Ted</strong> does not like tea. </p>
<p> I have a brother who likes <emph>tea</emph> but does not like
<emph>coffee</emph> </p>

To 3 normalized tables like this:

...p_table...
p_id desc
1 My name is....
2 I have a ....

...p_to_emph_table...
p_id e_id count
1 2 1
2 1 1
2 2 1

...emph_table...
e_id emph_word
1 Tea
2 Coffee

I am not sure what would be the best approach to parse this xml with
ruby or what tool
could help me do this efficiently?

Hi Jesus,

Thank you for your help. Right now I am stuck trying to traverse the
elements in a single xml::element. I know I can use this elements method
to list the elements, but I am not sure how
I can traverse through them and get their contents individually.

xml = File.read('translateXML.xml')
doc = Nokogiri::XML(xml)

# split into sentences first
arr = doc.search('p')

puts arr[0].elements

···

--
Posted via http://www.ruby-forum.com/.

Try something like:

require 'nokogiri'

doc = Nokogiri::XML(File.read("p.xml"))
doc.search("p").each do |p_element|
  puts "---------"
  puts p_element.text
  p_element.css("emph,strong").each do |emph|
    puts "Highlighted: #{emph.text}"
  end
end

Jesus.

···

On Sat, Apr 9, 2011 at 11:39 PM, Ted Flethuseo <flethuseo@gmail.com> wrote:

Hi Jesus,

Thank you for your help. Right now I am stuck trying to traverse the
elements in a single xml::element. I know I can use this elements method
to list the elements, but I am not sure how
I can traverse through them and get their contents individually.

xml = File.read('translateXML.xml')
doc = Nokogiri::XML(xml)

# split into sentences first
arr = doc.search('p')