Parsing xml

Hi,
Is there any way in Ruby to parse an xml file without using REXML or any
other libraries.

Regards
Arun Kumar

···

--
Posted via http://www.ruby-forum.com/.

Quoting "Arun Kumar" <arunkumar@innovaturelabs.com>:

Hi,
Is there any way in Ruby to parse an xml file without using REXML or any
other libraries.

Of course. You can write a finite state machine, read XML from file and parse as you want.

···

--
   WBR, Peter Zotov

Peter Zotov wrote:

Quoting "Arun Kumar" <arunkumar@innovaturelabs.com>:

Hi,
Is there any way in Ruby to parse an xml file without using REXML or any
other libraries.

Of course. You can write a finite state machine, read XML from file
and parse as you want.

Thanks. Can u please give me details of it.

Regards
Arun Kumar

···

--
Posted via http://www.ruby-forum.com/\.

Quoting "Arun Kumar" <arunkumar@innovaturelabs.com>:

Peter Zotov wrote:

Quoting "Arun Kumar" <arunkumar@innovaturelabs.com>:

Hi,
Is there any way in Ruby to parse an xml file without using REXML or any
other libraries.

Of course. You can write a finite state machine, read XML from file
and parse as you want.

Thanks. Can u please give me details of it.

It is described nicely in Wikipedia:

As a clue, I can recommend you define following states: "text", "opening tag", "tag attribute", "tag attribute value", "closing tag". E. g. when you are in "text" state and get "<" symbol at input sequence, you change state to "opening tag" or "closing tag"...
I still have one question: why you don't use REXML?

···

--
   WBR, Peter Zotov

Arun Kumar wrote:

Peter Zotov wrote:

Quoting "Arun Kumar" <arunkumar@innovaturelabs.com>:

Hi,
Is there any way in Ruby to parse an xml file without using REXML or any
other libraries.

Of course. You can write a finite state machine, read XML from file
and parse as you want.

Thanks. Can u please give me details of it.

He told you to write a parser. That's the same as using REXML or any other library.

Why can't you use a library? REXML comes for free with Ruby, and is good enough in a pinch.

If the XML input is very stable, and it never changes, you can parse some of it with Regexp. That will break very easily, but it might be good enough for your needs.

Peter Zotov wrote:

Quoting "Arun Kumar" <arunkumar@innovaturelabs.com>:

Thanks. Can u please give me details of it.

It is described nicely in Wikipedia:
Finite-state machine - Wikipedia

As a clue, I can recommend you define following states: "text",
"opening tag", "tag attribute", "tag attribute value", "closing tag".
E. g. when you are in "text" state and get "<" symbol at input
sequence, you change state to "opening tag" or "closing tag"...
I still have one question: why you don't use REXML?

Hi,

The problem is that my boss donot want me to use any libraries to parse
xml. He also said to use regular expressions to extract the contents of
an xml tag. Can u please tell me how to do it. I'll be really greatfull.

Regards
Arun Kumar

···

--
Posted via http://www.ruby-forum.com/\.

Better question: Why *wouldn't* you want to use an existing library?
You'd have to spend months on your own before it even starts to make
sense to use such a custom solution over an existing, tested, and
heavily used library like libxml or nokigiri (and to be fair,
Hpricot::XML, though it's more for HTML parsing than XML).

Jason

···

On Wed, Mar 25, 2009 at 1:52 PM, Peter Zotov <whitequark@whitequark.ru> wrote:

Quoting "Arun Kumar" <arunkumar@innovaturelabs.com>:

Peter Zotov wrote:

Quoting "Arun Kumar" <arunkumar@innovaturelabs.com>:

Hi,
Is there any way in Ruby to parse an xml file without using REXML or any
other libraries.

Of course. You can write a finite state machine, read XML from file
and parse as you want.

Thanks. Can u please give me details of it.

It is described nicely in Wikipedia:
Finite-state machine - Wikipedia

As a clue, I can recommend you define following states: "text", "opening
tag", "tag attribute", "tag attribute value", "closing tag". E. g. when you
are in "text" state and get "<" symbol at input sequence, you change state
to "opening tag" or "closing tag"...
I still have one question: why you don't use REXML?

--
WBR, Peter Zotov

Jason Roelofs wrote:

Of course. You can write a finite state machine, read XML from file

to "opening tag" or "closing tag"...
I still have one question: why you don't use REXML?

--
�WBR, Peter Zotov

Better question: Why *wouldn't* you want to use an existing library?
You'd have to spend months on your own before it even starts to make
sense to use such a custom solution over an existing, tested, and
heavily used library like libxml or nokigiri (and to be fair,
Hpricot::XML, though it's more for HTML parsing than XML).

Jason

Hi,
One problem is compatability. I'm developing an application that
extracts the xml tags from a url like 'http://www.shoe-g.com/index.rdf&#39;
and displays the contents within it. So compatability is an issue. My
boss is strict of not using any complex libraries. Can u please help me.
Thanks once again

Regards
Arun Kumar

···

On Wed, Mar 25, 2009 at 1:52 PM, Peter Zotov <whitequark@whitequark.ru> > wrote:

--
Posted via http://www.ruby-forum.com/\.

Quoting "Arun Kumar" <arunkumar@innovaturelabs.com>:

Peter Zotov wrote:

Quoting "Arun Kumar" <arunkumar@innovaturelabs.com>:

Thanks. Can u please give me details of it.

It is described nicely in Wikipedia:
Finite-state machine - Wikipedia

As a clue, I can recommend you define following states: "text",
"opening tag", "tag attribute", "tag attribute value", "closing tag".
E. g. when you are in "text" state and get "<" symbol at input
sequence, you change state to "opening tag" or "closing tag"...
I still have one question: why you don't use REXML?

Hi,

The problem is that my boss donot want me to use any libraries to parse
xml. He also said to use regular expressions to extract the contents of
an xml tag. Can u please tell me how to do it. I'll be really greatfull.

If you have, for example, this document:
----8<----
<?xml version="1.0" encoding="utf-8"?>
<root>
  <some-tag>some text</some-tag>
</root>
----8<----

you can extract contetns of tag "some-tag" with this (code assumes that document lies in "document" variable):

document.match(/<some-tag>(.+?)<\/some-tag>/)[1]

But this will fail at "some-tag" embedded in other "some-tag" and if tag will have arguments. Of course, these variants can be predicted and added to regexp too, but this will make it very complicated.

Anyway, REXML is not _external_ library to Ruby. It's in stdlib!

···

--
   WBR, Peter Zotov

Arun Kumar wrote:

The problem is that my boss donot want me to use any libraries to parse xml. He also said to use regular expressions to extract the contents of an xml tag. Can u please tell me how to do it. I'll be really greatfull.

Your boss is micromanaging you, and does not understand the relationship between Ruby, its libraries, and its programmers. Bosses generally should not prohibit valid techniques for bogus reasons.

That said, you could use "malicious compliance", and show her or him how fragile regular expressions are. (Write unit tests that fail for the wrong XML, for example.)

Or you could explain that REXML is not an _external_ library. It comes with Ruby, so it's "free" to use. You never need to download and install it...

One problem is compatability. I'm developing an application that extracts the xml tags from a url like 'http://www.shoe-g.com/index.rdf&#39; and displays the contents within it. So compatability is an issue. My boss is strict of not using any complex libraries. Can u please help me. Thanks once again

I had a boss once who wouldn't let us use keyboards, because we might use them to type bugs in.

Sheesh...

Peter Zotov wrote:

But this will fail at "some-tag" embedded in other "some-tag" and if
tag will have arguments. Of course, these variants can be predicted
and added to regexp too

I don't see what you could add to the regexp to handle nested tags. You can't
really handle nested structures with regular expressions.

Or you could explain that REXML is not an _external_ library. It comes with Ruby, so it's "free" to use. You never need to download and install it...

Next, REXML will break if you point it at nearly any website.

Elaborate sigh...

Can you demo Nokogiri, Hpricot, and Regexps to this boss??

Arun Kumar wrote:

Jason Roelofs wrote:
  

Better question: Why *wouldn't* you want to use an existing library?
You'd have to spend months on your own before it even starts to make
sense to use such a custom solution over an existing, tested, and
heavily used library like libxml or nokigiri (and to be fair,
Hpricot::XML, though it's more for HTML parsing than XML).

Jason
    
Hi,
One problem is compatability.

Compatibility with what?

I'm developing an application that extracts the xml tags from a url like 'http://www.shoe-g.com/index.rdf&#39;
  
Yes, Nokogiri can read that. I'll bet Hpricot can, too -- maybe even REXML.

Maybe you can find an example for me of an XML document that Nokogiri (libxml) can't read?

My boss is strict of not using any complex libraries.

Either this is some sort of test or interview question, to make sure you understand regular expressions...

...or, your boss doesn't know what he's talking about. The whole reason to use Ruby is to save yourself work. Suppose you want the contents of each title tag, just as an example:

require 'mechanize'
mech = WWW::Mechanize.new
mech.get 'http://www.shoe-g.com/index.rdf&#39;
doc = Nokogiri(mech.page.body)
titles = (doc / 'title').map(&:text)

Ask your boss if it's really worth it to spend days or months trying to get it right, when you could be using five lines to download and parse it much more simply and accurately than a regular expression would allow.

And if your boss insists, even after seeing this, you might want to start looking for a new job -- that one won't last long.

One problem is compatability. I'm developing an application that extracts the xml tags from a url like 'http://www.shoe-g.com/index.rdf&#39; and displays the contents within it. So compatability is an issue.

Between what and what? REXML is pure Ruby so it runs on all platforms.

My boss is strict of not using any complex libraries.

This has some implications

  - You can never tackle complex problems, because you always have to write everything from scratch - and you'll be late on *any* project plan with this approach.

  - Apparently your boss judges without knowing the facts (REXML and others are _not_ complex to _use_ as has been demonstrated).

  - Also it seems your boss's understanding of software engineering needs some serious improvement. Picking the right tool for a job is a significant part of it and has already saved tons of working hours all over the world. You build applications by plugging together self written and externally obtained components - that's the only economically viable way.

  - You cost him nothing or he does not care about how you spend your time.

This is really one of the most ridiculous things I have read in years. If he would argue with steep learning curve or expensive commercial software - but a strict rejection of "complex libraries"?

Can u please help me.

Yes, update your resume and run for a better place to work.

Really, you would also be wasting everybody else's time by trying to extract all the details on how to do something manually which has been built already and which you get it for free (i.e. with no extra charge or installation hassle).

Good luck!

  robert

···

On 25.03.2009 19:02, Arun Kumar wrote:

Quoting "Sebastian Hungerecker" <sepp2k@googlemail.com>:

Peter Zotov wrote:

But this will fail at "some-tag" embedded in other "some-tag" and if
tag will have arguments. Of course, these variants can be predicted
and added to regexp too

I don't see what you could add to the regexp to handle nested tags. You can't
really handle nested structures with regular expressions.

I think that you can use backlinks and combine results afterwards from their splitted state, but not sure.

···

--
   WBR, Peter Zotov

Regex is not stateful, thus you can't use it to parse XML. Oh there
are ways to hack yourself around some limitations and get some
results, but you are going to spend a TON of time making very
unreadable Regex that will die at the presense of slightest malformed
XML. Your boss obviously has no idea what he's talking about. If
anything, use REXML because, as another poster said, it's a part of
Ruby and not an external library.

If your boss makes these kind of requests often, you should probably
go looking for another job, IMO

Jason

Quoting Phlip <phlip2005@gmail.com>:

Or you could explain that REXML is not an _external_ library. It comes with Ruby, so it's "free" to use. You never need to download and install it...

Next, REXML will break if you point it at nearly any website.

Yes, at average HTML code will break it, but he need to parse RDF (or probably ATOM/RSS). These are almost always valid.

···

--
   WBR, Peter Zotov

Phlip wrote:

Or you could explain that REXML is not an _external_ library. It comes with Ruby, so it's "free" to use. You never need to download and install it...

Next, REXML will break if you point it at nearly any website.

Except the question was about XML, not HTML. And the document indicated:

http://www.shoe-g.com/index.rdf

seems to parse fine with rexml:

irb(main):001:0> require 'open-uri'
=> true
irb(main):002:0> require 'rexml/document'
=> true
irb(main):003:0> open('http://www.shoe-g.com/index.rdf&#39;\){|f| REXML::Document.new f}
=> <UNDEFINED> ... </>
irb(main):004:0> _.root
=> <rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#&#39; xmlns:dc='DCMI: DCMI Metadata Terms; xmlns:sy='http://purl.org/rss/1.0/modules/syndication/&#39; xmlns:admin='http://webns.net/mvcb/&#39; xmlns:cc='http://web.resource.org/cc/&#39; xmlns='http://purl.org/rss/1.0/&#39;&gt; ... </>

I'll second Nokogiri, but I think we agree on the main point: Use a library. Any library. The world does not need another hacked-together, home-grown, broken XML parser.

Here's a similar solution using REXML:

require 'open-uri'
require 'rexml/document'

body = open('http://www.shoe-g.com/index.rdf&#39;\).read
doc = REXML::Document.new body
titles = doc.elements.to_enum(:each, '//title').map(&:text)

Five lines as well...

Cheers

  robert

···

On 25.03.2009 20:10, David Masover wrote:

require 'mechanize'
mech = WWW::Mechanize.new
mech.get 'http://www.shoe-g.com/index.rdf&#39;
doc = Nokogiri(mech.page.body)
titles = (doc / 'title').map(&:text)