Then when I traverse the document and query Element#raw it does say
'true' for these tags, but it still appears that they have been parsed
and I can't get the raw text.
e.get_text.value
Returns the same thing as
e.text
Is there another way ones supposed to use to get at the raw text?
Thanks,
T.
Sigh, I just realized I miss understood what raw meant --it's just
relates to entity parsing. Why am I getting the feeling that there is
no way to prevent parsing of the body of an element? I pray this is not
the case, b/c it means back to the drawing board for something like the
13th time :-(. But if is the case, can anyone recommend another XML
parser then can do this?
None that I know of. The problem is this: where do you continue from,
and how do you know if not by parsing?
Ari
ยทยทยท
On Sat, 2005-01-22 at 09:55 +0900, trans. wrote:
Sigh, I just realized I miss understood what raw meant --it's just
relates to entity parsing. Why am I getting the feeling that there is
no way to prevent parsing of the body of an element? I pray this is not
the case, b/c it means back to the drawing board for something like the
13th time :-(. But if is the case, can anyone recommend another XML
parser then can do this?
Well, I just want to specify a tag and anything in that tag would be
left verbatim. That's all really. I'm tryng to find info on libxml
bindings (rather difficult to find it seems) though I have a feeling
that won't work either.
Worse comes to worse I'll wipe out old trust Tagiter and see if that
will do. Otherwise I'll have to roll my own. Just what I need More
Work!
Well, I just want to specify a tag and anything in that tag would be
left verbatim. That's all really. I'm tryng to find info on libxml
bindings (rather difficult to find it seems) though I have a feeling
that won't work either.
Worse comes to worse I'll wipe out old trust Tagiter and see if that
will do. Otherwise I'll have to roll my own. Just what I need More
Work!
REXML should be pretty easy to manipulate or add functions to. Why roll your own when you can just add a new behavior?
Well, I just want to specify a tag and anything in that tag would be
left verbatim. That's all really. I'm tryng to find info on libxml
bindings (rather difficult to find it seems) though I have a feeling
that won't work either.
Worse comes to worse I'll wipe out old trust Tagiter and see if that
will do. Otherwise I'll have to roll my own. Just what I need More
Work!
Thanks Ari.
Here's a micro xml-parser (posted via Google, so the
indentation has been removed):
# Produces array of nonmatching and matching
# substrings. The size of the array will
# always be an odd number. The first and the
# last item will always be nonmatching.
def shatter( s, re )
s.gsub( re, "\1"+'\&'+"\1" ).split("\1")
end
def get_attr( s )
h = Hash.new
while s =~ /(\w+)="([^"]*)"/
h[$1] = $2
s = $'
end
h
end
def tag_name( s )
if ( s =~ /^<(\S+)(\s|>)/ )
$1
else
nil
end
end
s = ''
$<.each_line {|x| s=s+x}
all = shatter( s, /<[^>]*>/ )
all.each {|x|
x.chomp!
if x.size > 0
print x
tname = tag_name(x)
print " | " + tname if tname
print "\n"
attr = get_attr( x )
if attr.size > 0
attr.each_pair {|key,val| puts "....#{key}-->#{val}" }
end
end
}
With this input:
<?xml version="1.0" encoding="UTF-8"?>
<tv><programme start="20041218204000 +1000"
stop="20041218225000+1000" channel="Network TEN Brisbane">
<title>The Frighteners</title>
<sub-title/><desc>A psychic private detective, who
consorts with deceased souls, becomes engaged in a mystery as members
of the town community begin dying mysteriously.</desc>
<rating system="ABA"><value>M</value></rating><length
units="minutes">130</length><category>Horror</category></programme>
the output is:
<?xml version="1.0" encoding="UTF-8"?> | ?xml
.....encoding-->UTF-8
.....version-->1.0
<tv> | tv
<programme start="20041218204000 +1000"
stop="20041218225000+1000" channel="Network TEN Brisbane"> | programme
.....stop-->20041218225000+1000
.....start-->20041218204000 +1000
.....channel-->Network TEN Brisbane
<title> | title
The Frighteners
</title> | /title
<sub-title/> | sub-title/
<desc> | desc
A psychic private detective, who
consorts with deceased souls, becomes engaged in a mystery as members
of the town community begin dying mysteriously.
</desc> | /desc
<rating system="ABA"> | rating
.....system-->ABA
<value> | value
M
</value> | /value
</rating> | /rating
<length
units="minutes"> | length
.....units-->minutes
130
</length> | /length
<category> | category
Horror
</category> | /category
</programme> | /programme
Hey Thanks! Not sure if I'll end up using since I just spent last night
wrting a general purpose stack-based parser. But I'll keep it in
reference.
Love the method name #shatter, BTW.
T.
P.S. FYI, I figured out that you can just use a "margin" character in
order to preserve indention. For example, I'm using Google Groups now
too:
: class A
: def shatter
: # ...
: end
: end
As to which character you like best, that's your call ;-).
Also, I know there is a way to set the google group to a fixed-font
mode (I manage a group and there is that option), but I don't know who
manages this group and thus would be able to set it.
.. class String
.. # Produces array of nonmatching and matching
.. # substrings. The size of the array will
.. # always be an odd number. The first and the
.. # last item will always be nonmatching.
.. def shatter( re )
.. self.gsub( re, "\1"+'\&'+"\1" ).split("\1")
.. end
.. def xml_parse
.. self.shatter( /<[^>]*>/ )
.. end
.. def get_attr
.. s = self
.. while s =~ /(\w+)="([^"]*)"/m
.. yield( $1, $2 )
.. s = $'
.. end
.. end
.. def tag_name
.. if ( self =~ /^<(\S+)("\n"|\s|>)/ )
.. $1
.. else
.. nil
.. end
.. end
.. def span( tagname )
.. s = self
.. while (s =~ Regexp.new(
.. '(<'+tagname+'.*?>)(.*?)</'+tagname+'>',
.. Regexp::MULTILINE))
.. yield( $1, $2 )
.. s = $'
.. end
.. end
.. end
..
.. s = ''
.. $<.each_line {|x| s=s+x}
.. s.span('programme') { |tag,string|
.. string.span('title') {|junk,title|
.. puts 'Title: ' + title.chomp.gsub(/\n/,' ')
.. }
.. string.span('length') {|tag,len|
.. tag.get_attr {|key,val| @units=val }
.. puts "Length: #{len.chomp} #{@units}"
.. }
.. }
With the input
<?xml version="1.0" encoding="UTF-8"?>
<tv><programme start="20041218204000 +1000" stop="20041218225000
+1000" channel="Network TEN Brisbane"><title>The
Frighteners</title><sub-title/><desc>A psychic private detective, who
consorts with deceased souls, becomes engaged in a mystery as members
of the town community begin dying mysteriously.</desc><rating
system="ABA"><value>M</value></rating><length
units="minutes">130</length><category>Horror</category></programme><programme
start="20041218080000 +1000" stop="20041218083000 +1000"
channel="Network TEN Brisbane"><title>Worst Best
Friends</title><sub-title>Better Than Glen</sub-title><desc>Life's
like that for Roger Thesaurus - two of his best friends are also his
worst enemies!</desc><rating
system="ABA"><value>C</value></rating><length
units="minutes">30</length><category>Children</category></programme>
the output is
Title: The Frighteners
Length: 130 minutes
Title: Worst Best Friends
Length: 30 minutes