I have a problem. I'm trying to parse with ruby some test results from
jmeter, that are stored in xml files. Unfortunately, while they should be
utf-8, some of them aren't. Probably because some db data isn't. In any
case, this makes other toys break down, like xslt transformation and
anything else that relies on the xml files being utf-8.
Does anyone know, how to get rid of such characters? When opened in an
editor like Kate, they are viewed as a white question mark in black square.
I don't really care much about the data - if it's missing some chars, nobody
will care. The point is not to destroy the xml structure and enable other
tool's operations. Any help will be greatly appreciated.
Look at Vectoring Ruby On Rails: Encoding problems,
particularly the "iconvert" method which attempts conversion to UTF-8,
but in the case where the string cannot be converted to UTF-8 (e.g.
double-byte chars) then it replaces the chars with "?".
-- Mark.
···
On Sep 17, 5:07 am, Krzysieq <krzys...@gazeta.pl> wrote:
[Note: parts of this message were removed to make it a legal post.]
Hi,
I have a problem. I'm trying to parse with ruby some test results from
jmeter, that are stored in xml files. Unfortunately, while they should be
utf-8, some of them aren't. Probably because some db data isn't. In any
case, this makes other toys break down, like xslt transformation and
anything else that relies on the xml files being utf-8.
I have a problem. I'm trying to parse with ruby some test results from
jmeter, that are stored in xml files. Unfortunately, while they should be utf-8, some of them aren't. Probably because some db data isn't. In any case, this makes other toys break down, like xslt transformation and
anything else that relies on the xml files being utf-8.
Does anyone know, how to get rid of such characters?
You can have bytes in that range as the first byte of a well-formed UTF-8 Byte Sequence. They just can't represent a single byte. It's just not that simple.
This is the approach we have take on some of our code, basically we wanted to
replicate the 'iconv -c' behavior. Does TRANSLIT do this ? I've never used
that mode before.
module UTF8
module Cleanable
···
On Wed, Sep 17, 2008 at 09:44:23PM +0900, James Gray wrote:
On Sep 17, 2008, at 4:07 AM, Krzysieq wrote:
I have a problem. I'm trying to parse with ruby some test results from
jmeter, that are stored in xml files. Unfortunately, while they should be
utf-8, some of them aren't. Probably because some db data isn't. In any
case, this makes other toys break down, like xslt transformation and
anything else that relies on the xml files being utf-8.
Does anyone know, how to get rid of such characters?
If you can figure out the encoding they are actually in, I recommend using
Iconv's transliterate mode:
#
# Converts the string representation of this class to a utf8 clean
# string. This assumes that #to_s on the object will result in a utf8
# string. All chars that are not valid utf8 char sequences will be
# silently dropped.
#
def utf8_clean
Iconv.open( "UTF-8", "UTF-8" ) do |iconv|
output = StringIO.new
working = self.to_s
loop do
begin
output.print iconv.iconv( working )
break
rescue Iconv::IllegalSequence => is
output.print is.success
working = is.failed[1..-1]
end
end
return output.string
end
end
end
end
Unfortunately, there's no way telling the original encoding. I would rather
go for some method of removing / substituting the chars that don't belong
there, but the method first suggested by Brian doesn't seem to work for some
reason. Does anyone have another option? I'm investigating the reasons of
failure, I will write more when I know something more. Thanks for all help
anyways
Cheers,
Chris
···
2008/9/17 James Gray <james@grayproductions.net>
On Sep 17, 2008, at 4:07 AM, Krzysieq wrote:
I have a problem. I'm trying to parse with ruby some test results from
jmeter, that are stored in xml files. Unfortunately, while they should be
utf-8, some of them aren't. Probably because some db data isn't. In any
case, this makes other toys break down, like xslt transformation and
anything else that relies on the xml files being utf-8.
Does anyone know, how to get rid of such characters?
If you can figure out the encoding they are actually in, I recommend using
Iconv's transliterate mode:
Thanks for inputs. So do You have another proposition?
Cheers,
Chris
···
2008/9/17 Rob Biedenharn <Rob@agileconsultingllc.com>
On Sep 17, 2008, at 5:15 AM, Brian Candler wrote:
If you really don't care about the content:
str.gsub(/[\x80-\xff]/,'?')
--
You can have bytes in that range as the first byte of a well-formed UTF-8
Byte Sequence. They just can't represent a single byte. It's just not that
simple.
If you really don't care about the content:
str.gsub(/[\x80-\xff]/,'?')
--
You can have bytes in that range as the first byte of a well-formed
UTF-8 Byte Sequence. They just can't represent a single byte. It's
just not that simple.
That's why I said "if you really don't care" ... it strips all valid
non-ASCII UTF8 as well as invalid.
TRANSLIT just works a little harder and tries to convert your
characters into a series of UTF-8 chars if possible.
I'm not sure if it drops chars that can't be transliterated...
-greg
···
On Wed, Sep 17, 2008 at 12:47 PM, Jeremy Hinegardner <jeremy@hinegardner.org> wrote:
On Wed, Sep 17, 2008 at 09:44:23PM +0900, James Gray wrote:
On Sep 17, 2008, at 4:07 AM, Krzysieq wrote:
I have a problem. I'm trying to parse with ruby some test results from
jmeter, that are stored in xml files. Unfortunately, while they should be
utf-8, some of them aren't. Probably because some db data isn't. In any
case, this makes other toys break down, like xslt transformation and
anything else that relies on the xml files being utf-8.
Does anyone know, how to get rid of such characters?
If you can figure out the encoding they are actually in, I recommend using
Iconv's transliterate mode:
This is the approach we have take on some of our code, basically we wanted to
replicate the 'iconv -c' behavior. Does TRANSLIT do this ? I've never used
that mode before.
module UTF8
module Cleanable
#
# Converts the string representation of this class to a utf8 clean
# string. This assumes that #to_s on the object will result in a utf8
# string. All chars that are not valid utf8 char sequences will be
# silently dropped.
//TRANSLIT is better than that. It tries to translate the characters. Thus a UTF-8 ellipse would become three periods if converted to ISO-8859-1 with //TRANSLIT.
You can mimic -c though, just use //IGNORE instead of //TRANSLIT. You can even do //TRANSLIT//IGNORE which transliterates what it can and discards the rest.
James Edward Gray II
···
On Sep 17, 2008, at 11:47 AM, Jeremy Hinegardner wrote:
On Wed, Sep 17, 2008 at 09:44:23PM +0900, James Gray wrote:
On Sep 17, 2008, at 4:07 AM, Krzysieq wrote:
I have a problem. I'm trying to parse with ruby some test results from
jmeter, that are stored in xml files. Unfortunately, while they should be
utf-8, some of them aren't. Probably because some db data isn't. In any
case, this makes other toys break down, like xslt transformation and
anything else that relies on the xml files being utf-8.
Does anyone know, how to get rid of such characters?
If you can figure out the encoding they are actually in, I recommend using
Iconv's transliterate mode:
This is the approach we have take on some of our code, basically we wanted to
replicate the 'iconv -c' behavior. Does TRANSLIT do this ? I've never used
that mode before.
If there is no way of telling the original encoding, the input data
may not have valid unicode in it at all, right?
-greg
···
On Thu, Sep 18, 2008 at 9:25 AM, Krzysieq <krzysieq@gazeta.pl> wrote:
Unfortunately, there's no way telling the original encoding. I would rather
go for some method of removing / substituting the chars that don't belong
there, but the method first suggested by Brian doesn't seem to work for some
reason. Does anyone have another option? I'm investigating the reasons of
failure, I will write more when I know something more. Thanks for all help
anyways
Try the iconv solutions with latin-1 (iso-8859-1) as the From. That's
as close as you can get to a one-byte "anything-goes" encoding.
-Mark.
···
On Sep 18, 9:25 am, Krzysieq <krzys...@gazeta.pl> wrote:
[Note: parts of this message were removed to make it a legal post.]
Unfortunately, there's no way telling the original encoding. I would rather
go for some method of removing / substituting the chars that don't belong
there, but the method first suggested by Brian doesn't seem to work for some
reason. Does anyone have another option?
Ok, I tried all previous suggestions, neither worked (gsub idea, TRANSLIT,
IGNORE or the one from the link posted by Mark Thomas). In fact, the last
two don't seem to have done anything, while gsub seems to do too much -
seems like it has damaged the xml structure in some way, which seems very
strange to me. I don't really care about the data inside, but I need the xml
to remain valid.
@Gregory - that's true, it may not. However, the places where I found the
funny characters are text nodes inside xml documents, and there aren't that
many of them. Surely, one is many enough to break the whole thing, but
typically there's very few and it seems more like corrupted database data. I
think they store some newspaper articles there or pieces of news. I learned
from the team who maintain that database in their app, that typically it
should all be ISO-8859-1, but for some reason it's not always the case.
Hence the idea with corrupted data seems quite likely.
Thanks for any help You can provide me with
Cheers,
Chris
···
2008/9/18 Mark Thomas <mark@thomaszone.com>
On Sep 18, 9:25 am, Krzysieq <krzys...@gazeta.pl> wrote:
> [Note: parts of this message were removed to make it a legal post.]
>
> Unfortunately, there's no way telling the original encoding. I would
rather
> go for some method of removing / substituting the chars that don't belong
> there, but the method first suggested by Brian doesn't seem to work for
some
> reason. Does anyone have another option?
Try the iconv solutions with latin-1 (iso-8859-1) as the From. That's
as close as you can get to a one-byte "anything-goes" encoding.
How is the XML file created? If you know in advance which parts of the
XML come from the database, wrap those sections in CDATA blocks and
your XML will remain valid.
Sill answer, but what is $KCODE ?? I'm relatively new to Ruby, so this tells
me nothing... And as You might have guessed, no, I haven't set it. What's it
do?
Cheers,
Chris
···
2008/9/19 Gregory Brown <gregory.t.brown@gmail.com>
On Fri, Sep 19, 2008 at 7:07 AM, Krzysieq <krzysieq@gazeta.pl> wrote:
> Thanks for any help You can provide me with
Silly question, but did you set $KCODE = "U" while processing your data?
On Fri, Sep 19, 2008 at 9:00 AM, Krzysieq <krzysieq@gazeta.pl> wrote:
Sill answer, but what is $KCODE ?? I'm relatively new to Ruby, so this tells
me nothing... And as You might have guessed, no, I haven't set it. What's it
do?