How to clean an xml files from non-utf-8 chars?

Krzysieq · 17 September 2008 09:07

Hi,

I have a problem. I'm trying to parse with ruby some test results from
jmeter, that are stored in xml files. Unfortunately, while they should be
utf-8, some of them aren't. Probably because some db data isn't. In any
case, this makes other toys break down, like xslt transformation and
anything else that relies on the xml files being utf-8.

Does anyone know, how to get rid of such characters? When opened in an
editor like Kate, they are viewed as a white question mark in black square.
I don't really care much about the data - if it's missing some chars, nobody
will care. The point is not to destroy the xml structure and enable other
tool's operations. Any help will be greatly appreciated.

Cheers,
Chris

Brian_Candler · 17 September 2008 09:15

If you really don't care about the content:
str.gsub(/[\x80-\xff]/,'?')

···

--
Posted via http://www.ruby-forum.com/.

Mark_Thomas · 17 September 2008 12:42

Look at Vectoring Ruby On Rails: Encoding problems,
particularly the "iconvert" method which attempts conversion to UTF-8,
but in the case where the string cannot be converted to UTF-8 (e.g.
double-byte chars) then it replaces the chars with "?".

-- Mark.

···

On Sep 17, 5:07 am, Krzysieq <krzys...@gazeta.pl> wrote:

[Note: parts of this message were removed to make it a legal post.]

Hi,

I have a problem. I'm trying to parse with ruby some test results from
jmeter, that are stored in xml files. Unfortunately, while they should be
utf-8, some of them aren't. Probably because some db data isn't. In any
case, this makes other toys break down, like xslt transformation and
anything else that relies on the xml files being utf-8.

James_Edward_Gray_II · 17 September 2008 12:44

If you can figure out the encoding they are actually in, I recommend using Iconv's transliterate mode:

require "iconv"
Iconv.conv("UTF-8//TRANSLIT", old_encoding_name, data)

James Edward Gray II

···

On Sep 17, 2008, at 4:07 AM, Krzysieq wrote:

I have a problem. I'm trying to parse with ruby some test results from
jmeter, that are stored in xml files. Unfortunately, while they should be utf-8, some of them aren't. Probably because some db data isn't. In any case, this makes other toys break down, like xslt transformation and
anything else that relies on the xml files being utf-8.

Does anyone know, how to get rid of such characters?

Rob_Biedenharn1 · 17 September 2008 12:26

You can have bytes in that range as the first byte of a well-formed UTF-8 Byte Sequence. They just can't represent a single byte. It's just not that simple.

-Rob

Rob Biedenharn http://agileconsultingllc.com
Rob@AgileConsultingLLC.com

···

On Sep 17, 2008, at 5:15 AM, Brian Candler wrote:

If you really don't care about the content:
str.gsub(/[\x80-\xff]/,'?')
--

Jeremy_Hinegardner · 17 September 2008 16:47

This is the approach we have take on some of our code, basically we wanted to
replicate the 'iconv -c' behavior. Does TRANSLIT do this ? I've never used
that mode before.

module UTF8
module Cleanable

···

On Wed, Sep 17, 2008 at 09:44:23PM +0900, James Gray wrote:

On Sep 17, 2008, at 4:07 AM, Krzysieq wrote:

I have a problem. I'm trying to parse with ruby some test results from
jmeter, that are stored in xml files. Unfortunately, while they should be
utf-8, some of them aren't. Probably because some db data isn't. In any
case, this makes other toys break down, like xslt transformation and
anything else that relies on the xml files being utf-8.

Does anyone know, how to get rid of such characters?

If you can figure out the encoding they are actually in, I recommend using
Iconv's transliterate mode:

require "iconv"
Iconv.conv("UTF-8//TRANSLIT", old_encoding_name, data)

      #
      # Converts the string representation of this class to a utf8 clean
      # string. This assumes that #to_s on the object will result in a utf8
      # string. All chars that are not valid utf8 char sequences will be
      # silently dropped.
      #
      def utf8_clean
        Iconv.open( "UTF-8", "UTF-8" ) do |iconv|
          output = StringIO.new
          working = self.to_s
          loop do
            begin
              output.print iconv.iconv( working )
              break
            rescue Iconv::IllegalSequence => is
              output.print is.success
              working = is.failed[1..-1]
            end
          end
          return output.string
        end
      end
    end
  end

  class String
    include UTF8::Cleanable
  end

enjoy,

-jeremy

--

Jeremy Hinegardner jeremy@hinegardner.org

Krzysieq · 18 September 2008 13:25

Unfortunately, there's no way telling the original encoding. I would rather
go for some method of removing / substituting the chars that don't belong
there, but the method first suggested by Brian doesn't seem to work for some
reason. Does anyone have another option? I'm investigating the reasons of
failure, I will write more when I know something more. Thanks for all help
anyways

Cheers,
Chris

···

2008/9/17 James Gray <james@grayproductions.net>

On Sep 17, 2008, at 4:07 AM, Krzysieq wrote:

I have a problem. I'm trying to parse with ruby some test results from

jmeter, that are stored in xml files. Unfortunately, while they should be
utf-8, some of them aren't. Probably because some db data isn't. In any
case, this makes other toys break down, like xslt transformation and
anything else that relies on the xml files being utf-8.

Does anyone know, how to get rid of such characters?

If you can figure out the encoding they are actually in, I recommend using
Iconv's transliterate mode:

require "iconv"
Iconv.conv("UTF-8//TRANSLIT", old_encoding_name, data)

James Edward Gray II

Krzysieq · 17 September 2008 12:44

Hey,

Thanks for inputs. So do You have another proposition?

Cheers,
Chris

···

2008/9/17 Rob Biedenharn <Rob@agileconsultingllc.com>

On Sep 17, 2008, at 5:15 AM, Brian Candler wrote:

If you really don't care about the content:

str.gsub(/[\x80-\xff]/,'?')
--

You can have bytes in that range as the first byte of a well-formed UTF-8
Byte Sequence. They just can't represent a single byte. It's just not that
simple.

-Rob

Rob Biedenharn http://agileconsultingllc.com
Rob@AgileConsultingLLC.com

Brian_Candler · 17 September 2008 13:33

Rob Biedenharn wrote:

···

On Sep 17, 2008, at 5:15 AM, Brian Candler wrote:

If you really don't care about the content:
str.gsub(/[\x80-\xff]/,'?')
--

You can have bytes in that range as the first byte of a well-formed
UTF-8 Byte Sequence. They just can't represent a single byte. It's
just not that simple.

That's why I said "if you really don't care" ... it strips all valid
non-ASCII UTF8 as well as invalid.

There is a nice table at UTF-8 - Wikipedia which would
let you build something more accurate. Ruby quiz perhaps?
--
Posted via http://www.ruby-forum.com/\.

Greg_Brown1 · 17 September 2008 18:31

To silently drop chars with IConv, you'd want to do:

Iconv.conv("UTF-8//IGNORE", old_encoding_name, data)

TRANSLIT just works a little harder and tries to convert your
characters into a series of UTF-8 chars if possible.
I'm not sure if it drops chars that can't be transliterated...

-greg

···

On Wed, Sep 17, 2008 at 12:47 PM, Jeremy Hinegardner <jeremy@hinegardner.org> wrote:

On Wed, Sep 17, 2008 at 09:44:23PM +0900, James Gray wrote:

On Sep 17, 2008, at 4:07 AM, Krzysieq wrote:

I have a problem. I'm trying to parse with ruby some test results from
jmeter, that are stored in xml files. Unfortunately, while they should be
utf-8, some of them aren't. Probably because some db data isn't. In any
case, this makes other toys break down, like xslt transformation and
anything else that relies on the xml files being utf-8.

Does anyone know, how to get rid of such characters?

If you can figure out the encoding they are actually in, I recommend using
Iconv's transliterate mode:

require "iconv"
Iconv.conv("UTF-8//TRANSLIT", old_encoding_name, data)

This is the approach we have take on some of our code, basically we wanted to
replicate the 'iconv -c' behavior. Does TRANSLIT do this ? I've never used
that mode before.

module UTF8
   module Cleanable
     #
     # Converts the string representation of this class to a utf8 clean
     # string. This assumes that #to_s on the object will result in a utf8
     # string. All chars that are not valid utf8 char sequences will be
     # silently dropped.

--
Technical Blaag at: http://blog.majesticseacreature.com | Non-tech
stuff at: http://metametta.blogspot.com

James_Edward_Gray_II · 17 September 2008 18:35

//TRANSLIT is better than that. It tries to translate the characters. Thus a UTF-8 ellipse would become three periods if converted to ISO-8859-1 with //TRANSLIT.

You can mimic -c though, just use //IGNORE instead of //TRANSLIT. You can even do //TRANSLIT//IGNORE which transliterates what it can and discards the rest.

James Edward Gray II

···

On Sep 17, 2008, at 11:47 AM, Jeremy Hinegardner wrote:

On Wed, Sep 17, 2008 at 09:44:23PM +0900, James Gray wrote:

On Sep 17, 2008, at 4:07 AM, Krzysieq wrote:

I have a problem. I'm trying to parse with ruby some test results from
jmeter, that are stored in xml files. Unfortunately, while they should be
utf-8, some of them aren't. Probably because some db data isn't. In any
case, this makes other toys break down, like xslt transformation and
anything else that relies on the xml files being utf-8.

Does anyone know, how to get rid of such characters?

If you can figure out the encoding they are actually in, I recommend using
Iconv's transliterate mode:

require "iconv"
Iconv.conv("UTF-8//TRANSLIT", old_encoding_name, data)

This is the approach we have take on some of our code, basically we wanted to
replicate the 'iconv -c' behavior. Does TRANSLIT do this ? I've never used
that mode before.

Greg_Brown1 · 18 September 2008 15:52

If there is no way of telling the original encoding, the input data
may not have valid unicode in it at all, right?

-greg

···

On Thu, Sep 18, 2008 at 9:25 AM, Krzysieq <krzysieq@gazeta.pl> wrote:

Unfortunately, there's no way telling the original encoding. I would rather
go for some method of removing / substituting the chars that don't belong
there, but the method first suggested by Brian doesn't seem to work for some
reason. Does anyone have another option? I'm investigating the reasons of
failure, I will write more when I know something more. Thanks for all help
anyways

--
Technical Blaag at: http://blog.majesticseacreature.com | Non-tech
stuff at: http://metametta.blogspot.com

Mark_Thomas · 18 September 2008 17:12

Try the iconv solutions with latin-1 (iso-8859-1) as the From. That's
as close as you can get to a one-byte "anything-goes" encoding.

-Mark.

···

On Sep 18, 9:25 am, Krzysieq <krzys...@gazeta.pl> wrote:

[Note: parts of this message were removed to make it a legal post.]

Unfortunately, there's no way telling the original encoding. I would rather
go for some method of removing / substituting the chars that don't belong
there, but the method first suggested by Brian doesn't seem to work for some
reason. Does anyone have another option?

Krzysieq · 19 September 2008 11:07

Ok, I tried all previous suggestions, neither worked (gsub idea, TRANSLIT,
IGNORE or the one from the link posted by Mark Thomas). In fact, the last
two don't seem to have done anything, while gsub seems to do too much -
seems like it has damaged the xml structure in some way, which seems very
strange to me. I don't really care about the data inside, but I need the xml
to remain valid.

@Gregory - that's true, it may not. However, the places where I found the
funny characters are text nodes inside xml documents, and there aren't that
many of them. Surely, one is many enough to break the whole thing, but
typically there's very few and it seems more like corrupted database data. I
think they store some newspaper articles there or pieces of news. I learned
from the team who maintain that database in their app, that typically it
should all be ISO-8859-1, but for some reason it's not always the case.
Hence the idea with corrupted data seems quite likely.

Thanks for any help You can provide me with
Cheers,
Chris

···

2008/9/18 Mark Thomas <mark@thomaszone.com>

On Sep 18, 9:25 am, Krzysieq <krzys...@gazeta.pl> wrote:
> [Note: parts of this message were removed to make it a legal post.]
>
> Unfortunately, there's no way telling the original encoding. I would
rather
> go for some method of removing / substituting the chars that don't belong
> there, but the method first suggested by Brian doesn't seem to work for
some
> reason. Does anyone have another option?

Try the iconv solutions with latin-1 (iso-8859-1) as the From. That's
as close as you can get to a one-byte "anything-goes" encoding.

-Mark.

Greg_Brown1 · 19 September 2008 12:57

Silly question, but did you set $KCODE = "U" while processing your data?

-greg

···

On Fri, Sep 19, 2008 at 7:07 AM, Krzysieq <krzysieq@gazeta.pl> wrote:

Thanks for any help You can provide me with

--
Technical Blaag at: http://blog.majesticseacreature.com | Non-tech
stuff at: http://metametta.blogspot.com

Mark_Thomas · 19 September 2008 13:02

How is the XML file created? If you know in advance which parts of the
XML come from the database, wrap those sections in CDATA blocks and
your XML will remain valid.

Krzysieq · 19 September 2008 13:00

Sill answer, but what is $KCODE ?? I'm relatively new to Ruby, so this tells
me nothing... And as You might have guessed, no, I haven't set it. What's it
do?

Cheers,
Chris

···

2008/9/19 Gregory Brown <gregory.t.brown@gmail.com>

On Fri, Sep 19, 2008 at 7:07 AM, Krzysieq <krzysieq@gazeta.pl> wrote:

> Thanks for any help You can provide me with

Silly question, but did you set $KCODE = "U" while processing your data?

-greg

--
Technical Blaag at: http://blog.majesticseacreature.com | Non-tech
stuff at: http://metametta.blogspot.com

Greg_Brown1 · 19 September 2008 13:11

It tells Ruby that you are working with UTF-8

-greg

···

On Fri, Sep 19, 2008 at 9:00 AM, Krzysieq <krzysieq@gazeta.pl> wrote:

Sill answer, but what is $KCODE ?? I'm relatively new to Ruby, so this tells
me nothing... And as You might have guessed, no, I haven't set it. What's it
do?

--
Technical Blaag at: http://blog.majesticseacreature.com | Non-tech
stuff at: http://metametta.blogspot.com

James_Edward_Gray_II · 19 September 2008 13:11

Sill answer, but what is $KCODE ??

It's a global variable that affects how Ruby 1.8 handles characters.

And as You might have guessed, no, I haven't set it.

Does your code run inside of a recent version of Rails? I'm just asking because it sets $KCODE for you.

James Edward Gray II

···

On Sep 19, 2008, at 8:00 AM, Krzysieq wrote:

Topic		Replies	Views
Change/ignore XML encoding? ruby-talk	7	91	24 August 2008
UTF-8 in Ruby ruby-talk	3	105	1 May 2008
Problems making UTF-8 text XML/XHTML friendly (no entity conversion?) ruby-talk	1	141	31 May 2004
[ENCODING] UTF8 hell ruby-talk	14	706	24 February 2010
Problems making UTF-8 text XML/XHTML friendly (no entity conversion?) ruby-talk	0	111	1 June 2004

How to clean an xml files from non-utf-8 chars?

--

Related topics