Ruby method to strip out XML codes?

Michael_W_Ryder1 · 6 December 2007 01:15

I am trying to process an XML file that includes various codes. The problem I am running into is that some of these codes are inserted into the middle of an encrypted string. If I display the file using a browser these codes do not show up and copying and pasting the string work fine. The problem occurs when I try to strip out the string in a program and these "extraneous" XML codes are included. This of course makes the decryption routine crash.
What I am looking for is a simple way to read through the file and remove all the XML codes leaving just plain text. I could probably write a series of regular expressions to remove each code that I can find in my text but am afraid I might miss some and it will come back to haunt me at a later time.

Gavin_Kistner3 · 6 December 2007 03:20

str.gsub /</?[^>]+>/, ''

This will only be a problem if your XML file is legal and has a CDATA
section which has a literal < character (not <), like:

for ( var i=0, len=a.length; i<len; ++i )

In that case you likely want a proper XML parser (like REXML) and to
use it.

Do you really want to remove the XML, or would it suffice to just:

  str.gsub! '&', '&'
  str.gsub! '<', '<'
  str.gsub! '>', '>'
(and maybe even)
  str.gsub! '"', '"'
  str.gsub! "'", '''

to make your string valid and escaped for use in an HTML context?

···

On Dec 5, 6:13 pm, "Michael W. Ryder" <_mwry...@worldnet.att.net> wrote:

I am trying to process an XML file that includes various codes. The
problem I am running into is that some of these codes are inserted into
the middle of an encrypted string. If I display the file using a
browser these codes do not show up and copying and pasting the string
work fine. The problem occurs when I try to strip out the string in a
program and these "extraneous" XML codes are included. This of course
makes the decryption routine crash.
What I am looking for is a simple way to read through the file and
remove all the XML codes leaving just plain text. I could probably
write a series of regular expressions to remove each code that I can
find in my text but am afraid I might miss some and it will come back to
haunt me at a later time.

Michael_W_Ryder1 · 6 December 2007 07:55

Phrogz wrote:

I am trying to process an XML file that includes various codes. The
problem I am running into is that some of these codes are inserted into
the middle of an encrypted string. If I display the file using a
browser these codes do not show up and copying and pasting the string
work fine. The problem occurs when I try to strip out the string in a
program and these "extraneous" XML codes are included. This of course
makes the decryption routine crash.
What I am looking for is a simple way to read through the file and
remove all the XML codes leaving just plain text. I could probably
write a series of regular expressions to remove each code that I can
find in my text but am afraid I might miss some and it will come back to
haunt me at a later time.

str.gsub /</?[^>]+>/, ''

This will only be a problem if your XML file is legal and has a CDATA
section which has a literal < character (not <), like:

   for ( var i=0, len=a.length; i<len; ++i )

In that case you likely want a proper XML parser (like REXML) and to
use it.

Do you really want to remove the XML, or would it suffice to just:

  str.gsub! '&', '&'
  str.gsub! '<', '<'
  str.gsub! '>', '>'
(and maybe even)
  str.gsub! '"', '"'
  str.gsub! "'", '''

to make your string valid and escaped for use in an HTML context?

My problem is that the XML file includes 
 in the middle of a couple of fields, especially in the encrypted fields. If I just strip out the encrypted field and try to decrypt it the program crashes as the key is invalid. I have to remove the "bad" character strings before sending it to my decryption program. I would prefer to do this removal before sending the file to my programs so that I don't have to deal with these codes.
I assume that the string I am seeing is XML's way of saying CR/LF as DA in hex is CR/LF and the output in a browser shows the field being broken at that point. The problem is that is only the ones that I have noticed and there may be others hiding in the data. The XML file is being parsed for conversion to our accounts.

···

On Dec 5, 6:13 pm, "Michael W. Ryder" <_mwry...@worldnet.att.net> > wrote:

Topic		Replies	Views
Replace string between xml tags that contains special characters ruby-talk	5	118	17 July 2011
Strange string in XML element ruby-talk	3	100	12 September 2007
How to clean an xml files from non-utf-8 chars? ruby-talk	18	138	19 September 2008
Strinpping html using regexp ruby-talk	4	72	5 May 2009
Best way to strip XML header/tag ruby-talk	2	148	5 April 2013

Ruby method to strip out XML codes?

Related Topics