NEWBIE: Ruby & XML

Hello.

I am working with some XML logs coming from a network simulator.
My aim is to strip out the transient information concerning any given
variable.

For example here is some example data:

string = <<EOF
<seqexml version="1.0">
<primitive name='PHY_DATA_IND' time="00:00:40.450" CFN="206"
sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1'
channel_name='HS_DPCCH' channel_number='0' >
<parameter name="CQI">23</parameter>
<parameter name="H-ARQ Status">DTX</parameter>
</primitive><primitive name='PHY_DATA_IND' time="00:00:40.460" CFN="207"
sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1'
channel_name='HS_DPCCH' channel_number='0' >
<parameter name="CQI">22</parameter>
<parameter name="H-ARQ Status">DTX</parameter>
</primitive><primitive name='PHY_DATA_IND' time="00:00:40.460" CFN="207"
sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1'
channel_name='HS_DPCCH' channel_number='0' >
<parameter name="CQI">22</parameter>
<parameter name="H-ARQ Status">DTX</parameter>
</primitive><primitive name='PHY_DATA_IND' time="00:00:40.460" CFN="207"
sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1'
channel_name='HS_DPCCH' channel_number='0' >
<parameter name="CQI">23</parameter>
<parameter name="H-ARQ Status">DTX</parameter>
</primitive><primitive name='PHY_DATA_IND' time="00:00:40.460" CFN="207"
sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1'
channel_name='HS_DPCCH' channel_number='0' >
<parameter name="CQI">22</parameter>
<parameter name="H-ARQ Status">DTX</parameter>
</primitive><primitive name='PHY_DATA_IND' time="00:00:40.460" CFN="207"
sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1'
channel_name='HS_DPCCH' channel_number='0' >
<parameter name="CQI">24</parameter>
<parameter name="H-ARQ Status">DTX</parameter>
</primitive><primitive name='PHY_DATA_IND' time="00:00:40.470" CFN="208"
sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1'
channel_name='HS_DPCCH' channel_number='0' >
<parameter name="CQI">21</parameter>
<parameter name="H-ARQ Status">DTX</parameter>
</primitive><primitive name='PHY_DATA_IND' time="00:00:40.470" CFN="208"
sap='NW_MAC_SAP' direction='uplink' bts_unit='1:1'
channel_name='HS_DPCCH' channel_number='0' >
<parameter name="CQI">22</parameter>
<parameter name="H-ARQ Status">DTX</parameter>
</primitive>
</seqexml>
EOF

For example I may want to strip out all the "CQI" and timing values to
get:

23, 00:00:40.450
22, 00:00:40.460
22, 00:00:40.460
23, 00:00:40.460
22, 00:00:40.460
24, 00:00:40.460
21, 00:00:40.470
22, 00:00:40.470

Question: These files can be very large and keeping the computer
resource overhead is important. I've looked at other threads on this
forum to decide which method of extracting data against timestamps would
be the quickest but the information has been conflicting.

I understand that stream parsing is faster that DOM. I also understand
that libxml is faster than REXML, but libxml streaming uses DOM. So is
it safe to assume that REXMl streaming is faster than libxml streaming?

I also need to consider which way of things would be easier to
implement.

···

--
Posted via http://www.ruby-forum.com/.

Hello.

I am working with some XML logs coming from a network simulator.
My aim is to strip out the transient information concerning any given
variable.

For example here is some example data:
[...]

For example I may want to strip out all the "CQI" and timing values to
get:

23, 00:00:40.450
22, 00:00:40.460
22, 00:00:40.460
23, 00:00:40.460
22, 00:00:40.460
24, 00:00:40.460
21, 00:00:40.470
22, 00:00:40.470

Question: These files can be very large and keeping the computer
resource overhead is important. I've looked at other threads on this
forum to decide which method of extracting data against timestamps would
be the quickest but the information has been conflicting.

I make no claim about what might be best :slight_smile: but, nokogiri seems to be
the leading Ruby XML library at the moment. I quickly adapted an old
REXML pull parser to work with your sample data:

def parse(stream)
  raise "BlockRequired" unless block_given?

  parser = REXML::Parsers::PullParser.new(stream)

  row = {}

  while parser.has_next?
    event = parser.pull

    case event.event_type
    when :start_element
      case event[0]
      when 'primitive'
        row = event[1]; col = nil
      when 'parameter'
        col = event[1]["name"]
      end

      row[col] ||= "" if col

    when :end_element
      col = nil

      case event[0]
      when 'primitive'
        yield(row)
      else
        # ignore
      end

    when :text
      row[col] << event[0].chomp if col

    else
      #ignore
    end
  end
end

parse(string){|row|
  #p row
  puts "#{row["CQI"]}, #{row["time"]}"
}

ruby x.rb

23, 00:00:40.450
22, 00:00:40.460
22, 00:00:40.460
23, 00:00:40.460
22, 00:00:40.460
24, 00:00:40.460
21, 00:00:40.470
22, 00:00:40.470

The original program I lifted that from was processing XML files up to
several gigabytes; particularly on the largest files we saw much
better performance running under JRuby over MRI (1.8.5 or so).

···

On Tue, Aug 10, 2010 at 9:53 AM, Jerome David Sallinger <imran.nazir@yahoo.co.uk> wrote: