Help requested -- regexp

I have a text file with each line representing a 'record'. The line
has tab-separated 'field=value' values. *Note*: Not all fields are
mandatory in all records.

This text file is rather large, and is auto-dumped from another
application.

Now, I am trying to provide a quick interface to my users to query
this file. The interface I am looking at is something on the lines of

     (field1 > 4) and ((field4 < 2.25) or (field3 > 8.0))

to be typed in the query shell.

I had first written something similar to:

   class Field

     ...

     def gen_regexp(cond)
       regexp = "(md = /#{@name}=(.+?)/.match(_line); md and "
       regexp += cond.gsub(/#{@name}/, 'md[1].' + @converter_method) # 'to_i'/'to_f'
       regexp += ')'
     end

     ...

   end

This 'generated' regexp would then be substituted in the place of the
corresponding parenthesized 'condition'. Once all such substitutions
are complete, a query block is generated as:

    query_blk = eval("lambda { |_line| #{final_regexp} }")

This query block is then used in a conventional 'select' on the lines.

It worked for the likes of the example above, but started having
problems for clauses with multiple fields in each 'condition', like:

     (field1 > 4) and ((field4 < 2.25) or (field3 + field8 > 8.0))

since the above substitution logic is at an individual field level.

Suggestions please. Thanks!

Best regards,

JS

I have a text file with each line representing a 'record'. The line
has tab-separated 'field=value' values. *Note*: Not all fields are
mandatory in all records.

Well, that's crying out to be a Hash, right?

Hash[*line.split("\t").map { |f| f.split("-") }.flatten]

This text file is rather large, and is auto-dumped from another
application.

You may not want to preload it then, if the users don't mind waiting for query() (or whatever) to check it line by line.

Now, I am trying to provide a quick interface to my users to query
this file. The interface I am looking at is something on the lines of

    (field1 > 4) and ((field4 < 2.25) or (field3 > 8.0))

to be typed in the query shell.

You could use irb for the "shell" and Ruby's block syntax for the query. If you want the fields to work as you show them above, use a little method_missing magic.

I had first written something similar to:

  class Field

    ...

    def gen_regexp(cond)
      regexp = "(md = /#{@name}=(.+?)/.match(_line); md and "
      regexp += cond.gsub(/#{@name}/, 'md[1].' + @converter_method) # 'to_i'/'to_f'
      regexp += ')'
    end

    ...

  end

This 'generated' regexp would then be substituted in the place of the
corresponding parenthesized 'condition'. Once all such substitutions
are complete, a query block is generated as:

   query_blk = eval("lambda { |_line| #{final_regexp} }")

This query block is then used in a conventional 'select' on the lines.

It worked for the likes of the example above, but started having
problems for clauses with multiple fields in each 'condition', like:

    (field1 > 4) and ((field4 < 2.25) or (field3 + field8 > 8.0))

since the above substitution logic is at an individual field level.

If you want to express complete relationships, just use Ruby code as described above.

Suggestions please. Thanks!

Those are my best thoughts. Hope it helps.

James Edward Gray II

···

On Nov 5, 2005, at 11:53 PM, Srinivas Jonnalagadda wrote:

Personally, and maybe because I'm a product of different experiences, I'd
approach this problem differently. I see your problem and it just screams to
me "database".

Is there a reason you couldn't parse the text file and upload it to some
sort of SQL database and then let your users query that?

William Ramirez wrote:

[...] I see your problem and it just screams to me "database".

I hear these screams :slight_smile:

daz

William Ramirez wrote:

Personally, and maybe because I'm a product of different experiences, I'd
approach this problem differently. I see your problem and it just screams to
me "database".

Is there a reason you couldn't parse the text file and upload it to some
sort of SQL database and then let your users query that?

Yes. Each dump file is between 800 MB and 1.4 GB. I had indeed setup
a database to hold this data. Here are a few reasons why I tried the
approach that I did:

1. The 'load' (with all relational constraints turned off) was still
    taking enormous amount of time (of the order of 3-4 hours per dump
    file).

2. The dump file's schema changes rather frequently. This induced
    frequent DBA overhead into this process, to keep the database
    schema synchronized.

3. The sparse nature of the data (not all fields being mandatory in
    all records) has resulted in a high storage overhead (about 2.5X).

4. SQL queries on the resulting database are not a serious option
    since my users do not know SQL, and would not be able to interpret
    any error diagnosis.

5. When wrapped with objects in Ruby, the same queries take 6-10X
    longer (as compared to the regular expression approach). Profiling
    shows this to be mostly because of object creation/initialization
    overhead.

    And, to answer James' question -- it was indeed a hash that was
    employed in each object.

So, I sought a solution that worked faster, and ended up with the
current regular expression approach.

Hope that clarifies your question.

Best regards,

JS