String generalization

Hello all,

I am having the following problem: I have to implement a method
which accepts a string and returns a generalized form of the string.
Some examples:

'12345' -> \d+
'ABCDE' -> [A-Z]+
'john.smith@my.super.server.rb' -> [a-z]+\.[a-z]+@([a-z]+\.)+[a-z]
'123-45-678-90' -> (\d+-)+\d+
'item:' -> [a-z]+:
'Peter, SZINEK': -> [A-Z][a-z]+, [A-Z]+
'http://www.google.com' -> [a-z]+://([a-z]+\.)+[a-z]
'jd£;d:L348kddd3' -> .*

So, given an example, the function should generate a generic regexp which is then used to match the instances of the same class (i.e. based on an e-mail, you create a regexp which can be used to match other emails).
Of course this problem is not solvable in general. You would need more examples (both positive and negative ones, as it is proven that just using positive examples you can not generalize a pattern (generate a regular grammar desribibg it) and then machine learn/induce a grammar/etc based on that. Unfortunately i have only one positive example. So i do not need a perfect solution (which is not possible anyway) just a mostly working one.

I have implemented it in java, and it works pretty well but it is ugly as hell (as any java code dealing with stuff where you need regexps, slicing, maps etc). Any ideas for a nice Ruby code solving this? Then i would call it via JRuby (and killed by the other colleagues using java but not Ruby ;-).

thx,
Peter

As you said for yourself this is quite a *general* problem. So my code is
not to show off a clever automaton, it is completely stupid, but rather to
give you the idea how to hide it :wink:
Please note that it is not even complete (end handling has to be done) I
just wanted to give you ideas.

Just an idea: Maybe you should make some more effort in the definition of
the problem and post it to Ruby Quiz!
----------------------------------------- 8<

···

On 4/6/06, Peter Szinek <peter@rt.sk> wrote:

Hello all,

I am having the following problem: I have to implement a method
which accepts a string and returns a generalized form of the string.
Some examples:

'12345' -> \d+
'ABCDE' -> [A-Z]+
'john.smith@my.super.server.rb' -> [a-z]+\.[a-z]+@([a-z]+\.)+[a-z]
'123-45-678-90' -> (\d+-)+\d+
'item:' -> [a-z]+:
'Peter, SZINEK': -> [A-Z][a-z]+, [A-Z]+
'http://www.google.com' -> [a-z]+://([a-z]+\.)+[a-z]
'jd£;d:L348kddd3' -> .*

So, given an example, the function should generate a generic regexp
which is then used to match the instances of the same class (i.e. based
on an e-mail, you create a regexp which can be used to match other
emails).
Of course this problem is not solvable in general. You would need more
examples (both positive and negative ones, as it is proven that just
using positive examples you can not generalize a pattern (generate a
regular grammar desribibg it) and then machine learn/induce a
grammar/etc based on that. Unfortunately i have only one positive
example. So i do not need a perfect solution (which is not possible
anyway) just a mostly working one.

I have implemented it in java, and it works pretty well but it is ugly
as hell (as any java code dealing with stuff where you need regexps,
slicing, maps etc). Any ideas for a nice Ruby code solving this? Then i
would call it via JRuby (and killed by the other colleagues using java
but not Ruby ;-).

thx,
Peter

---------------------------------------------
cat gen1.rb; ruby gen1.rb
#!/usr/bin/env ruby
#

class Generalize < Regexp

# Do not use a Hash, order is crucial!
Generalisations = [ '\w', '\d', '@', '\.', '.' ]

        def initialize( str )
                @rep = generalize(str)

                super @rep
        end

        def to_s
                s = super
                "<" + s + ">:" + self.class.to_s + "\"" + @rep +"\""
        end

        private
        #
        # This is the real tough part which might become ugly, but at least
it is private :slight_smile:
        def generalize( str )
                r = ""
                old = nil
                star= ""
                str.each_byte do
                        >b>
                        c = b.chr
                        # Much more intelligence to be put here, maybe also
more context than the
                        # single string we have at our disposal.
                        Generalisations.each do
                                > lit |
                                reg = Regexp.new( lit )
                                if reg === c then
                                        if lit == old then
                                                star = "+"
                                        else
                                                if old then
                                                        r << old
                                                        r << star
                                                        star = ""
                                                end
                                                old = lit
                                        end
                                        break
                                end
                        end
                end
                r
        end

end

r = Generalize.new("robert.dober@gmail.com")
puts r.to_s
<(?-mix:\w+\.\w+@\w+\.)>:Generalize"\w+\.\w+@\w+\."
------------------------------- >8 -----------------------------------------
Cheers
Robert
--
Deux choses sont infinies : l'univers et la bêtise humaine ; en ce qui
concerne l'univers, je n'en ai pas acquis la certitude absolue.

- Albert Einstein

Peter Szinek schrieb:

Hello all,

I am having the following problem: I have to implement a method
which accepts a string and returns a generalized form of the string.
Some examples:

'12345' -> \d+
'ABCDE' -> [A-Z]+
'john.smith@my.super.server.rb' -> [a-z]+\.[a-z]+@([a-z]+\.)+[a-z]
'123-45-678-90' -> (\d+-)+\d+
'item:' -> [a-z]+:
'Peter, SZINEK': -> [A-Z][a-z]+, [A-Z]+
'http://www.google.com' -> [a-z]+://([a-z]+\.)+[a-z]
'jd£;d:L348kddd3' -> .*

Peter, thanks for the quiz. I needed it right now :slight_smile:

I'm not sure I understand the rules. I had expected:

   '//' -> /+
   'jd£;d:L348kddd3' -> [a-z]+£;[a-z]:[A-Z]\d+[a-z]+\d

Anyway, here's a first implementation for the use cases you've shown:

   def generalize( str )
     str.gsub( /\d+|[a-z]+|[A-Z]+|\./ ) do |match|
       case match
       when /\d{2,}/: "\\d+"
       when /\d/: "\\d"
       when /[a-z]{2,}/: "[a-z]+"
       when /[a-z]/: "[a-z]"
       when /[A-Z]{2,}/: "[A-Z]+"
       when /[A-Z]/: "[A-Z]"
       when /\./: "\\."
       end
     end.gsub( /(.+)\1+/, "(\\1)+" )
   end

You notice some repetitions, so you can generalize it further. It isn't perfect, but maybe gets you going.

Regards,
Pit

Here is my shot at the problem:

class String

   # from strict to less strict
   PATTERNS_TO_GENERALIZE = [
     /\d/, /\d+/,
     /[a-z]/, /[a-z]+/,
     /[A-Z]/, /[A-Z]+/,
     /\w/, /\w+/
   ]
   def generalize_to_regexp
     parts = split(/(\W+)/).map { |x|
       case x
       when ""
         nil
       when /\W+/
         Regexp.escape(x)
       else
         PATTERNS_TO_GENERALIZE.find { |p|
           (p =~ x) && x == $&
         }.source
       end
     }.compact
     parts2 = # will contain the final parts
     size = 0 # size of the window we are currently checking
     parts.size.times { |idx|
       size += 1
       next if size < 4 # window to small
       if size % 2 == 0
         if parts[idx] != parts[idx-2] || parts[idx-1] != parts[idx-3]
           if size > 4
             parts2 << "(#{parts[idx-3]}#{parts[idx-2]})+"
             size = 2
           else # size == 4
             parts2 << parts[idx-3]
             size -= 1
           end
         end
       end
     }
     # handle the remaining window
     if size >= 4
       size %= 2
       parts2 << "(#{parts[-2-size]}#{parts[-1-size]})+"
     end
     parts2.concat(parts[-size, size]) if size > 0
     Regexp.new(parts2.join(""))
   end

end

if $0 == __FILE__
   ['12345', 'ABCDE', 'john.smith@my.super.server.rb',
    '123-45-678-90', 'item:', 'Peter, SZINEK',
    'http://www.google.com', 'jd£;d:L348kddd3'].each { |s|
     puts "#{s} => #{s.generalize_to_regexp.inspect}"
   }
end

The part that finds the repeating patterns is a bit ugly, but it should work. Here are the results:

$ ruby generalize_rx.rb
12345 => /\d+/
ABCDE => /[A-Z]+/
john.smith@my.super.server.rb => /[a-z]+\.[a-z]+@([a-z]+\.)+[a-z]+/
123-45-678-90 => /(\d+\-)+\d+/
item: => /[a-z]+:/
Peter, SZINEK => /\w+,\ [A-Z]+/
http://www.google.com => /[a-z]+:\/\/([a-z]+\.)+[a-z]+/
jd£;d:L348kddd3 => /[a-z]+\243;[a-z]:\w+/

Dominik

···

On Thu, 06 Apr 2006 14:11:10 +0200, Peter Szinek <peter@rt.sk> wrote:

Hello all,

I am having the following problem: I have to implement a method
which accepts a string and returns a generalized form of the string.
Some examples:

'12345' -> \d+
'ABCDE' -> [A-Z]+
'john.smith@my.super.server.rb' -> [a-z]+\.[a-z]+@([a-z]+\.)+[a-z]
'123-45-678-90' -> (\d+-)+\d+
'item:' -> [a-z]+:
'Peter, SZINEK': -> [A-Z][a-z]+, [A-Z]+
'http://www.google.com' -> [a-z]+://([a-z]+\.)+[a-z]
'jd£;d:L348kddd3' -> .*

As you said for yourself this is quite a *general* problem. So my code is
not to show off a clever automaton, it is completely stupid, but rather to
give you the idea how to hide it :wink:
Please note that it is not even complete (end handling has to be done) I
just wanted to give you ideas.

Wow! That's what i like about Ruby. My solution in java (although more complete,
and dealing with special cases etc) is about 10x bigger in size and much more
ugly/obscure/unmaintainable etc. Thx for the code!

Just an idea: Maybe you should make some more effort in the definition of
the problem and post it to Ruby Quiz!

Really? (I am new here, begun to learn Ruby 2 weeks ago) - i have seen Ruby Quiz but did not know that anybody can post problems... (And if it gets solved, this way "outsorce" a task from his full-time job tasklist :wink:

Best wishes,
Peter

cat gen1.rb; ruby gen1.rb
#!/usr/bin/env ruby
#

class Generalize < Regexp

The only part i am missing is the one that gave me te most headache in the java version - the second step of generalization.
Example:

r = Generalize.new("a.b@c.d.e.f.g")
You would return:
\w+.\w+@\w+.\w+.\w+.\w+.\w+ (first step)

But i want

\w+.\w+@(\w+\.)+\w+ (second step)

instead...

i.e. to find the repeating strings in the pattern and + them.
I mean it's ok to find the repeating strings, since you can do this with one back-referencing regexp. But to decide which of the repeating parts to + etc. is a real PITB... At least for me it was.

bw,
Peter

Hi,

Peter, thanks for the quiz. I needed it right now :slight_smile:

Actually it was not ment as quiz (its a task at hand at my workplace), but i like the mentionality of the Ruby ML that everybody started to solve it as a quiz :wink: Good work James!

I'm not sure I understand the rules. I had expected:

  '//' -> /+

Yeah, you re right with this one.

  'jd£;d:L348kddd3' -> [a-z]+£;[a-z]:[A-Z]\d+[a-z]+\d

The idea is, that if the input string is a mixture of this and that
without some 'sensible' pattern, it should be just generalized to .*
Well, it is not very straightforward to pin down whether a string is a mixture or not, and when there is a sensible pattern in it. (And what is sensible in this context) ATM i am using some crazy heuristics wich works very nicely in the practice, but no one knows (not even me) what it is really doing :wink: It just works.

To put it another way: a human user can tell that 'jd£;d:L348kddd3' is some crap and not a repeating pattern of something (like a tel #, an email, a name) in any way. So the rule is, that if you, as a human user think it is a crap rather than something meaningful, the computer should say this as well ;-). And you produced a chunk of a code which can be wired into the first machine passing the Turing test...

bw,
Peter

The part that finds the repeating patterns is a bit ugly, but it should work. Here are the results:

You did not see the java version... Now THAT'S ugly :wink:

$ ruby generalize_rx.rb
12345 => /\d+/
ABCDE => /[A-Z]+/
john.smith@my.super.server.rb => /[a-z]+\.[a-z]+@([a-z]+\.)+[a-z]+/
123-45-678-90 => /(\d+\-)+\d+/
item: => /[a-z]+:/
Peter, SZINEK => /\w+,\ [A-Z]+/
http://www.google.com => /[a-z]+:\/\/([a-z]+\.)+[a-z]+/
jd£;d:L348kddd3 => /[a-z]+\243;[a-z]:\w+/

Wow! Thanks man! Nice solution. Ruby is awesome.

Cheers,
Peter

[SNIP]

The part that finds the repeating patterns is a bit ugly, but it should
work. Here are the results:

$ ruby generalize_rx.rb
12345 => /\d+/
ABCDE => /[A-Z]+/
john.smith@my.super.server.rb => /[a-z]+\.[a-z]+@([a-z]+\.)+[a-z]+/
123-45-678-90 => /(\d+\-)+\d+/
item: => /[a-z]+:/
Peter, SZINEK => /\w+,\ [A-Z]+/
http://www.google.com => /[a-z]+:\/\/([a-z]+\.)+[a-z]+/
jd£;d:L348kddd3 => /[a-z]+\243;[a-z]:\w+/

Well that is quite impressive, but I could not figure out why
"12a".you_know_the_name_of_the_method gives /\w+/, it clearly should give
/\d+[a-z]+/, no?

Cheers
Robert

Dominik

···

On 4/6/06, Dominik Bathon <dbatml@gmx.de> wrote:

--
Deux choses sont infinies : l'univers et la bêtise humaine ; en ce qui
concerne l'univers, je n'en ai pas acquis la certitude absolue.

- Albert Einstein

SNIP

···

On 4/6/06, Peter Szinek <peter@rt.sk> wrote:
----------------

Really? (I am new here, begun to learn Ruby 2 weeks ago) - i have seen

Ruby Quiz but did not know that anybody can post problems... (And if it
gets solved, this way "outsorce" a task from his full-time job tasklist
:wink:

Best wishes,
Peter

Well James reads this list frequently, he will tell us, I guess :wink:

But my idea is the following.

* Give some constraints about how different sets of example strings would
influence each other.
* Define a test suite f.e.

I think along these lines
If the program has only one string "acccd" it will generalize to ".+"
if you give a second one
[ "addff", "" ] it will generalize to ".*"
if you give the following data
[ "129", "0x1ef" ], ["add", ""] it shall create "\d+|0x[0-9a-f]+", ".*"
etc. etc.

Believe me this is more than challanging and maybe a *real* good definition
of the problem is more
difficult than the solution.

Cheers
Robert

--
Deux choses sont infinies : l'univers et la bêtise humaine ; en ce qui
concerne l'univers, je n'en ai pas acquis la certitude absolue.

- Albert Einstein

Sure can. Ruby Quiz is a resource. Use away. :wink:

However, I should warn that I'm getting stricter and stricter about making most submitters solve the problem before we run it. Have to make sure it's a reasonable problem.

James Edward Gray II

···

On Apr 6, 2006, at 8:08 AM, Peter Szinek wrote:

Really? (I am new here, begun to learn Ruby 2 weeks ago) - i have seen Ruby Quiz but did not know that anybody can post problems... (And if it gets solved, this way "outsorce" a task from his full-time job tasklist :wink:

split(/(\W+)/) splits the string into alphanumerical and non-alphanumerical parts:

irb(main):001:0> "12a".split(/(\W+)/)
=> ["12a"]
irb(main):002:0> ".12a.iu-23..4234".split(/(\W+)/)
=> ["", ".", "12a", ".", "iu", "-", "23", "..", "4234"]

Then my implementation escapes the non-alphanumeric parts and tries to find a pattern for the alphanumeric parts (from PATTERNS_TO_GENERALIZE).
The first pattern that matches "12a" is /\w+/ so it gets choosen.

If you want /\d+[a-z]+/ you can either add more patterns to PATTERNS_TO_GENERALIZE, or try to further split the alphanumeric parts.

But what would you do with "1b23xy3sa"? Should that be /\w+/ or /\d[a-z]\d+[a-z]+\d[a-z]+/?

Dominik

···

On Thu, 06 Apr 2006 19:16:44 +0200, Robert Dober <robert.dober@gmail.com> wrote:

On 4/6/06, Dominik Bathon <dbatml@gmx.de> wrote:

[SNIP]

The part that finds the repeating patterns is a bit ugly, but it should
work. Here are the results:

$ ruby generalize_rx.rb
12345 => /\d+/
ABCDE => /[A-Z]+/
john.smith@my.super.server.rb => /[a-z]+\.[a-z]+@([a-z]+\.)+[a-z]+/
123-45-678-90 => /(\d+\-)+\d+/
item: => /[a-z]+:/
Peter, SZINEK => /\w+,\ [A-Z]+/
http://www.google.com => /[a-z]+:\/\/([a-z]+\.)+[a-z]+/
jd£;d:L348kddd3 => /[a-z]+\243;[a-z]:\w+/

Well that is quite impressive, but I could not figure out why
"12a".you_know_the_name_of_the_method gives /\w+/, it clearly should give
/\d+[a-z]+/, no?

* Give some constraints about how different sets of example strings would
influence each other.
* Define a test suite f.e.

I think along these lines
If the program has only one string "acccd" it will generalize to ".+"
if you give a second one
[ "addff", "" ] it will generalize to ".*"
if you give the following data
[ "129", "0x1ef" ], ["add", ""] it shall create "\d+|0x[0-9a-f]+", ".*"
etc. etc.

Uh-oh. I definitely like the idea to do something like this (i.e. to define
a quiz based on the above problem). But imho the
problem as you described it is in some way a machine learning problem - i.e. a quite hard one. I am really new here, so i am not sure about the usual level of the Ruby Quiz puzzles - but this seems to me a master thesis level problem rather than a quiz question :wink:

bw,
Peter

However, I should warn that I'm getting stricter and stricter about
making most submitters solve the problem before we run it. Have to
make sure it's a reasonable problem.

And you do well, doing so.
But I think, coming up with a reasonably understandable specification of the
problem and submitting it to Ruby Quiz has its own rewards.
(a) By understanding the problem, one might be very close to the solution.
      Actually it is a Conditio Sine Qua Non, but sometimes we are not aware
of it.
(b) Discussing the thing on-list before submission will get some useful
feadback.
(c) If you are out of ruby-quizzes (and that it may never happen) it might
be good to have some ideas on the shelf.
One shall not be disappointed if you reject it, but as I stated above it
would be for a good, well explained reason, so it should not be taken
personally.

Robert

···

On 4/6/06, James Edward Gray II <james@grayproductions.net> wrote:

On Apr 6, 2006, at 8:08 AM, Peter Szinek wrote:

--
Deux choses sont infinies : l'univers et la bêtise humaine ; en ce qui
concerne l'univers, je n'en ai pas acquis la certitude absolue.

- Albert Einstein

That is exactly the question one has to ask himself.
Given the original post it can be generalized to /.*/.
To generalize one string does not make any sense, to generalize sets of sets
of strings each set defining an entity, that makes sense...
... and is increadibly difficult.

Cheers
Robert

···

On 4/6/06, Dominik Bathon <dbatml@gmx.de> wrote:

But what would you do with "1b23xy3sa"? Should that be /\w+/ or
/\d[a-z]\d+[a-z]+\d[a-z]+/?

Dominik

--
Deux choses sont infinies : l'univers et la bêtise humaine ; en ce qui
concerne l'univers, je n'en ai pas acquis la certitude absolue.

- Albert Einstein