Perl regexp to ruby one conversion?

Pere_Noel1 · 23 March 2006 12:43

i've a perl regexp :

$field =~
  m/^(
     [\x09\x0A\x0D\x20-\x7E] # ASCII
   > [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
   > \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
   > [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
   > \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
   > \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
   > [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
   > \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
  )*$/x;

able to detect if $field is of UTF-8 chars or not and i'd like to
convert it into a ruby regexp.

How to do that ?

···

--
une bévue

James_Edward_Gray_II · 23 March 2006 14:00

The expression looks fine to me. Did you try using it?

James Edward Gray II

···

On Mar 23, 2006, at 6:43 AM, Une bévue wrote:

i've a perl regexp :

$field =~
  m/^(
     [\x09\x0A\x0D\x20-\x7E] # ASCII
   > [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
   > \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
   > [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
   > \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
   > \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
   > [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
   > \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
  )*$/x;

able to detect if $field is of UTF-8 chars or not and i'd like to
convert it into a ruby regexp.

How to do that ?

Pere_Noel1 · 23 March 2006 14:38

The expression looks fine to me. Did you try using it?

yes, without the correct result, here is my code :

field='&é§è!çàîûtybvn€'
utf8rgx=Regexp.new('m/^(
[\x09\x0A\x0D\x20-\x7E] # ASCII
> [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
> \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
> [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
> \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
> \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
> [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
> \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$/x')

the test :

flag=(field === utf8rgx)
p "flag = #{flag}"

the result being :
"flag = false"

i'm sure my encoding is utf-8...

may be i've a misunderstanding of "===" ?

because when trying :

truc = 'toto'
rgx=Regexp.new('^toto$')
flag=(truc === rgx)
p "flag = #{flag}"

i got :
# => "flag = false" ///seems NOT OK to me

flag=(truc =~ rgx)
p "flag = #{flag}"
# => "flag = 0" ///seems OK to me

···

James Edward Gray II <james@grayproductions.net> wrote:

--
une bévue

Ross_Bamford4 · 23 March 2006 14:50

You'll need to switch those around, as I showed in my response to your
other thread. flag will then be true, but unfortunately I think too
often:

utf8rgx === "onlyascii"
# => true

I think to do that kind of test you'd have to remove the first line
(matching ASCII chars) and not anchor the regexp with ^ and $.

Incidentally, I believe that the regexp above is best translated to Ruby
like this:

utf8rgx = /^(.)*$/u

You should also look into $KCODE (specifically $KCODE = 'u').

(Caveat to the above: I'm not much of an encoding expert at all).

···

On Thu, 2006-03-23 at 23:38 +0900, Une bévue wrote:

James Edward Gray II <james@grayproductions.net> wrote:

>
> The expression looks fine to me. Did you try using it?

yes, without the correct result, here is my code :

field='&é§è!çàîûtybvn€'
utf8rgx=Regexp.new('m/^(
[\x09\x0A\x0D\x20-\x7E] # ASCII
> [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
> \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
> [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
> \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
> \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
> [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
> \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$/x')

the test :

flag=(field === utf8rgx)
p "flag = #{flag}"

--
Ross Bamford - rosco@roscopeco.REMOVE.co.uk

James_Edward_Gray_II · 23 March 2006 14:55

Try changing this to:

utf8rgx = / ... /x

Hope that helps.

James Edward Gray II

···

On Mar 23, 2006, at 8:38 AM, Une bévue wrote:

utf8rgx=Regexp.new('m/^(
[\x09\x0A\x0D\x20-\x7E] # ASCII
> [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
> \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
> [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
> \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
> \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
> [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
> \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$/x')

Pere_Noel1 · 23 March 2006 15:13

ok thanks for all, may be it could be better streaming out all of the
html tags and bringing only part of what's in the <body/>...

···

Ross Bamford <rossrt@roscopeco.co.uk> wrote:

You'll need to switch those around, as I showed in my response to your
other thread. flag will then be true, but unfortunately I think too
often:

utf8rgx === "onlyascii"
# => true

I think to do that kind of test you'd have to remove the first line
(matching ASCII chars) and not anchor the regexp with ^ and $.

Incidentally, I believe that the regexp above is best translated to Ruby
like this:

utf8rgx = /^(.)*$/u

You should also look into $KCODE (specifically $KCODE = 'u').

(Caveat to the above: I'm not much of an encoding expert at all).

--
une bévue

Pere_Noel1 · 23 March 2006 15:13

ok, thanks, i see what u mean !

···

James Edward Gray II <james@grayproductions.net> wrote:

Try changing this to:

utf8rgx = / ... /x

Hope that helps.

--
une bévue

Pere_Noel1 · 23 March 2006 16:38

the above regexp doesn't work as expected with ruby, i've compared the
output for the same files with perl and ruby, ruby says always "yes it
is UTF-8", where perl says NO over an ISO-8859-1 encoded file... (even
after wipping out the first line the first ^and the last $)

then, for the time being, i'll use the perl script from ruby in a commad
line fashion...

···

James Edward Gray II <james@grayproductions.net> wrote:

> utf8rgx=Regexp.new('m/^(
> [\x09\x0A\x0D\x20-\x7E] # ASCII
> > [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
> > \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
> > [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
> > \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
> > \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
> > [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
> > \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
> )*$/x')

Try changing this to:

utf8rgx = / ... /x

--
une bévue

ts1 · 23 March 2006 16:47

the above regexp doesn't work as expected with ruby, i've compared the
output for the same files with perl and ruby, ruby says always "yes it
is UTF-8", where perl says NO over an ISO-8859-1 encoded file... (even
after wipping out the first line the first ^and the last $)

moulon% cat b.rb
field='&é§è!çàîûtybvn¤'
utf8rgx=Regexp.new('^(
[\x09\x0A\x0D\x20-\x7E] # ASCII
> [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
> \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
> [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
> \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
> \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
> [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
> \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$', Regexp::EXTENDED)

p utf8rgx =~ field
moulon%

moulon% file b.rb
b.rb: ISO-8859 text
moulon%

moulon% ruby b.rb
nil
moulon%

Guy Decoux

Pere_Noel1 · 23 March 2006 17:13

i don't understand your post )))

my rb file is UTF-8 encoded, at best i can have an answer, from this
script, being the reverse of what is wanted )))

otherwise i get always true...

···

ts <decoux@moulon.inra.fr> wrote:

p utf8rgx =~ field
moulon%

moulon% file b.rb
b.rb: ISO-8859 text
moulon%

moulon% ruby b.rb
nil
moulon%

--
une bévue

ts1 · 23 March 2006 17:20

i don't understand your post )))

moulon% file b.rb
b.rb: ISO-8859 text
moulon%

my file is ISO-8859 encoded

moulon% ruby b.rb
nil
moulon%

and ruby say NO

output for the same files with perl and ruby, ruby says always "yes it

^^^^^^^

is UTF-8", where perl says NO over an ISO-8859-1 encoded file... (even

^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^

Guy Decoux

···

ts <decoux@moulon.inra.fr> wrote:

Pere_Noel1 · 23 March 2006 18:13

my file is ISO-8859 encoded

ok i've done one "biso.rb" ISO encoded and the result is ok :

ruby biso.rb

nil
"false"

with :
field='&éèàçôîûêâöïü'
utf8rgx=Regexp.new('^(
[\x09\x0A\x0D\x20-\x7E] # ASCII
> [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
> \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
> [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
> \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
> \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
> [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
> \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$', Regexp::EXTENDED)
p utf8rgx =~ field
p (utf8rgx === field).to_s

>> moulon% ruby b.rb
>> nil
>> moulon%

and ruby say NO

> output for the same files with perl and ruby, ruby says always "yes it
^^^^^^^
> is UTF-8", where perl says NO over an ISO-8859-1 encoded file... (even
^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^

BUT, in "butf.rb" (an UTF-8 encoded file) i do :
field='&é§è!çàîûtybvn€'
utf8rgx=Regexp.new('^(
[\x09\x0A\x0D\x20-\x7E] # ASCII
> [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
> \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
> [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
> \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
> \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
> [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
> \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$', Regexp::EXTENDED)

p utf8rgx =~ field
p (utf8rgx === field).to_s

str=""
File.open("tut_exceptions.html").each { |l| str << l}

p utf8rgx =~ str
p (utf8rgx === str).to_s

and get :

ruby butf.rb

0
"true"
0
"true"

this file comes from :
<http://www.rubycentral.com/book/tut_exceptions.html>

with the following meta tag :
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"
notice Firefox does aggree with the "iso-8859-1" one of my text editor
also.

then, it is seen as an UTF-8 file but isn't, may be this is due to html
tags, i wippe them out saving the file tut_exceptions.html to
tut_exceptions.txt without any more tags nor even one < or >, retry on
that file :

ruby butf.rb
0
"true"
0
"true"

(i've only change the :
File.open("tut_exceptions.html").each { |l| str << l}

to :
File.open("tut_exceptions.txt").each { |l| str << l}
--------------------------^^^

however :

file tut_exceptions.txt

tut_exceptions.txt: UTF-8 Unicode English text

may be this isn't a good exemple because most of the char are us ascci
someway, the file as an english written one.

over :
<http://www.linux-france.org/>
saying it is a :
<meta http-equiv="Content-type" content="text/html;
charset=iso-8859-15"/>

and Firefox aggres also with that, then with the regexp i get :

ruby butf.rb

0
"true"
0
"true"

....

···

ts <decoux@moulon.inra.fr> wrote:
--
une bévue

Dominik_Bathon · 23 March 2006 21:37

Hi,

utf8rgx=Regexp.new('^(
[\x09\x0A\x0D\x20-\x7E] # ASCII
> [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
> \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
> [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
> \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
> \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
> [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
> \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$', Regexp::EXTENDED)

As I understand it utf8rgx matches any string that is utf8, which includes pure ascii strings (see first line).
So it should match http://www.rubycentral.com/book/tut_exceptions.html\.

First, here is a working version:

$ cat utf8tst.rb
utf8rgx = /\A(
    [\x09\x0A\x0D\x20-\x7E] # ASCII
  > [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
  > \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
  > [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
  > \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
  > \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
  > [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
  > \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*\z/x

p utf8rgx === ARGF.read
$ curl -s http://www.linux-france.org/ | ruby utf8tst.rb
false
$ curl -s http://www.rubycentral.com/book/tut_exceptions.html | ruby utf8tst.rb
true

Your problem was that in Perl ^ and $ only match beginning and end of string, but in ruby they also match beginning and end of line. So if a string contains for example a single empty line, it does always match:

irb(main):001:0> a = "xxx\n\nyyyy"
=> "xxx\n\nyyyy"
irb(main):002:0> a =~ /^(w)*$/
=> 4

So for beginning and end of string in ruby you need \A and \z:

irb(main):003:0> a =~ /\A(w)*\z/
=> nil

Hope that helps,
Dominik

···

On Thu, 23 Mar 2006 19:13:51 +0100, "Une bévue" <pere.noel@laponie.com.invalid> wrote:

Pere_Noel1 · 23 March 2006 21:53

Hope that helps,

fine thanks a lot it works, you explained very well why the ruby version
works on string like : string="&éçàôûîêäë" BUT NOT no files because of
the \n..., here is a script able to compare perl output with ruby one :
def isFileUtf8Encoded(fileName)
  utf8rgx = /\A(
      [\x09\x0A\x0D\x20-\x7E] # ASCII
    > [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
    > \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
    > [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
    > \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
    > \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
    > [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
    > \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
  )*\z/x
  str=""
  File.open("#{fileName}").each { |l| str << l}
  return (utf8rgx === str)
end

p isFileUtf8Encoded("lutte-ouvriere.html") # => false
p isFileUtf8Encoded("l_harmatan.html") # => false
p isFileUtf8Encoded("tut_exceptions.html") # => false
p isFileUtf8Encoded("butf.rb") # => true
p isFileUtf8Encoded("biso.rb") # => false

p `perl IsUTF-8.pl "lutte-ouvriere.html"` # => "0"
p `perl IsUTF-8.pl "l_harmatan.html"` # => "0"
p `perl IsUTF-8.pl "tut_exceptions.html"` # => "0"
p `perl IsUTF-8.pl "butf.rb"` # => "1"
p `perl IsUTF-8.pl "biso.rb"` # => "0"

p $KCODE # => "UTF8"

the perl script being (called from the ruby one) :

#!/usr/bin/perl

sub isFileUtf8Encoded
{
        my ($fn) = @_;
        $string='';
        open (F, $fn) || die "Unable to open file $file : $!";
        while ($line = <F>) {
                $string.=$line;
        }
        close F;
        $flag = ($string =~
          m/^(
             [\x09\x0A\x0D\x20-\x7E] # ASCII
           > [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
           > \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
           > [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
           > \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
           > \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
           > [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
           > \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
          )*$/x);
                if( $flag != 1 )
                {
                   return 0;
                }
        return $flag;
}
print isFileUtf8Encoded(@ARGV[0])

···

Dominik Bathon <dbatml@gmx.de> wrote:

--
une bévue

Topic		Replies	Views
Unicode in Regex ruby-talk	32	358	7 December 2007
Unicode ruby-talk	25	189	1 October 2007
UTF-8 question ruby-talk	20	197	15 August 2003
Ruby 1.9 hates you and me and the encodings we rode in on so just get used to it ruby-talk	28	248	31 December 2009
How do I set the encoding on a regexp? ruby-talk	19	209	3 March 2010

Perl regexp to ruby one conversion?

Related topics