Whitespace string only

Hi --

···

On Thu, 23 Sep 2004, trans. (T. Onoma) wrote:

So which method is fastest?

Considering how common this can be, one would think it were a built-in String
method (encoded in c) already.

Not *everything* can be a core method :slight_smile: Also, the regex engine is
written in C.

David

--
David A. Black
dblack@wobblini.net

"trans. (T. Onoma)" <transami@runbox.com> writes:

So which method is fastest?

Considering how common this can be, one would think it were a built-in String
method (encoded in c) already.

T.

$ ruby whitespace.rb
      user system total real
henrik 0.860000 0.110000 0.970000 ( 0.977667)
evan 8.240000 2.220000 10.460000 ( 10.524390)
mikael 0.010000 0.000000 0.010000 ( 0.014141)
tonoma 0.040000 0.000000 0.040000 ( 0.041485)

Here's the benchmark:

···

require 'benchmark'

n = 50

$whitespace = " \n" * 1000

$nonwhitespace = $whitespace
$nonwhitespace[-2] = 'a'

class String
  def henrik
    strings_to_test = split("\n")
    whitespace = /^\s+$/
    is_whitespace_only = true
    strings_to_test.each{ |str|
      unless whitespace.match(str) or str.empty?
        is_whitespace_only = false
  break
      end
    }
    is_whitespace_only
  end

  def evan
    each_byte { |b| return false unless [9,10,32].include?(b) }
    true
  end

  def mikael
    self !~ /[^\s]/
  end

  def tonoma
    strip.length == 0
  end
end

Benchmark::bm do |x|
  test_algorithm = lambda do |id|
    x.report id.to_s do
      whitespace_tester = $whitespace.method id
      nonwhitespace_tester = $nonwhitespace.method id
      n.times { whitespace_tester.call }
      n.times { nonwhitespace_tester.call }
    end
  end
  
  test_algorithm.call :henrik
  test_algorithm.call :evan
  test_algorithm.call :mikael
  test_algorithm.call :tonoma
end

   rep = %r/^\s*$/o

/o is useless and you make something too complex for the regexp engine

Guy Decoux

<Ara.T.Howard@noaa.gov> schrieb im Newsbeitrag
news:Pine.LNX.4.60.0409230933170.2168@harp.ngdc.noaa.gov...

>
> "ts" <decoux@moulon.inra.fr> schrieb im Newsbeitrag
> news:200409231451.i8NEphE08333@moulon.inra.fr...

>>
>> > if s.strip.empty?
>> > # the string is whitespace only
>>
>> svg% ruby -e 'a = " \000\000"; p "OK" if a.strip.empty?'
>> "OK"
>> svg%
>>
>> svg% ruby -e 'a = " \000\000 "; p "OK" if a.strip.empty?'
>> svg%
>
> Also I'd say the disadvantage of "a.strip.empty?" is that it creates a

copy

> of the string (=> a new instance) which is generally slower than a

simple

> regexp check.

i assumed you were correct - but this is suprising:

I have different results:

                          user system total real
rx =~ s 0.031000 0.000000 0.031000 ( 0.023000)
rx =~ bs 0.016000 0.000000 0.016000 ( 0.022000)
rx !~ s 0.031000 0.000000 0.031000 ( 0.025000)
rx !~ bs 0.016000 0.000000 0.016000 ( 0.027000)
RX1 =~ s 0.031000 0.000000 0.031000 ( 0.030000)
RX1 =~ bs 0.031000 0.000000 0.031000 ( 0.030000)
RX2 !~ s 0.047000 0.000000 0.047000 ( 0.039000)
RX2 !~ bs 0.032000 0.000000 0.032000 ( 0.033000)
s =~ rx 0.031000 0.000000 0.031000 ( 0.024000)
bs =~ rx 0.015000 0.000000 0.015000 ( 0.024000)
s !~ rx 0.032000 0.000000 0.032000 ( 0.026000)
bs !~ rx 0.031000 0.000000 0.031000 ( 0.026000)
s =~ RX1 0.031000 0.000000 0.031000 ( 0.031000)
bs =~ RX1 0.031000 0.000000 0.031000 ( 0.031000)
s !~ RX2 0.032000 0.000000 0.032000 ( 0.030000)
bs !~ RX2 0.031000 0.000000 0.031000 ( 0.034000)
s.strip.empty? 0.062000 0.000000 0.062000 ( 0.054000)
bs.strip.empty? 0.047000 0.000000 0.047000 ( 0.050000)
                          user system total real
rx =~ s 0.032000 0.000000 0.032000 ( 0.022000)
rx =~ bs 0.015000 0.000000 0.015000 ( 0.023000)
rx !~ s 0.031000 0.000000 0.031000 ( 0.024000)
rx !~ bs 0.016000 0.000000 0.016000 ( 0.025000)
RX1 =~ s 0.031000 0.000000 0.031000 ( 0.030000)
RX1 =~ bs 0.032000 0.000000 0.032000 ( 0.031000)
RX2 !~ s 0.031000 0.000000 0.031000 ( 0.031000)
RX2 !~ bs 0.031000 0.000000 0.031000 ( 0.032000)
s =~ rx 0.031000 0.000000 0.031000 ( 0.025000)
bs =~ rx 0.032000 0.000000 0.032000 ( 0.024000)
s !~ rx 0.015000 0.000000 0.015000 ( 0.028000)
bs !~ rx 0.031000 0.000000 0.031000 ( 0.026000)
s =~ RX1 0.032000 0.000000 0.032000 ( 0.032000)
bs =~ RX1 0.015000 0.000000 0.015000 ( 0.031000)
s !~ RX2 0.032000 0.000000 0.032000 ( 0.031000)
bs !~ RX2 0.031000 0.000000 0.031000 ( 0.033000)
s.strip.empty? 0.062000 0.000000 0.062000 ( 0.051000)
bs.strip.empty? 0.047000 0.000000 0.047000 ( 0.048000)
18:05:36 [ruby]:

Regards

    robert

empty-test.rb (1.3 KB)

···

On Thu, 23 Sep 2004, Robert Klemme wrote:

from man

   man isspace
   ...
          isspace()
                 checks for white-space characters. In the "C" and "POSIX"
                 locales, these are: space, form-feed ('\f'), newline ('\n'),
                 carriage return ('\r'), horizontal tab ('\t'), and vertical tab
                 ('\v').
   ...

from wikipedia

   In computer science, a whitespace (or a whitespace character) is any
   character which does not display itself but does take up space. For example,
   the character symbol " ", which is a blank space. Whitespaces are generated by
   the space bar or the Tab key; depending on context, a line-break generated by
   the Return key (Enter key) may be considered whitespace as well.

   Whitespace can also refer to a series of whitespace characters. Within source
   code, the size of whitespace is generally ignored by free-form languages. In
   the Python programming language whitespace and indentation are used for
   syntactical purposes.

   In many programming languages abundant use of whitespace, especially trailing
   whitespace at the end of lines, is considered a nuisance.

   [ \t]+ is a regular expression that matches whitespace.

   The term whitespace is based on the assumption that the background color used
   for text is white, and is thus confusing if it is not.

there is a long standing precendent for the meaning of whitespace. it does not
include non-printables since that do not take up any __space__. for that there
is isgraph(3)

-a

···

On Fri, 24 Sep 2004, Markus wrote:

On Thu, 2004-09-23 at 08:34, Ara.T.Howard@noaa.gov wrote:

since when is NUL whitespace!? defintely against POLS.

???

Since when _isn't_ NUL whitespace? Despite the fact that it is sometimes
used as a delimiter (which is true for all the other whitespace characters
as well), it has no meaning, no glyph, does not show up when printed--it
doesn't even move the cursor/printhead. How much more "whitespace" can you
get?

--

EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
PHONE :: 303.497.6469
A flower falls, even though we love it;
and a weed grows, even though we do not love it. --Dogen

===============================================================================

> $whitespace = " \n" * 1000
> > $nonwhitespace = $whitespace
> $nonwhitespace[-2] = 'a'

doesn't this make the arrays the same? Isn't $nonwhitespace just a reference to $whitespace?

Hi --

···

On Thu, 23 Sep 2004, Mikael Brockman wrote:

> def mikael
> self !~ /[^\s]/
> end

And you can even shave a few characters off:

  self !~ /\S/

with (by my measure) no ill effects.

David

--
David A. Black
dblack@wobblini.net

why useless? you mean because there's nothing to interplate here - that's
true in this case...

what do you mean be complex? seems very simple?

-a

···

On Fri, 24 Sep 2004, ts wrote:

> rep = %r/^\s*$/o

/o is useless and you make something too complex for the regexp engine

--

EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
PHONE :: 303.497.6469
A flower falls, even though we love it;
and a weed grows, even though we do not love it. --Dogen

===============================================================================

[abundant online rebuttal snipped]

      Gosh. So I guess your answer to my question is "since the late
1970's or so", which, coincidentally, is most likely the last time I
checked. Back in the RS-232 paper tape and teletype days (ASCII)
whitespace was the same as non-printing, e.g. everything <= 040 and
sometimes 0177, while the EBCDIC definition was a little murkier.

      *laugh* I wonder if anything else has changed in the last thirty
years?

      Thanks,

          -- Markus

···

On Thu, 2004-09-23 at 09:44, Ara.T.Howard@noaa.gov wrote:

On Fri, 24 Sep 2004, Markus wrote:

> On Thu, 2004-09-23 at 08:34, Ara.T.Howard@noaa.gov wrote:
>
>> since when is NUL whitespace!? defintely against POLS.
>
> ???
>
> Since when _isn't_ NUL whitespace?

Jani Monoses <jani@iv.ro> writes:

> > $whitespace = " \n" * 1000
> > > $nonwhitespace = $whitespace
> > $nonwhitespace[-2] = 'a'

doesn't this make the arrays the same? Isn't $nonwhitespace just a
reference to $whitespace?

Er, yes. Duh. I haven't really awoken yet. Duping whitespace doesn't
make any real difference, though. More interesting results are found
when whitespace[0] = 'a'. With n = 10000 and $nonwhitespace ignored
entirely:

       user system total real
henrik 14.140000 0.060000 14.200000 ( 14.684509)
evan 0.110000 0.030000 0.140000 ( 0.146974)
mikael 0.040000 0.020000 0.060000 ( 0.060418)
tonoma 3.840000 0.040000 3.880000 ( 4.163754)

> self !~ /[^\s]/

svg% ruby -rjj -e '/[^\s]/.dump'
Regexp /[^\s]/
  0 charset_not \011-\015 (0)
  1 end
svg%

  self !~ /\S/

svg% ruby -rjj -e '/\S/.dump'
Regexp /\S/
  0 charset_not \011-\012\014-\015 (0)
  1 end
svg%

Guy Decoux

what do you mean be complex? seems very simple?

For you, not for this "poor" regexp engine :slight_smile:

svg% ruby -rjj -e '" ".match(/^\s*$/)'
Regexp /^\s*$/
  0 begline
  1 on_failure_jump ==> 4
  2 charset \011-\012\014-\015 (0)
  3 maybe_finalize_jump ==> 1
  4 endline
  5 end
Fastmap supplied : \011-\012\014-\015

String << >> pos=0

  0 begline | |
  1 on_failure_jump | | >4[0]
  2 charset | |
  3 maybe_finalize_jump | |
  1 on_failure_jump | | >4[1]
  2 charset | |
  3 jump | |
  1 on_failure_jump | | >4[2]
  2 charset | |
  3 jump | |
  1 on_failure_jump | | >4[3]
  2 charset | | F4[3]
  4 endline | | SUCCESS
svg%

it really prefer to do this

svg% ruby -rjj -e '" ".match(/[^\S]/)'
Regexp /[^\S]/
  0 charset_not \000-\010\016-\037!-\377 (0)
  1 end
Fastmap supplied : \011-\015

String << >> pos=0

  0 charset_not | | SUCCESS
svg%

Guy Decoux

... ask one simple question and you end up being involved in regexp blackmagic and a computer history discussion :slight_smile:

What is this 'jj' file that is being used to analyze these regexps?

I wasn't able to find it and it looks very interesting. Any hints?

regards,

Henrik

lol.

perhaps i was a bit to aggressive in that reply - you sarcasm gave me a good
laugh at myself. no harm intended.

kind regards.

-a

···

On Fri, 24 Sep 2004, Markus wrote:

On Thu, 2004-09-23 at 09:44, Ara.T.Howard@noaa.gov wrote:

On Fri, 24 Sep 2004, Markus wrote:

On Thu, 2004-09-23 at 08:34, Ara.T.Howard@noaa.gov wrote:

since when is NUL whitespace!? defintely against POLS.

???

Since when _isn't_ NUL whitespace?

[abundant online rebuttal snipped]

     Gosh. So I guess your answer to my question is "since the late
1970's or so", which, coincidentally, is most likely the last time I
checked. Back in the RS-232 paper tape and teletype days (ASCII)
whitespace was the same as non-printing, e.g. everything <= 040 and
sometimes 0177, while the EBCDIC definition was a little murkier.

     *laugh* I wonder if anything else has changed in the last thirty
years?

--

EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
PHONE :: 303.497.6469
A flower falls, even though we love it;
and a weed grows, even though we do not love it. --Dogen

===============================================================================

Maybe I've missed something most important :slight_smile:
Where can I find jj.rb?

regards
Karl-Heinz

···

In message "whitespace string only" on 23.09.2004, ts <decoux@moulon.inra.fr> writes:

svg% ruby -rjj -e '/\S/.dump'
Regexp /\S/
  0 charset_not \011-\012\014-\015 (0)
  1 end
svg%

Guy Decoux

Hi --

>> > self !~ /[^\s]/

svg% ruby -rjj -e '/[^\s]/.dump'
Regexp /[^\s]/
  0 charset_not \011-\015 (0)
  1 end
svg%

> self !~ /\S/

svg% ruby -rjj -e '/\S/.dump'
Regexp /\S/
  0 charset_not \011-\012\014-\015 (0)
  1 end

Ugh. So \013 (vertical tab) is defined as whitespace:

  irb(main):008:0> /[\s]/.match("\013")
  => #<MatchData:0x401d75a0>

and non-whitespace:

  irb(main):007:0> /\S/.match("\013")
  => #<MatchData:0x401d9c38>

Rather hard to deduce....

David

···

On Thu, 23 Sep 2004, ts wrote:

--
David A. Black
dblack@wobblini.net

fascinating!

-a

···

On Fri, 24 Sep 2004, ts wrote:

> what do you mean be complex? seems very simple?

For you, not for this "poor" regexp engine :slight_smile:

svg% ruby -rjj -e '" ".match(/^\s*$/)'
Regexp /^\s*$/
0 begline
1 on_failure_jump ==> 4
2 charset \011-\012\014-\015 (0)
3 maybe_finalize_jump ==> 1
4 endline
5 end
Fastmap supplied : \011-\012\014-\015

String << >> pos=0

0 begline | |
1 on_failure_jump | | >4[0]
2 charset | |
3 maybe_finalize_jump | |
1 on_failure_jump | | >4[1]
2 charset | |
3 jump | |
1 on_failure_jump | | >4[2]
2 charset | |
3 jump | |
1 on_failure_jump | | >4[3]
2 charset | | F4[3]
4 endline | | SUCCESS
svg%

it really prefer to do this

svg% ruby -rjj -e '" ".match(/[^\S]/)'
Regexp /[^\S]/
0 charset_not \000-\010\016-\037!-\377 (0)
1 end
Fastmap supplied : \011-\015

String << >> pos=0

0 charset_not | | SUCCESS
svg%

--

EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
PHONE :: 303.497.6469
A flower falls, even though we love it;
and a weed grows, even though we do not love it. --Dogen

===============================================================================

"ts" <decoux@moulon.inra.fr> schrieb im Newsbeitrag
news:200409231619.i8NGJsO12622@moulon.inra.fr...

> what do you mean be complex? seems very simple?

For you, not for this "poor" regexp engine :slight_smile:

svg% ruby -rjj -e '" ".match(/^\s*$/)'
Regexp /^\s*$/
  0 begline
  1 on_failure_jump ==> 4
  2 charset \011-\012\014-\015 (0)
  3 maybe_finalize_jump ==> 1
  4 endline
  5 end
Fastmap supplied : \011-\012\014-\015

String << >> pos=0

  0 begline | |
  1 on_failure_jump | | >4[0]
  2 charset | |
  3 maybe_finalize_jump | |
  1 on_failure_jump | | >4[1]
  2 charset | |
  3 jump | |
  1 on_failure_jump | | >4[2]
  2 charset | |
  3 jump | |
  1 on_failure_jump | | >4[3]
  2 charset | | F4[3]
  4 endline | | SUCCESS
svg%

it really prefer to do this

svg% ruby -rjj -e '" ".match(/[^\S]/)'

Isn't this the same as /\s/? I mean /[^\S]/ means not not whitespace,
doesn't it?

Regexp /[^\S]/
  0 charset_not \000-\010\016-\037!-\377 (0)
  1 end
Fastmap supplied : \011-\015

String << >> pos=0

  0 charset_not | | SUCCESS
svg%

Err... Those two regexps you present do not yield an equivalent result.
Or did I miss something? Or did you mean to use /\S/ for the second one?

Regards

    robert

Err... Those two regexps you present do not yield an equivalent result.
Or did I miss something? Or did you mean to use /\S/ for the second one?

In this case the result is not important : this is just to show that
internally it can make something completely different.

Guy Decoux