String#split converts string args to regexes --?

Hi –

···

On Wed, 10 Jul 2002, Yukihiro Matsumoto wrote:

In message “Re: String#split converts string args to regexes – ?” > on 02/07/10, David Alan Black dblack@candle.superlink.net writes:

(c) prohibit string pattern longer than 1.

Why the one-character exception? I’m still not getting that.

AWKish behavior is used too much to omit, and useful as well.

I’m probably OT here, but I’m very curious about this… What is it
in AWK that behaves this way (one-char string as string, multi-char
converted to regex)? I think there’s some piece of the puzzle I’m not
seeing. (But I still vote for b/d :slight_smile:

David


David Alan Black
home: dblack@candle.superlink.net
work: blackdav@shu.edu
Web: http://pirate.shu.edu/~blackdav

Hi –

Hallo,

Regexp.new_const(s) = Regexp.new(Regexp.quote(s))

I’ve forgotten about this, sorry.

So I vote for a-c, with an error message saying something like

“use either Regexp.new(s) or Regexp.new(Regexp.quote(s)) to clarify”

Quick reminder, for reference :slight_smile:

sub/gsub

(a) prohibit string pattern
(b) string patterns as strings, without converting into regex.

split

(c) prohibit string pattern longer than 1.
(d) string patterns as strings, without converting into regex.

How many times do people use “error.log” or “error.log” instead of
“error\.log” or /error.log/ ?

Well… something a lot like that is what started this thread :slight_smile:
It’s partly a matter of documentation stating that String#split takes
a string or a regex, without (anywhere I’ve seen) noting that the
string has to be only one character long or else has to follow regex
syntax (because it’s really a regex).

I’m not sure what the disadvantage of allowing a string argument is.
It’s not mandatory – you can always use a regex (or even
Regexp.new(s) :slight_smile: – so allowing strings (without auto-conversion to
regex) doesn’t take away any functionality.

David

···

On Wed, 10 Jul 2002, Stepan Kasal wrote:

On Wed, 10 Jul 2002 09:01:11 GMT, Yukihiro Matsumoto matz@ruby-lang.org wrote:


David Alan Black
home: dblack@candle.superlink.net
work: blackdav@shu.edu
Web: http://pirate.shu.edu/~blackdav

Hi,

···

On Wed, 10 Jul 2002 12:41:08 GMT, David Alan Black wrote:

I’m probably OT here, but I’m very curious about this… What is it
in AWK that behaves this way (one-char string as string, multi-char
converted to regex)?

the special variable FS (field separator), the third parameter to the
function split() and, in awk which have it (eg. GNU awk), the record
separator (RS). See “info gawk” for details.

Feel free to contact me by e-mail, or post folloup to this post.
( Followup-To: comp.lang.awk )

Stepan Kasal

awk being typeless can’t distinguish between a regex vs. a string,
so a very common split on | in pipe-delimited files would be denoted
as: split($0, fields, “|”) … to have to escape | as | was probably
a nuisance.

Generally, when you’re splitting on a single character it’s not meant
to be a regular expression. When you’re splitting on more than a
single character, the odds are it’s a pattern – if it’s not, simply
escape the regex metachars …

This “odd” behavior comes from the most useful case, not the most
logical or predictable case. Perhaps the idea is “inconvenience the
least amount of uses of split() at the cost of some edge-case
confusion.”

I do agree, though, that in a language like Ruby where you can
differentiate between a String and a Regex that this “magic” is
unnecessary.

– Dossy

···

On 2002.07.10, David Alan Black dblack@candle.superlink.net wrote:

Hi –

On Wed, 10 Jul 2002, Yukihiro Matsumoto wrote:

In message “Re: String#split converts string args to regexes – ?” > > on 02/07/10, David Alan Black dblack@candle.superlink.net writes:

(c) prohibit string pattern longer than 1.

Why the one-character exception? I’m still not getting that.

AWKish behavior is used too much to omit, and useful as well.

I’m probably OT here, but I’m very curious about this… What is it
in AWK that behaves this way (one-char string as string, multi-char
converted to regex)? I think there’s some piece of the puzzle I’m not
seeing. (But I still vote for b/d :slight_smile:


Dossy Shiobara mail: dossy@panoptic.com
Panoptic Computer Network web: http://www.panoptic.com/
“He realized the fastest way to change is to laugh at your own
folly – then you can let go and quickly move on.” (p. 70)

I’m not sure what the disadvantage of allowing a string argument is.
It’s not mandatory – you can always use a regex (or even
Regexp.new(s) :slight_smile: – so allowing strings (without auto-conversion to
regex) doesn’t take away any functionality.

and, if implemented without regexps, could easily boost performance,
which IMHO is an important bonus considering speed is probably the
primary site against ruby in comparison to perl and python.

~transami

···

On Wed, 2002-07-10 at 06:29, David Alan Black wrote:

On Wed, 2002-07-10 at 06:29, David Alan Black wrote:

Hi –

On Wed, 10 Jul 2002, Stepan Kasal wrote:

Hallo,

On Wed, 10 Jul 2002 09:01:11 GMT, Yukihiro Matsumoto matz@ruby-lang.org wrote:

Regexp.new_const(s) = Regexp.new(Regexp.quote(s))

I’ve forgotten about this, sorry.

So I vote for a-c, with an error message saying something like

“use either Regexp.new(s) or Regexp.new(Regexp.quote(s)) to clarify”

Quick reminder, for reference :slight_smile:

sub/gsub

(a) prohibit string pattern
(b) string patterns as strings, without converting into regex.

split

(c) prohibit string pattern longer than 1.
(d) string patterns as strings, without converting into regex.

How many times do people use “error.log” or “error.log” instead of
“error\.log” or /error.log/ ?

Well… something a lot like that is what started this thread :slight_smile:
It’s partly a matter of documentation stating that String#split takes
a string or a regex, without (anywhere I’ve seen) noting that the
string has to be only one character long or else has to follow regex
syntax (because it’s really a regex).

I’m not sure what the disadvantage of allowing a string argument is.
It’s not mandatory – you can always use a regex (or even
Regexp.new(s) :slight_smile: – so allowing strings (without auto-conversion to
regex) doesn’t take away any functionality.

David


David Alan Black
home: dblack@candle.superlink.net
work: blackdav@shu.edu
Web: http://pirate.shu.edu/~blackdav


~transami

“They that can give up essential liberty to obtain a little
temporary safety deserve neither liberty nor safety.”
– Benjamin Franklin

Hello,
the reson is simple: the one-char separator was there long before
they thought that regex separator might be useful. So the idea is
“backward compatibility” or “for a historical reason”.

Stepan

···

On Wed, 10 Jul 2002 14:15:03 GMT, Dossy dossy@panoptic.com wrote:

On 2002.07.10, David Alan Black dblack@candle.superlink.net wrote:

I’m probably OT here, but I’m very curious about this… What is it
in AWK that behaves this way (one-char string as string, multi-char
converted to regex)?

This “odd” behavior comes from the most useful case, not the most
logical or predictable case. Perhaps the idea is “inconvenience the
least amount of uses of split() at the cost of some edge-case
confusion.”

Hallo,

Quick reminder, for reference :slight_smile:

sub/gsub

(a) prohibit string pattern
(b) string patterns as strings, without converting into regex.

split

(c) prohibit string pattern longer than 1.
(d) string patterns as strings, without converting into regex.

So I vote for a-c, with an error message saying something like
“use either Regexp.new(s) or Regexp.new(Regexp.quote(s)) to clarify”

[…]

It’s partly a matter of documentation stating that String#split takes
a string or a regex, without (anywhere I’ve seen) noting that the
string has to be only one character long or else has to follow regex
syntax (because it’s really a regex).

I’m not sure what the disadvantage of allowing a string argument is.

The disadvantage is that people will often write “error[.]log”, hoping
that this will match the string “error.log”. The reason for writing this
is not only the habit inherited from other languages, it may be simply
a program which used to work in a previous version of Ruby (though it
should not, according to the docs ;-).

How would you like if a review of a new version of Ruby said this:

We experienced non-complete backward compatibility with this
version, at least.  Our old programs just stopped working, though
no error message appeared.  We traced this down to a
(mis-)behaviour of the split() function.   [...]
We recommend reviewing all your programs carefully before upgrading
Ruby, especially in a production environment.

I know, similar reasoning applies to my solution (a-c with an error message),
at least to an extent, because the error is detected in runtime, so it may
last some time till the surprise appears. But I think crashing programs
is better then changing the semantics of old programs in a weird way.

It’s not mandatory – you can always use a regex (or even
Regexp.new(s) :slight_smile: – so allowing strings (without auto-conversion to
regex) doesn’t take away any functionality.

I think that arguments like this one lead to non-elegant non-readable
languages.

Stepan

···

On Wed, 10 Jul 2002, Stepan Kasal wrote:
On Wed, 10 Jul 2002 12:35:54 GMT, David Alan Black wrote:

Hi,

···

In message “Re: String#split converts string args to regexes – ?” on 02/07/10, Tom Sawyer transami@transami.net writes:

and, if implemented without regexps, could easily boost performance,
which IMHO is an important bonus considering speed is probably the
primary site against ruby in comparison to perl and python.

regex is highly optimized for searching. search using regex
(e.g. str.index(/pat/)) is almost always faster than search using
strings (e.g. str.index(“pat”)).

						matz.

stepan,

i think you should read this:

Imagine this…

There is a 16compat.rb file in ruby1.8 lib directory. By requiring
it, all language features that have been changed (not added) in 1.8
assume 1.6 behaviour. You could put a require ‘compat16’ in your
script, and you would be sure that, if it ran under 1.6, it will run
for you. You could even set that system-wide to be sure (e.g. ruby' is really a script containing ruby -rcompat1.6’) and require
‘compat18’ for new projects. Most important, by doing so on the new
project, whatever change will happen in ruby2.0 you already know your
script will still run because compat18.rb in ruby2.0 will take care of
it…

Just a thought.

Massimiliano

~transami

···

On Thu, 2002-07-11 at 02:30, Stepan Kasal wrote:

Hallo,

Quick reminder, for reference :slight_smile:

sub/gsub

(a) prohibit string pattern
(b) string patterns as strings, without converting into regex.

split

(c) prohibit string pattern longer than 1.
(d) string patterns as strings, without converting into regex.

On Wed, 10 Jul 2002, Stepan Kasal wrote:

So I vote for a-c, with an error message saying something like
“use either Regexp.new(s) or Regexp.new(Regexp.quote(s)) to clarify”

On Wed, 10 Jul 2002 12:35:54 GMT, David Alan Black wrote:
[…]

It’s partly a matter of documentation stating that String#split takes
a string or a regex, without (anywhere I’ve seen) noting that the
string has to be only one character long or else has to follow regex
syntax (because it’s really a regex).

I’m not sure what the disadvantage of allowing a string argument is.

The disadvantage is that people will often write “error[.]log”, hoping
that this will match the string “error.log”. The reason for writing this
is not only the habit inherited from other languages, it may be simply
a program which used to work in a previous version of Ruby (though it
should not, according to the docs ;-).

How would you like if a review of a new version of Ruby said this:

We experienced non-complete backward compatibility with this
version, at least. Our old programs just stopped working, though
no error message appeared. We traced this down to a
(mis-)behaviour of the split() function. […]
We recommend reviewing all your programs carefully before upgrading
Ruby, especially in a production environment.

I know, similar reasoning applies to my solution (a-c with an error message),
at least to an extent, because the error is detected in runtime, so it may
last some time till the surprise appears. But I think crashing programs
is better then changing the semantics of old programs in a weird way.

It’s not mandatory – you can always use a regex (or even
Regexp.new(s) :slight_smile: – so allowing strings (without auto-conversion to
regex) doesn’t take away any functionality.

I think that arguments like this one lead to non-elegant non-readable
languages.

Stepan


~transami

“They that can give up essential liberty to obtain a little
temporary safety deserve neither liberty nor safety.”
– Benjamin Franklin

Hello –

I’m not sure what the disadvantage of allowing a string argument is.

The disadvantage is that people will often write “error[.]log”, hoping
that this will match the string “error.log”. The reason for writing this
is not only the habit inherited from other languages, it may be simply
a program which used to work in a previous version of Ruby (though it
should not, according to the docs ;-).

How would you like if a review of a new version of Ruby said this:

We experienced non-complete backward compatibility with this
version, at least. Our old programs just stopped working, though
no error message appeared. We traced this down to a
(mis-)behaviour of the split() function. […]
We recommend reviewing all your programs carefully before upgrading
Ruby, especially in a production environment.

Ummm, yes: there is (self-evidently) a backward compatibility issue.
But matz has indicated that he’s considering having these methods take
strings arguments (of any length), and I think it’s safe to assume
that he’s on top of the compatibility question :slight_smile:

It’s not mandatory – you can always use a regex (or even
Regexp.new(s) :slight_smile: – so allowing strings (without auto-conversion to
regex) doesn’t take away any functionality.

I think that arguments like this one lead to non-elegant non-readable
languages.

That’s a bit of a conversation-stopper… :slight_smile: But anyway: I’m not sure
what’s non-readable about string.split(“blah”) where “blah” is a
string. I’d actually tend to argue that having to recognize “blah” as
a regex (despite the presence of a string constructor) is a
readability hurdle.

David

···

On Thu, 11 Jul 2002, Stepan Kasal wrote:

On Wed, 10 Jul 2002 12:35:54 GMT, David Alan Black wrote:


David Alan Black
home: dblack@candle.superlink.net
work: blackdav@shu.edu
Web: http://pirate.shu.edu/~blackdav

Hi,

···

In message “Re: String#split converts string args to regexes – ?” on 02/07/11, Stepan Kasal kasal@matsrv.math.cas.cz writes:

How would you like if a review of a new version of Ruby said this:

We experienced non-complete backward compatibility with this
version, at least. Our old programs just stopped working, though
no error message appeared. We traced this down to a
(mis-)behaviour of the split() function. […]
We recommend reviewing all your programs carefully before upgrading
Ruby, especially in a production environment.

I know, similar reasoning applies to my solution (a-c with an error message),
at least to an extent, because the error is detected in runtime, so it may
last some time till the surprise appears. But I think crashing programs
is better then changing the semantics of old programs in a weird way.

You will be warned if you specify a string that contains regexp
metacharacters. So you will not see silent behavior change without
notices.

						matz.

matz,

Example #1:

s = "xyz" * 1000000
a1 = s.split(/y/)

Example #2:

s = "xyz" * 1000000
a1 = s.split('y')

on my system:

Example #1: 3.69s cumulative
    Example #2: 0.73s cumulative

~transami

···

On Wed, 2002-07-10 at 09:24, Yukihiro Matsumoto wrote:

Hi,

In message “Re: String#split converts string args to regexes – ?” > on 02/07/10, Tom Sawyer transami@transami.net writes:

and, if implemented without regexps, could easily boost performance,
which IMHO is an important bonus considering speed is probably the
primary site against ruby in comparison to perl and python.

regex is highly optimized for searching. search using regex
(e.g. str.index(/pat/)) is almost always faster than search using
strings (e.g. str.index(“pat”)).

  					matz.


~transami

“They that can give up essential liberty to obtain a little
temporary safety deserve neither liberty nor safety.”
– Benjamin Franklin

Hi,

How would you like if a review of a new version of Ruby said this:

We experienced non-complete backward compatibility with this
version, at least.  Our old programs just stopped working, though

[…]

Ummm, yes: there is (self-evidently) a backward compatibility issue.
But matz […] he’s on top of the compatibility question :slight_smile:

I agree that writing such a long story about such a trivial thing was not
very bright. I apologize to matz and all for wasting their time.

It’s not mandatory – you can always use a regex (or even
Regexp.new(s) :slight_smile: – so allowing strings (without auto-conversion to
regex) doesn’t take away any functionality.

I think that arguments like this one lead to non-elegant non-readable
languages.

That’s a bit of a conversation-stopper… :slight_smile:
But anyway: I’m not sure what’s non-readable about string.split(“blah”)
where “blah” is a string.

:slight_smile: Well, it’s impossible to disagree :slight_smile:
You’re right and I apologize again.

I’d actually tend to argue that having to recognize “blah” as
a regex (despite the presence of a string constructor) is a
readability hurdle.

I agree. This variant is out of question, it even wasn’t in Matz’s
(a) (b) (c) (d) list.

Stepan

···

On Thu, 11 Jul 2002 11:25:46 GMT, David Alan Black wrote:

On Thu, 2002-07-11 at 02:30, Stepan Kasal wrote:

Hallo,

You will be warned if you specify a string that contains regexp
metacharacters. So you will not see silent behavior change without
notices.

sorry, Matz, for bothering you again :frowning:

At first, this seems right. But imagine:
When I experience such a warning, I usually search for a way to avoid it.
Thus I resort to Regexp.new(Regexp.quote(s)) because I want to get rid
of the warning.

The only situations in which one can safely use split(string,…) are those
in which they are sure that the string may not contain metacharacters. But
this almost always means that the string is entered literally in the program.
And in these situations one can usually write /blah/ instead of “blah”.

So I think that that warning may be actually discouradging using the
split(string, …) form.

For these reasons, I’d suggest that the warning is issued only in $VERBOSE
or $DEBUG mode; ie. -w or -d. (I’m not sure which one, but it’s probably
clear to most people here. It would be “awk --lint” in GNU awk.)

And, of course, I like the idea of ``require “compat1.6”‘’, as presented
in ruby-talk:43938 and pushed by Tom Sawyer transami@transami.net several
times.

Stepan

···

On Thu, 11 Jul 2002 15:31:56 GMT, Matz wrote:

  s = "xyz" * 1000000
  a1 = s.split('y')

It don't use String#index in this case

···

On Wed, 2002-07-10 at 09:24, Yukihiro Matsumoto wrote:

Hi,

In message "Re: String#split converts string args to regexes -- ?" >> on 02/07/10, Tom Sawyer <transami@transami.net> writes:

>and, if implemented without regexps, could easily boost performance,
>which IMHO is an important bonus considering speed is probably the
>primary site against ruby in comparison to perl and python.

regex is highly optimized for searching. search using regex
(e.g. str.index(/pat/)) is almost always faster than search using
strings (e.g. str.index("pat")).

            ^^^^^^^^^^^^^^^^^^^^^

Guy Decoux

Hi,

sorry, Matz, for bothering you again :frowning:

Not at all.

For these reasons, I’d suggest that the warning is issued only in $VERBOSE
or $DEBUG mode; ie. -w or -d. (I’m not sure which one, but it’s probably
clear to most people here. It would be “awk --lint” in GNU awk.)

You will always see warning for a while (during 1.7.x). This is
transient state.

						matz.
···

In message “Re: String#split converts string args to regexes – ?” on 02/07/12, Stepan Kasal kasal@matsrv.math.cas.cz writes:

ummm…the thread is called String#split converts…

but i am suprised with my #index results: it is about 4x as fast with a
regexp versus a string.

how is this possible? split is around 4x times faster with a string, but
index is 4x as slow?

even if regexp is highly optimized, it must still entail some overhead
that a pure string can do without, but perhaps the difference it’s quite
small. yet, the split example makes me wonder.

what’s going on here?

~transami

···

On Wed, 2002-07-10 at 10:46, ts wrote:

s = "xyz" * 1000000
a1 = s.split('y')

It don’t use String#index in this case

On Wed, 2002-07-10 at 09:24, Yukihiro Matsumoto wrote:

Hi,

In message “Re: String#split converts string args to regexes – ?” > >> on 02/07/10, Tom Sawyer transami@transami.net writes:

and, if implemented without regexps, could easily boost performance,
which IMHO is an important bonus considering speed is probably the
primary site against ruby in comparison to perl and python.

regex is highly optimized for searching. search using regex
(e.g. str.index(/pat/)) is almost always faster than search using
strings (e.g. str.index(“pat”)).
^^^^^^^^^^^^^^^^^^^^^

Guy Decoux


~transami

“They that can give up essential liberty to obtain a little
temporary safety deserve neither liberty nor safety.”
– Benjamin Franklin

how is this possible? split is around 4x times faster with a string, but
index is 4x as slow?

Your string as len == 1

Guy Decoux

Hi,

···

In message “Re: String#split converts string args to regexes – ?” on 02/07/11, Tom Sawyer transami@transami.net writes:

but i am suprised with my #index results: it is about 4x as fast with a
regexp versus a string.

how is this possible? split is around 4x times faster with a string, but
index is 4x as slow?

Remember String#split only searches for one character length string,
so that “split” only need to search and match the first byte for the
case.

regex match uses Boyer Moore search for exact string match (if it’s
possible), whereas string match uses simple linear search.

						matz.

Your string as len == 1

yes, quite. that’s the only way to get split to not convert to a regular
expression, and thus see the difference between the two.

···

On Wed, 2002-07-10 at 11:28, ts wrote:

On Wed, 2002-07-10 at 11:28, ts wrote:

how is this possible? split is around 4x times faster with a string, but
index is 4x as slow?

Your string as len == 1

Guy Decoux


~transami

“They that can give up essential liberty to obtain a little
temporary safety deserve neither liberty nor safety.”
– Benjamin Franklin