String#split converts string args to regexes --?

Hello –

[This comes from some peripheral playing around having to do with the
various String#each threads, but (I promise! :slight_smile: it’s not directly on
that topic.]

This is something I’ve been discussing and investigating on #ruby-lang
with Martin Chase, Holden Glova, Michael Granger.

According to the docs I’ve seen, String#split can take either a string
or a regex as the separator/delimiter argument. However – very
surprisingly to me – it turns out that if you provide a string:

str.split(aString) …

and if aString is longer than one character, then aString is
automatically converted to a regex. Examples:

One-char strings, treated as strings:

irb(main):001:0> "abc.+def".split("e")
["abc.+d", "f"]
irb(main):002:0> "abc.+def".split(".")
["abc", "+def"]

strings of >1 char, converted to regexes (!)

irb(main):003:0> "abc.+def".split(".e")
["abc.+", "f"]
irb(main):004:0> "abc.+def".split(".+")
[]

This means also that strings without any regex special characters are
really “splitting on a string” only by coincidence. They’re really
splitting on a regex which happens to provide the results one would
have expected from splitting on a string. Thus, for example:

irb(main):003:0> “here there and everywhere”.split(“er”)
[“h”, “e th”, “e and ev”, “ywh”, “e”]

is really treating the string arg as a regex, as shown by:

irb(main):005:0> “here there and everywhere”.split(".r")
[“h”, “e th”, “e and ev”, “ywh”, “e”]

producing the same results.

Any insights on why #split does this? I found it quite surprising
when I discovered it, and I don’t know of anywhere where it’s
documented as working this way.

David

···


David Alan Black
home: dblack@candle.superlink.net
work: blackdav@shu.edu
Web: http://pirate.shu.edu/~blackdav

surprisingly to me – it turns out that if you provide a string:

str.split(aString) …

and if aString is longer than one character, then aString is
automatically converted to a regex.

yikes! off the cuff, i would think this would cause some significant
performance loss. yea? nay? not to mention some confusion.

and does #index work the same way?

~transami

···

On Mon, 2002-07-08 at 16:51, David Alan Black wrote:

On Mon, 2002-07-08 at 16:51, David Alan Black wrote:

Hello –

[This comes from some peripheral playing around having to do with the
various String#each threads, but (I promise! :slight_smile: it’s not directly on
that topic.]

This is something I’ve been discussing and investigating on #ruby-lang
with Martin Chase, Holden Glova, Michael Granger.

According to the docs I’ve seen, String#split can take either a string
or a regex as the separator/delimiter argument. However – very
surprisingly to me – it turns out that if you provide a string:

str.split(aString) …

and if aString is longer than one character, then aString is
automatically converted to a regex. Examples:

One-char strings, treated as strings:

irb(main):001:0> "abc.+def".split("e")
["abc.+d", "f"]
irb(main):002:0> "abc.+def".split(".")
["abc", "+def"]

strings of >1 char, converted to regexes (!)

irb(main):003:0> "abc.+def".split(".e")
["abc.+", "f"]
irb(main):004:0> "abc.+def".split(".+")
[]

This means also that strings without any regex special characters are
really “splitting on a string” only by coincidence. They’re really
splitting on a regex which happens to provide the results one would
have expected from splitting on a string. Thus, for example:

irb(main):003:0> “here there and everywhere”.split(“er”)
[“h”, “e th”, “e and ev”, “ywh”, “e”]

is really treating the string arg as a regex, as shown by:

irb(main):005:0> “here there and everywhere”.split(“.r”)
[“h”, “e th”, “e and ev”, “ywh”, “e”]

producing the same results.

Any insights on why #split does this? I found it quite surprising
when I discovered it, and I don’t know of anywhere where it’s
documented as working this way.

David


David Alan Black
home: dblack@candle.superlink.net
work: blackdav@shu.edu
Web: http://pirate.shu.edu/~blackdav


~transami

“They that can give up essential liberty to obtain a little
temporary safety deserve neither liberty nor safety.”
– Benjamin Franklin

Hi,

···

In message “String#split converts string args to regexes – ?” on 02/07/09, David Alan Black dblack@candle.superlink.net writes:

Any insights on why #split does this? I found it quite surprising
when I discovered it, and I don’t know of anywhere where it’s
documented as working this way.

Insights? It’s inherited from Perl. Try:

% perl -le ‘print join(“:”, split(“.b”, “abcabc”))’
:c:c

But if it turns out to be a bad inheritance (and I admit I’m starting
to feeling so), I’m open to a new RCR.

						matz.

Hi –

Hi,

Any insights on why #split does this? I found it quite surprising
when I discovered it, and I don’t know of anywhere where it’s
documented as working this way.

Insights? It’s inherited from Perl. Try:

% perl -le ‘print join(“:”, split(“.b”, “abcabc”))’
:c:c

But the distinction between 1-char and multi-char string arguments
doesn’t come from Perl, I think, and that’s the part I find so
unexpected.

But if it turns out to be a bad inheritance (and I admit I’m starting
to feeling so), I’m open to a new RCR.

Changing it would make a lot of documentation instantly up-to-date,
which is the opposite of what usually happens with language changes
:slight_smile:

I would certainly advocate making it consistent: split(/re/) or
split(“string”), leaving the choice to the programmer and not doing
any automatic conversion of strings.

David

···

On Tue, 9 Jul 2002, Yukihiro Matsumoto wrote:

In message “String#split converts string args to regexes – ?” > on 02/07/09, David Alan Black dblack@candle.superlink.net writes:


David Alan Black
home: dblack@candle.superlink.net
work: blackdav@shu.edu
Web: http://pirate.shu.edu/~blackdav

Hallo,

Insights? It’s inherited from Perl. Try:

% perl -le ‘print join(“:”, split(“.b”, “abcabc”))’
:c:c

I guess Perl has inherited it from awk. awk is somewhat simple language
but I think it’s at least consistent. Let me explain why it works this
way in awk.

awk doesn’t have types. Not only types of variables it also lacks types
of values. So it’s impossible to have a variable whose value is a regexp.

That’s why in some situations strings are interpreted as regular
expressions. So you write gsub(“c+”,…) and it’s the same as writing
gsub(/c+/,…). In these situations regular expression is expected, and
if a string is found, it is converted to regexp.

But in some situations, like the field separator parameter to split(),
it’s impossible to convert all strings to regular expressions, since
traditionally, one-character separators were used. So one-char strings
has to retain their original meaning but, OTOH, there is no way to specify
real regular expression in awk.

That’s why, in these situations, new rule has been introduced:
one-char string means one-char field separator, longer strings mean regex
field separators.

But if it turns out to be a bad inheritance (and I admit I’m starting
to feeling so), I’m open to a new RCR.

Well, thus I’m speaking about “indirect inheritance from awk” or about
“consistency with awk”.

  1. gsub() … the parameter has to be regex, so I see no reason for
    accepting (and automatically converting) strings.
    As one cannot write “abc1bc”.gsub(1,“X”), it’s necessary to use at least
    “abc1bc”.gsub(1.to_s,“X”), I’d propose that

    “abcabc”.gsub(“a”,“X”)

simply won’t work, requiring the programmer to use this:

"abcabc".gsub(Regexp.new("a"),"X")

This also encouradges writing more effective programs, since it encouradges
storing a compiled Regexp.

Another plus of this: compiling regexps from strings is often source of
errors when the regexp contains backslashes. Thus encouradging usage of
/regexp/ instead of “regexp” is a good thing.

Is this possible or will it break too much old programs?
Matz will decide. :slight_smile:

  1. split() … one-character strings and regular expressing are
    absolutely necessery. Automatic conversion of anything to regexp obfuscates
    split(), I think. So I’d suggest either interpreting long strings as strings
    or forbidding them completely.

No doubt the currect situation about split() is confusing.
But if you change split() to interpret longer strings literally and leave
gsub/sub as it is, the situation will be confusing again, I’m afraid:
gsub translates strings to regexps, while split doesn’t.

Thus I think either split() should be changed in a fairly restrictive manner
(accept only one-char strings or Regexp) or gsub should not automatically
convert strings to regexps. I vote for the later alternative.

Looking forward to comments,
Stepan

···

On Tue, 09 Jul 2002 05:20:12 GMT, Yukihiro Matsumoto wrote:

Tue, 9 Jul 2002 23:06:58 +0900, Stepan Kasal kasal@matsrv.math.cas.cz pisze:

  1. gsub() … the parameter has to be regex, so I see no reason for
    accepting (and automatically converting) strings.

It’s a pity that “convert a string to regexp” means “compile it”.
I would expect it to mean “make a regexp recognizing that string”.
Now I must explicitly quote it. If I wanted to have a regexp, I would
have used // instead of “”…

Quoting and unquoting is ugly and error-prone - look at tcl or sh.
I don’t know tcl much but I saw results of poorly written scripts
caused by quoting.

  1. split() … one-character strings and regular expressing are
    absolutely necessery. Automatic conversion of anything to regexp obfuscates
    split(), I think. So I’d suggest either interpreting long strings as strings
    or forbidding them completely.

Me too; preferably interpreting them as strings. A pity that we can’t
change gsub behavior.

I recently used s.gsub(/{ident}/) {ident} instead of more
straightforward s.gsub(‘{ident}’) {ident}.

And a pity that I can’t say s.gsub(‘{ident}’, ident) because it will
work almost always, but break when ident contains backslashes.
Another quoting case that I don’t like, although I understand the
reasons… I’m just using the block version unless the target is
a constant.

···


__("< Marcin Kowalczyk
__/ qrczak@knm.org.pl
^^ Blog człowieka poczciwego.

David Alan Black wrote:

I would certainly advocate making it consistent: split(/re/) or
split(“string”), leaving the choice to the programmer and not doing
any automatic conversion of strings.

For consistency, what about gsub(‘string’) and gsub(/re/) ? The former
also converts to regex.

Hi,

Thus I think either split() should be changed in a fairly restrictive manner
(accept only one-char strings or Regexp) or gsub should not automatically
convert strings to regexps. I vote for the later alternative.

sub/gsub

(a) prohibit string pattern
(b) string patterns as strings, without converting into regex.

split

(c) prohibit string pattern longer than 1.
(d) string patterns as strings, without converting into regex.

The meaningful consistent combination must be a-c and b-d. Which do
you prefer?

						matz.
···

In message “Re: String#split converts string args to regexes – ?” on 02/07/09, Stepan Kasal kasal@matsrv.math.cas.cz writes:

Hi –

David Alan Black wrote:

I would certainly advocate making it consistent: split(/re/) or
split(“string”), leaving the choice to the programmer and not doing
any automatic conversion of strings.

For consistency, what about gsub(‘string’) and gsub(/re/) ? The former
also converts to regex.

I agree that it makes sense, pretty generally, for methods that take
string-or-regex arguments to make some distinction between them, and
not auto-convert the string. Hey, I’ve gone to all this trouble to
make my brain see this:

“abc.”

as very different from this:

/abc./

so I want to make use of it :slight_smile:

At least, I don’t want auto-conversion without auto-escaping of regex
special characters. If I do this:

str.split(“abc.”)

I don’t actually care if, internally, it gets turned into:

str.split(/abc./)

and, indeed, my expectation would be that these two would always
produce the same results. But I don’t want “abc.” to behave like:

str.split(/abc./)

because (for me) that’s too “magic”.

Here’s a little drop-in implementation of String#split that addresses
this:

class String
alias :oldsplit :split
def split(sep=$;,&block)
begin
sep = Regexp.escape(sep)
rescue TypeError
end
oldsplit(sep, &block)
end
end

Mind you, my biggest priority in bringing all this up is still the
one-char vs. multi-char differential behavior of #split… and maybe
the more general thoughts would lead to great code breakage… but I
do think it’s interesting to think about it all.

David

···

On Tue, 9 Jul 2002, Joel VanderWerf wrote:


David Alan Black
home: dblack@candle.superlink.net
work: blackdav@shu.edu
Web: http://pirate.shu.edu/~blackdav

Hi –

sub/gsub

(a) prohibit string pattern
(b) string patterns as strings, without converting into regex.

split

(c) prohibit string pattern longer than 1.

Why the one-character exception? I’m still not getting that.

(d) string patterns as strings, without converting into regex.

The meaningful consistent combination must be a-c and b-d. Which do
you prefer?

I prefer b-d.

David

···

On Wed, 10 Jul 2002, Yukihiro Matsumoto wrote:


David Alan Black
home: dblack@candle.superlink.net
work: blackdav@shu.edu
Web: http://pirate.shu.edu/~blackdav

– Nikodemus

···

On Wed, 10 Jul 2002, Yukihiro Matsumoto wrote:

(b) string patterns as strings, without converting into regex.
(d) string patterns as strings, without converting into regex.

b-d, IMHO (more versitle)

···

On Tue, 2002-07-09 at 10:30, Yukihiro Matsumoto wrote:

Hi,

In message “Re: String#split converts string args to regexes – ?” > on 02/07/09, Stepan Kasal kasal@matsrv.math.cas.cz writes:

Thus I think either split() should be changed in a fairly restrictive manner
(accept only one-char strings or Regexp) or gsub should not automatically
convert strings to regexps. I vote for the later alternative.

sub/gsub

(a) prohibit string pattern
(b) string patterns as strings, without converting into regex.

split

(c) prohibit string pattern longer than 1.
(d) string patterns as strings, without converting into regex.

The meaningful consistent combination must be a-c and b-d. Which do
you prefer?

  					matz.


~transami

“They that can give up essential liberty to obtain a little
temporary safety deserve neither liberty nor safety.”
– Benjamin Franklin

Hi,

···

At Wed, 10 Jul 2002 01:30:25 +0900, Yukihiro Matsumoto wrote:

The meaningful consistent combination must be a-c and b-d. Which do
you prefer?

I prefer b-d too.


Nobu Nakada

Hallo Matz,

sub/gsub

(a) prohibit string pattern
(b) string patterns as strings, without converting into regex.

split

(c) prohibit string pattern longer than 1.
(d) string patterns as strings, without converting into regex.

The meaningful consistent combination must be a-c and b-d. Which do
you prefer?

[BTW: I wasn’t able to express my ideas so simply and clearly, thanks!]

I’ve also considered a-d. I wouldn’t say it’s inconsistent: sub/gsub require
a regexp, split accepts either a regexp or a string.
(And I was afraid that you’d refuse (b) because it changes behaviour of
existing correct programs, which is worse then making them incorrect.)

But there were voices saying that (b) would be useful, since escaping may
be complicated. But alternatively, you could provide something like
Regexp.new_const(s) which would create a regexp matching string s
(ie. no character is interpreted as meta character).
(I don’t know whether new_const' is a good name, perhaps new_string’ or
even the shorter `new_s’.)

The above would enable a-d or even a-c. Sesitively written error messages
would direct people to express exactly what they want. It will annoy more
people then switching to b-d but it’ll annoy them less—it might be
frustrating to discover (after hours of debugging code which used to work)
that Matz has changed the definition of sub/gsub.
In other words, error messages are better then weird behaviour, even in an
interpreted language.

So, there are IMHO three possibilities:

(1) a-c with Regexp.new_const
(2) a-d with Regexp.new_const
(3) b-d

I’d be happy with any of these but I prefer (1).

Stepan
···

On Tue, 09 Jul 2002 16:35:03 GMT, Yukihiro Matsumoto matz@ruby-lang.org wrote:

sub/gsub

(a) prohibit string pattern
(b) string patterns as strings, without converting into regex.

split

(c) prohibit string pattern longer than 1.
(d) string patterns as strings, without converting into regex.

The meaningful consistent combination must be a-c and b-d. Which do
you prefer?

I’d go for (b) and (d) as well, but notes in the docs about
Regexp.new(Regexp.quote(s))
would be a useful cross-reference to have in the split and g?sub
sections, as well.
If there were another way to quote backslashes
http://www.rubycentral.com/faq/rubyfaq-9.html#ss9.18
I’d vote for that, too!

  					matz.
    Thank you,
    Hugh
···

On Wed, 10 Jul 2002, Yukihiro Matsumoto wrote:

sub/gsub

(a) prohibit string pattern
(b) string patterns as strings, without converting into regex.

split

(c) prohibit string pattern longer than 1.

Why the one-character exception? I’m still not getting that.

(d) string patterns as strings, without converting into regex.

The meaningful consistent combination must be a-c and b-d. Which
do
you prefer?

I prefer b-d.

Me too, for what that’s worth.

···

=====

Use your computer to help find a cure for cancer: http://members.ud.com/projects/cancer/

Yahoo IM: michael_s_campbell


Do You Yahoo!?
Sign up for SBC Yahoo! Dial - First Month Free

Nikodemus Siivola wrote:

···

On Wed, 10 Jul 2002, Yukihiro Matsumoto wrote:

(b) string patterns as strings, without converting into regex.
(d) string patterns as strings, without converting into regex.

One more vote for (b) (d).

Hi,

···

In message “Re: String#split converts string args to regexes – ?” on 02/07/10, David Alan Black dblack@candle.superlink.net writes:

(c) prohibit string pattern longer than 1.

Why the one-character exception? I’m still not getting that.

AWKish behavior is used too much to omit, and useful as well.

						matz.

Hi,

···

In message “Re: String#split converts string args to regexes – ?” on 02/07/10, Stepan Kasal kasal@matsrv.math.cas.cz writes:

But there were voices saying that (b) would be useful, since escaping may
be complicated. But alternatively, you could provide something like
Regexp.new_const(s) which would create a regexp matching string s
(ie. no character is interpreted as meta character).

I think you can use Regexp.quote for that purpose, i.e.

Regexp.new_const(s) = Regexp.new(Regexp.quote(s))

						matz.

Hallo,

···

On Wed, 10 Jul 2002 09:01:11 GMT, Yukihiro Matsumoto matz@ruby-lang.org wrote:

Regexp.new_const(s) = Regexp.new(Regexp.quote(s))

I’ve forgotten about this, sorry.

So I vote for a-c, with an error message saying something like

“use either Regexp.new(s) or Regexp.new(Regexp.quote(s)) to clarify”

How many times do people use “error.log” or “error.log” instead of
“error\.log” or /error.log/ ?

Stepan