Regexp question

Mark_Probert3 · 30 September 2004 21:15

Hi, Rubyists.

What is the best way of attacking field split on ';' when the string looks
like:

  s = 'a;b;c\;;d;'
  s.split(/???;/)
  => ["a", "b", "c\;", "d"]

Or is it best to use s.each_byte and do it by hand?

···

--
-mark. (probertm @ acm dot org)

Simon_Strandgaard1 · 30 September 2004 21:29

How about something ala

irb(main):015:0> "aa;bbb\\;;abc;;d\\\\;e;".scan(/(?:\\[^.]|[^;])*;/)
=> ["aa;", "bbb\\;;", "abc;", ";", "d\\\\;", "e;"]

···

On Thursday 30 September 2004 23:15, Mark Probert wrote:

Hi, Rubyists.

What is the best way of attacking field split on ';' when the string looks
like:

  s = 'a;b;c\;;d;'
  s.split(/???;/)
  => ["a", "b", "c\;", "d"]

Or is it best to use s.each_byte and do it by hand?

--
Simon Strandgaard

Brian_SchrA_der · 30 September 2004 21:34

Mark Probert wrote:

Hi, Rubyists.

What is the best way of attacking field split on ';' when the string looks like:

  s = 'a;b;c\;;d;'
  s.split(/???;/)
  => ["a", "b", "c\;", "d"]

Or is it best to use s.each_byte and do it by hand?

Normally this would call for fixed width lookbehind,

/(?<!\\);/

but as far as I know its not included in the ruby regexp engine.

But for further clarification:
How should 'a;b\\;;c' be split?
If backslashs can be escaped (and you'd want that because otherwise you can't have a field "b\" its more difficult.

And maybe the CSV library can help you here.

regards,

Brian

···

--
Brian Schröder
http://ruby.brian-schroeder.de/

Florian_Gross · 30 September 2004 23:10

Mark Probert wrote:

Hi, Rubyists.

Moin!

What is the best way of attacking field split on ';' when the string looks like:

  s = 'a;b;c\;;d;'
  s.split(/???;/)
  => ["a", "b", "c\;", "d"]

Or is it best to use s.each_byte and do it by hand?

This works, (even with escaped escape characters) but you might be better off doing it by hand to keep complexity low:

irb(main):025:0> str = "hello;world;foo\\;bar;no escape\\\\;blar"; puts str hello;world;foo\;bar;no escape\\;blar => nil irb(main):026:0> str.scan(/(?:(?!\\).(?:\\{2})*\\;|[^;])+/).map { |str| str.gsub(/\\(.)/, '\1') }
=> ["hello", "world", "foo;bar", "no escape\\", "blar"]

Regards,
Florian Gross

Simon_Strandgaard1 · 30 September 2004 21:42

maybe this one is better ?

irb(main):001:0> "aa;bbb\\;;abc;;d\\\\;e;f".scan(/(?:\A|;)((?:\\[^.]|[^;])*)/)
{ p $1 }
"aa"
"bbb\\;"
"abc"
""
"d\\\\"
"e"
"f"
=> "aa;bbb\\;;abc;;d\\\\;e;f"
irb(main):002:0>

···

On Thursday 30 September 2004 23:29, Simon Strandgaard wrote:

On Thursday 30 September 2004 23:15, Mark Probert wrote:
> Hi, Rubyists.
>
> What is the best way of attacking field split on ';' when the string
> looks like:
>
> s = 'a;b;c\;;d;'
> s.split(/???;/)
> => ["a", "b", "c\;", "d"]
>
> Or is it best to use s.each_byte and do it by hand?

How about something ala

irb(main):015:0> "aa;bbb\\;;abc;;d\\\\;e;".scan(/(?:\\[^.]|[^;])*;/)
=> ["aa;", "bbb\\;;", "abc;", ";", "d\\\\;", "e;"]

--
Simon Strandgaard

Mark_Probert3 · 30 September 2004 21:50

Hi ..

How about something ala

irb(main):015:0> "aa;bbb\\;;abc;;d\\\\;e;".scan(/(?:\\[^.]|[^;])*;/)
=> ["aa;", "bbb\\;;", "abc;", ";", "d\\\\;", "e;"]

Thanks! That is close enough:

irb(main):019:0> s.scan(/(?:\\[^.]|[^;])*/).each do |it|
irb(main):020:1* next if it.empty?
irb(main):021:1> puts " --> #{it}"
irb(main):022:1> end
--> a is a word
--> b is too
--> c\; for fun
--> d -- forget it
=> ["a is a word", "", "b is too", "", "c\\; for fun", "", "d -- forget
it", "", ""]

···

Simon Strandgaard <neoneye@adslhome.dk> wrote:

--
-mark. (probertm @ acm dot org)

Dany_Cayouette1 · 30 September 2004 22:10

But for further clarification:
How should 'a;b\\;;c' be split?

Guess is that it should be
["a", "b\", nil, "c"]

characters escaped by backslash at semi-colon, colon and backslash i.e.

; => \; : => \: \ => \\

If backslashs can be escaped (and you'd want that because otherwise you
can't have a field "b\" its more difficult.

And maybe the CSV library can help you here.

thanks,
Dany

Dany_Cayouette1 · 30 September 2004 22:25

> But for further clarification:
> How should 'a;b\\;;c' be split?
Guess is that it should be
["a", "b\", nil, "c"]

Sorry... I meant
["a", "b\\", nil, "c"] where b\\ would utimately become b\ when the escape chars are process in the data portion

characters escaped by backslash at semi-colon, colon and backslash i.e.

; => \; : => \: \ => \\

> If backslashs can be escaped (and you'd want that because otherwise you
> can't have a field "b\" its more difficult.
>

Didn't think about that one... I thought this was simple and the problem was my lack of programming experience...

Dany

···

On Thu, 30 Sep 2004 17:57:19 -0400 Dany Cayouette <danyc@nortelnetworks.com> wrote:

Robert · 1 October 2004 07:45

"Mark Probert" <probertm@nospam-acm.org> schrieb im Newsbeitrag
news:Xns95749654816D0probertmnospamtelusn@198.161.157.145...

Hi ..

>
> How about something ala
>
> irb(main):015:0> "aa;bbb\\;;abc;;d\\\\;e;".scan(/(?:\\[^.]|[^;])*;/)
> => ["aa;", "bbb\\;;", "abc;", ";", "d\\\\;", "e;"]
>

Thanks! That is close enough:

irb(main):019:0> s.scan(/(?:\\[^.]|[^;])*/).each do |it|
irb(main):020:1* next if it.empty?
irb(main):021:1> puts " --> #{it}"
irb(main):022:1> end
--> a is a word
--> b is too
--> c\; for fun
--> d -- forget it
=> ["a is a word", "", "b is too", "", "c\\; for fun", "", "d -- forget
it", "", ""]

s = "aa;bbb\\;;abc;;d\\\\;e;"

=> "aa;bbb\\;;abc;;d\\\\;e;"

s.scan /(?:\\.|[^\\;])+/

=> ["aa", "bbb\\;", "abc", "d\\\\", "e"]

Regards

robert

···

Simon Strandgaard <neoneye@adslhome.dk> wrote:

Simon_Strandgaard1 · 1 October 2004 16:33

[snip]

>> s = "aa;bbb\\;;abc;;d\\\\;e;"
=> "aa;bbb\\;;abc;;d\\\\;e;"
>> s.scan /(?:\\.|[^\\;])+/
=> ["aa", "bbb\\;", "abc", "d\\\\", "e"]

If its a csv file.. shouldn't output then be?

["aa", "bbb\\;", "abc", "", "d\\\\", "e", ""]

···

On Friday 01 October 2004 09:45, Robert Klemme wrote:

--
Simon Strandgaard

Robert · 1 October 2004 21:44

"Simon Strandgaard" <neoneye@adslhome.dk> schrieb im Newsbeitrag news:200410012022.59526.neoneye@adslhome.dk...

···

On Friday 01 October 2004 09:45, Robert Klemme wrote:
[snip]

>> s = "aa;bbb\\;;abc;;d\\\\;e;"
=> "aa;bbb\\;;abc;;d\\\\;e;"
>> s.scan /(?:\\.|[^\\;])+/
=> ["aa", "bbb\\;", "abc", "d\\\\", "e"]

If its a csv file.. shouldn't output then be?

["aa", "bbb\\;", "abc", "", "d\\\\", "e", ""]

Darn! You're right. Unfortunately using "*" instead of "+" is not sufficient: far too many empty strings are found that way.

robert

Topic		Replies	Views
Nuby - help on string spliting ruby-talk	3	96	30 September 2004
Short regexp question ruby-talk	18	98	23 September 2008
Regular Expression for D(elimiter) Separated Values File ruby-talk	4	115	12 March 2004
Regexp problem ruby-talk	4	93	2 December 2008
A little regexp help for a newbie ruby-talk	7	106	4 August 2006

Regexp question

Related topics