String spliting and inclusion

Stuart_Clarke · 21 July 2009 15:52

Hi all,

I am having trouble working out some logic for my problem. I basically
have a long string (320 characters) and I want to split into smaller
strings no longer than 50 characters in length. At present I have the
following regex:

data = "big long string"

puts data.scan(/{50}/)

This nicely breaks up the string however there are a few problems with
it, including:

It only outputs 50 character chunks, therefore when it gets to the end
and only 20 characters remain it misses them off the output (it outputs
6 50 characters strings and ignores the remaining 20)

This regex also splits up words, which is something I don't want. I want
a script to count to 50 and when it gets there, go backwards to find
some white space and split it at that point, therefore not breaking up a
word. As a result a number of sub strings of various sizes will be
created all less than 50 chars.

I hope this makes sense, to summarise I want to break up a string into a
max of 50 characters without breaking up words.

Thanks in advance

Stuart

···

--
Posted via http://www.ruby-forum.com/.

Forum · 21 July 2009 16:11

s = "a bad day in the office today, " * 3
puts "Attention, some backtracking here:"
puts s.scan( /.{,20}\b/ )
puts "I cannot come up with a non backtracking solution right now :("

HTH
Robert

···

On 7/21/09, Stuart Clarke <stuart.clarke1986@gmail.com> wrote:

Hi all,

I am having trouble working out some logic for my problem. I basically
have a long string (320 characters) and I want to split into smaller
strings no longer than 50 characters in length. At present I have the
following regex:

data = "big long string"

puts data.scan(/{50}/)

This nicely breaks up the string however there are a few problems with
it, including:

It only outputs 50 character chunks, therefore when it gets to the end
and only 20 characters remain it misses them off the output (it outputs
6 50 characters strings and ignores the remaining 20)

This regex also splits up words, which is something I don't want. I want
a script to count to 50 and when it gets there, go backwards to find
some white space and split it at that point, therefore not breaking up a
word. As a result a number of sub strings of various sizes will be
created all less than 50 chars.

I hope this makes sense, to summarise I want to break up a string into a
max of 50 characters without breaking up words.

Thanks in advance

Stuart
--
Posted via http://www.ruby-forum.com/\.

--
Toutes les grandes personnes ont d’abord été des enfants, mais peu
d’entre elles s’en souviennent.

All adults have been children first, but not many remember.

[Antoine de Saint-Exupéry]

7stud · 21 July 2009 20:16

Stuart Clarke wrote:

Hi all,

I am having trouble working out some logic for my problem. I basically
have a long string (320 characters) and I want to split into smaller
strings no longer than 50 characters in length. At present I have the
following regex:

data = "big long string"

puts data.scan(/{50}/)

error: invalid regular expression; there's no previous pattern, to which
'{' would define cardinality

data =<<ENDOFSTRING
Hello world. Hello moon.
Goodbye world. Goodbye moon.

Hello world. Hello moon.
Goodbye world. Goodbye moon.
The end.
ENDOFSTRING

chunks =
curr_chunk =
curr_length = 0

data.scan(/.+?\b/m) do |word|
wlen = word.length

  if curr_length + wlen <= 50
    curr_chunk << word
    curr_length += wlen
  else
    chunks << curr_chunk.join()
    curr_chunk = [word]
    curr_length = wlen
  end
end

if curr_chunk.length > 0
chunks << curr_chunk.join()
end

p chunks

chunks.each do |chunk|
puts chunk.length
end

···

--
Posted via http://www.ruby-forum.com/\.

Harry3 · 22 July 2009 00:41

I hope this makes sense, to summarise I want to break up a string into a
max of 50 characters without breaking up words.

Thanks in advance

Stuart
--

str = "I did not test this completely so you may need to make some
adjustments to this, but give it a try. This cuts on twenty instead of
fifty characters."

(str.length/20).times do
  arr = str.split(//)
  ess = arr.zip((0...arr.length).to_a)
  tee = ess.reverse.detect{|y| y[0] == " " and y[1] <= 20}
  p str.slice!(0..tee[1]).strip
end
p str

Harry

···

--
A Look into Japanese Ruby List in English

David_A_Black1 · 22 July 2009 01:00

Hi --

···

On Wed, 22 Jul 2009, Stuart Clarke wrote:

Hi all,

I am having trouble working out some logic for my problem. I basically
have a long string (320 characters) and I want to split into smaller
strings no longer than 50 characters in length. At present I have the
following regex:

data = "big long string"

puts data.scan(/{50}/)

This nicely breaks up the string however there are a few problems with
it, including:

It only outputs 50 character chunks, therefore when it gets to the end
and only 20 characters remain it misses them off the output (it outputs
6 50 characters strings and ignores the remaining 20)

This regex also splits up words, which is something I don't want. I want
a script to count to 50 and when it gets there, go backwards to find
some white space and split it at that point, therefore not breaking up a
word. As a result a number of sub strings of various sizes will be
created all less than 50 chars.

I hope this makes sense, to summarise I want to break up a string into a
max of 50 characters without breaking up words.

Try this. I don't guarantee robustness.

str.scan(/\b.{0,50}(?:$|\b)/m)

David

--
David A. Black / Ruby Power and Light, LLC
Ruby/Rails consulting & training: http://www.rubypal.com
Now available: The Well-Grounded Rubyist (http://manning.com/black2\)
Training! Intro to Ruby, with Black & Kastner, September 14-17
(More info: http://rubyurl.com/vmzN\)

7stud · 21 July 2009 20:34

--output:--
["Hello world. Hello moon.\nGoodbye world. Goodbye ", "moon.\n\nHello
world. Hello moon.\nGoodbye world. ", "Goodbye moon.\nThe end"]
49
48
21

Hmmm...I'm having a problem getting the ending period while using the
word boundary in the regex. I guess that's because there is no start of
a word after the ending period for the regex to match. \s works:

data.scan(/.+?\s/m) do |word|

···

--
Posted via http://www.ruby-forum.com/.

7stud · 22 July 2009 04:59

David A. Black wrote:

Try this. I don't guarantee robustness.

str.scan(/\b.{0,50}(?:$|\b)/m)

Oh my.

···

--
Posted via http://www.ruby-forum.com/\.

Forum · 22 July 2009 08:13

Hmm my \b at the end of my solution might have been a problem in some
edge cases, however I would suggest the usage of \z instead of $ and
the m switch. I fail to see why you put a \b at the beginning David,
would you mind to explain?

In Ruby 1.9 (or Oniguruma that is) the negative lookahead assertion
might lead to the most elegant solution:

/.{,50}(?!\B)/

BTW it seems that {n,m} does not have a "non greedy" and "possessive"
variant, or did I miss it?

Cheers
Robert

···

On 7/22/09, David A. Black <dblack@rubypal.com> wrote:

Hi --

On Wed, 22 Jul 2009, Stuart Clarke wrote:

> Hi all,
>
> I am having trouble working out some logic for my problem. I basically
> have a long string (320 characters) and I want to split into smaller
> strings no longer than 50 characters in length. At present I have the
> following regex:
>
> data = "big long string"
>
> puts data.scan(/{50}/)
>
> This nicely breaks up the string however there are a few problems with
> it, including:
>
> It only outputs 50 character chunks, therefore when it gets to the end
> and only 20 characters remain it misses them off the output (it outputs
> 6 50 characters strings and ignores the remaining 20)
>
> This regex also splits up words, which is something I don't want. I want
> a script to count to 50 and when it gets there, go backwards to find
> some white space and split it at that point, therefore not breaking up a
> word. As a result a number of sub strings of various sizes will be
> created all less than 50 chars.
>
> I hope this makes sense, to summarise I want to break up a string into a
> max of 50 characters without breaking up words.
>

Try this. I don't guarantee robustness.

str.scan(/\b.{0,50}(?:$|\b)/m)

Forum · 22 July 2009 08:55

> Hi --
>
>
>
>
> > Hi all,
> >
> > I am having trouble working out some logic for my problem. I basically
> > have a long string (320 characters) and I want to split into smaller
> > strings no longer than 50 characters in length. At present I have the
> > following regex:
> >
> > data = "big long string"
> >
> > puts data.scan(/{50}/)
> >
> > This nicely breaks up the string however there are a few problems with
> > it, including:
> >
> > It only outputs 50 character chunks, therefore when it gets to the end
> > and only 20 characters remain it misses them off the output (it outputs
> > 6 50 characters strings and ignores the remaining 20)
> >
> > This regex also splits up words, which is something I don't want. I want
> > a script to count to 50 and when it gets there, go backwards to find
> > some white space and split it at that point, therefore not breaking up a
> > word. As a result a number of sub strings of various sizes will be
> > created all less than 50 chars.
> >
> > I hope this makes sense, to summarise I want to break up a string into a
> > max of 50 characters without breaking up words.
> >
>
> Try this. I don't guarantee robustness.
>
> str.scan(/\b.{0,50}(?:$|\b)/m)

Hmm my \b at the end of my solution might have been a problem in some
edge cases, however I would suggest the usage of \z instead of $ and
the m switch. I fail to see why you put a \b at the beginning David,
would you mind to explain?

In Ruby 1.9 (or Oniguruma that is) the negative lookahead assertion
might lead to the most elegant solution:

/.{,50}(?!\B)/

Nahh that leaves us with spaces at the beginning of the line, of
course we could do
scan(...).map( &:lstrip ) but that hurts my regex pride

This seems to work (but does not really):

s = "Some words are made of letters! Some are not!"
puts s.scan( /.{,10}\p{Graph}(?:\P{Graph}|\z)/ )

Replace the puts with p and you will see trailing whitespace now :(.

This is a little bastard of a problem indeed. Simplest I could come up
with so far:

s = "Some words are made of letters! Some are not!"
p s.scan( /.{,10}\p{Graph}(?:\P{Graph}|\z)/ ).map( &:strip )

HTH
Robert

BTW it seems that {n,m} does not have a "non greedy" and "possessive"
variant, or did I miss it?

Yes I did, they are there {n,m}? and {n,m}+, sorry.

···

On 7/22/09, Robert Dober <robert.dober@gmail.com> wrote:

On 7/22/09, David A. Black <dblack@rubypal.com> wrote:
> On Wed, 22 Jul 2009, Stuart Clarke wrote:

Cheers
Robert

--
Toutes les grandes personnes ont d’abord été des enfants, mais peu
d’entre elles s’en souviennent.

All adults have been children first, but not many remember.

[Antoine de Saint-Exupéry]

David_A_Black1 · 22 July 2009 10:49

Hi --

Hi --

Hi all,

I am having trouble working out some logic for my problem. I basically
have a long string (320 characters) and I want to split into smaller
strings no longer than 50 characters in length. At present I have the
following regex:

data = "big long string"

puts data.scan(/{50}/)

This nicely breaks up the string however there are a few problems with
it, including:

It only outputs 50 character chunks, therefore when it gets to the end
and only 20 characters remain it misses them off the output (it outputs
6 50 characters strings and ignores the remaining 20)

This regex also splits up words, which is something I don't want. I want
a script to count to 50 and when it gets there, go backwards to find
some white space and split it at that point, therefore not breaking up a
word. As a result a number of sub strings of various sizes will be
created all less than 50 chars.

I hope this makes sense, to summarise I want to break up a string into a
max of 50 characters without breaking up words.

Try this. I don't guarantee robustness.

str.scan(/\b.{0,50}(?:$|\b)/m)

Hmm my \b at the end of my solution might have been a problem in some
edge cases, however I would suggest the usage of \z instead of $ and
the m switch. I fail to see why you put a \b at the beginning David,
would you mind to explain?

The idea was to start every scan at a \b. It's definitely not an
all-purpose solution to the problem anyway. For one thing, it doesn't
handle words of more than 50 characters -- which probably doesn't
matter, unless you're using it with a number less than 50:

str

=> "this is a string and i intend to split it up into little strings"

str.scan(/\b.{0,5}(?:$|\b)/m)

=> ["this ", "is a ", "", " and ", "i ", "", " to ", "split", " it ",
"up ", "into ", "", " ", "", ""]

Without the first \b you get:

["this ", "is a ", "", "tring", " and ", "i ", "", "ntend", " to ",
"split", " it ", "up ", "into ", "", "ittle", " ", "", "rings", ""]

So... further tweaking required

David

···

On Wed, 22 Jul 2009, Robert Dober wrote:

On 7/22/09, David A. Black <dblack@rubypal.com> wrote:

On Wed, 22 Jul 2009, Stuart Clarke wrote:

--
David A. Black / Ruby Power and Light, LLC
Ruby/Rails consulting & training: http://www.rubypal.com
Now available: The Well-Grounded Rubyist (http://manning.com/black2\)
Training! Intro to Ruby, with Black & Kastner, September 14-17
(More info: http://rubyurl.com/vmzN\)

David_A_Black1 · 22 July 2009 10:49

Hi --

···

On Wed, 22 Jul 2009, 7stud -- wrote:

David A. Black wrote:

Try this. I don't guarantee robustness.

str.scan(/\b.{0,50}(?:$|\b)/m)

Oh my.

It's got some problems; see the message I just posted about word
length.

David

--
David A. Black / Ruby Power and Light, LLC
Ruby/Rails consulting & training: http://www.rubypal.com
Now available: The Well-Grounded Rubyist (http://manning.com/black2\)
Training! Intro to Ruby, with Black & Kastner, September 14-17
(More info: http://rubyurl.com/vmzN\)

7stud · 22 July 2009 10:51

Robert Dober wrote:

···

On 7/22/09, David A. Black <dblack@rubypal.com> wrote:

> strings no longer than 50 characters in length. At present I have the
> and only 20 characters remain it misses them off the output (it outputs
>

Try this. I don't guarantee robustness.

str.scan(/\b.{0,50}(?:$|\b)/m)

I fail to see why you put a \b at the beginning David,
would you mind to explain?

Yes. Please explain that. Also please explain why you don't have
{1,50}?

Or, will you claim the 5th under the robustness disclaimer?

--
Posted via http://www.ruby-forum.com/\.

Forum · 22 July 2009 09:05

s.split( /(.{,10}\S)\s/ ).reject( &:empty? )

···

On 7/22/09, Robert Dober <robert.dober@gmail.com> wrote:

On 7/22/09, Robert Dober <robert.dober@gmail.com> wrote:

s = "Some words are made of letters! Some are not!"
p s.scan( /.{,10}\p{Graph}(?:\P{Graph}|\z)/ ).map( &:strip )

David_A_Black1 · 22 July 2009 10:58

I won't claim anything. Feel free to experiment with the code, which
I've already said repeatedly isn't a full solution, and see what you
come up with.

David

···

On Wed, 22 Jul 2009, 7stud -- wrote:

Robert Dober wrote:

On 7/22/09, David A. Black <dblack@rubypal.com> wrote:

strings no longer than 50 characters in length. At present I have the
and only 20 characters remain it misses them off the output (it outputs

Try this. I don't guarantee robustness.

str.scan(/\b.{0,50}(?:$|\b)/m)

I fail to see why you put a \b at the beginning David,
would you mind to explain?

Yes. Please explain that. Also please explain why you don't have
{1,50}?

Or, will you claim the 5th under the robustness disclaimer?

--
David A. Black / Ruby Power and Light, LLC
Ruby/Rails consulting & training: http://www.rubypal.com
Now available: The Well-Grounded Rubyist (http://manning.com/black2\)
Training! Intro to Ruby, with Black & Kastner, September 14-17
(More info: http://rubyurl.com/vmzN\)

Harry3 · 23 July 2009 09:13

In Ruby 1.9 (or Oniguruma that is) the negative lookahead assertion
might lead to the most elegant solution:

/.{,50}(?!\B)/

Nahh that leaves us with spaces at the beginning of the line

p str.scan(/\s*(.{1,50})(?!\S)/)

Or, if there are consecutive spaces between words,
squeeze them out first.

p str.squeeze(" ").scan(/\s*(.{1,50})(?!\S)/)

Harry

···

--
A Look into Japanese Ruby List in English

Forum · 22 July 2009 09:39

good enough? Certainly not
s.split( /(.{,10}\S)\s+/ ).reject( &:empty? )

Robert

···

On 7/22/09, Robert Dober <robert.dober@gmail.com> wrote:

On 7/22/09, Robert Dober <robert.dober@gmail.com> wrote:
> On 7/22/09, Robert Dober <robert.dober@gmail.com> wrote:
>

> s = "Some words are made of letters! Some are not!"
> p s.scan( /.{,10}\p{Graph}(?:\P{Graph}|\z)/ ).map( &:strip )

s.split( /(.{,10}\S)\s/ ).reject( &:empty? )

--
Toutes les grandes personnes ont d’abord été des enfants, mais peu
d’entre elles s’en souviennent.

All adults have been children first, but not many remember.

[Antoine de Saint-Exupéry]

Forum · 22 July 2009 11:59

Indeed this is very tricky, I had some doubts about your leading \b
example, but I experimented with lots of solutions and they were
covering it up. Thx for explaining. Unless OP says what he really
wants I shall stop for not making too much noise. e.g. there is the
issue of more than and one space and of course punctuation.
R.

···

On 7/22/09, David A. Black <dblack@rubypal.com> wrote:

I won't claim anything. Feel free to experiment with the code, which
I've already said repeatedly isn't a full solution, and see what you
come up with.

Topic		Replies	Views
Regular expressions and long text ruby-talk	14	128	12 December 2008
#split vs. #length. Different returns ruby-talk	2	154	10 February 2013
Management of words in a string ruby-talk	10	128	7 July 2012
Way to split a string based on fixed length? ruby-talk	8	166	21 October 2008
How to wrap a long string ruby-talk	5	126	25 September 2008

String spliting and inclusion

Related topics