Newbie: what's Ruby idiom for word-by-word input?

Hi,

What is the Ruby idiom for reading input word-by-word? In other words
-- how to process input while skipping all of the whitespace.

I would use this code in C++:

//std::istream& stream;
while (!stream.eof())
{
    std::string str;
    stream >> str;
    ...
}

Or something like this in C:

char str[100]; // please no flame on fixed buffer size :slight_smile:
while (!feof(file))
{
    fscanf("%99s", str);
    ...
}

I have tried to use Ruby's scanf("%s") but found it severely broken
(as per my task)--it discards the rest of the input up to the
newline. I currently use the following approach, which I find ugly:

while not $stdin.eof? do
  words = $stdin.gets().scan(/[^\s]+/)
  words.each do |w|
     ...
  end
end

It takes me to write an inner loop and reads the entire string into
memory and then splits it into array of words... Duh!

My application is not in any case a time- or memory-critical, nor did
I measured to find the performance bottleneck... however, I desire for
the enlightenment. :slight_smile:

Please show me the Ruby way!

Cheers,
Alex
PS: Yes, I've searched the web, tutorials, FAQs, cookbooks, etc.
before posting this. No luck.

http://ec1.images-amazon.com/images/I/41X2833B8TL.jpg

I believe that should be enough to show you the ruby way.

Maybe you could try this:

a = gets.chomp #=> "My name is Ari"
words = a.split(/ /) #=> ["My", "name", "is", "Ari"]

Tadah! I REALLY hope that's what you're looking for.

-------------------------------------------------------|
~ Ari
crap my sig won't fit

···

On Sep 16, 2007, at 3:10 PM, Alex Shulgin wrote:

Please show me the Ruby way!

If you do not need to treat every word before it is read from the input you could do this:

$stdin.each do |line|
   line.scan /\w+/ do |word|
     puts word
   end
end

If your definition of "word" is different (i.e. non whitespace characters) you need a different regexp (for example /\S+/).

If you want to read to word boundaries only it becomes more difficult.

Kind regards

  robert

···

On 16.09.2007 21:08, Alex Shulgin wrote:

Hi,

What is the Ruby idiom for reading input word-by-word? In other words
-- how to process input while skipping all of the whitespace.

I would use this code in C++:

//std::istream& stream;
while (!stream.eof())
{
    std::string str;
    stream >> str;
    ...
}

Or something like this in C:

char str[100]; // please no flame on fixed buffer size :slight_smile:
while (!feof(file))
{
    fscanf("%99s", str);
    ...
}

I have tried to use Ruby's scanf("%s") but found it severely broken
(as per my task)--it discards the rest of the input up to the
newline. I currently use the following approach, which I find ugly:

while not $stdin.eof? do
  words = $stdin.gets().scan(/[^\s]+/)
  words.each do |w|
     ...
  end
end

It takes me to write an inner loop and reads the entire string into
memory and then splits it into array of words... Duh!

My application is not in any case a time- or memory-critical, nor did
I measured to find the performance bottleneck... however, I desire for
the enlightenment. :slight_smile:

Please show me the Ruby way!

Cheers,
Alex
PS: Yes, I've searched the web, tutorials, FAQs, cookbooks, etc.
before posting this. No luck.

This is more or less the same code as I use, maybe a bit more
readable, tough. :slight_smile:

So there is no way in Ruby to read words w/o reading the whole line of
input and then splitting/scanning the line (which takes us two nested
loops anyway)? Looks very odd to me...

Alex

···

On Sep 17, 12:19 am, Robert Klemme <shortcut...@googlemail.com> wrote:

If you do not need to treat every word before it is read from the input
you could do this:

$stdin.each do |line|
   line.scan /\w+/ do |word|
     puts word
   end
end

If your definition of "word" is different (i.e. non whitespace
characters) you need a different regexp (for example /\S+/).

If you want to read to word boundaries only it becomes more difficult.

Hi,

Please show me the Ruby way!

Maybe you could try this:

a = gets.chomp #=> "My name is Ari"
words = a.split(/ /) #=> ["My", "name", "is", "Ari"]

This isn't actually elaborate as it doesn't recognize tabs
or multiple whitespace. It is even longer than

  words = gets.split

With no argument or nil, String#split uses $; what normally
is nil, too. Then, split uses something like %r/[ \t\n\r\v\f]+/.

You may easily read whole lines; they shouldn't become too
long. Reading the first word before the user typed enter
would need to tweak terminal settings. Not worth the effort
in most cases.

Bertram

···

Am Montag, 17. Sep 2007, 04:19:53 +0900 schrieb Ari Brown:

On Sep 16, 2007, at 3:10 PM, Alex Shulgin wrote:

--
Bertram Scharpf
Stuttgart, Deutschland/Germany
http://www.bertram-scharpf.de

Well, you can use #getc and implement the word matching logic
yourself. But that is more tedious and it's also questionable whether
that will be as efficient. And since a line break is a word boundary
anyway the nested loop approach yields the proper result (aka sequence
of words) as the other approach. So why bother to create a word
iterating solution just to get rid of one level of loop nesting?

Kind regards

robert

···

2007/9/17, Alex Shulgin <alex.shulgin@gmail.com>:

On Sep 17, 12:19 am, Robert Klemme <shortcut...@googlemail.com> wrote:
>
> If you do not need to treat every word before it is read from the input
> you could do this:
>
> $stdin.each do |line|
> line.scan /\w+/ do |word|
> puts word
> end
> end
>
> If your definition of "word" is different (i.e. non whitespace
> characters) you need a different regexp (for example /\S+/).
>
> If you want to read to word boundaries only it becomes more difficult.

This is more or less the same code as I use, maybe a bit more
readable, tough. :slight_smile:

So there is no way in Ruby to read words w/o reading the whole line of
input and then splitting/scanning the line (which takes us two nested
loops anyway)? Looks very odd to me...

Awk is a very popular tool for text processing, but there is no
way to make it treat a sequence of whitespace characters as a
record-separator. So in awk, as in Ruby, text is almost always
read a line at a time.
Gawk added the ability to set the record-separator to a
regular expression:

gawk 'BEGIN{RS="[ \t\n]+"} 1'

···

On Sep 17, 4:30 am, Alex Shulgin <alex.shul...@gmail.com> wrote:

On Sep 17, 12:19 am, Robert Klemme <shortcut...@googlemail.com> wrote:

> If you do not need to treat every word before it is read from the input
> you could do this:

> $stdin.each do |line|
> line.scan /\w+/ do |word|
> puts word
> end
> end

> If your definition of "word" is different (i.e. non whitespace
> characters) you need a different regexp (for example /\S+/).

> If you want to read to word boundaries only it becomes more difficult.

This is more or less the same code as I use, maybe a bit more
readable, tough. :slight_smile:

So there is no way in Ruby to read words w/o reading the whole line of
input and then splitting/scanning the line (which takes us two nested
loops anyway)? Looks very odd to me...

Alex

I thought Ruby is not just a text processing tool, but a general
purpose programming language. Anyway, it would be nice to have a
solution for this problem as compact and flexible as C++ example I've
provided. What if scanf() didn't discard the rest of the line... but
now is too late to fix it. :-/

Regards,
Alex

···

On Sep 17, 6:19 pm, William James <w_a_x_...@yahoo.com> wrote:

Awk is a very popular tool for text processing, but there is no
way to make it treat a sequence of whitespace characters as a
record-separator. So in awk, as in Ruby, text is almost always
read a line at a time.

You thought correctly. But when you talk about reading a word at
at time from a text file, you're talking about text processing.
The point is that languages (including Ruby) that were designed
to be very good at processing text usually read a line at a time,
not a word at a time. (A language that is very good at processing
text can still be a general purpose language.) Reading a word at
a time seems to me to be odd and unnecessary, and I do a lot of
text processing. However, here's one way to do it. (It would be
a lot more efficient to read by lines.)

class IO
  def get_word
    word = nil
    while c = self.read(1)
      if c =~ /\s/
        break if word
      else
        word>>=""
        word << c
      end
    end
    word
  end
end

File.open('data'){|file|
  while w = file.get_word
    p w
  end
}

···

On Sep 17, 1:00 pm, Alex Shulgin <alex.shul...@gmail.com> wrote:

On Sep 17, 6:19 pm, William James <w_a_x_...@yahoo.com> wrote:

> Awk is a very popular tool for text processing, but there is no
> way to make it treat a sequence of whitespace characters as a
> record-separator. So in awk, as in Ruby, text is almost always
> read a line at a time.

I thought Ruby is not just a text processing tool, but a general
purpose programming language.

I'd probably encapsulate the word reading in a module so the implementation can be reused and exchanged if necessary:

module WordIO
   def each_word(&b)
     each do |line|
       line.scan(/\w+/, &b)
     end
   end
end

class IO
   include WordIO

   def self.readwords(file)
     words =
     open(file) {|io| io.each_word {|wd| words << wd}}
     words
   end
end

ARGF.extend WordIO

# additional goody
class String
   include WordIO
end

:slight_smile:

Kind regards

  robert

···

On 17.09.2007 21:49, William James wrote:

On Sep 17, 1:00 pm, Alex Shulgin <alex.shul...@gmail.com> wrote:

On Sep 17, 6:19 pm, William James <w_a_x_...@yahoo.com> wrote:

Awk is a very popular tool for text processing, but there is no
way to make it treat a sequence of whitespace characters as a
record-separator. So in awk, as in Ruby, text is almost always
read a line at a time.

I thought Ruby is not just a text processing tool, but a general
purpose programming language.

You thought correctly. But when you talk about reading a word at
at time from a text file, you're talking about text processing.
The point is that languages (including Ruby) that were designed
to be very good at processing text usually read a line at a time,
not a word at a time. (A language that is very good at processing
text can still be a general purpose language.) Reading a word at
a time seems to me to be odd and unnecessary, and I do a lot of
text processing. However, here's one way to do it. (It would be
a lot more efficient to read by lines.)

class IO
  def get_word
    word = nil
    while c = self.read(1)
      if c =~ /\s/
        break if word
      else
        word>>=""
        word << c
      end
    end
    word
  end
end

File.open('data'){|file|
  while w = file.get_word
    p w
  end
}

Very sophisticated.

Since the o.p. wants whitespace as the word-separator,
the reg.exp. should be changed to /\S+/.

But, dang it all, I'm gonna say you're cheating because
you're still reading lines behind the scenes!
Reading lines and breaking them into words is a lot
easier than reading characters and constructing words.

···

On Sep 17, 4:13 pm, Robert Klemme <shortcut...@googlemail.com> wrote:

On 17.09.2007 21:49, William James wrote:

> On Sep 17, 1:00 pm, Alex Shulgin <alex.shul...@gmail.com> wrote:
>> On Sep 17, 6:19 pm, William James <w_a_x_...@yahoo.com> wrote:

>>> Awk is a very popular tool for text processing, but there is no
>>> way to make it treat a sequence of whitespace characters as a
>>> record-separator. So in awk, as in Ruby, text is almost always
>>> read a line at a time.
>> I thought Ruby is not just a text processing tool, but a general
>> purpose programming language.

> You thought correctly. But when you talk about reading a word at
> at time from a text file, you're talking about text processing.
> The point is that languages (including Ruby) that were designed
> to be very good at processing text usually read a line at a time,
> not a word at a time. (A language that is very good at processing
> text can still be a general purpose language.) Reading a word at
> a time seems to me to be odd and unnecessary, and I do a lot of
> text processing. However, here's one way to do it. (It would be
> a lot more efficient to read by lines.)

> class IO
> def get_word
> word = nil
> while c = self.read(1)
> if c =~ /\s/
> break if word
> else
> word>>=""
> word << c
> end
> end
> word
> end
> end

> File.open('data'){|file|
> while w = file.get_word
> p w
> end
> }

I'd probably encapsulate the word reading in a module so the
implementation can be reused and exchanged if necessary:

module WordIO
   def each_word(&b)
     each do |line|
       line.scan(/\w+/, &b)
     end
   end
end

class IO
   include WordIO

   def self.readwords(file)
     words =
     open(file) {|io| io.each_word {|wd| words << wd}}
     words
   end
end

ARGF.extend WordIO

# additional goody
class String
   include WordIO
end

:slight_smile:

Kind regards

        robert

Hi,

···

Am Dienstag, 18. Sep 2007, 06:15:05 +0900 schrieb Robert Klemme:

module WordIO
  def each_word(&b)
    each do |line|
      line.scan(/\w+/, &b)

Loath to criticize it, but

  irb(main):001:0> "tränenüberströmt".scan /\w+/
  => ["tr", "nen", "berstr", "mt"]
  irb(main):002:0>

Sigh!

Bertram

--
Bertram Scharpf
Stuttgart, Deutschland/Germany
http://www.bertram-scharpf.de

>
>
> >>> Awk is a very popular tool for text processing, but there is no
> >>> way to make it treat a sequence of whitespace characters as a
> >>> record-separator. So in awk, as in Ruby, text is almost always
> >>> read a line at a time.
> >> I thought Ruby is not just a text processing tool, but a general
> >> purpose programming language.
>
> > You thought correctly. But when you talk about reading a word at
> > at time from a text file, you're talking about text processing.
> > The point is that languages (including Ruby) that were designed
> > to be very good at processing text usually read a line at a time,
> > not a word at a time. (A language that is very good at processing
> > text can still be a general purpose language.) Reading a word at
> > a time seems to me to be odd and unnecessary, and I do a lot of
> > text processing. However, here's one way to do it. (It would be
> > a lot more efficient to read by lines.)
>
> > class IO
> > def get_word
> > word = nil
> > while c = self.read(1)
> > if c =~ /\s/
> > break if word
> > else
> > word>>=""
> > word << c
> > end
> > end
> > word
> > end
> > end
>
> > File.open('data'){|file|
> > while w = file.get_word
> > p w
> > end
> > }
>
> I'd probably encapsulate the word reading in a module so the
> implementation can be reused and exchanged if necessary:
>
> module WordIO
> def each_word(&b)
> each do |line|
> line.scan(/\w+/, &b)
> end
> end
> end
>
> class IO
> include WordIO
>
> def self.readwords(file)
> words =
> open(file) {|io| io.each_word {|wd| words << wd}}
> words
> end
> end
>
> ARGF.extend WordIO
>
> # additional goody
> class String
> include WordIO
> end
>
> :slight_smile:
>
> Kind regards
>
> robert

Very sophisticated.

Since the o.p. wants whitespace as the word-separator,
the reg.exp. should be changed to /\S+/.

See also Bertram's remark. Btw, that's probably also the reason why
this is not in the standard: there is probably no one size fits all
definition of "word". We have seen at least two so far and I reckon
there are more. :slight_smile:

But, dang it all, I'm gonna say you're cheating because
you're still reading lines behind the scenes!

:wink: But I said the implementation can be exchanged.

Reading lines and breaking them into words is a lot
easier than reading characters and constructing words.

Correct. But just a bit:

module WordIO
  def wchar?(c)
    /\A\w\z/ =~ c.chr
  end

  def each_word
    word = nil
    while ( c = getc )
      if wchar? c
         (word ||= "") << c
      else
        yield word if word
        word = nil
      end
    end
    self
  end
end

Kind regards

robert

···

2007/9/18, William James <w_a_x_man@yahoo.com>:

On Sep 17, 4:13 pm, Robert Klemme <shortcut...@googlemail.com> wrote:
> On 17.09.2007 21:49, William James wrote:
> > On Sep 17, 1:00 pm, Alex Shulgin <alex.shul...@gmail.com> wrote:
> >> On Sep 17, 6:19 pm, William James <w_a_x_...@yahoo.com> wrote:

Hi,

module WordIO
  def each_word(&b)
    each do |line|
      line.scan(/\w+/, &b)

Loath to criticize it, but

  irb(main):001:0> "tränenüberströmt".scan /\w+/
  => ["tr", "nen", "berstr", "mt"]
  irb(main):002:0>

Sigh!

$ irb -Ku
>> "tränenüberströmt".scan /\w+/
=> ["tränenüberströmt"]

James Edward Gray II

···

On Sep 18, 2007, at 1:22 AM, Bertram Scharpf wrote:

Am Dienstag, 18. Sep 2007, 06:15:05 +0900 schrieb Robert Klemme:

Yeah, that is my point. I only see a way to do this efficiently (w/o
reading the whole lines) by writing the routine in C and then using it
in Ruby.

Anyway, I probably won't bother, since there is no real problem--just
curiosity of mine. :wink:

Thanks all for discussing,
Alex

···

On Sep 18, 3:30 am, William James <w_a_x_...@yahoo.com> wrote:

But, dang it all, I'm gonna say you're cheating because
you're still reading lines behind the scenes!
Reading lines and breaking them into words is a lot
easier than reading characters and constructing words.

But, dang it all, I'm gonna say you're cheating because
you're still reading lines behind the scenes!
Reading lines and breaking them into words is a lot
easier than reading characters and constructing words.

Yeah, that is my point. I only see a way to do this efficiently (w/o
reading the whole lines) by writing the routine in C and then using it
in Ruby.

Why do you think Ruby solutions are inefficient? If you fear that reading individual characters is slow in Ruby: even if you use #getc Ruby will do buffered IO (I'm not sure about $stdin though).

Anyway, I probably won't bother, since there is no real problem--just
curiosity of mine. :wink:

If you are curious why not just take the suggested implementations and benchmark them. Benchmarking is actually pretty easy in Ruby because there is module Benchmark already (plus some more advanced variants).

Thanks all for discussing,

Thank you for bringing up interesting subjects!

Kind regards

  robert

···

On 18.09.2007 20:13, Alex Shulgin wrote:

On Sep 18, 3:30 am, William James <w_a_x_...@yahoo.com> wrote: