1.8.0-previewX rb_sys_fail() on socket instead of an Exception

Hi all,

Find some code below to reliably cause a rb_sys_fail() for server.rb,
whenever it is run by ruby 1.8.0-previewX on a Linux system (2.2 kernel RH
6.something, 2.4.21-smp kernel RH 7.3, 2.5 kernel debian unstable). The
client can be ran with 1.6.8, still causes rb_sys_fail().

I get normal exceptions/nil-from-gets for Ruby 1.6.8, or when running on
HP-UX. If I run the server with 1.6.8 and the client with 1.8.0-pX, all is
fine, too.

The code below is simplified from my real codebase at work, where the
rb_sys_fail() means the server crashes, without possibility of recovery.

However, I can not take any more code out of these snippets, or the
behaviour goes to normal/desired: exception or nil-from-gets.

Please find the code below, or here:
http://httpd.chello.nl/k.vangelder/ruby/broken_socket.zip

Bye,
Kero.

server.rb

···

require ‘socket’

server = TCPServer.new(‘localhost’, 1357)
while socket = server.accept()
t = Thread.new {
loop {
begin
line = socket.gets
puts line
rescue
end
}
}

i = 0
loop {
socket.puts(“hello”)
sleep 1
i += 1
}
end

client.rb

require 'socket’
require ‘duty’

socket = TCPSocket.new(‘localhost’, 1357)
while line = socket.gets()
puts line

Heavy::duty() # client can be killed (^C at command line) …
socket.puts(“world #{nonexistant}”) # … or will die here,
# either causes crash at server.rb
end

duty.c

#include <ruby.h>

VALUE rb_duty(VALUE module) {
int i, j, k;
for (i=0; i<1000; i++) {
for (j=0; j<1500; j++) { /* lower numbers on slower systems… /
for (k=0; k<2000; k++) {
int nr = i
j+k;
}
}
}
return Qnil;
}

void Init_duty() {
VALUE rb_mHeavy = rb_define_module(“Heavy”);
rb_define_module_function(rb_mHeavy, “duty”, rb_duty, 0);
}

extconf.rb

require 'mkmf’
create_makefile(“duty”)

rb_sys_fail should just raise an exception:

rb_sys_fail(mesg)
const char *mesg;
{
extern int errno;
int n = errno;
VALUE arg;

errno = 0;
if (n == 0) {
    rb_bug("rb_sys_fail() - errno == 0");
}

arg = mesg ? rb_str_new2(mesg) : Qnil;
rb_exc_raise(rb_class_new_instance(1, &arg, get_syserr(n)));

}

Are you saying that the ruby interpreter is dying at this point? Do you get
a core dump? (In which case gdb can be used to interpret it)

Do you get the message “rb_sys_fail() - errno == 0” printed?

I tried your code under FreeBSD-4.8; from server.rb I get a zillion ‘nils’
printed (from the infinite gets/prints loop), followed by an EPIPE when the
other server thread tries to write to the closed socket: is that what you
get with the working platforms?


nil
nil
nil
server.rb:17:in write': Broken pipe (Errno::EPIPE) from server.rb:17:in puts’
from server.rb:17
from server.rb:16:in `loop’
from server.rb:20

One thing to try might be adding
trap(‘PIPE’) { }
to the top of server.c, just to see if SIGPIPE is interacting some way. Just
a thought.

Cheers,

Brian.

···

On Sat, Aug 02, 2003 at 06:37:40PM +0900, Kero van Gelder wrote:

Find some code below to reliably cause a rb_sys_fail() for server.rb,
whenever it is run by ruby 1.8.0-previewX on a Linux system (2.2 kernel RH
6.something, 2.4.21-smp kernel RH 7.3, 2.5 kernel debian unstable). The
client can be ran with 1.6.8, still causes rb_sys_fail().

I get normal exceptions/nil-from-gets for Ruby 1.6.8, or when running on
HP-UX. If I run the server with 1.6.8 and the client with 1.8.0-pX, all is
fine, too.

The code below is simplified from my real codebase at work, where the
rb_sys_fail() means the server crashes, without possibility of recovery.

The code below is simplified from my real codebase at work, where the
rb_sys_fail() means the server crashes, without possibility of recovery.

The problem is that appendline() use ferror() when rb_sys_fail() use
errno. On linux I have

   These functions should not fail and do not set the external variable
                                          ^^^^^^^^^^
   errno. (However, in case fileno detects that its argument is not a
   valid stream, it must return -1 and set errno to EBADF.)

Guy Decoux

Interesting. The FreeBSD manpage for ferror says exactly the same, although
omits the sentence in brackets.

However, ‘man getc’ says:

RETURN VALUES
If successful, these routines return the next requested object from the
stream. Character values are returned as an unsigned char converted to
an int. If the stream is at end-of-file or a read error occurs, the rou-
tines return EOF. The routines feof(3) and ferror(3) must be used to
distinguish between end-of-file and error. If an error occurs, the
^^^^^^^^^^^^^^^^^^^^^^^
global variable errno is set to indicate the error. The end-of-file con-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
dition is remembered, even on a terminal, and all subsequent attempts to
read will return EOF until the condition is cleared with clearerr(3).

Does this mean that Linux’s ferror is actually clearing errno? Or there is
some circumstance where ferror() returns true but errno is unset?

Regards,

Brian.

···

On Sat, Aug 02, 2003 at 08:53:03PM +0900, ts wrote:

The problem is that appendline() use ferror() when rb_sys_fail() use
errno. On linux I have

These functions should not fail and do not set the external variable
^^^^^^^^^^
errno. (However, in case fileno detects that its argument is not a
valid stream, it must return -1 and set errno to EBADF.)

Are you saying that the ruby interpreter is dying at this point? Do you get
a core dump? (In which case gdb can be used to interpret it)

yup.

#3 0x080b7d40 in rb_bug (fmt=0x80d35d8 “rb_sys_fail() - errno == 0”) at error.c:193
193 abort();

I should probably read a good book about gdb, to find out how it got
there (shame on me, CS Master’s, lots of Linux experience :slight_smile:

Do you get the message “rb_sys_fail() - errno == 0” printed?

yup.

I tried your code under FreeBSD-4.8; from server.rb I get a zillion ‘nils’
printed (from the infinite gets/prints loop), followed by an EPIPE when the
other server thread tries to write to the closed socket: is that what you
get with the working platforms?

yup.

My real code deals properly with gets returning nil, but I took it out in
the example.

Bye,
Kero.

Does this mean that Linux's ferror is actually *clearing* errno?

No, ruby has cleared errno (see the source of server.rb). The problem is
in ruby.

                                                                Or there is
some circumstance where ferror() returns true but errno is unset?

yes,

Guy Decoux

Hello,

No, ruby has cleared errno (see the source of server.rb). The problem is
in ruby.

I want to fix this. But I don’t still understand the problem.

  • server.rb read the socket (by getc(3) in appendline).
  • getc returned EOF, which means end-of-file or error.
  • appendline called ferror(3) to determine if it’s error, not EOF.
  • but errno is zero, so that rb_sys_fail() failed.

right? So the question is under what situation that

  • getc(3) returns EOF
  • ferror(3) returns true
  • yet errno == 0

and what I can do for the situation like this.

						matz.
···

In message “Re: 1.8.0-previewX rb_sys_fail() on socket instead of an Exception.” on 03/08/02, ts decoux@moulon.inra.fr writes:

I want to fix this. But I don't still understand the problem.

  * server.rb read the socket (by getc(3) in appendline).
  * getc returned EOF, which means end-of-file or error.
  * appendline called ferror(3) to determine if it's error, not EOF.
  * but errno is zero, so that rb_sys_fail() failed.

Well, server.rb is

    loop {
      begin
        line = socket.gets
        puts line
      rescue
      end
    }

When the client die : the first #gets return EOF, ruby call rb_sys_fail()
but the error is discarded because there is a rescue.

When ruby call #gets for the second time, it return EOF and errno is not
set but ferror() give an error.

Guy Decoux

Hi,

Well, server.rb is

loop {
begin
line = socket.gets
puts line
rescue
end
}

When the client die : the first #gets return EOF, ruby call rb_sys_fail()
but the error is discarded because there is a rescue.

When ruby call #gets for the second time, it return EOF and errno is not
set but ferror() give an error.

Aha, I understand. If I move clearerr() before rb_sys_fail(), will it
solve the problem?

						matz.

— io.c 1 Aug 2003 07:23:00 -0000 1.227
+++ io.c 2 Aug 2003 17:59:02 -0000
@@ -910,5 +910,5 @@ appendline(fptr, delim, strp)
if (ferror(f)) {

  •   clearerr(f);
      if (!rb_io_wait_readable(fileno(f)))
          rb_sys_fail(fptr->path);
    
  •   clearerr(f);
      continue;
    
···

In message “Re: 1.8.0-previewX rb_sys_fail() on socket instead of an Exception.” on 03/08/03, ts decoux@moulon.inra.fr writes:

Aha, I understand. If I move clearerr() before rb_sys_fail(), will it
solve the problem?

  					matz.

— io.c 1 Aug 2003 07:23:00 -0000 1.227
+++ io.c 2 Aug 2003 17:59:02 -0000
@@ -910,5 +910,5 @@ appendline(fptr, delim, strp)
if (ferror(f)) {

  • clearerr(f);
    if (!rb_io_wait_readable(fileno(f)))
        rb_sys_fail(fptr->path);
    
  • clearerr(f);
    continue;
    

Interestingly, the program now never stops with printing ‘nil’ (as if the
Thread is never switched; the second loop in server.rb (with socket.puts
in it) isn’t processed anymore).

Even if I put a small sleep directly after gets, and the CPU usage drops
from 100% to negligible, the second loop is not scheduled anymore.

Thanks,
Kero.

Aha, I understand. If I move clearerr() before rb_sys_fail(), will it
solve the problem?

Well, the problem is that there is many call to ferror() in io.c (for
example in #getc)

Guy Decoux

Interestingly, the program now never stops with printing 'nil' (as if the
Thread is never switched; the second loop in server.rb (with socket.puts
in it) isn't processed anymore).

Well, I hope that you have understood that there is a problem in your
script : you are trying to read from an IO which is in error and discard
the error with rescue

Guy Decoux

Interestingly, the program now never stops with printing ‘nil’ (as if the
Thread is never switched; the second loop in server.rb (with socket.puts
in it) isn’t processed anymore).

Well, I hope that you have understood that there is a problem in your
script : you are trying to read from an IO which is in error and discard
the error with rescue

Yes, I understand :slight_smile:
If I break out of the first loop with the fix, everything works as
expected. But I can’t help wondering why the second loop isn’t scheduled,
when I give it “opportunity” to be scheduled. There’s probably a good
reason, but I can’t think of one.

Thanks,
Kero.

If I break out of the first loop with the fix, everything works as
expected. But I can't help wondering why the second loop isn't scheduled,
when I give it "opportunity" to be scheduled. There's probably a good
reason, but I can't think of one.

For me, it's scheduled, here an example

svg% ruby server.rb
[...]
nil
nil
nil
nil
server.rb:19:in `write': Interrupt from server.rb:19:in `puts'
        from server.rb:19
        from server.rb:18:in `loop'
        from server.rb:22
svg%

The program was stopped with Ctrl-C and it was interrupted in the second
loop

Guy Decoux