Does Ruby need a "line separator" class?

I've run into a problem where Ruby can't handle newlines on Windows
because the regexp is explicitly looking for \n and not \r\n.

In the Java world, there is a system property to represent line
separator so that you can write code that is cross-platform with respect
to line separation on Unix/Windows/Mac. Is there an equivalent
abstraction of the newline character in Ruby? If not, where does it
belong?

For some reason, I thought I read somewhere that sometimes the "\n"
character is overloaded in this way (to represent a "newline" regardless
of platform), but not sure if I'm misremembering.

Thanks,
Wes

···

--
Posted via http://www.ruby-forum.com/.

It shouldn't look for CRLFs. The rules of the game in languages that inherit the newline normalization approach from C (those include C++, and Perl, for instance, but not Java) are that if you work in text mode and the text file follows runtime conventions, you only read and print "\n"s.

That's because there's an intermediate IO layer that transforms CRLF into LF in CRLF platforms on reading, and LF back to CRLF on writing.

In Java this is handled in a different way, "\n" is not portable in Java. Portable code in Java uses method calls like println. But in Ruby a portable regexp that assumes text mode and data with the runtime platform conventions for newlines have to use "\n", no CR ever gets into the string.

-- fxn

···

On Jul 31, 2006, at 5:40 PM, Wes Gamble wrote:

I've run into a problem where Ruby can't handle newlines on Windows
because the regexp is explicitly looking for \n and not \r\n.

This has come up in the JRuby project fairly frequently since Java wants to
normalize line-terminators internally to the underlying platform, rather
than normalizing to \n and handling conversion on read-write. Xavier, are
you saying that Ruby has in its IO layer code to convert from CRLF to LF on
input/output, and this is the primary means of normalizing newlines? We have
had in our bug tracker a patch that resolves JRuby's newline issues in a
similar way, but had not committed it pending research into whether this
would be appropriate and sufficient.

···

On 7/31/06, Xavier Noria <fxn@hashref.com> wrote:

On Jul 31, 2006, at 5:40 PM, Wes Gamble wrote:

> I've run into a problem where Ruby can't handle newlines on Windows
> because the regexp is explicitly looking for \n and not \r\n.

It shouldn't look for CRLFs. The rules of the game in languages that
inherit the newline normalization approach from C (those include C++,
and Perl, for instance, but not Java) are that if you work in text
mode and the text file follows runtime conventions, you only read and
print "\n"s.

That's because there's an intermediate IO layer that transforms CRLF
into LF in CRLF platforms on reading, and LF back to CRLF on writing.

--
Contribute to RubySpec! @ Welcome to headius.com
Charles Oliver Nutter @ headius.blogspot.com
Ruby User @ ruby.mn
JRuby Developer @ www.jruby.org
Application Architect @ www.ventera.com

Xavier,

That's interesting.

In a pure Ruby (Rails) app, I've had to modify regexps to handle the
\r\n sequence so that my regexps will work in a Windows environment.
I'm guessing that this is related to the "file follows runtime
conventions" in your post. Meaning that the file that I'm processing
(which is actually sourced externally) did not conform to C runtime
conventions when it was written.

In general, this seems simple enough to handle, you just allow for
optional \r \n combinations in your regexp (assuming setting the
multiline flag for the regexp), like so:

[^\r\n]*
[\r\n]*
(\r*\n*)

Wes

···

--
Posted via http://www.ruby-forum.com/.

FWIW, I'm pursuing this question because of the JRuby issue.

···

--
Posted via http://www.ruby-forum.com/.

If I am not mistaken, in Ruby that is delegated to stdio. After a quick code inspection I think the exact point where that is done is in the call to write():

   r = write(fileno(f), RSTRING(str)->ptr+offset, l);

That's in the function io_fwrite(), line 455 of io.c in Ruby 1.8.4.

In Perl that was delegated to stdio as well until 5.8.0, where the I/O layer was substituted with PerlIO who is now the responsible for that filtering in CRLF platforms.

-- fxn

···

On Jul 31, 2006, at 6:15 PM, Charles O Nutter wrote:

This has come up in the JRuby project fairly frequently since Java wants to
normalize line-terminators internally to the underlying platform, rather
than normalizing to \n and handling conversion on read-write. Xavier, are
you saying that Ruby has in its IO layer code to convert from CRLF to LF on
input/output, and this is the primary means of normalizing newlines? We have
had in our bug tracker a patch that resolves JRuby's newline issues in a
similar way, but had not committed it pending research into whether this
would be appropriate and sufficient.

Yes, that is an important point.

When we talk about portability as far as newlines is concerned we are assuming the newline conventions of the platform and the data match. A portable line-oriented script might fail if it is running on Linux processing text files from a FAT32 partition that were generated by some Windows program. There a lot of common situations when conventions may not match. A portable line-oriented script is not supposed to handle those situation, a robust line-oriented script should do something sensible with foreign conventions.

Web programming is one of them, because you cannot assume anything in the input that comes from a text area or an uploaded text file for instance. In that case you better normalize first (written on the way):

   normalized_text_area = text_area.gsub(/\015\012/, "\n").gsub(/\015/, "\n")
   # Now text_area has been normalized and all standard line-oriented
   # idioms will work.

In Ruby we are done because "\n" is "\012" everywhere, in Perl that gets slightly more complicated because "\n" is eq "\015" on MacOS pre-X. But you see the idea and why you do that.

-- fxn (<-- whose article about newlines for O'Reilly is about to appear)

···

On Jul 31, 2006, at 6:23 PM, Wes Gamble wrote:

In a pure Ruby (Rails) app, I've had to modify regexps to handle the
\r\n sequence so that my regexps will work in a Windows environment.
I'm guessing that this is related to the "file follows runtime
conventions" in your post. Meaning that the file that I'm processing
(which is actually sourced externally) did not conform to C runtime
conventions when it was written.

I figured as much :slight_smile: From what Xavier says, we may be closer (with Ola's
patch) than previously thought...

···

On 7/31/06, Wes Gamble <weyus@att.net> wrote:

FWIW, I'm pursuing this question because of the JRuby issue.

--
Contribute to RubySpec! @ Welcome to headius.com
Charles Oliver Nutter @ headius.blogspot.com
Ruby User @ ruby.mn
JRuby Developer @ www.jruby.org
Application Architect @ www.ventera.com

A large part of our problem is that we currently tend to normalize
everything to \n....all the time. That has the effect of also writing out \n
to the filesystem for newlines, which as you describe above causes problems
when trying to re-read. So for the case in question, we run Rails...it
generates files with newlines...we normalize those newlines to \n and write
such to disk...and then future use of those files (in this case, ERB
templates) fails because the newlines aren't handled correctly (i.e. we
can't normalize \r\n to \n again because they're already \n on disk).

So it seems the IO approach may do well for us, where newlines are read from
platform-specific and written to platform-specific.

···

On 7/31/06, Xavier Noria <fxn@hashref.com> wrote:

On Jul 31, 2006, at 6:23 PM, Wes Gamble wrote:

> In a pure Ruby (Rails) app, I've had to modify regexps to handle the
> \r\n sequence so that my regexps will work in a Windows environment.
> I'm guessing that this is related to the "file follows runtime
> conventions" in your post. Meaning that the file that I'm processing
> (which is actually sourced externally) did not conform to C runtime
> conventions when it was written.

Yes, that is an important point.

When we talk about portability as far as newlines is concerned we are
assuming the newline conventions of the platform and the data match.
A portable line-oriented script might fail if it is running on Linux
processing text files from a FAT32 partition that were generated by
some Windows program. There a lot of common situations when
conventions may not match. A portable line-oriented script is not
supposed to handle those situation, a robust line-oriented script
should do something sensible with foreign conventions.

--
Contribute to RubySpec! @ Welcome to headius.com
Charles Oliver Nutter @ headius.blogspot.com
Ruby User @ ruby.mn
JRuby Developer @ www.jruby.org
Application Architect @ www.ventera.com

Just for the archives, this normalizes in Ruby with only one pass

   normalized_text_area = text_area.gsub(/\015\012?/, "\n")

though it is less explicit. Let me add now that we are on it that if the text is Unicode it may come with a few more codes for newlines. All in all this is a PITA like character encodings, but is what we've got for historical reasons.

-- fxn

···

On Jul 31, 2006, at 6:54 PM, Xavier Noria wrote:

  normalized_text_area = text_area.gsub(/\015\012/, "\n").gsub(/\015/, "\n")

If those files are only handled by that application there is no problem because \ns are precisely what the script should see.

For instance, if you pass a Unix text file to a line-oriented script running on Windows the script will work as long as it only reads. That's because LFs not following a CR are left untouched by the I/O layer, and by a happy coincidence LFs is what readline expects. So everything works, by chance, but works.

Problem is the application generates text files that do not follow the conventions of the platform, and other programs may assume they do.

-- fxn

···

On Jul 31, 2006, at 7:27 PM, Charles O Nutter wrote:

A large part of our problem is that we currently tend to normalize
everything to \n....all the time. That has the effect of also writing out \n
to the filesystem for newlines, which as you describe above causes problems
when trying to re-read. So for the case in question, we run Rails...it
generates files with newlines...we normalize those newlines to \n and write
such to disk...and then future use of those files (in this case, ERB
templates) fails because the newlines aren't handled correctly (i.e. we
can't normalize \r\n to \n again because they're already \n on disk).

I was thinking about this a little more.

Why wouldn't JRuby just take advantage of the Java runtime's
normalization facility in this case, using the JVM's notion of "newline"
on the particular platform to handle I/O?

Is the JRuby issue that only _some_ of the code that is doing I/O is
pure Java and some other set of the code is Ruby so that trying to
always use the JVM "line separator" concept won't work?

Wes

···

--
Posted via http://www.ruby-forum.com/.

The issues get complicated, but the biggest underlying issue is that we
can't easily look like unix on unix and windows on windows because Java
looks basically the same everywhere...that is except for crap specific to
unix and windows. If we pretend to be one or the other all the time, then
the other platform breaks. If we try to emulate both, we run into things
where we simply can't do it...we can act like both unix and windows for some
things but not others. Ultimately we try to normalize things to some
amorphous "java" platform, but then Ruby has no idea what we're talking
about and falls back on either windows or unix behavior.

We've mostly been able to trick Ruby into doing the right things on
different platforms, and this will probably work the same way. It's just a
matter of figuring out where newlines get normalized, normalizing them
ourselves to something appropriate internally for Java on platform X, and
then handling the conversion of that normalized format back out to the
platform again. Figuring out exactly what happens to \r\n everywhere it's
encountered within Windows-based C Ruby will help us figure out where the
in/out has to happen.

···

On 7/31/06, Wes Gamble <weyus@att.net> wrote:

I was thinking about this a little more.

Why wouldn't JRuby just take advantage of the Java runtime's
normalization facility in this case, using the JVM's notion of "newline"
on the particular platform to handle I/O?

Is the JRuby issue that only _some_ of the code that is doing I/O is
pure Java and some other set of the code is Ruby so that trying to
always use the JVM "line separator" concept won't work?

--
Contribute to RubySpec! @ Welcome to headius.com
Charles Oliver Nutter @ headius.blogspot.com
Ruby User @ ruby.mn
JRuby Developer @ www.jruby.org
Application Architect @ www.ventera.com

In this particular case, could
java.lang.System.getProperty("line.separator") be used to handle
platform-specific reading/writing? That way, you get to piggyback on
the multiplatform support built into Java. If the low-level I/O code is
centralized, it seems like this would be the way to go.

Are there performance implications for this approach? Seems like you
could just grab all of the system specific newline properties from the
System object upon the initialization of the JRuby interpreter and just
refer to them later.

Wes

···

--
Posted via http://www.ruby-forum.com/.

It's a bit more complicated than that...bring this up on the JRuby dev list
and others can chime in there as well.

···

On 7/31/06, Wes Gamble <weyus@att.net> wrote:

In this particular case, could
java.lang.System.getProperty("line.separator") be used to handle
platform-specific reading/writing? That way, you get to piggyback on
the multiplatform support built into Java. If the low-level I/O code is
centralized, it seems like this would be the way to go.

Are there performance implications for this approach? Seems like you
could just grab all of the system specific newline properties from the
System object upon the initialization of the JRuby interpreter and just
refer to them later.

Wes

--
Posted via http://www.ruby-forum.com/\.

--
Contribute to RubySpec! @ Welcome to headius.com
Charles Oliver Nutter @ headius.blogspot.com
Ruby User @ ruby.mn
JRuby Developer @ www.jruby.org
Application Architect @ www.ventera.com