File.seek and unused bytes

Ruby 1.8.6

(sorry this one takes some setup to explain the context of the question)

I'm using file seek to create equal fixed length rows in a disk file.
The row data itself is not fixed length, but I am forcing the seek
position to be at fixed intervals for the start of each row. So, on disk
it might look something like this:

aaaaaaaaaaaaaaaaaaaaaaaaaX0000000000000
bbbbbbbbbbbbbbX000000000000000000000
ccccccccccccccccccccccccccccccccccccccccX00

I'm really only writing the "aaaa..." and "bbbb...." portions of the
rows with an EOL (an X here so it's easy to see).

I have one operation step which uses tail to grab the last line of the
file. When I do that, I get something like this:

000000000000000000000cccccccccccccccccccccccccccccccccccccccc

which is the empty bytes past the EOL of the line "bbb..." plus the
valid data of the following line.

After some fiddling, it became apparent that I can test the value of the
byte against zero to know if there's data in it or not -- if byte_data
== 0 so that I can trim off that leading set of zeros, but I'm not
certain those empty bytes will always be zero.

And finally my question.....

So, my question is about those zeros. Does advancing the seek position
(do we still call that the "cursor"?) intentionally and proactively fill
the unsed bytes with what apparently equates to zero? OR, am I just
getting luck that my test files have used virgin disk space which yields
zeros, and the seek position just skips bytes which potentially would
contain garbage from previously used disk sectors?

Can I count on those unused bytes always being zero?

-- gw

···

--
Posted via http://www.ruby-forum.com/.

Greg Willits wrote:

aaaaaaaaaaaaaaaaaaaaaaaaaX0000000000000
bbbbbbbbbbbbbbX000000000000000000000
ccccccccccccccccccccccccccccccccccccccccX00

Argh. those should look equal length in non-proportional font.

aaaaaaaaaaaaaaaaaaaaaaaaaX0000000000
bbbbbbbbbbbbbbX000000000000000000000
cccccccccccccccccccccccccccccccccX00

-- gw

···

--
Posted via http://www.ruby-forum.com/\.

Unfortunately you're getting lucky. A seek adjusts the file pointer but doesn't write anything to disk so whilst your 'unused' bytes won't be changing value as a result of writing data to the file unless you write the full record, you can't rely on them not having a value other than zero if you don't.

Also you have to consider that zero may itself be a valid data value within a record :slight_smile:

Ellie

Eleanor McHugh
Games With Brains
http://slides.games-with-brains.net

···

On 3 Jul 2009, at 23:06, Greg Willits wrote:

So, my question is about those zeros. Does advancing the seek position
(do we still call that the "cursor"?) intentionally and proactively fill
the unsed bytes with what apparently equates to zero? OR, am I just
getting luck that my test files have used virgin disk space which yields
zeros, and the seek position just skips bytes which potentially would
contain garbage from previously used disk sectors?

Can I count on those unused bytes always being zero?

----
raise ArgumentError unless @reality.responds_to? :reason

Does advancing the seek position intentionally and proactively fill
the unsed bytes with what apparently equates to zero? OR, am I just
getting lucky?

you're getting lucky

Thanks, I figured as much (just thought I'd see if Ruby was any
different). I've gone ahead and filled empty positions using
string.ljust.

tail works fine. It's all text data with 10.chr EOLs, and yeah I would
know whether a 0 is a valid piece of data or not based on the file
format.

Thanks.

···

--
Posted via http://www.ruby-forum.com/\.

I don't think anyone has answered this question directly but on POSIX-like file systems a seek past the end of the file and a subsequent write will cause the intervening bytes (which have never been written) to read as zeros. Whether those 'holes' occupy disk space or not is implementation dependent.

Gary Wright

···

On Jul 3, 2009, at 6:06 PM, Greg Willits wrote:

So, my question is about those zeros. Does advancing the seek position
(do we still call that the "cursor"?) intentionally and proactively fill
the unsed bytes with what apparently equates to zero?

So, my question is about those zeros. Does advancing the seek position
(do we still call that the "cursor"?) intentionally and proactively fill
the unsed bytes with what apparently equates to zero? OR, am I just
getting luck that my test files have used virgin disk space which yields
zeros, and the seek position just skips bytes which potentially would
contain garbage from previously used disk sectors?

Can I count on those unused bytes always being zero?

Unfortunately you're getting lucky. A seek adjusts the file pointer but doesn't write anything to disk so whilst your 'unused' bytes won't be changing value as a result of writing data to the file unless you write the full record, you can't rely on them not having a value other than zero if you don't.

Actually it should not matter what those bytes are. Your record format should make sure that you exactly know how long an individual record is - as Ellie pointed out:

Also you have to consider that zero may itself be a valid data value within a record :slight_smile:

Oh, the details. :slight_smile:

Here's another one: if your filesystem supports sparse files and the holes are big enough (at least spanning more than one complete cluster or whatever the smallest allocation unit of the filesystem is called) those bytes might not really come from disk, in which case - when read - they are usually zeroed. But I guess this is also implementation dependent.

One closing remark: using "tail" to look at a binary file is probably a bad idea in itself. "tail" makes certain assumptions about what a line is (it needs to in order to give you N last lines). Those assumptions are usually incorrect when using binary files.

Kind regards

  robert

···

On 04.07.2009 01:11, Eleanor McHugh wrote:

On 3 Jul 2009, at 23:06, Greg Willits wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Eleanor McHugh wrote:

Can I count on those unused bytes always being zero?

Unfortunately you're getting lucky. A seek adjusts the file pointer
but doesn't write anything to disk so whilst your 'unused' bytes won't
be changing value as a result of writing data to the file unless you
write the full record, you can't rely on them not having a value other
than zero if you don't.

I don't believe that's the case today. If it were, then you would have a
very easy way to examine the contents of unused sectors on the disk -
which would allow you to see other people's deleted files, passwords
etc.

It was possible on old mainframe systems in the 80's though :slight_smile:

But today, if you extend a file using seek, you should always read
zeros.

···

--
Posted via http://www.ruby-forum.com/\.

Gary Wright wrote:

···

On Jul 3, 2009, at 6:06 PM, Greg Willits wrote:

So, my question is about those zeros. Does advancing the seek position
(do we still call that the "cursor"?) intentionally and proactively
fill
the unsed bytes with what apparently equates to zero?

I don't think anyone has answered this question directly but on POSIX-
like file systems a seek past the end of the file and a subsequent
write will cause the intervening bytes (which have never been written)
to read as zeros. Whether those 'holes' occupy disk space or not is
implementation dependent.

If in deed this is a fact (and it's consistent with my observation),
then I'd say it's worth taking advantage of. I can't find a definitive
reference to cite though (Pickaxe, The Ruby Way).

-- gw
--
Posted via http://www.ruby-forum.com/\.

Brian Candler wrote:

Eleanor McHugh wrote:

Can I count on those unused bytes always being zero?

Unfortunately you're getting lucky. A seek adjusts the file pointer
but doesn't write anything to disk so whilst your 'unused' bytes won't
be changing value as a result of writing data to the file unless you
write the full record, you can't rely on them not having a value other
than zero if you don't.

I don't believe that's the case today. If it were, then you would have a
very easy way to examine the contents of unused sectors on the disk -
which would allow you to see other people's deleted files, passwords
etc.

It was possible on old mainframe systems in the 80's though :slight_smile:

80's micros too with BASIC :stuck_out_tongue:

But today, if you extend a file using seek, you should always read
zeros.

That makes a great deal of sense, and would be consistent with what I
was seeing. I was wondering why the values being returned were zeros
instead of nil or something else.

Either way, I know its a better practice to pack the rows, but I had a
moment of laziness because I'm dealing will a couple million rows and
figured if there was some processing time to be saved, I'd take
advantage of it.

I would have experimented, but I don't know how to ensure that the
various file contents are in fact being written to the exact same disk
space.

-- gw

···

--
Posted via http://www.ruby-forum.com/\.

Greg Willits wrote:

I don't think anyone has answered this question directly but on POSIX-
like file systems a seek past the end of the file and a subsequent
write will cause the intervening bytes (which have never been written)
to read as zeros. Whether those 'holes' occupy disk space or not is
implementation dependent.

If in deed this is a fact (and it's consistent with my observation),
then I'd say it's worth taking advantage of. I can't find a definitive
reference to cite though (Pickaxe, The Ruby Way).

Well, those aren't POSIX references. But from "Advanced Programming in
the UNIX Environment" by the late great Richard Stevens, pub.
Addison-Wesley, p53:

"`lseek` only records the current file offset within the kernel - it
does not cause any I/O to take place. This offset is then used by the
next read or write operation.

The file's offset can get greater than the file's current size, in which
case the next `write` to the file will extend the file. This is referred
to as creating a hole in a file and is allowed. Any bytes in a file that
have not been written are read back as 0."

···

--
Posted via http://www.ruby-forum.com/\.

Brian Candler wrote:

Greg Willits wrote:

I don't think anyone has answered this question directly but on POSIX-
like file systems a seek past the end of the file and a subsequent
write will cause the intervening bytes (which have never been written)
to read as zeros. Whether those 'holes' occupy disk space or not is
implementation dependent.

If in deed this is a fact (and it's consistent with my observation),
then I'd say it's worth taking advantage of. I can't find a definitive
reference to cite though (Pickaxe, The Ruby Way).

Well, those aren't POSIX references. But from "Advanced Programming in
the UNIX Environment" by the late great Richard Stevens, pub.
Addison-Wesley, p53:

"`lseek` only records the current file offset within the kernel - it
does not cause any I/O to take place. This offset is then used by the
next read or write operation.

The file's offset can get greater than the file's current size, in which
case the next `write` to the file will extend the file. This is referred
to as creating a hole in a file and is allowed. Any bytes in a file that
have not been written are read back as 0."

I see, you guys are saying it's an OS-level detail, not a Ruby-specfic
detail.

It seems though that any hole in the file must be written to. Otherwise
the file format itself must keep track of every byte that it has written
to or not in order to have a write-nothing / read-as-zero capability.
This would seem to be very inefficient overhead.

Hmm... duh, I can bust out the hex editor and have a look.

<pause>

OK, well, empty bytes created by extending the filesize of a new file
are 0.chr not an ASCII zero character (well, at least according to the
hex editor app). That could simply be the absence of data from virgin
disk space. I suppose, that absence of data could be interpreted however
the app wants, so the hex editor says it is 0.chr and the POSIX code
says it is 48.chr.

Still though, since the file isn't being filled with the data that is
provided by the read-back, that still confuses me. How does the read
know to convert those particular NULL values into ASCII zeros vs a NULL
byte I write on purpose? And it still doesn't really confirm what would
happen when non-virgin disk space is being written to.

Hrrmmm. :-\

Thanks for the discussion so far.

-- gw

···

--
Posted via http://www.ruby-forum.com/\.

Greg Willits wrote:

It seems though that any hole in the file must be written to. Otherwise
the file format itself must keep track of every byte that it has written
to or not in order to have a write-nothing / read-as-zero capability.

Unless you seek over entire blocks, in which case the filesystem can
create a "sparse" file with entirely missing blocks (i.e. the disk usage
reported by du can be much less than the file size)

When you read any of these blocks, you will see all zero bytes.

Hmm... duh, I can bust out the hex editor and have a look.

<pause>

OK, well, empty bytes created by extending the filesize of a new file
are 0.chr not an ASCII zero character (well, at least according to the
hex editor app). That could simply be the absence of data from virgin
disk space. I suppose, that absence of data could be interpreted however
the app wants, so the hex editor says it is 0.chr and the POSIX code
says it is 48.chr.

No, POSIX says it is a zero byte (character \0, \x00, byte value 0,
binary 00000000, ASCII NUL, however you want to think of it)

···

--
Posted via http://www.ruby-forum.com/\.

Brian Candler wrote:

It seems though that any hole in the file must be written to. Otherwise
the file format itself must keep track of every byte that it has written
to or not in order to have a write-nothing / read-as-zero capability.

Unless you seek over entire blocks, in which case the filesystem can
create a "sparse" file with entirely missing blocks (i.e. the disk usage
reported by du can be much less than the file size)

When you read any of these blocks, you will see all zero bytes.

OK. But the file system doesn't keep track of aything smaller than the
block, right? So, it's not keeping track of the misc individual holes
created by each extension of the seek (?).

No, POSIX says it is a zero byte (character \0, \x00, byte value 0,
binary 00000000, ASCII NUL, however you want to think of it)

Doh! My zeros are coming from a step in my process which includes
converting this particular data chunk to integers which I was
forgetting. And nil.to_i will generate a zero. So, my bad; that detail
is cleared up.

The only thing I'm still not real clear on is....

- file X gets written to disk block 999 -- the data is a stream of 200
contiguous "A" characters

- file X gets deleted (which AFAIK only deletes the directory entry,
and does not null-out the file data unless the OS has been told to do
just that with a "secure delete" operation)

- file Y gets written to disk block 999 -- the data has holes in it
from extending the seek position

Generally, I wouldn't read in the holes, but I have this one little step
that does end up with some holes, and I know it. What I don't know is
what to expect in those holes. Null values or, garbage "A' characters
left over from file X.

Logically I would expect garbage data, but the literal impact of
paragraphs quoted earlier from the Unix book above indicates I should
expect null values. I can't think of any tools I have that would enable
me to test this.

Because I don't know, I've gone ahead and packed the holes with a known
character. However, if I can avoid that I want to because it sucks up
some time I'd like to avoid in large files, but it's not super critical.

At this point I'm more curious than anything. I appreciate the dialog.

-- gw

···

--
Posted via http://www.ruby-forum.com/\.

Generally, I wouldn't read in the holes, but I have this one little step
that does end up with some holes, and I know it. What I don't know is
what to expect in those holes. Null values or, garbage "A' characters
left over from file X.

You should expect null bytes (at least on Posix-like file systems). I'm not sure why you are doubting this.

From the Open Group Base Specification description of lseek:
<http://www.opengroup.org/onlinepubs/009695399/functions/lseek.html&gt;:

The lseek() function shall allow the file offset to be set beyond the end of the existing data in the file. If data is later written at this point, subsequent reads of data in the gap shall return bytes with the value 0 until data is actually written into the gap.

Gary Wright

···

On Jul 5, 2009, at 6:13 PM, Greg Willits wrote:

Generally, I wouldn't read in the holes, but I have this one little step that does end up with some holes, and I know it. What I don't know is what to expect in those holes. Null values or, garbage "A' characters left over from file X.

Logically I would expect garbage data, but the literal impact of paragraphs quoted earlier from the Unix book above indicates I should expect null values. I can't think of any tools I have that would enable me to test this.

I would not expect anything in those bytes for the simple reason that this reduces portability of your program. If anything the whole discussion has shown that apparently there are (or were) different approaches to handling this (including return of old data which should not happen any more nowadays).

Because I don't know, I've gone ahead and packed the holes with a known character. However, if I can avoid that I want to because it sucks up some time I'd like to avoid in large files, but it's not super critical.

At this point I'm more curious than anything. I appreciate the dialog.

I stick to the point I made earlier: if you need particular data to be present in the slack of your records you need to make sure it's there. Since your IO is done block wise and you probably aligned your offsets with block boundaries anyway there should not be a noticeable difference in IO. You probably need a bit more CPU time to generate that data but that's probably negligible in light of the disk IO overhead.

If you want to save yourself that effort you should probably make sure that your record format allows for easy separation of the data and slack area. There are various well established practices, for example preceding the data area with a length indicator or terminating data with a special marker byte.

My 0.02 EUR.

Kind regards

  robert

···

On 06.07.2009 00:13, Greg Willits wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Gary Wright wrote:

Generally, I wouldn't read in the holes, but I have this one little
step
that does end up with some holes, and I know it. What I don't know is
what to expect in those holes. Null values or, garbage "A' characters
left over from file X.

You should expect null bytes (at least on Posix-like file systems).
I'm not sure why you are doubting this.

I wasn't separating the read from the write. The spec talks about
reading zeros but doesn't talk about writing them. I wasn't trusting
that the nulls were getting written. I think I get it now that the read
is what matters. Whether a null/zero got written, or whether the gaps
are accounted for in some other way, is independent of the data the read
returns.

I still don't see where the nulls come from (if they're not being
written), but if the rules allow me to expect nulls/zeros, and those
gaps are being accounted for somewhere/somehow then that's what matters.

-- gw

···

On Jul 5, 2009, at 6:13 PM, Greg Willits wrote:

--
Posted via http://www.ruby-forum.com/\.

Robert Klemme wrote:

Generally, I wouldn't read in the holes, but I have this one little step
that does end up with some holes, and I know it. What I don't know is
what to expect in those holes. Null values or, garbage "A' characters
left over from file X.

Logically I would expect garbage data, but the literal impact of
paragraphs quoted earlier from the Unix book above indicates I should
expect null values. I can't think of any tools I have that would enable
me to test this.

I would not expect anything in those bytes for the simple reason that
this reduces portability of your program.

Understood. In this case, I'm making a concious decision to go with
whatever is faster. I've already written the code so that it is easy to
add back in the packing if it's ever needed.

We're working with large data sets for aggregation which takes a long
time to run, and second only to the ease and clarity of the top level
DSL, is the speed of the aggregation process itself so we can afford to
do more analysis.

Because I don't know, I've gone ahead and packed the holes with a known
character. However, if I can avoid that I want to because it sucks up
some time I'd like to avoid in large files, but it's not super critical.

At this point I'm more curious than anything. I appreciate the dialog.

should probably make sure
that your record format allows for easy separation of the data and slack
area. There are various well established practices, for example
preceding the data area with a length indicator or terminating data with
a special marker byte.

Yep, already done that. Where this 'holes' business comes in, is that to
stay below the 4GB limit, the data has to be processed and the file
written out in chunks. Each chunk may have a unique line length. So, we
find the longest line of the chunk, and write records at that interval
using seek. Each record terminates with a line feed.

Since we don't know the standard length of each chunk until processing
is done (and the file has already een started), a set of the lengths is
added to the end of the file instead of the beginning.

When reading data, the fastest way to get the last line which has my
line lengths, is to use tail. This returns a string starting from the
last record's EOL marker to the EOF. This "line" has the potential
(likelihood) to include the empty bytes of the last record in front of
the actual I want because of how tail interprets "lines" between EOL
markers. I need to strip those empty bytes from the start of the line
before I get to the line lengths data.

Every other aspect of the file uses the common approach of lines with
#00 between fields and #10 at the end of the data, followed by zero or
more fill characters to make each row an equal length of bytes.

-- gw

···

On 06.07.2009 00:13, Greg Willits wrote:

--
Posted via http://www.ruby-forum.com/\.

Greg Willits wrote:

I still don't see where the nulls come from (if they're not being
written)

All disk I/O is done in terms of whole blocks (typically 1K)

Whenever the filesystem adds a new block to a file, insteading of
reading the existing contents into the VFS cache it just zero-fills a
block in the VFS cache. A write to an offset then updates that block and
marks it 'dirty'. The entire block will then at some point get written
back to disk, including of course any of the zeros which were not
overwritten with user data.

···

--
Posted via http://www.ruby-forum.com/\.

Greg Willits wrote:

Yep, already done that. Where this 'holes' business comes in, is that to
stay below the 4GB limit, the data has to be processed and the file
written out in chunks. Each chunk may have a unique line length. So, we
find the longest line of the chunk, and write records at that interval
using seek. Each record terminates with a line feed.

To me, this approach smells. For example, it could have *really* bad
disk usage if one record in your file is much larger than all the
others.

Is the reason for this fixed-space padding just so that you can jump
directly to record number N in the file, by calculating its offset?

If so, it sounds to me like what you really want is cdb:
http://cr.yp.to/cdb.html

You emit key/value records of the form

+1,50:1->(50 byte record)
+1,70:2->(70 byte record)
+1,60:3->(60 byte record)
...
+2,500:10->(500 byte record)
... etc

then pipe it into cdbmake. The resulting file is built, in a single
pass, with a hash index, allowing you to jump to record with key 'x'
instantly.

There's a nice and simple ruby-cdb library available, which wraps djb's
cdb library.

Of course, with cdb you're not limited to integers as the key to locate
the records, nor do they have to be in sequence. Any unique key string
will do - consider it like an on-disk frozen Hash. (The key doesn't have
to be unique actually, but then when you search for key K you would ask
for all records matching this key)

···

--
Posted via http://www.ruby-forum.com/\.

Robert Klemme wrote:

We're working with large data sets for aggregation which takes a long
time to run, and second only to the ease and clarity of the top level
DSL, is the speed of the aggregation process itself so we can afford to
do more analysis.

Did you actually measure significant differences in time or are you
just assuming there is a significant impact because you write less and
have to do less processing?

Because I don't know, I've gone ahead and packed the holes with a known
character. However, if I can avoid that I want to because it sucks up
some time I'd like to avoid in large files, but it's not super critical.

At this point I'm more curious than anything. I appreciate the dialog.

should probably make sure
that your record format allows for easy separation of the data and slack
area. There are various well established practices, for example
preceding the data area with a length indicator or terminating data with
a special marker byte.

Yep, already done that. Where this 'holes' business comes in, is that to
stay below the 4GB limit, the data has to be processed and the file
written out in chunks. Each chunk may have a unique line length. So, we
find the longest line of the chunk, and write records at that interval
using seek. Each record terminates with a line feed.

Errr, I am not sure I fully understand your approach. What you write
sounds like if you end up with a file containing multiple sections
which each have lines with identical length. So a file with two
sections could look like this

aaN0000
aaaN000
aN00000
aaaaaaN
aaaaN00
bN00
bbbN
bbN0
bbN0

Basically you are combining two approaches in one file: fixed length
records and variable length records with termination marker. That
sounds odd to me. If file size matters then I do not understand why
you do not just write out the file like a regular text file, i.e. only
use the line termination approach.

Since we don't know the standard length of each chunk until processing
is done (and the file has already een started), a set of the lengths is
added to the end of the file instead of the beginning.

When reading data, the fastest way to get the last line which has my
line lengths, is to use tail.

Why don't you open the file, seek to N bytes before the end and read
them? You do not need tail for this and you also have all the file
handling in your Ruby program.

Every other aspect of the file uses the common approach of lines with
#00 between fields and #10 at the end of the data, followed by zero or
more fill characters to make each row an equal length of bytes.

It seems either I am missing something or you are doing something
weird for which I do not understand the reason. Can you shade some
more light on the nature of the processing and why you follow this
approach? That would be a wonderful completion of the discussion.

Kind regards

robert

···

2009/7/6 Greg Willits <lists@gregwillits.ws>:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/