String#unpack and null-terminated strings

Hi,

How can I unpack two or more consecutive C-strings with the
String#unpack method? Like this:

“abc\000def\000”.unpack(“??”) # => [“abc”, “def”]

Currently, this seems not to be possible. Any chance to get the
following patch applied, which implements exactly this?

Regards,

Michael

Index: pack.c

···

===================================================================
RCS file: /src/ruby/pack.c,v
retrieving revision 1.69
diff -r1.69 pack.c
1287a1288,1290

  •   T    | String  | read zero-terminated string 
    
  •        |         | (with null char removed)
    
  • -------+---------+-----------------------------------------
    

1389a1393,1408

  break;

case 'T':
        /* read until end of string or until a null character occurs */
        {
            char *start = s;

            while (s < send) {     /* don't read more than the whole string */
              if (*s == '\000') break;
              s++;
            }

            rb_ary_push(ary, infected_str_new(start, s-start, str));

            if (s < send && *s == '\000') s++; /* skip null character */
        }

Michael Neumann mneumann@ntecs.de writes:

Hi,

How can I unpack two or more consecutive C-strings with the
String#unpack method? Like this:

“abc\000def\000”.unpack(“??”) # => [“abc”, “def”]

Currently, this seems not to be possible. Any chance to get the
following patch applied, which implements exactly this?

You could use String#split e.g.

irb(main):001:0> “abc\000def\000”.split(/\0/)
=> [“abc”, “def”]

I know it’s not String#unpack, but hope it helps.

Mike

···


mike@stok.co.uk | The “`Stok’ disclaimers” apply.
http://www.stok.co.uk/~mike/ | GPG PGP Key 1024D/059913DA
mike@exegenix.com | Fingerprint 0570 71CD 6790 7C28 3D60
http://www.exegenix.com/ | 75D2 9EC4 C1C0 0599 13DA

Michael Neumann mneumann@ntecs.de wrote in message news:20040424221220.GA6199@miya.intranet.ntecs.de

Hi,

How can I unpack two or more consecutive C-strings with the
String#unpack method? Like this:

“abc\000def\000”.unpack(“??”) # => [“abc”, “def”]

Currently, this seems not to be possible. Any chance to get the
following patch applied, which implements exactly this?

Regards,

Michael

“abc\000def\000”.unpack(“A3xA3”) # => [“abc”,“def”]

Using the example you later posted…

“\100String\000\100”.unpack(“CA6xC”) # => [64,“String”,64]

Regards,

Dan

Michael Neumann wrote:

Hi,

How can I unpack two or more consecutive C-strings with the
String#unpack method? Like this:

“abc\000def\000”.unpack(“??”) # => [“abc”, “def”]

Currently, this seems not to be possible. Any chance to get the
following patch applied, which implements exactly this?

[snip] diff -r1.69 pack.c

At the risk of being told to clear off and write my own spec.,
I think that an ambuiguity has intruded into the designers mind.

The A and Z string field formats should IMO be recovered from
left to right. Doesn’t the term “string” relate here to a
string element within a packed field. The packed field just
happens to be a Ruby String.

If this is going to break code, I wish that it could happen
from 1.9.
As it is now, A and Z are behaving the way I would expect
A* and Z* to (i.e. * uses all remaining elements).

There’s String#rstrip for removing spaces and nulls from the
end of a String.

Unpack is very useful for decoding structures but with the
current behaviour if a structure were to contain a null-
terminated string element it would break the flow …
… as Michael has highlighted.

Please, Matz.

daz

Sure this works. But I want to mix it with other data-types like:

“\100String\000\100”.unpack(“CTC”) # T=null-term string

=> [64, “String”, 64]

Otherwise I have to write:

str = “\100String\000\100”
a, str = str.unpack(“Ca*”)
b, str = str.split(“\000”, 2)
c, _ = str.unpack(“Ca*”)

p [a, b, c] # => [64, “String”, 64]

Which is a bit ugly :slight_smile:

Pyhtons struct.unpack has a “s” format specifier which does exactly what
I want. Perl and Ruby doesn’t have this.

http://www.python.org/doc/current/lib/module-struct.html

Regards,

Michael

···

On Sun, Apr 25, 2004 at 07:44:05AM +0900, Mike Stok wrote:

Michael Neumann mneumann@ntecs.de writes:

Hi,

How can I unpack two or more consecutive C-strings with the
String#unpack method? Like this:

“abc\000def\000”.unpack(“??”) # => [“abc”, “def”]

Currently, this seems not to be possible. Any chance to get the
following patch applied, which implements exactly this?

You could use String#split e.g.

irb(main):001:0> “abc\000def\000”.split(/\0/)
=> [“abc”, “def”]

Hi,

At Sun, 25 Apr 2004 14:34:03 +0900,
daz wrote in [ruby-talk:98298]:

The A and Z string field formats should IMO be recovered from
left to right. Doesn’t the term “string” relate here to a
string element within a packed field. The packed field just
happens to be a Ruby String.

Sounds nice.

Index: pack.c

···

===================================================================
RCS file: /cvs/ruby/src/ruby/pack.c,v
retrieving revision 1.69
diff -u -2 -p -r1.69 pack.c
— pack.c 18 Apr 2004 23:19:45 -0000 1.69
+++ pack.c 25 Apr 2004 06:39:33 -0000
@@ -435,5 +435,5 @@ static unsigned long utf8_to_uv _((char*

  •   X     |  Back up a byte
    
  •   x     |  Null byte
    
    •   Z     |  Same as ``A''
      
    •   Z     |  Same as ``a'', except that null is added with *
      
    */

@@ -524,6 +524,9 @@ pack_pack(ary, fmt)
case ‘A’: /* ASCII string (space padded) /
case ‘Z’: /
null terminated ASCII string */

  •   if (plen >= len)
    
  •   if (plen >= len) {
          rb_str_buf_cat(res, ptr, len);
    
  •       if (p[-1] == '*' && type == 'Z')
    
  •   	rb_str_buf_cat(res, nul10, 1);
    
  •   }
      else {
          rb_str_buf_cat(res, ptr, plen);
    

@@ -1174,4 +1177,5 @@ infected_str_new(ptr, len, str)

  • "abc \0\0abc \0\0".unpack('A6Z6')   #=> ["abc", "abc "]
    
  • "abc \0\0".unpack('a3a3')           #=> ["abc", " \000\000"]
    
    • "abc \0abc \0".unpack('Z*Z*')       #=> ["abc ", "abc "]
      
    • "aa".unpack('b8B8')                 #=> ["10000110", "01100001"]
      
    • "aaa".unpack('h2H2c')               #=> ["16", "61", 97]
      

@@ -1285,4 +1289,5 @@ infected_str_new(ptr, len, str)

  • -------+---------+-----------------------------------------
    
  •   Z    | String  | with trailing nulls removed
    
    •        |         | upto first null with *
      
    • -------+---------+-----------------------------------------
      
    •   @    | ---     | skip to the offset given by the 
      

@@ -1377,5 +1382,13 @@ pack_unpack(str, fmt)
case ‘Z’:
if (len > send - s) len = send - s;

  •   {
    
  •   if (star) {
    
  •   char *t = s;
    
  •   while (t < send && *t) t++;
    
  •   rb_ary_push(ary, infected_str_new(s, t - s, str));
    
  •   if (t < send) t++;
    
  •   s = t;
    
  •   }
    
  •   else {
      long end = len;
      char *t = s + len - 1;
    


Nobu Nakada

That’s exactly I expected how Z behaves. Thanks!

Regards,

Michael

···

On Sun, Apr 25, 2004 at 03:39:53PM +0900, nobu.nokada@softhome.net wrote:

Hi,

At Sun, 25 Apr 2004 14:34:03 +0900,
daz wrote in [ruby-talk:98298]:

The A and Z string field formats should IMO be recovered from
left to right. Doesn’t the term “string” relate here to a
string element within a packed field. The packed field just
happens to be a Ruby String.

Sounds nice.

[patch]

Nobu patched:

— pack.c 18 Apr 2004 23:19:45 -0000 1.69
+++ pack.c 25 Apr 2004 06:39:33 -0000

[…]

case 'Z':
  if (len > send - s) len = send - s;
  • {
    
  • if (star) {
    
  •     char *t = s;
    
  •     while (t < send && *t) t++;
    
  •     rb_ary_push(ary, infected_str_new(s, t - s, str));
    
  •     if (t < send) t++;
    
  •     s = t;
    
  • }
    
  • else {
    

Combining that with recognition of the length specifier:

···

===============================

case ‘Z’:
{
char *t = s;

    if (len > send-s) len = send-s;
    while (t < s+len && *t) t++;
    rb_ary_push(ary, infected_str_new(s, t-s, str));
    if (t < send) t++;
    s = star ? t : s+len;
 }
 break;

===============================

s = “abc\0def\0\0jkl\0”

s.unpack(‘Z2ZZ’) #-> [“ab”, “c”, “def”]
s.unpack(‘Z6ZZ’) #-> [“abc”, “f”, “”]
s.unpack(‘Z7ZZ’) #-> [“abc”, “”, “”]
s.unpack(‘Z8ZZ’) #-> [“abc”, “”, “jkl”]
s.unpack(‘Z9ZZ’) #-> [“abc”, “jkl”, “”]
s.unpack(‘Z*Z42’) #-> [“abc”, “def”]

daz

Hi,

At Mon, 26 Apr 2004 16:19:04 +0900,
daz wrote in [ruby-talk:98364]:

Combining that with recognition of the length specifier:

===============================

case ‘Z’:
{
char *t = s;

    if (len > send-s) len = send-s;
    while (t < s+len && *t) t++;
    rb_ary_push(ary, infected_str_new(s, t-s, str));
    if (t < send) t++;
    s = star ? t : s+len;
 }
 break;

===============================

I’d also considered about it, but

s = “abc\0def\0\0jkl\0”

s.unpack(‘Z6ZZ’) #-> [“abc”, “f”, “”]

It can’t round trip with Array#pack, so I discarded this plan.

···


Nobu Nakada

Nobu wrote:

daz wrote in [ruby-talk:98364]:

Combining that with recognition of the length specifier:

I’d also considered about it, but

s = “abc\0def\0\0jkl\0”

s.unpack(‘Z6ZZ’) #-> [“abc”, “f”, “”]

It can’t round trip with Array#pack, so I discarded this plan.

But the user has specified that the first field is
fixed-width(6) and null-terminated so:

“abc\000de” == “abc\000\000\000” ==> “abc”

Everything from “\000” to the end of the field is junk
because the user told us so by using ‘Z’.

We don’t need to apologise that pack didn’t replace the
exact junk that was there before :-?

Round trip:

s = “abc\000def\000\000jkl\000”
zf = ‘Z6ZZ

s.unpack(zf) #-> [“abc”, “f”, “”]
s.unpack(zf).pack(zf) #-> “abc\000\000\000f\000\000”
s.unpack(zf).pack(zf).unpack(zf) #-> [“abc”, “f”, “”]

The fixed width consumes the added zero padding bytes so
it doesn’t create bogus extra fields.

···

To me, the result below seems not to do what was requested:

s.unpack(‘Z6ZZ’) #-> [“abc\000de”, “f”, “”]

I’m probably missing a crucial point here?

daz

Hi,

At Tue, 27 Apr 2004 06:54:03 +0900,
daz wrote in [ruby-talk:98456]:

s = “abc\0def\0\0jkl\0”

s.unpack(‘Z6ZZ’) #-> [“abc”, “f”, “”]

It can’t round trip with Array#pack, so I discarded this plan.

But the user has specified that the first field is
fixed-width(6) and null-terminated so:

“abc\000de” == “abc\000\000\000” ==> “abc”

Everything from “\000” to the end of the field is junk
because the user told us so by using ‘Z’.

We don’t need to apologise that pack didn’t replace the
exact junk that was there before :-?

Hmmm, sounds reasonable.

···


Nobu Nakada

Hi,

···

In message “Re: String#unpack and null-terminated strings” on 04/04/27, nobu.nokada@softhome.net nobu.nokada@softhome.net writes:

Hmmm, sounds reasonable.

I finally got time to consider this issue. Perl seems to work the way
Daz described in [ruby-talk:98364]. Could you commit the changes, Nobu?

						matz.

Hi,

At Mon, 10 May 2004 17:53:35 +0900,
Yukihiro Matsumoto wrote in [ruby-talk:99719]:

I finally got time to consider this issue. Perl seems to work the way
Daz described in [ruby-talk:98364]. Could you commit the changes, Nobu?

What about 1.8?

···


Nobu Nakada

Hi.

···

In message “Re: String#unpack and null-terminated strings” on 04/05/12, nobu.nokada@softhome.net nobu.nokada@softhome.net writes:

At Mon, 10 May 2004 17:53:35 +0900,
Yukihiro Matsumoto wrote in [ruby-talk:99719]:

I finally got time to consider this issue. Perl seems to work the way
Daz described in [ruby-talk:98364]. Could you commit the changes, Nobu?

What about 1.8?

Hmm. Go ahead. I now think it’s the only reasonable behavior for “Z”
with NUL containing strings.

						matz.

Yukihiro Matsumoto wrote:

Hi.

At Mon, 10 May 2004 17:53:35 +0900,
Yukihiro Matsumoto wrote in [ruby-talk:99719]:

I finally got time to consider this issue. Perl seems to work the way
Daz described in [ruby-talk:98364]. Could you commit the changes, Nobu?

What about 1.8?

Hmm. Go ahead. I now think it’s the only reasonable behavior for “Z”
with NUL containing strings.

matz.

Thanks, Matz.

The plea below is now wasted :))

···

In message “Re: String#unpack and null-terminated strings” > on 04/05/12, nobu.nokada@softhome.net nobu.nokada@softhome.net writes:

===============================================================

Hi Nobu,

Good to see your return, as always.

As the changes only affects ‘Z’-types in Strings with embedded null(s),
the impact should be extremely low.

I’m trying to think of any kind of string which might contain
significant nulls but also has a null as terminator.

I’ve seen some where null delimits fields and double-null terminates
but that rare case might be the only one to break iff a
programmer had decided that the best method to use on that type of string
was unpack(‘Z*’).

Embedded nulls are common when reading from binary files
(e.g. encoded characters) but I feel that it would never be a good idea
to strip trailing nulls in that context.

Voting +1 for inclusion in 1.8, also. Much more usable :slight_smile:

Thanks,

daz

===============================================================