Hi,
Hi Kubo,
I tried proceeding with the above mentioned APIs. However I am seeing
some interesting stuffs. Not sure if I am using the right constructs.
Below is the Ruby script I am using:
======================================
#encoding: utf-8
puts "Results in C extension"
puts "----------------------"
require 'ibm_db'
str = "insert into woods (name) values ('GÜHRING文')"
conn = IBM_DB.connect 'DRIVER={IBM DB2 ODBC
DRIVER};DATABASE=devdb;HOSTNAME=9.124.159.74;PORT=50000;PROTOCOL=TCPIP;UID=db2admin;PWD=db2admin;','',''
stmt = IBM_DB.exec conn, str
IBM_DB.close conn
print "----------------------\n\n"
puts "Results in Ruby script"
puts "----------------------"
puts "str.length is :#{str.length}"
puts "str.bytesize: #{str.bytesize}"
puts "**Forcing encoding**"
str1 = str.force_encoding("UTF-16LE")
puts "str.length is :#{str1.length}"
puts "str.bytesize: #{str1.bytesize}"
In the script above, IBM_DB is the C extension module. However the
database call has got nothing to do with the unicode API usage. I have
just resused the module for trying the unicode support.
The snippet in C extension that uses the unicode functions is as
below:
======================================
VALUE ibm_db_exec(int argc, VALUE *argv, VALUE self){
rb_scan_args(argc, argv, "21", &connection, &stmt, &options);
if (!NIL_P(stmt)) {
rb_encoding *enc_received;
rb_encoding *ucs2_enc = rb_enc_find("UTF-16LE");
rb_encoding *ucs4_enc = rb_enc_find("UTF-32LE");
enc_received = rb_enc_from_index(ENCODING_GET(stmt));
printf("\nString in received format: %s\n",RSTRING_PTR(stmt));
printf("\nrb_str_length is: %d\n",rb_str_length(stmt));
printf("\nRSTRING_LEN is: %d\n",RSTRING_LEN(stmt));
printf("\nEncoding format received: %s\n",enc_received->name);
stmt_ucs2 = rb_str_export_to_enc(stmt,ucs2_enc);
printf("\nString in utf16 format: %s\n",RSTRING_PTR(stmt_ucs2));
printf("\nrb_str_length is: %d\n",rb_str_length(stmt_ucs2));
printf("\nRSTRING_LEN is: %d\n",RSTRING_LEN(stmt_ucs2));
printf("\nEncoding after conversion: %s\n",ucs2_enc->name);
}
}
======================================
The above ruby script run produces the following output:
======================================
Results in C extension
----------------------
String in received format: insert into woods (name) values
('GÃHRINGæ')
rb_str_length is: 89
RSTRING_LEN is: 47
Encoding format received: UTF-8
String in utf16 format: i #Expected because used printf
rb_str_length is: 89
RSTRING_LEN is: 88
Encoding after conversion: UTF-16LE
----------------------
Results in Ruby script
----------------------
str.length is :44
str.bytesize: 47
**Forcing encoding**
str.length is :24
str.bytesize: 47
======================================
I am not sure why is there a difference in the string length in the
original string [44] (UTF-8 format) and string after changing the
encoding [24] (to UTF-16LE). The same is the case in case of output in
the C extension, the bytesize and the length are same (+1 or -1) and
the length is different in different encoding formats.
89 is not an integer but a VALUE. VALUE of 89 means 44 of integer.
Could you tell me what is that I am doing wrong?
You should use String#encode instead of String#force_encode like this:
puts "**Converting encoding**"
str1 = str.encode("UTF-16LE")
puts "str.length is :#{str1.length}"
puts "str.bytesize: #{str1.bytesize}"
Along with this, in C extension is there any API that I can call to
check if the given string is in a particular encoding or should I use
rb_enc_from_index and from there read the struct member name and
determine in the extension that I write?
Using rb_enc_get is more simple then rb_enc_from_index like this:
enc_received = rb_enc_get(stmt);
And, rb_str_length returns not an integer but a VALUE. So you should
use NUM2INT like this:
printf("\nrb_str_length is: %d\n",NUM2INT(rb_str_length(stmt)));
Regards,
Park Heesob
···
2010/2/16 Praveen <praveendevarao@gmail.com>: