CGI uses file size to distinguish between regular values and files

I’ve been having a ton of problems handling file uploads with CGI.rb
and after diving into the code, it looks like my woes can be narrowed
down to these lines in the read_multipart method:

if 10240 < content_length
require "tempfile"
body = Tempfile.new(“CGI”)
else
begin
require "stringio"
body = StringIO.new
rescue LoadError
require "tempfile"
body = Tempfile.new(“CGI”)
end
end

It sure looks like the sole differentiator on whether a request should
be treated with StringIOs or Tempfiles is the size of the request. If
this understanding is correct, I have no trouble understanding why I’ve
been pulling out my hair in frustration the last couple or many hours.

If I upload just one file that brings the content_length above 10240,
ALL fields in the request is treated as Tempfiles. If I’m uploading a
small file that keeps the content_length below 10240, all the fields –
including the file – is treated as StringIOs.

For starters, how would one move a small file, treated as StringIO, to
another location? With Tempfile, you just do Tempfile#local_path, and
move away. But it’s not so straight forward with StringIO.

It also seems very tedious that the application has to know whether the
file upload was big or small and act accordingly.

Maybe I’m just missing something. Like that StringIO and Tempfile is
supposed to act the same and they just don’t on my system for some
reason. Or this is a bug.

Since my last cries for help haven’t been overly succesful, I gather
that most consider CGI.rb to be a black box as well. So perhaps I
should get in touch with the original author. Does anyone know if Wakou
Aoyama is still actively maintaining CGI.rb?

···


David Heinemeier Hansson.
http://www.loudthinking.com/ – Broadcasting Brain

Does anyone know if there is a DBI driver for the Firebird database, or
even a non-DBI interface to the database available someplace? I’m curious
about testing Firebird for some client applications in my development
environment.

Thanks,

Kirk Haines

Hi David,

I haven’t used Ruby’s file upload facility, but since no-one else has
replied I’ll chip in with my $0.02…

I suspect that the fact that tempfiles are used by this module is a
private implementation detail, ie something that users are not supposed
to be aware of, or depend upon.

Are you sure there isn’t a public API provided to access the
uploaded/downloaded data “correctly”, and that you aren’t sneaking
around the back way by trying to directly access the temp file created?

If your code depends upon “private” implementation details of any
library, then you are asking for trouble as the implementer is perfectly
within their rights to change private implementation details in any
point release; only “public” APIs can be relied upon to remain stable
across releases. Yes, you may be able to optimise your code by accessing
private APIs of a module, but then you can’t expect the module author to
care if your code breaks.

The kind of optimisation you describe is fairly common when dealing with
data flowing across networks; process it in-memory if it is small else
use the filesystem as a data cache. And this is a private
implementation detail that should not be exposed to the user of the
module.

Of course I haven’t looked at CGI.rb’s interface and may be off-track
here. And because Ruby’s documentation standards are generally poor, it
may not immediately be obvious which are the public APIs and which the
private ones if the author has been a little slack about using the
public/protected/private keywoards…

Regards,

Simon

···

On Mon, 2003-11-03 at 07:03, David Heinemeier Hansson wrote:

I’ve been having a ton of problems handling file uploads with CGI.rb
and after diving into the code, it looks like my woes can be narrowed
down to these lines in the read_multipart method:

if 10240 < content_length
require “tempfile”
body = Tempfile.new(“CGI”)
else
begin
require “stringio”
body = StringIO.new
rescue LoadError
require “tempfile”
body = Tempfile.new(“CGI”)
end
end

It sure looks like the sole differentiator on whether a request should
be treated with StringIOs or Tempfiles is the size of the request. If
this understanding is correct, I have no trouble understanding why I’ve
been pulling out my hair in frustration the last couple or many hours.

If I upload just one file that brings the content_length above 10240,
ALL fields in the request is treated as Tempfiles. If I’m uploading a
small file that keeps the content_length below 10240, all the fields –
including the file – is treated as StringIOs.

For starters, how would one move a small file, treated as StringIO, to
another location? With Tempfile, you just do Tempfile#local_path, and
move away. But it’s not so straight forward with StringIO.

It also seems very tedious that the application has to know whether the
file upload was big or small and act accordingly.

Maybe I’m just missing something. Like that StringIO and Tempfile is
supposed to act the same and they just don’t on my system for some
reason. Or this is a bug.

Since my last cries for help haven’t been overly succesful, I gather
that most consider CGI.rb to be a black box as well. So perhaps I
should get in touch with the original author. Does anyone know if Wakou
Aoyama is still actively maintaining CGI.rb?

David Heinemeier Hansson.
http://www.loudthinking.com/ – Broadcasting Brain

IMHO the driver for interbase should work with Firebird although I have not
tried it.

Regards,

Dalibor

···

On Mon, Nov 03, 2003 at 05:12:59AM +0900, Kirk Haines wrote:

Does anyone know if there is a DBI driver for the Firebird database, or
even a non-DBI interface to the database available someplace? I’m curious
about testing Firebird for some client applications in my development
environment.


Dalibor Sramek insula.cz | In the eyes of cats,
dalibor.sramek@insula.cz | all things belong to cats.

Why, you can always check if the parameter you’ve got is a Tempfile,
and, if it is not, assume that it is small enough to be handled in core.

Adopted from Samizdat source:

if Tempfile === file then
File.syscopy(file.path, upload)
else # StringIO
File.open(upload, ‘w’) {|f| f.write(file.read) }
end

Did I miss something?

···

On Mon, Nov 03, 2003 at 02:29:08PM +0900, Simon Kitching wrote:

I suspect that the fact that tempfiles are used by this module is a
private implementation detail, ie something that users are not
supposed to be aware of, or depend upon.


Dmitry Borodaenko

Dalibor Sramek said:

···

On Mon, Nov 03, 2003 at 05:12:59AM +0900, Kirk Haines wrote:

Does anyone know if there is a DBI driver for the Firebird database, or
even a non-DBI interface to the database available someplace? I’m
curious
about testing Firebird for some client applications in my development
environment.

IMHO the driver for interbase should work with Firebird although I have
not
tried it.

Thanks. I should have thought of that. I’ll give it a whirl.

Kirk Haines

Why should I have to make that distinction in my code? I’ve got a Perl app
that I plan to port over to Ruby at some point, and it allows file uploads
(up to a couple megabytes in size). Indeed, the way that I deal with the
file uploads never touches the filesystem from the perspective of my
application (the file “string” goes directly into a MySQL database). If I
have to step into my own filehandling, then I may not port this app to Ruby
even though it’s a fine candidate for it otherwise.

The library should provide me the raw data. If I decide something is a file
and want to take some shortcuts with it, then the library can provide me
additional information (e.g., CGI.tempfile?(parameter_index)) and
functionality to access that implementation detail, but I should never
have to make the distinction in my application code. Ever.

-austin

···

On Mon, 3 Nov 2003 20:12:06 +0900, Dmitry Borodaenko wrote:

On Mon, Nov 03, 2003 at 02:29:08PM +0900, Simon Kitching wrote:

I suspect that the fact that tempfiles are used by this module is a
private implementation detail, ie something that users are not supposed
to be aware of, or depend upon.
Why, you can always check if the parameter you’ve got is a Tempfile,
and, if it is not, assume that it is small enough to be handled in core.

Adopted from Samizdat source:

if Tempfile === file then
File.syscopy(file.path, upload)
else # StringIO
File.open(upload, ‘w’) {|f| f.write(file.read) }
end

Did I miss something?


austin ziegler * austin@halostatue.ca * Toronto, ON, Canada
software designer * pragmatic programmer * 2003.11.03
* 09.35.55

Why should I have to make that distinction in my code? I’ve got a
Perl app that I plan to port over to Ruby at some point, and it allows
file uploads (up to a couple megabytes in size). Indeed, the way that
I deal with the file uploads never touches the filesystem from the
perspective of my application (the file “string” goes directly into a
MySQL database). If I have to step into my own filehandling, then I
may not port this app to Ruby even though it’s a fine candidate for it
otherwise.

You’ve missed the question in the beginning of this thread:

For starters, how would one move a small file, treated as StringIO,
to another location? With Tempfile, you just do Tempfile#local_path,
and move away. But it’s not so straight forward with StringIO.

Obviously, asker wants his own file handling.

Since I do that in Samizdat, I know at least one reason for that: I
don’t want to add two extra layers of indirection (Ruby and SQL) between
Apache (or rsync or Gnutella) and a hundred-megabyte video file on disk.

The library should provide me the raw data. If I decide something is a file
and want to take some shortcuts with it, then the library can provide me
additional information (e.g., CGI.tempfile?(parameter_index)) and
functionality to access that implementation detail, but I should never
have to make the distinction in my application code. Ever.

YMMV. I’d rather learn the difference between StringIO and Tempfile
once, than learn special API for handling this difference in each new
library I use. And introspection API is in Ruby for a reason, too :wink:

Not that I object to this behaviour of CGI module being more explicitly
documented at least in comments in cgi.rb.

···

On Mon, Nov 03, 2003 at 11:40:48PM +0900, Austin Ziegler wrote:


Dmitry Borodaenko

[snip]

You’ve missed the question in the beginning of this thread:

For starters, how would one move a small file, treated as
StringIO, to another location? With Tempfile, you just do
Tempfile#local_path, and move away. But it’s not so straight
forward with StringIO.

Obviously, asker wants his own file handling.

I didn’t miss that. What’s wrong, though, is that the CGI library is
returning different values for parameters based on the total size of
the data. I’m sorry, but that’s broken. Remember – one of the
complaints was based on the size of the upload:

If I upload just one file that brings the content_length above
10240, ALL fields in the request is treated as Tempfiles. If I’m
uploading a small file that keeps the content_length below 10240,
all the fields – including the file – is treated as StringIOs.

The problem is that this means that access to the data is much
harder to deal with if you’re dealing with a file upload. Like I
said; I don’t like the Perl implementation, but it’s consistent.
This appears to be highly inconsistent.

Since I do that in Samizdat, I know at least one reason for that:
I don’t want to add two extra layers of indirection (Ruby and SQL)
between Apache (or rsync or Gnutella) and a hundred-megabyte video
file on disk.

I think you’re misunderstanding. I want to be able to deal with
the data in either way. I’m looking at this from the perspective of
two different applications: Ruwiki and Bug Traction. Ruwiki may
support file uploads at some point. If this happens, there will be
multiple fields plus the file itself. The fields, I want to
accesss as pure data. The file, on the other hand, will be handled
either as a file (by the flatfiles backend) or as raw data (by a
putative database backend). Bug Traction, when I port it to Ruby, by
default has a database backend, so I only want to deal with the
data in a raw data mode.

Ideally, I should be able to do something like:

cgi[‘file’] # => returns the data of the file
cgi[‘file’].IO? # => +true+

I realize that this isn’t really the best way, but maybe an
alternative means might be:

cgi[‘file’] # => returns the data of the file
cgi[‘file’].tempfile # => returns – or creates to return – an
# associated tempfile for the data

IMO, this would be much better than the current situation. I should
not have to make my program work differently based on whether I’ve
uploaded a large enough file or not – or even if I’m uploading a
file or not.

-austin

···

On Tue, 4 Nov 2003 04:37:19 +0900, Dmitry Borodaenko wrote:

On Mon, Nov 03, 2003 at 11:40:48PM +0900, Austin Ziegler wrote:

austin ziegler * austin@halostatue.ca * Toronto, ON, Canada
software designer * pragmatic programmer * 2003.11.03
* 15.59.28

Is it really returning “different values for parameters”?

A Tempfile is a File which is an IO.
And a StringIO is an IO.

So this module is returning an IO object, and all the methods of an IO
object are always available.

If you think of how this would be implemented in a strictly-typed
language like Java or C++, the return type would be “IO”. If the user
then manually downcasted the object to type Tempfile or File or StringIO
then they would be asking for trouble. Isn’t this “downcasting”
implicitly what you are doing when trying to treat the returned object
as a File or StringIO?

Yes, dealing with the returned value as an IO object might be
“suboptimal”, but as I said in ny original posting, if you sneak around
accessing what are essentially private implementation details of a
library in order to improve your apps’ performance then you can expect
problems.

Note: I’m not familiar with cgi.rb; just basing this argument on what I
have seen of the API on this thread. Sorry if I’m off-course.

Regards,

Simon

···

On Tue, 2003-11-04 at 10:29, Austin Ziegler wrote:

On Tue, 4 Nov 2003 04:37:19 +0900, Dmitry Borodaenko wrote:

On Mon, Nov 03, 2003 at 11:40:48PM +0900, Austin Ziegler wrote:
[snip]
You’ve missed the question in the beginning of this thread:

For starters, how would one move a small file, treated as
StringIO, to another location? With Tempfile, you just do
Tempfile#local_path, and move away. But it’s not so straight
forward with StringIO.

Obviously, asker wants his own file handling.

I didn’t miss that. What’s wrong, though, is that the CGI library is
returning different values for parameters based on the total size of
the data.

As I’m understanding it – I haven’t yet put together tests for this
– there is a difference between the return for a multipart form
(e.g., a file upload) which does Tempfile/StringIO and a single part
form which does a String. If this is indeed the case, it’s
problematic, IMO. Ideally, one should not have to know that you’re
dealing with an IO object of any sort. It gets worse: WEBrick deals
with such things only as strings (well, FormData). So if I want to
make Ruwiki or Bug Traction handle multipart data, I have to deal
with three different modes of operation based on (1) whether or not
CGI is dealing with a multipart form, (2) a single part form, or (3)
WEBrick.

IMO, CGI does this wrong.

-austin

···

On Tue, 4 Nov 2003 06:51:43 +0900, Simon Kitching wrote:

On Tue, 2003-11-04 at 10:29, Austin Ziegler wrote:

I didn’t miss that. What’s wrong, though, is that the CGI library
is returning different values for parameters based on the total
size of the data.
Is it really returning “different values for parameters”?


austin ziegler * austin@halostatue.ca * Toronto, ON, Canada
software designer * pragmatic programmer * 2003.11.03
* 17.11.48

Hi,

I understand the problem, but not yet think of the best solution. Any
concrete ideas?

						matz.
···

In message “Re: CGI uses file size to distinguish between regular values and files” on 03/11/04, Austin Ziegler austin@halostatue.ca writes:

As I’m understanding it – I haven’t yet put together tests for this
– there is a difference between the return for a multipart form
(e.g., a file upload) which does Tempfile/StringIO and a single part
form which does a String. If this is indeed the case, it’s
problematic, IMO. Ideally, one should not have to know that you’re
dealing with an IO object of any sort. It gets worse: WEBrick deals
with such things only as strings (well, FormData). So if I want to
make Ruwiki or Bug Traction handle multipart data, I have to deal
with three different modes of operation based on (1) whether or not
CGI is dealing with a multipart form, (2) a single part form, or (3)
WEBrick.

IMO, CGI does this wrong.

-austin

austin ziegler * austin@halostatue.ca * Toronto, ON, Canada
software designer * pragmatic programmer * 2003.11.03
* 17.11.48

I think that it might be worth considering a variation on what
WEBrick does, which is WEBrick::HTTPUtils::FormData. Obviously,
that’s not an entirely useful solution, as it would break how people
are currently using CGI right now. But if CGI were to return a
FormData like object that has the ability to duck-type as an IO or a
String (I guess; maybe the methods added during the current
multipart form data handling), I think that this would probably be
ideal. Alternatively, still return something like a FormData but
have a #tempfile method that will return the associated Tempfile or
create one as necessary.

As I said, in most of my applications, I need to deal with the raw
data, not the tempfiles, but I want the ability to deal with the
tempfiles (or StringIO) as I need to (e.g., potential features in
Ruwiki).

-austin

···

On Tue, 4 Nov 2003 09:42:12 +0900, Yukihiro Matsumoto wrote:

I understand the problem, but not yet think of the best solution.
Any concrete ideas?


austin ziegler * austin@halostatue.ca * Toronto, ON, Canada
software designer * pragmatic programmer * 2003.11.03
* 23.22.39

I was thinking about doing some code for this, but I really don’t think I’ll
have time for about two weeks. If I get time before then, I’ll look at doing
something to possibly contribute that fixes the problem from my perspective
and provides a way to keep things “as they are” for those people who don’t
want to do it differently.

BTW, I would also love to see the request, response, and HTML output
portions of CGI separated, at least conceptually. The main thing that I
think I’d love to see separated is the HTML output stuff. I don’t think that
it really belongs in CGI at all.

-austin

···

On Tue, 4 Nov 2003 09:42:12 +0900, Yukihiro Matsumoto wrote:

I understand the problem, but not yet think of the best solution. Any
concrete ideas?


austin ziegler * austin@halostatue.ca * Toronto, ON, Canada
software designer * pragmatic programmer * 2003.11.03
* 23.37.19

In article 1067898096.626601.19785.nullmailer@picachu.netlab.jp,
matz@ruby-lang.org (Yukihiro Matsumoto) writes:

I understand the problem, but not yet think of the best solution. Any
concrete ideas?

I think

File.open(upload, ‘w’) {|f| f.write(file.read) }

is good enough except it reads whole file in memory at once. It is
a problem for big file.

So I propose IO#copy and StringIO#copy to able to write:

File.open(upload, ‘w’) {|f| file.copy(f) }

The implementation is assumed as follows.

class IO
def copy(destination)
while buf = self.read(8192)
destination.write buf
end
end
end

class StringIO
def copy(destination)
destination.write self.read
end
end

Note that IO#copy may use sendfile(2) if destination is a socket.

···


Tanaka Akira

Hi,

···

In message “Re: CGI uses file size to distinguish between regular values and files” on 03/11/04, Yukihiro Matsumoto matz@ruby-lang.org writes:

I understand the problem, but not yet think of the best solution. Any
concrete ideas?

How about moving to another CGI library, for example Narf?

						matz.

Hi,

···

In message “Re: CGI uses file size to distinguish between regular values and files” on 03/11/04, Tanaka Akira akr@m17n.org writes:

I understand the problem, but not yet think of the best solution. Any
concrete ideas?

I think

File.open(upload, ‘w’) {|f| f.write(file.read) }

is good enough except it reads whole file in memory at once. It is
a problem for big file.

So I propose IO#copy and StringIO#copy to able to write:

File.open(upload, ‘w’) {|f| file.copy(f) }

How about

require ‘fileutils’
File.open(upload, ‘w’) {|f| FileUtils.copy_stream(file,f)}

? It’s bit longer, but works now without any modifies.

						matz.

I was thinking about doing some code for this, but I really don’t
think I’ll
have time for about two weeks. If I get time before then, I’ll look at
doing
something to possibly contribute that fixes the problem from my
perspective
and provides a way to keep things “as they are” for those people who
don’t
want to do it differently.

I’m thinking this shouldn’t be a too hard issue to address, though.
Since CGI.rb knows when it’s handling a multipart request, I reckon
that form field access could easily be presented transparently to be
the same as normally.

cgi[‘value’] would just do an implicit cgi[‘value’].read in case of a
multipart request. And perhaps we could also have code to ease the
exaction of files. Like cgi[‘value’].file? and
cgi[‘value’].save(file_name). Or something similar.

BTW, I would also love to see the request, response, and HTML output
portions of CGI separated, at least conceptually. The main thing that I
think I’d love to see separated is the HTML output stuff. I don’t
think that
it really belongs in CGI at all.

Hear, hear…

···


David Heinemeier Hansson.
http://www.loudthinking.com/ – Broadcasting Brain

I will look at narf to give my opinion on it. I would like to see something
like WEBrick’s FormData.

-austin

···

On Thu, 6 Nov 2003 11:10:19 +0900, Yukihiro Matsumoto wrote:

In message “Re: CGI uses file size to distinguish between regular values > and files” on 03/11/04, Yukihiro Matsumoto matz@ruby-lang.org writes:

I understand the problem, but not yet think of the best solution. Any
concrete ideas?
How about moving to another CGI library, for example Narf?


austin ziegler * austin@halostatue.ca * Toronto, ON, Canada
software designer * pragmatic programmer * 2003.11.05
* 23.01.33

How about

require ‘fileutils’
File.open(upload, ‘w’) {|f| FileUtils.copy_stream(file,f)}

? It’s bit longer, but works now without any modifies.

I think this will best serve my needs within the current constraints of
CGI.rb’s multipart handling.

Coming from a PHP backgrund, I was initially thrown by the inability to
access the actual file – not really considering that I needed to turn
the stream into a file myself. And since CGI.rb uses tempfile itself
for larger file uploads, this made it even more confusing.

So yes, this is probably largely a documentation issue. It would be
great with an example somewhere that used the solution by matz to
demonstrate how to handle it.

That said, I strongly agree with Austin that this distinction is
conceptually broken. I have other infrastructure code that relies on
the regular CGI.rb interface for accessing form fields. It’s a great
pain to add kludges for accessing form fields that are part of a
multipart request. It offloads what I perceive as plumbing details to
the application. It just feels wrong and un-Ruby like.

Again, this could be my bias coming from PHP where multipart requests
are exposed equal to normal requests. But it sounds like Austin
expected the same thing coming from Perl.

Ruby is all about consistent behaviour and the principle of least
surprise. This is inconsistent and highly surprising behaviour.

···


David Heinemeier Hansson.
http://www.loudthinking.com/ – Broadcasting Brain