Bug in URI.parse?

I have looked through the archives for the mailing list, but didn't see
this issue addressed. So here goes.

My local machine is named 3beers-wrk. I achieved this by putting an
entry in my local hosts file (since I'm on Windows, this would be in
\windows\system32\drivers\etc\hosts). However, I have also seen the
issue below reproduce with a machine whose name is similar, say
12345-server.

Below is a capture of my session with the shell, then with irb:

H:\>ping 3beers-wrk

Pinging 3beers-wrk [127.0.0.1] with 32 bytes of data:

Reply from 127.0.0.1: bytes=32 time<1ms TTL=128

Reply from 127.0.0.1: bytes=32 time<1ms TTL=128

...

H:\>irb

irb(main):001:0> require 'uri'

=> true

irb(main):002:0> URI.parse("http://3beers-wrk.tsi.lan")

=> #<URI::HTTP:0x1612dc0 URL:http://3beers-wrk.tsi.lan>

irb(main):003:0> URI.parse("http://3beers-wrk")

URI::InvalidURIError: the scheme http does not accept registry part:
3beers-wrk (or bad hostname?)

        from
c:/p4/workgroup-1.0/workgroup-support/ruby/lib/ruby/1.8/uri/generic.rb:1
94:in `initialize'

        from
c:/p4/workgroup-1.0/workgroup-support/ruby/lib/ruby/1.8/uri/http.rb:46:i
n `initialize'

        from
c:/p4/workgroup-1.0/workgroup-support/ruby/lib/ruby/1.8/uri/common.rb:48
4:in `new'

        from
c:/p4/workgroup-1.0/workgroup-support/ruby/lib/ruby/1.8/uri/common.rb:48
4:in `parse'

        from (irb):3

        from
c:/p4/workgroup-1.0/workgroup-support/ruby/lib/ruby/1.8/uri/http.rb:57

URI.parse() is not properly parsing the non-qualified name 3beers-wrk
(though it does properly parse a fully-qualified hostname), which seems
to be in line with the grammar set forth in RFC2396. However, the RFC
makes the following comment: "In practice, however, the host component
may be a local domain literal" (section 3.2). This suggests that the
above URI is entirely valid. Further, this URI is acceptable to Web
browsers.

Is this a bug? How can I handle this seemingly valid URI?

Thanks!

Andrew

Good question! I've yet to figure out a good way to handle the InvalidURI errors myself.
Makes URI.parse pretty useless unless you have a way to handle URI class's errors raised.

···

On Aug 29, 2007, at 9:33 AM, Andrew Beers wrote:

I have looked through the archives for the mailing list, but didn't see
this issue addressed. So here goes.

My local machine is named 3beers-wrk. I achieved this by putting an
entry in my local hosts file (since I'm on Windows, this would be in
\windows\system32\drivers\etc\hosts). However, I have also seen the
issue below reproduce with a machine whose name is similar, say
12345-server.

Below is a capture of my session with the shell, then with irb:

H:\>ping 3beers-wrk

Pinging 3beers-wrk [127.0.0.1] with 32 bytes of data:

Reply from 127.0.0.1: bytes=32 time<1ms TTL=128

Reply from 127.0.0.1: bytes=32 time<1ms TTL=128

...

H:\>irb

irb(main):001:0> require 'uri'

=> true

irb(main):002:0> URI.parse("http://3beers-wrk.tsi.lan")

=> #<URI::HTTP:0x1612dc0 URL:http://3beers-wrk.tsi.lan>

irb(main):003:0> URI.parse("http://3beers-wrk")

URI::InvalidURIError: the scheme http does not accept registry part:
3beers-wrk (or bad hostname?)

        from
c:/p4/workgroup-1.0/workgroup-support/ruby/lib/ruby/1.8/uri/generic.rb:1
94:in `initialize'

        from
c:/p4/workgroup-1.0/workgroup-support/ruby/lib/ruby/1.8/uri/http.rb:46:i
n `initialize'

        from
c:/p4/workgroup-1.0/workgroup-support/ruby/lib/ruby/1.8/uri/common.rb:48
4:in `new'

        from
c:/p4/workgroup-1.0/workgroup-support/ruby/lib/ruby/1.8/uri/common.rb:48
4:in `parse'

        from (irb):3

        from
c:/p4/workgroup-1.0/workgroup-support/ruby/lib/ruby/1.8/uri/http.rb:57

URI.parse() is not properly parsing the non-qualified name 3beers-wrk
(though it does properly parse a fully-qualified hostname), which seems
to be in line with the grammar set forth in RFC2396. However, the RFC
makes the following comment: "In practice, however, the host component
may be a local domain literal" (section 3.2). This suggests that the
above URI is entirely valid. Further, this URI is acceptable to Web
browsers.

Is this a bug? How can I handle this seemingly valid URI?

Thanks!

Andrew

It looks like URI.parse doesn't like the leading number:

irb(main):001:0> require 'uri'
irb(main):003:0> URI.parse("http://xshare")
=> #<URI::HTTP:0x16fd906 URL:http://xshare>
irb(main):004:0> URI.parse("http://xshare-foo")
=> #<URI::HTTP:0x16fc498 URL:http://xshare-foo>
irb(main):006:0> URI.parse("http://3qshare")
URI::InvalidURIError: the scheme http does not accept registry part:
3qshare (or bad hostname?)
        from C:/ruby/lib/ruby/1.8/uri/generic.rb:195:in `initialize'
        from C:/ruby/lib/ruby/1.8/uri/http.rb:78:in `initialize'
        from C:/ruby/lib/ruby/1.8/uri/common.rb:488:in `new'
        from C:/ruby/lib/ruby/1.8/uri/common.rb:488:in `parse'
        from (irb):6

I couldn't tell you what the proper behavior is.

Regards,

Dan

···

On Aug 29, 8:33 am, "Andrew Beers" <be...@tableausoftware.com> wrote:

I have looked through the archives for the mailing list, but didn't see
this issue addressed. So here goes.

My local machine is named 3beers-wrk. I achieved this by putting an
entry in my local hosts file (since I'm on Windows, this would be in
\windows\system32\drivers\etc\hosts). However, I have also seen the
issue below reproduce with a machine whose name is similar, say
12345-server.

Below is a capture of my session with the shell, then with irb:

H:\>ping 3beers-wrk

Pinging 3beers-wrk [127.0.0.1] with 32 bytes of data:

Reply from 127.0.0.1: bytes=32 time<1ms TTL=128

Reply from 127.0.0.1: bytes=32 time<1ms TTL=128

...

H:\>irb

irb(main):001:0> require 'uri'

=> true

irb(main):002:0> URI.parse("http://3beers-wrk.tsi.lan")

=> #<URI::HTTP:0x1612dc0 URL:http://3beers-wrk.tsi.lan>

irb(main):003:0> URI.parse("http://3beers-wrk")

URI::InvalidURIError: the scheme http does not accept registry part:
3beers-wrk (or bad hostname?)

        from
c:/p4/workgroup-1.0/workgroup-support/ruby/lib/ruby/1.8/uri/generic.rb:1
94:in `initialize'

        from
c:/p4/workgroup-1.0/workgroup-support/ruby/lib/ruby/1.8/uri/http.rb:46:i
n `initialize'

        from
c:/p4/workgroup-1.0/workgroup-support/ruby/lib/ruby/1.8/uri/common.rb:48
4:in `new'

        from
c:/p4/workgroup-1.0/workgroup-support/ruby/lib/ruby/1.8/uri/common.rb:48
4:in `parse'

        from (irb):3

        from
c:/p4/workgroup-1.0/workgroup-support/ruby/lib/ruby/1.8/uri/http.rb:57

URI.parse() is not properly parsing the non-qualified name 3beers-wrk
(though it does properly parse a fully-qualified hostname), which seems
to be in line with the grammar set forth in RFC2396. However, the RFC
makes the following comment: "In practice, however, the host component
may be a local domain literal" (section 3.2). This suggests that the
above URI is entirely valid. Further, this URI is acceptable to Web
browsers.

Is this a bug? How can I handle this seemingly valid URI?

That is true, and due to the following regular expressions from
uri/common.rb:

# domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum
DOMLABEL = "(?:[#{ALNUM}](?:[-#{ALNUM}]*[#{ALNUM}])?)"
# toplabel = alpha | alpha *( alphanum | "-" ) alphanum
TOPLABEL = "(?:[#{ALPHA}](?:[-#{ALNUM}]*[#{ALNUM}])?)"
# hostname = *( domainlabel "." ) toplabel [ "." ]
HOSTNAME = "(?:#{DOMLABEL}\\.)*#{TOPLABEL}\\.?"

So a valid hostname will consist of optional DOMLABELs in front of a
TOPLABEL. The TOPLABEL must start with a letter, end in a letter or digit,
with letters, digits and hyphens inbetween the two.

That is consistent with RFC 1035 (DOMAIN NAMES - IMPLEMENTATION AND
SPECIFICATION) [http://www.ietf.org/rfc/rfc1035.txt\]:
The labels must follow the rules for ARPANET host names. They must
start with a letter, end with a letter or digit, and have as interior
characters only letters, digits, and hyphen. There are also some
restrictions on the length. Labels must be 63 characters or less.

The error thrown by URI.parse is a little odd in this context, but explained
as follows:

In the URI.parse chain, the URI is checked against a longer regular
expression that only partly matches the hostname, but also other URI parts
(such as userinfo, the scheme etc.). The hostname part doesn't match here
because it's dealing with an invalid hostname. The URI registry part _does_
match your invalid hostname, so this information is passed on in the array
of matched URI parts for the registry.
This array is then checked in Generic.new. That constructor finds the string
passed for the registry, but the class is hard coded to not use registries:

USE_REGISTRY = false

···

-----Original Message-----
From: Daniel Berger [mailto:djberg96@gmail.com]
Sent: Wednesday, August 29, 2007 10:24 AM
To: ruby-talk ML
Subject: Re: Bug in URI.parse?

It looks like URI.parse doesn't like the leading number:

irb(main):001:0> require 'uri'
irb(main):003:0> URI.parse("http://xshare")
=> #<URI::HTTP:0x16fd906 URL:http://xshare>
irb(main):004:0> URI.parse("http://xshare-foo")
=> #<URI::HTTP:0x16fc498 URL:http://xshare-foo>
irb(main):006:0> URI.parse("http://3qshare")
URI::InvalidURIError: the scheme http does not accept registry part:
3qshare (or bad hostname?)
        from C:/ruby/lib/ruby/1.8/uri/generic.rb:195:in `initialize'
        from C:/ruby/lib/ruby/1.8/uri/http.rb:78:in `initialize'
        from C:/ruby/lib/ruby/1.8/uri/common.rb:488:in `new'
        from C:/ruby/lib/ruby/1.8/uri/common.rb:488:in `parse'
        from (irb):6

I couldn't tell you what the proper behavior is.

Regards,

Dan

#
# DOC: FIXME!
#
def self.use_registry
  self::USE_REGISTRY
end

And in the constructor:

if @registry && !self.class.use_registry
  raise InvalidURIError,
  "the scheme #{@scheme} does not accept registry part: #{@registry} (or bad
hostname?)"
end

To sum up: a hostname of 3beers-wrk is invalid as an ARPANET host according
to the RFC, so the correct solution would be to rename the host.

Hope that helps,

Felix

Excuse me: a toplevel of 3beers-wrk is invalid.

Felix

···

-----Original Message-----
From: Felix Windt [mailto:fwmailinglists@gmail.com]
Sent: Wednesday, August 29, 2007 11:29 AM
To: ruby-talk ML
Subject: Re: Bug in URI.parse?

To sum up: a hostname of 3beers-wrk is invalid as an ARPANET
host according
to the RFC, so the correct solution would be to rename the host.

Hope that helps,

Felix

Felix Windt writes:
> > From: Daniel Berger [mailto:djberg96@gmail.com]
> > Sent: Wednesday, August 29, 2007 10:24 AM
> > To: ruby-talk ML
> > Subject: Re: Bug in URI.parse?
> >
> > It looks like URI.parse doesn't like the leading number:
> >
> > irb(main):001:0> require 'uri'
> > irb(main):003:0> URI.parse("http://xshare")
> > => #<URI::HTTP:0x16fd906 URL:http://xshare>
> > irb(main):004:0> URI.parse("http://xshare-foo")
> > => #<URI::HTTP:0x16fc498 URL:http://xshare-foo>
> > irb(main):006:0> URI.parse("http://3qshare")
> > URI::InvalidURIError: the scheme http does not accept registry part:
> > 3qshare (or bad hostname?)
> > from C:/ruby/lib/ruby/1.8/uri/generic.rb:195:in `initialize'
> > from C:/ruby/lib/ruby/1.8/uri/http.rb:78:in `initialize'
> > from C:/ruby/lib/ruby/1.8/uri/common.rb:488:in `new'
> > from C:/ruby/lib/ruby/1.8/uri/common.rb:488:in `parse'
> > from (irb):6
> >
> > I couldn't tell you what the proper behavior is.
> >
> > Regards,
> >
> > Dan
> >
> >
>
> That is true, and due to the following regular expressions from
> uri/common.rb:
>
> # domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum
> DOMLABEL = "(?:[#{ALNUM}](?:[-#{ALNUM}]*[#{ALNUM}])?)"
> # toplabel = alpha | alpha *( alphanum | "-" ) alphanum
> TOPLABEL = "(?:[#{ALPHA}](?:[-#{ALNUM}]*[#{ALNUM}])?)"
> # hostname = *( domainlabel "." ) toplabel [ "." ]
> HOSTNAME = "(?:#{DOMLABEL}\\.)*#{TOPLABEL}\\.?"
>
> So a valid hostname will consist of optional DOMLABELs in front of a
> TOPLABEL. The TOPLABEL must start with a letter, end in a letter or digit,
> with letters, digits and hyphens inbetween the two.
>
> That is consistent with RFC 1035 (DOMAIN NAMES - IMPLEMENTATION AND
> SPECIFICATION) [http://www.ietf.org/rfc/rfc1035.txt\]:
> The labels must follow the rules for ARPANET host names. They must
> start with a letter, end with a letter or digit, and have as interior
> characters only letters, digits, and hyphen. There are also some
> restrictions on the length. Labels must be 63 characters or less.
>
> The error thrown by URI.parse is a little odd in this context, but explained
> as follows:
>
> In the URI.parse chain, the URI is checked against a longer regular
> expression that only partly matches the hostname, but also other URI parts
> (such as userinfo, the scheme etc.). The hostname part doesn't match here
> because it's dealing with an invalid hostname. The URI registry part _does_
> match your invalid hostname, so this information is passed on in the array
> of matched URI parts for the registry.
> This array is then checked in Generic.new. That constructor finds the string
> passed for the registry, but the class is hard coded to not use registries:
>
> USE_REGISTRY = false
> #
> # DOC: FIXME!
> #
> def self.use_registry
> self::USE_REGISTRY
> end
>
> And in the constructor:
>
> if @registry && !self.class.use_registry
> raise InvalidURIError,
> "the scheme #{@scheme} does not accept registry part: #{@registry} (or bad
> hostname?)"
> end
>
>
>
> To sum up: a hostname of 3beers-wrk is invalid as an ARPANET host according
> to the RFC, so the correct solution would be to rename the host.
>
>
> Hope that helps,
>
> Felix

While I believe that Felix's analysis is valid, the problem is that
there are valid, real domains that start with numbers, and URI should
parse those, and in fact, it generally does.

irb(main):002:0> require 'uri'
=> true
irb(main):003:0> URI.parse('http://slashdot.org')
=> #<URI::HTTP:0x2fee3c URL:http://slashdot.org>
irb(main):004:0> URI.parse('http://401k.com')
=> #<URI::HTTP:0x2fca24 URL:http://401k.com>
irb(main):006:0> URI.parse('http://www.3com.com')
=> #<URI::HTTP:0x2f7b64 URL:http://www.3com.com>
irb(main):007:0> URI.parse('https://401k.fidelity.com')
=> #<URI::HTTPS:0x2f5364 URL:https://401k.fidelity.com>

All of these are real domains for real websites, and thus, the
suggestion of "rename the host" would not work very well.

The problem is probably better illustrated by this example:

irb(main):005:0> URI.parse('http://www.example.4bad')
URI::InvalidURIError: the scheme http does not accept registry part: www.example.4bad (or bad hostname?)
        from /usr/local/lib/ruby/1.8/uri/generic.rb:195:in `initialize'
        from /usr/local/lib/ruby/1.8/uri/http.rb:78:in `initialize'
        from /usr/local/lib/ruby/1.8/uri/common.rb:488:in `new'
        from /usr/local/lib/ruby/1.8/uri/common.rb:488:in `parse'
        from (irb):5

Here, the top-level domain starts with a digit, and _that_ is not
allowed. And we will most likely never see such a beast out in the
world. So the work-around for Dan's original problem would be to
specify the domain name with the hostname: 3qshare.<your-domain>

But, I would contend that this _is_ a bug in URI. My suggestion would
be that the regex for HOSTNAME be:
  HOSTNAME = "#{DOMLABEL}(?:(?:\\.#{DOMLABEL})*\\.#{TOPLEVEL}\\.?)"
(I'll admit I'm not that familiar with this regex notation, so I'm
winging it; apologies for any mistakes.) The point is that the
hostname may not be specified with a domain, and if so, must still be
parsed. If the hostname is either a fully qualified hostname or just
a domain name, then the format of a top-level domain must be checked
and enforced, with optional (sub)domains in between.

Of course, I'm working from what Felix gave above; I haven't gone
through uri/common.rb to any significant extent, so there may be other
things that this suggestion would cause to break.

Coey

···

> -----Original Message-----

--
Coey Minear
Senior Test Engineer
(651) 628-2831
coey_minear@securecomputing.com

Secure Computing(R)
Your trusted source for enterprise security(TM)
http://www.securecomputing.com
NASDAQ: SCUR

*** The information contained in this email message may be privileged,
confidential and protected from disclosure. If you are not the intended
recipient, any review, dissemination, distribution or copying is
strictly prohibited. If you have received this email message in error,
please notify the sender by reply email and delete the message and any
attachments. ***

I wouldn't call it a bug exactly, it does do what it is written to do.
Instead, let's just say that URI.parse isn't very robust.
It doesn't handle lots of real-world situations in ways you would expect.
You would expect some sort of message saying the TLD (top level domain) is missing or bad, but also you would not expect this to end your program abruptly.

A good URI parser will also accept IP addresses, since those are also valid, at least in the sense that they are real and do exist and are likely to be used or entered by users.
Another problem is the way it handles URLs missing the www or http:// or https://
While strictly speaking this should be required, it clearly is not the reality of URLs in the world or the reality of how humans use them. People have become accustomed to using what are officially partial or bad URLs.

Most web browsers will accept a simple string and attempt to find it, even if it means adding a TLD.

ARPANET is pretty pointless now.

I've begun my own script to check if a URL is correct, but only if it is the human readable variety.
One of the biggest problems becomes the transitory nature of URLs. They can change or disappear without notice.
Another problem is the path after a TLD. The path can be nearly anything and can only be determined to be the first single / after the apparent TLD.

If it is a bug change toplabel in common.rb to this

TOPLABEL = "(?:[#{ALNUM}](?:[-#{ALNUM}]*[#{ALNUM}])?)"

Thanks to my friendly MySQL admin .

Stephen Becker IV

···

On 8/29/07, John Joyce <dangerwillrobinsondanger@gmail.com> wrote:

I wouldn't call it a bug exactly, it does do what it is written to do.
Instead, let's just say that URI.parse isn't very robust.
It doesn't handle lots of real-world situations in ways you would
expect.
You would expect some sort of message saying the TLD (top level
domain) is missing or bad, but also you would not expect this to end
your program abruptly.

A good URI parser will also accept IP addresses, since those are also
valid, at least in the sense that they are real and do exist and are
likely to be used or entered by users.
Another problem is the way it handles URLs missing the www or http://
or https://
While strictly speaking this should be required, it clearly is not
the reality of URLs in the world or the reality of how humans use
them. People have become accustomed to using what are officially
partial or bad URLs.

Most web browsers will accept a simple string and attempt to find it,
even if it means adding a TLD.

ARPANET is pretty pointless now.

I've begun my own script to check if a URL is correct, but only if it
is the human readable variety.
One of the biggest problems becomes the transitory nature of URLs.
They can change or disappear without notice.
Another problem is the path after a TLD. The path can be nearly
anything and can only be determined to be the first single / after
the apparent TLD.

Ok lots of good responses, thanks! A few comments:

Felix: while URI.parse() is behaving according to the two cited RFCs, I
think it is missing an important use case. In "http://3beers-wrk",
"3beers-wrk" isn't a domain name, is it? It is an unqualified host name
(I assume we'd pick the host name up from context. Now, the RFC also
suggests that host name must follow these rules (starting with a letter,
etc.), and furthermore, all components of a domain name just follow this
convention, which suggests that the regexp is common.rb is also
incorrect. :slight_smile:

Also, the solution of "rename the host" is a non-solution when dealing
with customers, who are using an otherwise perfectly acceptable hostname
(I haven't found a tool yet that will balk at a hostname beginning with
a number)

Now, I'm not sure if the RFCs have been replaced by newer versions -
that would take some digging.

So, John, I'd say that this is a bug in URI.parse, since it follows
neither the published RFCs nor the practical implementation of them
today (as Coey points out). And if it follows neither, it's really not
a very good general purpose function in the Ruby library and so should
be fixed.

Andrew

···

-----Original Message-----
From: RubyTalk@gmail.com [mailto:rubytalk@gmail.com]
Sent: Wednesday, August 29, 2007 1:17 PM
To: ruby-talk ML
Subject: Re: Bug in URI.parse?

If it is a bug change toplabel in common.rb to this

TOPLABEL = "(?:[#{ALNUM}](?:[-#{ALNUM}]*[#{ALNUM}])?)"

Thanks to my friendly MySQL admin .

Stephen Becker IV

On 8/29/07, John Joyce <dangerwillrobinsondanger@gmail.com> wrote:

I wouldn't call it a bug exactly, it does do what it is written to do.
Instead, let's just say that URI.parse isn't very robust.
It doesn't handle lots of real-world situations in ways you would
expect.
You would expect some sort of message saying the TLD (top level
domain) is missing or bad, but also you would not expect this to end
your program abruptly.

A good URI parser will also accept IP addresses, since those are also
valid, at least in the sense that they are real and do exist and are
likely to be used or entered by users.
Another problem is the way it handles URLs missing the www or http://
or https://
While strictly speaking this should be required, it clearly is not
the reality of URLs in the world or the reality of how humans use
them. People have become accustomed to using what are officially
partial or bad URLs.

Most web browsers will accept a simple string and attempt to find it,
even if it means adding a TLD.

ARPANET is pretty pointless now.

I've begun my own script to check if a URL is correct, but only if it
is the human readable variety.
One of the biggest problems becomes the transitory nature of URLs.
They can change or disappear without notice.
Another problem is the path after a TLD. The path can be nearly
anything and can only be determined to be the first single / after
the apparent TLD.

From: Andrew Beers [mailto:beers@tableausoftware.com]
Sent: Wednesday, August 29, 2007 1:49 PM
To: ruby-talk ML
Subject: Re: Bug in URI.parse?

Ok lots of good responses, thanks! A few comments:

Felix: while URI.parse() is behaving according to the two
cited RFCs, I
think it is missing an important use case. In "http://3beers-wrk",
"3beers-wrk" isn't a domain name, is it? It is an
unqualified host name

That's fair - it does mention that single unqualified hostnames should work.
I don't have enough time right now at work to look at the RFC for those -
I'm not even sure there is one for them - and what that defines as naming
standards, that might be worth investigating.

(I assume we'd pick the host name up from context. Now, the RFC also
suggests that host name must follow these rules (starting
with a letter,
etc.), and furthermore, all components of a domain name just
follow this
convention, which suggests that the regexp is common.rb is also
incorrect. :slight_smile:

I think it does act correctly for qualified domain names, which is
important.

Also, the solution of "rename the host" is a non-solution when dealing
with customers, who are using an otherwise perfectly
acceptable hostname
(I haven't found a tool yet that will balk at a hostname
beginning with
a number)

That's true :o)

Now, I'm not sure if the RFCs have been replaced by newer versions -
that would take some digging.

I'm relatively certain it has not.

So, John, I'd say that this is a bug in URI.parse, since it follows
neither the published RFCs nor the practical implementation of them
today (as Coey points out). And if it follows neither, it's
really not
a very good general purpose function in the Ruby library and so should
be fixed.

Andrew

Together with:

From: RubyTalk@gmail.com [mailto:rubytalk@gmail.com]
Sent: Wednesday, August 29, 2007 1:17 PM
To: ruby-talk ML
Subject: Re: Bug in URI.parse?

If it is a bug change toplabel in common.rb to this

TOPLABEL = "(?:[#{ALNUM}](?:[-#{ALNUM}]*[#{ALNUM}])?)"

Thanks to my friendly MySQL admin .

Stephen Becker IV

If it is a bug - maybe you should file on the core mailing list and enquire?
-, here's a better fix:

$ ruby -v
ruby 1.8.5 (2006-08-25) [i486-linux]
diff for uri/common.rb:
56c56
< HOSTNAME = "(?:(?:#{DOMLABEL}\\.)+#{TOPLABEL}\\.?)|(?:#{DOMLABEL}?)"

···

-----Original Message-----
-----Original Message-----

---

      HOSTNAME = "(?:#{DOMLABEL}\\.)*#{TOPLABEL}\\.?"

If it's a qualified domain name, enforce things as they were. If there are
no sub-domains or domains to a top level domain, accept sub-domain naming
stands (can start with a number) as a single, unqualified hostname.

With that change:

irb(main):001:0> require 'uri'
=> true
irb(main):002:0> URI.parse('http://www.example.com')
=> #<URI::HTTP:0xfdbdf1726 URL:http://www.example.com>
irb(main):003:0> URI.parse('http://2.example.com')
=> #<URI::HTTP:0xfdbdf03ee URL:http://2.example.com>
irb(main):004:0> URI.parse('http://2test')
=> #<URI::HTTP:0xfdbdef250 URL:http://2test>
irb(main):005:0> URI.parse('http://2test.4bad')
URI::InvalidURIError: the scheme http does not accept registry part:
2test.4bad (or bad hostname?)
        from /usr/lib/ruby/1.8/uri/generic.rb:194:in `initialize'
        from /usr/lib/ruby/1.8/uri/http.rb:46:in `initialize'
        from /usr/lib/ruby/1.8/uri/common.rb:484:in `new'
        from /usr/lib/ruby/1.8/uri/common.rb:484:in `parse'
        from (irb):5
        from :0
irb(main):006:0>

Which should make everyone happy.

Unfortunately, you will have to edit your uri/common.rb file for that
directly - since these are declared as constants, you _can_ override them by
reclaring all modules involved (you'll have to redeclare several patterns
and regular expressions), but you will trigger warnings that way.

Hope that helps,

Felix