Url normalization

Hi,

I have a set of urls that I want to normalize but I can't find a regex
to do that, this is an url sample:
http://www.example.com/index.php?/topic/something/page__st__20__s__99590dc581fe8e7386051d6dfgdfg4eca4c/
when I use a web browser I find that this url is equivalent of the
following:
http://www.example.com/index.php?/topic/something/page__st__20
It is clear that the last part is a checksum but how can I detect that
automatically

best regards

···

--
Posted via http://www.ruby-forum.com/.

Can you work with something like this?

url_re = /^http:\/\/.*((?<=__s__)[a-g0-9]{32,64})\/$/

I've assumed your checksum is between 32 and 64 characters which may or may not be correct.

Sam

···

On 24/11/11 08:14, rubix Rubix wrote:

Hi,

I have a set of urls that I want to normalize but I can't find a regex
to do that, this is an url sample:
http://www.example.com/index.php?/topic/something/page__st__20__s__99590dc581fe8e7386051d6dfgdfg4eca4c/
when I use a web browser I find that this url is equivalent of the
following:
http://www.example.com/index.php?/topic/something/page__st__20
It is clear that the last part is a checksum but how can I detect that
automatically

best regards

I think if you pay close attention you'll see that your browser goes to the first url and then gets redirected by the server to the second url. The proper thing to do would be to actually do the redirection, not munge the url directly.

···

On Nov 23, 2011, at 11:14 , rubix Rubix wrote:

Hi,

I have a set of urls that I want to normalize but I can't find a regex
to do that, this is an url sample:
http://www.example.com/index.php?/topic/something/page__st__20__s__99590dc581fe8e7386051d6dfgdfg4eca4c/
when I use a web browser I find that this url is equivalent of the
following:
http://www.example.com/index.php?/topic/something/page__st__20
It is clear that the last part is a checksum but how can I detect that
automatically

There is no defined generic semantic for the path and query parameters
in an URL. Semantic is only defined for the leading parts (protocol,
host, port etc.). How do you expect any mechanism to know that the
last part is a checksum (of what btw?)? I mean, completely
independent from technical questions of parsing: how would a piece of
software detect the checksum from looking at the URL?

For specific formatted URLs it's a different story (see Sam's suggestion).

Kind regards

robert

···

On Wed, Nov 23, 2011 at 8:14 PM, rubix Rubix <aggouni2002@yahoo.fr> wrote:

I have a set of urls that I want to normalize but I can't find a regex
to do that, this is an url sample:
http://www.example.com/index.php?/topic/something/page__st__20__s__99590dc581fe8e7386051d6dfgdfg4eca4c/
when I use a web browser I find that this url is equivalent of the
following:
http://www.example.com/index.php?/topic/something/page__st__20
It is clear that the last part is a checksum but how can I detect that
automatically

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

-----Messaggio originale-----

···

Da: Ryan Davis [mailto:ryand-ruby@zenspider.com]
Inviato: mercoledì 23 novembre 2011 23:21
A: ruby-talk ML
Oggetto: Re: url normalization

On Nov 23, 2011, at 11:14 , rubix Rubix wrote:

Hi,

I have a set of urls that I want to normalize but I can't find a regex
to do that, this is an url sample:
http://www.example.com/index.php?/topic/something/page__st__20__s__995
90dc581fe8e7386051d6dfgdfg4eca4c/ when I use a web browser I find that
this url is equivalent of the
following:
http://www.example.com/index.php?/topic/something/page__st__20
It is clear that the last part is a checksum but how can I detect that
automatically

I think if you pay close attention you'll see that your browser goes to the
first url and then gets redirected by the server to the second url. The
proper thing to do would be to actually do the redirection, not munge the
url directly.

--
Caselle da 1GB, trasmetti allegati fino a 3GB e in piu' IMAP, POP3 e SMTP autenticato? GRATIS solo con Email.it http://www.email.it/f

Sponsor:
ING DIRECT Conto Arancio. 4,20% per 12 mesi, zero spese, aprilo in due minuti!
Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid924&d)-12