Regex problem

Roelof_Wobben · 2 June 2014 17:51

Hello,

I try to find a regex which find only the domain name without the .com or .nl

I tried :

\w{3+}

But on http://www.tamarawobben.nl/testhier it finds http tamarawobben and testhier

How to solve this ?

Roelof

Andrew_Vit · 2 June 2014 19:18

You need to limit your expression more, your example will find ANY 3 "word" characters, even numbers and punctuation, and even if there are MORE word characters around them.

You could do this with a capture group (here, named "tld"):

     regex = %r{
       \. # one dot,
       (?<tld> # capture as "tld":
         [a-z]{3}+ # 3+ alpha characters (note: not \w)
       )
       $ # at the end of a line/string
     }x
     "http://example.com".match(regex)[:tld]

Or with a positive look-behind:

     %r{
       (?<=\.) # lookbehind for one dot,
       [a-z]{3}+ # match 3+ alpha characters
       $ # at the end of a line/string
     }
     "http://example.com".match(regex)

Another approach is using the URI library:

URI.parse("http://example.com/"\).host.split(".").last

Andrew Vit

···

On 14-06-02, 10:51, Roelof Wobben wrote:

I try to find a regex which find only the domain name without the .com
or .nl

I tried :

\w{3+}

Tamara_Temple1 · 2 June 2014 21:17

OP seems to want the domain name without the TLD or subdomains. Andrew's
URI solution actually seems quite the best (really no sense in rewriting
well-written regexps), but instead of the last part of the host, you'll
want the penultimate part. Several ways you can get that. Here's one:

URI.parse("http://www.tamarawobben.nl/testhier"\).host.split(".")[-2] #=>
*"*tamarawobben*"*

···

On Mon, Jun 2, 2014 at 2:18 PM, Andrew Vit <andrew@avit.ca> wrote:

On 14-06-02, 10:51, Roelof Wobben wrote:

I try to find a regex which find only the domain name without the .com
or .nl

I tried :

\w{3+}

You need to limit your expression more, your example will find ANY 3
"word" characters, even numbers and punctuation, and even if there are MORE
word characters around them.

You could do this with a capture group (here, named "tld"):

    regex = %r{
      \. # one dot,
      (?<tld> # capture as "tld":
        [a-z]{3}+ # 3+ alpha characters (note: not \w)
      )
      $ # at the end of a line/string
    }x
    "http://example.com".match(regex)[:tld]

Or with a positive look-behind:

    %r{
      (?<=\.) # lookbehind for one dot,
      [a-z]{3}+ # match 3+ alpha characters
      $ # at the end of a line/string
    }
    "http://example.com".match(regex)

Another approach is using the URI library:

    URI.parse("http://example.com/"\).host.split(".").last

Andrew Vit

Roelof_Wobben · 2 June 2014 19:32

Andrew Vit schreef op 2-6-2014 21:18:

regex =

Thanks,

But if I try all three on a online ruby intepreter they do not give any answer.

Roelof

Robert_K1 · 3 June 2014 06:41

That depends on your input. Do you want to find those domain names in
a larger text? Do you try to parse URIs? Do you have full qualified
domain names from which you want to extract a portion?

Kind regards

robert

···

On Mon, Jun 2, 2014 at 7:51 PM, Roelof Wobben <r.wobben@home.nl> wrote:

I try to find a regex which find only the domain name without the .com or
.nl

I tried :

\w{3+}

But on http://www.tamarawobben.nl/testhier it finds http tamarawobben and
testhier

How to solve this ?

--
[guy, jim].each {|him| remember.him do |as, often| as.you_can - without end}
http://blog.rubybestpractices.com/

Doug1 · 2 June 2014 21:15

rubular.com is a great site for testing regexes. Here is one for the last
regex given by Andrew Vit:

Good luck

···

On Mon, Jun 2, 2014 at 12:32 PM, Roelof Wobben <r.wobben@home.nl> wrote:

Andrew Vit schreef op 2-6-2014 21:18:

regex =

Thanks,

But if I try all three on a online ruby intepreter they do not give any
answer.

Roelof

Roelof_Wobben · 3 June 2014 06:55

Robert Klemme schreef op 3-6-2014 8:41:

I try to find a regex which find only the domain name without the .com or
.nl

I tried :

\w{3+}

But on http://www.tamarawobben.nl/testhier it finds http tamarawobben and
testhier

How to solve this ?

That depends on your input. Do you want to find those domain names in
a larger text? Do you try to parse URIs? Do you have full qualified
domain names from which you want to extract a portion?

Kind regards

robert

Im a little bit further.
I have this : (?<=\.)(.*?)(?=\.)

it seems to work except I have to tell that on the .*? the / is not included.
And on the (<?=/.) I have to find a way to include the //

When I do (<?=/[.|/]) or (<?=/[.|//]) I see a message that I have to excape the /

Roelof

···

On Mon, Jun 2, 2014 at 7:51 PM, Roelof Wobben <r.wobben@home.nl> wrote:

Andrew_Vit · 3 June 2014 16:36

URI can extract from larger texts (URI.extract), parse URIs (URI.parse), and after that it's easy to split the domain parts from the fully-qualified hostnames. Really, I don't think there's any point in reinventing this using a Regexp... unless it's just a learning exercise.

Andrew Vit

···

On 14-06-02, 23:41, Robert Klemme wrote:

That depends on your input. Do you want to find those domain names in
a larger text? Do you try to parse URIs? Do you have full qualified
domain names from which you want to extract a portion?

Roelof_Wobben · 3 June 2014 08:11

Roelof Wobben schreef op 3-6-2014 8:55:

Robert Klemme schreef op 3-6-2014 8:41:

I try to find a regex which find only the domain name without the .com or
.nl

I tried :

\w{3+}

But on http://www.tamarawobben.nl/testhier it finds http tamarawobben and
testhier

How to solve this ?

That depends on your input. Do you want to find those domain names in
a larger text? Do you try to parse URIs? Do you have full qualified
domain names from which you want to extract a portion?

Kind regards

robert

Im a little bit further.
I have this : (?<=\.)(.*?)(?=\.)

it seems to work except I have to tell that on the .*? the / is not included.
And on the (<?=/.) I have to find a way to include the //

When I do (<?=/[.|/]) or (<?=/[.|//]) I see a message that I have to excape the /

Roelof

I tried this one (?<=\[.|\//\)(.*?)(?=\.)
but still the error message taht there are un escaped backslashes .

Roelof

···

On Mon, Jun 2, 2014 at 7:51 PM, Roelof Wobben <r.wobben@home.nl> wrote:

Roelof_Wobben · 3 June 2014 16:50

This is a learning exercise from codewars.

But I think I will use a regex for finding the full domain and then use split to find only the part before the .com and so on.

I tried and I think its very difficult to find a regex which can solve all these problems.

http:///www.tamarawobben.nl/index.html

http://tamarawobben.nl/index.html

http://.tamarawobben.nl/index.html

where all three tamarawobben.nl must be found.

Roelof

···

Op 3 juni 2014 om 18:36 schreef Andrew Vit andrew@avit.ca:

On 14-06-02, 23:41, Robert Klemme wrote:

That depends on your input. Do you want to find those domain names in
a larger text? Do you try to parse URIs? Do you have full qualified
domain names from which you want to extract a portion?

URI can extract from larger texts (URI.extract), parse URIs (URI.parse),
and after that it’s easy to split the domain parts from the
fully-qualified hostnames. Really, I don’t think there’s any point in
reinventing this using a Regexp… unless it’s just a learning exercise.

Andrew Vit

Robert_K1 · 3 June 2014 11:28

Roelof Wobben schreef op 3-6-2014 8:55:

Robert Klemme schreef op 3-6-2014 8:41:

...

I tried this one (?<=\[.|\//\)(.*?)(?=\.)
but still the error message taht there are un escaped backslashes .

Please stop fullquoting - especially if you are not referring in any
way to the quoted text. Thank you.

Regards

robert

···

On Tue, Jun 3, 2014 at 10:11 AM, Roelof Wobben <r.wobben@home.nl> wrote:

On Mon, Jun 2, 2014 at 7:51 PM, Roelof Wobben <r.wobben@home.nl> wrote:

--
[guy, jim].each {|him| remember.him do |as, often| as.you_can - without end}
http://blog.rubybestpractices.com/

Jesus_Gabriel_y_Gala · 3 June 2014 17:14

I tried to do it with a single regexp and I couldn't do anything
useful, so I tried to do it first with a regexp to extract the part
between the slashes (between http:// and the following /) and then use
split on "." to the result. This way is quite simpler. I'm not going
to give you the solution, so you can try a little bit this approach,
as this is a learning exercise.

Let me know if you get stuck.

Jesus.

···

On Tue, Jun 3, 2014 at 6:50 PM, Roelof Wobben <r.wobben@home.nl> wrote:

Op 3 juni 2014 om 18:36 schreef Andrew Vit <andrew@avit.ca>:

On 14-06-02, 23:41, Robert Klemme wrote:

That depends on your input. Do you want to find those domain names in
a larger text? Do you try to parse URIs? Do you have full qualified
domain names from which you want to extract a portion?

URI can extract from larger texts (URI.extract), parse URIs (URI.parse),
and after that it's easy to split the domain parts from the
fully-qualified hostnames. Really, I don't think there's any point in
reinventing this using a Regexp... unless it's just a learning exercise.

Andrew Vit

This is a learning exercise from codewars.

But I think I will use a regex for finding the full domain and then use
split to find only the part before the .com and so on.

I tried and I think its very difficult to find a regex which can solve all
these problems.

http:///www.tamarawobben.nl/index.html

http://tamarawobben.nl/index.html

http://<subdomain>.tamarawobben.nl/index.html

where all three tamarawobben.nl must be found.

Roelof

Topic		Replies	Views
Regex hostnames? ruby-talk	5	142	16 January 2013
How to extract domain name without sub domain from url ruby-talk	2	119	23 June 2009
Extract domain name ruby-talk	5	72	23 August 2010
Regular expression to parse out "host" part of URL strin ruby-talk	2	96	22 April 2006
Needed only the Domain name from an url ruby-talk	4	110	25 April 2009

Regex problem

Related Topics