Extract domain names out of URLs

R

Ron Rosenfeld

...
[reformatted]
re.Pattern = "\b((https?|ftp)://)?([\-A-Z0-9.]+)" & _
"(/[\-A-Z0-9+&@#/%=~_|!:,.;]*)?(\?[\-A-Z0-9+&@#/%=~_|!:,.;]*)?"
...

Why so verbose?

re.Pattern = "[^:]*:(//)?[^/:]*?([^./:]+\.[^./:]+(\.[a-z]{2})?)[:/].*"
ExtrURL = re.Replace(str, "$2")

It's a pattern (from a library) that captures the different URL parts into
different backreferences, so does more than what the OP requested.

But I did test against all the content mentioned in the thread.

Running a quick test, using
============================
Function Extr(str As String) As String
Dim re As Object
Set re = CreateObject("vbscript.regexp")
re.IgnoreCase = True
re.Global = True
re.Pattern = "[^:]*:(//)?[^/:]*?([^./:]+\.[^./:]+(\.[a-z]{2})?)[:/].*"
If re.Test(str) = True Then
Extr = re.Replace(str, "$2")
End If
End Function
================================


Your pattern doesn't seem to match:

http://excelusergroup.org
www.techdirt.com/articles/20080408/223932792.shtml

www.BagelsAndLox.com
BagelsAndLox.com
www.Ted.BagelsAndLox.com
Alice.BagelsAndLox.com

and won't extract the URL from
.AddItem GetCompanyName("www.BagelsAndLox.com")
.AddItem GetCompanyName("BagelsAndLox.com")
.AddItem GetCompanyName("www.Ted.BagelsAndLox.com")
.AddItem GetCompanyName("Alice.BagelsAndLox.com")


Granted, these kinds of examples were not all in the OP's specifications.
--ron
 
H

Harlan Grove

Ron Rosenfeld said:
But I did test against all the content mentioned in the thread. ....
Your pattern doesn't seem to match:

http://excelusergroup.orgwww.techdirt.com/articles/20080408/223932792.shtml

The domain name in the url above should be

techdirt.com

and that's what my approach returns.
www.BagelsAndLox.com
BagelsAndLox.comwww.Ted.BagelsAndLox.com
Alice.BagelsAndLox.com
....

I don't consider these urls. They're missing a protocol specifier
(http, https, ftp, or mailto, news, gofer, etc.) All depends on how we
define urls, but there could be substrings in arbitrary text that
match \b[^. ]+\.[^. ]\b that aren't urls, e.g., section numbers like
2.34.5. How would one distinguish these from urls without making the
protocol specifiers mandatory?
 
R

Ron Rosenfeld

I don't consider these urls. They're missing a protocol specifier
(http, https, ftp, or mailto, news, gofer, etc.) All depends on how we
define urls, but there could be substrings in arbitrary text that
match \b[^. ]+\.[^. ]\b that aren't urls, e.g., section numbers like
2.34.5. How would one distinguish these from urls without making the
protocol specifiers mandatory?

A valid objection.

Of course, we could just go back to the OP's original request:
I'm struggling to find the correct algorithm to locate the starting
point (after the http:// or after the http://www.) and the ending
point (the first /) for my Mid function.

which can be easily handled with a worksheet function, and given his
description of "having a list of URL's" are probably not embedded in text.

--ron
 
H

Harlan Grove

Ron Rosenfeld said:
Of course, we could just go back to the OP's original request:


which can be easily handled with a worksheet function, and given his
description of "having a list of URL's" are probably not embedded in
text.

OP's often don't provide comprehensive examples, as you know. If the
urls always have protocol specifiers, and there's always 2 slashes
just after the protocol specifier and colon, then the domain name will
appear between :// and the subsequent /, but such urls *can* also
contain port number specifiers. For example,

http://www.foo.com:80/bar/

which your approach chokes on but mine parses as foo.com. Then there
are mailto: and protocol specifiers that aren't followed by two
slashes, but they're perhaps a digression.

The domain name will be the last 2 or 3 period-separated tokens
between the first colon, possibly followed by 2 slashes, and the first
subsequent colon or slash. The only characters you need to check for
as delimiters are colons and slashes. The domain name will contain 1
or 2 periods separating any other characters.
 
H

Howard Kaikow

The inclusion of the whatever:// is irrelevant to the issue of extracting
the domain.
Proper code will work either way.
And do not forget about country codes at the end of the string/URL.
 
H

Harlan Grove

Howard Kaikow said:
The inclusion of the whatever:// is irrelevant to the issue of extracting
the domain.
Proper code will work either way.
And do not forget about country codes at the end of the string/URL.

Really? What code would handle all the following?

http://linuxtoday.com/
http://www.firstmonday.dk/issues/issue3_3/raymond/
http://www.ace.net.nz/tech/TechFileFormat.html#s
http://www.ifi.unizh.ch/richter/people/pilz/links/index.html
http://www.insurance.ca.gov/docs/index.html
http://www.tdi.state.tx.us/wc/indexwc.html
http://xcell05.free.fr/pages/prog/api-c.htm
http://www.science.uva.nl/research/air/wiki/ShellStartupFiles
http://en-US.www.mozilla.com/en-US/firefox/help/
http://xxx.lanl.gov/
http://www.stats.ox.ac.uk/pub/MASS4/
http://caml.inria.fr/
http://www.er.uqam.ca/nobel/r10735/linux.html
http://gd.tuwien.ac.at/opsys/linux/RPM/
http://perso.wanadoo.es/antlarr/kalamaris.html

where the domain names should be

linuxtoday.com
firstmonday.dk
ace.net.nz
unizh.ch
ca.gov
state.tx.us
free.fr
uva.nl
mozilla.com
lanl.gov
ox.ac.uk
inria.fr
uqam.ca
tuwien.ac.at
wanadoo.es

It seems country top-level domains (.uk, .ca, .es, .dk, .fr, etc)
don't have to have US-like top-level domains
(.com, .net, .org, .gov, .edu, etc), but they can have optional
alternatives (.ac for .edu, .co for .com). But the presence of .??.us
where the ?? are 2-char abbreviations for US states or territories
really screws up simple rules.
 
R

Ron Rosenfeld

OP's often don't provide comprehensive examples, as you know. If the
urls always have protocol specifiers, and there's always 2 slashes
just after the protocol specifier and colon, then the domain name will
appear between :// and the subsequent /, but such urls *can* also
contain port number specifiers. For example,

http://www.foo.com:80/bar/

which your approach chokes on but mine parses as foo.com. Then there
are mailto: and protocol specifiers that aren't followed by two
slashes, but they're perhaps a digression.

The domain name will be the last 2 or 3 period-separated tokens
between the first colon, possibly followed by 2 slashes, and the first
subsequent colon or slash. The only characters you need to check for
as delimiters are colons and slashes. The domain name will contain 1
or 2 periods separating any other characters.

Actually, my VBA regex approaches parses out port specifiers OK. But I think
there is confusion, for me and others, about what constitutes a "domain name".
(I'm not particularly knowledgeable here).

But I see definitions for URL; domain name; registered domain name; hostname;
as well as various types of Top Level Domains (generic, country specific);
second level domains; and various levels of subdomains.

And the specifications are changing. Including allowing the use non-ascii
characters both in country level TLD's as well as in legitimate domain names.

In any event, the OP said he had a list of URL's; wanted to extract the domain
name; and remove the www. if present.

So I have simplified my original regex and VBA routine to do that. I start
matching at the first ":", with an optional "//"; capture the (www.) into a
group which I will ignore, and return the subsequent string that includes
letters, digits, underscore, hyphens and dots.

re.Pattern = ":(//)?(www\.)?([-\w.]+)"

This returns the domains and all the subdomains, with the exception of the
"www."

There are some differences in what we return in some of the URL's you listed.
I'm not sure what the OP would want. For some of them, he might want the
leftmost subdomain, and for others not.

URL
http://www.firstmonday.dk/issues/issue3_3/raymond/
http://www.insurance.ca.gov/docs/index.html
http://www.tdi.state.tx.us/wc/indexwc.html
http://en-US.www.mozilla.com/en-US/firefox/help/
http://xxx.lanl.gov/
http://www.stats.ox.ac.uk/pub/MASS4/
http://gd.tuwien.ac.at/opsys/linux/RPM/


Ron Harlan
firstmonday.dk www.firstmonday.dk
insurance.ca.gov ca.gov
tdi.state.tx.us state.tx.us
en-US.www.mozilla.com mozilla.com
xxx.lanl.gov lanl.gov
stats.ox.ac.uk ox.ac.uk
gd.tuwien.ac.at tuwien.ac.at


I can "correct" the entry with mozilla.com by making a small change in my
regex:

":(//)?([-\w.]*www\.)?([-\w.]+)"

and that works on the samples you provided. But I don't know if it would work
in all cases.

In addition, as you know, javascript does not match unicode characters, so that
causes another set of problems :-(

Enought for now -- I've got some errands to do. Below is the VBA code I used:

Ron:
====================================
Function ExtrURL(str As String) As String
Dim re As Object, mc As Object
Set re = CreateObject("vbscript.regexp")
re.IgnoreCase = True
re.Global = False
're.Pattern = ":(//)?(www\.)?([-\w.]+)"
re.Pattern = ":(//)?([-\w.]*www\.)?([-\w.]+)"
If re.test(str) = True Then
Set mc = re.Execute(str)
ExtrURL = mc(mc.Count - 1).submatches(2)
End If
End Function

'Harlan--------------------------------------------------------

Function ExtrURLH(str As String) As String
Dim re As Object, mc As Object
Set re = CreateObject("vbscript.regexp")
re.IgnoreCase = True
re.Global = True
re.Pattern = "[^:]*:(//)?[^/:]*?([^./:]+\.[^./:]+(\.[a-z]{2})?)[:/].*"
ExtrURLH = re.Replace(str, "$2")
End Function
=======================================

Best,
--ron
 
D

Dave Mills

First you will need to answer how a human can tell what the domain part is for
these examples. The only way I could think of would be to query "Who Is" and
look at the registrant data.

The problem is com, nz, uk etc. are all domains
so are net.nz and ace.net.nz. ace.net.nz is a sub domain of net.nz but then net
is a sub domain of nz. Since the domain itself can have an IP and be used to
point to a web server there is no way you can extract what you have defined as
the domain part from the string programmatically. The solution needs additional
data about what you consider is the boundary point in each string.
 
H

Harlan Grove

Dave Mills said:
First you will need to answer how a human can tell what the domain
part is for these examples. . . . ....
The problem is com, nz, uk etc. are all domains
so are net.nz and ace.net.nz. ace.net.nz is a sub domain of net.nz
but then net is a sub domain of nz. Since the domain itself can have
an IP and be used to point to a web server there is no way you can
extract what you have defined as the domain part from the string
programmatically. The solution needs additional data about what you
consider is the boundary point in each string.
....

There are some rules. Maybe not complete, but they'll cover most
situations. Domains should be parsed right to left by token, and
tokens are period-delimited strings.

If the rightmost token is 2 chars,
it's a country top-level domain, and presumably more tokens wanted.
If the next token going left is also 2 chars or a common generic
top-level domain name (net, org, etc.), then it's presumably also
a higher level domain. Otherwise, the 2nd token from the right
would complete the domain name.
If the rightmost token is us, the next is 2 chars and the next is
k12, we'd need the 4th token from the right too; otherwise, the 3rd
token from the right would complete the domain name. Any further
tokens going left would be hostnames within domain.
Else (the rightmost token is 3 or more chars) the 2nd token from the
right would complete the domain name.

These rules would fail if www.foobar.museum.ru were a valid url, in
which case the domain name should be foobar.museum.ru. Perhaps what's
needed is a complete list of accepted top-level domain names, then the
domain name would stop at the first token going right to left that
isn't an accepted top-level domain name.

The joker in the set of urls I posted before was stats.ox.ac.uk.
 
D

Dave Mills

...

There are some rules. Maybe not complete, but they'll cover most
situations. Domains should be parsed right to left by token, and
tokens are period-delimited strings.

If the rightmost token is 2 chars,
it's a country top-level domain, and presumably more tokens wanted.
If the next token going left is also 2 chars or a common generic
top-level domain name (net, org, etc.), then it's presumably also
a higher level domain. Otherwise, the 2nd token from the right
would complete the domain name.
If the rightmost token is us, the next is 2 chars and the next is
k12, we'd need the 4th token from the right too; otherwise, the 3rd
token from the right would complete the domain name. Any further
tokens going left would be hostnames within domain.
Else (the rightmost token is 3 or more chars) the 2nd token from the
right would complete the domain name.
Most UK schools have domain names like
school.localauthority.sch.uk
This breaks your assumption.
The problem is that you have assumed that there is some sort of convention about
the depth of a DNS name whereas once I own a domain I can create as many
sub-domain as I like nested to any depth I like. Hence the need for some
addition info to determine the boundary.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top