Extract domain names out of URLs

M

MikeB

Hi, I have a list of URLs and I would like to find a way to extract
the domain name from the URL.

The URLs can be prefixed with www in some cases and I'd also like to
strip that off.

So, for the following URLs
http://www.wiseclerk.com/group-news/tag/libor
http://www.wiseclerk.com/group-news/2008/04/
http://www.wiseclerk.com/group-news/
http://www.tiscali.co.uk/search/results.php?section=&from=&query=setana
sports
http://www.techdirt.com/articles/20080408/223932792.shtml
http://www.techdirt.com/articles/20080408/223932792.shtml
http://www.techdirt.com/
http://www.techdirt.com/
http://techdirt.com/index.php
http://techdirt.com/blog.php?tag=loans



I'd like to extract

wiseclerk.com
tiscali.co.uk
techdirt.com

I'm struggling to find the correct algorithm to locate the starting
point (after the http:// or after the http://www.) and the ending
point (the first /) for my Mid function.

Any help appreciated.

I'm thinking that I might have to do a find/search function for either
of the two starting strings and then have the "max" value (for the
longer of the two strings) but that's a lot of extra columns in my
spreadsheet.

Is there perhaps a better way to do this?
 
R

Rick Rothstein \(MVP - VB\)

I think this formula does what you want and handles all the various
possibilities...

=IF(LEFT(LEFT(SUBSTITUTE(A1,"http://",""),FIND("/",SUBSTITUTE(A1,"http://","")&"/")-1),4)="www.",MID(LEFT(SUBSTITUTE(A1,"http://",""),FIND("/",SUBSTITUTE(A1,"http://","")&"/")-1),5,256),LEFT(SUBSTITUTE(A1,"http://",""),FIND("/",SUBSTITUTE(A1,"http://","")&"/")-1))

Rick
 
M

MikeB

Mike

Try something like

=IF(ISERR(FIND("www",A1,1)),MID(A1,8,FIND("/",A1,8)-8),MID(A1,12,FIND("/",A1,12)-12))

this works on the theory that there is a trailing / after the domain
somewhere, it will need amending if there is not, e.ghttp://excelusergroup.org

--
HTH
Nick Hodge
Microsoft MVP - Excel
Southampton, England
(e-mail address removed)
web:www.excelusergroup.org
web:www.nickhodge.co.uk

Hey Nick! Some years ago I had the pleasure of living in Romsey near
Southampton. I just saw your signature and that brought back fond
memories.
 
H

Howard Kaikow

Rick Rothstein (MVP - VB) said:
I think this formula does what you want and handles all the various
possibilities...
=IF(LEFT(LEFT(SUBSTITUTE(A1,"http://",""),FIND("/",SUBSTITUTE(A1,"http://","
")&"/")-1),4)="www.",MID(LEFT(SUBSTITUTE(A1,"http://",""),FIND("/",SUBSTITUT
E(A1,"http://","")&"/")-1),5,256),LEFT(SUBSTITUTE(A1,"http://",""),FIND("/",
SUBSTITUTE(A1,"http://","")&"/")-1))

That does not work for URLs that start with, say, ftp.

The type of the URL is relevant, e.g., what about ftp.xyz.com and
www.xyz.com, both should result in same doughmain.

In this case, a worksheet function may be better..
 
R

Rick Rothstein \(MVP - VB\)

Well, I could modify my formula to account for the FTP protocol (and, by
changing the approach, maybe so others even), but there seems to be so many
protocols possible... even if you wrote a function to handle them, which
protocols would you select to handle?

Rick
 
H

Howard Kaikow

Rick Rothstein (MVP - VB) said:
Well, I could modify my formula to account for the FTP protocol (and, by
changing the approach, maybe so others even), but there seems to be so many
protocols possible... even if you wrote a function to handle them, which
protocols would you select to handle?

There is no simple answer.
It's up to the programmer.

A function may be easier because there are likely Windows APIs for parsing
URLs.
 
R

Rick Rothstein \(MVP - VB\)

Well, I could modify my formula to account for the FTP protocol (and, by
There is no simple answer.
It's up to the programmer.

A function may be easier because there are likely Windows APIs for parsing
URLs.

Fair enough. Here is a possible Function solution to this question which
makes use of the UrlGetPart API to get to the "company name". For those
following this thread, add a Module to the project (Insert/Module from the
VBA editor) and copy/paste the code after my signature into its code window.
Inside the GetCompanyName function, I have seeded the Protocols constant
with "www" and "ftp"... you can add any other protocols you think necessary
in this constant, just make sure each protocol is separated by a dot. You
would use this function just like any other worksheet function. So, for
example, if A1 contains the URL, then put this formula in the cell you want
to extract the company name to...

=GetCompanyName(A1)

It is that simple (assuming I didn't screw up my code that is<g>).

Rick

Private Declare Function UrlGetPart Lib "shlwapi" _
Alias "UrlGetPartA" _
(ByVal pszIn As String, _
ByVal pszOut As String, _
pcchOut As Long, _
ByVal dwPart As Long, _
ByVal Flags As Long) As Long

Private Const MAX_PATH As Long = 260
Private Const URL_PART_HOSTNAME As Long = 2

Public Function GetCompanyName(URL As String) As String
Dim dwPart As Long
Dim Flags As Long
Dim Part As String
Dim Size As Long
Dim Host As String
' Protocols is a **dot** delimited string
Const Protocols As String = "www.ftp"
If Len(URL) > 0 Then
If InStr(URL, "//") = 0 Then URL = "http://" & URL
Part = Space$(MAX_PATH)
Size = Len(Part)
If UrlGetPart(URL, Part, Size, URL_PART_HOSTNAME, 0&) = 0 Then
GetCompanyName = Left$(Part, Size)
End If
Host = Left(GetCompanyName, InStr(GetCompanyName, "."))
If InStr("." & Protocols & ".", "." & Host) Then
GetCompanyName = Mid(GetCompanyName, Len(Host) + 1)
End If
End If
End Function
 
R

Ron Rosenfeld

Hi, I have a list of URLs and I would like to find a way to extract
the domain name from the URL.

The URLs can be prefixed with www in some cases and I'd also like to
strip that off.

So, for the following URLs
http://www.wiseclerk.com/group-news/tag/libor
http://www.wiseclerk.com/group-news/2008/04/
http://www.wiseclerk.com/group-news/
http://www.tiscali.co.uk/search/results.php?section=&from=&query=setana
sports
http://www.techdirt.com/articles/20080408/223932792.shtml
http://www.techdirt.com/articles/20080408/223932792.shtml
http://www.techdirt.com/
http://www.techdirt.com/
http://techdirt.com/index.php
http://techdirt.com/blog.php?tag=loans



I'd like to extract

wiseclerk.com
tiscali.co.uk
techdirt.com

I'm struggling to find the correct algorithm to locate the starting
point (after the http:// or after the http://www.) and the ending
point (the first /) for my Mid function.

Any help appreciated.

I'm thinking that I might have to do a find/search function for either
of the two starting strings and then have the "max" value (for the
longer of the two strings) but that's a lot of extra columns in my
spreadsheet.

Is there perhaps a better way to do this?


Since you indicated that these are URL's, I will assume they have a normal
structure, so the domain name will be the part between the first // and the
next /. Then we just remove the www. if it exists.


=SUBSTITUTE(MID(A1,FIND("/",A1)+2,FIND(CHAR(1),
SUBSTITUTE(A1,"/",CHAR(1),3))-FIND("/",A1)-2),"www.","")

--ron
 
R

Ron Rosenfeld

Hi, I have a list of URLs and I would like to find a way to extract
the domain name from the URL.

The URLs can be prefixed with www in some cases and I'd also like to
strip that off.

So, for the following URLs
http://www.wiseclerk.com/group-news/tag/libor
http://www.wiseclerk.com/group-news/2008/04/
http://www.wiseclerk.com/group-news/
http://www.tiscali.co.uk/search/results.php?section=&from=&query=setana
sports
http://www.techdirt.com/articles/20080408/223932792.shtml
http://www.techdirt.com/articles/20080408/223932792.shtml
http://www.techdirt.com/
http://www.techdirt.com/
http://techdirt.com/index.php
http://techdirt.com/blog.php?tag=loans



I'd like to extract

wiseclerk.com
tiscali.co.uk
techdirt.com

I'm struggling to find the correct algorithm to locate the starting
point (after the http:// or after the http://www.) and the ending
point (the first /) for my Mid function.

Any help appreciated.

I'm thinking that I might have to do a find/search function for either
of the two starting strings and then have the "max" value (for the
longer of the two strings) but that's a lot of extra columns in my
spreadsheet.

Is there perhaps a better way to do this?

Missed something -- the case where there is no final "/".

So use this instead:

=SUBSTITUTE(MID(A1,FIND("/",A1)+2,IF((LEN(A1)-LEN(SUBSTITUTE(A1,"/","")))=2,
255,FIND(CHAR(1),SUBSTITUTE(A1,"/",CHAR(1),3))-FIND("/",A1)-2)),"www.","")

--ron
 
H

Howard Kaikow

Rick Rothstein (MVP - VB) said:
Fair enough. Here is a possible Function solution to this question which
makes use of the UrlGetPart API to get to the "company name". For those
following this thread, add a Module to the project (Insert/Module from the
VBA editor) and copy/paste the code after my signature into its code window.
Inside the GetCompanyName function, I have seeded the Protocols constant
with "www" and "ftp"... you can add any other protocols you think necessary
in this constant, just make sure each protocol is separated by a dot. You
would use this function just like any other worksheet function. So, for
example, if A1 contains the URL, then put this formula in the cell you want
to extract the company name to...

It is almost always better to use the Unicode version of APIs.
 
R

Rick Rothstein \(MVP - VB\)

Fair enough. Here is a possible Function solution to this question which
It is almost always better to use the Unicode version of APIs.

You are probably right; but I have zero experience with international
programming issues (all my programming efforts for the past 27 years have
been US based), whether involving fonts or not. The end result is I have no
idea if there is anything special that needs to be accounted for or not
when dealing with them. My gut feeling is simply using UrlGetPartW in place
of UrlGetPart is not the whole answer.

Rick
 
H

Howard Kaikow

I really tied up now, but the following will give you an idea on how to
Convert to Unicode.
Apparently, the code is still not parsing as desired.

Option Explicit

Private Declare Function UrlGetPart Lib "shlwapi" _
Alias "UrlGetPartW" _
(ByVal pszIn As Long, _
ByVal pszOut As Long, _
pcchOut As Long, _
ByVal dwPart As Long, _
ByVal Flags As Long) As Long

Private Const MAX_PATH As Long = 260
Private Const URL_PART_HOSTNAME As Long = 2

Public Function GetCompanyName(URL As String) As String
Dim dwPart As Long
Dim Flags As Long
Dim Part As String
Dim Size As Long
Dim Host As String
' Protocols is a **dot** delimited string
Const Protocols As String = "www.ftp"
If Len(URL) > 0 Then
If InStr(URL, "//") = 0 Then URL = "http://" & URL
Part = Space$(MAX_PATH)
Size = Len(Part)
If UrlGetPart(StrPtr(URL), StrPtr(Part), Size, URL_PART_HOSTNAME, 0&)
= 0 Then
GetCompanyName = Left$(Part, Size)
End If
Host = Left(GetCompanyName, InStr(GetCompanyName, "."))
If InStr("." & Protocols & ".", "." & Host) Then
GetCompanyName = Mid(GetCompanyName, Len(Host) + 1)
End If
End If
End Function

Private Sub btnByeBye_Click()
Unload Me
End Sub

Private Sub btnRunMe_Click()
With lstResult
.AddItem GetCompanyName("http://www.BagelsAndLox.com/")
.AddItem GetCompanyName("http://BagelsAndLox.com/")
.AddItem GetCompanyName("www.BagelsAndLox.com")
.AddItem GetCompanyName("BagelsAndLox.com")
.AddItem GetCompanyName("http://www.Bob.BagelsAndLox.com/")
.AddItem GetCompanyName("http://Carol.BagelsAndLox.com/")
.AddItem GetCompanyName("www.Ted.BagelsAndLox.com")
.AddItem GetCompanyName("Alice.BagelsAndLox.com")
End With
End Sub
 
H

Howard Kaikow

A betta test is:


Private Sub btnRunMe_Click()
With lstResult
.AddItem GetCompanyName("http://www.BagelsAndLox.com/")
.AddItem GetCompanyName("http://BagelsAndLox.com/")
.AddItem GetCompanyName("www.BagelsAndLox.com")
.AddItem GetCompanyName("BagelsAndLox.com")
.AddItem GetCompanyName("http://www.Bob.BagelsAndLox.com/")
.AddItem GetCompanyName("http://Carol.BagelsAndLox.com/")
.AddItem GetCompanyName("www.Ted.BagelsAndLox.com")
.AddItem GetCompanyName("Alice.BagelsAndLox.com")

.AddItem GetCompanyName("http://www.BagelsAndLox.com/Alpha.htm")
.AddItem GetCompanyName("http://BagelsAndLox.com/Beta.html")
.AddItem GetCompanyName("http://www.Bob.BagelsAndLox.com/Gamma.doc")
.AddItem GetCompanyName("http://Carol.BagelsAndLox.com/Delta.jpg")
End With
End Sub
 
R

Ron Rosenfeld

Hi, I have a list of URLs and I would like to find a way to extract
the domain name from the URL.

The URLs can be prefixed with www in some cases and I'd also like to
strip that off.

So, for the following URLs
http://www.wiseclerk.com/group-news/tag/libor
http://www.wiseclerk.com/group-news/2008/04/
http://www.wiseclerk.com/group-news/
http://www.tiscali.co.uk/search/results.php?section=&from=&query=setana
sports
http://www.techdirt.com/articles/20080408/223932792.shtml
http://www.techdirt.com/articles/20080408/223932792.shtml
http://www.techdirt.com/
http://www.techdirt.com/
http://techdirt.com/index.php
http://techdirt.com/blog.php?tag=loans



I'd like to extract

wiseclerk.com
tiscali.co.uk
techdirt.com

I'm struggling to find the correct algorithm to locate the starting
point (after the http:// or after the http://www.) and the ending
point (the first /) for my Mid function.

Any help appreciated.

I'm thinking that I might have to do a find/search function for either
of the two starting strings and then have the "max" value (for the
longer of the two strings) but that's a lot of extra columns in my
spreadsheet.

Is there perhaps a better way to do this?

Here is a UDF that will probably do a better job, given all the variables that
have been proposed in this thread.

Note that the line that starts with re.Pattern = and the subsequent line(s)
within quotation marks should be all on one line.

======================================
Option Explicit
Function ExtrURL(str As String) As String
Dim re As Object, mc As Object
Set re = CreateObject("vbscript.regexp")
re.IgnoreCase = True
re.Global = True
re.Pattern =
"\b((https?|ftp)://)?([\-A-Z0-9.]+)(/[\-A-Z0-9+&@#/%=~_|!:,.;]*)?(\?[\-A-Z0-9+&@#/%=~_|!:,.;]*)?"
If re.test(str) = True Then
Set mc = re.Execute(str)
ExtrURL = mc(mc.Count - 1).submatches(2)
ExtrURL = Replace(ExtrURL, "www.", "")
If InStr(1, ExtrURL, ".") = 0 Then ExtrURL = ""
End If
End Function
==================================

--ron
 
R

Rick Rothstein \(MVP - VB\)

re.Pattern =
"\b((https?|ftp)://)?([\-A-Z0-9.]+)(/[\-A-Z0-9+&@#/%=~_|!:,.;]*)?(\?[\-A-Z0-9+&@#/%=~_|!:,.;]*)?"

Now that is what I miss about Regular Expressions from my days many years
ago working with them in the UNIX world... their clarity and readability.<g>

Rick
 
R

Ron Rosenfeld

re.Pattern =
"\b((https?|ftp)://)?([\-A-Z0-9.]+)(/[\-A-Z0-9+&@#/%=~_|!:,.;]*)?(\?[\-A-Z0-9+&@#/%=~_|!:,.;]*)?"

Now that is what I miss about Regular Expressions from my days many years
ago working with them in the UNIX world... their clarity and readability.<g>

Rick

<ggg>

And even when you write out the explanation:

===============================
URL capturing

\b((https?|ftp)://)?([-A-Z0-9.]+)(/[-A-Z0-9+&@#/%=~_|!:,.;]*)?(\?[-A-Z0-9+&@#/%=~_|!:,.;]*)?

Options: case insensitive

Assert position at a word boundary «\b»
Match the regular expression below and capture its match into backreference
number 1 «((https?|ftp)://)?»
Between zero and one times, as many times as possible, giving back as needed
(greedy) «?»
Match the regular expression below and capture its match into backreference
number 2 «(https?|ftp)»
Match either the regular expression below (attempting the next
alternative only if this one fails) «https?»
Match the characters “http” literally «http»
Match the character “s” literally «s?»
Between zero and one times, as many times as possible, giving back
as needed (greedy) «?»
Or match regular expression number 2 below (the entire group fails if
this one fails to match) «ftp»
Match the characters “ftp” literally «ftp»
Match the characters “://” literally «://»
Match the regular expression below and capture its match into backreference
number 3 «([-A-Z0-9.]+)»
Match a single character present in the list below «[-A-Z0-9.]+»
Between one and unlimited times, as many times as possible, giving back
as needed (greedy) «+»
The character “-” «-»
A character in the range between “A” and “Z” «A-Z»
A character in the range between “0” and “9” «0-9»
The character “.” «.»
Match the regular expression below and capture its match into backreference
number 4 «(/[-A-Z0-9+&@#/%=~_|!:,.;]*)?»
Between zero and one times, as many times as possible, giving back as needed
(greedy) «?»
Match the character “/” literally «/»
Match a single character present in the list below
«[-A-Z0-9+&@#/%=~_|!:,.;]*»
Between zero and unlimited times, as many times as possible, giving back
as needed (greedy) «*»
The character “-” «-»
A character in the range between “A” and “Z” «A-Z»
A character in the range between “0” and “9” «0-9»
One of the characters “+&@#/%=~_|!:,.;” «+&@#/%=~_|!:,.;»
Match the regular expression below and capture its match into backreference
number 5 «(\?[-A-Z0-9+&@#/%=~_|!:,.;]*)?»
Between zero and one times, as many times as possible, giving back as needed
(greedy) «?»
Match the character “?” literally «\?»
Match a single character present in the list below
«[-A-Z0-9+&@#/%=~_|!:,.;]*»
Between zero and unlimited times, as many times as possible, giving back
as needed (greedy) «*»
The character “-” «-»
A character in the range between “A” and “Z” «A-Z»
A character in the range between “0” and “9” «0-9»
One of the characters “+&@#/%=~_|!:,.;” «+&@#/%=~_|!:,.;»


Created with RegexBuddy
======================================
--ron
 
H

Harlan Grove

....
[reformatted]
re.Pattern = "\b((https?|ftp)://)?([\-A-Z0-9.]+)" & _
"(/[\-A-Z0-9+&@#/%=~_|!:,.;]*)?(\?[\-A-Z0-9+&@#/%=~_|!:,.;]*)?"
....

Why so verbose?

re.Pattern = "[^:]*:(//)?[^/:]*?([^./:]+\.[^./:]+(\.[a-z]{2})?)[:/].*"
ExtrURL = re.Replace(str, "$2")
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top