Extract data from website code.

  • Thread starter mattwatson.mail
  • Start date
M

mattwatson.mail

There seems to be many ways to skin this cat. I'm most proficient with Excel, but could pick up and run with most anything.

My goal:

Take HTML source of website.
Search for item listings in HTML. (Panasonic TV - 42" - LCD)
Return 3 values for each listing. (Model, size, style)


In looking at the HTML I'm not worried that the data will be too hard to identify. I have found all the desired info quite easily. What would be the best way to grab and separate this data? I feel like I'd need a combination of word/excel to get the raw code into something I could separate by column? I don't really have any macro experience, but if that's an option I coulddefinitely give that a go.

Thanks!
 
C

Claus Busch

Hi,

Am Fri, 8 Nov 2013 10:39:21 -0800 (PST) schrieb
(e-mail address removed):
Take HTML source of website.
Search for item listings in HTML. (Panasonic TV - 42" - LCD)
Return 3 values for each listing. (Model, size, style)

if you have Panasonic TV - 42" - LCD in A1 then
Data => TextToColumns => Delimited => Delimiter = Hyphen => Finish


Regards
Claus B.
 
M

mattwatson.mail

I apologize, I probably should have given specifics.

<h2><a class="thm-hglight-text_color" href="/auto/new-2014-nissan-altima-25_s/664277/">New 2014 Nissan Altima 2.5 S </a></h2>
I need the "New 2014 Nissan Altima 2.5 S"

<dd data-price="23247" class="vehicle_price price_tp-msrp price_strike">$23,247</dd>
I need the "$23247"

<dd data-price="19277" class="vehicle_price price_tp-selling ">$19,277</dd>
I need the "19277"

This is buried in about 25 lines of code per vehicle on this page then 80 vehicles per page. I think the FIRST obstacle would be to JUST grab those 3 lines I want. I can't seem to think of the best way to do that. Is there a way to select and delete lines that do/do not include specific words/characters?

If I only got the lines I needed I could use TextToColumns to get everything after a ">". From there I could use the =Left(6) minus 1 to get that price/vehicle I needed.
 
M

mattwatson.mail

Halfway there. I filtered by lines containing "vehicle_price price_tp-msrp price_strike" and got at least one line I needed then used TextToColumns toget the number between < and >. Unfortunately via this method I have to dothis 3 times and then get the data sets paired up. I appreciate the idea of using TextToColumns.
 
R

Ron Rosenfeld

I apologize, I probably should have given specifics.

<h2><a class="thm-hglight-text_color" href="/auto/new-2014-nissan-altima-25_s/664277/">New 2014 Nissan Altima 2.5 S </a></h2>
I need the "New 2014 Nissan Altima 2.5 S"

<dd data-price="23247" class="vehicle_price price_tp-msrp price_strike">$23,247</dd>
I need the "$23247"

<dd data-price="19277" class="vehicle_price price_tp-selling ">$19,277</dd>
I need the "19277"

This is buried in about 25 lines of code per vehicle on this page then 80 vehicles per page. I think the FIRST obstacle would be to JUST grab those 3 lines I want. I can't seem to think of the best way to do that. Is there a way to select and delete lines that do/do not include specific words/characters?

If I only got the lines I needed I could use TextToColumns to get everything after a ">". From there I could use the =Left(6) minus 1 to get that price/vehicle I needed.

This can be done fairly easily using an HTMLParser and, if you don't have deal with multiple nestings of the same tags, using Regular Expressions.
(Since I am not familiar with HTML Parsers, except for having read about their capabilities, and since Regular Expressions are implemented in VBA, I will discuss those).
Exactly how best to do this depends on things like the location of the html text, how you want the results laid out, etc.

Further, it seems that you are reading in the HTML "line by line" with each line of HTML going into a separate row. I do not believe you can guarantee that the entire desired text will all reside on a single line. If it does not, that will mess up your idea of filtering lines by content.

For example, given that the
html text is located in A1
each item is listed between <a and </a>
each item is followed by two prices listed between <dd and </dd>

Then the macro below would extract the data and place it adjacent columns (and three rows per entry).

Of course, this may not be ideal.
Depending on the size of the page, the entire HTML text might not fit in one cell. However, you could combine multiple cells into a single string in VBA, which has a much larger size limit, >10^9 if I recall correctly.
You may need the results output in a different manner.
You may need to iterate through a bunch of cells or worksheets
There may be relevant information about the HTML source that you have not mentioned.

To enter this Macro (Sub), <alt-F11> opens the Visual Basic Editor.
Ensure your project is highlighted in the Project Explorer window.
Then, from the top menu, select Insert/Module and
paste the code below into the window that opens.

To use this Macro (Sub), <alt-F8> opens the macro dialog box. Select the macro by name, and <RUN>.

================================
Option Explicit
Sub ExtractAuto()
Dim re As Object, mc As Object, m As Object
Dim i As Long
Dim c As Range
Dim s As String

Set c = Range("A1")
s = c.Text

Set re = CreateObject("vbscript.regexp")
With re
.Global = True
.ignorecase = True
.Pattern = "<a[^>]*>([^<]+)[\s\S]+?<dd[^>]+>([^<]+)[\s\S]+?<dd[^>]+>([^<]+)"
End With

If re.test(s) = True Then
Set mc = re.Execute(s)
For i = 0 To mc.Count - 1
c(1, i + 2) = mc(i).submatches(0)
c(2, i + 2) = mc(i).submatches(1)
c(3, i + 2) = mc(i).submatches(2)
Next i
End If
End Sub
==================================
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top