Stuck at Trying to Extract Data from a Website using JSP

H

HC

Hello,

There is this page that lists out the past transaction records of houses:
http://proptx.midland.com.hk/unit/index.jsp?est_id=E00005 (it's in Chinese)

If you click on a particular house, the past transaction records of that
house is shown.

I want to be able to extract the past transaction data and make charts to
visualise the price trend of the houses. Now, I'm only able to use the
"stupid" method of clicking on all the houses and typing in manually the
transaction records in Excel and then chart the data.

I wish to be able to extract the data to Excel automatically. I have
studied the underlying jsp pages and it seems that the site uses
http://proptx.midland.com.hk/unit/unit_tx.jsp to show the data.

I have tried typing in
http://proptx.midland.com.hk/unit/unit_tx.jsp?unit_id=U000146982 to see any
information will come up, but there's nothing in the page.

I'm totally stuck. As I want to monitor the trend of a number of
developments, it will be very tedious to type up all the transactions in
Excel. It seems the website has sort of exposed the data, but I just can't
find a way to get the data out, at least one house at a time.

I want to do webquery in Excel and then extract the data to a proper table.

Hope some experts can point me in the right direction.

Regards and thanks in advance,

HC
 
J

Joel

Here is code to get you started. I don't know chinese and not sure what
data you are looking for. I belive the ID=E0005 the part of the query that
extracts a particular house. Not sure how to lookup the id's. You can
create a string to get the URL like this:

----------------------------------------------------------------
ID = "E00005"

URL = "http://proptx.midland.com.hk/unit/index.jsp
Request = URL & "?est_id=" & ID
IE.Navigate2 URL
------------------------------------------------------------------
Below is code to dump the info from the house you had listed.


Sub GetHouse()

URL = "http://proptx.midland.com.hk/unit/index.jsp?est_id=E00005"


Set IE = CreateObject("InternetExplorer.Application")
IE.Visible = True


'get web page
IE.Navigate2 URL
Do While IE.readystate <> 4 Or _
IE.Busy = True
DoEvents
Loop

'test code
'With Sheets("Sheet3")
' RowCount = 1
' For Each itm In IE.document.all
' .Range("A" & RowCount) = itm.tagname
' .Range("B" & RowCount) = itm.ID
' .Range("C" & RowCount) = itm.classname
' .Range("D" & RowCount) = Left(itm.innertext, 1024)

' RowCount = RowCount + 1
' Next itm
'End With




Set Table = IE.document.getElementsByTagname("Table")

RowCount = 1
For Each Row In Table(7).Rows
Colcount = 1
For Each cell In Row.Cells
Cells(RowCount, Colcount) = cell.innertext
Colcount = Colcount + 1
Next cell

RowCount = RowCount + 1
Next Row
End Sub
 
J

Joel

Here is something I'm working on. The webpage I dumped has a magnifier glass
which brings you to a search page. The search page is a form with a number
of drop down boxes. I don't understand chineese but I'm able to go to my IE
explorer and from the menu on thie IE explorer go to View - Source. the HTML
is in english and has some useful comments that I'm looking at right now.
Let you know what I find.
 
H

HC

Hello Joel,

Thank you for your help. Maybe I need to elaborate what I want to achieve
more.

On the page: http://proptx.midland.com.hk/unit/index.jsp?est_id=E00005, it
shows there are Flat A to Flat H on each floor. The number is the size of
that particular flat. When you click on, say "753'" in the cell denoting
25/F Flat A, an orange box pops up. Inside the box, the first row of purple
text denotes the location of the flat, the second row of purple text states
again the area of the flat, and the third row of purple text states that
there were previously 3 transactions for this particular flat. First column
is the date of transaction, second column is the sold price, and the third
column is cost per square feet. I want to be able to progammatically
extract these transaction records to Excel for charting.

I am looking to achieve either of the following:

1. Find out the way to directly access the popup box. After studying the
codes, I know that the ID of each flat (unit) is in the saved html file and
that won't be too difficult to extract (I've done similar thing before, and
I guess I will be able to figure that out). After I've got all the unit_id,
I want to be able to repeatedly open the corresponding jsp page and extract
the transaction records for the flats (units).

2. As you have enlightened me in your post, or maybe I can just start an
instant of IE and then copy the data from IE to Excel. So do you know how I
can access the value displayed in the popped up orange box?

After some studying of the codes, it seems that the jsp page that shows the
orange popped up box is http://proptx.midland.com.hk/unit/unit_tx.jsp, and
it uses unit_id to denote each unit. However, I can't figure out further
how to get to the data.

Hope I've made myself clear. I'm totally stuck.

Hope you are able to help.

Thank you very much indeed.

HC
 
J

Joel

I'm having problems naviaging across pages. this code will work for 1 page.
I found the english language page for this website. I found a method to get
all the houses but only can span the 1st page. I used a 2nd IE to get the
details. Found when the tagname was A and the innertext was "Details" I was
able to get the URL of the details page. See if this helps.



Sub GetHouses()

Dim Districts() As Variant

URL = "http://cybersearch.midland.com.hk/cybersearch/?lang=en"

Set IE = CreateObject("InternetExplorer.Application")
IE.Visible = True

Set IE2 = CreateObject("InternetExplorer.Application")
IE2.Visible = True


'get web page
IE.Navigate2 URL
Do While IE.readystate <> 4 Or _
IE.Busy = True
DoEvents
Loop


a = 1
Set navigate = IE.document.getelementbyid("estatebox")
Pages = navigate.all.Length - 1
For pagecount = 1 To Pages
If pagecount <> 1 Then

Set navigate = IE.document.getelementbyid("estatebox")
Set Form = IE.document.getelementsbytagname("Form")

'I keep on getting the 1st page with this code.
'I can put the page number in the box but the submit gets
'back to page one.
'change to next page
navigate.all(0).Value = pagecount
navigate.all(pagecount).Click
Form(1).submit

Do While IE.readystate <> 4 Or _
IE.Busy = True

DoEvents
Loop
End If

Set Body = IE.document.getelementbyid("csBody")
'get table
Set Listings = Body.all(2)

For Each itm In Listings.all
If itm.tagname = "A" And _
itm.innertext = "Details" Then

'naviagate to detains in 2nd explorer
URL2 = itm.href
IE2.Navigate2 URL2
Do While IE2.readystate <> 4 Or _
IE2.Busy = True
DoEvents
Loop

End If
Next itm

Next pagecount


End Sub
 
J

John B. Matthews

There is this page that lists out the past transaction records of
houses: http://proptx.midland.com.hk/unit/index.jsp?est_id=E00005
(it's in Chinese)

If you click on a particular house, the past transaction records of
that house is shown.

I want to be able to extract the past transaction data and make
charts to visualise the price trend of the houses. Now, I'm only
able to use the "stupid" method of clicking on all the houses and
typing in manually the transaction records in Excel and then chart
the data.

I wish to be able to extract the data to Excel automatically. I have
studied the underlying jsp pages and it seems that the site uses
http://proptx.midland.com.hk/unit/unit_tx.jsp to show the data.

I have tried typing in
http://proptx.midland.com.hk/unit/unit_tx.jsp?unit_id=U000146982 to
see any information will come up, but there's nothing in the page.

Well, there are eight empty lines, delimited by CR/LF. Possibly, they
don't want strangers scraping their data.
I'm totally stuck. As I want to monitor the trend of a number of
developments, it will be very tedious to type up all the transactions
in Excel. It seems the website has sort of exposed the data, but I
just can't find a way to get the data out, at least one house at a
time.

You could ask them for the data. Alternatively, you might be able to
interpret the JavaScript usefully.
I want to do webquery in Excel and then extract the data to a proper
table.

In my locale, the tax authority makes similar data available for
download in convenient CSV format.
 
R

Roedy Green

I wish to be able to extract the data to Excel automatically. I have
studied the underlying jsp pages and it seems that the site uses
http://proptx.midland.com.hk/unit/unit_tx.jsp to show the data.

I got a temporarily unavailable on that URL.

I wrote a simple screenscrape to go to Oanda.com to pick up the daily
exchange rates. To my surprise I got a rude letter from their lawyers
and they blocked me from the site. They were willing to give the
information freely to the public, so long as it was not actually used.

So, I warn you, the same thing may happen to you if you succeed in
screenscraping that website.

See http://mindprod.com/jgloss/screenscraping.html

If you think they won't mind you screenscraping, perhaps they might be
willing to provide the data in XML, CSV, SOAP or other
computer-friendly format.
--
Roedy Green Canadian Mind Products
http://mindprod.com

"The industrial civilisation is based on the consumption of energy resources that are inherently limited in quantity, and that are about to become scarce. When they do, competition for what remains will trigger dramatic economic and geopolitical events; in the end, it may be impossible for even a single nation to sustain industrialism as we have know it in the twentieth century."
~ Richard Heinberg, The Party’s Over: Oil, War, and the Fate of Industrial Societies
 
R

ron

HC...The following url produces the original webpage and pop-up box
for flat 26/F A

http://proptx.midland.com.hk/unit/index.jsp?est_id=E00005&unit_id=U00....

If you programatically select all and copy / paste as text into an
Excel spreadsheet, the pop-up box info that you want will be found
near the top of the excel sheet...Ron

So just to be clear, all of the flats that have a pop-up window have a
unit ID which can be extracted from the source code (div id=) at

http://proptx.midland.com.hk/unit/index.jsp?est_id=E00005

and assigned to an array. It sounds like HC has already collected all
of these unit IDs.

They can then be programatically added to the base url to create the
link I attached in my earlier post

my_url = base url + unit ID
= "http://proptx.midland.com.hk/unit/index.jsp?
est_id=E00005&" _
ID_array(J)

One can then step through these urls one at a time and open the web
page plus pop-up window for each url; scrape each window plus pop-up,
paste the contents into excel and extract the desired information that
was contained in the pop-up window...Ron
 
H

Herbert Chan

Thank you for your attempt indeed.

For this particular website, there is actually no corresponding English
version of the same data. The data provided under
http://cybersearch.midland.com.hk/cybersearch/?lang=en is not the same as
the one on http://proptx.midland.com.hk/unit/index.jsp?est_id=E00005.

ron has pointed out that I can actually use a url such as
http://proptx.midland.com.hk/unit/index.jsp?est_id=E00005&unit_id=U000146982
to get to the pop up directly. That's a big piece of information for me.

After opening the webpage with the pop up, is there a way to directly
reference to the pop up box without having to copy and paste the whole page
back to Excel? Is the pop up frame one of the tables in the page?

Now that a way is found, I just wonder if there is a more elegant way to get
to the data.

HC
 
J

Joel

The data is in different tables. The pop up window is displaying specific
data from specific tables. The dollar amount of the bids are shown in the
code below. I included a debug tool that I use which is in the sub DUMP. I
usually run this code when I working with a webpage.

I also add watch variables when I writing my code. Select variable like
TABLE and right click variable to add watch. I thedn single step through the
code using F8.

The table vairable will have 18 tables. The index to the table starts at
zero so item 1 in the watch is table(0). You will see a property in the
watch window for tables call sourceindex. The source index starts at zero
and is the same data as the row number from dump (offset by 1). The data you
need to get is the innertext property.

Sub GetHouse()
'URL = "http://proptx.midland.com.hk/unit/index.jsp?est_id=E00005"
URL = "http://proptx.midland.com.hk/unit/index.jsp"
Unit = "?est_id=E00005&unit_id=U000146982"

Set IE = CreateObject("InternetExplorer.Application")
IE.Visible = True


'get web page
IE.Navigate2 URL & Unit
Do While IE.readystate <> 4 Or _
IE.Busy = True
DoEvents
Loop

'test code
'With Sheets("Sheet3")
' RowCount = 1
' For Each itm In IE.document.all
' .Range("A" & RowCount) = itm.tagname
' .Range("B" & RowCount) = itm.ID
' .Range("C" & RowCount) = itm.classname
' .Range("D" & RowCount) = Left(itm.innertext, 1024)

' RowCount = RowCount + 1
' Next itm
'End With



Call dump(IE)
Set Table = IE.document.getelementsbytagname("Table")
a = 1
Set PopupWin = Table(2)
b = 1
Set PopupWin = Table(3)
b = 1

'code for extracting table
RowCount = 1
For Each Row In PopupWin.Rows
Colcount = 1
For Each cell In Row.Cells
Cells(RowCount, Colcount) = cell.innertext
Colcount = Colcount + 1
Next cell '
RowCount = RowCount + 1
Next Row
End Sub
Sub dump(IE)

'test code
With Sheets("Sheet6")
.Cells.ClearContents
RowCount = 1
For Each itm In IE.document.all
.Range("A" & RowCount) = itm.tagname
.Range("B" & RowCount) = itm.ID
.Range("C" & RowCount) = itm.classname
.Range("D" & RowCount) = Left(itm.innertext, 1024)

RowCount = RowCount + 1
Next itm
End With
End Sub
 
J

Joel

If the data is on the webpage and not visible it is still in public domain.
You can view the source code in IE by going to menu view - Source. If the
data is on the page then it is in puiblic domain. If you have to request the
data by using a java script then it is not in public domain. In this case
the data is on on webpage and is public.
 
R

ron

Now that a way is found, I just wonder if there is a more elegant way to get
to the data.

HC...I agree it is not a very elegant approach. Opening IE and moving
from url to url takes time. I prefer to use the GET method whenever
possible as it does not require IE to be open. I have a macro that
checks real estate data at 800 different addresses. Using IE to move
from page to page, the maco takes 45 minutes to run; using the GET
method and parsing the source code behind each web page that is
assigned to a variable only requires 14 minutes. However when I tried
the GET method with the url I used above, the source code did not
contain the pop-up box information.

my_url = "http://proptx.midland.com.hk/unit/index.jsp?
est_id=E00005&unit_id=U000146982"
Set my_object = CreateObject("MSXML2.XMLHTTP")
my_object.Open "GET", my_url, False
my_object.send
my_var = RL.responsetext
Set my_object = Nothing

An alternative would be to still run IE, scrape the data to the
clipboard, then, rather than paste the data into a spreadsheet, the
clipboard contents can be transferred to a variable which could then
be parsed (instr, mid, left, etc.) and the desired information
extracted. I have tried this method and it works for the flats on the
web page.

Set my_object = CreateObject("htmlfile")
my_var = my_object.ParentWindow.ClipboardData.GetData("text")

....Ron
 
R

ron

HC...I agree it is not a very elegant approach.  Opening IE and moving
from url to url takes time.  I prefer to use the GET method whenever
possible as it does not require IE to be open.  I have a macro that
checks real estate data at 800 different addresses.  Using IE to move
from page to page, the maco takes 45 minutes to run; using the GET
method and parsing the source code behind each web page that is
assigned to a variable only requires 14 minutes.  However when I tried
the GET method with the url I used above, the source code did not
contain the pop-up box information.

    my_url = "http://proptx.midland.com.hk/unit/index.jsp?
est_id=E00005&unit_id=U000146982"
    Set my_object = CreateObject("MSXML2.XMLHTTP")
    my_object.Open "GET", my_url, False
    my_object.send
    my_var = RL.responsetext
    Set my_object = Nothing

An alternative would be to still run IE, scrape the data to the
clipboard, then, rather than paste the data into a spreadsheet, the
clipboard contents can be transferred to a variable which could then
be parsed (instr, mid, left, etc.) and the desired information
extracted.  I have tried this method and it works for the flats on the
web page.

Set my_object = CreateObject("htmlfile")
my_var = my_object.ParentWindow.ClipboardData.GetData("text")

...Ron

PS...to use the clipboard transfer method the Microsoft Forms 2.0
Object Library must be selected...Ron
 
H

Herbert Chan

I've finally got the time to sit down and study what you've written.

I've run your macro, and it seems that the tagname "html" contains all the
text on the page, including the popup box.

I know nothing about how to control IE. How can I access the content of the
tagname "html" on the page?

I've tried:
Sheets("Sheet2").Range("a1").Value = IE.document.HTML

But the above gives me an error.

Hope you are still following the thread.

Thanks.

Herbert
 
H

Herbert Chan

How can I make a select all on the IE page and do the copy to clipboard?

URL = "http://proptx.midland.com.hk/unit/index.jsp"
Unit = "?est_id=E00005&unit_id=U000146982"
Set IE = CreateObject("InternetExplorer.Application")
IE.Visible = True
IE.Navigate2 URL & Unit

How to do the copy after the above?

Herbert

"ron" <[email protected]> ¼¶¼g©ó¶l¥ó·s»D:[email protected]...
HC...I agree it is not a very elegant approach. Opening IE and moving
from url to url takes time. I prefer to use the GET method whenever
possible as it does not require IE to be open. I have a macro that
checks real estate data at 800 different addresses. Using IE to move
from page to page, the maco takes 45 minutes to run; using the GET
method and parsing the source code behind each web page that is
assigned to a variable only requires 14 minutes. However when I tried
the GET method with the url I used above, the source code did not
contain the pop-up box information.

my_url = "http://proptx.midland.com.hk/unit/index.jsp?
est_id=E00005&unit_id=U000146982"
Set my_object = CreateObject("MSXML2.XMLHTTP")
my_object.Open "GET", my_url, False
my_object.send
my_var = RL.responsetext
Set my_object = Nothing

An alternative would be to still run IE, scrape the data to the
clipboard, then, rather than paste the data into a spreadsheet, the
clipboard contents can be transferred to a variable which could then
be parsed (instr, mid, left, etc.) and the desired information
extracted. I have tried this method and it works for the flats on the
web page.

Set my_object = CreateObject("htmlfile")
my_var = my_object.ParentWindow.ClipboardData.GetData("text")

...Ron

PS...to use the clipboard transfer method the Microsoft Forms 2.0
Object Library must be selected...Ron
 
R

ron

How can I make a select all on the IE page and do the copy to clipboard?

URL = "http://proptx.midland.com.hk/unit/index.jsp"
Unit = "?est_id=E00005&unit_id=U000146982"
Set IE = CreateObject("InternetExplorer.Application")
IE.Visible = True
IE.Navigate2 URL & Unit

How to do the copy after the above?

Herbert

"ron" <[email protected]> ¼¶¼g©ó¶l¥ó·s»D:[email protected]...







PS...to use the clipboard transfer method the Microsoft Forms 2.0
Object Library must be selected...Ron- Hide quoted text -

- Show quoted text -

Herbert...Try the following to select the entire page and then copy to
the clipboard

ie.ExecWB OLECMDID_SELECTALL, OLECMDEXECOPT_DONTPROMPTUSER

ie.ExecWB OLECMDID_COPY, OLECMDEXECOPT_DODEFAULT

A reference to Microsoft Internet Controls must be set...Ron
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top