Page 1 of 3 123>
Topic Options
#162324 - 2006-05-23 08:42 PM Web scraping in KiX
pearly Offline
Getting the hang of it
*****

Registered: 2004-02-04
Posts: 92
I googled web scraping and it returns a lot of hits using Python, VBScript, Twill?, etc. What I want to do is to pull HTML from webpages so I can parse them to find data I need to validate testing.

A former co-worker used COM in a VBScript :

Code:

Set objWindowsShell = CreateObject("Shell.Application")
For varObjectIndex = 0 To objWindowsShell.Windows.Count - 1
If objWindowsShell.Windows(varObjectIndex).HWND = varHwnd Then
Set objIe = objWindowsShell.Windows(varObjectIndex)
Exit For
End If
Next
If varFrameContext = "" Then
Set objDocument = objIe.Document
Else
Set objDocument = objIe.Document.Frames(varFrameContext).Document
End If
Set objTables = objDocument.All.Tags("TABLE")
Set objTable = objTables.Item(varItemId)

If Not objTable.Rows.Length < varRow Then
If Not objTable.Rows(varRowIndex).Cells.Length < varCol Then
Select Case varPropertyName
Case "innerText"
varHtmlTableCell = Trim(objTable.Rows(varRowIndex).Cells(varColIndex).innerText)
Case "Image.Name"
varHtmlTableCell = objTable.Rows(varRowIndex).Cells(varColIndex).Images(0).name
End Select
Else
If IsNumeric(varItemId) Then varItemId = varItemId + 1
varErrorDetail = """" & "Type=HTMLTable;Index=" & varItemId & """" & ", " & """" & "Row=" & varRow & ";Col=" & varCol & """" & vbCrLf & "Col not found."
If Not ErrorMessagePersist_True(ErrorMessage, constrProcedureName, mconlngAutomationError, varErrorDetail) Then Exit Function
Exit Function
End If
Else
If IsNumeric(varItemId) Then varItemId = varItemId + 1
varErrorDetail = """" & "Type=HTMLTable;Index=" & varItemId & """" & ", " & """" & "Row=" & varRow & ";Col=" & varCol & """" & vbCrLf & "Row not found."
If Not ErrorMessagePersist_True(ErrorMessage, constrProcedureName, mconlngAutomationError, varErrorDetail) Then Exit Function
Exit Function
End If

Set objArguments = Nothing
Set objWindowsShell = Nothing
Set objIe = Nothing
Set objDocument = Nothing
Set objTables = Nothing



I had a really tough time find out the properties and methods of the object in use. Can someone tell me if the above code works for what I need or tell me the best way to parse HTML using KiXtart? Thanks!

Top
#162325 - 2006-05-23 08:44 PM Re: Web scraping in KiX
Lonkero Administrator Offline
KiX Master Guru
*****

Registered: 2001-06-05
Posts: 22346
Loc: OK
what do you really want?
that code above is not the right way if you want to get the whole of the html of a page.
for that, a simple 3 line kixtart script will do just fine.
_________________________
!

download KiXnet

Top
#162326 - 2006-05-23 09:01 PM Re: Web scraping in KiX
pearly Offline
Getting the hang of it
*****

Registered: 2004-02-04
Posts: 92
Quote:

what do you really want?
that code above is not the right way if you want to get the whole of the html of a page.
for that, a simple 3 line kixtart script will do just fine.




Can you show me the three lines of code?

Top
#162327 - 2006-05-23 09:09 PM Re: Web scraping in KiX
Les Offline
KiX Master
*****

Registered: 2001-06-11
Posts: 12734
Loc: fortfrances.on.ca
$rc=SetOption('WrapAtEOL','on')
$http=CreateObject("microsoft.xmlhttp")
$http.Open("GET","http://www.kixtart.org/",Not 1)
$http.send
$value=$http.responsebody
$value ?
_________________________
Give a man a fish and he will be back for more. Slap him with a fish and he will go away forever.

Top
#162328 - 2006-05-23 09:10 PM Re: Web scraping in KiX
Lonkero Administrator Offline
KiX Master Guru
*****

Registered: 2001-06-05
Posts: 22346
Loc: OK
sorry, it takes four:
Code:

$http=CreateObject("microsoft.xmlhttp")
$http.Open("GET","http://www.kixtart.org/",Not 1)
$http.send
$http.responsebody ?



thanks les.
_________________________
!

download KiXnet

Top
#162329 - 2006-05-23 09:15 PM Re: Web scraping in KiX
pearly Offline
Getting the hang of it
*****

Registered: 2004-02-04
Posts: 92
Thanks guys. How do you set the proxy configuration?
Top
#162330 - 2006-05-23 09:21 PM Re: Web scraping in KiX
Lonkero Administrator Offline
KiX Master Guru
*****

Registered: 2001-06-05
Posts: 22346
Loc: OK
it was a single registry value...
just can't remember which one.
_________________________
!

download KiXnet

Top
#162331 - 2006-05-23 09:24 PM Re: Web scraping in KiX
pearly Offline
Getting the hang of it
*****

Registered: 2004-02-04
Posts: 92
Quote:

it was a single registry value...
just can't remember which one.




Oh it can't be done with a property set for xmlhttp object?

Top
#162332 - 2006-05-23 09:32 PM Re: Web scraping in KiX
Lonkero Administrator Offline
KiX Master Guru
*****

Registered: 2001-06-05
Posts: 22346
Loc: OK
no.
but you can avoid the registry value by pulling always a different url.
that is, add something extra to the end, like "?some=fake&values=here"
so the above example becomes:
Code:

$http=CreateObject("microsoft.xmlhttp")
$http.Open("GET","http://www.kixtart.org/?some=fake&values=here",Not 1)
$http.send
$http.responsebody ?



anyways, the registry setting is the best choice.
it's a per user setting and you can always reset it once you don't need it anymore.
_________________________
!

download KiXnet

Top
#162333 - 2006-05-23 10:17 PM Re: Web scraping in KiX
pearly Offline
Getting the hang of it
*****

Registered: 2004-02-04
Posts: 92
Quote:

no.
but you can avoid the registry value by pulling always a different url.
that is, add something extra to the end, like "?some=fake&values=here"
so the above example becomes:
Code:

$http=CreateObject("microsoft.xmlhttp")
$http.Open("GET","http://www.kixtart.org/?some=fake&values=here",Not 1)
$http.send
$http.responsebody ?



anyways, the registry setting is the best choice.
it's a per user setting and you can always reset it once you don't need it anymore.




I tried entering in fake values, but it didn't work. I think I may have a special case. The url contains a dll reference.

http:/[ipaddress]/[navigation]/[dllname]?Login

ex: http://10.10.10.1/kix/kix64.dll?Login


Edited by pearly (2006-05-23 10:18 PM)

Top
#162334 - 2006-05-23 10:40 PM Re: Web scraping in KiX
Lonkero Administrator Offline
KiX Master Guru
*****

Registered: 2001-06-05
Posts: 22346
Loc: OK
think you best go with registry fix.
what the fake does is force the update of the page.
it does not override the proxy settings.
you can disable to proxy for the script execution time and get rid of the issue.
_________________________
!

download KiXnet

Top
#162335 - 2006-05-23 10:45 PM Re: Web scraping in KiX
pearly Offline
Getting the hang of it
*****

Registered: 2004-02-04
Posts: 92
Can you give me an idea of how to go about writing the registry fix?

Thanks for helping!

Top
#162336 - 2006-05-23 11:19 PM Re: Web scraping in KiX
Lonkero Administrator Offline
KiX Master Guru
*****

Registered: 2001-06-05
Posts: 22346
Loc: OK
k, had to crawl a huge amount of historical pages, back to some year 2003 but here is what I found:
Code:

$xK="HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\Internet Settings"
$cache=ReadValue($xK,"SyncMode5")
$=WriteValue($xK,"SyncMode5","3","reg_dword")



not sure what the values are for disabling proxy and crap but guess you can google with the syncmode5 keyword.
_________________________
!

download KiXnet

Top
#162337 - 2006-05-24 01:57 AM Re: Web scraping in KiX
pearly Offline
Getting the hang of it
*****

Registered: 2004-02-04
Posts: 92
does xmlhttp object use the proxy settings found in Internet Explorer Internet Settings? either way, i tried a test w/ the registry hack and w/o, but $http.responsebody is still returning null value.
Top
#162338 - 2006-05-24 07:13 AM Re: Web scraping in KiX
Lonkero Administrator Offline
KiX Master Guru
*****

Registered: 2001-06-05
Posts: 22346
Loc: OK
yes, xmlhttp is part of IE.
and what registry value did you use?
I know, the above setting does not bypass the proxy, it just forces a check of new data every time.
think I said, you can go googling for the correct value yourself.
_________________________
!

download KiXnet

Top
#162339 - 2006-05-25 09:24 PM Re: Web scraping in KiX
pearly Offline
Getting the hang of it
*****

Registered: 2004-02-04
Posts: 92
According to Scripting Guy ()

SyncMode5 can be set to one of the four possible values :

Every visit to the page 3
Every time you start Internet Explorer 2
Automatically 4
Never 0

I've tried all values, but none of them work. Here is my code :

Code:

Break On
GetPage("http://10.10.10.1/kix/kix64.dll?Login") ?
Sleep 10

Function GetPage($URL)
Dim $HTML, $IECacheKey, $IECacheVal
$IECacheKey = "HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\Internet Settings"
$IECacheVal = ReadValue($IECacheKey, "SyncMode5")
$IECacheVal ?
If $IECacheVal <> 3
$nul = WriteValue($IECacheKey, "SyncMode5", "3", "REG_DWORD")
EndIf
$HTML = CreateObject("microsoft.XMLhttp")
$HTML.Open("GET", $URL, Not 1)
$HTML.Send
If $HTML.Status = 200
$GetPage = $HTML.ResponseText ;or ResponseBody
Else
$GetPage = "HTTP Status Code: " + $HTML.Status + " (" + $HTML.StatusText + ")"
Exit 1
EndIf
$nul = WriteValue($IECacheKey, "SyncMode5", $IECacheVal, "REG_DWORD")
EndFunction


Top
#162340 - 2006-05-25 10:05 PM Re: Web scraping in KiX
Lonkero Administrator Offline
KiX Master Guru
*****

Registered: 2001-06-05
Posts: 22346
Loc: OK
ok, let me ask, what is this dll?
if you have the dll custom made there, why you need to pull out the html it produces?
why can't you directly give it the info it wants or why don't you ask it directly?
_________________________
!

download KiXnet

Top
#162341 - 2006-05-25 10:34 PM Re: Web scraping in KiX
pearly Offline
Getting the hang of it
*****

Registered: 2004-02-04
Posts: 92
Quote:

ok, let me ask, what is this dll?
if you have the dll custom made there, why you need to pull out the html it produces?
why can't you directly give it the info it wants or why don't you ask it directly?




Hmmm, good question. Unfortunately I have no idea what's inside the dll. I'm new in the QA team. The testing tool we use can get capture the HTML content, but I need KiX or some other third-party tool to do it, so I can run it w/o the need to install the testing tool and parse the HTML content to pull the version spec.

Is there any way to mimic programatically the feature for viewing the source in IE (View > Source)?


Edited by pearly (2006-05-25 10:35 PM)

Top
#162342 - 2006-05-26 10:03 AM Re: Web scraping in KiX
Lonkero Administrator Offline
KiX Master Guru
*****

Registered: 2001-06-05
Posts: 22346
Loc: OK
the responsebody is exactly it.
but thought your problem was the proxy?
if you disable the proxy in ie and try your script, does it still fail to get the source?
_________________________
!

download KiXnet

Top
#162343 - 2006-05-26 11:33 PM Re: Web scraping in KiX
pearly Offline
Getting the hang of it
*****

Registered: 2004-02-04
Posts: 92
i currently have a proxy set for internet access. it looks like the website i want to scrape is an intranet that i can access w/o the proxy setup. so this is a different issue?

i've tried disabling the proxy and tried all SyncMode5 values, but still i'm getting nothing back.

anything special i need to do for intranet sites that use dll?

Top
Page 1 of 3 123>


Moderator:  Shawn, ShaneEP, Ruud van Velsen, Arend_, Jochen, Radimus, Glenn Barnas, Allen, Mart 
Hop to:
Shout Box

Who's Online
0 registered and 920 anonymous users online.
Newest Members
Timothy, Jojo67, MaikSimon, kvn317, kixtarts2025
17874 Registered Users

Generated in 0.111 seconds in which 0.062 seconds were spent on a total of 12 queries. Zlib compression enabled.

Search the board with:
superb Board Search
or try with google:
Google
Web kixtart.org