You are not logged in. [Log In]


Forums » General Discussion » The Water Cooler » Downloading HTML files

Page 1 of 3 1 2 3 >
Topic Options
Rate This Topic
#338781 - 02/16/11 07:58 PM Downloading HTML files
Kruncher Offline
devotee

Registered: 10/05/06
Posts: 484
Loc: Maple Ridge, BC
OK, MS operating system gurus, I've got an odd one for you.

I've Google'd until I'm blue in the face, and have read all manner of seemingly related stuff that naturally turned out to be unrelated.

Here's the deal. I'm trying to do some number Krunchin' by scraping data from web sites. I was supplied two with html files which were downloaded from the same web page: one using IE8 on Win7, the other using IE8 on XP.

The XP derived file is about 30k. The Win7 file is about 50k, and is full of stuff that makes it practically impossible to get the data out for mere mortals.

Does anyone have any idea why they're different, and more importantly, how to get the usable XP format out of a Win 7 box, as that's key to keeping this analysis system running in the future.

I've spent hours on this over the last few days, and would really appreciate any pointers to concrete answers, or of course, answers themselves.

Switch browsers doesn't seem to help, BTW. Tried that. Several times.

Top
#338784 - 02/16/11 08:04 PM Re: Downloading HTML files [Re: Kruncher]
ClubNeon Offline
connoisseur

Registered: 02/06/09
Posts: 3466
Loc: Western Maryland, USA
When you're in the Save As... dialog box, change the Save as type to: Webpage, HTML only.
_________________________
Pioneer PDP-5020FD, Marantz SR6011
Axiom M5HP, VP160HP, QS8
Sony PS4, surround backs
-Chris

Top
#338785 - 02/16/11 08:04 PM Re: Downloading HTML files [Re: Kruncher]
Ken.C Offline
shareholder in the making

Registered: 05/03/03
Posts: 18044
Loc: NoVA
What command was used to get the file out of the browser? Edit->View Source or File->Save? If Save, which of the two html options?
_________________________
I am the Doctor, and THIS... is my SPOON!

Top
#338787 - 02/16/11 08:10 PM Re: Downloading HTML files [Re: Ken.C]
Kruncher Offline
devotee

Registered: 10/05/06
Posts: 484
Loc: Maple Ridge, BC
File, Save as, Webpage HTML only. I believe.

I got two HTM files, no other folders.

Top
#338796 - 02/16/11 08:22 PM Re: Downloading HTML files [Re: Kruncher]
SirQuack Offline
shareholder in the making

Registered: 01/29/04
Posts: 13571
Loc: Iowa
Often for file type you will have an option for HTML Webpage Complete, or just HTML Only. The complete one will have more content. You should have a drop down when you do a save-as to select various options like mht, html, etc.
_________________________
M80s-VP180-4xM22ow-4xM3ic-EP600-2xEP350
Anthem AVM60 Outlaw 7700 Emotiva A500 Epson 5040UB



Top
#338832 - 02/16/11 10:38 PM Re: Downloading HTML files [Re: SirQuack]
Kruncher Offline
devotee

Registered: 10/05/06
Posts: 484
Loc: Maple Ridge, BC
I've confirmed the save process with the sender:
File, Save as, Webpage HTML only.

Further, my own testing on a Win 7 system generated a Webpage Complete file of 59,033 bytes while the "HTML only" save version was 59,258 bytes for the same page. No, that's not a typo or reversed values.

This is the strangest thing...

Top
#338838 - 02/16/11 11:02 PM Re: Downloading HTML files [Re: Kruncher]
SirQuack Offline
shareholder in the making

Registered: 01/29/04
Posts: 13571
Loc: Iowa
Not sure this would help, but have you tried the MHT option?

"Saving in this format allows users to save a web page and its resources as a single MHTML file called a "Web Archive", where all images and linked files will be saved as a single entity."
_________________________
M80s-VP180-4xM22ow-4xM3ic-EP600-2xEP350
Anthem AVM60 Outlaw 7700 Emotiva A500 Epson 5040UB



Top
#338839 - 02/16/11 11:16 PM Re: Downloading HTML files [Re: SirQuack]
Kruncher Offline
devotee

Registered: 10/05/06
Posts: 484
Loc: Maple Ridge, BC
Unfortunately that format is not really an option Randy.

The end goal is to use software either from Iopus or similar in function to it, to scrape data from literally hundreds of web pages to create a mini data mart. Other software will be used to capture the specific data values from the individual downloaded pages, and mht doesn't play nicely with that. For example, there will be a title of "2008-09" that appears in the browser, but the string "2008-09" can't be found in the mht file.

Top
#338842 - 02/16/11 11:24 PM Re: Downloading HTML files [Re: SirQuack]
pmbuko Offline
shareholder in the making

Registered: 04/02/03
Posts: 16437
Loc: Ben Lomond, California
MHT would be more or less useless for scraping data. What you want is a simple file readable by any text editor.

Have you tried using Firefox? It has a very convenient option in the Save As box that allows you to save the web page as a text file. This will strip out all the scripts and non-visible portions of the web page and give you only what's actually displayed on the screen. For an example page I just visited, the file sizes for raw html (full page source code) and the text file version were 229K and 82K, respectively.

May I ask what you want to use to scrape these files?
_________________________
I can explain it to you but I can't understand it for you.

Top
#338847 - 02/16/11 11:34 PM Re: Downloading HTML files [Re: pmbuko]
Kruncher Offline
devotee

Registered: 10/05/06
Posts: 484
Loc: Maple Ridge, BC
I tried FF with the same poor result a few days ago Peter. I know that the browswer must play a role, after all it's the software that's being used to create the file.

But when I used FF 3.6 to download from http://some-arbitrary-site.com, I got a larger .htm file using a Win 7 system than I did when I used FF 3.6 **for the exact same page** on my Vista box. WTF? Seriously! I know... I didn't buy it either when it was presented to me as the challenge in the first place.

As to the data scraper to be used on the downloaded files, Monarch is the tool of choice.

EDIT: Hold that thought. I didn't realize that I'd been using IE7 on my Vista box. Still, why would downloading the HTML file with one version of a browser, any browser (but in this case say IE), be different? Shouldn't the HTML be dictacted purely by the source - the site in question? Or are content management systems sending out entirely different HTML based on the browser in use at the client side? I think I'm getting warm now...


Edited by Kruncher (02/16/11 11:49 PM)
Edit Reason: As noted

Top
Page 1 of 3 1 2 3 >

Moderator:  alan, Amie, Andrew, axiomadmin, Brent, Debbie, Ian, Jc 
Forum Stats

15,233 Registered Members
15 Forums
24,008 Topics
424,190 Posts

Most users ever online:
883 @ 03/04/17 05:06 PM

Top Posters
Ken.C 18044
pmbuko 16437
SirQuack 13571
CV 11677
MarkSJohnson 11437
0 registered ()
207 Guests and
3 Spiders online.
Key: Admin, Global Mod, Mod
Newsletter Signup