Register Log In

Axiom Home Page Forums General Discussion The Water Cooler Downloading HTML files

Print Thread

Rate Thread

Page 1 of 3

1

2

3

Downloading HTML files #338781 02/17/11 12:58 AM
Joined: Oct 2006 Posts: 484 Maple Ridge, BC Kruncher OP devotee
OP Kruncher devotee Joined: Oct 2006 Posts: 484 Maple Ridge, BC	OK, MS operating system gurus, I've got an odd one for you. I've Google'd until I'm blue in the face, and have read all manner of seemingly related stuff that naturally turned out to be unrelated. Here's the deal. I'm trying to do some number Krunchin' by scraping data from web sites. I was supplied two with html files which were downloaded from the same web page: one using IE8 on Win7, the other using IE8 on XP. The XP derived file is about 30k. The Win7 file is about 50k, and is full of stuff that makes it practically impossible to get the data out for mere mortals. Does anyone have any idea why they're different, and more importantly, how to get the usable XP format out of a Win 7 box, as that's key to keeping this analysis system running in the future. I've spent hours on this over the last few days, and would really appreciate any pointers to concrete answers, or of course, answers themselves. Switch browsers doesn't seem to help, BTW. Tried that. Several times.

Re: Downloading HTML files Kruncher #338784 02/17/11 01:04 AM
Joined: Feb 2009 Posts: 3,466 Western Maryland, USA ClubNeon connoisseur
ClubNeon connoisseur Joined: Feb 2009 Posts: 3,466 Western Maryland, USA	When you're in the Save As... dialog box, change the Save as type to: Webpage, HTML only. Pioneer PDP-5020FD, Marantz SR6011 Axiom M5HP, VP160HP, QS8 Sony PS4, surround backs -Chris

Re: Downloading HTML files Kruncher #338785 02/17/11 01:04 AM
Joined: May 2003 Posts: 18,044 NoVA Ken.C shareholder in the making
Ken.C shareholder in the making Joined: May 2003 Posts: 18,044 NoVA	What command was used to get the file out of the browser? Edit->View Source or File->Save? If Save, which of the two html options? I am the Doctor, and THIS... is my SPOON!

Re: Downloading HTML files Ken.C #338787 02/17/11 01:10 AM
Joined: Oct 2006 Posts: 484 Maple Ridge, BC Kruncher OP devotee
OP Kruncher devotee Joined: Oct 2006 Posts: 484 Maple Ridge, BC	File, Save as, Webpage HTML only. I believe. I got two HTM files, no other folders.

Re: Downloading HTML files Kruncher #338796 02/17/11 01:22 AM
Joined: Jan 2004 Posts: 13,840 Likes: 13 Iowa SirQuack shareholder in the making
SirQuack shareholder in the making Joined: Jan 2004 Posts: 13,840 Likes: 13 Iowa	Often for file type you will have an option for HTML Webpage Complete, or just HTML Only. The complete one will have more content. You should have a drop down when you do a save-as to select various options like mht, html, etc. M80s VP180 4xM22ow 4xM3ic EP600 2xEP350 AnthemAVM60 Outlaw7700 EmoA500 Epson5040UB FluanceRT85

Re: Downloading HTML files SirQuack #338832 02/17/11 03:38 AM
Joined: Oct 2006 Posts: 484 Maple Ridge, BC Kruncher OP devotee
OP Kruncher devotee Joined: Oct 2006 Posts: 484 Maple Ridge, BC	I've confirmed the save process with the sender: File, Save as, Webpage HTML only. Further, my own testing on a Win 7 system generated a Webpage Complete file of 59,033 bytes while the "HTML only" save version was 59,258 bytes for the same page. No, that's not a typo or reversed values. This is the strangest thing...

Re: Downloading HTML files Kruncher #338838 02/17/11 04:02 AM
Joined: Jan 2004 Posts: 13,840 Likes: 13 Iowa SirQuack shareholder in the making
SirQuack shareholder in the making Joined: Jan 2004 Posts: 13,840 Likes: 13 Iowa	Not sure this would help, but have you tried the MHT option? "Saving in this format allows users to save a web page and its resources as a single MHTML file called a "Web Archive", where all images and linked files will be saved as a single entity." M80s VP180 4xM22ow 4xM3ic EP600 2xEP350 AnthemAVM60 Outlaw7700 EmoA500 Epson5040UB FluanceRT85

Re: Downloading HTML files SirQuack #338839 02/17/11 04:16 AM
Joined: Oct 2006 Posts: 484 Maple Ridge, BC Kruncher OP devotee
OP Kruncher devotee Joined: Oct 2006 Posts: 484 Maple Ridge, BC	Unfortunately that format is not really an option Randy. The end goal is to use software either from Iopus or similar in function to it, to scrape data from literally hundreds of web pages to create a mini data mart. Other software will be used to capture the specific data values from the individual downloaded pages, and mht doesn't play nicely with that. For example, there will be a title of "2008-09" that appears in the browser, but the string "2008-09" can't be found in the mht file.

Re: Downloading HTML files SirQuack #338842 02/17/11 04:24 AM
Joined: Apr 2003 Posts: 16,441 Felton, California pmbuko shareholder in the making
pmbuko shareholder in the making Joined: Apr 2003 Posts: 16,441 Felton, California	MHT would be more or less useless for scraping data. What you want is a simple file readable by any text editor. Have you tried using Firefox? It has a very convenient option in the Save As box that allows you to save the web page as a text file. This will strip out all the scripts and non-visible portions of the web page and give you only what's actually displayed on the screen. For an example page I just visited, the file sizes for raw html (full page source code) and the text file version were 229K and 82K, respectively. May I ask what you want to use to scrape these files?

Re: Downloading HTML files pmbuko #338847 02/17/11 04:34 AM
Joined: Oct 2006 Posts: 484 Maple Ridge, BC Kruncher OP devotee
OP Kruncher devotee Joined: Oct 2006 Posts: 484 Maple Ridge, BC	I tried FF with the same poor result a few days ago Peter. I know that the browswer must play a role, after all it's the software that's being used to create the file. But when I used FF 3.6 to download from http://some-arbitrary-site.com, I got a larger .htm file using a Win 7 system than I did when I used FF 3.6 for the exact same page on my Vista box. WTF? Seriously! I know... I didn't buy it either when it was presented to me as the challenge in the first place. As to the data scraper to be used on the downloaded files, Monarch is the tool of choice. EDIT: Hold that thought. I didn't realize that I'd been using IE7 on my Vista box. Still, why would downloading the HTML file with one version of a browser, any browser (but in this case say IE), be different? Shouldn't the HTML be dictacted purely by the source - the site in question? Or are content management systems sending out entirely different HTML based on the browser in use at the client side? I think I'm getting warm now... Last edited by Kruncher; 02/17/11 04:49 AM. Reason: As noted

Re: Downloading HTML files Kruncher #338851 02/17/11 04:57 AM
Joined: Apr 2003 Posts: 16,441 Felton, California pmbuko shareholder in the making
pmbuko shareholder in the making Joined: Apr 2003 Posts: 16,441 Felton, California	To avoid any sort of browser "infection", you could try curl for Windows. It's built in on most UNIX/linux OSes and very easy to use. E.g., from the command prompt: curl http://www.somesite.com/stuff.html -o somesite-stuff.html This would download the stuff.html file from somesite.com and save it locally as somesite-stuff.html. Since it's a command-line utility, you could batch a bunch of sites together.

Re: Downloading HTML files Kruncher #338855 02/17/11 05:02 AM
Joined: Oct 2006 Posts: 484 Maple Ridge, BC Kruncher OP devotee
OP Kruncher devotee Joined: Oct 2006 Posts: 484 Maple Ridge, BC	The plot thickens. The fellow who approached me with this in the first place maintains that he was using IE8 on both a Win 7 box and an XP box. Good, usable results files downloaded with the XP system, and bloated unmanageable files on the Win 7 box. He bought a Win 7 box to progress with his project, but instead it's stopped the project dead in its tracks. He's still got the XP box, but the future is with Win 7, so that's what he'd prefer to use. Understandable, I believe.

Re: Downloading HTML files pmbuko #338856 02/17/11 05:07 AM
Joined: Oct 2006 Posts: 484 Maple Ridge, BC Kruncher OP devotee
OP Kruncher devotee Joined: Oct 2006 Posts: 484 Maple Ridge, BC	That sounds like a great plan Peter. Thanks very much for that information. Really top notch. I left my AIX/Unix days behind me in the '90s, and it's easy to forget just how useful utilities built for those OS's are. I'll pass that along and will try to post back his feedback here this week.

Re: Downloading HTML files Kruncher #338860 02/17/11 05:25 AM
Joined: Feb 2009 Posts: 3,466 Western Maryland, USA ClubNeon connoisseur
ClubNeon connoisseur Joined: Feb 2009 Posts: 3,466 Western Maryland, USA	wget may be easier to use than curl. It's my tool of choice. Pioneer PDP-5020FD, Marantz SR6011 Axiom M5HP, VP160HP, QS8 Sony PS4, surround backs -Chris

Re: Downloading HTML files ClubNeon #338912 02/17/11 07:00 PM
Joined: Apr 2003 Posts: 16,441 Felton, California pmbuko shareholder in the making
pmbuko shareholder in the making Joined: Apr 2003 Posts: 16,441 Felton, California	True. wget is a bit more powerful and I'd use it instead if you need to grab a bunch of different files from a web server and want to filter out anything other than .htm or .html files. Here's an example I used recently: I have a server that holds install and configuration files for the linux desktops I deploy and manage. In one of my automated installs, I need to grab the latest NVIDIA driver from my install server. The name of this file is not constant, but it always ends with a .run extension, so I use the following command to grab it: wget -r -nH -np -nd -A run http://yum1:8080/nvidia/ The options basically say "look at all the files in the nvidia directory on that web server but only grab the ones that have a '.run' file extension." This works since I only ever keep one in there.

Re: Downloading HTML files pmbuko #338922 02/17/11 07:54 PM
Joined: Feb 2009 Posts: 3,466 Western Maryland, USA ClubNeon connoisseur
ClubNeon connoisseur Joined: Feb 2009 Posts: 3,466 Western Maryland, USA	wget can also be as simple as: wget "http://www.axiomaudio.com/" That'll create a file named "index.html" in your current directory. So that is easier than curl for getting a single document. You have at least tell curl what name to save the file with, or it'll just write to the screen. Of course you can tell wget to save with a different name by just giving it the "-o filename.html" option too. Pioneer PDP-5020FD, Marantz SR6011 Axiom M5HP, VP160HP, QS8 Sony PS4, surround backs -Chris

Re: Downloading HTML files ClubNeon #338925 02/17/11 08:16 PM
Joined: Apr 2003 Posts: 16,441 Felton, California pmbuko shareholder in the making
pmbuko shareholder in the making Joined: Apr 2003 Posts: 16,441 Felton, California	curl -O http://the.url.com will also save an index.html (or whatever default file the server gives you) in your current directory.

Re: Downloading HTML files pmbuko #338927 02/17/11 08:18 PM
Joined: Feb 2009 Posts: 3,466 Western Maryland, USA ClubNeon connoisseur
ClubNeon connoisseur Joined: Feb 2009 Posts: 3,466 Western Maryland, USA	Still too much typing. Pioneer PDP-5020FD, Marantz SR6011 Axiom M5HP, VP160HP, QS8 Sony PS4, surround backs -Chris

Re: Downloading HTML files ClubNeon #338940 02/17/11 10:08 PM
Joined: Apr 2003 Posts: 16,441 Felton, California pmbuko shareholder in the making
pmbuko shareholder in the making Joined: Apr 2003 Posts: 16,441 Felton, California	A spurious criticism for a board regular to make.

Re: Downloading HTML files pmbuko #338949 02/17/11 10:50 PM
Joined: Feb 2009 Posts: 3,466 Western Maryland, USA ClubNeon connoisseur
ClubNeon connoisseur Joined: Feb 2009 Posts: 3,466 Western Maryland, USA	If I was typing -O every time I wanted to save a file, how would I have time to spend here? Pioneer PDP-5020FD, Marantz SR6011 Axiom M5HP, VP160HP, QS8 Sony PS4, surround backs -Chris

Re: Downloading HTML files ClubNeon #338952 02/17/11 10:52 PM
Joined: May 2003 Posts: 18,044 NoVA Ken.C shareholder in the making
Ken.C shareholder in the making Joined: May 2003 Posts: 18,044 NoVA	Isn't that why you alias curl to curl -O? I am the Doctor, and THIS... is my SPOON!

Re: Downloading HTML files Ken.C #338955 02/17/11 10:56 PM
Joined: Feb 2009 Posts: 3,466 Western Maryland, USA ClubNeon connoisseur
ClubNeon connoisseur Joined: Feb 2009 Posts: 3,466 Western Maryland, USA	I don't even have curl on my machine. Keeping the disk space free for other things. I think having that alias around would also be a waste of RAM. Pioneer PDP-5020FD, Marantz SR6011 Axiom M5HP, VP160HP, QS8 Sony PS4, surround backs -Chris

Re: Downloading HTML files ClubNeon #338957 02/17/11 10:59 PM
Joined: May 2003 Posts: 18,044 NoVA Ken.C shareholder in the making
Ken.C shareholder in the making Joined: May 2003 Posts: 18,044 NoVA	Yeah, we know your computer is quite limited on both disk space and RAM. I am the Doctor, and THIS... is my SPOON!

Re: Downloading HTML files Ken.C #338958 02/17/11 11:01 PM
Joined: Feb 2009 Posts: 3,466 Western Maryland, USA ClubNeon connoisseur
ClubNeon connoisseur Joined: Feb 2009 Posts: 3,466 Western Maryland, USA	I'm like the billionaire that picks up pennies off the ground. Pioneer PDP-5020FD, Marantz SR6011 Axiom M5HP, VP160HP, QS8 Sony PS4, surround backs -Chris

Re: Downloading HTML files ClubNeon #338960 02/17/11 11:05 PM
Joined: Mar 2010 Posts: 3,596 Likes: 1 Massachusetts Badlands BobKay connoisseur
BobKay connoisseur Joined: Mar 2010 Posts: 3,596 Likes: 1 Massachusetts Badlands	Originally Posted By: ClubNeon I'm like the billionaire that picks up pennies off the ground. To whip them at poor people. Always call the place you live a house. When you're old, everyone else will call it a home.

Re: Downloading HTML files ClubNeon #338981 02/18/11 02:17 AM
Joined: Apr 2003 Posts: 16,441 Felton, California pmbuko shareholder in the making
pmbuko shareholder in the making Joined: Apr 2003 Posts: 16,441 Felton, California	Lemme guess... you've compiled your own kernel to save space, too.

Re: Downloading HTML files pmbuko #338986 02/18/11 03:03 AM
Joined: Feb 2009 Posts: 3,466 Western Maryland, USA ClubNeon connoisseur
ClubNeon connoisseur Joined: Feb 2009 Posts: 3,466 Western Maryland, USA	Kernel? I compiled my whole operating system, without -g. Pioneer PDP-5020FD, Marantz SR6011 Axiom M5HP, VP160HP, QS8 Sony PS4, surround backs -Chris

Page 1 of 3

1

2

3

Moderated by alan, Amie, Andrew, axiomadmin, Brent, Debbie, Ian, Jc

Link Copied to Clipboard

Forum Statistics

Forums16

Topics24,944

Posts442,472

Members15,617

Most Online2,082
Jan 22nd, 2020

Top Posters

Ken.C 18,044

pmbuko 16,441

SirQuack 13,840

CV 12,077

MarkSJohnson 11,458

Who's Online Now

0 members (), 496 guests, and 3 robots.

Key: Admin, Global Mod, Mod

Newsletter Signup

Powered by UBB.threads™ PHP Forum Software 7.7.4