Manfred Kuechler
Hunter College

Version: Feb 16, 2007
 

How to save Web Documents -- for Documentation in Internet  Research Papers

 
This is a supplement  to the "Guidelines for attribution and citation in Internet Research papers" dealing in more detail with the technical issues involved. As discussed in the main document, many web pages changes frequently. So, just saving the web address (URL) may not be sufficient to document the source of the information used in the paper.  When in doubt, save/download  the document (or the crucial part) to your local computer. This seems easy enough, as all major browsers (including Mozilla, Netscape, Firefox, MS IE) offer a "Save (Page) as ..." option on the "File" menu. And in many cases, this is all it takes. However, there can be complications and these are discussed below.  Dealing with such complications varies with the browser you use and I will cover primarily the current versions of Firefox (2.0), and MS IE 7 on the Windows platform. When I refer to "all browsers" below this is shorthand for these two browsers as they work on the Windows platform. I strongly discourage the use of the AOL browser. Even if you use AOL as your Internet Service Provider (ISP), you can still use one of the two browsers discussed in this and related technical advice documents.

Finally, many tasks can be accomplished in different ways, e.g., either by using pull-down menus from the toolbar at the top of your browser window or by using the right-click menu. My goal is to describe one way to accomplish each of the various tasks, not to describe all possibilities. Most of the time, there are other ways as well and you are encouraged to explore and experiment. My personal preferences may not match yours.

Format of Web Documents

HTML. The standard for web documents is the html format and they are identified by an ending like .htm, .html, or .shtml in the web address (URL).  Here are three important properties of HTML web documents that are important to realize for the purpose of saving/downloading them:
  1. An HTML document consists of a main (text) file plus any number of associated files (mostly files containing images like photos, charts, and graphs); each image -- regardless of size -- is stored in a separate file. Using Mozilla, Netscape, or Firefox, you can get detailed information about the components by right-clicking on the web documents and selecting "View Page Info".
  2. The main file consists of the actual (visible) text plus "markup" instructions or "tags" which are enclosed in angle brackets (<.....tag......>).  All browsers let you look at the "source" (text plus tags) by right-clicking and selecting "View [Page] Source". By default, MS IE displays the source in Notepad; the other browsers show the source in a new browser window. In general, you do not need to look at the source, but if things don't go as expected this is the place to look -- assuming that you have at least some basic knowledge of html tags.
  3. Many  HTML documents use "frames" meaning that the browser window is divided in several parts (frames) and each frame displays a separate document. Dealing with frames is easy in Mozilla, Netscape, and Firefox, but tricky in MS IE (more below). Blackboard pages are a prime example of web pages using frames, but the existence of frames cannot be determined by just the appearance of a page (e.g., the Hunter library web site does not use frames).
One major drawback of the html format is that is offers only limited control of the page layout and appearance. The appearance of a web document is contingent upon specifics of the user' s computer including hardware (size/quality of monitor), screen resolution settings, size of window, and browser configuration. Therefore, many web sites use alternative formats which browsers cannot display directly, so that other software is needed in addition. But, typically, there are  "plugins" available and -- these days -- the most common plugins are installed automatically with the browser. Some of these plugins allow to display the document within the browser window, often also adding a plugin specific toolbar, others open a new window. Most plugins deal with multimedia files like video clips or animations (e.g., RealPlayer, Quicktime, Flash), but others do deal with documents in the traditional sense. And when handling such documents, you should use the (plugin) toolbar associated with this specific software rather than the browser toolbar.

To see which plugins are installed (and to add more plugins), go to "Help"/"About:plugins" (Mozilla/Netscape) or  type "about:plugins" (without the marks) into the address box in Firefox.   As MS IE is integrated into the Windows operating system, there are no "plugins" here, rather some other interface (DirectX) is used, but the technical details are not relevant for the purpose of this document.  When your browser encounters a document/file with a format it cannot (yet) handle, you are usually offered a choice to download and install the necessary software (often a more current version of what you have installed already). During the installation process, you may be given a choice which browsers (assuming you have more than one browser installed on your computer) you want to be enabled to handle this type of document/file.  If you want to be proactive, visit the Firefox plugin site and download/install what you may need.

PDF.  Files in this format ("portable data format") are the most common alternative to html documents on web sites -- especially for longer documents and for documents converted from hard copy (print) using a scanner. There are various ways of  converting print documents into .pdf and the resulting documents can either be
"searchable" (i.e., you can search for specific text strings in the document) or show just  an image of the original printed text.  But .pdf files can be created from any file which can be printed (including documents created and saved by word processing software like documents in .doc or .wpd format and even web pages in .htm format). All .pdf  documents have a fixed layout complete with page numbers and the user can only change the "viewing scale" or "zoom level" when displaying it. In addition, the creator of a pdf document can add several security features to a document; the most extreme security setting prevents a viewer from even printing the document, let alone making any changes to it. To view a .pdf document, you need to have the (free) Adobe Reader -- formerly known as the Acrobat Reader -- and the related browser plugins installed.  The most recent versions 7 and 8 of Acrobat (the most commonly used software to create .pdf files) allows to even include multimedia objects in .pdf files.

Other formats. Occasionally, you may find documents in other formats on web sites like documents in proprietary word processing formats (.doc, .wpd), presentations in original Powerpoint format (.ppt), Excel spreadsheets (.xls), and more.  In these cases, the browser may present you with an option to either "download/save" the file or "open" it (if your browser already knows what software should be used for this file type and you have this software installed -- this information is part of your browser configuration). So downloading is not a problem here, but you may not have the matching software (application) installed in your computer and the download may not do you much good. However, some applications offer limited versions for free, so that you can at least view a document (but not make any changes). An example is the free Powerpoint Viewer from Microsoft.

As web site visitors may not be able to view documents in formats requiring protected (licensed) software without free readers or viewers, it is considered bad web site design to use such documents -- but you may still encounter them.


Details of Saving Web documents

HTML documents

These documents can be saved in either their native format or as pdf files. The latter has significant advantages (for documentation purposes), but requires the full version of Acrobat (or some other software capable of creating pdf files like the free PDFCreator).

Saving in native format.  Click "File"/"Save (Page) as" in your browser toolbar. In the pop up window select  the location (folder) on your computer where you want to have the document stored as well as a file name. The browser will typically suggest using the same file name as on the web site. However, it is preferable to change the file name to something which is more descriptive of the contents., e.g.,  the file name on the web site may simply be "06shiites.html" (this would be the browser suggestion), but you may want to change this to something like "NYT_article_020505_Iraq_constitution.html".  When perusing your folder at a later point in time, such descriptive file names are helpful in locating specific documents as they indicate the contents of a file and can save the time to open it.  In addition, you have several choices as to how you want to save an html document. These are to be selected from the pull-down menu in the "Save as type" box below the "File name" box:
The exact wording varies between browsers; especially, when using obsolete (=non current) browser versions.

If images (and other additional components) are irrelevant -- and often these are just standard logos, other embellishments, or ads with no bearing on the substantive contents -- use the "HTML only" option as it helps to reduce clutter on your local computer and you would not want to include such fluff with your Internet research paper anyway. But images (photos, charts) may also be essential components of the web document, so you want to make sure to save them as well.  But dealing with complete web pages saved this way is cumbersome when moving/transferring such saved documents (see companion piece).

The  "web archive" (or more technically speaking:  an .mht file) offered by MS IE is an attractive alternative. This file format was developed independent from MS as a general Internet standard, but -- in contrast to Microsoft -- hardly any other browser (Mozilla, Firefox, Netscape) supports  this format (though I have not checked Opera and other minor web browsers). Consequently, to display these files correctly you need to use MS software; when opening such files in other browsers, you will likely see a garbled document. How badly it looks depends on the specific browser you use and the complexity of the saved web document. So, attractive in principle, this format should be avoided for Internet research papers (to be submitted to an instructor and possibly shared with other students) due to these display problems, but if you use MS IE as your (principal) browser anyway, you may consider this option for documents which you don't plan to share.


Saving as pdf document.  Another (better) alternative to save complete web pages as one file is to use  Acrobat (and even older versions like version 5 let you do this) and create a .pdf file from the web document. 
Note, the (free) Adobe Reader is not sufficient to perform this task, you need at least "Acrobat Standard" (comparison of different Acrobat products). Fortunately, the full Adobe software is available in the Hunter College computer labs. CUNY/Hunter currently has no "volume purchasing" arrangement for Acrobat which would allow students to buy Acrobat (and other Adobe products) at a great discount. However, you can purchase the software at an individual educational discount at various web vendors like Campus Tech (they currently offer Acrobat 8 Standard for $89.85; the regular price is $299.)

Saving a web (html) page as a pdf document, can be done in several different ways:
Acrobat only: Start Acrobat, go to "File"/"Create PDF"/"From Web Page" , enter the web address (URL) of the web document/page you want to save, e.g., the Hunter College Homepage at http://www.hunter.cuny.edu, then click the "create" button. After the web page displays in Acrobat, select "File"/"Save as" and save the web document as a .pdf file. Here is this document as downloaded and saved on Feb 16, 2007. Note that the content of the Hunter College home page changes frequently.  In addition, to the saved web page, the pdf file also contains the URL of the web page/site captured and the download date/time. (This description is based on Acrobat 7; I don't have version 8 yet.)

Print from Browser: With the web page of interest displayed by the browser, go to "File"/"Print", then select  an Adobe or Acrobat "printer". What choices you will be offered depends on the version of Windows and the version of Acrobat -- or any other pdf software like PDFCreator -- you have installed. On older systems, you will see "Acrobat Distiller" and "Acrobat PDF Writer", on newer systems something like "Adobe PDF" or "PDF Creator". Nothing will be sent to a physical printer, so don't get confused. Instead, you will be prompted to select a name and a location for the web page to be saved as a .pdf file. (Logically, this operation is much more like "save as ..." where you select a non-default file type, but you don't get there via the "save as ..." menu.)

Make sure to use an efficient  "page setup" (from the "File" menu in your browser) so that  both URL of the web page as well as date and time of  the "print" (visit of the web site) are printed.  In Mozilla/Firefox you have great control over the "headers" and "footers" and I suggest to have the URL printed in upper left corner  (and nothing else  in the header to have maximum space for even long URLs;  if the space is insufficient the URL will get truncated) and  date/time in the lower left corner.

If you are using MS IE, you can also display an "Acrobat toolbar" ("View"/"Toolbars", then check entry for Acrobat/Adobe). Clicking this Adobe/Acrobat button on the toolbar will automatically start a save of the displayed web page as a .pdf file. However, using this approach, URL and date/time may not be included -- and this is a serious disadvantage for documentation purposes.

Finally, if you run into trouble with saving a web page as pdf, view the 10-minute screen movie on this topic. (Note that the screen movie link only works if you are logged in to Bb6 -- at least as a guest.)


PDF Documents

After you displayed the document in your browser, click  the save icon (looks like a floppy diskette) to the left of your Acrobat toolbar.  Alternatively, if you have the full Acrobat software and the security settings of the documents allow it, you can also save selected pages only -- either by extracting the pages you want or by deleting the pages you don't want and then save the extracted pages or the slimmed down document. To do this, go to "Documents" pull-down menu from the toolbar on top. If the security settings do not allow such changes the corresponding menu item are grayed out.

Complications

Protected documents. If a document is protected against regular download (via either the "File"/"Save as" menu item in your browser or an equivalent menu item of the "plugin" software) and/or the document uses a format for which you don't have and cannot obtain (without cost) the matching software, there is still one recourse: take a screen shot. The details are discussed in a companion document.

You may also opt for the screen shot approach if the document (not matter in what format) is rather long and you only need to document a small part.

Bypassing the display of the web document. Note that you can usually bypass the display of the document and instead right-click on the link to the document on the referring page and select "Save Target/Link as". However, this approach only works if the referring page uses a standard link (a regular html tag) to point to the document of interest. Some web sites (including the Bb6 course web sites) use "scripts" to link to certain (types of) documents and in these cases you may save something different from the actual document. After saving/downloading a file, always verify that you have actually saved what you intended to save  by opening/displaying the downloaded version and scrolling all the way to the end of the document (to make sure that the whole document was downloaded).

"Bypassing" is an option worthwhile to consider when you are dealing with documents large in file size (some .pdf documents are quite large even though they consist of relatively few pages only)  and you have a slow (telephone modem rather than DSL or cable) connection to the Internet. As some web sites  prevent the browser from "caching" documents (storing a temporary version on your local computer), first displaying and then saving a (.pdf) document may require to download the same document twice -- doubling your download time.

Web pages with frames. Often there is no need to save the complete web page (i.e., all frames of a page), rather the document of interest is displayed in one specific frame. Sometime it is easy to tell that a site uses frames as the web address (URL) itself points to it like in Bb6 course sites:  http://hc.bbprod.cuny.edu/webapps/portal/frameset.jsp.  Often "banners" (top part of displayed page) and "navigation bars" (often on the left of a displayed page) are displayed in separate frames, but this is not necessarily the case. As many web designers do not like frames (for good reasons, but  we  will not go into this here), they rather repeat such components of each page of the site. And using "dynamic" pages (also referred to as "active server pages") this is much less onerous as it may appear. E.g., the Hunter College Homepage has both a banner on top and a navigation bar on the left, but it does not use frames.

So, how do we find out whether a web page uses frames?
Documenting your download.  All citation styles (like MLA, APA, Chicago, etc.) require that you record the web address (URL) and the date of the download. And while the latter does not help much in cross-checking your sources if the page has changed in the meantime, it is still a good idea to do this. Recoding the URL, however, is much more important, if only to enable the reader of an Internet research paper to check whether or not the page has changed in the meantime. If it has not changed (any many web sites keep their documents available and in the same location for years), then a reader has a chance to double check the (primary) source. (The problem of non-persistent URLs is discussed in more detail in the main document.)

Pages with frames, again, present a complication as the address shown in the address/location box of the browser may not change while perusing such a web site -- creating a mismatch between the URL shown in location box of the browser and the actual document shown in (one frame of) the page.  Simply copying the address box and using this as the URL for the saved document will then lead to an inadequate attribution.