Manfred Kuechler
Hunter College
Version: Feb 16, 2007
How to save Web Documents -- for
Documentation in Internet Research Papers
This is a supplement to the "Guidelines
for attribution and citation in Internet Research papers" dealing
in more detail with the technical issues involved. As discussed in the
main document, many web pages changes frequently. So, just saving the
web
address (URL) may not be sufficient to document the source of the
information used in the paper. When in doubt, save/download
the document (or the crucial part) to your local computer. This seems
easy enough, as all major browsers (including Mozilla, Netscape,
Firefox, MS IE) offer a "Save (Page) as ..." option on the "File"
menu. And
in many cases, this is all it takes. However, there can be
complications and these are discussed below. Dealing with such
complications varies with the browser you use and I will cover primarily the
current versions of Firefox (2.0), and
MS IE 7 on the Windows platform. When I refer to "all browsers" below
this is shorthand for these two browsers as they work on the Windows
platform. I strongly discourage the use of the AOL browser. Even if you
use AOL as your Internet Service Provider (ISP), you can still use one
of the two browsers discussed in this and related technical advice
documents.
Finally, many tasks can be accomplished in different ways, e.g., either
by
using pull-down menus from the toolbar at the top of your browser
window or
by using the right-click menu. My goal is to describe one way to accomplish each of the
various tasks, not to describe all possibilities. Most of the time,
there are other ways as well and you are
encouraged to explore and experiment. My personal preferences may not
match yours.
Format of Web Documents
HTML. The standard for web
documents is the html format and they are identified by an ending like
.htm, .html, or .shtml in the web address (URL). Here are three important properties of HTML
web documents
that are important to realize for the purpose of saving/downloading
them:
- An HTML document consists of a main
(text)
file plus
any number of associated files (mostly files containing images like
photos,
charts, and graphs); each image -- regardless of size -- is stored in a
separate file. Using Mozilla, Netscape, or Firefox, you can get
detailed information about the components by right-clicking on the web
documents and selecting "View Page Info".
- The main file consists of the actual
(visible) text
plus "markup" instructions or "tags" which are enclosed in angle
brackets
(<.....tag......>). All browsers let you look at the
"source" (text plus tags) by right-clicking and selecting "View [Page]
Source". By default, MS IE displays the source in Notepad; the other
browsers show the source in a new browser window. In general, you do
not need to look at the source, but if things don't go as expected this
is the place to look -- assuming that you have at least some basic
knowledge of html tags.
- Many HTML documents use "frames"
meaning that the browser window is divided in several parts (frames)
and each frame displays a separate document. Dealing with frames is
easy in Mozilla, Netscape, and Firefox, but tricky in MS IE (more
below). Blackboard pages are a prime example of web pages using frames,
but the existence of frames cannot be determined by just the appearance
of a page (e.g., the Hunter
library web site does not
use frames).
One major
drawback of the html format is that is offers only limited control of
the page layout and appearance. The appearance of a web document is
contingent upon specifics of the user' s computer including hardware
(size/quality of monitor), screen resolution settings, size of window,
and browser configuration. Therefore, many web sites use alternative
formats which browsers cannot display directly, so that other software
is needed in addition. But, typically, there
are "plugins" available and -- these
days -- the
most common plugins are installed automatically with the browser. Some
of these plugins allow to display the document within the browser
window, often also adding a plugin specific toolbar, others open a new
window. Most plugins deal with multimedia
files like video clips or animations (e.g., RealPlayer, Quicktime,
Flash), but others do deal with documents in the traditional sense. And when handling such documents, you should use the
(plugin) toolbar associated with this specific software rather than the
browser toolbar.
To see which plugins are installed (and to add more plugins), go to
"Help"/"About:plugins" (Mozilla/Netscape) or type "about:plugins"
(without the marks) into the address box in Firefox. As MS IE
is integrated into the Windows operating system, there are no "plugins"
here, rather some other interface (DirectX) is used, but the technical
details are not relevant for the purpose of this document. When
your browser encounters a document/file with a format it cannot
(yet) handle, you are usually offered a choice to download and install
the necessary software (often a more current version of what you have
installed already). During the installation process, you may be given a
choice which browsers (assuming you have more than one browser
installed on your computer) you want to be enabled to handle this type
of document/file. If you want to be proactive, visit the Firefox plugin site
and download/install what you may need.
PDF. Files in this format
("portable data format") are the most common alternative to html
documents on web sites -- especially for longer documents and for
documents converted from hard copy (print) using a scanner. There are
various ways of converting print documents into .pdf and the
resulting documents can either be "searchable"
(i.e., you can search for specific
text strings in the document) or show just an image of the
original printed text. But .pdf files can be created from
any file which can be printed (including documents created and saved by
word
processing software like documents in .doc or .wpd format and even web
pages in .htm format). All .pdf
documents have a fixed layout complete with page numbers and the user
can only change the "viewing scale" or "zoom level" when displaying it.
In addition, the creator of a pdf document can add several security
features to a document; the most extreme security setting prevents a
viewer from even printing the document, let alone making any changes to
it. To view a .pdf document, you need to have the (free) Adobe
Reader -- formerly known as the Acrobat Reader -- and the related
browser plugins installed. The most recent versions 7 and 8 of Acrobat
(the most
commonly used software to create .pdf files) allows to even include
multimedia objects in .pdf files.
Other formats. Occasionally,
you may find documents in other formats on web sites like documents in
proprietary word processing formats (.doc, .wpd), presentations in
original Powerpoint format (.ppt), Excel spreadsheets (.xls), and
more. In these cases, the browser may present you with an option
to either "download/save" the file or "open" it (if your browser
already knows what software should be used for this file type and you
have this software installed -- this information is part of your
browser configuration). So downloading is not a problem here, but you
may not have the matching software (application) installed in your
computer and the download may not do you much good. However, some
applications offer limited versions for free, so that you can at least
view a document (but not make any changes). An example is the free
Powerpoint Viewer from Microsoft.
As web site visitors may not be able to view documents in formats
requiring protected (licensed) software without free readers or
viewers, it is considered bad web site design to use such documents --
but
you may still encounter them.
Details of Saving Web documents
HTML
documents
These documents can be saved in
either their native format or as pdf files. The latter has significant
advantages (for documentation purposes), but requires the full version
of Acrobat (or some other software capable of creating pdf files like the free PDFCreator).
Saving in native format. Click
"File"/"Save (Page) as" in your browser toolbar.
In the pop up window select the location (folder) on your computer
where you want to have the document stored as well as a file name. The
browser will typically suggest using the same file name as on the web
site. However, it is preferable to change the file name to something
which is more descriptive of the contents., e.g., the file name
on the web site may simply be "06shiites.html" (this would be the
browser suggestion), but you may want to change this to something like
"NYT_article_020505_Iraq_constitution.html". When perusing your
folder at a later point in time, such descriptive file names are
helpful in locating specific documents as they indicate the contents of
a file and can save the time to open it. In addition, you have several
choices as to how you want to save an
html document. These are to be selected from the pull-down menu in the
"Save as type" box below the "File name" box:
- Web page, complete -- saves all files making up the web document,
the name you select is for the main file; all associated files are put
in a folder with a matching name (adding "_files" to the name of the
main file)
- Web page, HTML only -- saves the main text file only, no
images and/or objects belonging to the web document
- Web archive, single file (MS IE only)
The exact wording varies between browsers; especially, when using
obsolete (=non current) browser versions.
If images (and other additional components) are irrelevant -- and often
these are just standard logos, other embellishments, or ads with no
bearing on the substantive contents -- use the "HTML only" option as it
helps to reduce clutter on your local computer and you would not want
to include such fluff with your Internet research paper anyway. But
images (photos, charts) may also be essential components of the web
document, so you want to make sure to save them as well. But
dealing with complete web pages saved this way is cumbersome when
moving/transferring such saved documents (see companion
piece).
The "web
archive" (or more technically speaking: an .mht file) offered by MS IE is an attractive alternative. This
file
format was developed independent from MS as a general Internet
standard,
but -- in contrast to Microsoft -- hardly any other browser (Mozilla,
Firefox, Netscape) supports this format (though I have not
checked Opera and other minor web browsers). Consequently, to display
these files correctly you need to use MS software; when opening such files in
other browsers, you will likely see a garbled document. How badly it
looks depends on the specific browser you use and the complexity of the
saved web document. So, attractive in principle, this format should be
avoided for Internet research papers (to be submitted to an instructor
and possibly shared with other students) due to these display problems,
but if you use MS IE as your (principal)
browser anyway, you may consider this
option for documents which you don't plan to share.
Saving as pdf document.
Another (better) alternative
to save complete web pages as one file is to use Acrobat (and
even older versions like version 5 let you do this) and create a .pdf
file from the web document.
Note, the (free) Adobe Reader is not sufficient to perform this task,
you need at least "Acrobat Standard" (comparison
of different Acrobat products). Fortunately, the full Adobe software is
available in the Hunter College computer labs. CUNY/Hunter currently
has no "volume purchasing" arrangement for Acrobat which would allow
students to buy Acrobat (and other Adobe products) at a great discount.
However, you can purchase the software at an individual educational
discount at various web vendors like Campus Tech (they currently
offer Acrobat 8 Standard for $89.85; the regular
price is $299.)
Saving a web (html) page as a pdf document, can be done in several
different ways:
- using Acrobat only
(requires that the page has a persistent URL; does not work , e.g., for
pages within the Lexis-Nexis data base)
- using a browser and then "printing"
to Adobe/Acrobat (works with basically all web pages)
Acrobat only:
Start Acrobat,
go to "File"/"Create PDF"/"From Web Page" , enter the web address (URL) of the web
document/page you want to save, e.g., the Hunter College Homepage at http://www.hunter.cuny.edu, then
click the "create" button. After the web page displays in Acrobat,
select "File"/"Save as" and save the web document as a .pdf file. Here
is this document
as
downloaded and saved on Feb 16, 2007. Note that the content of the
Hunter College home
page changes
frequently. In addition, to the saved web page, the pdf file also
contains the URL of the web page/site captured and the download
date/time. (This description is based on Acrobat 7; I don't have version 8 yet.)
Print from Browser: With the
web page of interest displayed by the browser, go to "File"/"Print",
then select an Adobe or Acrobat "printer". What choices you will
be offered depends on the version of Windows and the version of Acrobat
-- or any other pdf software like PDFCreator -- you have installed. On
older systems, you will see "Acrobat Distiller"
and "Acrobat PDF Writer", on newer systems something like "Adobe PDF"
or "PDF Creator".
Nothing will be sent to a physical printer, so don't get confused.
Instead, you will be prompted to select a name and a location for the
web page to be saved as a .pdf file. (Logically, this operation is much
more like "save as ..." where you select a non-default file type, but
you don't get there via the "save as ..." menu.)
Make sure to use an efficient "page setup" (from the "File" menu
in your browser) so that both URL of the web page as well as date
and time of the "print" (visit of the web site) are
printed. In Mozilla/Firefox you have great control over the
"headers" and "footers" and I suggest to have the URL printed in upper
left corner (and nothing else in the header to have maximum
space for even long URLs; if the space is insufficient the URL
will get truncated) and date/time in the lower left corner.
If you are using MS IE, you can also display an "Acrobat toolbar"
("View"/"Toolbars", then check entry for Acrobat/Adobe). Clicking this
Adobe/Acrobat button on the toolbar will automatically start a save of
the displayed web page as a .pdf file. However, using this approach,
URL and date/time may not be included -- and this is a serious
disadvantage for documentation purposes.
Finally, if you run into trouble with saving a web page as pdf, view
the 10-minute
screen movie on this topic. (Note that the screen movie link only
works if you are logged
in to Bb6 -- at least as a guest.)
PDF Documents
After you
displayed the document in your browser, click
the save icon (looks like a floppy diskette) to the left of your
Acrobat toolbar. Alternatively, if you have the full Acrobat
software and the security settings of the documents allow it, you can
also save selected pages only -- either by extracting the pages you
want or by deleting the pages you don't want and then save the
extracted pages or the slimmed down document. To do this, go to
"Documents" pull-down menu from the toolbar on top. If the security
settings do not allow such changes the corresponding menu item are
grayed out.
Complications
Protected
documents. If a document is protected against
regular download (via either the "File"/"Save as" menu item in your
browser or an equivalent menu item of the "plugin" software) and/or the
document uses a format for which you don't have and cannot obtain
(without cost) the matching software, there is still one recourse: take
a screen shot. The details are discussed in a companion
document.
You may also
opt for the screen shot approach if the document (not matter in what
format) is rather long and you only need to document a small part.
Bypassing the display of the web
document. Note that you can usually bypass the display of the
document and
instead right-click on the link to the document on the referring page
and select "Save Target/Link as". However, this approach only works if
the referring page uses a standard link (a regular html tag) to point
to the document of interest. Some web sites (including the Bb6 course
web sites) use "scripts" to link to certain (types of) documents and in
these cases you may save something different from the actual document.
After saving/downloading a file, always verify that you have actually
saved what you intended to save by opening/displaying the
downloaded
version and scrolling all the way to the end of the document (to make
sure that the whole document was downloaded).
"Bypassing" is an option worthwhile to consider when you are dealing
with documents large in file size (some .pdf documents are quite large
even though they consist of relatively few pages only) and you
have a slow (telephone modem rather than DSL or cable) connection to
the Internet. As some web sites prevent the browser from
"caching" documents (storing a temporary version on your local
computer), first displaying and then saving a (.pdf) document may
require to download the same document twice -- doubling your download
time.
Web pages with frames. Often
there is no need to save the complete web page (i.e., all frames of a
page), rather the
document of interest is displayed in one specific frame. Sometime it is
easy to tell that a site uses frames as the web address (URL) itself
points to it like in Bb6 course sites: http://hc.bbprod.cuny.edu/webapps/portal/frameset.jsp.
Often "banners" (top part of displayed page) and "navigation bars"
(often on the left of a displayed page) are displayed in separate
frames, but this is not necessarily the case. As many web designers do
not like frames (for good reasons, but we will not go into
this here), they rather repeat such components of each page of the
site. And using "dynamic" pages (also referred to as "active server
pages") this is much less onerous as it may appear. E.g., the Hunter College Homepage has
both a banner on top and a navigation bar on the left, but it does not
use frames.
So, how do we find out whether a web page uses frames?
- In MS IE, go to "File"/"Print" -- "Options" tab (in MS IE 7); if the page does not have frames,
this part dealing with frames will be grayed out. Otherwise, it will
give you options as to how to print only one frame, all frames
separately,or all frames together. But apart from printing, MS IE does
not offer much support with handling frames.
- In Mozilla, Netscape, Firefox simply right-click on the web page;
if
the page uses frames, you will see a menu item "This Frame ..."
and
selecting this item leads to a submenu dealing with the specific frame
only. One of the items here is to "Save Frame as". Note that "this frame"
is determined by where you click on the page; so make sure to click
somewhere on the document itself, not an banner or navigation part of
the page.
Documenting your download.
All citation styles (like MLA, APA, Chicago, etc.) require that you
record the web address (URL)
and the date of the download.
And while the latter does not help much in cross-checking your sources
if the page has changed in the meantime, it is still a good idea to do
this. Recoding the URL, however, is much more important,
if only to enable the reader of an Internet research paper to check
whether or not the page has changed in the meantime. If it has not
changed (any many web sites keep their documents available and in the
same location for years), then a reader has a chance to double check
the (primary) source. (The problem of non-persistent
URLs is discussed in more detail in the main
document.)
Pages with frames, again, present a complication
as the address shown in the address/location box of the browser may not
change while perusing such a web site -- creating a mismatch between
the URL shown in location box of the browser and the actual document
shown in (one frame of) the page. Simply copying the address box
and using this as the URL for the saved document will then lead to an
inadequate attribution.