Manfred Kuechler
Hunter College

Version: 6 Jan , 2006
 

How to transfer HTML documents ("Complete Web Pages")

 
Recent versions of commonly used word processing software (like MS Word) allow to save documents in html format (as "web pages") without any knowledge of HTML (Hyper Text Markup Language). For most purposes (like student papers) this automatic conversion works well enough; however, when transferring such documents (like submitting them as an attachment to an e-mail message or via the Blackboard (Bb) drop box special issues arise. These are discussed in this document -- providing both step-by-step "how to" instructions for various software constellations as well as some explanation of the technical issues involved. To have at least some conceptual understanding of why certain steps are necessary is important for anyone who wants to become "information literate". A companion piece ("How to produce simple html (web) documents with WP or MS Word") discusses the steps of producing such documents in the first place.
As file extensions are very important for this discussion, I strongly suggest that you have Windows Explorer show file extensions (by default, it does not). To do this, go to "Tools"/"Folder Options", select the "View" tab and make sure that the box labeled "Hide file extensions for known file types" is not checked.

The structure of HTML documents

Here are three important properties of HTML documents that are important to realize:
  1. An HTML document consists of a main (text) file plus any number of associated files (mostly files containing images like photos, charts, and graphs); each image -- regardless of size -- is stored in a separate file
  2. The main file consists of the actual (visible) text plus "markup" instructions or "tags" which are enclosed in angle brackets (<.....tag......>); the complete file (text plus tags) can be viewed in any text editor (such as Notepad) and it is referred to as the "source" or the "source code". Web browsers (Netscape, MS IE, etc.) display the actual text only based on the instructions contained in the tags.
  3. The most important "tag" allows to turn a specific string of text into a "link" (typically displayed in blue and underlined) and clicking on the link leads to the display of another document (external link) or another part of the same document (internal link).

Issues with the transfer of HTML documents

We will discuss the following types of transfers In theory, it is possible to transfer all files associated with one HTML document separately, but there is a good chance that
  1. some files are overlooked and
  2. the placement (in subfolders) relative to the location of the main file is not maintained (current versions MS Office applications do not place the associated files in the same folder as the main file).
Therefore, it is preferable to "package" all files belonging to an  HTML document into one file and transfer this (consolidated, compressed, archive) file. There are two options to achieve such "packaging": Obviously, for simple HTML documents with no associated files, such packaging is not necessary. However, check the section on e-mail transfers below as some e-mail programs require extra measures.
 

The MHT approach

The most convenient approach is using what Microsoft calls a "web archive" or more technically speaking an .mht file. This file format was developed independently from MS as a general Internet standard, but -- in contrast to Microsoft -- hardly any other browser (Mozilla, Firefox, Netscape) supports  this format (though I have not checked Opera and other minor web browsers). MHT files can be created and opened via several MS software products, including but they don't all work the same -- as some have glitches and may produce flawed .mht files. The crucial aspects are: MS Office XP/2002 (Service Pack 1) or better. This is the most convenient and glitch free solution. After completing the document and saving it either as .doc or .htm file, simply re-save it as "Web Archive (.mht)" by selecting this "type" from the pull-down menu (see screen shot below).  If saving it as .htm file first, do not select "Web Page, Filtered" -- or image files may not be included in the resulting .mht file.

The convenience comes at a price, though. As MS Word 2002/3 tries to establish full equivalence between its proprietary .doc format and the .htm format, all images are stored in two formats, the .jpg format traditionally used on the web and the newer .png (portable network graphics) format which is not yet supported by all web browsers. So, the resulting .mht file is larger than necessary. Also, MS Word 2002/3 (as well as MS Word 2000) does not really generate HTML code, but rather more complex XML code which can lead to other problems.

MS Office 2000 web archive add-in. This "add-in" can be downloaded for free from Microsoft. The term "add-in" is somewhat misleading as it is a separate program. If it installs correctly, the right-click menu (when used with a file list in Windows Explorer or the "Open" Window in MS Word 2000) will show an additional entry "Save as web archive" when an htm file is selected (highlighted) and an additional entry "Unpack Web Archive" when an .mht file is selected. Simply selecting these options will execute the operation (and additional files will be added to the folder). As with using MS Office XP/2002/2003, do not "filter" the initial htm file.

Note that you must have MS IE as well as MS Outlook Express installed in order for this add-in to work. You do not have to actively use MS Outlook Express. Also, it is installed automatically with newer versions of MS IE -- unless you take special steps to prevent this (via a "custom installation"). More information in the MS Knowledge Base.

MS Internet Explorer (MS IE). At this point, I do not recommend to use MS IE to create .mht files for HTML documents created on your computer, e.g., by opening a local htm file and then saving it as "web archive". MS IE converts  "relative" links into "absolute" links which creates problems for "internal" links. The "tag" for an internal link is typically of the form <a href="#note1"> and is interpreted by a browser relative to the current document. Converted to an "absolute" link, the tag becomes something like <a href="file:///c:\My Documents\GSR\miller_paper.htm#note1">. Once transferred, the HTML document is not likely to be in a folder called "c:\My Documents\GSR\" again, and thus the target cannot be found.

Use MS IE and the "web archive" (mht file) feature only to archive external web pages (downloaded from the Internet), e.g., to preserve evidence from a web page likely to change. Note that the same problem with internal links occurs. However, as long as the page stays the same on the external web server and viewing takes place with a live connection to the Internet, this flaw will not become apparent.

If you don't have access to MS Word XP/2002/2003 and you can't get the "MS Office 2000 web archive add-in" to work, and thus MS IE is your only option, make sure that you save HTML documents created with MS Office as "filtered" or images may not be included. (Note this advice is the exact opposite to what you should do when creating .mht files with either MS Word 2002/3 or the add-in.) The only way to make .mht files created with MS IE fully functional, is to edit the source code afterwards and change all internal links back to relative links. This can be done with any text editor, but -- if you are not somewhat familiar with HTML source code -- you may make some unintended changes as well. So, this is a last resort only.
 

The ZIP Approach

Here, all files are packaged using the .zip format. There are several disadvantages to this method: The last point is not much different from having to download the MS Office 2000 add-in. WinZip is a very popular utility, but it is "shareware" ($29) and while the free demo version continues to work you get constant reminders that you are supposed to chip in. However, there are totally free alternatives.  

The first task ("identifying the files") is easy to achieve when MS Word is used to create the HTML document. If the main file is called "miller_paper.htm" than all associated files are stored in a subfolder called "miller_paper_files" which is located in the same folder as the main file.

Note that you cannot simply move the associated files to the same folder as the main file, as the specific folder location is used in the underlying HTML source code. So, you must "zip with relative path information" or -- after transfer -- the unzipped document will not display correctly.

This 4:23 min "screen movie" (a .wmv file requiring a recent version of the  Windows Media Player to view)  demonstrates the process in great detail -- assuming that you have WinZip installed on your computer.



Issues with E-Mail Programs (Clients)

When submitting a zip or mht file via the Bb drop bop or via a device like a diskette or CD-RW no further precautions are necessary. The recipient simply needs a zip utility (to unzip) and MS IE to view an .mht file. As to the Bb drop box, the recipient should download the file to a local computer first (via right click menu) as there is no way to unzip a zip file in the drop box on the server and .mht files may be displayed as plain text rather than as web page (pending on Bb server configuration). But that's it.

When using e-mail, however, the sender should make sure that the file is sent as a separate attachment -- not displayed in the body of the message. Many mail programs (clients) offer a choice of keeping attachment separate or having them included with the body of yourt message. Some, however, (most notably AOL mail) incorporate htm (and mht) files in the body of the text, offering no choice. When using such an e-mail program, it is necessary to use a work around by adding an extension like ".bin" or ".exe" to the actual file name. This will fool such mail programs into assuming that the file cannot be displayed in the body of the message and thus it will be sent as a genuine attachment. The recipient, then, simply has to remove the extra (fake) extension. Example: If the file "miller_paper.mht" is supposed to be sent via AOL or a similar mail program, name it "miller_paper.mht.bin" first and then add it to the message as attachment. This, however, can create other problems as an e-mail client may be set up to refuse "executables" as attachment (and the extensions .bin and .exe arenormally used for executables). So, you may want to add a fake .pdf (this will pass), but the recipient needs to be warned that this a fake extension only and needs to be removed upon receipt. The file will not open in the Acrobat/Adobe reader. (Changing the extension does not change the internal structure of a file!).