Saving and Opening HTML Files

When a Microsoft Office program opens an HTML file, the document is formatted according to the elements, attributes, and styles in the file. When the document is saved as a Web page, things such as white space and the order of elements might appear differently than in the original file.

Styles

In Microsoft Word, Office CSS styles are used for text formatting and layout. If the user changes a style that Word created, the new style is defined in the Style element of the document.

Example

In this example, the user defined the strong style as red, Times New Roman text. 


<style><!--
strong {
color: red;
font-face: "Times New Roman";
}
...
--></style>

In Microsoft Excel, text formatting is applied to table elements.

In Microsoft PowerPoint, formatting is applied to textual content within elements. Some types of formatting, such as first line indents and margin adjustment, cannot be represented by PowerPoint. Thus, PowerPoint will apply only the formatting that it can when it encounters unused HTML elements.

id attributes

Like other atttributes that are not used by Office, id attributes are preserved.

In Microsoft Word, when a document is saved as a Web page, and if there's more than one instance of the id attribute, only the first instance is specified.

Nested tags

When character formatting elements contain paragraph/block-level elements, the character formatting elements are repeated for each paragraph instead.

If the HTML in the following example is opened,


<strong>
<p>Here is a paragraph.</p>
<p>Here is another.</p>
<p>Wait! There's another.</p>
<table><tr>
<td>One cell</td>
<td>Two cells</td>
</tr></table>
</strong>

the Office program specifies the following HTML when the document is saved as a Web page:


<p><b>Here is a paragraph.<o:p></o:p></b></p>
<p><b>Here is another.<o:p></o:p></b></p>
<p><b>Wait! There's another.<o:p></o:p></b></p>
<table border=0 cellpadding=0 style='mso-cellspacing:1.5pt'>
 <tr>
  <td style='padding:.75pt .75pt .75pt .75pt'>
  <p class=MsoNormal>One cell</p>
  </td>
  <td style='padding:.75pt .75pt .75pt .75pt'>
  <p class=MsoNormal>Two cells</p>
  </td>
 </tr>
</table>

Nested block-level elements

If an Office program encounters a block-level element inside another block-level element, the current element is closed before starting the new one. When the new element is closed, the parent element contains the remainder of the text span.

If the HTML in the following example is opened,


<p>This is a paragraph.
<address>This is an address inside of the paragraph.</address>
And this is the rest of the paragraph.</p>

the Office program specifies the following HTML when the document is saved as a Web page:


<p>This is a paragraph. </p>
<address>This is an address inside of the paragraph.</address>
<p class=MsoNormal>And this is the rest of the paragraph.</p>

Improperly nested elements

If the file contains elements that are improperly nested, the elements are saved in the proper nesting order. The following example shows an I start tag inside a B element.

<b>Here are <i>some HTML</b> elements</i>

When the document is saved, an I end tag is specified within the parent B element to close the I start tag, and a new I start tag is specified outside the B element to match the original I end tag.

<b>Here are <i>some HTML</i></b><i> elements</i>

If the improperly nested element has an id attribute, the identifier is preserved. This example shows an I element with an id attribute.

<b>Here are <i id=id001>some HTML</b> elements</i>

When the document is saved, the id attribute is specified.

<b>Here are <i id=id001>some HTML</i></b><i> elements</i>

HTML elements and styles

When a Web page is opened in Microsoft Word, a Word style is created. Character styles use the underlying font and the specified formatting. Paragraph styles use the Normal style and the specified formatting. Microsoft Excel and PowerPoint apply direct formatting.

When the document is saved, the style is specified using the appropriate element and style attribute.

The following table describes what a Microsoft Office program does when it encounters an element while opening a Web page, and how some Web browsers display the elements saved by Office.

Element Comments
Address Apply italics to the text contained within the element and apply a paragraph break before and after the text.
Basefont Sets a default font size for a section of the document. Only the size attribute is allowed in this element, but it is not required. If the size is not specified, the default font size is 3. If the close tag is not used, this means that the base font size is in effect until either the document's end is reached or a Basefont start tag is encountered.
Big The text contained in this element increases in size by one HTML font size increment.
Blockquote The text contained in this element is indented 0.25 inches on the left and right. Each nested Blockquote element adds 0.25 inches to the existing indent. Blockquote start tags increase the indent and closing Blockquote tags reduce the indent.
Center Center alignment is applied to the text and elements contained in this element. No style is created. When the document is saved, direct alignment formatting is used.
Cite The text contained in this element appears italicized. The size of the text depends on the size of surrounding text.
Code The size of the text contained in this element depends on the size of surrounding text. Note that Microsoft Internet Explorer 3.0 and Netscape Navigator 3.0 and 4.0 display slightly smaller text of the same face and color. Internet Explorer 4.0 displays Courier New text that is slightly smaller than the surrounding text.
Comment The text contained in this element does not appear in Web browsers. The text is not displayed in the Office program, but is preserved.
Dfn Not supported by Netscape Navigator 3.0 and 4.0. Microsoft Internet Explorer 3.0 and 4.0 display the text using an italic font variant.
Dd Used to create a definition list. Text contained in a Dd element nested inside the Dl element is indented 0.25 inches from the left margin. The text has a first line indent of 0.25 inches and is directly formatted. No style is created. Text contained in a Dt element is aligned with the left margin. When a Dd element appears after a Dt element, a line break is used, but a new paragraph is not created. Text contained in a Dd element that is not nested in a Dl element is rendered as a paragraph with a first line indent of 0.25 in. The text is directly formatted and no style is created. If the Dd element is nested within a Dl element, the text enclosed by the Dd element is directly formatted with a first line indent of 0.25 inches and no style is created. All other combinations of definition list tags are displayed as regular paragraphs. When the document is saved, all formatting is applied as direct margin or first line indent formatting. Note that the compact attribute is not preserved.
Dl Used in a definition list. Contains the Dd and Dt elements. Text located inside the Dl element but outside the Dd and Dt elements is displayed as if it were inside a P element.
Dt Used in a definition list. Text within a Dt element that is not nested within a Dl element is displayed as if it were inside a P element.
Iframe The text contained in this element appears in a separate window. The text is not displayed in the Office program, but it is preserved.
Italics The text contained in this element is italicized.
Kbd The size of the text contained in this element depends on the size of surrounding text. Note that Microsoft Internet Explorer 3.0 and Netscape Navigator 3.0 and 4.0 display slightly smaller text of the same face and color. Internet Explorer 4.0 displays Courier New text that is slightly smaller than the surrounding text.
Listing The text contained in this element is displayed as code and does not wrap as the window is sized. See the description of the Pre element.
Marquee Text scrolls across the screen.
Nobr Suppresses line breaks within a section of text.
Plaintext The text and elements contained in this element are displayed using Courier font. See the description of the Pre element. The closing tag is not required.
Pre The text contained in this element is displayed as code and does not wrap as the window is sized. Note that in Microsoft Internet Explorer 3.0 and Netscape Navigator 3.0 and 4.0, the text is displayed using the current font but in a slightly smaller font size. In Internet Explorer 4.0, the text is displayed using Courier font.
Q The quotation text contained in this HTML 4.0 element is indented.
Samp Text is displayed using a slightly smaller font than the current font. Internet Explorer 4.0 displays the text using Courier New in a slightly smaller font size.
Small The text contained in this element is decreased in size by one HTML font increment.
Spacer Causes a blank space to appear in the browser window.
Tt The text contained in this element is displayed using a slightly smaller font. In Internet Explorer 4.0, the text is displayed using Courier font in a size slightly smaller than the current font size.
Unknown elements The contents of the element are displayed as a paragraph in Normal style.
Var The text contained in this element is displayed as italic in the current font. In Internet Explorer 3.0, the text is displayed using the same font but in a slightly smaller size.
Wbr Specifies where a line should break. This element is typically used within a Nobr element and does not have a closing tag.
Xmp The text contained in this element is displayed as code and does not wrap as the window is sized. See the description of the Pre element.

Note that formatting is applied to the rest of the document as direct formatting. If an element has been redefined in the style sheet using CSS style attributes and the document is saved as a Web page, the element is not used and the formatting is specified as direct formatting.

The Nobr and Wbr elements are treated in the same way as unknown HTML elements in Microsoft Word and PowerPoint. In Microsoft Excel, the Nobr element disables wordwrap within a cell and Wbr enables wrapping. If there is more than one Nobr and Wbr element within a cell, the last one is used. When the document is saved in Microsoft Word or PowerPoint, the original syntax of the element is specified. In Excel, direct formatting for suppressing wordwrap is specified and the elements are not used.

For information about table elements, see the Excel Worksheets and Tables topics.

Missing close tags

When a file is opened, the elements in the following table have optional close tags that are specified when the document is saved and one of the following tags is encountered.

If the start tag is The end tag is specified before
<DD> <DD>, <DT>, </DL>
<DT> <DD>, <DT>, </DL>
<LI> <LI>, </UL>, </OL>, </Menu>, </Dir>
<Option> <Option>, </Select>
<TD> <TD>, <TH>, <TR>, </TR>, </Table>, </Select>
<TH> <TD>, <TH>, <TR>, </TR>, </Table>, </Menu>, </Dir>
<TR> <TR>, </Table>

If a Table element contains text outside of cell elements, cell elements are not automatically added when the document is saved. In the following example, the elements required to make the text appear inside the table are not specified. Depending on the Web browser used, the text appears outside the table.

<table border>
outside
</table> 

White space, margins, and missing paragraph end tags

The following table shows the lines of white space around block-level elements.

Element Lines of white space
Headings (H1...H6), P, Blockquote, Pre, IsIndex, Form 1
UL, OL, Dir, Menu, DL 1 for top-level lists, or 0 for lists inside lists
DT, DD, BR, HR, LI, Address 0
NoFrames, NoScript, Div, Center, Fieldset 0

The paragraph end tag is optional. However, the bottom margin of the paragraph is determined by where the end tag is used. If the paragraph end tag is not used, the default margin of a paragraph is 1 em above and 0 em below. The margin style attribute specifies the margin. In this example, the paragraph has a two-line margin.

<p style="margin: 2em">A paragraph with 2 lines around it

If the bottom margin of a block-level element and the top margin of an adjacent block-level element are positive, the white space displayed in the browser is the greater of the two. Otherwise, the white space is the sum of the two margins.

When the document is saved, paragraph end tags are used. If the end tag is missing from the original document, an end tag is added before the start tag of any block-level elements in the preceding table.

Missing close tags for block-level elements in table cells

A file can contain block-level elements in a table cell, and if the close tags of those elements are missing, the document is saved with end tags applied before the end tag of the cell.

Empty elements

To reduce file size, the following empty elements are not specified when the document is saved: B, Big, Blink, Center, Cite, Code, Comment, DFN, EM, Font, I, KBD, Pre, Q, S, Samp, Small, Span, Strike, Strong, Sub, Sup, TT, U, Var, and Xmp.

An element is empty if it contains no text, spaces, or other non-empty elements. An empty element can have attributes.

Order of Body and Head elements

In an HTML file, the Head element can be specified after the Body element. However, when the document is saved, the head always precedes the body.

Uppercase HTML tags

When a document is saved, all HTML tags and attributes are specified in lowercase characters. For elements and attributes that are not used, the capitalization is preserved.

Line spacing

There are 6 styles automatically used by Office to specify elements.

Style 1: Text and elements are in the same line.

....here is some <b>bold</b> text... 

Style 2: One or more instances of an element appear in the parent element.

<ul> 
<li>here is a list element</li> 
<li>and here is another</li>
</ul>

Style 3: Blank lines appear between elements.

...example above.</p> 

<h1>In Conclusion</h1> 

<p>So, we see that... 

Style 4: Various elements appear in the parent element.

<table> 
<tr> 
<td>a cell</td> 
</tr> 
</table> 

Style 5: Blank lines appear around the parent element.

...Office 9.</p> 

<table> 
<tr> 
<td>a cell</td> 
</tr> 
</table> 

<p>In the table above... 

Style 6: No line break before the BR tag, but one after.

...Office 9.</p> 

<p>This is a paragraph<br> 
that has several breaks<br> 
in the middle</p> 

<p>In conclusion...

The following table shows the style used to specify an element.

Element Style
A, Acronym, B, Big, Button, Cite, Code, Comment, Del, DFN, Em, Font, I, Ignore, Img, Input, Ins, KBD, Listing, Nobr, Pre, Q, S, Samp, Small, Span, Strike, Strong, Sub, Sup, Textarea, TT, U, Var, Wbr, Xml, Xmp, any tags not used by Microsoft Office 1
Area, Base, Basefont, Caption, Col, Colgroup, DD, DT, Frame, Legend, Li, Link, Meta, Option, Plaintext, TD, TH, Title 2
Address, Bgsound, Blockquote, Center, comments (<!--...-->), Div, H1...H6, Marquee, P, Spacer 3
Html, Noframes, TR 4
Body, Dir, DL, Fieldset, Form, Frameset, Head, HR, Iframe, Layer, Map, Menu, Object, OL, Script, Select, Style, Table, UL 5
BR 6

If two style 3 or style 5 elements are next to each other, only one blank line is specified between them, not two.

If contained in other elements, style 5 elements are implemented the same was as style 4 elements.

Wordwrap

A line feed is specified if the text is longer than 80 characters and if the line can be broken. The line feed is specified between words in the content. If the line must be broken inside an element, the line feed is specified between attributes or in spaces within the attribute values. The following example shows two line feeds specified in an attribute value.

<td style="border-top: 0.0in none rgb(0,0,0); border-left: 
0.0in none rgb(0,0,0); border-bottom: 0.0in none rgb(0,0,0); 
border-right: 0.0in none rgb(0,0,0)"> 

Indenting lines

The following elements are indented one space from the current indent level: TR, Dir, DL, Form, Frameset, Layer, Map, Menu, Object, OL, Select, Table, and UL. A line can be indented up to 40 characters.

In the following example, the table elements are indented.

<table>
 <tr>
  <td>First cell</td> 
  <td>Second Cell</td>
 </tr>
</table>

Indenting list items

Each line after the first line of a list element is indented 4 spaces from the current indent level. In this example, the list item is indented.

<li><font size="2" face="Verdana">Changes to HTML pages we 
    automatically generate (the table of contents pane, for 
    example)</font></li>
<li><font size="2" face="Verdana">Other changes</font></li>

Formatting and indenting CSS definitions

If the definition has only one style attribute and value, it is specified on one line. A space is added after the name, opening brace, colon, semicolon, and closing brace. The following examples show style definitions.

h1 { color: blue; } 
body { background: url(texture.gif) white; } 
.pastoral { color: green; } 

If the definition specifies more than one style attribute and value, each pair is specified on a line. This example shows a paragraph formatting definition.

p { 
    font-size: 12pt; 
    float: left; 
} 

The semicolon is always specified after the last attribute and value pair.

Capitalization of CSS style definitions

The definitions, CSS style attribute names, and values are specified in lowercase. The case of class names, URLs, anything in quotes, and anything not used by Microsoft Office programs is preserved. The definitions inside the Style element are placed in a comment.

Quoting values

A CSS value is put in quotation marks if it has a space in the value or if it is longer than 256 characters.

Double quotation marks in the attribute value are specified as "\0022", and single quotation marks are specified as "\0027". In this example, double quotation marks surround the date value.

msoffice-field-code: "DATE \0022MM\/DD\/YY\0022" 

The forward slash is specified as a backward slash followed by a forward slash. 

style attribute

Unlike other attribute values that are specified in double quotation marks, inline CSS styles are specified in single marks allowing double marks to appear inside the attribute value as a single character instead of "\0022".

For a single attribute, one space is added after the colon.

<span style='border: solid'> 

If more than one attribute is specified, one space is added after each colon and semicolon.

<span style='border: solid; font: arial'> 

XML, script, and preformatted text

Unlike HTML, white space is significant in XML. Blocks of XML that are not used by Microsoft Office programs and the white space in those blocks are preserved. Likewise, the formatting of script blocks and the content of Pre elements are also preserved.

Styles not used by Microsoft Office

Style names and attributes that are not placed by Microsoft Office programs are preserved. When a document is saved, these styles are specified after the styles used by Office programs. @ keywords are placed after the @ keywords that must start the block of style definitions, followed by the style definitions and unused style definitions.

If an invalid style value is encountered when the file is opened, the value is either preserved or the value used by the Office program is specified. If a priority is specified after a style attribute value, the priority is not preserved when the document is saved. Comments in the style element are also not preserved.

For more information about how Web browsers handle CSS properties and values, see the CSS Level 1 Recommendation.

Attributes not used by Microsoft Office

Attributes that are not used by Microsoft Office programs are preserved as long as the parent element is specified.

In Microsoft PowerPoint and Microsoft Word, the attributes in consecutive elements are combined if the elements are combined. In the following example, the two B elements have id attributes.

...some text <b id=1><i><b id=2>hi there!</b></i></b>... 

When the document is saved, the attributes are specified in a single B element.

...some text <b id=1 id=2><i>hi there!</i></b>... 

If an invalid attribute value is encountered when the file is opened, the value is either preserved or the value used by the Office program is specified. 

Tags not used by Microsoft Office

Tags and comments that are not used by Microsoft Office programs are preserved as long as the parent element can contain elements. Unused tags are usually written at the end of the parent element.

Any content that appears before or after the Html element is preserved. If the Html start and end tags are absent, the Html element is assumed to contain the entire document. If the Head and Body elements are absent, the Body element is assumed to contain everything in the Html element. If the Head element has been specified but not the Body element, everything in the Html element after the Head element is assumed to be in a Body element.

If a Body or Frameset element has been specified, but the Head element is absent, everything in the Html element before the Body or Frameset element is assumed to be in a Head element. Content within the Html element but outside the Head, Body, and Frameset elements is preserved.

If unused elements have been specified in a Frameset, Head, or Map element, those elements are preserved and placed at the end of the parent element when the document is saved. If Head is the parent element, the textual content of those elements is not preserved.

Images

Images are included in a page using the Img element and an associated image file. When a document is opened, the image file is linked. If the image is a hyperlink, the hyperlink is preserved.

src attribute

The src attribute can contain a URL, UNC, or local file path and file name. The path can be absolute or relative. If the path is relative, the URL specified in the Base element is used. If the Base element is absent, the URL of the document is used as the base path.

If the src attribute is absent, a picture placeholder is created when the document is opened in a Microsoft Office program.

height and width attributes

If both height and width attribute values have been specified, the image is scaled. If both values are absent, the actual size of the image is used. If only one value has been specified, the other value is calculated so that the aspect ratio of the picture is preserved. Both height and width attributes are specified when the document is saved.

alt attribute

The alternate text for an image is preserved.

usemap and ismap attributes

A mapping table is created containing all the Map elements, and a shape is created for each Area element in a Map element. Different images can reference the same Map element; however, a separate Map element is specified for each image when the document is saved. The name and any id attributes of the Map element are not preserved. 

For floating images, the ismap attribute is preserved. If both ismap and usemap attributes have been specified, the usemap attribute is used.

align attribute

In Microsoft Word, images that are aligned to the left or right are opened as floating images, while other images are opened as inline images. Attribute values other than left and right are preserved. Microsoft Excel and Microsoft PowerPoint do not support inline images or text wrapping around images. In Excel and PowerPoint, an image is opened as a floating picture object and the surrounding text does not wrap around the picture.

hspace and vspace attributes

In Word, the hspace and vspace attributes specify the spacing around the image in pixels. Excel and PowerPoint do not support text wrapping around images.

dynsrc, controls, loop, and start attributes

In Word, images having a dynsrc attribute are opened using an ActiveX control instead of being importing as a picture. If the controls, loop, or start attributes have been specified without a dynsrc attribute, it is preserved. These attributes are not supported by the other Microsoft Office programs.

Area element

When a document containing an image map is opened, an Office Art shape is created for each Area element. The shape has a hyperlink and is sized and positioned on top of the picture object to represent an image map. The shape has no border and fill to make it hidden.

shape and coords attributes

The shape attribute contains one of the following string constants specifying the shape: rect, rectangle, circ, circle, poly, or polygon. The value of the coords attribute depends on the type of shape. For rectangles, the coords attribute contains left, right, top, or bottom. For circles, the coords attribute contains centerx or centery. For polygons, the coords attribute contains x1, y1, x2, or y2. The values are specified in pixels.

In Microsoft Word, shapes are anchored immediately before the image by means of character-relative anchoring.

href and nohref attributes

If the href attribute has been specified, a hyperlink is created for the shape. If both href and nohref are specified, the href attribute is used but the nohref attribute is ignored.

title attribute

If the title attribute is specified, a ScreenTip is created for the shape.

Background color and images

The bgcolor attribute of the Body element specifies the background color of a page, sheet, or slide, and the fill color of Office Art background shapes on the page. If the background-color style attribute is specified in the Body element, the style value is used instead of the bgcolor attribute value, and when the document is saved, the bgcolor attribute is not preserved.

The background attribute of the Body element specifies the image used as the background texture fill of a page, sheet, or slide. In Microsoft Excel, a device-dependent bitmap is created from the image. If the background style attribute is specified in the Body element, the style value is used instead of the background attribute value, and when the document is saved, the background attribute is not preserved.

In Excel, if both bgcolor and background attributes are specified, the background attribute is used and the bgcolor attribute is ignored. If the bgcolor attribute is specified but not the background attribute, the fill color of all cells is set to the value of the bgcolor attribute. If the background attribute is specified but not the bgcolor attribute, the specified image is embedded and tiled across the sheet background.

The background, background-color, and background-tile style attribute values take precedence over the corresponding element attribute values. When the document is saved, the element attributes are specified, not the style attributes. The color of the default master cell is specified in the bgcolor attribute. The mso-ignore: bgcolor style is also specified, along with a style class designating the default master cell style. If the background is tiled, the bgcolor attribute specifies the color white and the background attribute contains the name of the image file. The mso-ignore:bgcolor style is also specified.

HR tags

In Microsoft Word, the HR tag produces a horizontal line object. In Excel and PowerPoint, the HR tag is ignored and not preserved.