HTML? XHTML? Standards and compatibility

Pass the 10 most popular sites to and see the results:

Page Document type HTTP Content-Type Errors Warnings
Google HTML 5 text/html 48 2
Facebook XHTML 1.0 Strict text/html 39 0
YouTube HTML 4.01 Transitional text/html 154 52
Yahoo! HTML 4.01 Strict text/html 143 32
Windows Live XHTML 1.0 Transitional text/html 0 0
Wikipedia HTML5 text/html 1 2 HTML 4.01 Strict text/html 23 0 HTML 4.01 Transitional text/html 3 4
Microsoft Network (MSN) XHTML 1.0 Strict text/html 1 0
Yahoo!カテゴリ HTML 4.01 Transitional text/html 26 27

Out of these 10 pages, only 1 gets its mark-up right. All the others are really invalid (X)HTML. Also, among the 4 pages that are XHTML, the Content-Types sent are all text/html, which means browsers will use an HTML parser to parse that page instead of an XML parser.

So, why is the reason to write valid (X)HTML? The answer is simple: compatibility. A valid HTML 4.01 page is guaranteed to be rendered correctly on all user agents that supports HTML 4.01. If you find a user agent that claims to support HTML 4.01 but renders it incorrectly, you should immediately file a bug report. However, It is not a bug for a user agent to render invalid pages incorrectly, or not to render it at all.

Let’s move on to the next topic: HTML or XHTML? The only difference between HTML 4.01 and XHTML 1.0 is the base language: HTML 4.01 is based on SGML and XHTML 1.0 is based on XML. There are lots of advantages using XHTML over XML. The stricter rules of XML makes an XML parser simpler than an HTML parser. In particular, HTML is like “tag soup” and XML emphasises on elements. In HTML, tags are not the same as elements. The closing tag and sometimes the open tag can be omitted. Therefore, an HTML parser needs to determine if something need to be automatically opened/closed when processing every tag. In XML, things are different. Tags delimit elements and the processor is required to choke even on the simplest error. There are no omissions possible. Also, you can easily mix any other XML application inside an XHTML file. When you mix MathML inside XHTML, you will get wonderful mathematical formulæ inside a web page when opened by a user agent that supports both XHTML and MathML.

As of 2010, it is safe to assume that all browsers renders XHTML correctly except the following:

  • Trident (Internet Explorer and browsers written in the MSHTML API)
  • KHTML (KDE applications like Konqueror and Amarok)
  • Lynx

The correct MIME type for sending an HTML document is text/html, and that for sending an XHTML document is either application/xhtml+xml (preferred), application/xml or text/xml. When an XHTML file is sent using different MIME type, different responses are shown by different browsers. The worst thing is, when an application/xhtml+xml document is received by Trident, it will open a “Save File” dialogue as it does not recognise the MIME type. Sending it as application/xml or text/xml does not help either due to the buggy MSXML. For KHTML, sending an XHTML document as application/xhtml+xml would put it into HTML mode, which is certainly a bug in KDE though sending it as application/xml or text/xml puts it in XML mode.

Although an XHTML document can be sent as text/html, doing so would put the browser into HTML mode in which the advantages of XHTML would be lost. Therefore, it is recommended that all XHTML documents are sent as application/xhtml+xml unless these the above user agents are detected, in this case it is sent as text/html if it is HTML-compatible or generate an HTTP error, denying request of these non-standards-compliant agents. My site is written using HTML-incompatible standard XHTML 1.1 except the error pages, therefore, I must not send the pages as text/html. Upon getting requests of these non-standards-compliant agents, my PHP scripts generate an error page written in HTML 4.01 telling the user to change the browser. It is already 11 years from the release of XHTML 1.0, go and get a standards-compliant browser.