HTML? XHTML? Standards and compatibility

Pass the 10 most popular sites to http://validator.w3.org/ and see the results:

Page Document type HTTP Content-Type Errors Warnings
Google HTML 5 text/html 48 2
Facebook XHTML 1.0 Strict text/html 39 0
YouTube HTML 4.01 Transitional text/html 154 52
Yahoo! HTML 4.01 Strict text/html 143 32
Windows Live XHTML 1.0 Transitional text/html 0 0
Wikipedia HTML5 text/html 1 2
Blogger.com HTML 4.01 Strict text/html 23 0
Baidu.com HTML 4.01 Transitional text/html 3 4
Microsoft Network (MSN) XHTML 1.0 Strict text/html 1 0
Yahoo!カテゴリ HTML 4.01 Transitional text/html 26 27

Out of these 10 pages, only 1 gets its mark-up right. All the others are really invalid (X)HTML. Also, among the 4 pages that are XHTML, the Content-Types sent are all text/html, which means browsers will use an HTML parser to parse that page instead of an XML parser.

So, why is the reason to write valid (X)HTML? The answer is simple: compatibility. A valid HTML 4.01 page is guaranteed to be rendered correctly on all user agents that supports HTML 4.01. If you find a user agent that claims to support HTML 4.01 but renders it incorrectly, you should immediately file a bug report. However, It is not a bug for a user agent to render invalid pages incorrectly, or not to render it at all.

Let’s move on to the next topic: HTML or XHTML? The only difference between HTML 4.01 and XHTML 1.0 is the base language: HTML 4.01 is based on SGML and XHTML 1.0 is based on XML. There are lots of advantages using XHTML over XML. The stricter rules of XML makes an XML parser simpler than an HTML parser. In particular, HTML is like “tag soup” and XML emphasises on elements. In HTML, tags are not the same as elements. The closing tag and sometimes the open tag can be omitted. Therefore, an HTML parser needs to determine if something need to be automatically opened/closed when processing every tag. In XML, things are different. Tags delimit elements and the processor is required to choke even on the simplest error. There are no omissions possible. Also, you can easily mix any other XML application inside an XHTML file. When you mix MathML inside XHTML, you will get wonderful mathematical formulæ inside a web page when opened by a user agent that supports both XHTML and MathML.

As of 2010, it is safe to assume that all browsers renders XHTML correctly except the following:

  • Trident (Internet Explorer and browsers written in the MSHTML API)
  • KHTML (KDE applications like Konqueror and Amarok)
  • Lynx

The correct MIME type for sending an HTML document is text/html, and that for sending an XHTML document is either application/xhtml+xml (preferred), application/xml or text/xml. When an XHTML file is sent using different MIME type, different responses are shown by different browsers. The worst thing is, when an application/xhtml+xml document is received by Trident, it will open a “Save File” dialogue as it does not recognise the MIME type. Sending it as application/xml or text/xml does not help either due to the buggy MSXML. For KHTML, sending an XHTML document as application/xhtml+xml would put it into HTML mode, which is certainly a bug in KDE though sending it as application/xml or text/xml puts it in XML mode.

Although an XHTML document can be sent as text/html, doing so would put the browser into HTML mode in which the advantages of XHTML would be lost. Therefore, it is recommended that all XHTML documents are sent as application/xhtml+xml unless these the above user agents are detected, in this case it is sent as text/html if it is HTML-compatible or generate an HTTP error, denying request of these non-standards-compliant agents. My site is written using HTML-incompatible standard XHTML 1.1 except the error pages, therefore, I must not send the pages as text/html. Upon getting requests of these non-standards-compliant agents, my PHP scripts generate an error page written in HTML 4.01 telling the user to change the browser. It is already 11 years from the release of XHTML 1.0, go and get a standards-compliant browser.

4 Comments


  1. I am trying to use the following php script to deal with IE’s problem and succeeded on Internet Explorer 8 and Chrome, as I write my site in an HTML-compatible way.
    I had tried to follow the rules mentioned on http://www.w3.org/TR/xhtml1/#guidelines, which there are guidelines to write a HTML-compatible document.

    And in the element, I added <meta http-equiv="Content-Type" content="; charset=UTF-8″ />.

    On the other way, I can’t see how your site is HTML-incompatible, one of the reason is that the page’s too long~~

    Do you think adding this into the element is a wise thing to do?

    Reply

    1. One of the reasons that my page is HTML-incompatible is that I use XML tools to process my XHTML documents and the XML tools output XHTML documents in an HTML-incompatible way such as <script type="text/javascript" src="function.js"/>, <td/> (empty table cell)

      For the meta tag, it is good to include it in HTML documents but it is not necessary to include it if the XHTML is sent as application/xhtml+xml solely.

      Reply

      1. Thanks.

        Now I’m playing with the funny php code and alter it to detect whether the browser accept XHTML or not. Instead of checking for MSIE, I check if the browser accept application/xhtml+xml instead.

        Hope that works~

        if(isset($_SERVER[‘HTTP_ACCEPT’])&&(strpos($_SERVER[‘HTTP_ACCEPT’],”application/xhtml+xml”)!==false)){
        header(“Content-Type: application/xhtml+xml; charset=UTF-8”);
        }else{
        header(“Content-Type: text/html; charset=UTF-8”);
        }

        Reply

Leave a Reply

Your email address will not be published. Required fields are marked *