TapCharacters on the web

The web standards require every document to declare what character set it is written in. This is necessary to enable the browser to be able to understand the document and know what to display when presented with any particular code. This requirement is often misunderstood. What if my page needs characters that are not in the set, do I have to specify a different charset? What if I need a wide variety of characters, do I have to get a special editor?

There are three different but related things to understand here; Character Set, Character Encoding and Font.

  • The Character Set used on all the web is Unicode (also known as UCS or ISO-10646). This is sufficiently rich to contain the majority of characters required by the languages of the world. However, being so rich, it contains many thousands of characters. If you need characters that are not in this set then you will need to look at other methods such as inline images.
  • The Character Encoding is how the character set is represented in the document sent to the browser. There are a number of encodings designed for various language groups e.g. ISO-8859-1 contains a subset suitable for most Western European languages, ISO-8859-5 for Cyrillic and EUC-JP is suitable for Japanese. Some encodings are compact using a single byte per character others use multiple bytes and are thus able to encode a larger number of characters at the expense of bulkier documents.

    BEWARE: Many computer systems (MS Windows and Apple Mac) use non-standard encodings and have characters at positions that cannot be understood by other web users. Particularly watch “smart quotes,” some of the lesser used punctuation and accented characters.

  • The Font specifies how the characters look; how the characters are represented on the screen or page. Some are very basic and only allow for a small range of characters, others are quite comprehensive.

The chosen font and the character encoding may not encompass the same subset of characters. To fill this gap, the (X)HTML language allows for Character Entities or References. These can be symbolic e.g. é or —. There is a complete list in Character entity references in HTML 4 and modern browsers recognize most of them. For ones not included in the list, or for acceptance by older browsers, numeric entities can be used e.g. … (ellipsis …). The number refers to the absolute position of the character in the UNICODE set (in decimal or hex).

All of these are in the character set (UNICODE) but it is the responsibility of the author to specify a font which contains all the characters in his document whether as native encodings or as entities AND that the user is going to have that font available. The generic serif and sans-serif fonts generally allow for the widest variety of characters, but not necessarily all.

When choosing a character encoding for your document it is best to chose one that includes the majority of the characters you need natively so that the minimum number of special characters have to be represented by entities and also to chose one that is supported by the editor that you use to create the document.

Confusingly, this encoding is specified by the server to the browser using the charset value in the HTTP headers e.g.

Content-Type: text/html; charset=ISO-8859-1

This can often be set by adjusting the .htaccess file on the server (with Apache) but if that is not possible then you will need to include a meta tag very early in the data stream of every document (before any content that requires encoding) e.g.

<html>
<head>
<meta http-equiv='Content-Type' content='text/html; charset=UTF-8' />
<title …

More detail about all this can be found at HTML Document Representation.

One Response to “Characters on the web”

^ Top