Archive for the ‘Technical’ Category

TapCharacters on the web

10 Oct 2005 16:27 by Rick

The web standards require every document to declare what character set it is written in. This is necessary to enable the browser to be able to understand the document and know what to display when presented with any particular code. This requirement is often misunderstood. What if my page needs characters that are not in the set, do I have to specify a different charset? What if I need a wide variety of characters, do I have to get a special editor?

There are three different but related things to understand here; Character Set, Character Encoding and Font.

  • The Character Set used on all the web is Unicode (also known as UCS or ISO-10646). This is sufficiently rich to contain the majority of characters required by the languages of the world. However, being so rich, it contains many thousands of characters. If you need characters that are not in this set then you will need to look at other methods such as inline images.
  • The Character Encoding is how the character set is represented in the document sent to the browser. There are a number of encodings designed for various language groups e.g. ISO-8859-1 contains a subset suitable for most Western European languages, ISO-8859-5 for Cyrillic and EUC-JP is suitable for Japanese. Some encodings are compact using a single byte per character others use multiple bytes and are thus able to encode a larger number of characters at the expense of bulkier documents.

    BEWARE: Many computer systems (MS Windows and Apple Mac) use non-standard encodings and have characters at positions that cannot be understood by other web users. Particularly watch “smart quotes,” some of the lesser used punctuation and accented characters.

  • The Font specifies how the characters look; how the characters are represented on the screen or page. Some are very basic and only allow for a small range of characters, others are quite comprehensive.

The chosen font and the character encoding may not encompass the same subset of characters. To fill this gap, the (X)HTML language allows for Character Entities or References. These can be symbolic e.g. é or —. There is a complete list in Character entity references in HTML 4 and modern browsers recognize most of them. For ones not included in the list, or for acceptance by older browsers, numeric entities can be used e.g. … (ellipsis …). The number refers to the absolute position of the character in the UNICODE set (in decimal or hex).

All of these are in the character set (UNICODE) but it is the responsibility of the author to specify a font which contains all the characters in his document whether as native encodings or as entities AND that the user is going to have that font available. The generic serif and sans-serif fonts generally allow for the widest variety of characters, but not necessarily all.

When choosing a character encoding for your document it is best to chose one that includes the majority of the characters you need natively so that the minimum number of special characters have to be represented by entities and also to chose one that is supported by the editor that you use to create the document.

Confusingly, this encoding is specified by the server to the browser using the charset value in the HTTP headers e.g.

Content-Type: text/html; charset=ISO-8859-1

This can often be set by adjusting the .htaccess file on the server (with Apache) but if that is not possible then you will need to include a meta tag very early in the data stream of every document (before any content that requires encoding) e.g.

<html>
<head>
<meta http-equiv='Content-Type' content='text/html; charset=UTF-8' />
<title …

More detail about all this can be found at HTML Document Representation.

TapWhere have all the Pages gone

7 Oct 2005 12:57 by Rick

A couple of weeks ago I ran a long overdue “broken link check” on West Penwith Resources. I do this using the excellent Xenu program which spiders the site and interprets all the HTTP error codes. I know it was over a year since I last did it, but was horrified to discover 25% of the external links broken. There were a few internal bugettes, typos etc. but in the process of fixing the others I came up with these types…

  • Web site totally gone, no trace anywhere. I suppose funding may have dried up, companies gone to the wall and, not uncommon among genealogists, owner gone to be with their ancestors.
  • Web site moved to another domain. This is understandable if moving up a gear from a freebie ISP host to a “proper job” custom domain name, though a redirect at least on the home page would be a good idea. I moved about 3 years ago, I think, and only recently have I not been able to find links to the old site. My redirect stayed up for most of that time. Big sites change domain when their department name changes or they are taken over. If they must do it, they ought to have the know how and resources to map page to equivalent page on the new site.

    Moving from one ISP site to another makes less sense and makes you wonder if the makers are serious about their site at all. Even if you don’t go the expense of a personal domain name, there are plenty of free hosts available, especially in the genealogy world.

  • Individual pages moved and/or changed name. It is not just the amateurs who do this, some big corporate and government sites do it as well. Now listen up site owners—this is a BAD IDEA. Deep linking to internal pages is a matter of fact on the web so think out your structure well in advance and stick with it. We all make mistakes, I have a few pages in odd places, but you need to think very carefully before moving anything.
  • A final curse—those sites that die and are taken over by resellers, link farms and other even less desirable content. Good though it is, Xenu can’t always spot these if they don’t use a redirect. The only way to find them is to actually visit your links occasionally. I am fortunate that my site is a working resource so I am using it every day.

TapWhat time is it

5 Oct 2005 10:12 by Rick

I am posting this entry at exactly 10:00 BST = 09:00 UTC, from the Office, where I know the clocks are accurate and reliable (I wrote the piece much earlier). As you will observe from the time of posting above (or below), the Web Host server clocks are far from accurate. Not as far out as I posted on Monday because I misread the time zone charts — PDT is only seven hours behind BST not eight as I thought — but still pretty bad.

Why should this matter? Well, for a start, it looks very silly if the clock on your screen is not right, and you miss appointments. It is also inconvenient. For example you get emails arriving before they are sent and the FTP “upload if newer” features don’t work properly.

But, there are much more important reasons. If you are trying to diagnose a problem then you have to be able to rely on the time stamps in system logs, particularly for network related issues where more than one machine is involved. I am trying to work with the service provider to solve a problem with some emails not getting through. I can send some test mails through at a particular time and they need to look at their logs to see what happened to them. If the time stamps are not right they do not stand a chance of locating my test mail among the many thousand passing through their server. If they ever have to use their logs in a legal case, e.g. prosecuting a cracker, then it will be (should be) thrown out of court if they can’t demonstrate that the clocks were accurate. Perhaps (sly grin) this is why they keep it wrong, so they can’t be forced to produce email evidence for criminal cases?

There is really no excuse for incorrect clocks; there is a perfectly good and free to use system called NTP which can synchronise to any number of publicly available reference systems and, for an internal network, it is straight forward to set up a cascade so that even the humble desktop can be kept within a few milliseconds of mythical “True Time.” For Windows systems, if you should be so afflicted, there are compatible systems available.

Whilst investigating this I came across a related problem which I thought had died out years ago. When setting up WordPress for this blog it asked for my offset to UTC (Coordinated Universal Time en Français) so that the posts would be stamped with my local time (UK). Except that the time it reported as UTC … wasn’t. It was an hour out due to Summer Time (called Daylight Saving Time in some parts of the world). Now I don’t know if WordPress was reporting it wrong, or PHP was returning it incorrectly (unlikely) or the server has been set up badly, but it certainly was not right. When you set the clock on a machine it has to be set to UTC and then any necessary time zone and DST offsets applied. Many systems rely on this being right; for example, EMail time stamps have the UTC (GMT) offset included in them so that at the receiving end they can be converted back to UTC and then to the new local time so it makes sense to the recipient. NTP sorts all this out automatically because it has to work in UTC internally, but if you are relying on the operator;s wrist watch then he is going to set the clock to what he sees and if the time zone has been left at the Zulu factory setting then everything is going to be haywire.

Late Note: I have tested WordPress date reporting and it is OK; PHP function date("H:i T I") returns 10:00 GMT 0 (i.e. no DST) so it is the server that is set up wrongly!

TapFirst Draft

4 Oct 2005 22:50 by Rick

Well, that was hard work; but I’m quite pleased with the result. The apparent simplicity hides some rather complex stuff going on underneath. I can’t imagine any non-geek being able to customise their blog but perhaps they use a simpler but less flexible system. Even I’m glad that PHP is a pretty obvious programming language—not a huge step from sh or perl. The tricky bit is the CSS—it is still very much a black art to me.

I have just commented out the unwanted stuff from the default template for the moment but once I am happy with it I will cut out the redundant stuff which should speed it up a bit.

TapImmediate Reaction

3 Oct 2005 21:47 by Rick

It looks like I am going to have to work at this a bit. Even if I wanted to keep the style as the default, which I won’t, there are clearly some things wrong which need to be addressed.

  • There is no author name on the posts. Come to think of it, I didn’t see how to create multiple authors.
  • It would be nice to have the time on the posts, even if the server clock is wrong.
  • After asking me to specify the format for the date, it has ignored it.
  • The Comments link, which shouldn’t be there in the first place, goes to a 404.
  • I need a link back to my family page, the West Penwith Resources home, the One Name Study home etc.

Time to RTFM I think.

TapFirst Post

21:07 by Rick

Well, what’s all this blogging about then. I hope that it will make it a bit easier to get some of my random thoughts down so that they don’t disappear forever. It’s the age you know; the memory is not what it was.

My prime thinking time is when soaking in the bath, hence the title; fewer distractions there. The drawback is that I have often forgotten them by the time I get out, rather like last night’s dream.

The subjects will vary wildly—Web creation to Family History, Naive Politics to C programming etc. It doesn’t have to be interesting as I don’t expect any readers, it is largely for my own benefit, but if others enjoy it then all the better. Some of my earlier material has attracted some interest via the search engines.

It will quickly become apparent that I can’t spell.

One thing that is immediately obvious is that the time on this server is over an hour slow!

^ Top