Soundex - the True Story

There are very many pages on the web that discuss the Soundex system but, as far as I can discover, very few tell the whole story, none that I have found describe the algorithm in a way that is easy to code or manipulate by hand and quite a few are plain wrong!

History

On 2 Apr 1918, Robert C. Russell of Pittsburgh, Philadelphia obtained a patent on a method for indexing which was based on the way a name was pronounced rather than how it was spelled. He did this by coding 8 phonetic sound types with a few additional rules. Together with Margaret K. Odell he obtained a second patent in 1922 with some variations. This they sold to various commercial and governmental organisations. It was taken up in a modified form in the 1930's by the Social Security Administration under a work creation scheme to extract certain data from the US Census and to index its records. It has also been used for immigration records and, more recently, for indexing, search engines and for spell checkers. In fact, as will be seen later, a number of variants of the original design have been used over the years.

The algorithm

The system really became popular when Donald E. Knuth published his book "The Art of Computing" where in Volume 3 "Sorting and Searching" (Addison Wesley) he describes an algorithm to encode names using the simplified form of the method described below. What it does is to convert each name into a four character index (one letter and three digits) which represents its sound when spoken. By this means names which sound the same but are spelt differently, either through error or choice, have the same code and are grouped together in any index using it. It is by no means perfect, but it is widely used and hard to better. For a study of its effectiveness compared to alternate schemes see A. J. Lait & B. Randall "An Assessment of Name Matching Algorithms" Dept of Computing Science, University of Newcastle upon Tyne.

I have here re-ordered the commonly described algorithm to make it easier to both code and calculate by hand as a series of steps in a systematic way. Starting with the required surname:-

Remember the initial letter.
Convert each letter (including the first) according to the following table. Ignore punctuation such as apostrophes, spaces and hyphens.
0 = AEIOUWYH
1 = BPFV
2 = CSKGJQXZ
3 = DT
4 = L
5 = MN
6 = R
Change all consecutive duplicate digits to a single example. e.g. change 22 to 2
Replace the first digit by the letter remembered in step A.
Delete all zeros.
Adjust to four characters by truncating or padding to the right with zeros.

The resulting 4 character code is the Simplified (or, incorrectly, Russell) Soundex for that name and is, for example, the one used by the online calculator on the Rootsweb site. For names with a prefix e.g. ST. AUBYN, it is worth coding both with and without the prefix i.e. AUBYN and STAUBYN. Similarly it is worth coding both halves of double-barrelled names separately.

Common errors with published algorithms (and some of the online Soundex Calculators) are to only eliminate double letters rather than pairs of letters with the same code and also not to include the first letter in this calculation. I have even seen one that totally ignores vowels from the beginning which compresses syllables into an unpronounceable mush.

Examples

Step           1)   2)             3)            4)            5)       6)

WILLIAMS    -> W -> 00440052    -> 04052      -> W4052      -> W452  -> W452
BARAGWANATH -> B -> 10602005030 -> 1060205030 -> B060205030 -> B6253 -> B625
DONNELL     -> D -> 3055044     -> 30504      -> D0504      -> D54   -> D540
LLOYD       -> L -> 44003       -> 403        -> L03        -> L3    -> L300
WOOLCOCK    -> W -> 00042022    -> 04202      -> W4202      -> W422  -> W422

The latter two are good for spotting errors in other algorithms.

Variations

The variation that was used for the US 1920 Census, known as American Soundex or Miracode, is to ignore "H" and "W" after the first letter so that consonants with the same code separated by either of those letters are coded as only one. e.g. ASHCROFT which would normally be A226 becomes A261. This is the version that is described on the NARA (U.S. Archives) site and was occasionally (but not consistently) used for other census years. Ironically, the online calculator which was on the same site (now gone) used Simplified Soundex.

The algorithm described can be simply modified to allow for this case. In step B. remove "H" and "W" from the list for code 0 and add 7 = HW (for the first letter only)

There is one calculator online (“Yet Another Soundex Calculator” by Mohr and Whalen) that half implements this scheme (ignoring the "H" but not "W"). There is an article, code and online calculator which does both and some other variations on the Creativyst Software site but they disagree about the form of the normal algorithm.

Another variation, used by some spell checkers and search engines, is to code the first letter in the same way as the remainder which eliminates the poor effect of incorrect initial letters (e.g. Kernow for Curnow). This is not widely used in published indexes. Soundex is very English language biased and there are other variations for Germanic, Yiddish and Middle European names which otherwise fare particularly badly. The best known is called the Daitch-Mokotoff Code and a web search will reveal references.

The moral of this is to look carefully at what encoding has been used before discarding negative results. The US Census probably used the NARA system but some batches certainly used the Simplified and other coders just misread the algorithm altogether and ended up with something else. The Immigration and Naturalisation records tended to use the simplified system but, again, it is worth checking and trying alternatives.

Program code

For those that know the computer language "C", here is a function that will return the Simplified SOUNDEX code for any given surname and, with modification as shown, the American or NARA Soundex. It is reasonably quick, and is code, locale and machine independent.

#include <string.h>
#define GROUPS 7
const char * group[GROUPS] =
        { "AEIOUHWY", "BFPV", "CGKJQSXZ", "DT", "L", "MN", "R" };
/* to implement the NARA variation use
#define GROUPS 8
const char * group[GROUPS] =
        { "AEIOUY", "BFPV", "CGKJQSXZ", "DT", "L", "MN", "R", "HW" };
#define HW '7'
*/
const char * digit = { "0123456789" };
/* Returns the soundex equivalent to name */
char * Soundex(char * name)
{
        int i, j, k;
        char prev = digit[GROUPS]; /* value not used for a code */
        static char out[5];
        char c;
        for (i = 0, j = 0; name[i] != (char)NULL && j < 4; i++)
        {
                c = toupper(name[i]);
                /* decode the character */
                for (k = 0; k < GROUPS; k++)
                        if (strchr(group[k], c) != (char *)NULL)
                        {
                                c = digit[k];
                                break;
                        }
                /* if not found then ignore the character */
                if (k == GROUPS)
                        continue;
                /* to implement the NARA variation include
                if (j != 0 && c == HW)
                        continue;
                */
                /* ignore duplicates */
                if (c != prev)
                {
                        prev = c;
                        /* replace first letter */
                        if (j == 0)
                                c = name[0];
                        /* ignore "vowels" */
                        if (c != '0')
                                out[j++] = c;
                }
        }
        /* padd out to 4 chars */
        for ( ; j < 4; j++)
                out[j] = '0';
        /* terminate the string */
        out[4] = (char)NULL;
        return out;
}