I wrote the first draft of this article around three years ago. I found it in my drafts and touched it up for this week's post.
I was presented a problem recently at work. I work at an advertising agency, and we produce emails and paper mailings in which the prospective customer's name is included. Presenting this properly goes a long way towards improving their attitudes towards the campaign itself.
The names came to us from several sources. They were not all in printable form. Some had the name entirely uppercased (JOHN SMITH) and some had only the last name uppercased (John SMITH). Some had the name titlecased (John Smith) which works in the vast majority of cases, but falls flat for names like Miis van der Ruhe and Ronald McDonald.
So the task I was given was to
propercase (which means the same thing as titlecase to some of you; I'm going to use it to mean the correct casing of the name in this article).
First thing I needed was a dataset with a bunch of weird edge cases in it. Ideally this would contain the propercased version of each name but I didn't (and still don't) have access to such a corpus. If you've got one, I'd love to hear about it.
I could have used our prospective customer list, but I figured that there would not be as much variation in that list as I was looking for (as it turns out, this was not a good assumption). So I set about locating a dataset.
I started with the census. Conveniently the 2000 census provides a list of 151,671 surnames, in CSV, and the 1990 census provides a list of 88,799 surnames, in TSV. I wasn't satisfied with this though because it contained mostly simple surnames that wouldn't really challenge any particular method of propercasing. So I kept looking and located the National Grigsby Family Society list of unusual names. I wrote a simple shell script to generate the dataset from that website.
First approach: Titlecase
As mentioned the simplest approach is to use titlecase. In Python:
>>> 'VANDERRUHE'.title() 'Vanderruhe' >>> 'MCDONALD'.title() 'Mcdonald'
Obviously this can be improved. For very low effort we can instead use the excellent Titlecase module. It handles the McDonald case:
>>> from titlecase import titlecase >>> titlecase('MCDONALD') 'McDonald'
but not any others:
>>> from titlecase import titlecase >>> titlecase('VANDERRUHE') 'Vanderruhe' >>> titlecase('MACDONALD') 'Macdonald'
This approach is documented in v1-titlecase.py.
Second approach: Dictionary
The second approach I tried was to collect a list of prefixes and the adjustments that would need to be made for each. I split the prefixes into two categories: proper prefixes, which are attached to the name (Mac, l', D', etc.) and particles (van, de, der, etc.). This made sense since there can only be one prefix, but many particles. This approach is documented in v2-titlecase-exceptions.py.
A problem I solved in this iteration was with the dashed last name (Smythington-Smythe-Smith). I handled this by titlecasing each of the parts - in retrospect it would have been better to run each part separately through the propercasing routine.
One of the issues I found early with this approach is that, with my dataset, 'VANDERBEEK' would be handled the same way as 'VANDROSS', giving us either James van der Beek and Luther van Dross, or Luther Vandross and James Vanderbeek, which is not optimal. Similarly it handles 'VANETTI' the same way as 'VANBEEK' (Vanetti being Italian and particle-free, while Van Beek is Dutch and has a particle).
I originally solved this by only considering a particle a true particle if it was disconnected before I began processing. Since this didn't handle the dataset I created, I figured I would need a new solution. I supposed that if a vowel follows the particle, it shouldn't be considered a particle (e.g. Vanetti). This does not handle 'VANEYCK' (Van Eyck) so that wouldn't work either. I then hypothesized that you could use the stress of the next syllable to determine whether or not there is a particle, but since this can't be determined without a pronunciation guide, it was a non-starter (if you're interested in this sort of thing for regular English, the CMU Dictionary is the corpus you're looking for - NLTK has a wrapper for it).
This approach also requires that a canonical capitalization be defined for each prefix and particle, which may not be the case across family names (or across countries). For example, Alex and Eddie Van Halen titlecase the particle, but Anton van Leeuwenhoek does not. Since this is a matter of preference, I selected one for each and called it good enough.
Third approach: checking the Internet
I presented this problem to my good friend Max Buck, who suggested that you merely Google the name and take the first result. He looked into it and found that Google is very aggressive about their API limits, and that Wikipedia does not limit your query rate (and may even return a better result).
So in the final approach, v3-wikipedia.py, I hit the Wikipedia API and pulled page titles returned for the name. I then had to select the best one, which is more of a problem than I had expected.
I began by ordering the results by a weighted average of case-sensitive and case-insensitive Levenshtein distance from the original input string, normalized by string length (since Levenshtein distance measures number of characters which would need to be changed to convert one string to the other).
I then did quite a bit of error checking and postprocessing.
- Verify that the best match does in fact contain the name (handling an artifact of Wikipedia's search engine)
- Split the page title on spaces & find the name
- Add it to the final string
- If it's too short, move backwards in the title and add the previous token to the beginning of the final string (to handle particles)
- Verify that the final string is sane (not all upper, or all lower, or clearly incorrect)
- If at any time an error is found, the library falls back to titlecase.
That was the approach I settled on. It could be improved by:
- loading a processed version of Wikipedia into a local database, rather than hitting the API, which takes up the bulk of the time this takes to run
- Using a finer tokenization approach for the page title in the postprocessing phase (rather than just splitting on spaces)