Alphabetical sorting is hard
November 24, 2019
It really was a great idea. As usual, I took a problem that I randomly ran into and wanted to find a solution to. I was not sure if I could implement the solution, but I knew I could describe it. It happened when I was working on a project where I got some data about municipalities in Hungary. I thought it would be handy to sort them alphabetically so that the user of the final product can easily find the place they are interested in. However, due to the unusual letters in the Hungarian alphabet, such as ő or gy, collation is not trivial. But I knew I could find a way to solve this!
First, I realized that a standard algorithm might not know how to correctly sort o, ó, ö, and ő. In Hungarian, these are not just diacritics on a letter, but are considered as completely separate letters (they sound different as well), so when sorted, o < ő. Although this is a complication, it is not a big one: I can provide a correct alphabetical ordering to my algorithm, which will then know how to do this.
Second, I realized that the situation is more complicated with the digraphs (cs, dz, gy, ly, ny, sz, ty, zs) and the trigraph dzs. These are considered to be one letter (even though displayed using two glyphs) and they come after the single-glyph letters in an alphabetically ordered list. Therefore, cudar comes before csak. Okay, no problem: find all instances of these, turn them into some symbol, and add them to the custom list. So, for example, we will have cudar and $ak, which we can sort if we know that c < $ < d.
And we are done!
Well, of course not.
The glyphs of a digraph next to each other might not mean the digraph itself. For example, s and z do not only occur next to each other when they stand for sz. This can easily happen in two situations. First, in compound words: vaszár (a compound word of iron+lock) consists of vas + zár, so s and z are two separate letters, not sz. Second, since Hungarian is an agglutinative language, there is also a chance for false digraphs where prefixes and suffixes meet the word (or each other). An example for this is malacság, (malac + -ság, literally: piglet+ness, meaning smut (“obscene or lascivious talk, writing, or pictures.”)).
And this is just a start of our woes! Gemination, or the lengthening of consonants is another feature of the Hungarian language. In case you are not familiar with it, think about the Italian expression pizza quattro formaggi – all of these three words have a geminated consonant, whose pronunciation is indeed longer than a single one. In Hungarian, digraphs can also be geminated, but only the first glyph is written twice. For example, gy + gy = ggy, like in meggy (sour cherry; not the same as megy, which means goes).
Substituting with tokens does not work in these situations. While you could try to replace ccs in meccs (football match) with two of the token $ we introduced above, it is not enough. For example, vasszekér, vas + szekér, a horse-drawn carriage made of iron (literally: iron+cart), has ssz, but it is not sz + sz, just s + sz. Or think about the previous example of vaszár, which has s + z ≠ sz. A similar situation appears in vízszerelő, víz + szerelő, plumber (literally: water+repairman), where there is also potential ambiguity: is it zs + z or z + sz? Another great example is the two words meggyez (literally: to put/get sour cherry on something; “összemeggyezte az ingét” = “he got a sour cherry stain on his shirt”) and meggyőz (meg- + győz, to convince someone). A simple algorithm looking at each glyph would think that meggyez comes first and meggyőz comes second. This is not true, since meggyez is sorted as [m e gy gy e z], therefore it comes after [m e g gy ő z].
At this point I had to realize that there is no way to properly sort Hungarian words without a dictionary to split up compound words. And even if we have such a list, it might not even solve all the problems that are caused by prefixes/suffixes.
Well, the next step was clear as the sky: I have to implement an algorithm in Python where you pass a list of words and it returns a correctly collated list thereof. What a great personal project would it be to showcase on my website!
Then, two things happened.
First, I realized that this, simply put, is not an important problem. Or, at least, not important enough to me to spend the required amount of time on it. How frequent it is that one has to collate a list, where orthographic precision is extremely important and a manual check is not feasible? Not very.
Second, after giving up on programming anything for this topic, I decided to just write an article about this discovery. To gather the relevant vocabulary in English, I went to the Wikipedia page for the Hungarian alphabet. The last sentence of the Alphabetical ordering (collation) section is the following:
These rules [regarding digraphs] make Hungarian alphabetic ordering algorithmically difficult (one has to know the correct segmentation of a word to sort it correctly), which was a problem for computer software development.
Well, there goes my Big Discovery.
Cover image: a page of Magyar Értelmező Késziszótár, showing the entry for betűrend (alphabetical ordering). Photo by the Author, 2019. Creative Commons CC BY-NC-SA 4.0