Zurückspulen and a mild case of Zalgo

What’s the difference between “Zurückspulen” and “Zurückspulen”? Just because you don’t see any difference, doesn’t mean they are the same.

I was planning to watch the new season of Rozen Maiden, the series that is obsessed with dolls and the German language. Looked around a bit, decided to go with EveTaku and EveSenshi releases.

I use my own application to search for torrents, which is able to match the files to their corresponding anime and download the ones that I’m interested in. It usually works pretty well. But this time, it somehow failed to match “Rozen Maiden Zurückspulen” to “Rozen Maiden: Zurückspulen”. It ignores punctuation, so I knew that the colon wasn’t the issue… Then, why? Thankfully, I had some prior experience in encoding hell.

The “ü” character in my database entry is 0xC3 0xBC in UTF-8, whereas the “ü” in those files is 0x75 0xCC 0x88 and is actually composed of two characters: The first one is a regular “u”, and the second is a “combining diaeresis” (U+0308). We don’t see any difference because operating systems, web browsers and most text editors render them the same, but they’re technically different.

One could say that they’re canonically equivalent and it’s the application’s responsibility to treat them as such, perhaps by normalizing the text into the fully composed form. It’s true, but is there any reason to distribute files with combining characters, where an equivalent precomposed character exists? It’s kinda like writing “I.” instead of ”!” for an exclamation mark and hoping that whoever is reading it gets it.

So here’s my plea to fansub groups: When you’re naming your files, don’t use w̥͎̗̯͙̟̥͑ͤ͋̎ē̞͍͗́̎̔́î̟r͚̠̤͚̹͑̋̌ͮ̈́͋d͙̠̦̜̈͞ ̞̼̝͕̩̤͐̽͗̃̒s͖̲̬̦̎͛̆͌ṱ̝̾̾ͣͭ̔ͧ̚u̢͙̼͈̎ͥͨf̛̞͙͍ͨ̿̊̒̅ͤf̛ͦ̔. Thanks.