wordbotch.com

Blog


Replacing unicode chars

First blog post, let's see how long this lasts :-)

Over the last couple of weeks, I released Mars 3D and Moon 3D. I got the location data for them from planetarynames.wr.usgs.gov.

The data is excellent - all downloadable in CSV format. In fact, it's better than anything I could find for locations on Earth (Earth 3D hopefully coming soon).

One problem, though, was that it contained lots of unicode characters. Many features on Mars and the Moon are named after people and places. Of course, some of these (eg Cádiz, a crater on Mars) contain non-ascii chars in their native language.

Any text in JME3 has to derive all its characters from a bitmap. The JME3 API includes a way of generating these bitmaps, but I don't think it's possible to add unicode chars. In any case, it's just plain easier to stick to ascii.

At first I thought of searching for non-ascii chars and replacing them. As well as being tedious, though, that gets tricky. It's hard to know what the replacement for a given unicode character should be. For instance, the correct transliteration of the German 'ü' is 'ue'.

After googling around for a bit I discovered unidecode, a python module made for this. I gather it's actually a port of an old perl module.

import codecs;
from unidecode import unidecode;

f_in = codecs.open('in.csv', encoding='utf-8');
f_out = open('out.csv', 'w');

for line in f_in:
    f_out.write(unidecode(line));

Worked like a charm.