2008-09-29

Mature programming languages and internationalization

Have you ever tried to implement application using character encoding other than Latin 1? What is deal with that? If you are using modern development platforms like Java or .Net - it just works. Let's omit other problems caused by bad implementation of internationalization issues in framework/library. Multilingual application scenario is simple - use internal UTF character encoding, allow transcoding from external systems at the API level (I/O operations with text files, databases etc.) - and your application will be able to write messages in most standardized character sets including Russian and Chinese.

What about other popular programming languages? Take for example Python 2.5 - recent version that could be used in production environment. It supports UTF strings processing, and character transcoding- but be aware - there are glitches and unimplemented UTF features, causing bad UTF processing. To get it work you have to use (sometimes ugly) workarounds. Here is nice presentation about these issues. There is hope that developers from open source community will fix that till let's say version 3.0.
Ruby is in better situation- as a younger language it has more properly adressed internationalization issues at the design stage. I suspect there is still more work to do.

What about old work horse of web applications development- PHP in newest stable version 5.5.2. It came a long road, but still have problems caused by legacy of non integrated separated libraries. It has long list of options to set encoding like internal encoding, http input output encoding, and many more affecting single libraries.

So what about using mbstring module (multibyte string representation) , setting UTF as internal encoding and sticking to it. Situation is worse than in Python- many even simple string functions can't process multibyte strings in proper way. The Solution is to write own workarounds or use third party libraries like PHP utf8. Here is good list of PHP internationalization issues. Last week I have checked xml import support for old PHP application- it caused problems with Latin2 encoding. The reason was simple: The XML SAX parser supports only ISO-8859-1, UTF-8 and US-ASCII encodings.

Such low detail problems could potentially take a lot of precious project time. So if your next application will use character set other than Latin 1, you should consider good internationalization support by chosen development and production platform.

No comments: