Unicode & Python, handling UnicodeDecodeError
By: Hasanat Kazmi

Unicode in Python is a little un-pythonic. It isn’t as automatic as other things in Python are. You might get this sort of error while dealing with Unicode in Python:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position XXX: ordinal not in range(128)

That means perhaps you didn’t get how to deal with it. It’s not your problem entirely; perhaps Python dev team didn’t comprehend that non English world will also use it. They made ASCII default encoding in Python and made all this fuss. (Anyone knows a better reason why they did so?) In Python 3.x, after realizing this, they made Unicode default encoding. So the idea is, after you get/read/fetch any text, you should decode it to Unicode before performing any operation:


(utf-8 is variant of Unicode which is most popular and fully supported by python.)

After you have decoded it to Unicode, all your code which you wrote shouldn’t break. You should be able to read it well, process it etc but remember, not all python modules are ‘Unicode safe’, so whenever you have to pass data, myString in this case, back to any module, remember to encode is back to ASCII. Modules can or cannot be Unicode safe e.g. If you have to write myString to file, it will give you an awkward error if you won’t encode it back to ASCII. So before writing it to file, do something like this:


To cut it short, if you have non ascii data, just decode it early, process it the way you want, and then encode it back. This is very much like in Java (its Javaish not Pythonic ). Here is the snippet:

#stuff which I downloaded from internet and has lots of non ASCII characters
myInFile = open("myUrduWords.txt")
myString = myInFile.read()
for word in myString.split(" "): print word #or whatever you want to do
myString.encode(sys.getdefaultencoding()) #which is ascii under Python 2.x
myOutFile = open("myUrduWordsOutFile.txt", "w")

If you exactly know which language you are dealing with, you can decode it into that language rather than Unicode which is superset of all encodings. That can help you in certain cases.

About the author
I'm Hasanat Kazmi, a computer science student, self-styled computer scientist and entrepreneur. I also like discussing politics and religion. Feel free to send me an email at [email protected]