Excerpt from https://www.cnblogs.com/kingcat/archive/2012/10/16/2726334.html:
所以我们也可以这样理解,Unicode是用0至65535之间的数字来表示所有字符.其中0至127这128个数字表示的字符仍然跟ASCII完全一样.65536是2的16次方.这是第一步.第二步就是怎么把0至65535这些数字转化成01串保存到计算机中.这肯定就有不同的保存方式了.于是出现了UTF(unicode transformation format),有UTF-8,UTF-16.
There are many problems can be caused by encoding. So it is a must to know how different languages handles encoding issues:
Java
java uses unicode = utf-16 internally, but it seems neccessary to set it up on the surface.
1 | // set and print encodeing type |
Python
- In Python 3, all strings are sequences of Unicode characters. There is a
bytes
type that holds raw bytes. - In Python 2, a string may be of type
str
or of typeunicode
. You can tell which using code something like this:isinstance(s,str/unicode)
1 | # check encodeing type |