[Notes] Unicode

Post author: Mingxiang Cai
Post link: <a href="https://marcopolocai.github.io/2018/03/14/Notes-Unicode/" title="[Notes] Unicode">https://marcopolocai.github.io/2018/03/14/Notes-Unicode/
Copyright Notice: All articles in this blog are licensed under <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/" rel="external nofollow" target="_blank">CC BY-NC-SA 4.0 unless stating additionally.

Excerpt from https://www.cnblogs.com/kingcat/archive/2012/10/16/2726334.html:

所以我们也可以这样理解,Unicode是用0至65535之间的数字来表示所有字符.其中0至127这128个数字表示的字符仍然跟ASCII完全一样.65536是2的16次方.这是第一步.第二步就是怎么把0至65535这些数字转化成01串保存到计算机中.这肯定就有不同的保存方式了.于是出现了UTF(unicode transformation format),有UTF-8,UTF-16.

There are many problems can be caused by encoding. So it is a must to know how different languages handles encoding issues:

Java

java uses unicode = utf-16 internally, but it seems neccessary to set it up on the surface.

// set and print encodeing type
System.setProperty("file.encoding", "UTF-16");
String a = System.getProperty("file.encoding");

//conversion 
try {
    // Convert from Unicode to UTF-8
    String string = "abc\u5639\u563b";
    byte[] utf8 = string.getBytes("UTF-8");
    // Convert from UTF-8 to Unicode
    string = new String(utf8, "UTF-8");
} 
catch (UnsupportedEncodingException e) {
}

Python

In Python 3, all strings are sequences of Unicode characters. There is a bytes type that holds raw bytes.
In Python 2, a string may be of type str or of type unicode. You can tell which using code something like this: isinstance(s,str/unicode)

# check encodeing type 
import sys
sys.getdefaultencoding()

# set encoding type ("utf-8", "utf-16")
#coding=utf-8

# conversion ("utf-8", "utf-16", "unicode-escape" )
s = '你好'
ec  = s.encode("utf-8")
dc =  ec.decode("utf-8")