Java Modified UTF-8 Encoding¶
Synopsis¶
This module creates a Python codecs interface for the Java Modified UTF-8 Encoding, for JNI interface calls. It is slightly different than the UTF-8 encoding.
The differences are:
- The null byte ‘\u0000’ is encoded in 2-bytes rather than 1-byte, so that the encoded string never has an embedded zero-byte.
- Onle the 1-byte, 2-byte, and 3-byte formats are used.
- Supplementary characters are represented in the form of surrogate pairs, which take 6-bytes.
This gives us the following mapping:
Number of bytes | First code point | Last code point | Bits | Byte 1 | Byte 2 | Byte 3 | Byte 4 | Byte 5 | Byte 6 |
2 | U+0000 | U+0000 | – | 11000000 | 10000000 | ||||
1 | U+0001 | U+007F | 7 | 0xxxxxxx | |||||
2 | U+0080 | U+07FF | 11 | 110xxxxx | 10xxxxxx | ||||
3 | U+0800 | U+FFFF | 16 | 1110xxxx | 10xxxxxx | 10xxxxxx | |||
6 | U+10000 | U+FFFFF | 20 | 11101101 | 1010xxxx | 10xxxxxx | 11101101 | 1011xxxx | 10xxxxxx |
To implement as a Python codec, all that is needed is an encode and decode function. The codec is registered by passing a custom function to search for potentially multiple codecs and return the two functions in a CodecInfo object.
Sometimes this encoding is referred to as CESU-8 or Compatibility Encoding Scheme for UTF-16: 8-bit, but changes the way zero bytes (‘\x00’) are encoded. There doesn’t seem to be an official designation for this encoding, and a request to officially added to Python was rejected, so I’ll just use “mutf8” or “mutf-8” for my implementation.
Usage¶
To use this encoding, you could do this:
import codecs
import py2jdbc.mutf8
codecs.register(py2jdbc.mutf8.info)
codecs.encode(u'a string', 'mutf8')
codecs.encode(u'a string', 'mutf-8')
codecs.encode(u'a string', py2jdbc.mutf8.NAME)
The JNI Interface module registers and imports this module and maps it to jni.encode()
and jni.decode()
already, so you could also use it with:
from py2jdbc.jni import encode
encode(u'a string')
decode(b'a string')
Although JNI will do this automatically for any calls needing a character pointer argument or returning a character poiter result.