Java Modified UTF-8 Encoding¶

Synopsis¶

This module creates a Python codecs interface for the Java Modified UTF-8 Encoding, for JNI interface calls. It is slightly different than the UTF-8 encoding.

The differences are:

The null byte ‘\u0000’ is encoded in 2-bytes rather than 1-byte, so that the encoded string never has an embedded zero-byte.
Onle the 1-byte, 2-byte, and 3-byte formats are used.
Supplementary characters are represented in the form of surrogate pairs, which take 6-bytes.

This gives us the following mapping:

Number of bytes	First code point	Last code point	Bits	Byte 1	Byte 2	Byte 3	Byte 4	Byte 5	Byte 6
2	U+0000	U+0000	–	11000000	10000000
1	U+0001	U+007F	7	0xxxxxxx
2	U+0080	U+07FF	11	110xxxxx	10xxxxxx
3	U+0800	U+FFFF	16	1110xxxx	10xxxxxx	10xxxxxx
6	U+10000	U+FFFFF	20	11101101	1010xxxx	10xxxxxx	11101101	1011xxxx	10xxxxxx

To implement as a Python codec, all that is needed is an encode and decode function. The codec is registered by passing a custom function to search for potentially multiple codecs and return the two functions in a CodecInfo object.

Sometimes this encoding is referred to as CESU-8 or Compatibility Encoding Scheme for UTF-16: 8-bit, but changes the way zero bytes (‘\x00’) are encoded. There doesn’t seem to be an official designation for this encoding, and a request to officially added to Python was rejected, so I’ll just use “mutf8” or “mutf-8” for my implementation.

Usage¶

To use this encoding, you could do this:

import codecs
import py2jdbc.mutf8
codecs.register(py2jdbc.mutf8.info)

codecs.encode(u'a string', 'mutf8')
codecs.encode(u'a string', 'mutf-8')
codecs.encode(u'a string', py2jdbc.mutf8.NAME)

The JNI Interface module registers and imports this module and maps it to jni.encode() and jni.decode() already, so you could also use it with:

from py2jdbc.jni import encode

encode(u'a string')
decode(b'a string')

Although JNI will do this automatically for any calls needing a character pointer argument or returning a character poiter result.

Java Modified UTF-8 Encoding¶

Synopsis¶

Usage¶

API Reference¶

py2jdbc

Navigation

Related Topics