4.9 KiB

Raw Blame History

go-utf8

Package utf8 implements encoding and decoding of UTF-8, for the Go programming language.

This package is meant to be a replacement for Go's built-in "unicode/utf8" package.

Documention

Online documentation, which includes examples, can be found at: http://godoc.org/sourcecode.social/reiver/go-utf8

Reading a Single UTF-8 Character

This is the simplest way of reading a single UTF-8 character.

var reader io.Reader

// ...

r, n, err := utf8.ReadRune(reader)

Write a Single UTF-8 Character

This is the simplest way of writing a single UTF-8 character.

var writer io.Writer

// ...

var r rune

// ...

n, err := utf8.WriteRune(w, r)

io.RuneReader

This is how you can create an io.RuneReader:

var reader io.Reader

// ...

var runeReader io.RuneReader = utf8.RuneReaderWrap(reader)

// ...

r, n, err := runeReader.ReadRune()

io.RuneScanner

This is how you can create an io.RuneScanner:

var reader io.Reader

// ...

var runeScanner io.RuneScanner := utf8.RuneScannerWrap(reader)

// ...

r, n, err := runeScanner.ReadRune()

// ...

err = runeScanner.UnreadRune()

UTF-8

UTF-8 is a variable length encoding of Unicode. An encoding of a single Unicode code point can be from 1 to 4 bytes longs.

Some examples of UTF-8 encoding of Unicode code points are:

UTF-8 encoding				value	code point	decimal	binary	name
byte 1	byte 2	byte 3	byte 4	value	code point	decimal	binary	name
`0b0,1000001`				A	U+0041	65	`0b0000,0000,0100,0001`	LATIN CAPITAL LETTER A
`0b0,1110010`				r	U+0072	114	`0b0000,0000,0111,0010`	LATIN SMALL LETTER R
`0b110,00010`	`0b10,100001`			¡	U+00A1	161	`0b0000,0000,1010,0001`	INVERTED EXCLAMATION MARK
`0b110,11011`	`0b10,110101`			۵	U+06F5	1781	`0b0000,0110,1111,0101`	EXTENDED ARABIC-INDIC DIGIT FIVE
`0b1110,0010`	`0b10,000000`	`0b10,110001`		‱	U+2031	8241	`0b0010,0000,0011,0001`	PER TEN THOUSAND SIGN
`0b1110,0010`	`0b10,001001`	`0b10,100001`		≡	U+2261	8801	`0b0010,0010,0110,0001`	IDENTICAL TO
`0b11110,000`	`0b10,010000`	`0b10,001111`	`0b10,010101`	𐏕	U+000103D5	66517	`b0001,0000,0011,1101,0101`	OLD PERSIAN NUMBER HUNDRED
`0b11110,000`	`0b10,011111`	`0b10,011001`	`0b10,000010`	🙂	U+0001F642	128578	`0b0001,1111,0110,0100,0010`	SLIGHTLY SMILING FACE

UTF-8 Versus ASCII

UTF-8 was (partially) designed to be backwards compatible with 7-bit ASCII.

Thus, all 7-bit ASCII is valid UTF-8.

UTF-8 Encoding

Since, at least as of 2003, Unicode fits into 21 bits, and thus UTF-8 was designed to support at most 21 bits of information.

This is done as described in the following table:

# of bytes	# bits for code point	1st code point	last code point	byte 1	byte 2	byte 3	byte 4
1	7	U+000000	U+00007F	`0xxxxxxx`
2	11	U+000080	U+0007FF	`110xxxxx`	`10xxxxxx`
3	16	U+000800	U+00FFFF	`1110xxxx`	`10xxxxxx`	`10xxxxxx`
4	21	U+010000	U+10FFFF	`11110xxx`	`10xxxxxx`	`10xxxxxx`	`10xxxxxx`

4.9 KiB Raw Blame History Unescape Escape