Go to file

Charles Iliya Krempeaux 5dd7c5557d sourcecode.social -> github.com		2024-08-06 11:59:36 -07:00
LICENSE	changed author URL	2024-03-31 09:17:48 -07:00
README.md	sourcecode.social -> github.com	2024-08-06 11:59:36 -07:00
errors.go	"Complainer" -> "Error"	2023-08-18 05:57:56 -07:00
format.go	"utf8s" -> "utf8"	2022-07-18 16:36:02 -07:00
format_test.go	"utf8s" -> "utf8"	2022-07-18 16:36:02 -07:00
invalidutf8.go	"Complainer" -> "Error"	2023-08-18 05:57:56 -07:00
nilreader.go	"Complainer" -> "Error"	2023-08-18 05:57:56 -07:00
nilwriter.go	"Complainer" -> "Error"	2023-08-18 05:57:56 -07:00
readrune.go	made it so utf8.ReadRune() returns the size-of-the-rune rather than the number-of-bytes-read	2024-03-31 09:34:54 -07:00
readrune_test.go	added more tests for utf8.ReadRune()	2024-03-31 09:21:44 -07:00
runeerror.go	"utf8s" -> "utf8"	2022-07-18 16:36:02 -07:00
runelength.go	Len() -> RuneLength()	2022-07-18 18:46:19 -07:00
runelength_test.go	Len() -> RuneLength()	2022-07-18 18:48:37 -07:00
runereader.go	wrap, new	2023-08-18 06:37:04 -07:00
runereader_test.go	wrap, new	2023-08-18 06:37:04 -07:00
runescanner.go	utf8.RuneScanner.Buffered()	2023-11-30 07:24:56 -08:00
runescanner_buffered_test.go	sourcecode.social -> github.com	2024-08-06 11:59:36 -07:00
runescanner_test.go	wrap, new	2023-08-18 06:37:04 -07:00
runewriter.go	wrap, new	2023-08-18 06:37:04 -07:00
runewriter_test.go	sourcecode.social -> github.com	2024-08-06 11:59:36 -07:00
writerune.go	"Complainer" -> "Error"	2023-08-18 05:57:56 -07:00
writerune_test.go	"utf8s" -> "utf8"	2022-07-18 16:36:02 -07:00

README.md

go-utf8

Package utf8 implements encoding and decoding of UTF-8, for the Go programming language.

This package is meant to be a replacement for Go's built-in "unicode/utf8" package.

Documention

Online documentation, which includes examples, can be found at: http://godoc.org/github.com/reiver/go-utf8

Reading a Single UTF-8 Character

This is the simplest way of reading a single UTF-8 character.

var reader io.Reader

// ...

r, n, err := utf8.ReadRune(reader)

Write a Single UTF-8 Character

This is the simplest way of writing a single UTF-8 character.

var writer io.Writer

// ...

var r rune

// ...

n, err := utf8.WriteRune(w, r)

io.RuneReader

This is how you can create an io.RuneReader:

var reader io.Reader

// ...

var runeReader io.RuneReader = utf8.NewRuneReader(reader)

// ...

r, n, err := runeReader.ReadRune()

io.RuneScanner

This is how you can create an io.RuneScanner:

var reader io.Reader

// ...

var runeScanner io.RuneScanner := utf8.NewRuneScanner(reader)

// ...

r, n, err := runeScanner.ReadRune()

// ...

err = runeScanner.UnreadRune()

UTF-8

UTF-8 is a variable length encoding of Unicode. An encoding of a single Unicode code point can be from 1 to 4 bytes longs.

Some examples of UTF-8 encoding of Unicode code points are:

UTF-8 encoding				value	code point	decimal	binary	name
byte 1	byte 2	byte 3	byte 4	value	code point	decimal	binary	name
`0b0,1000001`				A	U+0041	65	`0b0000,0000,0100,0001`	LATIN CAPITAL LETTER A
`0b0,1110010`				r	U+0072	114	`0b0000,0000,0111,0010`	LATIN SMALL LETTER R
`0b110,00010`	`0b10,100001`			¡	U+00A1	161	`0b0000,0000,1010,0001`	INVERTED EXCLAMATION MARK
`0b110,11011`	`0b10,110101`			۵	U+06F5	1781	`0b0000,0110,1111,0101`	EXTENDED ARABIC-INDIC DIGIT FIVE
`0b1110,0010`	`0b10,000000`	`0b10,110001`		‱	U+2031	8241	`0b0010,0000,0011,0001`	PER TEN THOUSAND SIGN
`0b1110,0010`	`0b10,001001`	`0b10,100001`		≡	U+2261	8801	`0b0010,0010,0110,0001`	IDENTICAL TO
`0b11110,000`	`0b10,010000`	`0b10,001111`	`0b10,010101`	𐏕	U+000103D5	66517	`b0001,0000,0011,1101,0101`	OLD PERSIAN NUMBER HUNDRED
`0b11110,000`	`0b10,011111`	`0b10,011001`	`0b10,000010`	🙂	U+0001F642	128578	`0b0001,1111,0110,0100,0010`	SLIGHTLY SMILING FACE

UTF-8 Versus ASCII

UTF-8 was (partially) designed to be backwards compatible with 7-bit ASCII.

Thus, all 7-bit ASCII is valid UTF-8.

UTF-8 Encoding

Since, at least as of 2003, Unicode fits into 21 bits, and thus UTF-8 was designed to support at most 21 bits of information.

This is done as described in the following table:

# of bytes	# bits for code point	1st code point	last code point	byte 1	byte 2	byte 3	byte 4
1	7	U+000000	U+00007F	`0xxxxxxx`
2	11	U+000080	U+0007FF	`110xxxxx`	`10xxxxxx`
3	16	U+000800	U+00FFFF	`1110xxxx`	`10xxxxxx`	`10xxxxxx`
4	21	U+010000	U+10FFFF	`11110xxx`	`10xxxxxx`	`10xxxxxx`	`10xxxxxx`

README.md Unescape Escape