Standards: UTF-8 character encoding

UTF-8 is a character encoding capable of encoding all 1,112,064^[1] valid code points in Unicode using one to four 8-bit bytes.^[2] The encoding is defined by the Unicode standard, and was originally designed by Ken Thompson and Rob Pike.^[3]^[4] The name is derived from Unicode (or Universal Coded Character Set) Transformation Format – 8-bit.^[5]
It was designed for backward compatibility with ASCII. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. The first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single octet with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well. Since ASCII bytes do not occur when encoding non-ASCII code points into UTF-8, UTF-8 is safe to use within most programming and document languages that interpret certain ASCII characters in a special way, such as "/" in filenames, "\" in escape sequences, and "%" in printf.

Shows the usage of the main encodings on the web from 2001 to 2012 as recorded by Google,^[6] with UTF-8 overtaking all others in 2008 and nearing 50% of the web in 2012.
Note that the ASCII only figure includes web pages with any declared header if they are restricted to ASCII characters.

UTF-8 has been the dominant character encoding for the World Wide Web since 2009, and as of August 2017 accounts for 89.7% of all Web pages. (The next-most popular multibyte encodings, Shift JIS and GB 2312, have 0.9% and 0.7% respectively).^[7]^[8]^[6] The Internet Mail Consortium (IMC) recommended that all e-mail programs be able to display and create mail using UTF-8,^[9] and the W3C recommends UTF-8 as the default encoding in XML and HTML.^[10]

Standards

Saturday, September 9, 2017

UTF-8 character encoding

No comments:

Post a Comment