Popularity of text encodings

There are many methods of translating text into digital data, such as Baudot code, EBCDIC, and UTF-8, and the relative usage levels of them can provide insight into their usability, and historical trends can show the progress of new methods.

Exact measurements are not possible. Counts of numbers of documents are different than counts weighed by actual use or visibility of those documents. The encoding popularity varies depending on the language used for the documents, or the locale that is the source of the document, or the purpose of the document. Text may be ambiguous as to what encoding it is in, for instance pure ASCII text is valid ASCII or ISO-8859-1 or CP1252 or UTF-8. "Tags" may indicate a document encoding, but when this is incorrect this may be silently corrected by display software (for instance the HTML spec says that the tag for ISO-8859-1 should be treated as CP1252), so counts of tags may not be accurate.

Popularity on the World Wide Web

Use of the main encodings on the web from 2001 to 2012 as recorded by Google,[1] with UTF-8 overtaking all others in 2008 and over 60% of the web in 2012 (since then approaching 100%). The ASCII-only figure includes all web pages that only contain ASCII characters, regardless of the declared header.

UTF-8 has been the most common encoding for the World Wide Web since 2008.[2] As of May 2022, UTF-8 accounts for on average 97.7% of all web pages (and 986 of the top 1,000 highest ranked web pages, the next most popular encoding, ISO-8859-1, is used by 4 of those sites).[3] UTF-8 includes ASCII as a subset; almost no websites declare only ASCII used.[4]

All countries and all of the tracked languages have less than 10% of alternative encodings on the web.

In locales where UTF-8 is used alongside another encoding, the latter is typically more efficient for the associated language. GB 18030, a Chinese Unicode Transformation Format, (effectively[5]) has a 6.7% share of websites in China and territories[6][7][8] and 0.2% share worldwide. Big5 is another popular Chinese encoding with less than 0.1% share world-wide, but is popular in Hong Kong and more that twice as popular in Taiwan where it has 4.2% share.[9] The single-byte Windows-1251 is twice as efficient for the Cyrillic script and is used for 7.4% of Russian websites.[10] E.g. Greek and Hebrew encodings are also twice as efficient, but still those languages have over 98% use of UTF-8.[11][12] South Korea has relatively low UTF-8 use compared to most other countries at 93.7%, with the rest of websites mainly using EUC-KR which is more efficient for Korean text. Japanese language websites have a bit more UTF-8 use, with the legacy Shift JIS and EUC-JP encodings having a combined share of 6.5% for the rest of Japanese websites (the more popular Shift JIS has 0.1% global share).[13][14][1] With the exception of GB 18030 (and UTF-16 and UTF-8), other (legacy) encodings were designed for specific languages, and do not support all Unicode characters. As of May 2022, Breton has the lowest UTF-8 use on the web of any tracked language, with 90% use.[15] Over a third of the languages tracked have 100.0% use of UTF-8 on the web, such as Punjabi, Tagalog, Lao, Marathi, Kannada, Kurdish, Pashto, Javanese, Greenlandic (Kalaallisut) and Iranian languages[16][17] and sign languages.[18]

Popularity for local text files

Local storage on computers has considerably more use of "legacy" single-byte encodings than on the web. Attempts to update to UTF-8 have been blocked by editors that do not display or write UTF-8 unless the first character in a file is a byte order mark, making it impossible for other software to use UTF-8 without being rewritten to ignore the byte order mark on input and add it on output. UTF-16 files are also fairly common on Windows, but not in other systems.[19][20]

Popularity internally in software

In the memory of a computer program, usage of UTF-8 is even lower than local disk files. UTF-16 is very common, particularly in Windows but also in JavaScript, Python,[21] Qt, and many other cross-platform software libraries. Compatibility with the Windows API, that at one point didn't support UTF-8, is a major factor.

Recently it has become clear that the overhead of translating from/to UTF-8 on input and output, and dealing with potential encoding errors in the input UTF-8, vastly overwhelms any savings UTF-16 may offer. So newer software systems are starting to use UTF-8. International Components for Unicode (ICU) has historically used UTF-16, and still does only for Java; while for C/C++ UTF-8 is now supported as the "Default Charset"[22] including the correct handling of "illegal UTF-8".[23] The default string primitive used in newer programing languages, such as Go,[24] Julia, Rust and Swift 5,[25] assume UTF-8 encoding and it's also used for their source code. PyPy is also using UTF-8 for its strings.[26] Microsoft now recommends the use of UTF-8 for applications using the Windows API, while continuing to maintain a legacy "Unicode" (meaning UTF-16) interface.[27]

References

Davis, Mark (2012-02-03). "Unicode over 60 percent of the web". Official Google Blog. Archived from the original on 2018-08-09. Retrieved 2020-07-24.
Davis, Mark (2008-05-05). "Moving to Unicode 5.1". Retrieved 2021-02-19.
"Usage Survey of Character Encodings broken down by Ranking". w3techs.com. Retrieved 2022-05-02.
"Usage Statistics and Market Share of US-ASCII for Websites, August 2021". w3techs.com. Retrieved 2020-08-24.
The Chinese standard GB 2312 and with its extension GBK (which are both interpreted by web browsers as GB 18030, having support for the same letters as UTF-8)
"Distribution of Character Encodings among websites that use China and territories". w3techs.com. Retrieved 2022-05-02.
"Distribution of Character Encodings among websites that use .cn". w3techs.com. Retrieved 2021-11-01.
"Distribution of Character Encodings among websites that use Chinese". w3techs.com. Retrieved 2021-11-01.
"Distribution of Character Encodings among websites that use Taiwan". w3techs.com. Retrieved 2022-04-08.
"Distribution of Character Encodings among websites that use .ru". w3techs.com. Retrieved 2022-05-02.
"Distribution of Character Encodings among websites that use Greek". w3techs.com. Retrieved 2021-05-15.
"Distribution of Character Encodings among websites that use Hebrew". w3techs.com. Retrieved 2021-05-15.
"Historical trends in the usage of character encodings". Retrieved 2022-03-30.
"UTF-8 Usage Statistics". BuiltWith. Retrieved 2011-03-28.
"Usage Report of UTF-8 broken down by Content Languages". w3techs.com. Retrieved 2022-05-02.
"Distribution of Character Encodings among websites that use Bengali". w3techs.com. Retrieved 2021-02-24.
"Distribution of Character Encodings among websites that use Iranian languages". w3techs.com. Retrieved 2018-12-03.
"Distribution of Character Encodings among websites that use Sign Languages". w3techs.com. Retrieved 2018-12-03.
"Charset". Android Developers. Retrieved 2021-01-02. Android note: The Android platform default is always UTF-8.
Galloway, Matt. "Character encoding for iOS developers. Or UTF-8 what now?". www.galloway.me.uk. Retrieved 2021-01-02. in reality, you usually just assume UTF-8 since that is by far the most common encoding.
"PEP 623 -- Remove wstr from Unicode". Python.org. Retrieved 2020-11-21. Until we drop legacy Unicode object, it is very hard to try other Unicode implementation like UTF-8 based implementation in PyPy
"UTF-8 - ICU User Guide". userguide.icu-project.org. Retrieved 2018-04-03.
"#13311 (change illegal-UTF-8 handling to Unicode "best practice")". bugs.icu-project.org. Retrieved 2018-04-03.
"The Go Programming Language Specification". Retrieved 2021-02-10.
Tsai, Michael J. "Michael Tsai - Blog - UTF-8 String in Swift 5". Retrieved 2021-03-15.
Mattip (2019-03-24). "PyPy Status Blog: PyPy v7.1 released; now uses utf-8 internally for unicode strings". PyPy Status Blog. Retrieved 2020-11-21.
"Use the Windows UTF-8 code page". UWP applications. docs.microsoft.com. Retrieved 2020-06-06.

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.

[MarkDavis2012-1] Davis, Mark (2012-02-03). "Unicode over 60 percent of the web". Official Google Blog. Archived from the original on 2018-08-09. Retrieved 2020-07-24.

[markdavis-2] Davis, Mark (2008-05-05). "Moving to Unicode 5.1". Retrieved 2021-02-19.

[W3TechsWebEncoding-3] "Usage Survey of Character Encodings broken down by Ranking". w3techs.com. Retrieved 2022-05-02.

[4] "Usage Statistics and Market Share of US-ASCII for Websites, August 2021". w3techs.com. Retrieved 2020-08-24.

[5] The Chinese standard GB 2312 and with its extension GBK (which are both interpreted by web browsers as GB 18030, having support for the same letters as UTF-8)

[6] "Distribution of Character Encodings among websites that use China and territories". w3techs.com. Retrieved 2022-05-02.

[7] "Distribution of Character Encodings among websites that use .cn". w3techs.com. Retrieved 2021-11-01.

[8] "Distribution of Character Encodings among websites that use Chinese". w3techs.com. Retrieved 2021-11-01.

[9] "Distribution of Character Encodings among websites that use Taiwan". w3techs.com. Retrieved 2022-04-08.

[10] "Distribution of Character Encodings among websites that use .ru". w3techs.com. Retrieved 2022-05-02.

[11] "Distribution of Character Encodings among websites that use Greek". w3techs.com. Retrieved 2021-05-15.

[12] "Distribution of Character Encodings among websites that use Hebrew". w3techs.com. Retrieved 2021-05-15.

[W3Techs-13] "Historical trends in the usage of character encodings". Retrieved 2022-03-30.

[BuiltWith-14] "UTF-8 Usage Statistics". BuiltWith. Retrieved 2011-03-28.

[15] "Usage Report of UTF-8 broken down by Content Languages". w3techs.com. Retrieved 2022-05-02.

[16] "Distribution of Character Encodings among websites that use Bengali". w3techs.com. Retrieved 2021-02-24.

[17] "Distribution of Character Encodings among websites that use Iranian languages". w3techs.com. Retrieved 2018-12-03.

[18] "Distribution of Character Encodings among websites that use Sign Languages". w3techs.com. Retrieved 2018-12-03.

[19] "Charset". Android Developers. Retrieved 2021-01-02. Android note: The Android platform default is always UTF-8.

[20] Galloway, Matt. "Character encoding for iOS developers. Or UTF-8 what now?". www.galloway.me.uk. Retrieved 2021-01-02. in reality, you usually just assume UTF-8 since that is by far the most common encoding.

[21] "PEP 623 -- Remove wstr from Unicode". Python.org. Retrieved 2020-11-21. Until we drop legacy Unicode object, it is very hard to try other Unicode implementation like UTF-8 based implementation in PyPy

[22] "UTF-8 - ICU User Guide". userguide.icu-project.org. Retrieved 2018-04-03.

[23] "#13311 (change illegal-UTF-8 handling to Unicode "best practice")". bugs.icu-project.org. Retrieved 2018-04-03.

[24] "The Go Programming Language Specification". Retrieved 2021-02-10.

[25] Tsai, Michael J. "Michael Tsai - Blog - UTF-8 String in Swift 5". Retrieved 2021-03-15.

[26] Mattip (2019-03-24). "PyPy Status Blog: PyPy v7.1 released; now uses utf-8 internally for unicode strings". PyPy Status Blog. Retrieved 2020-11-21.

[27] "Use the Windows UTF-8 code page". UWP applications. docs.microsoft.com. Retrieved 2020-06-06.