Character Encoding

When working with web pages, it is easy to forget that the resulting computer file doesn't consist of letters, numbers, and symbols, but instead consists of 0's and 1's.

As noted in the Wikipedia article on Character Encoding (https://en.wikipedia.org/wiki/Character_encoding), there have been a variety of character encoding systems used in computers. ASCII and EBCDIC were developed in the 1960s and could represent the letters in the Latin alphabet, digits from 0 to 9, and some symbols. These required no more than 8 bits (1 byte). However, to represent characters in other languages, more bits were required. The problem was that for the vast majority of computer users (who did not need this additional characters), using extra bits were a wasteful use of what were at that time expensive computer resources. Eventually, Unicode was developed which in the latest version can represent 136,755 characters.

There are various character encodings that can implement Unicode (UTF-8, UCS-2 UTF-32).

UTF-8 uses one byte for the first 128 characters (the ASCII) characters and up to 4 bytes for other characters - it is the most widely used by websites ( https://en.wikipedia.org/wiki/Unicode)

https://www.w3schools.com/charsets/ref_html_utf8.asp states as follows:

The Difference Between Unicode and UTF-8

"Unicode is a character set. UTF-8 is encoding.

Unicode is a list of characters with unique decimal numbers (code points). A = 65, B = 66, C = 67, ....

This list of decimal numbers represent the string "hello": 104 101 108 108 111

Encoding is how these numbers are translated into binary numbers to be stored in a computer:

UTF-8 encoding will store "hello" like this (binary): 01101000 01100101 01101100 01101100 01101111

Encoding translates numbers into binary. Character sets translates characters to numbers."

Revised: September 13, 2017