In order to store or transport data as text, some special characters have to be encoded. This can be characters that cannot be displayed on the screen or sent using text based protocols such as HTTP. Some written languages also uses special signs that are translated into Unicode representations. It is also common to encode languages such as JavaScript. It is supported by most languages such as PHP, Java etc.
Numerous ways to encode data exists including
Unicode
The 16 bit Unicode character set is widely in use today and uses a "%u" prefix and adds the two byte Unicode representation of the encoded data in a hexadecimal format. Unicode represents the characters it supports via numbers called code points. The hexadecimal range of code points is 0x0 to 0x10FFFF (17 times 16 bits).
UTF-8
A character in UTF8 encoding can be from 1 to 4 bytes long. UTF-8 can represent any character in the Unicode standard and is backwards compatible with ASCII. The first 128 characters of Unicode (which correspond one-to-one with ASCII) are encoded using a single octet with the same binary value as ASCII,
making valid ASCII text valid UTF-8-encoded Unicode as well.
Base64
Base64 translates binary data into text using encoding and represents this data only using printable ASCII values. One Base64 digit takes up 6 bits. Often the input will come in 8 binary digits (bits) so the length of Base64 encoded information will be padded to fit inside the encoding if necessary to make the 6 bits vs 8 bits fit together. The padding is often represented by one or two "=" characters and makes Base64 easy recognizable.
Hex Encoding
Hex encoding is very popular to transport data using some hexadecimal or base 16 representation of this data. One reason is that it is possible to display or print any value using hexadecimal values even though the ASCII representation of the data or value cannot be printed on the screen. The data will be represented with two 4 bit nybbles containing a valid hexadecimal value between 0 and 9 and a to f. Depending on where the data is used, it can be represented using several prefixes such as "\x", "0x", or other variations, and some times no prefix at all.
Character sets translates characters into numbers where encoding translates numbers into binary. Note that a lot of alternative character sets and encoding techniques exists not mentioned here, also, above mentioned exists in different versions and variations such as UTF-16 and Base32 etc.
Further Reading
Numerous ways to encode data exists including
- Unicode
- UTF-8
- Base64
- Hex Encoding
Unicode
The 16 bit Unicode character set is widely in use today and uses a "%u" prefix and adds the two byte Unicode representation of the encoded data in a hexadecimal format. Unicode represents the characters it supports via numbers called code points. The hexadecimal range of code points is 0x0 to 0x10FFFF (17 times 16 bits).
UTF-8
A character in UTF8 encoding can be from 1 to 4 bytes long. UTF-8 can represent any character in the Unicode standard and is backwards compatible with ASCII. The first 128 characters of Unicode (which correspond one-to-one with ASCII) are encoded using a single octet with the same binary value as ASCII,
making valid ASCII text valid UTF-8-encoded Unicode as well.
Base64
Base64 translates binary data into text using encoding and represents this data only using printable ASCII values. One Base64 digit takes up 6 bits. Often the input will come in 8 binary digits (bits) so the length of Base64 encoded information will be padded to fit inside the encoding if necessary to make the 6 bits vs 8 bits fit together. The padding is often represented by one or two "=" characters and makes Base64 easy recognizable.
Hex Encoding
Hex encoding is very popular to transport data using some hexadecimal or base 16 representation of this data. One reason is that it is possible to display or print any value using hexadecimal values even though the ASCII representation of the data or value cannot be printed on the screen. The data will be represented with two 4 bit nybbles containing a valid hexadecimal value between 0 and 9 and a to f. Depending on where the data is used, it can be represented using several prefixes such as "\x", "0x", or other variations, and some times no prefix at all.
Character sets translates characters into numbers where encoding translates numbers into binary. Note that a lot of alternative character sets and encoding techniques exists not mentioned here, also, above mentioned exists in different versions and variations such as UTF-16 and Base32 etc.
Further Reading
- Unicode http://www.unicode.org/consortium/consort.html
- UTF-8 https://en.wikipedia.org/wiki/UTF-8
- Base64 https://en.wikipedia.org/wiki/Base64
- Hex https://en.wikipedia.org/wiki/Hexadecimal