Practical aspects of Encoding for Programmers
Every time you hear about a String or Text; ask this question -What’s the encoding? Things like- default encoding or plain text or no-encoding might sound practical but doesn’t have any relevance in reality. You might be designing a Microservice or just facing an interviewer, start by clarifying the character set (i.e. charset) or encoding of the strings. This is required because Computers use a sequence of bits to represent/encode any character currently available in human civilization. If you want to go through the encoding at the more fundamental level, I will recommend you to this classic article. This post, I will be focussing high level on some of the most commonly used encodings.
ASCII Character Set
ASCII is a charset which defines 128 characters. This includes English alphabets (both lower case as well as upper case), numbers 0–9 and some other control characters. ASCII represents every character using a number between 32 and 127. So just 7 bits (2⁷ = 128) are sufficient to represent ASCII characters. Many languages like French, German and even companies use values in the range from 128 to 256 to define their own character set known as Extended ASCII. So, we would just require at most 8 bits to represent ASCII character set.
//Java
String input = "AbcHi";
int ascii = (int)input.charAt(0); // ascii=65
char asciiToChar = (char)ascii;// More apt way, if you are NOT sure if it's ASCII
int ascii = String.valueOf('A').codePointAt(0);
Unicode Encoding
Internet or Http uses UTF encoding. When your service responds to an HTTP request, it returns Content-Type header, something like below:
Content-Type: text/html; charset=utf-8
Unicode created a single character set that includes every reasonable writing system on this planet. In Unicode, a letter maps to a code point. Unicode consortium assigns a magic number for every letter.
EA = U+0041
U+ means Unicode and the following numbers are represented in hexadecimal. There is NO hard limit on the number of characters it can define, and it has already gone beyond the limit of 2 Bytes (i.e. 65,536 values). So, NOT every Unicode character can be expressed in 2 bytes.
UTF-8: Every code point from 0–127 is stored in a single byte. Only code points 128 and above are stored using 2 or more bytes. So, UTF-8 doesn’t mean it always uses 8 bits.
//Java
byte[] bytes = new byte[]{'A','B','C'};
String byteToUnicode = new String(bytes, Charset.forName("UTF-8"));
System.out.println(byteToUnicode); // ABC
Base 62 Encoding
Base62 is binary-to-text (i.e. take binary data and convert into text) encoding schemes that represent binary in an ASCII string format by translating it into a radix-62 representation. Base62 has 62 characters, 26 upper letters from A to Z, 26 lower letters from a to z, and 10 numbers from 0 to 9. The Base62 encoding is usually used in URL Shortening, which is a conversion between a Base 10 integer to its Base 62 encoding value. Base-64 (just an extension of Base-62 with two more chars) is used to transmit things like e-mail and HTML form data.
Below Java code shows how to convert a long value to the Base62 encoded value and then from Base62 to corresponding value.
String charSet = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";public String encodeBase62(long valueToBeEncoded) {
if (valueToBeEncoded < 0) {
throw new IllegalArgumentException("can't be negative");
}
String encodedVal = "";
char ch;
while (valueToBeEncoded > 0) {
ch = charSet.charAt((int) (valueToBeEncoded % 62));
encodedVal = ch + encodedVal;
valueToBeEncoded /= 62;
}
return encodedVal;
}public long decodeBase62(String b62) {
long val = 0;
b62 = new StringBuffer(b62).reverse().toString();
long count = 1;
for (char character : b62.toCharArray()) {
val += charSet.indexOf(character) * count;
count *= 62;
}
return val;
}encodeBase62(987650); //EI50
decodeBase62("EI50"); //987650
The approach used above is exactly similar to the way we encode an int to binary and vice-versa. Decimals are radix-10 representation.
toBinary(4) = 100
toInt(100) = 4
This post doesn’t cover all possible encodings. It just covered the practical aspects of some of the commonly used encodings.
yppaH gnidocne