I wanted to make a cheat sheet for myself containing a reference of things I use when it comes to Unicode and when using Unicode in Vim, Python, Julia and Rust.

My Julia programming cheat sheet Nasser M. Abbasi May 13, 2020 Compiled on May 13, 2020 at 5:35pm Contents 1 installing 0.5 1 2 installing 0.3 2 3 installing 0.2 3 4 getting help on functions 7. Julia v1.0 or above. Credits This cheat sheet was created by Victoria Gregory, Andrij Stachurski, Natasha Watkins and other collaborators on behalf of QuantEcon.

First some basics:

  1. Unicode Code Pointshttps://unicode.org/glossary/#code_point are unique mappings from hexadecimal integers to an abstract character, concept or graphical representation. These graphical representations may look visually similar but can represent different “ideas”. For example: A, Α, А, A are all different Unicode code points.


    The Unicode consortium defines a Graphemehttps://unicode.org/glossary/#grapheme as a “What a user thinks of as a character”. Multiple code points may be used to represent a grapheme. For example, my name in Devangari and Tamil can be written as 3 graphemes, but it consists of 4 and 5 code points respectively in these languages:

    • DEVANGARI: दीपक
      • : U+0915 Dec:2325 DEVANAGARI LETTER KA
    • TAMIL: தீபக்
      • : U+0BA4 TAMIL LETTER TA
      • : U+0B95 TAMIL LETTER KA

    Additionally, multiple “ideas” may be defined as a single code point. For example, the following grapheme ﷺ translates to “peace be upon him” and is defined as the code point at U+FDFA:


    And to make matters more complicated, graphemes and visual representations of code points may not be a single column width wide, even in monospaced fonts. See the code point at U+FDFD:


    Code points can be of different categories, Normal, Pictographic, Spacer, Zero Width Joiners, Controls etc.

  2. The same “idea”, i.e. code point can be encoded into different bits when it is required to be represented on a machine. The bits used to represent the idea depend on the encoding chosen. An encoding is a map or transformation of a code point into bits or bytes. For example, the code point for a 🐉 can be encoded into UTF-8, UTF16, UTF32 in Python as follows.

    Python prints the bytes as human readable characters if they are valid ASCII characters. ASCII defines 128 characters, half of the 256 possible bytes in an 8-bit computer system. Valid ASCII byte strings are also valid UTF-8 byte strings.

  3. When receiving or reading data, we must know the encoding used to interpret it correctly. A Unicode encoding is not guaranteed to contain any information about the encoding. Different encodings exist for efficiency, performance and backward compatibility. UTF-8 is a good pick for an encoding in the general case.

In vim in insert mode, we can type Ctrl+VCheck out :help i_CTRL-V_digit for more information. followed by either:

  • a decimal number [0-255]. Ctrl-v255 will insert ÿ.
  • the letter o and then an octal number [0-377]. Ctrl-vo377 will insert ÿ.
  • the letter x and then a hex number [00-ff]. Ctrl-vxff will insert ÿ.
  • the letter u and then a 4-hexchar Unicode sequence. Ctrl-vu03C0 will insert π.
  • the letter U and then an 8-hexchar Unicode sequence. Ctrl-vU0001F409 will insert 🐉.

Using unicode.vim, we can use :UnicodeName to get the Unicode number of the code point under the cursor. With unicode.vim and fzf installed, you can even fuzzy find Unicode symbols.

Since Python >=3.3, the Unicode string type supports a “flexible string representation”. This means that any one of multiple internal representations may be used depending on the largest Unicode ordinal (1, 2, or 4 bytes) in a Unicode string.

For the common case, a string used in the English speaking world may only use ASCII characters thereby using a Latin-1 encoding to store the data. If non Basic Multilingual Plane characters are used in a Python Unicode string, the internal representation may be stored as UCS2 or UCS4.

In each of these cases, the internal representation uses the same number of bytes for each code point. This allows efficient indexing into a Python Unicode string, but indexing into a Python Unicode string will only return a valid code point and not a grapheme. The length of a Unicode string is defined as the number of code points in the string.

As an example, let’s take this emoji: 🤦🏼‍♂️ [1]. This emoji actually consists of 5 code pointsWe can view this breakdown using uniview. In vim, we can use :UnicodeName.:

  • 🤦 : U+1F926 FACE PALM
  • ♂ : U+2642 MALE SIGN (Ml)

In Python, a string that contains just this emoji has length equal to 5.

If we want to keep a Python file pure ASCII but want to use Unicode in string literals, we can use the U escape sequence.

As mentioned earlier, indexing into a Python Unicode string gives us the code point at that location.

Iterating over a Python string gives us the code points as well.

However, in practice, indexing into a string may not be what we want or may not be useful. More often, we are either interested in:

  1. indexing into the byte string representation or
  2. indexing into the graphemes.

We can use the s.encode('utf-8') function to get a Python byte string representation of the Python unicode string in s.

If we are interested in the number of graphemes, we can use the grapheme package.

For historical reasons, Unicode allows the same set of characters to be represented by different sequences of code points.

We can use the built in standard library unicodedata to normalize Python Unicode strings.

It is best practice to add the following lines to the top of your Python file that you expect to run as scripts.

If your Python files are part of a package, just adding the second line is sufficient. I recommend using pre-commit hooks to ensure that the encoding pragma of python files are fixed before making a git commit.

Let’s take a look at how Julia handles strings.

Printing the length of the string in Julia returns 5. As we saw earlier, this is the number of code points in the unicode string.

Julia String literals are encoded using the UTF-8 encoding. In Python, the indexing into a string would return the code point at the string. In Julia, indexing into a string refers to code unitshttps://unicode.org/glossary/#code_unit, and for the default String this returns the byte as a Char type.

If we want each code point in a Julia String, we can use eachindexSee the Julia manual strings documentation for more information: https://docs.julialang.org/en/v1/manual/strings/.

And finally, we can use the Unicode module that is built in to the standard library to get the number of graphemes.

If we wish to encode a Julia string as UTF-8As of Julia v1.5.0, only conversion to/from UTF-8 is currently supported: https://docs.julialang.org/en/v1/base/strings/#Base.transcode, we can use the following:

Let’s also take a look at rust. We can create a simple main.rs file:

And compile and run it like so:

