How to correctly use Unicode characters codes

Hi all,

With very little understanding of unicode in general, I’m trying to understand how to add unicode character codes to a font map, as I’m working on a project that requires a few specific unicode characters. At the moment I’m fiddling with gl::TextureFont, as I see that it has some unicode characters in its default char set that looks like:

static std::string  defaultChars() { return "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890().?!,:;'\"&*=+-/\\@#_[]<>%^llflfiphrids\303\251\303\241\303\250\303\240"; }

Does anyone know where these last 8 character codes come from? I tried drawing them in the TextureFont sample by changing the input str given to mTextureFont->drawStringWrapped() to the following:

string str = "aaaa\303\251\303\241\303\250\303\240";

The result I got looks like (I switched to Arial font):

so first off, it looks like only four of the unicode chars are being drawn. Next, I looked up what the code is for é using this handy web app (index 112), wherein I see that the unicode value is 00E9, which doesn’t appear to be one of the chars that I input into the sample. I take it that the value on the webside is in hex, which corresponds to 233 in decimal, still not one of the values I inserted. So, still trying to find out what corresponds to what here.


Alternate question: I see in the content that I’m provided values like \u2022 for a bullet (can be found at index 135 in the web app I linked above) - how would I add the correct unicode character so that TextureFont can render this glyph?

Thanks all for the help. cheers,
Rich

\303\251 etc. are octal-encoded UTF-8 code points. Hex is an equally valid choice.

To take your • example (\u2022), using the appropriate entry on a similar site and scrolling to the UTF-8 (hex) line, the multibyte sequence is 0xE2 0x80 0xA2. I’ll use hex here instead of octal, so the string might become

string str = "example: \xE2\x80\xA2"

And running that through the TextureFont sample (after modifying the gl::TextureFont construction appropriately) produces:

With one or two obscure exceptions Cinder uses UTF-8 throughout, and this hex escaping should work everywhere. If you don’t care about MSW, you can actually just embed \u2022 directly in your string. However things get subtle when you care about cross-platform, so personally I stick with octal or hex. This post describes some of the nuances.

1 Like

Ah, that makes complete sense. Passing either octal or utf8 works for me here as well, even if the content from json that I’m rendering is using the utf16 (\u2022) values. jsoncpp must be doing the conversion internally, which is great. Thanks for the explanation, Andrew!

It’ll be nice once we’re all using VS 2015 / vc140, as it looks like it has much better support for utf16 and unicode string literals.