[Solved] Handling invalid UTF-8 characters

Gazoo · June 19, 2018, 10:00am

Hey Embers,

TL;DR version:

I’d like to use cinder’s toUtf16() function on a string like this “ñêQRV”, but this leads to an invalid UTF-8 exception. I’m guessing I have to convert them to characters representing their unicode value, as the character itself isn’t a proper UTF-8 character? Any tips/hints on doing this?

Long version:

I’m reading in text from Mp3 ID3 tags and apparently it occasionally uses a (now) deprecated encoding. Specifically, “UCS-2 encoded Unicode with BOM” according to the wikipedia page. I spent at least an hour trying to find the ‘right way’ to properly decode it, until I eventually gave up and settled for a solution provided by a kind soul on the github for the id3 decode library. I’m left with an std::string which will occasionally contain what is apparently invalid UTF-8 characters because when I use toUtf16() on it, I’m dealt an exception highlighting this.

I know UTF-16 is generally to be avoided as per the ever helpful @paul.houx, but since I only plan to use UTF-16 internally on the windows platform for now, I’ve opted to make an exception.

I use it primarily so I can remove a single character at a time from a string and be sure it is a ‘complete’ character rather than a portion of one.

So I’m wondering what I’m to do - perhaps convert the characters to proper UTF-8 using toUtf8()? But if that’s the approach, how do I do this when these characters are in an std::string to begin with?

Thanks in advance,

Gazoo

lithium · June 19, 2018, 1:45pm

Since you’re on windows only, you might be able to do the MultiByteToWideChar + std::wstring / std::u16string dance. Apologies in advance.

benjaminbojko · June 19, 2018, 1:55pm

Speaking a bit out of my comfort zone here, but on Windows we’ve had good luck with wstring_convert and bundled things in these conversion methods: https://github.com/bluecadet/Cinder-BluecadetText/blob/develop/src/bluecadet/text/Text.h#L270-L288. Perhaps there’s something helpful there.

xumo · June 19, 2018, 2:06pm

I use latin windows (spanish and portuguese). So when tryiing to draw system messages I use this replace method:

#include "utf8cpp/checked.h"

    std::string text = "Texto en español ";
    std::string temp;
    //UTF8 valid checkout
   utf8::replace_invalid(text.begin(), text.end(), back_inserter(temp));

Then I can pass it to somenthing like a ci::TextBox.

Gazoo · June 20, 2018, 9:29am

Much appreciate all of the answers. It would seem to me that @xumo’s suggestion is the most cross-platform friendly. So I opted for that approach.

Honestly, I wasn’t even aware that Cinder had this functionality available.

Thanks everybody!

Cheers,

Gazoo

Topic		Replies	Views
UTF8 encoding and Cinder and cross-platform Using Cinder	2	1471	June 30, 2016
Yet another UTF-8 Question - loading files with Cyrillic paths Using Cinder	4	658	February 14, 2019
Filepath looks fine in UTF8 BOM, but doesn't render. UTF8 changes the string :S	1	797	October 10, 2017
Windows text encoding	1	892	February 20, 2017
[solved] loadString, system() & Unicode Using Cinder	2	949	April 5, 2018

[Solved] Handling invalid UTF-8 characters

Related topics