[Solved] Handling invalid UTF-8 characters


#1

Hey Embers,

TL;DR version:

I’d like to use cinder’s toUtf16() function on a string like this “ñêQRV”, but this leads to an invalid UTF-8 exception. I’m guessing I have to convert them to characters representing their unicode value, as the character itself isn’t a proper UTF-8 character? Any tips/hints on doing this?

Long version:

I’m reading in text from Mp3 ID3 tags and apparently it occasionally uses a (now) deprecated encoding. Specifically, “UCS-2 encoded Unicode with BOM” according to the wikipedia page. I spent at least an hour trying to find the ‘right way’ to properly decode it, until I eventually gave up and settled for a solution provided by a kind soul on the github for the id3 decode library. I’m left with an std::string which will occasionally contain what is apparently invalid UTF-8 characters because when I use toUtf16() on it, I’m dealt an exception highlighting this.

I know UTF-16 is generally to be avoided as per the ever helpful @paul.houx, but since I only plan to use UTF-16 internally on the windows platform for now, I’ve opted to make an exception.

I use it primarily so I can remove a single character at a time from a string and be sure it is a ‘complete’ character rather than a portion of one.

So I’m wondering what I’m to do - perhaps convert the characters to proper UTF-8 using toUtf8()? But if that’s the approach, how do I do this when these characters are in an std::string to begin with?

Thanks in advance,

Gazoo


#2

Since you’re on windows only, you might be able to do the MultiByteToWideChar + std::wstring / std::u16string dance. Apologies in advance.


#3

Speaking a bit out of my comfort zone here, but on Windows we’ve had good luck with wstring_convert and bundled things in these conversion methods: https://github.com/bluecadet/Cinder-BluecadetText/blob/develop/src/bluecadet/text/Text.h#L270-L288. Perhaps there’s something helpful there.


#4

I use latin windows (spanish and portuguese). So when tryiing to draw system messages I use this replace method:

#include "utf8cpp/checked.h"

    std::string text = "Texto en español ";
    std::string temp;
    //UTF8 valid checkout
   utf8::replace_invalid(text.begin(), text.end(), back_inserter(temp));

Then I can pass it to somenthing like a ci::TextBox.


#5

Much appreciate all of the answers. It would seem to me that @xumo’s suggestion is the most cross-platform friendly. So I opted for that approach.

Honestly, I wasn’t even aware that Cinder had this functionality available.

Thanks everybody!

Cheers,

Gazoo