[redland-dev] unsigned char* to std::wstring?

Michael Stahl Michael.Stahl at Sun.com
Thu Jun 3 11:49:59 CEST 2010


On 03/06/2010 10:23, Joe Andrieu wrote:
> I understand that Redland uses UTF-8 internally and in its API.
> 
> However, it isn't clear to me the right way to convert from those 
> unsigned char* strings to std::wstring or std::string classes, which is 
> what my program uses internally.
> 
> I have the code "working" using std::string, but I would bet that my 
> code wouldn't handle non-ascii characters properly.
> 
> Can anyone provide some guidance?
> 
> -j

AFAIK the C++ standard library does not provide a string type that can
reliably store Unicode.

presumably this is because it's difficult to implement random-access
operations on variable-length encodings like UTF-8.
[but of course in practice nobody really wants to randomly access
individual characters in a string; what people want is to iterate over the
string, yielding a Unicode character each step, and there's not much of a
problem with that]

basically you can put UTF-8 encoded string into std::string, but:
- all the methods work on the individual bytes
- there's no support for accessing the individual UTF-8 characters
- it's far too easy to shoot yourself in the foot

so this does not look like an approach that yields reliable programs.

then there is std::wstring, which uses wchar_t, but it's not usable
either, because on lots of C++ implementations wchar_t is just 16 bit in
size, which is not sufficient to represent all Unicode characters.

if you use a 16-bit wchar_t, with UTF-16 encoding, then you get the same
disadvantages as when storing UTF-8 in std::string, they're just more
difficult to detect because they only occur on characters that are
seldomly used.

we have some experience with that particular problem in OOo; the "default"
one of our ~6 different string classes (::rtl::OUString) uses UTF-16
encoding and exposes this encoding to client code, and a non-exhaustive
list of problems that result from that choice can be found here:
http://qa.openoffice.org/issues/show_bug.cgi?id=102943

imho if there is a reliable and portable way to store Unicode in a
standard C++ string, then it is to specialize the base_string template for
a 32-bit integer type, and use that.
but i've never tried it, so i don't know what disadvantages that approach
may have.

regards,
 michael

-- 
"I believe in Spinoza's God who reveals himself in the orderly harmony
 of what exists, not in a God who concerns himself with the fates and
 actions of human beings." -- Albert Einstein



More information about the redland-dev mailing list