Unicode string indexing in C++ -

i come python can use 'string[10]' access character in sequence. , if string encoded in unicode give me expected results. when use indexing on string in c++, long characters ascii works, when use unicode character inside string , use indexing, in output i'll octal representation /201. example:

string ramp = "ÐðŁłŠšÝýÞþŽž"; cout << ramp << "\n";     cout << ramp[5] << "\n";

output:

ÐðŁłŠšÝýÞþŽž /201

why happening , how can access character in string representation or how can convert octal representation actual character?

standard c++ not equipped proper handling of unicode, giving problems 1 observed.

the problem here c++ predates unicode comfortable margin. means string literal of yours interpreted in implementation-defined manner because characters not defined in basic source character set (which is, basically, ascii-7 characters minus @, $, , backtick).

c++98 not mention unicode @ all. mentions wchar_t, , wstring being based on it, specifying wchar_t being capable of "representing character in current locale". did more damage good...

microsoft defined wchar_t 16 bit, enough unicode code points at time. however, since unicode has been extended beyond 16-bit range... , windows' 16-bit wchar_t not "wide" anymore, because need 2 of them represent characters beyond bmp -- , microsoft docs notoriously ambiguous wchar_t means utf-16 (multibyte encoding surrogate pairs) or ucs-2 (wide encoding no support characters beyond bmp).

all while, linux wchar_t 32 bit, is wide enough utf-32...

c++11 made significant improvements subject, adding char16_t , char32_t including associated string variants remove ambiguity, but still not equipped unicode operations.

just 1 example, try convert e.g. german "fuß" uppercase , see mean. (the single letter 'ß' need expand 'ss', standard functions -- handling 1 character in, 1 character out @ time -- cannot do.)

however, there help. international components unicode (icu) library is equipped handle unicode in c++. specifying special characters in source code, have use u8"", u"", , u"" enforce interpretation of string literal utf-8, utf-16, , utf-32 respectively, using octal / hexadecimal escapes or relying on compiler implementation handle non-ascii-7 encodings appropriately.

and integer value std::cout << ramp[5], because c++, character integer semantic meaning. icu's ustream.h provides operator<< overloads icu::unicodestring class, ramp[5] 16-bit unsigned integer (1), , people askance @ if unsigned short interpreted characters. need c-api u_fputs() / u_printf() / u_fprintf() functions that.

#include <unicode/unistr.h> #include <unicode/ustream.h> #include <unicode/ustdio.h>  #include <iostream>  int main() {     // make sure source file utf-8 encoded...     icu::unicodestring ramp( icu::unicodestring::fromutf8( "ÐðŁłŠšÝýÞþŽž" ) );     std::cout << ramp << "\n";     std::cout << ramp[5] << "\n";     u_printf( "%c\n", ramp[5] ); }

compiled g++ -std=c++11 testme.cpp -licuio -licuuc.

ÐðŁłŠšÝýÞþŽž 353 š

(1) icu uses utf-16 internally, , unicodestring::operator[] returns code unit, not code point, might end 1 half of surrogate pair. api docs various other ways index unicode string.

WIKI

Search This Blog

Unicode string indexing in C++ -

Comments

Post a Comment