i come python can use 'string[10]' access character in sequence. , if string encoded in unicode give me expected results. when use indexing on string in c++, long characters ascii works, when use unicode character inside string , use indexing, in output i'll octal representation /201. example:
string ramp = "ÐðŁłŠšÝýÞþŽž"; cout << ramp << "\n"; cout << ramp[5] << "\n"; output:
ÐðŁłŠšÝýÞþŽž /201 why happening , how can access character in string representation or how can convert octal representation actual character?
standard c++ not equipped proper handling of unicode, giving problems 1 observed.
the problem here c++ predates unicode comfortable margin. means string literal of yours interpreted in implementation-defined manner because characters not defined in basic source character set (which is, basically, ascii-7 characters minus @, $, , backtick).
c++98 not mention unicode @ all. mentions wchar_t, , wstring being based on it, specifying wchar_t being capable of "representing character in current locale". did more damage good...
microsoft defined wchar_t 16 bit, enough unicode code points at time. however, since unicode has been extended beyond 16-bit range... , windows' 16-bit wchar_t not "wide" anymore, because need 2 of them represent characters beyond bmp -- , microsoft docs notoriously ambiguous wchar_t means utf-16 (multibyte encoding surrogate pairs) or ucs-2 (wide encoding no support characters beyond bmp).
all while, linux wchar_t 32 bit, is wide enough utf-32...
c++11 made significant improvements subject, adding char16_t , char32_t including associated string variants remove ambiguity, but still not equipped unicode operations.
just 1 example, try convert e.g. german "fuß" uppercase , see mean. (the single letter 'ß' need expand 'ss', standard functions -- handling 1 character in, 1 character out @ time -- cannot do.)
however, there help. international components unicode (icu) library is equipped handle unicode in c++. specifying special characters in source code, have use u8"", u"", , u"" enforce interpretation of string literal utf-8, utf-16, , utf-32 respectively, using octal / hexadecimal escapes or relying on compiler implementation handle non-ascii-7 encodings appropriately.
and integer value std::cout << ramp[5], because c++, character integer semantic meaning. icu's ustream.h provides operator<< overloads icu::unicodestring class, ramp[5] 16-bit unsigned integer (1), , people askance @ if unsigned short interpreted characters. need c-api u_fputs() / u_printf() / u_fprintf() functions that.
#include <unicode/unistr.h> #include <unicode/ustream.h> #include <unicode/ustdio.h> #include <iostream> int main() { // make sure source file utf-8 encoded... icu::unicodestring ramp( icu::unicodestring::fromutf8( "ÐðŁłŠšÝýÞþŽž" ) ); std::cout << ramp << "\n"; std::cout << ramp[5] << "\n"; u_printf( "%c\n", ramp[5] ); } compiled g++ -std=c++11 testme.cpp -licuio -licuuc.
ÐðŁłŠšÝýÞþŽž 353 š (1) icu uses utf-16 internally, , unicodestring::operator[] returns code unit, not code point, might end 1 half of surrogate pair. api docs various other ways index unicode string.
Comments
Post a Comment