c++ – std::wstring VS std::string

c++ – std::wstring VS std::string

string? wstring?

std::string is a basic_string templated on a char, and std::wstring on a wchar_t.

char vs. wchar_t

char is supposed to hold a character, usually an 8-bit character.
wchar_t is supposed to hold a wide character, and then, things get tricky:
On Linux, a wchar_t is 4 bytes, while on Windows, its 2 bytes.

What about Unicode, then?

The problem is that neither char nor wchar_t is directly tied to unicode.

On Linux?

Lets take a Linux OS: My Ubuntu system is already unicode aware. When I work with a char string, it is natively encoded in UTF-8 (i.e. Unicode string of chars). The following code:

#include <cstring>
#include <iostream>

int main()
{
    const char text[] = olé;


    std::cout << sizeof(char)    :  << sizeof(char) << n;
    std::cout << text            :  << text << n;
    std::cout << sizeof(text)    :  << sizeof(text) << n;
    std::cout << strlen(text)    :  << strlen(text) << n;

    std::cout << text(ordinals)  :;

    for(size_t i = 0, iMax = strlen(text); i < iMax; ++i)
    {
        unsigned char c = static_cast<unsigned_char>(text[i]);
        std::cout <<   << static_cast<unsigned int>(c);
    }

    std::cout << nn;

    // - - -

    const wchar_t wtext[] = Lolé ;

    std::cout << sizeof(wchar_t) :  << sizeof(wchar_t) << n;
    //std::cout << wtext           :  << wtext << n; <- error
    std::cout << wtext           : UNABLE TO CONVERT NATIVELY. << n;
    std::wcout << Lwtext           :  << wtext << n;

    std::cout << sizeof(wtext)   :  << sizeof(wtext) << n;
    std::cout << wcslen(wtext)   :  << wcslen(wtext) << n;

    std::cout << wtext(ordinals) :;

    for(size_t i = 0, iMax = wcslen(wtext); i < iMax; ++i)
    {
        unsigned short wc = static_cast<unsigned short>(wtext[i]);
        std::cout <<   << static_cast<unsigned int>(wc);
    }

    std::cout << nn;
}

outputs the following text:

sizeof(char)    : 1
text            : olé
sizeof(text)    : 5
strlen(text)    : 4
text(ordinals)  : 111 108 195 169

sizeof(wchar_t) : 4
wtext           : UNABLE TO CONVERT NATIVELY.
wtext           : ol�
sizeof(wtext)   : 16
wcslen(wtext)   : 3
wtext(ordinals) : 111 108 233

Youll see the olé text in char is really constructed by four chars: 110, 108, 195 and 169 (not counting the trailing zero). (Ill let you study the wchar_t code as an exercise)

So, when working with a char on Linux, you should usually end up using Unicode without even knowing it. And as std::string works with char, so std::string is already unicode-ready.

Note that std::string, like the C string API, will consider the olé string to have 4 characters, not three. So you should be cautious when truncating/playing with unicode chars because some combination of chars is forbidden in UTF-8.

On Windows?

On Windows, this is a bit different. Win32 had to support a lot of application working with char and on different charsets/codepages produced in all the world, before the advent of Unicode.

So their solution was an interesting one: If an application works with char, then the char strings are encoded/printed/shown on GUI labels using the local charset/codepage on the machine, which could not be UTF-8 for a long time. For example, olé would be olé in a French-localized Windows, but would be something different on an cyrillic-localized Windows (olй if you use Windows-1251). Thus, historical apps will usually still work the same old way.

For Unicode based applications, Windows uses wchar_t, which is 2-bytes wide, and is encoded in UTF-16, which is Unicode encoded on 2-bytes characters (or at the very least, UCS-2, which just lacks surrogate-pairs and thus characters outside the BMP (>= 64K)).

Applications using char are said multibyte (because each glyph is composed of one or more chars), while applications using wchar_t are said widechar (because each glyph is composed of one or two wchar_t. See MultiByteToWideChar and WideCharToMultiByte Win32 conversion API for more info.

Thus, if you work on Windows, you badly want to use wchar_t (unless you use a framework hiding that, like GTK or QT…). The fact is that behind the scenes, Windows works with wchar_t strings, so even historical applications will have their char strings converted in wchar_t when using API like SetWindowText() (low level API function to set the label on a Win32 GUI).

Memory issues?

UTF-32 is 4 bytes per characters, so there is no much to add, if only that a UTF-8 text and UTF-16 text will always use less or the same amount of memory than an UTF-32 text (and usually less).

If there is a memory issue, then you should know than for most western languages, UTF-8 text will use less memory than the same UTF-16 one.

Still, for other languages (chinese, japanese, etc.), the memory used will be either the same, or slightly larger for UTF-8 than for UTF-16.

All in all, UTF-16 will mostly use 2 and occassionally 4 bytes per characters (unless youre dealing with some kind of esoteric language glyphs (Klingon? Elvish?), while UTF-8 will spend from 1 to 4 bytes.

See https://en.wikipedia.org/wiki/UTF-8#Compared_to_UTF-16 for more info.

Conclusion

  1. When I should use std::wstring over std::string?

    On Linux? Almost never (§).
    On Windows? Almost always (§).
    On cross-platform code? Depends on your toolkit…

    (§) : unless you use a toolkit/framework saying otherwise

  2. Can std::string hold all the ASCII character set including special characters?

    Notice: A std::string is suitable for holding a binary buffer, where a std::wstring is not!

    On Linux? Yes.
    On Windows? Only special characters available for the current locale of the Windows user.

    Edit (After a comment from Johann Gerell):
    a std::string will be enough to handle all char-based strings (each char being a number from 0 to 255). But:

    1. ASCII is supposed to go from 0 to 127. Higher chars are NOT ASCII.
    2. a char from 0 to 127 will be held correctly
    3. a char from 128 to 255 will have a signification depending on your encoding (unicode, non-unicode, etc.), but it will be able to hold all Unicode glyphs as long as they are encoded in UTF-8.
  3. Is std::wstring supported by almost all popular C++ compilers?

    Mostly, with the exception of GCC based compilers that are ported to Windows.
    It works on my g++ 4.3.2 (under Linux), and I used Unicode API on Win32 since Visual C++ 6.

  4. What is exactly a wide character?

    On C/C++, its a character type written wchar_t which is larger than the simple char character type. It is supposed to be used to put inside characters whose indices (like Unicode glyphs) are larger than 255 (or 127, depending…).

I recommend avoiding std::wstring on Windows or elsewhere, except when required by the interface, or anywhere near Windows API calls and respective encoding conversions as a syntactic sugar.

My view is summarized in http://utf8everywhere.org of which I am a co-author.

Unless your application is API-call-centric, e.g. mainly UI application, the suggestion is to store Unicode strings in std::string and encoded in UTF-8, performing conversion near API calls. The benefits outlined in the article outweigh the apparent annoyance of conversion, especially in complex applications. This is doubly so for multi-platform and library development.

And now, answering your questions:

  1. A few weak reasons. It exists for historical reasons, where widechars were believed to be the proper way of supporting Unicode. It is now used to interface APIs that prefer UTF-16 strings. I use them only in the direct vicinity of such API calls.
  2. This has nothing to do with std::string. It can hold whatever encoding you put in it. The only question is how You treat its content. My recommendation is UTF-8, so it will be able to hold all Unicode characters correctly. Its a common practice on Linux, but I think Windows programs should do it also.
  3. No.
  4. Wide character is a confusing name. In the early days of Unicode, there was a belief that a character can be encoded in two bytes, hence the name. Today, it stands for any part of the character that is two bytes long. UTF-16 is seen as a sequence of such byte pairs (aka Wide characters). A character in UTF-16 takes either one or two pairs.

c++ – std::wstring VS std::string

So, every reader here now should have a clear understanding about the facts, the situation. If not, then you must read paercebals outstandingly comprehensive answer [btw: thanks!].

My pragmatical conclusion is shockingly simple: all that C++ (and STL) character encoding stuff is substantially broken and useless. Blame it on Microsoft or not, that will not help anyway.

My solution, after in-depth investigation, much frustration and the consequential experiences is the following:

  1. accept, that you have to be responsible on your own for the encoding and conversion stuff (and you will see that much of it is rather trivial)

  2. use std::string for any UTF-8 encoded strings (just a typedef std::string UTF8String)

  3. accept that such an UTF8String object is just a dumb, but cheap container. Do never ever access and/or manipulate characters in it directly (no search, replace, and so on). You could, but you really just really, really do not want to waste your time writing text manipulation algorithms for multi-byte strings! Even if other people already did such stupid things, dont do that! Let it be! (Well, there are scenarios where it makes sense… just use the ICU library for those).

  4. use std::wstring for UCS-2 encoded strings (typedef std::wstring UCS2String) – this is a compromise, and a concession to the mess that the WIN32 API introduced). UCS-2 is sufficient for most of us (more on that later…).

  5. use UCS2String instances whenever a character-by-character access is required (read, manipulate, and so on). Any character-based processing should be done in a NON-multibyte-representation. It is simple, fast, easy.

  6. add two utility functions to convert back & forth between UTF-8 and UCS-2:

    UCS2String ConvertToUCS2( const UTF8String &str );
    UTF8String ConvertToUTF8( const UCS2String &str );
    

The conversions are straightforward, google should help here …

Thats it. Use UTF8String wherever memory is precious and for all UTF-8 I/O. Use UCS2String wherever the string must be parsed and/or manipulated. You can convert between those two representations any time.

Alternatives & Improvements

  • conversions from & to single-byte character encodings (e.g. ISO-8859-1) can be realized with help of plain translation tables, e.g. const wchar_t tt_iso88951[256] = {0,1,2,...}; and appropriate code for conversion to & from UCS2.

  • if UCS-2 is not sufficient, than switch to UCS-4 (typedef std::basic_string<uint32_t> UCS2String)

ICU or other unicode libraries?

For advanced stuff.

Leave a Reply

Your email address will not be published. Required fields are marked *