Solution

Without any additional set-up, wifstream will use the C locale, which is probably not what you want.

There are multiple possible solutions:

1. Using the default locale

To read files using the default locale, use std::wifstream with std::locale(""). This is how most Linux programs work.

Example:

#include <fstream>
#include <iostream>
#include <ranges>

int main()
{
    std::wifstream stream("sample.txt");
    if (!stream)
    {
        std::cerr << "Failed to open file\n";
        return 1;
    }

    stream.imbue(std::locale(""));

    for (auto c : std::views::istream<wchar_t>(stream))
    {
        std::cout << std::hex << static_cast<int>(c) << std::endl;
    }

    if (stream.bad())
    {
        std::cerr << "Failed to extract character\n";
    }
}

Most Linux distros default to UTF-8. Windows can be configured to use UTF-8 too, although I believe that's not the default.

2. Using an UTF-8 locale

While there isn't really a fully portable way to do this, the following seems to work both on Windows and Linux (as long as the en_US.utf-8 locale is available):

stream.imbue(std::locale("en_US.utf-8"));

3. Using the deprecated (C++17) / removed (C++26) `codecvt_utf8_utf16`

WARNING: Do NOT use on production. Stuff from <codecvt> have issues related to error handling and are set to be removed from the standard. I also wouldn't trust them to properly handle malformed (or potentially malicious) input.

stream.imbue(std::locale(std::locale::classic(), new std::codecvt_utf8_utf16<wchar_t>));

4. Using a library

This is realistically the best option for production code that can't simply rely on the default locale.

So which approach should you choose? It depends:

If you just want to do some quick testing, option #2 is probably the simplest. Just don't use it on production;
If it's acceptable (or even desirable, if you're targeting Linux) for your program to use the default locale, use option #1;
Otherwise, use #4.

2024-07-06

LHLaurini

How can I read files into wstring?

1. Using the default locale

2. Using an UTF-8 locale

3. Using the deprecated (C++17) / removed (C++26) codecvt_utf8_utf16

4. Using a library

3. Using the deprecated (C++17) / removed (C++26) `codecvt_utf8_utf16`