Question

How can I read files into wstring?

I'm trying to open file using wifstream:

#include <sstream>
#include <fstream>

wstring readFile(const char* filename)
{
    wifstream wif(filename);
    wstringstream wss;
    wss << wif.rdbuf();
    return wss.str();
}

It is reading file that conatains english and russian symbols and placing text into wstring.

Then, I tried to print all of that into console and all russian symbols were replaced with another. English symbols were correct.

I found, that all characters codes (like \u...) of replaced symbols are right. Also them were out of ASCII table, like å - U+00E5 (en.wikipedia.org/wiki/List_of_Unicode_characters)

I think there's problem in decoding file into utf-8.

 3  75  3
1 Jan 1970

Solution

 1

Without any additional set-up, wifstream will use the C locale, which is probably not what you want.

There are multiple possible solutions:


1. Using the default locale

To read files using the default locale, use std::wifstream with std::locale(""). This is how most Linux programs work.

Example:

#include <fstream>
#include <iostream>
#include <ranges>

int main()
{
    std::wifstream stream("sample.txt");
    if (!stream)
    {
        std::cerr << "Failed to open file\n";
        return 1;
    }

    stream.imbue(std::locale(""));

    for (auto c : std::views::istream<wchar_t>(stream))
    {
        std::cout << std::hex << static_cast<int>(c) << std::endl;
    }

    if (stream.bad())
    {
        std::cerr << "Failed to extract character\n";
    }
}

Most Linux distros default to UTF-8. Windows can be configured to use UTF-8 too, although I believe that's not the default.


2. Using an UTF-8 locale

While there isn't really a fully portable way to do this, the following seems to work both on Windows and Linux (as long as the en_US.utf-8 locale is available):

stream.imbue(std::locale("en_US.utf-8"));

3. Using the deprecated (C++17) / removed (C++26) codecvt_utf8_utf16

WARNING: Do NOT use on production. Stuff from <codecvt> have issues related to error handling and are set to be removed from the standard. I also wouldn't trust them to properly handle malformed (or potentially malicious) input.

stream.imbue(std::locale(std::locale::classic(), new std::codecvt_utf8_utf16<wchar_t>));

4. Using a library

This is realistically the best option for production code that can't simply rely on the default locale.


So which approach should you choose? It depends:

  • If you just want to do some quick testing, option #2 is probably the simplest. Just don't use it on production;
  • If it's acceptable (or even desirable, if you're targeting Linux) for your program to use the default locale, use option #1;
  • Otherwise, use #4.
2024-07-06
LHLaurini