Question

fwscanf failing to read UTF-8 CSV file correctly in C

This program can only use libraries of the C standard.

I'm trying to read a UTF-8 encoded CSV file in C using fwscanf, but I'm encountering issues with the reading process. The file contains rows with a string and a float value separated by a comma. Here's a minimal example demonstrating the problem:

#include <stdio.h>
#include <wchar.h>
#include <locale.h>

#define MAX_STRING_LENGTH 31

int main() {
    setlocale(LC_ALL, "en_US.UTF-8");
    FILE *file = fopen("input.csv", "r, ccs=UTF-8");
    if (file == NULL) {
        fwprintf(stderr, L"Error opening file.\n");
        return 1;
    }

    wchar_t string[MAX_STRING_LENGTH];
    float frequency;
    int row = 0;

    while (!feof(file)) {
        row++;
        int result = fwscanf(file, L"%30[^,],%f,", string, &frequency);
        
        if (result == 2) {
            wprintf(L"Row %d: String = '%ls', Frequency = %.4f\n", row, string, frequency);
        } else if (result == 1) {
            wprintf(L"Row %d: String = '%ls', Frequency not read\n", row, string);
        } else if (result == EOF) {
            break;
        } else {
            wprintf(L"Error reading row %d\n", row);
            wchar_t c;
            // Skip the rest of the line
            while ((c = fgetwc(file)) != L'\n' && c != WEOF);
        }
    }

    fclose(file);
    return 0;
}

Sample input.csv:

hello,1.0000
world,0.5000
how,0.7500
are,0.2500
you,1.0000
?,0.5000

Expected output:

Row 1: String = 'hello', Frequency = 1.0000
Row 2: String = 'world', Frequency = 0.5000
Row 3: String = 'how', Frequency = 0.7500
Row 4: String = 'are', Frequency = 0.2500
Row 5: String = 'you', Frequency = 1.0000
Row 6: String = '?', Frequency = 0.5000

The issue I'm facing is that fwscanf is not reading the file correctly. It either reads incorrect values or fails to read at all. I've tried using different locale settings and file opening modes, but the problem persists.

3 70 3

1 Jan 1970

Solution

The argument string is not consistent with the L"%30[^,],%f," format string. %[ expects a pointer to a char array that will receive the conversion of the wide characters read from the stream to their multibyte representation.

You want to perform the opposite task: convert the UTF-8 encoded input byte stream into a wide string, ie: an array of wchar_t. You should use fscanf("%30l[^,],%f,", string, &frequency) for this instead.

Unless you need to use wide strings in the rest of the program, converting from UTF-8 seems unnecessary as this encoding is fully compatible with the CSV syntax and all its variants.

2024-07-03

chqrlie

Solution

The wide-oriented I/O functions such as fwscanf() are inappropriate for your use case. They expect input as a sequence of wide characters (where implementations have some lattitude to define what that means), but UTF-8 input is not that. Implementations may vary, but it is probable that your fwscanf calls are attempting to read the file as if it were encoded in UCS-2 (which, for this purpose, is functionally equivalent to UTF-16). Similarly, your wprintf calls probably are not producing output encoded in a way consistent with your terminal configuration.

C has a sense of "multibyte characters", separate and distinct from "wide characters". The former are made up of two or more char units, and they are most naturally stored in arrays of char, possibly interspersed with single-byte characters. The latter are made up of a single wchar_t each, and are most naturally stored in arrays of wchar_t, in which case they cannot be interspersed with single-byte characters.

Your UTF-8 input best matches the former, and the byte-oriented I/O functions are best suited for reading and writing them. (And a terminal or other display device is responsible for interpreting the code sequences, so as to present corresponding graphic representations.) As a side note, C has had UTF-8 literals since C11, and these correspond to arrays of char.

So, you're trying to go to unneeded extra effort. Use narrow I/O functions and regular strings instead of the wide-oriented I/O functions and wide strings.

Additionally,

consider not using fscanf (nor fwscanf), as these are deceptively difficult to use correctly. Among the possible alternatives would be to read a line at a time with fgets(), then parse each one with sscanf().
while(!feof(file)) is always wrong.

2024-07-03

John Bollinger