Question

Python pandas read_sas with chunk size option fails with value error on index mismatch

I have a very large SAS file that won't fit in memory of my server. I simply need to convert to parquet formatted file. To do so, I am reading it in chunks using the chunksize option of the read_sas method in pandas. It is mostly working / doing its job. Except, it fails with the following error after a while.

This particular SAS file has 79422642 rows of data. It is not clear why it fails in the middle.

import pandas as pd

filename = 'mysasfile.sas7bdat'
SAS_CHUNK_SIZE = 2000000

sas_chunks = pd.read_sas(filename, chunksize = SAS_CHUNK_SIZE, iterator = True)
for sasDf in sas_chunks:
    print(sasDf.shape)


(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/anaconda3/lib/python3.10/site-packages/pandas/io/sas/sas7bdat.py", line 340, in __next__
    da = self.read(nrows=self.chunksize or 1)
  File "/opt/anaconda3/lib/python3.10/site-packages/pandas/io/sas/sas7bdat.py", line 742, in read
    rslt = self._chunk_to_dataframe()
  File "/opt/anaconda3/lib/python3.10/site-packages/pandas/io/sas/sas7bdat.py", line 795, in _chunk_to_dataframe
    rslt[name] = pd.Series(self._string_chunk[js, :], index=ix)
  File "/opt/anaconda3/lib/python3.10/site-packages/pandas/core/series.py", line 461, in __init__
    com.require_length_match(data, index)
  File "/opt/anaconda3/lib/python3.10/site-packages/pandas/core/common.py", line 571, in require_length_match
    raise ValueError(
ValueError: Length of values (2000000) does not match length of index (1179974)

I just tested the same logic of the code on a smaller SAS file with fewer rows using a smaller chunk size as follows, and it seems to work fine without any errors, and also handles the last remaining chunk that is smaller than the chunk size parameter:

filename = 'mysmallersasfile.sas7bdat'
SAS_CHUNK_SIZE = 1000

sas_chunks = pd.read_sas(filename, chunksize = SAS_CHUNK_SIZE, iterator = True)
for sasDf in sas_chunks:
     print(sasDf.shape)

(1000, 5)
(1000, 5)
(1000, 5)
(1000, 5)
(983, 5)

3 137 3

1 Jan 1970