Question

Python pandas read_sas with chunk size option fails with value error on index mismatch

I have a very large SAS file that won't fit in memory of my server. I simply need to convert to parquet formatted file. To do so, I am reading it in chunks using the chunksize option of the read_sas method in pandas. It is mostly working / doing its job. Except, it fails with the following error after a while.

This particular SAS file has 79422642 rows of data. It is not clear why it fails in the middle.

import pandas as pd

filename = 'mysasfile.sas7bdat'
SAS_CHUNK_SIZE = 2000000

sas_chunks = pd.read_sas(filename, chunksize = SAS_CHUNK_SIZE, iterator = True)
for sasDf in sas_chunks:
    print(sasDf.shape)


(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/anaconda3/lib/python3.10/site-packages/pandas/io/sas/sas7bdat.py", line 340, in __next__
    da = self.read(nrows=self.chunksize or 1)
  File "/opt/anaconda3/lib/python3.10/site-packages/pandas/io/sas/sas7bdat.py", line 742, in read
    rslt = self._chunk_to_dataframe()
  File "/opt/anaconda3/lib/python3.10/site-packages/pandas/io/sas/sas7bdat.py", line 795, in _chunk_to_dataframe
    rslt[name] = pd.Series(self._string_chunk[js, :], index=ix)
  File "/opt/anaconda3/lib/python3.10/site-packages/pandas/core/series.py", line 461, in __init__
    com.require_length_match(data, index)
  File "/opt/anaconda3/lib/python3.10/site-packages/pandas/core/common.py", line 571, in require_length_match
    raise ValueError(
ValueError: Length of values (2000000) does not match length of index (1179974)

I just tested the same logic of the code on a smaller SAS file with fewer rows using a smaller chunk size as follows, and it seems to work fine without any errors, and also handles the last remaining chunk that is smaller than the chunk size parameter:

filename = 'mysmallersasfile.sas7bdat'
SAS_CHUNK_SIZE = 1000

sas_chunks = pd.read_sas(filename, chunksize = SAS_CHUNK_SIZE, iterator = True)
for sasDf in sas_chunks:
     print(sasDf.shape)

(1000, 5)
(1000, 5)
(1000, 5)
(1000, 5)
(983, 5)

 3  137  3
1 Jan 1970

Solution

 1

Perhaps try this code:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

filename = 'mysasfile.sas7bdat'
output_filename = 'output.parquet'
SAS_CHUNK_SIZE = 2000000

writer = None  # initialize writer

sas_chunks = pd.read_sas(filename, chunksize=SAS_CHUNK_SIZE, iterator=True)

for i, sasDf in enumerate(sas_chunks):
    print(f"Processing chunk {i+1} with shape {sasDf.shape}")
    
    table = pa.Table.from_pandas(sasDf) # convert pandas DF to Arrow table
        
    if writer is None:
        # Create new Parquet file with 1st chunk
        writer = pq.ParquetWriter(output_filename, table.schema)
    
    writer.write_table(table)  # write Arrow Table to Parquet file

if writer:
    writer.close()`
  1. It reads in chunks using pd.read_sas function
  2. pyarrow.parquet.ParquetWriter writes the data to a Parquet file while allowing appending data in chunks, which is suitable for such large datasets
  3. Each chunk is converted to a pyarrow.Table and written to the Parquet file
2024-07-22
el_pazzu