Skip to content

cast_column(..., Audio) fails with load_dataset("csv",) #7970

@jstangroome

Description

@jstangroome

Describe the bug

Attempt to load a dataset from a csv with a single audio column with a single row with a path to an audio file fails when casting the column to Audio, but the exact same dataset created from a dictionary succeeds.

Steps to reproduce the bug

  1. Have any valid audio file audio.wav
  2. Have a csv file named audio.csv with the following content:
"audio"
"audio.wav"
  1. Attempt to execute the following python code:
from datasets import load_dataset,Audio,Dataset

dataset = Dataset.from_dict({"audio": ["audio.wav"]})
dataset = dataset.cast_column("audio", Audio())
print(dataset[0]["audio"])
# ^^ succeeds with output: <datasets.features._torchcodec.AudioDecoder object at 0x7a32b341a3c0>

dataset = load_dataset("csv", data_files="audio.csv")
dataset = dataset.cast_column("audio", Audio())
# ^^ errors and terminates
print(dataset[0]["audio"])

The error is:

Traceback (most recent call last):
  File "~/datasets-bug/explore.py", line 8, in <module>
    dataset = dataset.cast_column("audio", Audio(sampling_rate=24000))
  File "~/datasets-bug/.venv/lib/python3.14/site-packages/datasets/dataset_dict.py", line 337, in cast_column
    return DatasetDict({k: dataset.cast_column(column=column, feature=feature) for k, dataset in self.items()})
                           ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/datasets-bug/.venv/lib/python3.14/site-packages/datasets/fingerprint.py", line 468, in wrapper
    out = func(dataset, *args, **kwargs)
  File "~/datasets-bug/.venv/lib/python3.14/site-packages/datasets/arrow_dataset.py", line 2201, in cast_column
    dataset._data = dataset._data.cast(dataset.features.arrow_schema)
                    ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/datasets-bug/.venv/lib/python3.14/site-packages/datasets/table.py", line 1124, in cast
    return MemoryMappedTable(table_cast(self.table, *args, **kwargs), self.path, replays)
                             ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/datasets-bug/.venv/lib/python3.14/site-packages/datasets/table.py", line 2272, in table_cast
    return cast_table_to_schema(table, schema)
  File "~/datasets-bug/.venv/lib/python3.14/site-packages/datasets/table.py", line 2224, in cast_table_to_schema
    cast_array_to_feature(
    ~~~~~~~~~~~~~~~~~~~~~^
        table[name] if name in table_column_names else pa.array([None] * len(table), type=schema.field(name).type),
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        feature,
        ^^^^^^^^
    )
    ^
  File "~/datasets-bug/.venv/lib/python3.14/site-packages/datasets/table.py", line 1795, in wrapper
    return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
                             ~~~~^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/datasets-bug/.venv/lib/python3.14/site-packages/datasets/table.py", line 1995, in cast_array_to_feature
    return feature.cast_storage(array)
           ~~~~~~~~~~~~~~~~~~~~^^^^^^^
  File "~/datasets-bug/.venv/lib/python3.14/site-packages/datasets/features/audio.py", line 272, in cast_storage
    return array_cast(storage, self.pa_type)
  File "~/datasets-bug/.venv/lib/python3.14/site-packages/datasets/table.py", line 1797, in wrapper
    return func(array, *args, **kwargs)
  File "~/datasets-bug/.venv/lib/python3.14/site-packages/datasets/table.py", line 1949, in array_cast
    return array.cast(pa_type)
           ~~~~~~~~~~^^^^^^^^^
  File "pyarrow/array.pxi", line 1147, in pyarrow.lib.Array.cast
  File "~/datasets-bug/.venv/lib/python3.14/site-packages/pyarrow/compute.py", line 412, in cast
    return call_function("cast", [arr], options, memory_pool)
  File "pyarrow/_compute.pyx", line 604, in pyarrow._compute.call_function
  File "pyarrow/_compute.pyx", line 399, in pyarrow._compute.Function.call
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Unsupported cast from large_string to struct using function cast_struct

Expected behavior

The audio column with file paths loaded from a csv can be converted to AudioDecoder objects the same as an identical dataset created from a dict.

Environment info

datasets 4.3.0 and 4.5.0, Ubuntu 24.04 amd64, python 3.13.11 and 3.14.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions