-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Open
Description
Describe the bug
Attempt to load a dataset from a csv with a single audio column with a single row with a path to an audio file fails when casting the column to Audio, but the exact same dataset created from a dictionary succeeds.
Steps to reproduce the bug
- Have any valid audio file
audio.wav - Have a csv file named
audio.csvwith the following content:
"audio"
"audio.wav"- Attempt to execute the following python code:
from datasets import load_dataset,Audio,Dataset
dataset = Dataset.from_dict({"audio": ["audio.wav"]})
dataset = dataset.cast_column("audio", Audio())
print(dataset[0]["audio"])
# ^^ succeeds with output: <datasets.features._torchcodec.AudioDecoder object at 0x7a32b341a3c0>
dataset = load_dataset("csv", data_files="audio.csv")
dataset = dataset.cast_column("audio", Audio())
# ^^ errors and terminates
print(dataset[0]["audio"])The error is:
Traceback (most recent call last):
File "~/datasets-bug/explore.py", line 8, in <module>
dataset = dataset.cast_column("audio", Audio(sampling_rate=24000))
File "~/datasets-bug/.venv/lib/python3.14/site-packages/datasets/dataset_dict.py", line 337, in cast_column
return DatasetDict({k: dataset.cast_column(column=column, feature=feature) for k, dataset in self.items()})
~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "~/datasets-bug/.venv/lib/python3.14/site-packages/datasets/fingerprint.py", line 468, in wrapper
out = func(dataset, *args, **kwargs)
File "~/datasets-bug/.venv/lib/python3.14/site-packages/datasets/arrow_dataset.py", line 2201, in cast_column
dataset._data = dataset._data.cast(dataset.features.arrow_schema)
~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "~/datasets-bug/.venv/lib/python3.14/site-packages/datasets/table.py", line 1124, in cast
return MemoryMappedTable(table_cast(self.table, *args, **kwargs), self.path, replays)
~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "~/datasets-bug/.venv/lib/python3.14/site-packages/datasets/table.py", line 2272, in table_cast
return cast_table_to_schema(table, schema)
File "~/datasets-bug/.venv/lib/python3.14/site-packages/datasets/table.py", line 2224, in cast_table_to_schema
cast_array_to_feature(
~~~~~~~~~~~~~~~~~~~~~^
table[name] if name in table_column_names else pa.array([None] * len(table), type=schema.field(name).type),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
feature,
^^^^^^^^
)
^
File "~/datasets-bug/.venv/lib/python3.14/site-packages/datasets/table.py", line 1795, in wrapper
return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
~~~~^^^^^^^^^^^^^^^^^^^^^^^^
File "~/datasets-bug/.venv/lib/python3.14/site-packages/datasets/table.py", line 1995, in cast_array_to_feature
return feature.cast_storage(array)
~~~~~~~~~~~~~~~~~~~~^^^^^^^
File "~/datasets-bug/.venv/lib/python3.14/site-packages/datasets/features/audio.py", line 272, in cast_storage
return array_cast(storage, self.pa_type)
File "~/datasets-bug/.venv/lib/python3.14/site-packages/datasets/table.py", line 1797, in wrapper
return func(array, *args, **kwargs)
File "~/datasets-bug/.venv/lib/python3.14/site-packages/datasets/table.py", line 1949, in array_cast
return array.cast(pa_type)
~~~~~~~~~~^^^^^^^^^
File "pyarrow/array.pxi", line 1147, in pyarrow.lib.Array.cast
File "~/datasets-bug/.venv/lib/python3.14/site-packages/pyarrow/compute.py", line 412, in cast
return call_function("cast", [arr], options, memory_pool)
File "pyarrow/_compute.pyx", line 604, in pyarrow._compute.call_function
File "pyarrow/_compute.pyx", line 399, in pyarrow._compute.Function.call
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Unsupported cast from large_string to struct using function cast_structExpected behavior
The audio column with file paths loaded from a csv can be converted to AudioDecoder objects the same as an identical dataset created from a dict.
Environment info
datasets 4.3.0 and 4.5.0, Ubuntu 24.04 amd64, python 3.13.11 and 3.14.2
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels