When using lerobot2rlds to convert large-scale data sets, the speed is very slow without beam, and it will fail with an error if beam is used. #48

xliu0105 · 2025-07-06T03:09:35Z

xliu0105
Jul 6, 2025

Dear author, thank you very much for open-sourcing this repository, which is very convenient for us to convert datasets related to imitation learning.

We are using your open source code to convert the agibotWorld dataset to the lerobot format, and hope to use lerobot2rlds to convert the converted lerobot data to rlds data.

However, when we convert lerobot to rlds, if beam is not enabled, the entire conversion process is normal but extremely slow, and this time cost is unbearable for us. If beam is enabled, there will be various errors and warnings, and eventually the code will be terminated. I will list the errors and warnings below, hoping to get your help and answers.

ValueError: Message tensorflow_copy.Example exceeds maximum protobuf size of 2GB: 5589367504 [while running 'train_write/Serialize']
2025-07-03 13:16:30,923 - INFO - Detected input queue delay longer than 300 seconds. Waiting to receive elements in input queue for instruction: bundle_64 for 11422.22 seconds.
2025-07-03 13:45:56,022 - WARNING - Worker: severity: WARN timestamp { seconds: 1751521551 nanos: 293240547 } message: "Data output stream buffer size 1835560355 exceeds 536870912 bytes. This is likely due to a large element in a PCollection. Large elements increase pipeline RAM requirements and can cause runtime errors. Prefer multiple small elements over single large elements in PCollections. If needed, store large blobs in external storage systems, and use PCollections to pass their metadata, or use a custom coder that reduces the element\'s size." instruction_id: "bundle_90" transform_id: "train_write/Serialize" log_location: "/personal/liuxu/miniforge3/envs/any4lerobot/lib/python3.10/site-packages/apache_beam/runners/worker/data_plane.py:169" thread: "Thread-13"
2025-07-03 13:46:03,876 - INFO - Worker: severity: INFO timestamp { seconds: 1751521563 nanos: 821125984 } message: "processing episode 89" instruction_id: "bundle_90" transform_id: "train/Map(functools.partial(<function DatasetBuilder._generate_examples.<locals>._generate_examples_beam at 0x7fc8b49e9360>, raw_dir=PosixPath(\'/robby/share/robbyvla/liuxu/datasets/lerobot/AgiBotWorld-Alpha-Lerobot2.1/agibotworld/task_327\')))" log_location: "/personal/liuxu/utils/any4lerobot/lerobot2rlds/lerobot2rlds.py:120" thread: "Thread-13"
ValueError: Buffer size 2221960756 exceeds GRPC limit 2147483548. This is likely due to a single element that is too large. To resolve, prefer multiple small elements over single large elements in PCollections. If needed, store large blobs in external storage systems, and use PCollections to pass their metadata, or use a custom coder that reduces the element's size. [while running 'train_write/Serialize']

We use the following command to configure the environment. There will be version conflicts after installing beam.
pip install apache-beam==2.65.0
datasets==3.6.0
datatrove==0.5.0
dill==0.4.0
multiprocess==0.70.18
protobuf==3.20.3
pyarrow==16.1.0
rerun-sdk==0.22.1
tensorboard==2.18.0
tensorboard-data-server==0.7.2
tensorflow==2.18.0
tensorflow-addons==0.23.0
tensorflow-datasets==4.9.7
tensorflow-graphics==2021.12.3
tensorflow-io-gcs-filesystem==0.37.1
tensorflow-metadata==1.16.1
torch==2.6.0
TorchCodec==0.2.1
torchvision==0.21.0 --no-deps

Tavish9 · 2025-07-06T05:21:12Z

Tavish9
Jul 6, 2025
Maintainer

Hello, thank you for trying this out.

Before addressing the specific issues, I want to emphasize from my personal perspective that I prefer the lerobot dataset over the rlds dataset. This is mainly because TensorFlow introduces many uncertainties and uncontrollable factors, and the rlds format conversion is very slow (it serializes image sequences into byte streams). Therefore, if possible, please try to avoid using the rlds dataset.

Now, let’s address the specific problems, although some may not have immediate solutions:
1. tensorflow_copy.Example exceeds maximum protobuf size of 2GB: This occurs because each rlds shard has a maximum size limit of 2GB. The agibot dataset is very large, with long trajectories and many images. Obviously, 2GB is insufficient to store such long image streams. The only solution would be to increase the maximum shard size from 2GB to 10GB, but this is clearly not ideal for training.
2. input queue delay longer than 300 seconds, This is likely due to a large element, Buffer size 2221960756 exceeds GRPC limit 2147483548: similarly, due to the length of agibot trajectories, timeouts occur, buffer exceeds, you could try fewer workers or increase your DRAM, but the error might still happens.
3. 2025-07-03 13:46:03,876 - INFO - Worker: severity: INFO timestamp: This is not an error but normal beam logging output.

4 replies

xliu0105 Jul 7, 2025
Author

After changing the shard size to 10G, the program can run normally, but the system memory occupied will continue to grow until OOM and fails to complete the data conversion.

The computer we use has 1500G memory, and we choose num_beam_worker=15 to convert agibotworld alpha task-327.

I think OOM has nothing to do with the number of beam_workers, right? Because no matter how many workers there are, they all need to load the same amount of data into memory in the end.

Tavish9 Jul 7, 2025
Maintainer

If you are using multi_processing, then the number of workers does affect OOM. Each worker processes one trajectory, and the longer the trajectory, the more memory that worker uses. As the number of workers increases, the total memory usage goes up, which can cause OOM.

NOTE: Each worker only loads the portion of data it needs to process, not the entire dataset.

xliu0105 Jul 7, 2025
Author

Thank you very much for your answer. I am not very familiar with tensorflow and tensorflow_dataset. I used to think that no matter how many workers there are, tfds must read all lerobots into memory and convert them all before writing them to rlds.

I will try to reduce the number of workers.

Tavish9 Jul 7, 2025
Maintainer

Aha, 😄this is not related to TensorFlow; it mainly has to do with my implementation.

However, if you use without beam, that's true, it will load the whole dataset before converting.

FYI: each worker only loads 1 episode

any4lerobot/lerobot2rlds/lerobot2rlds.py

Lines 119 to 122 in 29ed0f1

    
           def _generate_examples_beam(episode_index, raw_dir): 
        
               episode = [] 
        
               dataset = LeRobotDataset("", raw_dir, episodes=[episode_index]) 
        
               logging.info(f"processing episode {episode_index}")

xliu0105 · 2025-07-06T07:18:40Z

xliu0105
Jul 6, 2025
Author

Thank you very much for your answer. I also agree with your idea: Lerobot dataset is indeed much better than rlds dataset, but this is one of the tasks assigned by my superiors, so I still have to complete it.

I will try the solution you provided and give feedback.

0 replies

Tavish9 · 2025-07-10T01:12:13Z

Tavish9
Jul 10, 2025
Maintainer

Hi @xliu0105, forgot to tell you that currently lerobot2rlds would convert all image to rlds, that's the reason why the whole progress is extremely slow.

You can specify the image key you want to use by changing the code here:

any4lerobot/lerobot2rlds/lerobot2rlds.py

Lines 33 to 40 in e3487cf

    
           observation_info={ 
        
               **{ 
        
                   k.split(".")[-1]: tfds.features.Image( 
        
                       shape=v["shape"], dtype=np.uint8, encoding_format=encoding_format, doc=v["names"] 
        
                   ) 
        
                   for k, v in features.items() 
        
                   if "observation.image" in k and "depth" not in k 
        
               },

any4lerobot/lerobot2rlds/lerobot2rlds.py

Lines 66 to 72 in e3487cf

    
           observation_info = { 
        
               **{ 
        
                   # lerobot image is (C, H, W) and in range [0, 1] 
        
                   k.split(".")[-1]: np.array(v * 255, dtype=np.uint8).transpose(1, 2, 0) 
        
                   for k, v in data_item.items() 
        
                   if "observation.image" in k and "depth" not in k 
        
               },

Make sure these two info has the same image key :)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When using lerobot2rlds to convert large-scale data sets, the speed is very slow without beam, and it will fail with an error if beam is used. #48

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

When using lerobot2rlds to convert large-scale data sets, the speed is very slow without beam, and it will fail with an error if beam is used. #48

Uh oh!

xliu0105 Jul 6, 2025

Replies: 3 comments · 4 replies

Uh oh!

Uh oh!

Tavish9 Jul 6, 2025 Maintainer

Uh oh!

xliu0105 Jul 7, 2025 Author

Uh oh!

Tavish9 Jul 7, 2025 Maintainer

Uh oh!

xliu0105 Jul 7, 2025 Author

Uh oh!

Tavish9 Jul 7, 2025 Maintainer

Uh oh!

xliu0105 Jul 6, 2025 Author

Uh oh!

Tavish9 Jul 10, 2025 Maintainer

xliu0105
Jul 6, 2025

Replies: 3 comments 4 replies

Tavish9
Jul 6, 2025
Maintainer

xliu0105 Jul 7, 2025
Author

Tavish9 Jul 7, 2025
Maintainer

xliu0105 Jul 7, 2025
Author

Tavish9 Jul 7, 2025
Maintainer

xliu0105
Jul 6, 2025
Author

Tavish9
Jul 10, 2025
Maintainer