Removing ground truth data from uniad_data influences output generated trajectory

Hello,

I am attempting to run your model in other datasets and simulators without providing ground truth labels to the model at test time.

By debugging the code, I've found that the `generate` function of the `LlavaQwenForCausalLM` class takes in `uniad_data`. This is a dictionary with keys like:
```
['img_metas', 'img', 
'timestamp', 
'l2g_r_mat', 'l2g_t', 
'gt_lane_labels', 'gt_lane_bboxes', 'gt_lane_masks', 'gt_segmentation', 'gt_instance', 'gt_centerness', 'gt_offset', 'gt_flow', 'gt_backward_flow', 'gt_occ_has_invalid_frame', 'gt_occ_img_is_valid', 
'sdc_planning', 'sdc_planning_mask', 
'command']
```

I believe at inference time, the gt data like `gt_lane_labels`, `gt_lane_bboxes`, and `gt_lane_masks` are only used for computing metrics like IOU. I found this to be the case in the `forward_test` function of the `PansegformerHead` class. I also don't believe that the keys like `sdc_planning` and `sdc_planning_mask` are being used at inference time for anything meaningful since the planning head is never called in the `forward_test` function of the `UniAD` class (neither are the motion or occ heads).

As such, I began by commenting out the only instances I found where the code was using the gt values for computing IOU in the segmentation head:

```
    def forward_test(self,
                    pts_feats=None,
                    gt_lane_labels=None,
                    gt_lane_masks=None,
                    img_metas=None,
                    rescale=False):
        bbox_list = [dict() for i in range(len(img_metas))]

        pred_seg_dict = self(pts_feats)
        results = self.get_bboxes(pred_seg_dict['outputs_classes'],
                                           pred_seg_dict['outputs_coords'],
                                           pred_seg_dict['enc_outputs_class'],
                                           pred_seg_dict['enc_outputs_coord'],
                                           pred_seg_dict['args_tuple'],
                                           pred_seg_dict['reference'],
                                           img_metas,
                                           rescale=rescale)

        # with torch.no_grad():
        #     drivable_pred = results[0]['drivable']
        #     drivable_gt = gt_lane_masks[0][0, -1]
        #     drivable_iou, drivable_intersection, drivable_union = IOU(drivable_pred.view(1, -1), drivable_gt.view(1, -1))

        #     lane_pred = results[0]['lane']
        #     lanes_pred = (results[0]['lane'].sum(0) > 0).int()
        #     lanes_gt = (gt_lane_masks[0][0][:-1].sum(0) > 0).int()
        #     lanes_iou, lanes_intersection, lanes_union = IOU(lanes_pred.view(1, -1), lanes_gt.view(1, -1))

        #     divider_gt = (gt_lane_masks[0][0][gt_lane_labels[0][0] == 0].sum(0) > 0).int()
        #     crossing_gt = (gt_lane_masks[0][0][gt_lane_labels[0][0] == 1].sum(0) > 0).int()
        #     contour_gt = (gt_lane_masks[0][0][gt_lane_labels[0][0] == 2].sum(0) > 0).int()
        #     divider_iou, divider_intersection, divider_union = IOU(lane_pred[0].view(1, -1), divider_gt.view(1, -1))
        #     crossing_iou, crossing_intersection, crossing_union = IOU(lane_pred[1].view(1, -1), crossing_gt.view(1, -1))
        #     contour_iou, contour_intersection, contour_union = IOU(lane_pred[2].view(1, -1), contour_gt.view(1, -1))


        #     ret_iou = {'drivable_intersection': drivable_intersection,
        #                'drivable_union': drivable_union,
        #                'lanes_intersection': lanes_intersection,
        #                'lanes_union': lanes_union,
        #                'divider_intersection': divider_intersection,
        #                'divider_union': divider_union,
        #                'crossing_intersection': crossing_intersection,
        #                'crossing_union': crossing_union,
        #                'contour_intersection': contour_intersection,
        #                'contour_union': contour_union,
        #                'drivable_iou': drivable_iou,
        #                'lanes_iou': lanes_iou,
        #                'divider_iou': divider_iou,
        #                'crossing_iou': crossing_iou,
        #                'contour_iou': contour_iou}
        for result_dict, pts_bbox in zip(bbox_list, results):
            result_dict['pts_bbox'] = pts_bbox
            # result_dict['ret_iou'] = ret_iou
            result_dict['args_tuple'] = pred_seg_dict['args_tuple']
            result_dict['output_query_things'] = pts_bbox['output_query_things']
            result_dict['output_query_stuff'] = pts_bbox['output_query_stuff']
            result_dict['chosen_output_query_things'] = pts_bbox['chosen_output_query_things']
        return bbox_list
```

When I did this and ran the evaluation on one data point, I got the same predicted trajectory as when the code was uncommented:

```
Generated content tokens:
tensor([[151668,     58,   4080,     15,     13,     16,     19,     11,     19,
             13,     21,     20,    701,   4080,     15,     13,     19,     22,
             11,     24,     13,     17,     23,    701,   4080,     16,     13,
             15,     18,     11,     16,     18,     13,     24,     15,    701,
           4080,     16,     13,     22,     23,     11,     16,     23,     13,
             20,     16,    701,   4080,     17,     13,     21,     22,     11,
             17,     18,     13,     15,     23,    701,   4080,     18,     13,
             21,     22,     11,     17,     22,     13,     21,     15,   7252,
         151669, 151645]], device='cuda:0')
'Decoded answer:'
['[(-0.14,4.65),(-0.47,9.28),(-1.03,13.90),(-1.78,18.51),(-2.67,23.08),(-3.67,27.60)]']
```

I took a step forward by further setting all keys in `uniad_data` I believed was unnecessary for inference. I did this by adding the following code to the `generate` function of the `LlavaQwenForCausalLM` class before calling `prepare_inputs_labels_for_multimodal_uniad_vlm`:

```
            for k in uniad_data:
                if 'timestamp' in k or 'gt_' in k or 'sdc_' in k or 'command' in k:
                    uniad_data[k] = [None]
```

When I reran the inference on the same data point, I got a slightly different trajectory output:

```
Generated content tokens:
tensor([[151668,     58,   4080,     15,     13,     16,     19,     11,     19,
             13,     21,     22,    701,   4080,     15,     13,     20,     17,
             11,     24,     13,     18,     23,    701,   4080,     16,     13,
             16,     15,     11,     16,     19,     13,     15,     23,    701,
           4080,     16,     13,     23,     21,     11,     16,     23,     13,
             22,     16,    701,   4080,     17,     13,     22,     19,     11,
             17,     18,     13,     17,     22,    701,   4080,     18,     13,
             21,     22,     11,     17,     22,     13,     22,     17,   7252,
         151669, 151645]], device='cuda:0')
'Decoded answer:'
['[(-0.14,4.67),(-0.52,9.38),(-1.10,14.08),(-1.86,18.71),(-2.74,23.27),(-3.67,27.72)]']
```

I even tried only removing keys that started with `gt_`, but I still got the latter output trajectory.

I debugged why this might be the case by printing the hidden states of the `vision_tower_result["result_track"]`, `vision_tower_result["result_seg"]`, `scene_feature`, `track_feature`, and `map_feature` in the `prepare_inputs_labels_for_multimodal_uniad_vlm` function of the `LlavaQwenForCausalLM` class. The weird thing is that sometimes the values of the hidden states were identical with and without the removed gt data, and sometimes they were different (I couldn't diagnose when they change). But no matter what, I always got only one of the two output trajectories printed above.

**A couple of questions:**
1. Do you have a function that allows us to directly generate trajectories without requiring input gt data? This would help when applying the model to other environments. I understand that data like transformations from l2g would still be required for each multi view camera image.
2. If not, do you know how I should go about evaluating your model on other simulators and datasets that do not follow the same exact NuScenes format?
3. Do you have a function for visualizing the generated trajectory on the input images?

Thanks for your time!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removing ground truth data from uniad_data influences output generated trajectory #31

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Removing ground truth data from uniad_data influences output generated trajectory #31

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions