Skip to content

Removing ground truth data from uniad_data influences output generated trajectory #31

@TheNeeloy

Description

@TheNeeloy

Hello,

I am attempting to run your model in other datasets and simulators without providing ground truth labels to the model at test time.

By debugging the code, I've found that the generate function of the LlavaQwenForCausalLM class takes in uniad_data. This is a dictionary with keys like:

['img_metas', 'img', 
'timestamp', 
'l2g_r_mat', 'l2g_t', 
'gt_lane_labels', 'gt_lane_bboxes', 'gt_lane_masks', 'gt_segmentation', 'gt_instance', 'gt_centerness', 'gt_offset', 'gt_flow', 'gt_backward_flow', 'gt_occ_has_invalid_frame', 'gt_occ_img_is_valid', 
'sdc_planning', 'sdc_planning_mask', 
'command']

I believe at inference time, the gt data like gt_lane_labels, gt_lane_bboxes, and gt_lane_masks are only used for computing metrics like IOU. I found this to be the case in the forward_test function of the PansegformerHead class. I also don't believe that the keys like sdc_planning and sdc_planning_mask are being used at inference time for anything meaningful since the planning head is never called in the forward_test function of the UniAD class (neither are the motion or occ heads).

As such, I began by commenting out the only instances I found where the code was using the gt values for computing IOU in the segmentation head:

    def forward_test(self,
                    pts_feats=None,
                    gt_lane_labels=None,
                    gt_lane_masks=None,
                    img_metas=None,
                    rescale=False):
        bbox_list = [dict() for i in range(len(img_metas))]

        pred_seg_dict = self(pts_feats)
        results = self.get_bboxes(pred_seg_dict['outputs_classes'],
                                           pred_seg_dict['outputs_coords'],
                                           pred_seg_dict['enc_outputs_class'],
                                           pred_seg_dict['enc_outputs_coord'],
                                           pred_seg_dict['args_tuple'],
                                           pred_seg_dict['reference'],
                                           img_metas,
                                           rescale=rescale)

        # with torch.no_grad():
        #     drivable_pred = results[0]['drivable']
        #     drivable_gt = gt_lane_masks[0][0, -1]
        #     drivable_iou, drivable_intersection, drivable_union = IOU(drivable_pred.view(1, -1), drivable_gt.view(1, -1))

        #     lane_pred = results[0]['lane']
        #     lanes_pred = (results[0]['lane'].sum(0) > 0).int()
        #     lanes_gt = (gt_lane_masks[0][0][:-1].sum(0) > 0).int()
        #     lanes_iou, lanes_intersection, lanes_union = IOU(lanes_pred.view(1, -1), lanes_gt.view(1, -1))

        #     divider_gt = (gt_lane_masks[0][0][gt_lane_labels[0][0] == 0].sum(0) > 0).int()
        #     crossing_gt = (gt_lane_masks[0][0][gt_lane_labels[0][0] == 1].sum(0) > 0).int()
        #     contour_gt = (gt_lane_masks[0][0][gt_lane_labels[0][0] == 2].sum(0) > 0).int()
        #     divider_iou, divider_intersection, divider_union = IOU(lane_pred[0].view(1, -1), divider_gt.view(1, -1))
        #     crossing_iou, crossing_intersection, crossing_union = IOU(lane_pred[1].view(1, -1), crossing_gt.view(1, -1))
        #     contour_iou, contour_intersection, contour_union = IOU(lane_pred[2].view(1, -1), contour_gt.view(1, -1))


        #     ret_iou = {'drivable_intersection': drivable_intersection,
        #                'drivable_union': drivable_union,
        #                'lanes_intersection': lanes_intersection,
        #                'lanes_union': lanes_union,
        #                'divider_intersection': divider_intersection,
        #                'divider_union': divider_union,
        #                'crossing_intersection': crossing_intersection,
        #                'crossing_union': crossing_union,
        #                'contour_intersection': contour_intersection,
        #                'contour_union': contour_union,
        #                'drivable_iou': drivable_iou,
        #                'lanes_iou': lanes_iou,
        #                'divider_iou': divider_iou,
        #                'crossing_iou': crossing_iou,
        #                'contour_iou': contour_iou}
        for result_dict, pts_bbox in zip(bbox_list, results):
            result_dict['pts_bbox'] = pts_bbox
            # result_dict['ret_iou'] = ret_iou
            result_dict['args_tuple'] = pred_seg_dict['args_tuple']
            result_dict['output_query_things'] = pts_bbox['output_query_things']
            result_dict['output_query_stuff'] = pts_bbox['output_query_stuff']
            result_dict['chosen_output_query_things'] = pts_bbox['chosen_output_query_things']
        return bbox_list

When I did this and ran the evaluation on one data point, I got the same predicted trajectory as when the code was uncommented:

Generated content tokens:
tensor([[151668,     58,   4080,     15,     13,     16,     19,     11,     19,
             13,     21,     20,    701,   4080,     15,     13,     19,     22,
             11,     24,     13,     17,     23,    701,   4080,     16,     13,
             15,     18,     11,     16,     18,     13,     24,     15,    701,
           4080,     16,     13,     22,     23,     11,     16,     23,     13,
             20,     16,    701,   4080,     17,     13,     21,     22,     11,
             17,     18,     13,     15,     23,    701,   4080,     18,     13,
             21,     22,     11,     17,     22,     13,     21,     15,   7252,
         151669, 151645]], device='cuda:0')
'Decoded answer:'
['[(-0.14,4.65),(-0.47,9.28),(-1.03,13.90),(-1.78,18.51),(-2.67,23.08),(-3.67,27.60)]']

I took a step forward by further setting all keys in uniad_data I believed was unnecessary for inference. I did this by adding the following code to the generate function of the LlavaQwenForCausalLM class before calling prepare_inputs_labels_for_multimodal_uniad_vlm:

            for k in uniad_data:
                if 'timestamp' in k or 'gt_' in k or 'sdc_' in k or 'command' in k:
                    uniad_data[k] = [None]

When I reran the inference on the same data point, I got a slightly different trajectory output:

Generated content tokens:
tensor([[151668,     58,   4080,     15,     13,     16,     19,     11,     19,
             13,     21,     22,    701,   4080,     15,     13,     20,     17,
             11,     24,     13,     18,     23,    701,   4080,     16,     13,
             16,     15,     11,     16,     19,     13,     15,     23,    701,
           4080,     16,     13,     23,     21,     11,     16,     23,     13,
             22,     16,    701,   4080,     17,     13,     22,     19,     11,
             17,     18,     13,     17,     22,    701,   4080,     18,     13,
             21,     22,     11,     17,     22,     13,     22,     17,   7252,
         151669, 151645]], device='cuda:0')
'Decoded answer:'
['[(-0.14,4.67),(-0.52,9.38),(-1.10,14.08),(-1.86,18.71),(-2.74,23.27),(-3.67,27.72)]']

I even tried only removing keys that started with gt_, but I still got the latter output trajectory.

I debugged why this might be the case by printing the hidden states of the vision_tower_result["result_track"], vision_tower_result["result_seg"], scene_feature, track_feature, and map_feature in the prepare_inputs_labels_for_multimodal_uniad_vlm function of the LlavaQwenForCausalLM class. The weird thing is that sometimes the values of the hidden states were identical with and without the removed gt data, and sometimes they were different (I couldn't diagnose when they change). But no matter what, I always got only one of the two output trajectories printed above.

A couple of questions:

  1. Do you have a function that allows us to directly generate trajectories without requiring input gt data? This would help when applying the model to other environments. I understand that data like transformations from l2g would still be required for each multi view camera image.
  2. If not, do you know how I should go about evaluating your model on other simulators and datasets that do not follow the same exact NuScenes format?
  3. Do you have a function for visualizing the generated trajectory on the input images?

Thanks for your time!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions