@eric-xw @zzxslp
So far, each video is represented by a NumPy array of size (1, num_of_segments, 1024).
Since many of the original videos are no longer available, would it be possible for you to provide a pooled/global feature for each video (size of [1, D])?
Such a pooled representation is widely used in image-guided NMT such as Multi30K, and I believe it will also benefit research in VMT.