I have read this paper and it is very interesting, I assume that there are images of full scenes are input to the model. But I didn't find relevant pieces about that. All I see is that objects in the full scenes are extracted as images of single objects. How does this model know the relative position and distance between objects. Thank you very much. |
Thank for your interest in our project. For object-centric representation, as mentioned in Sec.4 Tokenization, we also encode bounding box coordinates. These features are then fused with objects' image features to provide object tokens.
#感谢您对我们的项目感兴趣。对于以对象为中心的表示,如第4节“标记化”中所述,我们还对边界框坐标进行编码。然后将这些特征与对象的图像特征融合以提供对象标记。
Hi, I have a question, why do VIMA need both frontal and top-down views for the observation space,Can't just only give the top?
Hi there. For certain tasks only supplying top-down view might be suboptimal, such as?Follow Order
?where one object is stacked on another. Additionally, due to legacy reason, we used to have tasks where frontal view is necessary to provide enough information for reasoning.、
#你好。对于某些任务,仅提供自上而下的视图可能是次优的,例如一个对象堆叠在另一个对象上的“按顺序”。此外,由于遗留的原因,我们过去的任务需要正面视图来提供足够的信息进行推理。
参考资料:
Some questions about the input observation · Issue #38 · vimalabs/VIMA (github.com)?