VIMA中有价值的问题 #持续更新

发布时间：2023年12月18日

As the input are images of single objects, so how does the model know the relative position and distance between objects：#24

I have read this paper and it is very interesting, I assume that there are images of full scenes are input to the model. But I didn't find relevant pieces about that. All I see is that objects in the full scenes are extracted as images of single objects. How does this model know the relative position and distance between objects. Thank you very much.

Thank for your interest in our project. For object-centric representation, as mentioned in Sec.4 Tokenization, we also encode bounding box coordinates. These features are then fused with objects' image features to provide object tokens.

#感谢您对我们的项目感兴趣。对于以对象为中心的表示，如第4节“标记化”中所述，我们还对边界框坐标进行编码。然后将这些特征与对象的图像特征融合以提供对象标记。

Some questions about the input observation?#38

Hi, I have a question, why do VIMA need both frontal and top-down views for the observation space，Can't just only give the top?

Hi there. For certain tasks only supplying top-down view might be suboptimal, such as?Follow Order?where one object is stacked on another. Additionally, due to legacy reason, we used to have tasks where frontal view is necessary to provide enough information for reasoning.、

#你好。对于某些任务，仅提供自上而下的视图可能是次优的，例如一个对象堆叠在另一个对象上的“按顺序”。此外，由于遗留的原因，我们过去的任务需要正面视图来提供足够的信息进行推理。

参考资料：

As the input are images of single objects, so how does the model know the relative position and distance between objects · Issue #24 · vimalabs/VIMA (github.com)

Some questions about the input observation · Issue #38 · vimalabs/VIMA (github.com)?

文章来源:https://blog.csdn.net/weixin_43332715/article/details/135023319
本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若内容造成侵权/违法违规/事实不符，请联系我的编程经验分享网邮箱：chenni525@qq.com进行投诉反馈，一经查实，立即删除！