Abstract: Endowing Large Multimodal Models (LMMs) with visual grounding capability can significantly enhance AIs’ understanding of the visual world and their interaction with humans. However, existing ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results