: The agent must understand spatial relationships and object semantics, such as distinguishing a "wooden table" from a "marble counter".
VLN is a "multi-modal" task that requires an AI to process both visual input (what it sees) and linguistic input (what it is told to do) to reach a destination. VLN-155zip
: Modern VLN models, such as those using Volumetric Environment Representation (VER), require large files to store the learned parameters that allow them to predict 3D occupancy and room layouts. : The agent must understand spatial relationships and
the file into a designated data/ or weights/ directory. VLN-155zip