Visual Odometry is All You Need for Realistic PointGoal Navigation

What scientific questions do we (try to) answer?

Is the visual odometry module is all you need to close the gap?

Will the PoitNav v2 agent ("realistic") that does fully visual/perceptual navigation transfer from simulation to reality?

Can the PoitNav v1 and PoitNav v2 Success and SPL gap be closed?

What is Visual Odometry?

What is PointGoal Navigation?

Why do we need VO?

PointNav v2 (Realistic)

PointNav v1 (already solved)

What are approaches to solve this setting?

Sim2Real experiment

MP3D transfer

VO ablations

HC 2021 benchmark results

VO consistency gives only 0.04 SPL hit to any GT nav policy

How this method is different from Datta et al.?

Map-free

Map-based

How this method is different from Zhao et al.?

Single model

Less parameters

Action embedding

DropOut is not applied

Concatenated to every FC

Zhao's et al. code is opensourced(https://github.com/Xiaoming-Zhao/PointNav-VO). Should we plug Depth Discretization/TopDownMap as they did and check the results?

Datta et al. had less significant results before and Anticipation map dominated

Much lower nav-metrics hit on Gibson val between Policy+VO and policy+GT localization

Engineering contribution

Distributed dataset generation (image resizing and format to preserve space)

Future work: Where to move next?

Opensourced code and checkpoints to push Embodied AI forward

Distributed (large scale) VO training

Online training from HSim?

Estimation that PointNav v2 max is 0.84 SPL, not 0.99 SPL

Camera is cheap

GPS & Compass are too noisy in indoor environments

Why not use noisy egomotion from wheels?

Properties and (potential) applications

Correct to state: "ensembling via data augmentations"?

Significantly outperforms map-based

Reuse as a fully visual navigator for Rearrangement (or other tasks that require GT localization (GPS & Compass))?

VO odometry is trained on policy agnostic data (shortest path follower trajectories)

Plug and play usability: no need to fine-tune

Mention that 2.5 SPL reward brings +2 SPL?

nav-policy

Reward how close you’re to the center

Reduce success in the zone

Stop on a range

Distance to goal at failure take a look

Relatively small dataset size

Should we retrain our approach in HC2020 setting to compare with Zhao's et al.

HM3D transfer?

Introduction flow:

Сan map-free approach with general purpose models solve PointNav v2 similar to PointNav v1?

2) While robots still using SLAM for PointNav even PointNav v1 was solved. Solving PointNav v2 can change it, but challenging.

3) Сan map-free approach with general purpose models solve PointNav v2 similar to PointNav v1?

4) First, we investigate what solves mean: 0.84 SPL, 0.98 Success, because of actuation noise on short episodes.

5) We show that even with localization 0.84 SPL hard to reach with same scale as PointNav v1 and we reach 0.80 SPL vs 0.71 SPL in Zhao.

6) We focus on decreasing hit from switching from GT localization to estimated localization using VO and reached only 0.0x SPL hit.

8) all that set new SOTA XX that leaves Y difference for task being solved.

7) We learned that action embedding, test argumentation and scale of the dataset and model are key ingredients to boost VO robustness.

Action embedding: +8 Success + 6SPL

Train/test Augmentations: +5 Success + 4SPL

Large dataset (old 0.5M - new 1.5M): +8 Success + 6SPL

More powerful encoder: +3 Success + 3SPL

1) PointGoal task is foundational task for robots like boston dynamics, roomba and embodied ai tasks.