Skip to content

FALCON-VLA/FALCON

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

12 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

FALCON_logo

| FALCON | From Spatial to Actions:
Grounding Vision-Language-Action Model in Spatial Foundation Priors (ICLR 2026)

arXiv Website HF Paper: FALCON HF Model: FALCON HF Dataset: CALVIN-3D
Python 3.8 PyTorch

Zhengshen Zhang   Hao Li   Yalun Dai   Zhengbang Zhu   Lei Zhou  
Chenchen Liu   Dong Wang   Francis E. H. Tay   Sijin Chen  
Ziwei Liu   Yuxiao Liu*†   Xinghang Li*   Pan Zhou*  

*Corresponding Author  †Project Lead


ByteDance Seed
National University of Singapore   Nanyang Technological University
Tsinghua University   Singapore Management University


FALCON_teaser

Updates πŸš€πŸš€πŸš€

  • [18/05/2026] Released the pre-training and post-training code for the FALCON series, and the preprocessed point cloud data & camera parameters for CALVIN ABC & CALVIN ABCD. Welcome to check it out and build on top of it!

  • [25/03/2026] Released inference code of FALCON and relevant weights on CALVIN & SimplerEnv, please feel free to try our model!

  • [26/01/2026] 🎊 Thrilled to share that our paper has been accepted to ICLR 2026! Code will be open-sourced soon. Stay tuned!

  • [20/10/2025] Existing vision-language-action (VLA) models act in 3D real-world but are typically built on 2D encoders, leaving a spatial reasoning gap that limits generalization and adaptability. In this work, we introduce FALCON (From Spatial to Action), a novel paradigm that injects rich 3D spatial tokens into the action head of a VLA model, enabling robust spatial understanding and SOTA performance across diverse manipulation tasks without disrupting vision-language alignment. See our paper at here.

Contents

πŸ”§ Installation

# Clone the repository
git clone https://github.com/FALCON-VLA/FALCON.git
cd FALCON

πŸ’‘ For now, we support CALVIN and SimplerEnv, you can follow their guidance to download the training/validation data. Besides, we suggest create seperate virtual envs for different benchmarks to prevent from conflicts. Also, we provide easy-setup scripts to help you setup the environments that is compatible with our codebase for running these benchmarks:

CALVIN Benchmark

# If you want to run training/eval on CALVIN
conda create -n falcon_calvin python=3.8.20 -y
conda activate falcon_calvin
pip install -e .

# For CALVIN Installation
bash scripts/setup_calvin.sh

SimplerEnv Experiments

# If you want to run training/eval on SimplerEnv
conda create -n falcon_oxe python=3.10 -y
conda activate falcon_oxe
pip install -e .

# For training on OXE dataset, using our fork of openvla
cd ..
git clone https://github.com/lixinghang12/openvla
cd openvla
pip install -e .

cd ../FALCON
# For SimplerEnv Installation
bash scripts/setup_simplerenv.sh

# Making soft link to assets of SimplerEnv for eval
sudo ln -s ./SimplerEnv/ManiSkill2_real2sim/data/real_inpainting real_inpainting

To validate if CALVIN/SimplerEnv is successfully installed, run the following command for testing:

# For CALVIN simulation Verification
python eval/calvin/env_test.py

# For SimplerEnv simulation Verification
python eval/simpler/env_test.py

πŸ’ͺ Benchmark Performance Comparison

CALVIN Benchmark

calvin

SimplerEnv WidowX Robot Experiments

simpler

SimplerEnv Google Robot Experiments

simpler

Real-World Experiments

real-world πŸ’‘ For more sim/real-world benchmark results, please refer to our paper.

πŸ€— Model Zoo

We provide the following model weights and their config files in our paper:

Model Name VLA Model Embodied Spatial Model Note
FALCON-FC-CALVIN-ABC falcon-esm-fc-calvin-abc-pt esm-1b finetune on calvin-abc with RGB inputs to ESM, Tab. 4 and 5.
FALCON-FC-CALVIN-ABC-WDepth falcon-esm-fc-calvin-abc-wdepth-pt esm-1b finetune on calvin-abc with RGB-D inputs to ESM, Tab. 5.
FALCON-3DPC-FC-CALVIN-ABC falcon-3dpc-fc-calvin-abc-pt improved DP3 encoder finetune on calvin-abc with point cloud inputs to idp3 encoder, Tab. 5-Kosmos-VLA (w/ rgb-d).
FALCON-LSTM-CALVIN-ABC falcon-lstm-calvin-abc-pt esm-1b finetune on calvin-abc with RGB inputs to ESM, Tab. 1.
FALCON-LSTM-CALVIN-ABCD falcon-lstm-calvin-abcd-pt esm-1b finetune on calvin-abcd with RGB inputs to ESM, Tab. 1.
FALCON-FC-SimplerEnv-Bridge falcon-fc-simpler-bridge-pt esm-1b pretrained on oxe then finetune on bridge dataset with RGB inputs to ESM, Tab. 2.
FALCON-FC-SimplerEnv-Fractal falcon-fc-simpler-fractal-pt esm-1b pretrained on oxe then finetune on fractal dataset with RGB inputs to ESM, Tab. 3.

πŸ‹οΈ Training

Training Configuration File

The configuration file comprises five main sections:

Basic VLA Model Configurations

Define the basic configurations of the VLA model:

"robovlm_name": "RoboKosMos", # Name of the registered VLA
"model": "kosmos", # Name of the VLM model used for necessary paths, specialized operations like initialization and prompting
"model_url": "https://huggingface.co/microsoft/kosmos-2-patch14-224", # Huggingface url of VLMs, it will be automaticly download before training start
"image_size": 224, # Input image size
"window_size": 1/16, # VLA history length
"fwd_pred_next_n": 10, # Predict action chunk length
"batch_size": 4, # Batch size on each GPU
"optimizer": "adamw", # Optimizer type
"learning_rate": 1e-4, # Learning rate
"weight_decay": 0.0, # Weight decay
"output_root": "/path/to/ur/checkpoints",
"log_root": "/path/to/ur/logs",
"cache_root": "/path/to/ur/cache",
"model_load_path": "/path/to/ur/pretrained/checkpoints",
"model_load_source": "torch",  # Options: `torch`, `deepspeed`

πŸ’‘ During training, the model checkpoint and running configuration will be saved at the paths specified by the output_root and log_root in the config file.

Training Setup Configurations

Specify the detailed training parameters. You can choose which part of the model to train, and whether to freeze some parts:

"train_setup": {
    "precision": "bf16/fp32",
    "predict_action": true,
    "predict_forward": false,
    "predict_forward_hand": false,
    "predict_caption": false,
    "train_vision": true,
    "bits": -1,
    "freeze_mm_mlp_adapter": false,
    "freeze_backbone": false,
    "freeze_resampler": false,
    "tune_mm_mlp_adapter": false,
    "mm_use_im_start_end": false,
    "mm_use_im_patch_token": false,
    "gradient_checkpointing": false,
    "lora_enable": false,
    "mm_projector_lr": 0.0001,
    "lora_r": 64,
    "lora_alpha": 16,
    "lora_dropout": 0.05,
    "lora_bias": "none",
    "train_text_embedding": true,
    "train_act_head": false,  # false means freeze action head
    "train_spatial_injector": true,
    "train_decoder_layers": -1
},

Action Head Configurations

Specify the parameters of the action head, take FCDecoder_ESM as an example:

"act_head": {
    "type": "FCDecoder_ESM",
    "hidden_size": 1024,
    "action_dim": 7,
    "down_sample": "none",
    "latent": 1,
    "fwd_pred_next_n": 1,  # will inherit from `fwd_pred_next_n`
    "window_size": 1,  # will inherit from `window_size`
    "action_space": "continuous",
    "with_history": true,
    "history_type": "post",
    "use_spatial_injector": true,
    "esm_use_hand_rgb": false,
    "camera_gt_pt": 0.0,  # set to 0.0 if not using camera poses as ESM input
    "depth_gt_pt": 1.0,  # set to 1.0 if using depth maps as ESM input
    "esm_ckpt_path": "/path/to/ur/esm/checkpoints"
},

VLM Configurations

Specify the tokenizer type, VLM type, and the paths to the pretrained models. If you do not download and specify any pretrained VLM, our script will download it automatically with the specified model_url:

"tokenizer": {
    "type": "AutoProcessor",
    "pretrained_model_name_or_path": ".vlms/kosmos-2-patch14-224",  # If not exist will download automatically from specified `model_url`
    "tokenizer_type": "kosmos",
    "max_text_len": 256,
    "additional_special_tokens": null
},
"vlm": {
    "type": "AutoModelForVision2Seq",
    "name": "kosmos",
    "pretrained_model_name_or_path": ".vlms/kosmos-2-patch14-224"
},

Dataset Configurations

For now, we support the CALVIN and Open X-Embodiment datasets, as well as custom datasets. Example dataset configurations are as follows:

Calvin Dataset
"train_dataset": {
    "type": "DiskCalvinDataset_ESM",
    "data_dir": "calvin/dataset/task_ABC_D/training",
    "cam_params_dir": "calvin_cam_params/packaged_ABC_D/training",
    "shift_first": false,
    "model_name": "kosmos",   # Same as 'model' in configs
    "rgb_pad": 10,            # Random shift size for static images
    "gripper_pad": 4,         # Random shift size for gripper images
    "few_shot": false
},
"val_dataset": {
    "type": "DiskCalvinDataset_ESM",
    "data_dir": "calvin/dataset/task_ABC_D/validation",
    "cam_params_dir": "calvin_cam_params/packaged_ABC_D/validation",
    "model_name": "kosmos"   # Same as 'model' in configs
}

Note

We also provide the preprocessed point cloud data and camera parameters for CALVIN ABC & CALVIN ABCD. Please download them from the following link: FALCON_CALVIN-3D dataset.

SimplerEnv Dataset
"train_dataset": {
    "type": "OpenVLADataset",
    "data_root_dir": "openvla/datasets/open-x-embodiment",
    "model_name": "kosmos",   # Same as 'model' in configs
    "image_aug": true,
    "mode": "train",
    "data_mix": "bridge",   # Options: `bridge`, `rt_1`, `oxe_magic_soup` and other data mixtures
    "window_sample": "sliding",
    "organize_type": "interleave",
    "shuffle_buffer_size": 51200,
    "train": true
},
"val_dataset": {
    "type": "OpenVLADataset",
    "data_root_dir": "openvla/datasets/open-x-embodiment",
    "model_name": "kosmos",   # Same as 'model' in configs
    "mode": "train",
    "data_mix": "bridge",
    "window_sample": "sliding",
    "organize_type": "interleave",
    "shuffle_buffer_size": 10000,
    "train": false
}

Additionally, you can define your own custom dataset in the following format:

"rgb": image_tensors,           # Shape: [Batch Size, Window Size, Channel, Width, Height]
"hand_rgb": gripper_tensors,    # Shape: [Batch Size, Window Size, Channel, Width, Height]
"action": action_tensors,       # Shape: [Batch Size, Window Size, Action Dim]
"text": text_tensors,           # Shape: [Batch Size, Max Text Len]
"text_mask": attention_mask,    # Shape: [Batch Size, Max Text Len]
"action_chunk": action_chunk,   # Shape: [Batch Size, Window Size, Chunk Size, Action Dim]
"chunk_mask": action_mask,      # Mask for valid action chunks
"instr_and_action_ids": instr_and_action_ids,  # Input for auto-regressive next token prediction
"instr_and_action_labels": instr_and_action_labels,  # Label for auto-regressive next token prediction
"instr_and_action_mask": instr_and_action_mask,  # Mask for auto-regressive next token prediction
"raw_text": raw_text,           # Raw list of language instructions for each action chunk
"data_source": data_source      # Task type string (e.g., calvin_action, must involve 'action' for action prediction)

After defining the dataset, wrap it with a custom collater and register it in data/__init__.py as follows:

from .custom_dataset import CustomDataset
__all__.append('CustomDataset')

Then, add your custom dataset to the config file:

Customed Dataset
"train_dataset": {
    "type": "CustomDataset",
    "data_dir": "path/to/custom_data",
    "shift_first": false,
    "model_name": "kosmos",
    "rgb_pad": 10,            # Random shift size for RGB
    "gripper_pad": 4         # Random shift size for gripper
},
"val_dataset": {
    "type": "CustomDataset",
    "data_dir": "path/to/custom_data",
    "model_name": "kosmos"
}

Config Management

The training configuration files automatically inherit parameters like window_size, ensuring consistency across datasets. You can easily switch between datasets by updating the train_dataset and val_dataset sections in your config file.

πŸš€ Pre-train from Scratch

FALCON is pretrained on 1.1 million real-robot demonstrations from the OXE and CALVIN datasets separately, using 32 A100 GPUs for approximately five days with a batch size of 512. You can pre-train the model from scratch using the following command. Before running the script, please download the Open X-Embodiment dataset (need to convert it to the RLDS format, see moojink/rlds_dataset_builder for more info) and CALVIN dataset (optional).

# pretrain on oxe
bash scripts/run.sh configs/oxe_training/finetune_kosmos_cont-fc-post_full-ft_text_vision_wd-0_ws-1_act-5_oxe_pretrain_oxe-magic-soup.json

# pretrain on calvin
bash scripts/run.sh configs/calvin_finetune/finetune_kosmos_cont-fc-post_full-ft_text_vision_wd-0_use-hand_ws-1_act-10.json
bash scripts/run.sh configs/calvin_finetune/finetune_kosmos_cont-lstm-post_full-ft_text_vision_wd-0_use-hand_ws-16_act-10.json

πŸ’‘ To start the training process, use scripts/run.sh followed by the path to the desired config file:

bash scripts/run.sh path/to/config.json

πŸš€ Progressive Post-train

To integrate spatial awareness into the VLA model while preserving the pre-trained components’ capabilities, we carefully design a two-stage post-training pipeline, please refer to our paper for more details. Most of our post-training experiments are conducted using 32 A100 GPUs.

# sft on oxe
bash scripts/run.sh configs/oxe_training/finetune_kosmos_cont-fc-post_esm_wd-0_ws-1_act-5_bridge_finetune.json
bash scripts/run.sh configs/oxe_training/finetune_kosmos_cont-fc-post_esm_wd-0_ws-1_act-5_gr_finetune.json

# sft on calvin
bash scripts/run.sh configs/calvin_finetune/finetune_kosmos_cont-fc-post_esm_wd-0_use-hand_ws-1_act-10.json
bash scripts/run.sh configs/calvin_finetune/finetune_kosmos_cont-lstm-post_esm_wd-0_use-hand_ws-16_act-10.json

# sft FALCON with point cloud input on calvin
bash scripts/run.sh configs/calvin_finetune/finetune_kosmos_cont-fc-post_pcd_wd-0_use-hand_ws-1_act-10.json

πŸƒ Evaluation

Evaluation on CALVIN

πŸ’‘ Add the paths to your model checkpoint and configuration files in the ckpt_paths list for calvin eval script eval/calvin/eval_ckpts.py as shown below:

ckpt_paths = [
    ("path/to/VLA-Checkpoint-{epoch}-{steps}.ckpt/or/VLA-Checkpoint.pt", 
    "path/to/VLA-Checkpoint-config.json")
]

We recommend using multi-gpus (e.g., 8 gpus) to parallel run the eval script:

python eval/calvin/eval_ckpts.py

After running the eval script, the results will be saved in the eval/calvin/logs directory, then use the following script to gather the results from each gpu:

python3 tools/merge_multi_rank_res.py

Note

For FALCON with fc action head, add --act_chunk in scripts/run_eval_raw_ddp_torchrun.sh for action chunking rollouts. For FALCON with lstm action head, remove --act_chunk for single-step rollout with action ensemble.

Evaluation on SimplerEnv

πŸ’‘ Before running, make sure that you have the right path to SimplerEnv Real-Sim image assets. We recommand make a soft link for SimplerEnv/ManiSkill2_real2sim/data/real_inpainting to run the provided eval scripts.

Add the paths to your model checkpoint, configuration files, and the eval output logs in the ckpt_paths list for SimplerEnv eval scripts as shown below:

ckpt_paths = [
    ("path/to/VLA-Checkpoint-{epoch}-{steps}.ckpt/or/VLA-Checkpoint.pt", 
    "path/to/VLA-Checkpoint-config.json",
    "path/to/SimplerEnv-eval-logs")
]

To evaluate the model on Google Robot/Fractal environment, use the following command and we recommend using multi-gpus (e.g., 4 gpus) to accelerate the evaluation:

# For single gpu eval
python eval/simpler/eval_ckpts_google_robot.py
# For multi-gpus eval
python eval/simpler/eval_ckpts_google_robot_parallel.py

For evaluation on Bridge environment, run:

python eval/simpler/eval_ckpts_bridge.py

After running the eval script, the log files will be saved in the path/to/SimplerEnv-eval-logs directory that you have set, then use the following script to obtain the final results:

# For bridge results summary
python3 tools/summary_bridge_results.py <PATH-TO-UR-LOG-FILE>
# For gr results summary
python3 tools/summary_gr_results.py <PATH-TO-UR-LOGS-FOLDER>
# For gr/bridge sub-task results summary
python3 tools/get_simpler_results.py

Note

Please make sure that the paths to the model checkpoints and configuration files are correct and match the setup of your environment before running the benchmark evaluation scripts.

πŸ—’οΈ TODO List

  • Release the code, model of FALCON.
  • Release the CALVIN & SimplerEnv evaluation code and model weights for FALCON series.
  • Release pre-training / post-training code for FALCON series.
  • Release the preprocessed point cloud data and camera parameters for CALVIN ABC & CALVIN ABCD.
  • Release the code for real-world deployment of FALCON via ManiUniCon.

πŸ€— FAQs

If you encounter any issues, feel free to open an issue on GitHub or reach out through discussions. We appreciate your feedback and contributions! πŸš€

πŸ–ŠοΈ Citation

If you find this project useful in your research, please consider cite:

@article{zhang2025spatial,
  title={From spatial to actions: Grounding vision-language-action model in spatial foundation priors},
  author={Zhang, Zhengshen and Li, Hao and Dai, Yalun and Zhu, Zhengbang and Zhou, Lei and Liu, Chenchen and Wang, Dong and Tay, Francis EH and Chen, Sijin and Liu, Ziwei and others},
  journal={arXiv preprint arXiv:2510.17439},
  year={2025}
}

πŸͺͺ License

This project is licensed under the Apache-2.0 License.

❀️ Acknowledgement

FALCON is built with reference to the code of the following projects: RoboVLMs, Microsoft Kosmos-2, VGGT, and ManiUniCon. Thanks for their awesome work!

About

πŸ¦… FALCON: an effective vision-language-action model injects rich 3D spatial tokens into the action head, enabling robust spatial understanding and SOTA performance across diverse manipulation tasks. || Accepted at ICLR 2026.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors