Synthesizing Consistent Novel Views Via 3D Epipolar Attention Without Re-Training Botao Ye, Sifei Liu, Xueting Li, Marc Pollefeys, Ming-Hsuan Yang Proceedings 2025 International Conference on 3D Vision 3dv 2025, 2025 Large diffusion models demonstrate remarkable zeroshot capabilities in novel view synthesis from a single image. However, these models often face challenges in maintaining consistency across novel and reference views. A crucial factor leading to this issue is the limited utilization of contextual information from reference views. Specifically, when there is an overlap in the viewing frustum between two views, it is essential to ensure that the corresponding regions maintain consistency in both geometry and appearance. This observation leads to a simple yet effective approach, where we propose to use epipolar geometry to locate and retrieve overlapping information from the input view. This information is then incorporated into the generation of target views, eliminating the need for training or fine-tuning, as the process requires no learnable parameters. Furthermore, to enhance the overall consistency of generated views, we extend the utilization of epipolar attention to a multi-view setting, allowing retrieval of overlapping information from the input view and other target views. Qualitative and quantitative experimental results demonstrate the effectiveness of our method in significantly improving the consistency of synthesized views without the need for any fine-tuning. Moreover, This enhancement also boosts the performance of downstream applications such as 3D reconstruction. The code is available at https://github.com/botaoye/ConsisSyn.
M3: 3D-SPATIAL MULTIMODAL MEMORY 13th International Conference on Learning Representations Iclr 2025, 2025
NO POSE, NO PROBLEM: SURPRISINGLY SIMPLE 3D GAUSSIAN SPLATS FROM SPARSE UNPOSED IMAGES 13th International Conference on Learning Representations Iclr 2025, 2025
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks Miran Heo, Min-Hung Chen, De-An Huang, Sifei Liu, Subhashree Radhakrishnan, Seon Joo Kim, Yu-Chiang Frank Wang, Ryo Hachiuma Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2025 We present Omni-RGPT, a multimodal large language model designed to facilitate region-level comprehension for both images and videos. To achieve consistent region representation across spatio-temporal dimensions, we introduce Token Mark, a set of tokens highlighting the target regions within the visual feature space. These tokens are directly embedded into spatial regions using region prompts (e.g., boxes or masks) and simultaneously incorporated into the text prompt to specify the target, establishing a direct connection between visual and text tokens. To further support robust video understanding without requiring tracklets, we introduce an auxiliary task that guides Token Mark by leveraging the consistency of the tokens, enabling stable region interpretation across the video. Additionally, we introduce a large-scale region-level video instruction dataset (RegVID300k). Omni-RGPT achieves state-of-the-art results on image and video-based commonsense reasoning benchmarks while showing strong performance in captioning and referring expression comprehension tasks.
Parallel Sequence Modeling via Generalized Spatial Propagation Network Hongjun Wang, Wonmin Byeon, Jiarui Xu, Jinwei Gu, Ka Chun Cheung, Xiaolong Wang, Kai Han, Jan Kautz, Sifei Liu Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2025 We present the Generalized Spatial Propagation Network (GSPN), a new attention mechanism optimized for vision tasks that inherently captures 2D spatial structures. Existing attention models, including transformers, linear attention, and state-space models like Mamba, process multidimensional data as 1D sequences, compromising spatial coherence and efficiency. GSPN overcomes these limitations by directly operating on spatially coherent image data and forming dense pairwise connections through a line-scan approach. Central to GSPN is the Stability-Context Condition, which ensures stable, long-context propagation across 2D sequences and reduces the effective sequence length to $\sqrt N $ for a square map with N elements, which significantly enhances computational efficiency. With learnable, input-dependent weights and no reliance on positional embeddings, GSPN achieves superior spatial fidelity and state-of-the-art performance in vision tasks, including ImageNet classification, class-guided image generation, and text-to-image generation. Notably, GSPN accelerates SD-XL with softmax-attention by over 84× when generating 16K images. Project page: https://whj363636.github.io/GSPN/
Scaling Vision Pre-Training to 4K Resolution Baifeng Shi, Boyi Li, Han Cai, Yao Lu, Sifei Liu, Marco Pavone, Jan Kautz, Song Han, Trevor Darrell, Pavlo Molchanov, Hongxu Yin Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2025 High-resolution perception of visual details is crucial for daily tasks. Current vision pre-training, however, is still limited to low resolutions (e.g., 378×378 pixels) due to the quadratic cost of processing larger images. We introduce PS3 that scales CLIP-style vision pre-training to 4K resolution with a near-constant cost. Instead of contrastive learning on global image representation, PS3 is pre-trained by selectively processing local regions and contrasting them with local detailed captions, enabling high-resolution representation learning with greatly reduced computational overhead. The pre-trained PS3 is able to both encode the global image at low resolution and selectively process local high-resolution regions based on their saliency or relevance to a text prompt. When applying PS3 to multi-modal LLM (MLLM), the resulting model, named VILA-HD, significantly improves high-resolution visual perception compared to baselines without high-resolution vision pre-training such as AnyRes and S<sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> while using up to 4.3× fewer tokens. PS3 also unlocks appealing scaling properties of VILA-HD, including scaling up resolution for free and scaling up test-time compute for better performance. Compared to state of the arts, VILA-HD outperforms previous MLLMs such as NVILA and Qwen2-VL across multiple benchmarks and achieves better efficiency than latest token pruning approaches. Finally, we find current benchmarks do not require 4K-resolution perception, which motivates us to propose 4KPro, a new benchmark of image QA at 4K resolution, on which VILA-HD outperforms all previous MLLMs, including a 14.5% improvement over GPT-4o, and a 3.2% improvement and 2.96× speedup over Qwen2-VL.
NVILA: Efficient Frontier Visual Language Models Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, Xiuyu Li, Haotian Tang, Yunhao Fang, Yukang Chen, Cheng-Yu Hsieh, De-An Huang, An-Chieh Cheng, Jinyi Hu, Sifei Liu, Ranjay Krishna, Pavlo Molchanov, Jan Kautz, Hongxu Yin, Song Han, Yao Lu Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2025 Visual language models (VLMs) have made significant advances in accuracy in recent years. However, their efficiency has received much less attention. This paper introduces NVILA, a family of open VLMs designed to optimize both efficiency and accuracy. Building on top of VILA, we improve its model architecture by first scaling up the spatial and temporal resolutions, and then compressing visual tokens. This "scale-then-compress" approach enables NVILA to efficiently process high-resolution images and long videos. We also conduct a systematic investigation to enhance the efficiency of NVILA throughout its entire lifecycle, from training to deployment. NVILA matches or surpasses the accuracy of many leading open and proprietary VLMs across a wide range of image and video benchmarks. At the same time, it reduces training costs by 1.9-5.1×, prefilling latency by 1.6-2.2×, and decoding latency by 1.2-2.8×.
COLMAP-Free 3D Gaussian Splatting Yang Fu, Xiaolong Wang, Sifei Liu, Amey Kulkarni, Jan Kautz, Alexei A. Efros Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2024
AGG: Amortized Generative 3D Gaussians for Single Image to 3D Transactions on Machine Learning Research, 2024
3D RECONSTRUCTION WITH GENERALIZABLE NEURAL FIELDS USING SCENE PRIORS 12th International Conference on Learning Representations Iclr 2024, 2024
RegionGPT: Towards Region Understanding Vision Language Model Qiushan Guo, Shalini De Mello, Hongxu Yin, Wonmin Byeon, Ka Chun Cheung, Yizhou Yu, Ping Luo, Sifei Liu Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2024
SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models Advances in Neural Information Processing Systems, 2024
Zero-shot Pose Transfer for Unrigged Stylized 3D Characters Jiashun Wang, Xueting Li, Sifei Liu, Shalini De Mello, Orazio Gallo, Xiaolong Wang, Jan Kautz Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2023
Affordance Diffusion: Synthesizing Hand-Object Interactions Yufei Ye, Xueting Li, Abhinav Gupta, Shalini De Mellon, Stan Birchfield, Jiaming Song, Shubham Tulsiani, Sifei Liu Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2023
Autoregressive 3D Shape Generation via Canonical Mapping An-Chieh Cheng, Xueting Li, Sifei Liu, Min Sun, Ming-Hsuan Yang Lecture Notes in Computer Science Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics, 2022
LEARNING CONTINUOUS ENVIRONMENT FIELDS VIA IMPLICIT FUNCTIONS Iclr 2022 10th International Conference on Learning Representations, 2022
CoordGAN: Self-Supervised Dense Correspondences Emerge from GANs Jiteng Mu, Shalini De Mello, Zhiding Yu, Nuno Vasconcelos, Xiaolong Wang, Jan Kautz, Sifei Liu Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2022
GroupViT: Semantic Segmentation Emerges from Text Supervision Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2022
Learning to Track Instances without Video Annotations Yang Fu, Sifei Liu, Umar Iqbal, Shalini De Mello, Humphrey Shi, J. Kautz Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2021
Hierarchical Contrastive Motion Learning for Video Action Recognition 32nd British Machine Vision Conference Bmvc 2021, 2021
CONTRASTIVE SYN-TO-REAL GENERALIZATION Iclr 2021 9th International Conference on Learning Representations, 2021
Coupled Segmentation and Edge Learning via Dynamic Graph Propagation Advances in Neural Information Processing Systems, 2021
Self-Supervised Object Detection via Generative Image Synthesis Siva Karthik Mustikovela, Shalini De Mello, Aayush Prakash, Umar Iqbal, Sifei Liu, Thu Nguyen-Phuoc, Carsten Rother, Jan Kautz Proceedings of the IEEE International Conference on Computer Vision, 2021
Self-Supervised Viewpoint Learning from Image Collections Siva Karthik Mustikovela, V. Jampani, Shalini De Mello, Sifei Liu, Umar Iqbal, C. Rother, J. Kautz Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2020
Online adaptation for consistent mesh reconstruction in the wild Advances in Neural Information Processing Systems, 2020
Self-supervised Single-View 3D Reconstruction via Semantic Consistency Xueting Li, Sifei Liu, Kihwan Kim, Shalini De Mello, V. Jampani, Ming-Hsuan Yang, J. Kautz Lecture Notes in Computer Science Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics, 2020
SCOPS: Self-supervised co-part segmentation W. Hung, V. Jampani, Sifei Liu, Pavlo Molchanov, Ming-Hsuan Yang, J. Kautz Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2019
Joint-task self-supervised learning for temporal correspondence Advances in Neural Information Processing Systems, 2019
Learning Dual Convolutional Neural Networks for Low-Level Vision Jinshan Pan, Sifei Liu, Deqing Sun, Jiawei Zhang, Yang Liu, Jimmy S. J. Ren, Zechao Li, Jinhui Tang, Huchuan Lu, Yu-Wing Tai, Ming-Hsuan Yang Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2018
Context-aware synthesis and placement of object instances Advances in Neural Information Processing Systems, 2018
Rendering portraitures from monocular camera and beyond Xiangyu Xu, Deqing Sun, Sifei Liu, Wenqi Ren, Yujin Zhang, Ming-Hsuan Yang, Jian Sun Lecture Notes in Computer Science Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics, 2018
Switchable temporal propagation network Sifei Liu, Guangyu Zhong, Shalini De Mello, Jinwei Gu, Ming-Hsuan Yang, J. Kautz Lecture Notes in Computer Science Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics, 2018
Deep cascaded Bi-network for face hallucination Shizhan Zhu, Sifei Liu, Chen Change Loy, Xiaoou Tang Lecture Notes in Computer Science Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics, 2016
Compressed face hallucination Sifei Liu, Ming-Hsuan Yang 2014 IEEE International Conference on Image Processing Icip 2014, 2014
Structured face hallucination Chih-Yuan Yang, Sifei Liu, Ming-Hsuan Yang Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2013
A face antispoofing database with diverse attacks Zhiwei Zhang, Junjie Yan, Sifei Liu, Zhen Lei, Dong Yi, Stan Z. Li Proceedings 2012 5th Iapr International Conference on Biometrics Icb 2012, 2012
Novel method for fire smoke recognition based on Gabor wavelet Yi Qi Yi Biao Xue Bao Chinese Journal of Scientific Instrument, 2010
RECENT SCHOLAR PUBLICATIONS
Context-aware synthesis and placement of object instances D Lee, S Liu, J Gu, MY Liu, J Kautz US Patent App. 19/433,543 , 2026 2026
Scaling rl to long videos Y Chen, W Huang, B Shi, Q Hu, H Ye, L Zhu, Z Liu, P Molchanov, J Kautz, ... Advances in Neural Information Processing Systems 38, 172842-172870 , 2026 2026 Citations: 72
Diffusion-based open-vocabulary segmentation J Xu, S De Mello, S Liu, A Vahdat, W Byeon US Patent 12,586,199 , 2026 2026 Citations: 8
Compositional 3d-consistent freeview image generation with 3d blobs C Liu, W Nie, S Liu, AH Badki, H Su, M Mardani, BD Eckart, A Vahdat US Patent App. 19/227,222 , 2026 2026
Techniques for fine-tuning a machine learning model to reconstruct a three-dimensional scene Y Fu, S Liu, J Kautz, X Li, S De Mello, A Kulkarni, M Naphade US Patent 12,548,234 , 2026 2026 Citations: 2
Techniques for training a machine learning model to reconstruct different three-dimensional scenes Y Fu, S Liu, J Kautz, X Li, S De Mello, A Kulkarni, M Naphade US Patent 12,548,258 , 2026 2026
Learnable fourier series for image restoration S Liu, S De Mello, J Kautz US Patent App. 18/975,124 , 2026 2026
Training and inferencing using a neural network to predict orientations of objects in images SK Mustikovela, V Jampani, S De Mello, S Liu, U Iqbal, J Kautz US Patent App. 19/094,621 , 2025 2025
Context-aware synthesis and placement of object instances D Lee, S Liu, J Gu, MY Liu, J Kautz US Patent 12,462,453 , 2025 2025 Citations: 1
Segmentation using an unsupervised neural network training technique V Jampani, WC Hung, S Liu, P Molchanov, J Kautz US Patent 12,450,748 , 2025 2025
Token-Efficient VLM: High-Resolution Image Understanding Via Dynamic Region Proposal Y Jiang, J Gu, T Xue, KC Cheung, P Molchanov, H Yin, S Liu 2025 IEEE/CVF International Conference on Computer Vision (ICCV), 24147-24158 , 2025 2025 Citations: 5
OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM H Ye, CHH Yang, A Goel, W Huang, L Zhu, Y Su, S Lin, AC Cheng, Z Wan, ... arXiv preprint arXiv:2510.15870 , 2025 2025 Citations: 10
QeRL: Beyond Efficiency--Quantization-enhanced Reinforcement Learning for LLMs W Huang, Y Ge, S Yang, Y Xiao, H Mao, Y Lin, H Ye, S Liu, KC Cheung, ... arXiv preprint arXiv:2510.11696 , 2025 2025 Citations: 7
Compositional text-to-image generation with dense blob representations W Nie, S Liu, MM Korani, C Liu, BD Eckart, A Vahdat US Patent App. 18/889,975 , 2025 2025
3d aware region prompted vision language model AC Cheng, Y Fu, Y Chen, Z Liu, X Li, S Radhakrishnan, S Han, Y Lu, ... arXiv preprint arXiv:2509.13317 , 2025 2025 Citations: 19
Region-aware vision language processor Q Guo, S De Mello, H Yin, W Byeon, KC Cheung, SCW See, J Kautz, ... US Patent App. 19/065,367 , 2025 2025
Machine learning framework applied in a semi-supervised setting to perform instance tracking in a sequence of image frames Y Fu, S Liu, U Iqbal, S De Mello, J Kautz US Patent 12,400,341 , 2025 2025 Citations: 1
Sse: Multimodal semantic data selection and enrichment for industrial-scale data assimilation M Shen, N Chang, S Liu, JM Alvarez Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and … , 2025 2025 Citations: 4
Egovla: Learning vision-language-action models from egocentric human videos R Yang, Q Yu, Y Wu, R Yan, B Li, AC Cheng, X Zou, Y Fang, X Cheng, ... arXiv preprint arXiv:2507.12440 , 2025 2025 Citations: 72
View synthesis using camera poses learned from a video Y Fu, S Liu, A Kulkarni, J Kautz US Patent App. 18/963,075 , 2025 2025 Citations: 1
MOST CITED SCHOLAR PUBLICATIONS
Learning continuous image representation with local implicit image function Y Chen, S Liu, X Wang Proceedings of the IEEE/CVF conference on computer vision and pattern … , 2021 2021 Citations: 1182
A face antispoofing database with diverse attacks Z Zhang, J Yan, S Liu, Z Lei, D Yi, SZ Li 2012 5th IAPR international conference on Biometrics (ICB), 26-31 , 2012 2012 Citations: 1120
Groupvit: Semantic segmentation emerges from text supervision J Xu, S De Mello, S Liu, W Byeon, T Breuel, J Kautz, X Wang Proceedings of the IEEE/CVF conference on computer vision and pattern … , 2022 2022 Citations: 868
Generative face completion Y Li, S Liu, J Yang, MH Yang Proceedings of the IEEE conference on computer vision and pattern … , 2017 2017 Citations: 849
Open-vocabulary panoptic segmentation with text-to-image diffusion models J Xu, S Liu, A Vahdat, W Byeon, X Wang, S De Mello Proceedings of the IEEE/CVF conference on computer vision and pattern … , 2023 2023 Citations: 752
Low-light image enhancement via a deep hybrid network W Ren, S Liu, L Ma, Q Xu, X Xu, X Cao, J Du, MH Yang IEEE Transactions on Image Processing 28 (9), 4364-4375 , 2019 2019 Citations: 592
Spatialrgpt: Grounded spatial reasoning in vision-language models AC Cheng, H Yin, Y Fu, Q Guo, R Yang, J Kautz, X Wang, S Liu Advances in Neural Information Processing Systems 37, 135062-135093 , 2024 2024 Citations: 431
Learning affinity via spatial propagation networks S Liu, S De Mello, J Gu, G Zhong, MH Yang, J Kautz Advances in Neural Information Processing Systems 30 , 2017 2017 Citations: 372
Learning linear transformations for fast image and video style transfer X Li, S Liu, J Kautz, MH Yang Proceedings of the IEEE/CVF conference on computer vision and pattern … , 2019 2019 Citations: 338
COLMAP-Free 3D Gaussian Splatting Y Fu, S Liu, A Kulkarni, J Kautz, AA Efros, X Wang Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern … , 2024 2024 Citations: 307
Self-supervised single-view 3d reconstruction via semantic consistency X Li, S Liu, K Kim, S De Mello, V Jampani, MH Yang, J Kautz European Conference on Computer Vision, 677-693 , 2020 2020 Citations: 307
Deep cascaded bi-network for face hallucination S Zhu, S Liu, CC Loy, X Tang European conference on computer vision, 614-630 , 2016 2016 Citations: 297
Learning dual convolutional neural networks for low-level vision J Pan, S Liu, D Sun, J Zhang, Y Liu, J Ren, Z Li, J Tang, H Lu, YW Tai, ... Proceedings of the IEEE conference on computer vision and pattern … , 2018 2018 Citations: 266
Semi-supervised 3d hand-object poses estimation with interactions in time S Liu, H Jiang, J Xu, S Liu, X Wang Proceedings of the IEEE/CVF conference on computer vision and pattern … , 2021 2021 Citations: 256
Learning recursive filters for low-level vision via a hybrid neural network S Liu, J Pan, MH Yang European conference on computer vision, 560-576 , 2016 2016 Citations: 211
Joint-task self-supervised learning for temporal correspondence X Li, S Liu, S De Mello, X Wang, J Kautz, MH Yang Advances in Neural Information Processing Systems 32 , 2019 2019 Citations: 209
Scops: Self-supervised co-part segmentation WC Hung, V Jampani, S Liu, P Molchanov, MH Yang, J Kautz Proceedings of the IEEE/CVF conference on computer vision and pattern … , 2019 2019 Citations: 204
No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images B Ye, S Liu, H Xu, X Li, M Pollefeys, MH Yang, S Peng International Conference on Learning Representations 2025, 54009-54033 , 2025 2025 Citations: 194
Synthesizing long-term 3d human motion and interaction in 3d scenes J Wang, H Xu, J Xu, S Liu, X Wang Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern … , 2021 2021 Citations: 192
Nvila: Efficient frontier visual language models Z Liu, L Zhu, B Shi, Z Zhang, Y Lou, S Yang, H Xi, S Cao, Y Gu, D Li, X Li, ... Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern … , 2025 2025 Citations: 190