Sifei Liu

Scopus Publications

SSE: Multimodal Semantic Data Selection and Enrichment for Industrial-scale Data Assimilation
Maying Shen, Nadine Chang, Sifei Liu, Jose M. Alvarez
Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2025
Synthesizing Consistent Novel Views Via 3D Epipolar Attention Without Re-Training
Botao Ye, Sifei Liu, Xueting Li, Marc Pollefeys, Ming-Hsuan Yang
Proceedings 2025 International Conference on 3D Vision 3dv 2025, 2025
Large diffusion models demonstrate remarkable zeroshot capabilities in novel view synthesis from a single image. However, these models often face challenges in maintaining consistency across novel and reference views. A crucial factor leading to this issue is the limited utilization of contextual information from reference views. Specifically, when there is an overlap in the viewing frustum between two views, it is essential to ensure that the corresponding regions maintain consistency in both geometry and appearance. This observation leads to a simple yet effective approach, where we propose to use epipolar geometry to locate and retrieve overlapping information from the input view. This information is then incorporated into the generation of target views, eliminating the need for training or fine-tuning, as the process requires no learnable parameters. Furthermore, to enhance the overall consistency of generated views, we extend the utilization of epipolar attention to a multi-view setting, allowing retrieval of overlapping information from the input view and other target views. Qualitative and quantitative experimental results demonstrate the effectiveness of our method in significantly improving the consistency of synthesized views without the need for any fine-tuning. Moreover, This enhancement also boosts the performance of downstream applications such as 3D reconstruction. The code is available at https://github.com/botaoye/ConsisSyn.
M3: 3D-SPATIAL MULTIMODAL MEMORY
13th International Conference on Learning Representations Iclr 2025, 2025
NO POSE, NO PROBLEM: SURPRISINGLY SIMPLE 3D GAUSSIAN SPLATS FROM SPARSE UNPOSED IMAGES
13th International Conference on Learning Representations Iclr 2025, 2025
Compositional Text-to-Image Generation with Feedforward Layout Generation
Sifei Liu, Weili Nie, An-Chieh Cheng, Morteza Mardani, Chao Liu, Benjamin Eckart, Arash Vahdat
Lecture Notes in Computer Science, 2025
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks
Miran Heo, Min-Hung Chen, De-An Huang, Sifei Liu, Subhashree Radhakrishnan, Seon Joo Kim, Yu-Chiang Frank Wang, Ryo Hachiuma
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2025
We present Omni-RGPT, a multimodal large language model designed to facilitate region-level comprehension for both images and videos. To achieve consistent region representation across spatio-temporal dimensions, we introduce Token Mark, a set of tokens highlighting the target regions within the visual feature space. These tokens are directly embedded into spatial regions using region prompts (e.g., boxes or masks) and simultaneously incorporated into the text prompt to specify the target, establishing a direct connection between visual and text tokens. To further support robust video understanding without requiring tracklets, we introduce an auxiliary task that guides Token Mark by leveraging the consistency of the tokens, enabling stable region interpretation across the video. Additionally, we introduce a large-scale region-level video instruction dataset (RegVID300k). Omni-RGPT achieves state-of-the-art results on image and video-based commonsense reasoning benchmarks while showing strong performance in captioning and referring expression comprehension tasks.
Parallel Sequence Modeling via Generalized Spatial Propagation Network
Hongjun Wang, Wonmin Byeon, Jiarui Xu, Jinwei Gu, Ka Chun Cheung, Xiaolong Wang, Kai Han, Jan Kautz, Sifei Liu
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2025
We present the Generalized Spatial Propagation Network (GSPN), a new attention mechanism optimized for vision tasks that inherently captures 2D spatial structures. Existing attention models, including transformers, linear attention, and state-space models like Mamba, process multidimensional data as 1D sequences, compromising spatial coherence and efficiency. GSPN overcomes these limitations by directly operating on spatially coherent image data and forming dense pairwise connections through a line-scan approach. Central to GSPN is the Stability-Context Condition, which ensures stable, long-context propagation across 2D sequences and reduces the effective sequence length to $\sqrt N $ for a square map with N elements, which significantly enhances computational efficiency. With learnable, input-dependent weights and no reliance on positional embeddings, GSPN achieves superior spatial fidelity and state-of-the-art performance in vision tasks, including ImageNet classification, class-guided image generation, and text-to-image generation. Notably, GSPN accelerates SD-XL with softmax-attention by over 84× when generating 16K images. Project page: https://whj363636.github.io/GSPN/
Scaling Vision Pre-Training to 4K Resolution
Baifeng Shi, Boyi Li, Han Cai, Yao Lu, Sifei Liu, Marco Pavone, Jan Kautz, Song Han, Trevor Darrell, Pavlo Molchanov, Hongxu Yin
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2025
High-resolution perception of visual details is crucial for daily tasks. Current vision pre-training, however, is still limited to low resolutions (e.g., 378×378 pixels) due to the quadratic cost of processing larger images. We introduce PS3 that scales CLIP-style vision pre-training to 4K resolution with a near-constant cost. Instead of contrastive learning on global image representation, PS3 is pre-trained by selectively processing local regions and contrasting them with local detailed captions, enabling high-resolution representation learning with greatly reduced computational overhead. The pre-trained PS3 is able to both encode the global image at low resolution and selectively process local high-resolution regions based on their saliency or relevance to a text prompt. When applying PS3 to multi-modal LLM (MLLM), the resulting model, named VILA-HD, significantly improves high-resolution visual perception compared to baselines without high-resolution vision pre-training such as AnyRes and S<sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> while using up to 4.3× fewer tokens. PS3 also unlocks appealing scaling properties of VILA-HD, including scaling up resolution for free and scaling up test-time compute for better performance. Compared to state of the arts, VILA-HD outperforms previous MLLMs such as NVILA and Qwen2-VL across multiple benchmarks and achieves better efficiency than latest token pruning approaches. Finally, we find current benchmarks do not require 4K-resolution perception, which motivates us to propose 4KPro, a new benchmark of image QA at 4K resolution, on which VILA-HD outperforms all previous MLLMs, including a 14.5% improvement over GPT-4o, and a 3.2% improvement and 2.96× speedup over Qwen2-VL.
NVILA: Efficient Frontier Visual Language Models
Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, Xiuyu Li, Haotian Tang, Yunhao Fang, Yukang Chen, Cheng-Yu Hsieh, De-An Huang, An-Chieh Cheng, Jinyi Hu, Sifei Liu, Ranjay Krishna, Pavlo Molchanov, Jan Kautz, Hongxu Yin, Song Han, Yao Lu
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2025
Visual language models (VLMs) have made significant advances in accuracy in recent years. However, their efficiency has received much less attention. This paper introduces NVILA, a family of open VLMs designed to optimize both efficiency and accuracy. Building on top of VILA, we improve its model architecture by first scaling up the spatial and temporal resolutions, and then compressing visual tokens. This "scale-then-compress" approach enables NVILA to efficiently process high-resolution images and long videos. We also conduct a systematic investigation to enhance the efficiency of NVILA throughout its entire lifecycle, from training to deployment. NVILA matches or surpasses the accuracy of many leading open and proprietary VLMs across a wide range of image and video benchmarks. At the same time, it reduces training costs by 1.9-5.1×, prefilling latency by 1.6-2.2×, and decoding latency by 1.2-2.8×.
BlobGEN-3D: Compositional 3D-Consistent Freeview Image Generation with 3D Blobs
Chao Liu, Weili Nie, Sifei Liu, Abhishek Badki, Hang Su, Morteza Mardani, Benjamin Eckart, Arash Vahdat
Proceedings SIGGRAPH Asia 2024 Conference Papers SA 2024, 2024
CosAE: Learnable Fourier Series for Image Restoration
Advances in Neural Information Processing Systems, 2024
Physics-based Indirect Illumination for Inverse Rendering
Youming Deng, Xueting Li, Sifei Liu, Ming-Hsuan Yang
Proceedings 2024 International Conference on 3D Vision 3dv 2024, 2024
RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos
Hongchi Xia, Yang Fu, Sifei Liu, Xiaolong Wang
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2024
TUVF: LEARNING GENERALIZABLE TEXTURE UV RADIANCE FIELDS
12th International Conference on Learning Representations Iclr 2024, 2024
A Unified Approach for Text-and Image-Guided 4D Scene Generation
Yufeng Zheng, Xueting Li, Koki Nagano, Sifei Liu, Otmar Hilliges, Shalini De Mello
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2024
Compositional Text-to-Image Generation with Dense Blob Representations
Proceedings of Machine Learning Research, 2024
HOIDiffusion: Generating Realistic 3D Hand-Object Interaction Data
Mengqi Zhang, Yang Fu, Zheng Ding, Sifei Liu, Zhuowen Tu, Xiaolong Wang
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2024
COLMAP-Free 3D Gaussian Splatting
Yang Fu, Xiaolong Wang, Sifei Liu, Amey Kulkarni, Jan Kautz, Alexei A. Efros
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2024
AGG: Amortized Generative 3D Gaussians for Single Image to 3D
Transactions on Machine Learning Research, 2024
3D RECONSTRUCTION WITH GENERALIZABLE NEURAL FIELDS USING SCENE PRIORS
12th International Conference on Learning Representations Iclr 2024, 2024
RegionGPT: Towards Region Understanding Vision Language Model
Qiushan Guo, Shalini De Mello, Hongxu Yin, Wonmin Byeon, Ka Chun Cheung, Yizhou Yu, Ping Luo, Sifei Liu
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2024
SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models
Advances in Neural Information Processing Systems, 2024
Self-Supervised Super-Plane for Neural 3D Reconstruction
Botao Ye, Sifei Liu, Xueting Li, Ming-Hsuan Yang
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2023
Generalizable One-shot 3D Neural Head Avatar
Advances in Neural Information Processing Systems, 2023
Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models
Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiaolong Wang, Shalini De Mello
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2023
Zero-shot Pose Transfer for Unrigged Stylized 3D Characters
Jiashun Wang, Xueting Li, Sifei Liu, Shalini De Mello, Orazio Gallo, Xiaolong Wang, Jan Kautz
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2023
Affordance Diffusion: Synthesizing Hand-Object Interactions
Yufei Ye, Xueting Li, Abhinav Gupta, Shalini De Mellon, Stan Birchfield, Jiaming Song, Shubham Tulsiani, Sifei Liu
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2023
Deblurring Dynamic Scenes via Spatially Varying Recurrent Neural Networks
Wenqi Ren, Jiawei Zhang, Jinshan Pan, Sifei Liu, Jimmy S. J. Ren, Junping Du, Xiaochun Cao, Ming-Hsuan Yang
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022
Correction to: Learning Contrastive Representation for Semantic Correspondence (International Journal of Computer Vision, (2022), 130, 5, (1293-1309), 10.1007/s11263-022-01602-y)
Taihong Xiao, Sifei Liu, Shalini De Mello, Zhiding Yu, Jan Kautz, Ming-Hsuan Yang
International Journal of Computer Vision, 2022
Learning Contrastive Representation for Semantic Correspondence
Taihong Xiao, Sifei Liu, Shalini De Mello, Zhiding Yu, Jan Kautz, Ming-Hsuan Yang
International Journal of Computer Vision, 2022
Autoregressive 3D Shape Generation via Canonical Mapping
An-Chieh Cheng, Xueting Li, Sifei Liu, Min Sun, Ming-Hsuan Yang
Lecture Notes in Computer Science Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics, 2022
Scraping Textures from Natural Images for Synthesis and Editing
Xueting Li, Xiaolong Wang, Ming-Hsuan Yang, Alexei A. Efros, Sifei Liu
Lecture Notes in Computer Science, 2022
LEARNING CONTINUOUS ENVIRONMENT FIELDS VIA IMPLICIT FUNCTIONS
Iclr 2022 10th International Conference on Learning Representations, 2022
CoordGAN: Self-Supervised Dense Correspondences Emerge from GANs
Jiteng Mu, Shalini De Mello, Zhiding Yu, Nuno Vasconcelos, Xiaolong Wang, Jan Kautz, Sifei Liu
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2022
GroupViT: Semantic Segmentation Emerges from Text Supervision
Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2022
Learning Continuous Image Representation with Local Implicit Image Function
Yinbo Chen, Sifei Liu, Xiaolong Wang
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2021
Video Autoencoder: self-supervised disentanglement of static 3D structure and motion
Zihang Lai, Sifei Liu, Alexei A. Efros, Xiaolong Wang
Proceedings of the IEEE International Conference on Computer Vision, 2021
Video Matting via Consistency-Regularized Graph Neural Networks
Tiantian Wang, Sifei Liu, Yapeng Tian, Kai Li, Ming-Hsuan Yang
Proceedings of the IEEE International Conference on Computer Vision, 2021
Learning 3D Dense Correspondence via Canonical Point Autoencoder
Advances in Neural Information Processing Systems, 2021
Semi-supervised 3D hand-object poses estimation with interactions in time
Shaowei Liu, Hanwen Jiang, Jiarui Xu, Sifei Liu, Xiaolong Wang
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2021
Synthesizing Long-Term 3D Human Motion and Interaction in 3D Scenes
Jiashun Wang, Huazhe Xu, Jingwei Xu, Sifei Liu, X. Wang
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2021
Regularizing Meta-learning via Gradient Dropout
Hung-Yu Tseng, Yi-Wen Chen, Yi-Hsuan Tsai, Sifei Liu, Yen-Yu Lin, Ming-Hsuan Yang
Lecture Notes in Computer Science, 2021
Learning to Track Instances without Video Annotations
Yang Fu, Sifei Liu, Umar Iqbal, Shalini De Mello, Humphrey Shi, J. Kautz
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2021
Hierarchical Contrastive Motion Learning for Video Action Recognition
32nd British Machine Vision Conference Bmvc 2021, 2021
CONTRASTIVE SYN-TO-REAL GENERALIZATION
Iclr 2021 9th International Conference on Learning Representations, 2021
Coupled Segmentation and Edge Learning via Dynamic Graph Propagation
Advances in Neural Information Processing Systems, 2021
Self-Supervised Object Detection via Generative Image Synthesis
Siva Karthik Mustikovela, Shalini De Mello, Aayush Prakash, Umar Iqbal, Sifei Liu, Thu Nguyen-Phuoc, Carsten Rother, Jan Kautz
Proceedings of the IEEE International Conference on Computer Vision, 2021
Weakly-Supervised Semantic Segmentation by Iterative Affinity Learning
Xiang Wang, Sifei Liu, Huimin Ma, Ming-Hsuan Yang
International Journal of Computer Vision, 2020
Few-shot viewpoint estimation
30th British Machine Vision Conference 2019 Bmvc 2019, 2020
Self-Supervised Viewpoint Learning from Image Collections
Siva Karthik Mustikovela, V. Jampani, Shalini De Mello, Sifei Liu, Umar Iqbal, C. Rother, J. Kautz
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2020
Online adaptation for consistent mesh reconstruction in the wild
Advances in Neural Information Processing Systems, 2020
Self-supervised Single-View 3D Reconstruction via Semantic Consistency
Xueting Li, Sifei Liu, Kihwan Kim, Shalini De Mello, V. Jampani, Ming-Hsuan Yang, J. Kautz
Lecture Notes in Computer Science Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics, 2020
Learning propagation for arbitrarily-structured data
Sifei Liu, Xueting Li, V. Jampani, Shalini De Mello, J. Kautz
Proceedings of the IEEE International Conference on Computer Vision, 2019
Low-Light Image Enhancement via a Deep Hybrid Network
Wenqi Ren, Sifei Liu, Lin Ma, Qianqian Xu, Xiangyu Xu, Xiaochun Cao, Junping Du, Ming-Hsuan Yang
IEEE Transactions on Image Processing, 2019
Learning linear transformations for fast image and video style transfer
Xueting Li, Sifei Liu, J. Kautz, Ming-Hsuan Yang
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2019
Putting humans in a scene: Learning affordance in 3D indoor environments
Xueting Li, Sifei Liu, Kihwan Kim, Xiaolong Wang, Ming-Hsuan Yang, J. Kautz
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2019
SCOPS: Self-supervised co-part segmentation
W. Hung, V. Jampani, Sifei Liu, Pavlo Molchanov, Ming-Hsuan Yang, J. Kautz
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2019
Joint-task self-supervised learning for temporal correspondence
Advances in Neural Information Processing Systems, 2019
Learning Dual Convolutional Neural Networks for Low-Level Vision
Jinshan Pan, Sifei Liu, Deqing Sun, Jiawei Zhang, Yang Liu, Jimmy S. J. Ren, Zechao Li, Jinhui Tang, Huchuan Lu, Yu-Wing Tai, Ming-Hsuan Yang
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2018
Hallucinating Compressed Face Images
Chih-Yuan Yang, Sifei Liu, Ming-Hsuan Yang
International Journal of Computer Vision, 2018
Learning video-story composition via recurrent neural network
Guangyu Zhong, Yi-Hsuan Tsai, Sifei Liu, Zhixun Su, Ming-Hsuan Yang
Proceedings 2018 IEEE Winter Conference on Applications of Computer Vision Wacv 2018, 2018
Context-aware synthesis and placement of object instances
Advances in Neural Information Processing Systems, 2018
Rendering portraitures from monocular camera and beyond
Xiangyu Xu, Deqing Sun, Sifei Liu, Wenqi Ren, Yujin Zhang, Ming-Hsuan Yang, Jian Sun
Lecture Notes in Computer Science Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics, 2018
Switchable temporal propagation network
Sifei Liu, Guangyu Zhong, Shalini De Mello, Jinwei Gu, Ming-Hsuan Yang, J. Kautz
Lecture Notes in Computer Science Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics, 2018
Unsupervised Domain Adaptation for Face Recognition in Unlabeled Videos
Kihyuk Sohn, Sifei Liu, Guangyu Zhong, Xiang Yu, Ming-Hsuan Yang, Manmohan Chandraker
Proceedings of the IEEE International Conference on Computer Vision, 2017
Generative face completion
Yijun Li, Sifei Liu, Jimei Yang, Ming-Hsuan Yang
Proceedings 30th IEEE Conference on Computer Vision and Pattern Recognition Cvpr 2017, 2017
Face parsing via recurrent propagation
Sifei Liu, Jianping Shi, Liang Ji, Ming-Hsuan Yang
British Machine Vision Conference 2017 Bmvc 2017, 2017
Learning affinity via spatial propagation networks
Advances in Neural Information Processing Systems, 2017
Learning recursive filters for low-level vision via a hybrid neural network
Sifei Liu, Jinshan Pan, Ming-Hsuan Yang
Lecture Notes in Computer Science Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics, 2016
Deep cascaded Bi-network for face hallucination
Shizhan Zhu, Sifei Liu, Chen Change Loy, Xiaoou Tang
Lecture Notes in Computer Science Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics, 2016
Multi-objective convolutional learning for face labeling
Sifei Liu, Jimei Yang, Chang Huang, Ming-Hsuan Yang
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2015
Compressed face hallucination
Sifei Liu, Ming-Hsuan Yang
2014 IEEE International Conference on Image Processing Icip 2014, 2014
Structured face hallucination
Chih-Yuan Yang, Sifei Liu, Ming-Hsuan Yang
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2013
Heterogeneous face image matching using multi-scale features
Sifei Liu, Dong Yi, Zhen Lei, Stan Z. Li
Proceedings 2012 5th Iapr International Conference on Biometrics Icb 2012, 2012
Discriminant analysis with Gabor phase for robust face recognition
Jianfei Zhu, Dong Cao, Sifei Liu, Zhen Lei, Stan Z. Li
Proceedings 2012 5th Iapr International Conference on Biometrics Icb 2012, 2012
A face antispoofing database with diverse attacks
Zhiwei Zhang, Junjie Yan, Sifei Liu, Zhen Lei, Dong Yi, Stan Z. Li
Proceedings 2012 5th Iapr International Conference on Biometrics Icb 2012, 2012
Face alignment under partial occlusion in near infrared images
Sifei Liu, Dong Yi, Bin Li, Stan Z. Li
2010 Chinese Conference on Pattern Recognition Ccpr 2010 Proceedings, 2010
Novel method for fire smoke recognition based on Gabor wavelet
Yi Qi Yi Biao Xue Bao Chinese Journal of Scientific Instrument, 2010

Sifei Liu

EDUCATION

RESEARCH INTERESTS

Scopus Publications

RECENT SCHOLAR PUBLICATIONS

MOST CITED SCHOLAR PUBLICATIONS