Wenhao Wu (吴文灏)  

知乎 github Google Scholar LinkedIn X Semantic Scholar ORCID

 

Ph.D. Student

School of Computer Science, Faculty of Engineering

The University of Sydney

Personal Email : whwu.ucas (at) gmail.com

Short Biography

Wenhao is a 3rd-year Computer Science Ph.D. student at MMLab, The University of Sydney, and supervised by Prof. Wanli Ouyang. Prior to this, he was a Full-time Senior Researcher (3 Years) at Baidu VIS, where he worked closely with Chief Scientist Dr. Jingdong Wang (IEEE Fellow). Previously, he received M.S.E degree from University of Chinese Academy of Sciences (UCAS), supervised by Prof. Shifeng Chen and Prof. Yu Qiao.
He has spent 8 years (2016-) conducting AI Research & Development in both industry and academic institutions. He is fortunate to have over 6 years of industrial experience at Amazon AI Labs, Baidu, iQIYI, SenseTime, Samsung Research. Additionally, he is/was as a member of academic institutions such as MMLab@USYD, MMLab@CUHK, and MMLab@SIAT-CAS.
He is honored to be awarded the Baidu PhD Fellowship (2023).

If interested in collaboration or discussion, please email me.

I am entering the job market starting now (graduate in Spring 2025), and am actively looking for post-doctoral scholar and full-time research scientist position in US. Feel free to schedule a casual chat if our research match :)

Research Interest

My research interests broadly lie in the areas of Computer Vision and Deep Learning, current and previous focal areas include: I've dedicated myself to advanced AI research across these fields, leading to publications in top-tier conferences. As my tech expertise deepened, I now focus less on paper quantity and more on rethinking problems and offering simple, effective solutions.

Updates

  • New 09/2024: [2/2] Dense Connecter and AMP are accepted by NeurIPS 2024! Dense Connector was cited by Apple MM1.5.
  • New 05/2024: We explore visual signals, RLHF, and zero-shot image-to-video extension for MLLMs: (1) We introduce the Dense Connecter, a simple, effective, plug-and-play vision-language connector that enhances existing MLLMs by leveraging multi-layer visual features with minimal computational overhead. (2) We present an Automated Multi-level Preference (AMP) framework for RLHF, replacing binary preference learning. It generates high-quality multi-level preference datasets without human/AI annotators and employs the multi-level DPO (MDPO) algorithm. (3) I release FreeVA , which provides a plug-and-play, simple yet effective study exploring the utilization of existing image MLLMs as video conversational models in a training-free manner. ⚡The core code can be just one line!
  • New 05/2024: The extension of Cap4Video has been accepted by TPAMI.
  • New 01/2024: 🎖 I'm honored to be among the 10 PhD students globally awarded the 11th Baidu Scholarship, a prestigious fellowship in Artificial Intelligence, providing 200,000 RMB (about $30,000) to selectees from thousands of applicants. [第11届百度奖学金揭晓, 全球10人全员95后]
  • New 11/2023: We release GPT4Vis , which provides a Quantitative Evaluation of GPT-4 (😭Running once roughly costs 💰$4000💰) for Visual Understanding across images, videos and point clouds, spinning on 16 datasets.
  • New 11/2023: We release Side4Video , a Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning, which significantly reduces the training memory cost for action recognition (↓75%) and text-video retrieval (↓30%).
  • New 08/2023: The extension of Text4Vis has been accepted by IJCV.
  • 07/2023: Two First-author papers (Temporal Modeling: ATM , Cross-Modal Retrieval: UA ) are accepted by ICCV2023.
  • 02/2023: Two First-author papers for video understanding (BIKE , Cap4Video ) are accepted by CVPR 2023. Cap4Video involves GPT to enhance text-video learning, is selected as a Highlight paper (Top 2.5%).
  • 11/2022: Two papers (Video Recognition: Text4Vis , Style Transfer: AdaCM) are accepted by AAAI 2023.
  • 07/2022: Three papers (Video Sampling: NSNet, TSQNet, Cross-Modal Learning: CODER) are accepted by ECCV 2022.
  • 06/2022: Our MaMiCo, a new video self-supervised learning work, is accepted by ACMMM 2022 (Oral Presentation).
  • 03/2022: Two low-level vision papers (MSPC, BAIRNet) are accepted by CVPR 2022.
  • 12/2021: Our BCNet , an general temporal localization framework, is accepted by AAAI 2022.
  • 07/2021: Our ASCNet, a self-supervised video representation learning framework, is accepted by ICCV 2021.
  • 07/2021: Two papers (Video Recognition, Crowd Counting) are accepted by ACMMM 2021.
  • 04/2021: We present a novel task: Weakly-Supervised Spatio-Temporal Anomaly Detection, is accepted by IJCAI 2021.
  • 04/2021: Winner in the Traffic Anomaly Detection Track of the CVPR 2021 AI CITY CHALLENGE.
  • 12/2020: Our MVFNet , an efficient temporal module, is accepted by AAAI 2021.
  • 07/2020: Our ADD-GCN for multi-label image recognition, is accepted by ECCV 2020.
  • 05/2020: One dynamic video inference paper is accepted for Oral Presentation on CVPR2020 EDLCV workshop.
  • 07/2019: My first paper MARL, a novel video sampler, is accepted as Oral Presentation (Top 4%) on ICCV 2019.
  • 09/2017: Recommended to University of Chinese Academy of Sciences towards Master degree with exam exemption (保送研究生).
  • 06/2017: Graduated from Central South University with Outstanding Graduate Honor.
  • 10/2016: Joined MMLab@SIAT as research intern. Started working on Computer Vision.

Selected Publications [ Full List ]

( *Co-first Author, *Correspondence)
Dense Connector for MLLMs
Huanjin Yao, Wenhao Wu**, Taojiannan Yang, Yuxin Song, Mengxi Zhang, Haocheng Feng, Yifan Sun, Zhiheng Li, Wanli Ouyang, Jingdong Wang
Conference on Neural Information Processing Systems (NeurIPS), 2024
[ PDF ] [ Code ] [ My Blog (Chinese) ]
A universal plug-and-play module to enhance Multimodal-LLM.
Automated Multi-level Preference for MLLMs
Mengxi Zhang, Wenhao Wu, Yu Lu, Yuxin Song, Kang Rong, Huanjin Yao, Jianbo Zhang, Fanglong Liu, Yifan Sun, Haocheng Feng, Jingdong Wang
Conference on Neural Information Processing Systems (NeurIPS), 2024
[ PDF ] [ Code ] [ My Blog (Chinese) ]
We present an AMP framework for RLHF, replacing binary preference learning. It generates high-quality multi-level preference datasets without human/AI annotators.
FreeVA: Offline MLLM as Training-Free Video Assistant
Wenhao Wu
Technical Report, 2024
[ PDF ] [ Code ]
FreeVA - a plug-and-play, simple yet effective study exploring the utilization of existing image MLLMs as video conversational models in a training-free manner.
GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?
Wenhao Wu, Huanjin Yao, Mengxi Zhang, Yuxin Song, Wanli Ouyang, Jingdong Wang
Technical Report, 2023
[ PDF ] [ Code ] [ My Blog (Chinese) ]
We provide a Quantitative Evaluation of GPT-4 for Visual Understanding across images, videos and point clouds, spinning on 16 popular datasets.
(😭Running all tests once roughly costs 💰$4000+💰)
Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning
Huanjin Yao, Wenhao Wu**, Zhiheng Li
Technical Report, 2023
[ PDF ] [ Code ]
Side4Video significantly reduces the training memory cost for action recognition (↓75%) and text-video retrieval (↓30%).
What Can Simple Arithmetic Operations Do for Temporal Modeling?
Wenhao Wu, Yuxin Song, Zhun Sun, Jingdong Wang, Chang Xu, Wanli Ouyang
IEEE International Conference on Computer Vision (ICCV), 2023
[ PDF ] [ Code ]
Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?
Wenhao Wu, Haipeng Luo, Bo Fang, Jingdong Wang, Wanli Ouyang
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2023
[Highlight, Top 2.5% of 9155 submissions] [ PDF ] [ Code ]

Cap4Video++: Enhancing Video Understanding with Auxiliary Captions
Wenhao Wu, Xiaohan Wang, Haipeng Luo, Jingdong Wang, Yi Yang, Wanli Ouyang
Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024
Impact factor: 23.6
[ PDF ]
Cap4Video leverages auxiliary captions generated by GPT to enhance cross-modal learning.
UATVR: Uncertainty-Adaptive Text-Video Retrieval
Bo Fang*, Wenhao Wu*, Chang Liu, Yu Zhou, Yuxin Song, Weiping Wang, Xiangbo Shu, Xiangyang Ji, Jingdong Wang
IEEE International Conference on Computer Vision (ICCV), 2023
[ PDF ] [ Code ]
Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models
Wenhao Wu, Xiaohan Wang, Haipeng Luo, Jingdong Wang, Yi Yang, Wanli Ouyang
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2023
[ PDF ] [ Code ]
Revisiting Classifier: Transferring Vision-Language Models for Video Recognition
Wenhao Wu, Zhun Sun, Wanli Ouyang
The AAAI Conference on Artificial Intelligence (AAAI) , 2023
[ PDF ] [ Code ] [ Poster ] [ Slides ] [ Video ]

Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective
Wenhao Wu, Zhun Sun, Yuxin Song, Jingdong Wang, Wanli Ouyang
International Journal of Computer Vision (IJCV), 2023 Impact factor: 19.5
[ PDF ]
We revisit the classifier with the textual embeddings, and achieve SOTA performance on Full-supervision/Few-shot/Zero-shot recognition.
NSNet: Non-saliency Suppression Sampler for Efficient Video Recognition
Boyang Xia*, Wenhao Wu**, Haoran Wang, Rui Su, Dongliang He, Haosen Yang, Xiaoran Fan, Wanli Ouyang
European Conference on Computer Vision (ECCV) , 2022
[ PDF ] [ Project ]
A sampler with a 4x faster practical speed than SOTA methods.
Temporal Saliency Query Network for Efficient Video Recognition
Boyang Xia*, Zhihao Wang*, Wenhao Wu*, Haoran Wang, Jungong Han
European Conference on Computer Vision (ECCV) , 2022
[ PDF ] [ Project ]
TSQNet, the first work to model temporal sampling as a query-response task.
MaMiCo: Macro-to-Micro Semantic Correspondence for Self-supervised Video Representation Learning
Bo Fang*, Wenhao Wu*, Chang Liu*, Yu Zhou, Dongliang He, Weiping Wang
ACM International Conference on Multimedia (ACMMM) , 2022
[Oral, 5.0% acceptance rate] [ PDF ] [ Project ]
MaMiCo, a self-supervised Macro-to-Micro Semantic Correspondence learning framework for video representation learning.
Temporal Action Proposal Generation with Background Constraint
Haosen Yang*, Wenhao Wu*, Lining Wang, Sheng Jin, Boyang Xia, Hongxun Yao, Hujie Huang
The AAAI Conference on Artificial Intelligence (AAAI) , 2022
[15% acceptance rate] [ PDF ] [ Code ]
BCNet, an general framework for effective Temporal Action Proposal Generation.
ASCNet: Self-supervised Video Representation Learning with Appearance-Speed Consistency
Deng Huang*, Wenhao Wu*, Weiwen Hu, Xu Liu, Dongliang He, Zhihua Wu, Xiangmiao Wu, Mingkui Tan, Errui Ding
IEEE International Conference on Computer Vision (ICCV) , 2021
[ PDF ] [ Poster ] [ Slides ] [ Video ] [ Code ]
An effective self-supervised video representation learning framework.
DSANet: Dynamic Segment Aggregation Network for Video-Level Representation Learning
Wenhao Wu*, Yuxiang Zhao*, Yanwu Xu, Xiao Tan, Dongliang He, Zhikang Zou, Jin Ye, Yingying Li, Mingde Yao, Zichao Dong, Yifeng Shi
ACM International Conference on Multimedia (ACMMM) , 2021
[ PDF ] [ Poster ] [ Slides ] [ Code ]
An efficient plug-and-play module for effective video-level representation learning.
MVFNet: Multi-View Fusion Network for Efficient Video Recognition
Wenhao Wu, Dongliang He, Tianwei Lin, Fu Li, Chuang Gan, Errui Ding
The AAAI Conference on Artificial Intelligence (AAAI) , 2021
[ PDF ] [ Poster ] [ Slides ] [ Code ] [ Bibtex ]
          @inproceedings{wu2021mvfnet,
          title={Mvfnet: Multi-view fusion network for efficient video recognition},
          author={Wu, Wenhao and He, Dongliang and Lin, Tianwei and Li, Fu and Gan, Chuang and Ding, Errui},
          booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
          volume={35},
          number={4},
          pages={2943--2951},
          year={2021}
          }
              
An efficient architecture for video recognition based on 2D CNN.
Good Practices and A Strong Baseline for Traffic Anomaly Detection
Yuxiang Zhao*, Wenhao Wu*, Yue He, Yingying Li, Xiao Tan, Shifeng Chen
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) - 5th AI City Challenge (AICity), 2021
[ PDF ] [ Code ] [ Bibtex ]
        @inproceedings{zhao2021good,
        title={Good Practices and A Strong Baseline for Traffic Anomaly Detection},
        author={Zhao, Yuxiang and Wu, Wenhao and He, Yue and Li, Yingying and Tan, Xiao and Chen, Shifeng},
        booktitle={Proceedings of CVPR Workshops},
        year={2021}
        }
          
Winner of AI City challenge for traffic anomaly detection
Dynamic Inference: A New Approach Toward Efficient Video Action Recognition
Wenhao Wu, Dongliang He, Xiao Tan, Shifeng Chen, Yi Yang, Shilei Wen
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) - Joint Workshop on Efficient Deep Learning in Computer Vision (EDLCV), 2020
[Oral] [ PDF ] [ Slides ] [ Bibtex ]
        @inproceedings{wu2020dynamic,
            title={Dynamic Inference: A New Approach Toward Efficient Video Action Recognition},
            author={Wu, Wenhao and He, Dongliang and Tan, Xiao and Chen, Shifeng 
              and Yang, Yi and Wen, Shilei},
            booktitle={Proceedings of CVPR Workshops},
            pages={676--677},
            year={2020}
        }
            
Multi-Agent Reinforcement Learning Based Frame Sampling for Effective Untrimmed Video Recognition
Wenhao Wu, Dongliang He, Xiao Tan, Shifeng Chen, Shilei Wen
IEEE International Conference on Computer Vision (ICCV), 2019
[Oral, 4.3% acceptance rate] [ PDF ] [ Poster ] [ Slides ] [ Bibtex ]
@inproceedings{wu2019multi,
    title={Multi-Agent Reinforcement Learning Based Frame Sampling for Effective Untrimmed
       Video Recognition},
    author={Wu, Wenhao and He, Dongliang and Tan, Xiao and Chen, Shifeng and Wen, Shilei},
    booktitle={Proceedings of the IEEE International Conference on Computer Vision},
    pages={6222--6231},
    year={2019}
}

Education & Visiting

The University of Sydney, Australia

Doctor of Philosophy (Ph.D.) Candidate in Computer Science
Team: Multimedia Laboratory (MMLab@USYD)
Advisor: Prof. Wanli Ouyang, Prof. Chang Xu
2022 - 2025

The Chinese University of Hong Kong, Hong Kong

Honorary Research Assistant in Multimedia Laboratory (MMLab@CUHK)
Advisor: Prof. Wanli Ouyang
2023 - 2024

University of Chinese Academy of Sciences, China

Master of Science in Engineering in Pattern Recognition & Intelligent System
Team: Multimedia Laboratory, SIAT, CAS (MMLab@SIAT-CAS)
Advisor: Prof. Shifeng Chen, Prof. Yu Qiao
2017 - 2020 with exam exemption (保送研究生)

Central South University, China

Bachelor of Engineering in Automation
Advisor: Prof. Shifeng Chen, Prof. Yu Qiao while visiting CAS (Oct. 2016 - Jun. 2017)
2013 - 2017

Industrial Experiences

Amazon AI

Applied Scientist Intern in AWS AI Labs
worked with Dr. Shuai Zhang, Dr. Taojiannan Yang, Dr. Boran Han, Dr. Bernie Wang
Jun. 2024 - Sep. 2024. Santa Clara, USA

Baidu VIS

Intern → Senior Researcher (Full-time, 3 Years) → Intern on Video Understanding & AIGC
worked with Dr. Jingdong Wang (IEEE Fellow) and Dr. Errui Ding
Oct. 2018 - Present. Shenzhen / Beijing / Hybrid

SenseTime Research

Research Intern in OpenMMLab Team
worked with Dr. Kai Chen
Jan. 2020 - Feb. 2020. Shenzhen, China

iQIYI

Research Intern in Video Analysis Group
hosted by Qiyue Liu
Jun. 2018 - Oct. 2018. Beijing, China

Samsung Research China

Research Intern in Machine Learning Lab
hosted by Zhenbo luo
Mar. 2018 - Jun. 2018. Beijing, China

Contests

  • CVPR2021 AI CITY Challenge: Traffic Anomaly Detection, Winner Award, 2021
  • CVPR2021 NTIRE Challenge on Image Deblurring: Track 2 JPEG Artifacts, Runner-Up Award, 2021
  • Meritorious Winner (First Prize), America Mathematical Contest in Modeling (MCM), 2016
  • First-class Prize, National Undergraduate Mechanical Innovation Design Competition, 2016
  • Second-class Prize, China Freescale Cup Intelligent Car Competition (South China Region), 2015
  • Second-class Prize, Smart Car Racing Competition of Hunan Province, 2015

Awards

PhD:

  • Chinese Government Award for Outstanding Self-financed Students Abroad (The highest award granted by the Chinese government to Chinese students overseas, 650 recipients per year), 2024
  • Baidu PhD Fellowship (10 awardees globally, 200,000 RMB), 2023
  • Australian Good Design Award, 2022
  • Faculty of Engineering Research Scholarship (Full Scholarship), 2022-2025
  • Master:

  • Outstanding Student of University of Chinese Academy of Sciences, 2020
  • Outstanding Intern of Baidu, 2019
  • Scholarship for Academic Excellence of SIAT, CAS, 2018
  • Postgraduate Scholarship of UCAS (Full Scholarship), 2017-2020
  • Undergraduate:

  • Graduate with Honour of Central South University (Top 5%), 2017
  • National Inspirational Scholarship (Top 5%) awarded by the Ministry of Education, 2016
  • Excellent National College Students Innovation and Entrepreneurship Project (20,000 RMB), 2016
  • Outstanding Student Leader, Outstanding Student of Central South University, 2013-2017
  • Scholarship for Academic Excellence of Central South University, 2013-2017
  • Academic Activities

    Journal Reviewer

  • IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
  • IEEE Transactions on Neural Networks and Learning Systems (TNNLS)
  • IEEE Transaction on Image Processing (TIP)
  • IEEE Transaction on Circuits and Systems for Video Technology (TCSVT)
  • IEEE Transactions on Multimedia (TMM)
  • Computer Vision and Image Understanding (CVIU)
  • IEEE Transactions on Biomedical Engineering (TBME)
  • Knowledge-Based Systems
  • International Journal of Multimedia Information Retrieval (IJMIR)
  • IEEE Intelligent Transportation Systems Transactions (TITS)
  • Conference PC Member/Reviewer

  • Reviewer, The Conference on Computer Vision and Pattern Recognition (CVPR), 2021, 2022, 2023, 2024
  • Reviewer, International Conference on Computer Vision (ICCV), 2023
  • Reviewer, European Conference on Computer Vision (ECCV), 2022, 2024
  • Reviewer, Conference on Neural Information Processing Systems (NeurIPS), 2024
  • PC Member, International Joint Conference on Artificial Intelligence (IJCAI), 2021
  • PC Member, The AAAI Conference on Artifical Intelligence (AAAI), 2021, 2022
  • Reviewer, ACM International Conference on Multimedia (ACMMM), 2023, 2024
  • Reviewer, Winter Conference on Applications of Computer Vision (WACV), 2022
  • Reviewer, International Conference on Digital Image Computing: Techniques and Applications (DICTA), 2022
  • Member of IEEE, ACM, AAAI and CVF

    Off-Campus Mentor of Tsinghua University

    Current/Past Mentoring

    I am fortunate to have provided help to talented mentees (at Baidu/CAS) including:


    Huanjin Yao (Master, Tsinghua University), 2023 ↝ Now
    Mengxi Zhang(Master, Tianjin University), 2023 ↝ Now
    Bo Fang (Master, CAS → PhD, CityU), 2021 ↝ Now
    Haipeng Luo (Master, CAS → PhD, Tsinghua University), 2022 ↝ 2023
    Yuguo Wang (Master, Duke University), 2022
    Zhihao Wang (Master, CAS), 2022
    Boyang Xia (Master, CAS → Kuaishou), 2021 ↝ 2022
    Haosen Yang (PhD, University of Surrey), 2021
    Deng Huang (Master, SCUT → AutoX), 2020 ↝ 2021
    Hengyuan Zhao (PhD, NUS), 2021
    Yuxiang Zhao (Master, CAS → PhD, Peking University), 2020 ↝ 2021

    Collaborators & Friends

    Xiaohan Wang (Stanford University), Xiao Tan (Baidu), Dongliang He (ByteDance), Tianwei Lin (Horizon), Yanwu Xu (Boston University), Jie Wu (ByteDance), Jin Ye (Monash University), Chuang Gan (MIT-IBM Watson Lab), Yihao Liu (Shanghai Lab), Chang Liu (Tsinghua University), Zhun Sun (Tencent), Mingde Yao (CUHK), Min Yang (ByteDance)

    Last Updated on 15th August, 2024

    Published with GitHub Pages