

Core contributor to Amazon Nova. I work on video understanding and multimodal LLMs — from foundation models to post-training.
I'm an Applied Scientist at Amazon AGI. Before Amazon, I spent nearly seven years (2018–2025) at Baidu VIS with Chief Scientist Dr. Jingdong Wang (IEEE Fellow), progressing from research intern to Senior/Staff Researcher across multiple large-scale computer-vision and multimodal projects.
I earned my Ph.D. from MMLab, The University of Sydney, advised by Prof. Wanli Ouyang, and my M.S.E. from the University of Chinese Academy of Sciences (UCAS) under Prof. Shifeng Chen and Prof. Yu Qiao. I've also worked at AWS AI Labs, SenseTime, Samsung Research and iQIYI AI, and was a visiting scholar at MMLab@CUHK and MMLab@SIAT-CAS.
I'm a recipient of the Baidu PhD Fellowship (2023, 10 worldwide) and the DAAD AInet Fellowship (2025).
In this era of rapidly advancing AGI, I've come to believe that incremental papers matter far less than what actually moves the needle — rigorous scaling validation and the unglamorous, careful engineering behind it, like getting data quality right. That's where I've focused my energy since 2025. If this resonates and you'd like to collaborate or chat, feel free to email me.
A few works honored by top venues as Oral / Highlight / Spotlight (top few %) — from the good old days of chasing papers. Feel free to skip; these days I'd rather build things that matter. Still, the full list is here if you're curious.
Empowering MLLMs with o1-like reasoning and reflection via collective Monte-Carlo tree search.
Generating distinctive audio descriptions for long-form video story understanding.
Auxiliary captions for text-video retrieval. Journal extension in IEEE TPAMI.
Revisiting classifier: transferring vision-language knowledge for video recognition. Journal extension in IJCV.
Macro-to-micro contrastive learning for self-supervised video representation.
Multi-agent reinforcement learning for frame sampling in untrimmed video — accepted as an ICCV Oral on my first submission.










Conference Reviewer / PC Member: ICML (2025–2026), NeurIPS (2024–2026), CVPR (2021–2026), ICCV (2023, 2025), ECCV (2022, 2024, 2026), IJCAI (2021), AAAI (2021–2023), ACMMM (2023–2024), WACV (2022).
Journal Reviewer: TPAMI, IJCV, TNNLS, TIP, TCSVT, TMM, CVIU, TBME, KBS, IJMIR, TITS, TOMM.
Member of IEEE, ACM, AAAI and CVF · Off-Campus Mentor of Tsinghua University (2023–2025).
I'm honestly a pretty low-key person. Fittingly for someone who works on video understanding, most of my downtime goes to… watching videos 📺 — variety & stand-up shows, Chinese TV dramas (iQIYI / Youku / Tencent Video), movies, and anime, from the classics (Detective Conan, Dragon Ball) to Chinese ones like Biao Ren (镖人) and A Record of a Mortal's Journey to Immortality (凡人修仙传) — both of which update painfully slowly, so I'm perpetually waiting for the next episode. 😅 Plus an unhealthy amount of Douyin and Bilibili, and yes, I pay for memberships on basically every streaming platform.
Back in China I loved eating my way through the food 🍜 and traveled across most of its provinces. In the US, with good Chinese food harder to find, my evenings are mostly videos, and my mornings often start with the stock market 📈 (a firm believer in the LEAPS + sell-put strategy, with returns that swing more wildly than my training loss curves). My Switch and PS5 🎮 are gathering dust these days — mostly I just take after-dinner walks in the park with my wife. 🌳 By the standards of AI and academia, I married and started a family early — a baby before 30. 👶 I'm also pretty lazy and never work out, yet somehow stay slim thanks to good genes (my dad's the same). 🤷