Ji Qi 齐济

Postdoc Fellow @ Zhipu&Tsinghua

Bio

I am a Postdoctoral Fellow jointly affiliated with Tsinghua University and Zhipu AI, fortunately working with Prof. Jie Tang.

Previously, I received my Ph.D. from the Knowledge Engineering Group (KEG), Department of Computer Science and Technology, Tsinghua University in 2025, advised by Prof. Bin Xu and Prof. Juanzi Li.

I was fortunate to be a visiting student at the School of Computing, National University of Singapore, advised by Prof. Tat-Seng Chua.

Currently, I work on the foundations and development of large multimodal models (LMMs).

New Research on Large Multimodal Models
April 2026
We recently released GLM-5V-Turbo, the first multimodal coding foundation model, built for vision-based coding tasks.
New Research on Large Multimodal Models
July 2025
We recently released GLM-4.1V and GLM-4.5V, two foundational and powerful large multimodal models.
New Research on Multimodal Video Understanding
April 2025
We recently released Quicksviewer, an LMM for efficient long video understanding via reinforced compression of video cubes.

Selected Papers

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

GLM-V Team

Preprint

ABSTRACT ArXiv Link PDF CODE Bib

We present GLM-4.1V-Thinking and GLM-4.5V, a family of vision-language models (VLMs) designed to advance general-purpose multimodal understanding and reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the final performance. We then propose Reinforcement Learning with Curriculum Sampling (RLCS) to unlock the full potential of the model, leading to comprehensive capability enhancement across a diverse range of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding, GUI-based agents, and long document interpretation. In a comprehensive evaluation across 42 public benchmarks, GLM-4.5V achieves state-of-the-art performance on nearly all tasks among open-source models of similar size, and demonstrates competitive or even superior results compared to closed-source models such as Gemini-2.5-Flash on challenging tasks including Coding and GUI Agents. Meanwhile, the smaller GLM-4.1V-9B-Thinking remains highly competitive-achieving superior results to the much larger Qwen2.5-VL-72B on 29 benchmarks. We open-source both GLM-4.1V-9B-Thinking and GLM-4.5V. Code, models.
@article{vteam2025glm45vglm41vthinkingversatilemultimodal, title = {GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning}, preview = {glmv-logo.svg}, bibtex_show = {true}, journal = {Preprint}, code = {https://github.com/zai-org/GLM-V/tree/main}, selected = {true}, author = {Team, GLM-V}, year = {2025}, arxiv = {2507.01006}, html = {https://arxiv.org/abs/2507.01006}, pdf = {https://arxiv.org/pdf/2507.01006} }
An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes

Ji Qi, Yuan Yao, Yushi Bai, Bin Xu, Juanzi Li, Zhiyuan Liu, and Tat-Seng Chua

Preprint

ABSTRACT ArXiv Link PDF CODE Bib

Large Multimodal Models (LMMs) uniformly perceive video frames, creating computational inefficiency for videos with inherently varying temporal information density. This paper present \textbfQuicksviewer, an LMM with new perceiving paradigm that partitions a video of nonuniform density into varying cubes using Gumbel Softmax, followed by a unified resampling for each cube to achieve efficient video understanding. This simple and intuitive approach dynamically compress video online based on its temporal density, significantly reducing spatiotemporal redundancy (overall 45× compression rate), while enabling efficient training with large receptive field. We train the model from a language backbone through three progressive stages, each incorporating lengthy videos on average of 420s/1fps thanks to the perceiving efficiency. With only 0.8M total video-text samples for training, our model outperforms the direct baseline employing a fixed partitioning strategy by a maximum of 8.72 in accuracy, demonstrating the effectiveness in performance. On Video-MME, Quicksviewer achieves SOTA under modest sequence lengths using just up to 5% of tokens per frame required by baselines. With this paradigm, scaling up the number of input frames reveals a clear power law of the model capabilities. It is also empirically verified that the segments generated by the cubing network can help for analyzing continuous events in videos.
@article{qi2025lmm, title = {An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes}, preview = {quicksviewer-logo.jpg}, bibtex_show = {true}, journal = {Preprint}, code = {https://github.com/quicksviewer/quicksviewer}, selected = {true}, author = {Qi, Ji and Yao, Yuan and Bai, Yushi and Xu, Bin and Li, Juanzi and Liu, Zhiyuan and Chua, Tat-Seng}, arxiv = {2504.15270}, html = {https://arxiv.org/abs/2504.15270}, pdf = {https://arxiv.org/pdf/2504.15270}, year = {2025} }
CogCoM: A Visual Language Model with Chain-of-Manipulations Reasoning

Ji Qi, Ming Ding, Weihan Wang, Yushi Bai, Qingsong Lv, Wenyi Hong, Bin Xu, Lei Hou, Juanzi Li, Yuxiao Dong, and Jie Tang

ICLR 2024

ABSTRACT ArXiv Link PDF CODE Bib

Vision-Language Models (VLMs) have demonstrated their broad effectiveness thanks to extensive training in aligning visual instructions to responses. However, such training of conclusive alignment leads models to ignore essential visual reasoning, further resulting in failures in meticulous visual problems and unfaithful responses. Drawing inspiration from human cognition in solving visual problems (e.g., marking, zoom in), this paper introduces Chain of Manipulations, a mechanism that enables VLMs to solve problems step-by-step with evidence. After training, models can solve various visual problems by eliciting intrinsic manipulations (e.g., grounding, zoom in) with results (e.g., boxes, image) actively without involving external tools, while also allowing users to trace error causes. We study the roadmap to implement this mechanism, including (1) a flexible design of manipulations upon extensive analysis, (2) an efficient automated data generation pipeline, (3) a compatible VLM architecture capable of multi-turn multi-image, and (4) a model training process for versatile capabilities. With the design, we also manually annotate 6K high-quality samples for the challenging graphical mathematical problems. Our trained model, \textbfCogCoM, equipped with this mechanism with 17B parameters achieves state-of-the-art performance across 9 benchmarks from 4 categories, demonstrating the effectiveness while preserving the interpretability. Our code, model weights, and collected data are publicly available.
@inproceedings{qi2025cogcom, title = {CogCoM: A Visual Language Model with Chain-of-Manipulations Reasoning}, preview = {cogcom-logo.png}, bibtex_show = {true}, booktitle = {ICLR}, code = {https://github.com/zai-org/CogCoM}, selected = {true}, author = {Qi, Ji and Ding, Ming and Wang, Weihan and Bai, Yushi and Lv, Qingsong and Hong, Wenyi and Xu, Bin and Hou, Lei and Li, Juanzi and Dong, Yuxiao and Tang, Jie}, arxiv = {2402.04236}, html = {https://arxiv.org/abs/2402.04236}, pdf = {https://arxiv.org/pdf/2402.04236}, year = {2024} }
Preserving Knowledge Invariance: Rethinking Robustness Evaluation of Open Information Extraction

Ji Qi, Chuchun Zhang, Xiaozhi Wang, Kaisheng Zeng, Jifan Yu, Jinxin Liu, Lei Hou, Juanzi Li, and Xu Bin

EMNLP 2023 Outstanding Paper Award

ABSTRACT ArXiv Link PDF CODE Bib

The robustness to distribution changes ensures that NLP models can be successfully applied in the realistic world, especially for information extraction tasks. However, most prior evaluation benchmarks have been devoted to validating pairwise matching correctness, ignoring the crucial validation of robustness. In this paper, we present the first benchmark that simulates the evaluation of open information extraction models in the real world, where the syntactic and expressive distributions under the same knowledge meaning may drift variously. We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique that consists of sentences with structured knowledge of the same meaning but with different syntactic and expressive forms. By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques. We perform experiments on typical models published in the last decade as well as a representative large language model, and the results show that the existing successful models exhibit a frustrating degradation, with a maximum drop of 23.43 F1 score.
@inproceedings{qi2023preserving, title = {Preserving Knowledge Invariance: Rethinking Robustness Evaluation of Open Information Extraction}, preview = {robust-logo.jpg}, bibtex_show = {true}, booktitle = {EMNLP}, code = {https://github.com/qijimrc/ROBUST}, selected = {true}, author = {Qi, Ji and Zhang, Chuchun and Wang, Xiaozhi and Zeng, Kaisheng and Yu, Jifan and Liu, Jinxin and Hou, Lei and Li, Juanzi and Bin, Xu}, arxiv = {2305.13981}, html = {https://arxiv.org/abs/2305.13981}, pdf = {https://arxiv.org/pdf/2305.13981}, award = {<b><font color="BB0A21"> Outstanding Paper Award </font></b>}, year = {2023} }
GOAL: A challenging knowledge-grounded video captioning benchmark for real-time soccer commentary generation

Ji Qi, Jifan Yu, Teng Tu, Kunyu Gao, Yifan Xu, Xinyu Guan, Xiaozhi Wang, Bin Xu, Lei Hou, Juanzi Li, and Jie Tang

CIKM 2023

ABSTRACT ArXiv Link PDF CODE Bib

Despite the recent emergence of video captioning models, how to generate vivid, fine-grained video descriptions based on the background knowledge (i.e., long and informative commentary about the domain-specific scenes with appropriate reasoning) is still far from being solved, which however has great applications such as automatic sports narrative. In this paper, we present GOAL, a benchmark of over 8.9k soccer video clips, 22k sentences, and 42k knowledge triples for proposing a challenging new task setting as Knowledge-grounded Video Captioning (KGVC). Moreover, we conduct experimental adaption of existing methods to show the difficulty and potential directions for solving this valuable and applicable task. Our data and code are publicly available.
@inproceedings{qi2023goal, title = {GOAL: A challenging knowledge-grounded video captioning benchmark for real-time soccer commentary generation}, preview = {goal-logo.jpg}, bibtex_show = {true}, booktitle = {CIKM}, code = {https://github.com/THU-KEG/goal}, selected = {true}, author = {Qi, Ji and Yu, Jifan and Tu, Teng and Gao, Kunyu and Xu, Yifan and Guan, Xinyu and Wang, Xiaozhi and Xu, Bin and Hou, Lei and Li, Juanzi and Tang, Jie}, arxiv = {2303.14655}, html = {https://arxiv.org/abs/2303.14655}, pdf = {https://arxiv.org/pdf/2303.14655}, year = {2023} }
Syntactically robust training on partially-observed data for open information extraction

Ji Qi, Yuxiang Chen, Lei Hou, Juanzi Li, and Bin Xu

EMNLP 2022

ABSTRACT ArXiv Link PDF CODE Bib

Open Information Extraction models have shown promising results with sufficient supervision. However, these models face a fundamental challenge that the syntactic distribution of training data is partially observable in comparison to the real world. In this paper, we propose a syntactically robust training framework that enables models to be trained on a syntactic-abundant distribution based on diverse paraphrase generation. To tackle the intrinsic problem of knowledge deformation of paraphrasing, two algorithms based on semantic similarity matching and syntactic tree walking are used to restore the expressionally transformed knowledge. The training framework can be generally applied to other syntactic partial observable domains. Based on the proposed framework, we build a new evaluation set called CaRB-AutoPara, a syntactically diverse dataset consistent with the real-world setting for validating the robustness of the models. Experiments including a thorough analysis show that the performance of the model degrades with the increase of the difference in syntactic distribution, while our framework gives a robust boundary. The source code is publicly available.
@inproceedings{qi2023syntactically, title = {Syntactically robust training on partially-observed data for open information extraction}, preview = {robustoie-logo.png}, bibtex_show = {true}, booktitle = {EMNLP}, code = {https://github.com/qijimrc/RobustOIE}, selected = {true}, author = {Qi, Ji and Chen, Yuxiang and Hou, Lei and Li, Juanzi and Xu, Bin}, arxiv = {2301.06841}, html = {https://arxiv.org/abs/2301.06841}, pdf = {https://arxiv.org/pdf/2301.06841}, year = {2022} }

Google Scholar Infobox

Service

NeurIPS 2022~2025
ICML 2022~2025
ICLR 2022~2025
CVPR 2022~2025
ICCV 2022~2025
ACL 2022~2025
EMNLP 2022~2025

Bio

New Research on Large Multimodal Models

New Research on Large Multimodal Models

New Research on Multimodal Video Understanding

Selected Papers

Service