About Chiyue Wei
Chiyue Wei is a Ph.D. student in Electrical and Computer Engineering at Duke University, working under the supervision of Professor Yiran Chen. His research interests lie at the intersection of computer architecture and deep learning. Prior to Duke, he earned his Bachelorβs degree in Electronic Engineering from Tsinghua University in 2023, where he conducted research with Professor Yuan Xie and Professor Yu Wang.
π₯ News
- [2026/06] π Excited to share our latest work on Diffusion LLM serving, Optimus: Elastic Decoding for Efficient Diffusion LLM Serving, delivering up to 6.1Γ throughput gain over autoregressive decoding. Check out the paper and code.
- [2026/05] π Excited to start my research internship in the AI-System Co-Design team at Meta, where I work on MTIA, Metaβs in-house AI accelerator.
- [2026/03] ππ Our work EVA: Accelerating LLM Decoding via an Efficient Vector Quantization Architecture is accepted to ISCA 2026! Check out the paper and code.
- [2025/11] ππ Two papers accepted to HPCA 2026! One of them features our work Focus: A Streaming Concentration Architecture for Efficient Vision-Language Models, now a π Best Paper Nominee! Check out the paper and code.
- [2025/08] Check out our work DPad, a training-free acceleration method for Diffusion LLMs, now available on on arXiv!
- [2025/08] Wrapped up my internship at NVIDIA, where I worked on the FlashInfer project. I developed high-performance and customizable attention kernels with CuTe DSL, optimized for Blackwell GPUs.
- [2025/06] Honored to be named a DAC 2025 Young Fellow.
- [2025/06] Excited that our works Phi, Transitive Array, and Ecco were presented at ISCA 2025, check out the slides for Phi.
[2025/05] π Iβm excited to start my summer internship at NVIDIA, focusing on LLM inference framework optimization within the Deep Learning Frameworks team.
- [2025/03] πππ Three papers accepted by ISCA 2025! Topics include acceleration for Spiking Neural Networks, General Matrix Multiplications, and Large Language Models.
- [2025/03] π₯ I presents Prosperity at HPCA 2025 in Las Vegas! Check out presentation slides and video.
- [2024/11] π Our paper βProsperity: Accelerating Spiking Neural Networks via Product Sparsityβ is accepted by HPCA 2025.
π Selected Publications
HPCA 2026π Best Paper NomineeFocus: A Streaming Concentration Architecture for Efficient Vision-Language Models, Chiyue Wei, Cong Guo, Junyao Zhang, Haoxuan Shan, Yifan Xu, Ziyue Zhang, Yudong Liu, Qinsi Wang, Changchun Zhou, Hai βHelenβ Li, Yiran ChenPreprintOptimus: Elastic Decoding for Efficient Diffusion LLM Serving, Chiyue Wei, Cong Guo, Bowen Duan, Junyao Zhang, Haoxuan Shan, Yifei Wang, Yangjie Zhou, Hai βHelenβ Li, Danyang Zhuo, Yiran ChenISCA 2026EVA: Accelerating LLM Decoding via an Efficient Vector Quantization Architecture, Bowen Duan, Cong Guo, Chiyue Wei, Haoxuan Shan, Yuzhe Fu, Xinhua Chen, Yifan Xu, Ziyue Zhang, Changchun Zhou, Hai Li, Yiran ChenISCA 2025Phi: Leveraging Pattern-based Hierarchical Sparsity for High-Efficiency Spiking Neural Networks, Chiyue Wei, Bowen Duan, Cong Guo, Jingyang Zhang, Qingyue Song, Hai Li, Yiran ChenISCA 2025Transitive Array: An Efficient GEMM Accelerator with Result Reuse, Cong Guo, Chiyue Wei, Jiaming Tang, Bowen Duan, Song Han, Hai Li, Yiran ChenISCA 2025Ecco: Improving Memory Bandwidth and Capacity for LLMs via Entropy-Aware Cache Compression, Feng Cheng, Cong Guo, Chiyue Wei, Junyao Zhang, Changchun Zhou, Edward Hanson, Jiaqi Zhang, Xiaoxiao Liu, Hai Li, Yiran ChenHPCA 2025Prosperity: Accelerating Spiking Neural Networks via Product Sparsity, Chiyue Wei, Cong Guo, Feng Cheng, Shiyu Li, Hao Yang, Hai Li, Yiran ChenDATE 2023CLAP: Locality Aware and Parallel Triangle Counting with Content Addressable Memory, Tianyu Fu, Chiyue Wei, Zhenhua Zhu, Shang Yang, Zhongming Yu, Guohao Dai, Huazhong Yang, Yu WangISCA 2022DIMMining: Pruning-Efficient and Parallel Graph Mining on Near-Memory-Computing, Guohao Dai, Zhenhua Zhu, Tianyu Fu, Chiyue Wei, Bangyan Wang, Xiangyu Li, Yuan Xie, Huazhong Yang, Yu Wang
