Zhou, Yaoyang

Architect of LLM DSA; Maintainer of u-arch simulator for Xiangshan; PhD of Computer Architecture

Beijing Institute of Open Source Chip

Biography

I am interested in LLM inference and CPU micro-architecture.

For LLM inference, I am interested in

CPU-style LLM inference architecture, such as XSAI (Xianshan + AI)
Modeling LLM kernels, chips, and clusters, such as Softmax first-order model and Deepseek V3 model
Speculative decoding

During Oct. 2024 - Oct. 2025, I worked on XSAI (XSAI slides here, XSAI repo here). We hope to provide hardware support for modern LLM kernels in a CPU paradigm on Xianshan, and hide memory latency automatically with out-of-order execution and prefetching. See XSAI’s roadmap here.

For CPU performance, I am experienced in

Prefetchers
Workload characterization
Performance counter architecture
Performance evaluation framework

During 2022 - 2024, I led the performance analysis and modeling team of Xiangshan processor in Beijing Institute of Open Source Chip (BOSC). Our team played a significant role in the design of 3rd generation architecture of the Xiangshan processor, achieving a SPECint2k6 score of 15/GHz on both C++ simulator and RTL.

My hobbies include playing badmiton, investment. I obtained my Ph.D. degree from the Institute of Computing Technology, Chinese Academy of Sciences, and B.Sc. degree from Nanjing University.

Interests

LLM inference
CPU micro-architecture
Investment
Badminton

Education

PhD in Computer Architecture, 2017 - 2023
Institute of Computing Technology, CAS
BSc in Computer Science, 2013 - 2017
Nanjing University

Recent Posts

The myth of decoding large language models

在探讨大语言模型（LLM）的性能时，一个流传已久的说法是：“解码过程中的 Attention 操作是访存密集型（Memory Bound）的。” 这个观点深入人心，以至于许多优化讨论都以此为前提。然而，随着模型架构的演进和解码策略的创新，这一迷思正在被打破。

Zhou, Yaoyang

Sep 21, 2025 3 min read 工作

The myth of decoding large language models

如何估算不同规格的芯片 EP 部署 Deepseek 的单卡吞吐 V1.0

简介与上一篇文章不同，本文主要目的是介绍模型的建模方法，以及搜索吞吐最大配置的方法。 TL;DR: H800、H20、A100、L20 的数据附在文末（不构成买卡建议）。吞吐计算方法本文采用的估算方法：首先假设平均上下文长度为 5K （5K 上下文是参考 shen han 的文章：https://zhuanlan.zhihu.com/p/29841050824），然后用 DRAM 容量作为约束，计算出最大的 batch size per card。然后对单个 token 的延迟进行估算，得到 token per second。最后计算单卡的吞吐 = batch size per card * token per second。

Zhou, Yaoyang

Last updated on Mar 16, 2025 11 min read 兴趣

非 AI 背景的人如何入门大模型（一）

一个非 AI 背景的人写的 AI 入门文献列表

Zhou, Yaoyang

Aug 11, 2024 2 min read 工作

以向量化的方式进行 RISC-V 向量指令模拟

虽然目前已经有多个实现了 RISC-V Vector 1.0 的 ISA 模拟器，例如 Spike，QEMU 和 NEMU，但是这些实现的速度尚不足以满足高效的软硬件协同设计的需求。为了将软件修改、编译器修改后的性能反馈时间缩短到一天以内，我们对 NEMU 的 RISC-V Vector 实现进行了优化……

Zhou, Yaoyang

Last updated on Aug 10, 2024 2 min read 工作

生成香山全系统负载和checkpoint的视频教程

最近，我们意识到让用户搞定香山处理器的仿真环境和负载程序是一件非常具有挑战的事情。为了让不同背景的用户更顺利地制作 SPEC CPU 2006 的负载和 Checkpoint，我们制作了一个视频教程。

Zhou, Yaoyang

Mar 21, 2024 1 min read 工作

See all posts

Publications

Quickly discover relevant content by filtering publications.

Zhou, Yaoyang, Zihao Yu, Chuanqi Zhang, Yinan Xu, Huizhe Wang, Sa Wang, Ninghui Sun, Yungang Bao (2021). Omegaflow: a high-performance dependency-based architecture. In ICS 2021.

Cite

Xin Jin, Zhou, Yaoyang, Bowen Huang, Zihao Yu, Xusheng Zhan, Huizhe Wang, Sa Wang, Ningmei Yu, Ninghui Sun, Yungang Bao (2021). QoSMT: supporting precise performance control for simultaneous multithreading architecture. In ICS 2019.

Cite

Zhou, Yaoyang

Architect of LLM DSA; Maintainer of u-arch simulator for Xiangshan; PhD of Computer Architecture

Beijing Institute of Open Source Chip

Biography

Recent Posts

Publications

Popular Topics

Contact