LLM Knowledge Base
Large Language Model knowledge: papers, interview questions, and core topics.
Interview Questions
- 2026-04-24 Interview: 如何设计一个能处理异构请求(不同长度、不同模型)的LLM推理集群?
- 2026-04-23 Interview: Pipeline Parallelism的bubble比例如何计算?有什么方法降低bubble?
- 2026-04-22 Interview: Tensor Parallelism中AllReduce的通信发生在哪些位置?对延迟的影响有多大?
- 2026-04-21 Interview: 如何在不重新训练的情况下将一个4K上下文模型扩展到32K?YaRN和NTK-aware的区别?
- 2026-04-20 Interview: 长上下文模型(100K+)的KV Cache显存如何管理?有哪些压缩技术?
- 2026-04-19 Interview: 结构化输出(JSON Mode)的实现原理是什么?Constrained Decoding的计算开销多大?
- 2026-04-18 Interview: Top-k、Top-p和Temperature三个采样参数之间如何交互?设置不当会怎样?
- 2026-04-17 Interview: 模型服务的First Token Latency和Throughput之间有什么trade-off?如何平衡?
- 2026-04-16 Interview: 一个7B模型FP16推理的理论吞吐量瓶颈在哪?是计算还是内存带宽?
- 2026-04-15 Interview: Prefix Caching在多轮对话场景能节省多少计算?有什么前提条件?
- 2026-04-14 Interview: Continuous Batching相比Static Batching的优势在哪?实现时有什么挑战?
- 2026-04-13 Interview: 投机解码(Speculative Decoding)的正确性是如何数学保证的?什么时候效果最好?
- 2026-04-12 Interview: vLLM的PagedAttention解决了什么问题?与传统静态内存分配相比效率提升多少?
- 2026-04-11 Interview: GPTQ和AWQ的核心区别是什么?AWQ为什么号称"activation-aware"?
- 2026-04-10 Interview: 量化中INT4和FP8各适合什么场景?为什么有些层不能量化?
- 2026-04-09 Interview: MT-Bench与AlpacaEval的评测盲区深度分析
- 2026-04-08 Interview: 知识蒸馏中Teacher-Student能力差距过大的问题与解决方案
- 2026-04-07 Interview: 模型合并(Model Merging)的工作原理与TIES-Merging、DARE的核心思想
- 2026-04-06 Interview: 过度对齐(Over-alignment)的表现形式与检测方法
- 2026-04-05 Interview: 拒绝采样与Best-of-N在对齐中的优劣势对比
- 2026-04-04 Interview: 多轮对话训练中loss计算策略的深度分析
- 2026-04-03 Interview: Chat Template设计的重要性与模板不兼容导致的问题
- 2026-04-02 Interview: 对齐税(Alignment Tax)的本质、量化与缓解策略
- 2026-04-01 Interview: Constitutional AI的自我批评机制原理与局限性
- 2026-03-31 Interview: DPO与PPO的真正优劣势对比及DeepSeek-R1回归PPO的原因
- 2026-03-31 Interview: PPO在LLM对齐中的训练难度分析与KL散度惩罚的核心作用
- 2026-03-28 Interview: LoRA的rank选择理论指导与不同任务上rank敏感性分析
- 2026-03-27 Interview: 指令数据的多样性与质量权衡及数据质量量化方法
- 2026-03-26 Interview: LIMA论文"1000条数据足够SFT"的结论适用边界分析
- 2026-03-25 Interview: SFT学习率为何远低于预训练及SFT常见问题分析
- 2026-03-24 Interview: Tensor Parallelism与Pipeline Parallelism的适用场景深度对比
- 2026-03-23 Interview: Gradient Checkpointing的时空权衡比例与checkpoint层选择策略
- 2026-03-22 Interview: ZeRO Stage 1/2/3的分片内容与Stage 3通信量分析
- 2026-03-21 Interview: 分布式训练中AllReduce通信量计算与Ring vs Tree AllReduce对比
- 2026-03-20 Interview: 为特定领域扩充Tokenizer词表的完整流程与注意事项
- 2026-03-19 Interview: 灾难性遗忘的本质原因与经典方法在LLM场景的局限
- 2026-03-18 Interview: 持续预训练(CPT)时领域数据与通用数据的配比策略
- 2026-03-17 Interview: 数据去重为什么如此重要?完全不去重和过度去重分别会导致什么问题?
- 2026-03-16 Interview: 预训练数据中代码数据的占比对模型推理能力有什么影响?有什么实验证据?
- 2026-03-15 Interview: Chinchilla定律在工业界为什么经常被违反?over-training的合理性在哪?
- 2026-03-14 Interview: Scaling Laws说loss随计算量呈幂律下降,但这个规律有没有失效的时候?
- 2026-03-13 Interview: 预训练loss突然出现spike,你的排查思路和应对策略是什么?
- 2026-03-12 Interview: 混合精度训练中BF16比FP16更适合LLM训练的根本原因是什么?
- 2026-03-11 Interview: Adam和AdamW的区别不只是名字——解释weight decay在这两个优化器中的数学差异。
- 2026-03-10 Interview: 预训练时Batch Size和Learning Rate如何协同调整?Linear Scaling Rule的局限性是什么?
- 2026-03-09 Interview: 交叉熵损失函数的label smoothing在LLM中有什么作用?什么时候该用什么时候不该用?
- 2026-03-08 Interview: BPE分词算法的合并策略对模型性能有什么影响?中文场景的分词有什么特殊考虑?
- 2026-03-07 Interview: 如果让你从零设计一个7B参数的LLM架构,你会如何分配层数、隐藏维度和头数?
- 2026-03-06 Interview: Transformer的计算瓶颈在Attention还是FFN?训练和推理时有什么不同?
- 2026-03-05 Interview: RMSNorm相比LayerNorm去掉了什么?为什么去掉均值中心化反而更好?
- 2026-03-04 Interview: SwiGLU比ReLU/GELU好在哪里?为什么现代LLM几乎都切换到了SwiGLU?
- 2026-03-03 Interview: MoE架构的Router负载均衡为什么是一个难题?DeepSeek-V2是怎么解决的?
- 2026-03-02 Interview: KV Cache在推理时如何工作?它的显存占用公式是什么?什么因素影响最大?
- 2026-03-01 Interview: GQA和MQA相比标准MHA牺牲了什么换取了什么?为什么LLaMA-2 70B选择GQA?
- 2026-02-28 Interview: Flash Attention没有改变数学计算结果,为什么能加速2-4倍?瓶颈到底在哪?
- 2026-02-27 Interview: 为什么Decoder-only架构在大规模预训练中胜出?Encoder-Decoder架构真的不行吗?
- 2026-02-26 Interview: Transformer中FFN的作用到底是什么?有研究认为FFN是知识存储的主要载体,你怎么看?
- 2026-02-25 Interview: RoPE相比绝对位置编码和ALiBi各有什么优劣?RoPE为什么能支持长度外推?
- 2026-02-24 Interview: 为什么现代LLM都用Pre-Norm而不是Post-Norm?Post-Norm有没有优势?
- 2026-02-23 Interview: Multi-Head Attention的参数量和Single-Head完全相同,那多头的优势本质上来自哪里?
- 2026-02-23 Interview: Self-Attention为什么需要Q、K、V三个矩阵?用同一个矩阵行不行?
- 2026-02-23 Interview: 为什么Transformer使用缩放点积注意力而不是加性注意力?缩放因子1/√d_k的数学直觉是什么?
Topics
- 2026-04-24 推理优化全景图
- 2026-04-23 知识蒸馏(Distillation)
- 2026-04-22 模型合并(Model Merging)技术
- 2026-04-21 过度对齐(Over-alignment)问题
- 2026-04-20 人类评估vs自动评估
- 2026-04-19 MT-Bench/AlpacaEval评估方法
- 2026-04-18 安全对齐与Red Teaming
- 2026-04-17 SPIN自我博弈微调
- 2026-04-16 Self-Play自博弈训练
- 2026-04-15 拒绝采样(Rejection Sampling)
- 2026-04-14 多轮对话训练技巧
- 2026-04-13 System Prompt工程与最佳实践
- 2026-04-12 Chat Template与对话格式
- 2026-04-11 对齐税(Alignment Tax)
- 2026-04-10 数据质量vs数据数量的权衡
- 2026-04-09 Constitutional AI:AI自我约束
- 2026-04-08 ORPO/SimPO/KTO新型对齐算法
- 2026-04-07 DPO直接偏好优化
- 2026-04-06 PPO算法在LLM中的应用
- 2026-04-05 Reward Model训练详解
- 2026-04-04 RLHF概述:从人类反馈中学习
- 2026-04-03 Full Fine-tuning vs LoRA对比
- 2026-04-02 QLoRA:4bit量化微调
- 2026-04-01 LoRA低秩适配原理
- 2026-03-31 指令数据构造方法论
- 2026-03-31 SFT监督微调详解
- 2026-03-28 SwiGLU激活函数
- 2026-03-28 RMSNorm:更高效的归一化
- 2026-03-28 KV Cache缓存机制
- 2026-03-27 Flash Attention原理与实现
- 2026-03-26 GQA/MQA注意力优化
- 2026-03-25 RoPE旋转位置编码
- 2026-03-24 LLaMA架构深度解析
- 2026-03-23 GPT系列架构演进(GPT-1到GPT-4)
- 2026-03-22 Chinchilla定律:最优训练配置
- 2026-03-21 Scaling Laws:模型规模的科学
- 2026-03-20 Tokenizer训练:构建你的词表
- 2026-03-19 数据去重与质量过滤技术
- 2026-03-18 预训练数据清洗与质量控制
- 2026-03-17 持续预训练(Continual Pre-training)
- 2026-03-16 掩码语言模型(Masked LM)
- 2026-03-15 自回归语言模型(Causal LM)
- 2026-03-14 预训练(Pre-training)概述
- 2026-03-13 模型参数量与计算量(FLOPs)估算
- 2026-03-12 GPU显存占用分析与计算
- 2026-03-11 分布式训练基础(DP/DDP)
- 2026-03-10 混合精度训练(FP16/BF16)
- 2026-03-09 过拟合与正则化策略
- 2026-03-08 梯度下降与优化器(Adam/AdamW)
- 2026-03-07 Batch Size与Learning Rate的关系
- 2026-03-06 困惑度(Perplexity):衡量语言模型好坏
- 2026-03-05 交叉熵损失函数在LLM中的应用
- 2026-03-04 BPE/WordPiece/SentencePiece分词算法
- 2026-03-03 Softmax函数与温度参数
- 2026-03-02 Layer Normalization与残差连接
- 2026-03-01 Encoder与Decoder的区别与联系
- 2026-02-28 Transformer整体架构解析
- 2026-02-27 位置编码(Positional Encoding)
- 2026-02-26 Multi-Head Attention多头注意力
- 2026-02-25 Self-Attention自注意力详解
- 2026-02-24 注意力机制(Attention)的本质
- 2026-02-23 Embedding向量表示:从离散到连续
- 2026-02-23 Token与分词:LLM如何理解文字
Papers
- 2026-04-24 Paper: Text Embeddings by Weakly-Supervised Contrastive Pre-training (E5)
- 2026-04-23 Paper: Gorilla: Large Language Model Connected with Massive APIs
- 2026-04-22 Paper: Toolformer: Language Models Can Teach Themselves to Use Tools
- 2026-04-21 Paper: Active Retrieval Augmented Generation (FLARE)
- 2026-04-20 Paper: Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity
- 2026-04-19 Paper: GraphRAG: Unlocking LLM Discovery on Narrative Private Data
- 2026-04-18 Paper: Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE)
- 2026-04-17 Paper: Corrective Retrieval Augmented Generation (CRAG)
- 2026-04-16 Paper: Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
- 2026-04-15 Paper: REALM: Retrieval-Augmented Language Model Pre-Training
- 2026-04-14 Paper: ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction
- 2026-04-13 Paper: Dense Passage Retrieval for Open-Domain Question Answering
- 2026-04-12 Paper: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
- 2026-04-11 Paper: RETRO: Improving Language Models by Retrieving from Trillions of Tokens
- 2026-04-10 Paper: Model Soups: Averaging Weights of Multiple Fine-tuned Models
- 2026-04-09 Paper: NEFTune: Noisy Embeddings Improve Instruction Finetuning
- 2026-04-08 Paper: DoRA: Weight-Decomposed Low-Rank Adaptation
- 2026-04-07 Paper: QLoRA: Efficient Finetuning of Quantized Language Models
- 2026-04-06 Paper: LoRA: Low-Rank Adaptation of Large Language Models
- 2026-04-05 Paper: Scaling Data-Constrained Language Models
- 2026-04-04 Paper: Curriculum Learning for LLMs
- 2026-04-03 Paper: Deduplication与数据质量
- 2026-04-02 Paper: Textbooks Are All You Need II: phi-1.5
- 2026-04-01 Paper: Code Llama: Open Foundation Models for Code
- 2026-03-31 Paper: DeepSeek-Coder: When the Large Language Model Meets Programming
- 2026-03-31 Paper: StarCoder: May the Source Be with You
- 2026-03-28 Paper: Rejection Sampling与Best-of-N在对齐中的应用
- 2026-03-27 Paper: UltraFeedback: Boosting Language Models with High-quality Feedback
- 2026-03-26 Paper: Zephyr: Direct Distillation of LM Alignment
- 2026-03-25 Paper: Orca: Progressive Learning from Complex Explanation Traces
- 2026-03-24 Paper: WizardLM: Empowering LLMs to Follow Complex Instructions (Evol-Instruct)
- 2026-03-23 Paper: SPIN: Self-Play Fine-Tuning
- 2026-03-22 Paper: Proximal Policy Optimization Algorithms (PPO)
- 2026-03-21 Paper: KTO: Model Alignment as Prospect Theoretic Optimization
- 2026-03-20 Paper: ORPO: Monolithic Preference Optimization without Reference Model
- 2026-03-19 Paper: Direct Preference Optimization (DPO)
- 2026-03-18 Paper: Constitutional AI: Harmlessness from AI Feedback
- 2026-03-17 Paper: LIMA: Less Is More for Alignment
- 2026-03-16 Paper: Stanford Alpaca: An Instruction-following LLaMA Model
- 2026-03-15 Paper: Self-Instruct: Aligning Language Models with Self-Generated Instructions
- 2026-03-14 Paper: Training language models to follow instructions with human feedback
- 2026-03-13 Paper: Scaling Laws for Neural Language Models
- 2026-03-12 Paper: RoFormer: Enhanced Transformer with Rotary Position Embedding
- 2026-03-11 Paper: GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
- 2026-03-10 Paper: FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
- 2026-03-09 Paper: PaLM: Scaling Language Modeling with Pathways
- 2026-03-08 Paper: DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
- 2026-03-07 Paper: Textbooks Are All You Need
- 2026-03-06 Paper: RWKV: Reinventing RNNs for the Transformer Era
- 2026-03-05 Paper: Mamba: Linear-Time Sequence Modeling with Selective State Spaces
- 2026-03-04 Paper: Mistral 7B
- 2026-03-03 Paper: LLaMA: Open and Efficient Foundation Language Models
- 2026-03-02 Paper: Training Compute-Optimal Large Language Models
- 2026-03-01 Paper: Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
- 2026-02-28 Paper: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
- 2026-02-27 Paper: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
- 2026-02-26 Paper: Language Models are Few-Shot Learners
- 2026-02-25 Paper: Language Models are Unsupervised Multitask Learners
- 2026-02-24 Paper: Improving Language Understanding by Generative Pre-Training
- 2026-02-23 Paper: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- 2026-02-23 Paper: Attention Is All You Need