LLM Knowledge Base
Large Language Model knowledge: papers, interview questions, and core topics.
Interview Questions
- 2026-04-10 Interview: 量化中INT4和FP8各适合什么场景?为什么有些层不能量化?
- 2026-04-09 Interview: MT-Bench与AlpacaEval的评测盲区深度分析
- 2026-04-08 Interview: 知识蒸馏中Teacher-Student能力差距过大的问题与解决方案
- 2026-04-07 Interview: 模型合并(Model Merging)的工作原理与TIES-Merging、DARE的核心思想
- 2026-04-06 Interview: 过度对齐(Over-alignment)的表现形式与检测方法
- 2026-04-05 Interview: 拒绝采样与Best-of-N在对齐中的优劣势对比
- 2026-04-04 Interview: 多轮对话训练中loss计算策略的深度分析
- 2026-04-03 Interview: Chat Template设计的重要性与模板不兼容导致的问题
- 2026-04-02 Interview: 对齐税(Alignment Tax)的本质、量化与缓解策略
- 2026-04-01 Interview: Constitutional AI的自我批评机制原理与局限性
- 2026-03-31 Interview: DPO与PPO的真正优劣势对比及DeepSeek-R1回归PPO的原因
- 2026-03-31 Interview: PPO在LLM对齐中的训练难度分析与KL散度惩罚的核心作用
- 2026-03-28 Interview: LoRA的rank选择理论指导与不同任务上rank敏感性分析
- 2026-03-27 Interview: 指令数据的多样性与质量权衡及数据质量量化方法
- 2026-03-26 Interview: LIMA论文"1000条数据足够SFT"的结论适用边界分析
- 2026-03-25 Interview: SFT学习率为何远低于预训练及SFT常见问题分析
- 2026-03-24 Interview: Tensor Parallelism与Pipeline Parallelism的适用场景深度对比
- 2026-03-23 Interview: Gradient Checkpointing的时空权衡比例与checkpoint层选择策略
- 2026-03-22 Interview: ZeRO Stage 1/2/3的分片内容与Stage 3通信量分析
- 2026-03-21 Interview: 分布式训练中AllReduce通信量计算与Ring vs Tree AllReduce对比
- 2026-03-20 Interview: 为特定领域扩充Tokenizer词表的完整流程与注意事项
- 2026-03-19 Interview: 灾难性遗忘的本质原因与经典方法在LLM场景的局限
- 2026-03-18 Interview: 持续预训练(CPT)时领域数据与通用数据的配比策略
- 2026-03-17 Interview: 数据去重为什么如此重要?完全不去重和过度去重分别会导致什么问题?
- 2026-03-16 Interview: 预训练数据中代码数据的占比对模型推理能力有什么影响?有什么实验证据?
- 2026-03-15 Interview: Chinchilla定律在工业界为什么经常被违反?over-training的合理性在哪?
- 2026-03-14 Interview: Scaling Laws说loss随计算量呈幂律下降,但这个规律有没有失效的时候?
- 2026-03-13 Interview: 预训练loss突然出现spike,你的排查思路和应对策略是什么?
- 2026-03-12 Interview: 混合精度训练中BF16比FP16更适合LLM训练的根本原因是什么?
- 2026-03-11 Interview: Adam和AdamW的区别不只是名字——解释weight decay在这两个优化器中的数学差异。
- 2026-03-10 Interview: 预训练时Batch Size和Learning Rate如何协同调整?Linear Scaling Rule的局限性是什么?
- 2026-03-09 Interview: 交叉熵损失函数的label smoothing在LLM中有什么作用?什么时候该用什么时候不该用?
- 2026-03-08 Interview: BPE分词算法的合并策略对模型性能有什么影响?中文场景的分词有什么特殊考虑?
- 2026-03-07 Interview: 如果让你从零设计一个7B参数的LLM架构,你会如何分配层数、隐藏维度和头数?
- 2026-03-06 Interview: Transformer的计算瓶颈在Attention还是FFN?训练和推理时有什么不同?
- 2026-03-05 Interview: RMSNorm相比LayerNorm去掉了什么?为什么去掉均值中心化反而更好?
- 2026-03-04 Interview: SwiGLU比ReLU/GELU好在哪里?为什么现代LLM几乎都切换到了SwiGLU?
- 2026-03-03 Interview: MoE架构的Router负载均衡为什么是一个难题?DeepSeek-V2是怎么解决的?
- 2026-03-02 Interview: KV Cache在推理时如何工作?它的显存占用公式是什么?什么因素影响最大?
- 2026-03-01 Interview: GQA和MQA相比标准MHA牺牲了什么换取了什么?为什么LLaMA-2 70B选择GQA?
- 2026-02-28 Interview: Flash Attention没有改变数学计算结果,为什么能加速2-4倍?瓶颈到底在哪?
- 2026-02-27 Interview: 为什么Decoder-only架构在大规模预训练中胜出?Encoder-Decoder架构真的不行吗?
- 2026-02-26 Interview: Transformer中FFN的作用到底是什么?有研究认为FFN是知识存储的主要载体,你怎么看?
- 2026-02-25 Interview: RoPE相比绝对位置编码和ALiBi各有什么优劣?RoPE为什么能支持长度外推?
- 2026-02-24 Interview: 为什么现代LLM都用Pre-Norm而不是Post-Norm?Post-Norm有没有优势?
- 2026-02-23 Interview: Multi-Head Attention的参数量和Single-Head完全相同,那多头的优势本质上来自哪里?
- 2026-02-23 Interview: Self-Attention为什么需要Q、K、V三个矩阵?用同一个矩阵行不行?
- 2026-02-23 Interview: 为什么Transformer使用缩放点积注意力而不是加性注意力?缩放因子1/√d_k的数学直觉是什么?
Topics
- 2026-04-10 数据质量vs数据数量的权衡
- 2026-04-09 Constitutional AI:AI自我约束
- 2026-04-08 ORPO/SimPO/KTO新型对齐算法
- 2026-04-07 DPO直接偏好优化
- 2026-04-06 PPO算法在LLM中的应用
- 2026-04-05 Reward Model训练详解
- 2026-04-04 RLHF概述:从人类反馈中学习
- 2026-04-03 Full Fine-tuning vs LoRA对比
- 2026-04-02 QLoRA:4bit量化微调
- 2026-04-01 LoRA低秩适配原理
- 2026-03-31 指令数据构造方法论
- 2026-03-31 SFT监督微调详解
- 2026-03-28 SwiGLU激活函数
- 2026-03-28 RMSNorm:更高效的归一化
- 2026-03-28 KV Cache缓存机制
- 2026-03-27 Flash Attention原理与实现
- 2026-03-26 GQA/MQA注意力优化
- 2026-03-25 RoPE旋转位置编码
- 2026-03-24 LLaMA架构深度解析
- 2026-03-23 GPT系列架构演进(GPT-1到GPT-4)
- 2026-03-22 Chinchilla定律:最优训练配置
- 2026-03-21 Scaling Laws:模型规模的科学
- 2026-03-20 Tokenizer训练:构建你的词表
- 2026-03-19 数据去重与质量过滤技术
- 2026-03-18 预训练数据清洗与质量控制
- 2026-03-17 持续预训练(Continual Pre-training)
- 2026-03-16 掩码语言模型(Masked LM)
- 2026-03-15 自回归语言模型(Causal LM)
- 2026-03-14 预训练(Pre-training)概述
- 2026-03-13 模型参数量与计算量(FLOPs)估算
- 2026-03-12 GPU显存占用分析与计算
- 2026-03-11 分布式训练基础(DP/DDP)
- 2026-03-10 混合精度训练(FP16/BF16)
- 2026-03-09 过拟合与正则化策略
- 2026-03-08 梯度下降与优化器(Adam/AdamW)
- 2026-03-07 Batch Size与Learning Rate的关系
- 2026-03-06 困惑度(Perplexity):衡量语言模型好坏
- 2026-03-05 交叉熵损失函数在LLM中的应用
- 2026-03-04 BPE/WordPiece/SentencePiece分词算法
- 2026-03-03 Softmax函数与温度参数
- 2026-03-02 Layer Normalization与残差连接
- 2026-03-01 Encoder与Decoder的区别与联系
- 2026-02-28 Transformer整体架构解析
- 2026-02-27 位置编码(Positional Encoding)
- 2026-02-26 Multi-Head Attention多头注意力
- 2026-02-25 Self-Attention自注意力详解
- 2026-02-24 注意力机制(Attention)的本质
- 2026-02-23 Embedding向量表示:从离散到连续
- 2026-02-23 Token与分词:LLM如何理解文字
Papers
- 2026-04-10 Paper: Model Soups: Averaging Weights of Multiple Fine-tuned Models
- 2026-04-09 Paper: NEFTune: Noisy Embeddings Improve Instruction Finetuning
- 2026-04-08 Paper: DoRA: Weight-Decomposed Low-Rank Adaptation
- 2026-04-07 Paper: QLoRA: Efficient Finetuning of Quantized Language Models
- 2026-04-06 Paper: LoRA: Low-Rank Adaptation of Large Language Models
- 2026-04-05 Paper: Scaling Data-Constrained Language Models
- 2026-04-04 Paper: Curriculum Learning for LLMs
- 2026-04-03 Paper: Deduplication与数据质量
- 2026-04-02 Paper: Textbooks Are All You Need II: phi-1.5
- 2026-04-01 Paper: Code Llama: Open Foundation Models for Code
- 2026-03-31 Paper: DeepSeek-Coder: When the Large Language Model Meets Programming
- 2026-03-31 Paper: StarCoder: May the Source Be with You
- 2026-03-28 Paper: Rejection Sampling与Best-of-N在对齐中的应用
- 2026-03-27 Paper: UltraFeedback: Boosting Language Models with High-quality Feedback
- 2026-03-26 Paper: Zephyr: Direct Distillation of LM Alignment
- 2026-03-25 Paper: Orca: Progressive Learning from Complex Explanation Traces
- 2026-03-24 Paper: WizardLM: Empowering LLMs to Follow Complex Instructions (Evol-Instruct)
- 2026-03-23 Paper: SPIN: Self-Play Fine-Tuning
- 2026-03-22 Paper: Proximal Policy Optimization Algorithms (PPO)
- 2026-03-21 Paper: KTO: Model Alignment as Prospect Theoretic Optimization
- 2026-03-20 Paper: ORPO: Monolithic Preference Optimization without Reference Model
- 2026-03-19 Paper: Direct Preference Optimization (DPO)
- 2026-03-18 Paper: Constitutional AI: Harmlessness from AI Feedback
- 2026-03-17 Paper: LIMA: Less Is More for Alignment
- 2026-03-16 Paper: Stanford Alpaca: An Instruction-following LLaMA Model
- 2026-03-15 Paper: Self-Instruct: Aligning Language Models with Self-Generated Instructions
- 2026-03-14 Paper: Training language models to follow instructions with human feedback
- 2026-03-13 Paper: Scaling Laws for Neural Language Models
- 2026-03-12 Paper: RoFormer: Enhanced Transformer with Rotary Position Embedding
- 2026-03-11 Paper: GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
- 2026-03-10 Paper: FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
- 2026-03-09 Paper: PaLM: Scaling Language Modeling with Pathways
- 2026-03-08 Paper: DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
- 2026-03-07 Paper: Textbooks Are All You Need
- 2026-03-06 Paper: RWKV: Reinventing RNNs for the Transformer Era
- 2026-03-05 Paper: Mamba: Linear-Time Sequence Modeling with Selective State Spaces
- 2026-03-04 Paper: Mistral 7B
- 2026-03-03 Paper: LLaMA: Open and Efficient Foundation Language Models
- 2026-03-02 Paper: Training Compute-Optimal Large Language Models
- 2026-03-01 Paper: Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
- 2026-02-28 Paper: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
- 2026-02-27 Paper: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
- 2026-02-26 Paper: Language Models are Few-Shot Learners
- 2026-02-25 Paper: Language Models are Unsupervised Multitask Learners
- 2026-02-24 Paper: Improving Language Understanding by Generative Pre-Training
- 2026-02-23 Paper: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- 2026-02-23 Paper: Attention Is All You Need