This repository includes 100+ LLM interview questions with answers.

Join 🚀 AIxFunda free newsletter to get latest updates and interesting tutorials related to Generative AI, LLMs, Agents and RAG.
- ✨ Weekly GenAI updates
- 📄 Weekly LLM, Agents and RAG paper updates
- 📝 1 fresh blog post on an interesting topic every week
- 📗RAG Interview Questions and Answers Hub - 100+ RAG interview questions and answers.
- 🚀Prompt Engineering Techniques Hub - 25+ prompt engineering techniques with LangChain implementations.
- 👨🏻💻LLM Engineer Toolkit - Categories wise collection of 120+ LLM, RAG and Agent related libraries.
- 🩸LLM, RAG and Agents Survey Papers Collection - Category wise collection of 200+ survey papers.
Crack modern LLM and Generative AI interviews with this comprehensive, interview-focused guide designed specifically for ML Engineers, AI Engineers, Data Scientists and Software Engineers.
This book features 100+ carefully curated LLM interview questions, each paired with clear answers and in-depth explanations so you truly understand the concepts interviewers care about. Get the book here.
Use the Coupon Code: LLMQA25 for an exclusive discount (50%) on the book. (Available only for a short period of time).
| # | Question | Answer |
|---|---|---|
| Q1 | CNNs and RNNs don’t use positional embeddings. Why do transformers use positional embeddings? | Answer |
| Q2 | Tell me the basic steps involved in running an inference query on an LLM. | Answer |
| Q3 | Explain how KV Cache accelerates LLM inference. | Answer |
| Q4 | How does quantization affect inference speed and memory requirements? | Answer |
| Q5 | How do you handle the large memory requirements of KV cache in LLM inference? | Answer |
| Q6 | After tokenization, how are tokens converted into embeddings in the Transformer model? | Answer |
| Q7 | Explain why subword tokenization is preferred over word-level tokenization in the Transformer model. | Answer |
| Q8 | Explain the trade-offs in using a large vocabulary in LLMs. | Answer |
| Q9 | Explain how self-attention is computed in the Transformer model step by step. | Answer |
| Q10 | What is the computational complexity of self-attention in the Transformer model? | Answer |
| Q11 | How do Transformer models address the vanishing gradient problem? | Answer |
| Q12 | What is tokenization, and why is it necessary in LLMs? | Answer |
| Q13 | Explain the role of token embeddings in the Transformer model. | Answer |
| Q14 | Explain the working of the embedding layer in the Transformer model. | Answer |
| Q15 | What is the role of self-attention in the Transformer model, and why is it called “self-attention”? | Answer |
| Q16 | What is the purpose of the encoder in a Transformer model? | Answer |
| Q17 | What is the purpose of the decoder in a Transformer model? | Answer |
| Q18 | How does the encoder-decoder structure work at a high level in the Transformer model? | Answer |
| Q19 | What is the purpose of scaling in the self-attention mechanism in the Transformer model? | Answer |
| Q20 | Why does the Transformer model use multiple self-attention heads instead of a single self-attention head? | Answer |
| Q21 | How are the outputs of multiple heads combined and projected back in the multi-head attention in the Transformer model? | Answer |
| Q22 | How does masked self-attention differ from regular self-attention, and where is it used in a Transformer? | Answer |
| Q23 | Discuss the pros and cons of the self-attention mechanism in the Transformer model. | Answer |
| Q24 | What is the purpose of masked self-attention in the Transformer decoder? | Answer |
| Q25 | Explain how masking works in masked self-attention in Transformer. | Answer |
| Q26 | Explain why self-attention in the decoder is referred to as cross-attention. How does it differ from self-attention in the encoder? | Answer |
| Q27 | What is the softmax function, and where is it applied in Transformers? | Answer |
| Q28 | What is the purpose of residual (skip) connections in Transformer layers? | Answer |
| Q29 | Why is layer normalization used, and where is it applied in Transformers? | Answer |
| Q30 | What is cross-entropy loss, and how is it applied during Transformer training? | Answer |
| Q31 | Compare Transformers and RNNs in terms of handling long-range dependencies. | Answer |
| Q32 | What are the fundamental limitations of the Transformer model? | Answer |
| Q33 | How do Transformers address the limitations of CNNs and RNNs? | Answer |
| Q34 | How do Transformer models address the vanishing gradient problem? | Answer |
| Q35 | What is the purpose of the position-wise feed-forward sublayer? | Answer |
| Q36 | Can you briefly explain the difference between LLM training and inference? | Answer |
| Q37 | What is latency in LLM inference, and why is it important? | Answer |
| Q38 | What is batch inference, and how does it differ from single-query inference? | Answer |
| Q39 | How does batching generally help with LLM inference efficiency? | Answer |
| Q40 | Explain the trade-offs between batching and latency in LLM serving. | Answer |
| Q41 | How can techniques like mixture-of-experts (MoE) optimize inference efficiency? | Answer |
| Q42 | Explain the role of decoding strategy in LLM text generation. | Answer |
| Q43 | What are the different decoding strategies in LLMs? | Answer |
| Q44 | Explain the impact of the decoding strategy on LLM-generated output quality and latency. | Answer |
| Q45 | Explain the greedy search decoding strategy and its main drawback. | Answer |
| Q46 | How does Beam Search improve upon Greedy Search, and what is the role of the beam width parameter? | Answer |
| Q47 | When is a deterministic strategy (like Beam Search) preferable to a stochastic (sampling) strategy? Provide a specific use case. | Answer |
| Q48 | Discuss the primary trade-off between the computational cost and the output quality when comparing Greedy Search and Beam Search. | Answer |
| Q49 | When you set the temperature to 0.0, which decoding strategy are you using? | Answer |
| Q50 | How is Beam Search fundamentally different from a Breadth-First Search (BFS) or Depth-First Search (DFS)? | Answer |
| Q51 | Explain the criteria for choosing different decoding strategies. | Answer |
| Q52 | Compare deterministic and stochastic decoding methods in LLMs. | Answer |
| Q53 | What is the role of the context window during LLM inference? | Answer |
| Q54 | Explain the pros and cons of large and small context windows in LLM inference. | Answer |
| Q55 | What is the purpose of temperature in LLM inference, and how does it affect the output? | Answer |
| Q56 | What is autoregressive generation in the context of LLMs? | Answer |
| Q57 | Explain the strengths and limitations of autoregressive text generation in LLMs. | Answer |
| Q58 | Explain how diffusion language models (DLMs) differ from Large Language Models (LLMs). | Answer |
| Q59 | Do you prefer DLMs or LLMs for latency-sensitive applications? | Answer |
| Q60 | Explain the concept of token streaming during inference. | Answer |
| Q61 | What is speculative decoding, and when would you use it? | Answer |
| Q62 | What are the challenges in performing distributed inference across multiple GPUs? | Answer |
| Q63 | How would you design a scalable LLM inference system for real-time applications? | Answer |
| Q64 | Explain the role of Flash Attention in reducing memory bottlenecks. | Answer |
| Q65 | What is continuous batching, and how does it differ from static batching? | Answer |
| Q66 | What is mixed precision, and why is it used during inference? | Answer |
| Q67 | Differentiate between online and offline LLM inference deployment scenarios and discuss their respective requirements. | Answer |
| Q68 | Explain the throughput vs latency trade-off in LLM inference. | Answer |
| Q69 | What are the various bottlenecks in a typical LLM inference pipeline when running on a modern GPU? | Answer |
| Q70 | How do you measure LLM inference performance? | Answer |
| Q71 | What are the different LLM inference engines available? Which one do you prefer? | Answer |
| Q72 | What are the challenges in LLM inference? | Answer |
| Q73 | What are the possible options for accelerating LLM inference? | Answer |
| Q74 | What is Chain-of-Thought prompting, and when is it useful? | Answer |
| Q75 | Explain the reason behind the effectiveness of Chain-of-Thought (CoT) prompting. | Answer |
| Q76 | Explain the trade-offs in using CoT prompting. | Answer |
| Q77 | What is prompt engineering, and why is it important for LLMs? | Answer |
| Q78 | What is the difference between zero-shot and few-shot prompting? | Answer |
| Q79 | What are the different approaches for choosing examples for few-shot prompting? | Answer |
| Q80 | Why is context length important when designing prompts for LLMs? | Answer |
| Q81 | What is a system prompt, and how does it differ from a user prompt? | Answer |
| Q82 | What is In-Context Learning (ICL), and how is few-shot prompting related? | Answer |
| Q83 | What is self-consistency prompting, and how does it improve reasoning? | Answer |
| Q84 | Why is context important in prompt design? | Answer |
| Q85 | Describe a strategy for reducing hallucinations via prompt design. | Answer |
| Q86 | How would you structure a prompt to ensure the LLM output is in a specific format, like JSON? | Answer |
| Q87 | Explain the purpose of ReAct prompting in AI agents. | Answer |
| Q88 | What are the different phases in LLM development? | Answer |
| Q89 | What are the different types of LLM fine-tuning? | Answer |
| Q90 | What role does instruction tuning play in improving an LLM’s usability? | Answer |
| Q91 | What role does alignment tuning play in improving an LLM's usability? | Answer |
| Q92 | How do you prevent overfitting during fine-tuning? | Answer |
| Q93 | What is catastrophic forgetting, and why is it a concern in fine-tuning? | Answer |
| Q94 | What are the strengths and limitations of full fine-tuning? | Answer |
| Q95 | Explain how parameter efficient fine-tuning addresses the limitations of full fine-tuning. | Answer |
| Q96 | When might prompt engineering be preferred over task-specific fine-tuning? | Answer |
| Q97 | When should you use fine-tuning vs RAG? | Answer |
| Q98 | What are the limitations of using RAG over fine-tuning? | Answer |
| Q99 | What are the limitations of fine-tuning compared to RAG? | Answer |
| Q100 | When should you prefer task-specific fine-tuning over prompt engineering? | Answer |
| Q101 | What is LoRA, and how does it work? | Answer |
| Q102 | Explain the key ingredient behind the effectiveness of the LoRA technique. | Answer |
| Q103 | What is QLoRA, and how does it differ from LoRA? | Answer |
| Q104 | When would you use QLoRA instead of standard LoRA? | Answer |
| Q105 | How would you handle LLM fine-tuning on consumer hardware with limited GPU memory? | Answer |
| Q106 | Explain different preference alignment methods and their trade-offs. | Answer |
| Q107 | What is gradient accumulation, and how does it help with fine-tuning large models? | Answer |
| Q108 | What are the possible options to speed up LLM fine-tuning? | Answer |
| Q109 | Explain the pretraining objective used in LLM pretraining. | Answer |
| Q110 | What is the difference between casual language modeling and masked language modeling? | Answer |
| Q111 | How do LLMs handle out-of-vocabulary (OOV) words? | Answer |
| Q112 | In the context of LLM pretraining, what is scaling law? | Answer |
| Q113 | Explain the concept of Mixture-of-Experts (MoE) architecture and its role in LLM pretraining. | Answer |
| Q114 | What is model parallelism, and how is it used in LLM pre-training? | Answer |
| Q115 | What is the significance of self-supervised learning in LLM pretraining? | Answer |
Please consider giving a star, if you find this repository useful.
