Skip to content

KalyanKS-NLP/LLM-Interview-Questions-and-Answers-Hub

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 

Repository files navigation

🚀 LLM Interview Questions and Answers Hub

This repository includes 100+ LLM interview questions with answers. AIxFunda Newsletter

Stay Updated with Generative AI, LLMs, Agents and RAG.

Join 🚀 AIxFunda free newsletter to get latest updates and interesting tutorials related to Generative AI, LLMs, Agents and RAG.

  • ✨ Weekly GenAI updates
  • 📄 Weekly LLM, Agents and RAG paper updates
  • 📝 1 fresh blog post on an interesting topic every week

Related Repositories

🚀 LLM Interview Questions and Answers Book

Crack modern LLM and Generative AI interviews with this comprehensive, interview-focused guide designed specifically for ML Engineers, AI Engineers, Data Scientists and Software Engineers.

This book features 100+ carefully curated LLM interview questions, each paired with clear answers and in-depth explanations so you truly understand the concepts interviewers care about. Get the book here.

Use the Coupon Code: LLMQA25 for an exclusive discount (50%) on the book. (Available only for a short period of time).

LLM Interview Questions and Answers Book by Kalyan KS

# Question Answer
Q1 CNNs and RNNs don’t use positional embeddings. Why do transformers use positional embeddings? Answer
Q2 Tell me the basic steps involved in running an inference query on an LLM. Answer
Q3 Explain how KV Cache accelerates LLM inference. Answer
Q4 How does quantization affect inference speed and memory requirements? Answer
Q5 How do you handle the large memory requirements of KV cache in LLM inference? Answer
Q6 After tokenization, how are tokens converted into embeddings in the Transformer model? Answer
Q7 Explain why subword tokenization is preferred over word-level tokenization in the Transformer model. Answer
Q8 Explain the trade-offs in using a large vocabulary in LLMs. Answer
Q9 Explain how self-attention is computed in the Transformer model step by step. Answer
Q10 What is the computational complexity of self-attention in the Transformer model? Answer
Q11 How do Transformer models address the vanishing gradient problem? Answer
Q12 What is tokenization, and why is it necessary in LLMs? Answer
Q13 Explain the role of token embeddings in the Transformer model. Answer
Q14 Explain the working of the embedding layer in the Transformer model. Answer
Q15 What is the role of self-attention in the Transformer model, and why is it called “self-attention”? Answer
Q16 What is the purpose of the encoder in a Transformer model? Answer
Q17 What is the purpose of the decoder in a Transformer model? Answer
Q18 How does the encoder-decoder structure work at a high level in the Transformer model? Answer
Q19 What is the purpose of scaling in the self-attention mechanism in the Transformer model? Answer
Q20 Why does the Transformer model use multiple self-attention heads instead of a single self-attention head? Answer
Q21 How are the outputs of multiple heads combined and projected back in the multi-head attention in the Transformer model? Answer
Q22 How does masked self-attention differ from regular self-attention, and where is it used in a Transformer? Answer
Q23 Discuss the pros and cons of the self-attention mechanism in the Transformer model. Answer
Q24 What is the purpose of masked self-attention in the Transformer decoder? Answer
Q25 Explain how masking works in masked self-attention in Transformer. Answer
Q26 Explain why self-attention in the decoder is referred to as cross-attention. How does it differ from self-attention in the encoder? Answer
Q27 What is the softmax function, and where is it applied in Transformers? Answer
Q28 What is the purpose of residual (skip) connections in Transformer layers? Answer
Q29 Why is layer normalization used, and where is it applied in Transformers? Answer
Q30 What is cross-entropy loss, and how is it applied during Transformer training? Answer
Q31 Compare Transformers and RNNs in terms of handling long-range dependencies. Answer
Q32 What are the fundamental limitations of the Transformer model? Answer
Q33 How do Transformers address the limitations of CNNs and RNNs? Answer
Q34 How do Transformer models address the vanishing gradient problem? Answer
Q35 What is the purpose of the position-wise feed-forward sublayer? Answer
Q36 Can you briefly explain the difference between LLM training and inference? Answer
Q37 What is latency in LLM inference, and why is it important? Answer
Q38 What is batch inference, and how does it differ from single-query inference? Answer
Q39 How does batching generally help with LLM inference efficiency? Answer
Q40 Explain the trade-offs between batching and latency in LLM serving. Answer
Q41 How can techniques like mixture-of-experts (MoE) optimize inference efficiency? Answer
Q42 Explain the role of decoding strategy in LLM text generation. Answer
Q43 What are the different decoding strategies in LLMs? Answer
Q44 Explain the impact of the decoding strategy on LLM-generated output quality and latency. Answer
Q45 Explain the greedy search decoding strategy and its main drawback. Answer
Q46 How does Beam Search improve upon Greedy Search, and what is the role of the beam width parameter? Answer
Q47 When is a deterministic strategy (like Beam Search) preferable to a stochastic (sampling) strategy? Provide a specific use case. Answer
Q48 Discuss the primary trade-off between the computational cost and the output quality when comparing Greedy Search and Beam Search. Answer
Q49 When you set the temperature to 0.0, which decoding strategy are you using? Answer
Q50 How is Beam Search fundamentally different from a Breadth-First Search (BFS) or Depth-First Search (DFS)? Answer
Q51 Explain the criteria for choosing different decoding strategies. Answer
Q52 Compare deterministic and stochastic decoding methods in LLMs. Answer
Q53 What is the role of the context window during LLM inference? Answer
Q54 Explain the pros and cons of large and small context windows in LLM inference. Answer
Q55 What is the purpose of temperature in LLM inference, and how does it affect the output? Answer
Q56 What is autoregressive generation in the context of LLMs? Answer
Q57 Explain the strengths and limitations of autoregressive text generation in LLMs. Answer
Q58 Explain how diffusion language models (DLMs) differ from Large Language Models (LLMs). Answer
Q59 Do you prefer DLMs or LLMs for latency-sensitive applications? Answer
Q60 Explain the concept of token streaming during inference. Answer
Q61 What is speculative decoding, and when would you use it? Answer
Q62 What are the challenges in performing distributed inference across multiple GPUs? Answer
Q63 How would you design a scalable LLM inference system for real-time applications? Answer
Q64 Explain the role of Flash Attention in reducing memory bottlenecks. Answer
Q65 What is continuous batching, and how does it differ from static batching? Answer
Q66 What is mixed precision, and why is it used during inference? Answer
Q67 Differentiate between online and offline LLM inference deployment scenarios and discuss their respective requirements. Answer
Q68 Explain the throughput vs latency trade-off in LLM inference. Answer
Q69 What are the various bottlenecks in a typical LLM inference pipeline when running on a modern GPU? Answer
Q70 How do you measure LLM inference performance? Answer
Q71 What are the different LLM inference engines available? Which one do you prefer? Answer
Q72 What are the challenges in LLM inference? Answer
Q73 What are the possible options for accelerating LLM inference? Answer
Q74 What is Chain-of-Thought prompting, and when is it useful? Answer
Q75 Explain the reason behind the effectiveness of Chain-of-Thought (CoT) prompting. Answer
Q76 Explain the trade-offs in using CoT prompting. Answer
Q77 What is prompt engineering, and why is it important for LLMs? Answer
Q78 What is the difference between zero-shot and few-shot prompting? Answer
Q79 What are the different approaches for choosing examples for few-shot prompting? Answer
Q80 Why is context length important when designing prompts for LLMs? Answer
Q81 What is a system prompt, and how does it differ from a user prompt? Answer
Q82 What is In-Context Learning (ICL), and how is few-shot prompting related? Answer
Q83 What is self-consistency prompting, and how does it improve reasoning? Answer
Q84 Why is context important in prompt design? Answer
Q85 Describe a strategy for reducing hallucinations via prompt design. Answer
Q86 How would you structure a prompt to ensure the LLM output is in a specific format, like JSON? Answer
Q87 Explain the purpose of ReAct prompting in AI agents. Answer
Q88 What are the different phases in LLM development? Answer
Q89 What are the different types of LLM fine-tuning? Answer
Q90 What role does instruction tuning play in improving an LLM’s usability? Answer
Q91 What role does alignment tuning play in improving an LLM's usability? Answer
Q92 How do you prevent overfitting during fine-tuning? Answer
Q93 What is catastrophic forgetting, and why is it a concern in fine-tuning? Answer
Q94 What are the strengths and limitations of full fine-tuning? Answer
Q95 Explain how parameter efficient fine-tuning addresses the limitations of full fine-tuning. Answer
Q96 When might prompt engineering be preferred over task-specific fine-tuning? Answer
Q97 When should you use fine-tuning vs RAG? Answer
Q98 What are the limitations of using RAG over fine-tuning? Answer
Q99 What are the limitations of fine-tuning compared to RAG? Answer
Q100 When should you prefer task-specific fine-tuning over prompt engineering? Answer
Q101 What is LoRA, and how does it work? Answer
Q102 Explain the key ingredient behind the effectiveness of the LoRA technique. Answer
Q103 What is QLoRA, and how does it differ from LoRA? Answer
Q104 When would you use QLoRA instead of standard LoRA? Answer
Q105 How would you handle LLM fine-tuning on consumer hardware with limited GPU memory? Answer
Q106 Explain different preference alignment methods and their trade-offs. Answer
Q107 What is gradient accumulation, and how does it help with fine-tuning large models? Answer
Q108 What are the possible options to speed up LLM fine-tuning? Answer
Q109 Explain the pretraining objective used in LLM pretraining. Answer
Q110 What is the difference between casual language modeling and masked language modeling? Answer
Q111 How do LLMs handle out-of-vocabulary (OOV) words? Answer
Q112 In the context of LLM pretraining, what is scaling law? Answer
Q113 Explain the concept of Mixture-of-Experts (MoE) architecture and its role in LLM pretraining. Answer
Q114 What is model parallelism, and how is it used in LLM pre-training? Answer
Q115 What is the significance of self-supervised learning in LLM pretraining? Answer

⭐️ Star History

Star History Chart

Please consider giving a star, if you find this repository useful.

Releases

No releases published

Packages

No packages published