1. General & Average Questions (DS & LLMs)These are common "litmus test" questions for interviews or project sanity checks.Data Science FundamentalsBias-Variance Tradeoff: Why does a model that performs perfectly on training data often fail in production? (Answer: Overfitting/High Variance).Feature Engineering vs. Representation Learning: How does the way we "feed" data to a Random Forest differ from how we feed it to a Transformer?Evaluation Metrics: When would you prefer F1-Score over Accuracy? (Answer: Imbalanced datasets).LLM BasicsPre-training vs. Fine-tuning: What is the difference between teaching a model "how to speak" (Pre-training) and "how to be a medical assistant" (Fine-tuning)?Hallucinations: Why do LLMs confidently state facts that are wrong? (Answer: They are probabilistic token predictors, not database query engines).Tokenization: Why do we use sub-word tokenization (like BPE) instead of just word-level tokenization?2. In-Depth Project Review: The ChecklistWhen reviewing a project, don't just look at the code. Scrutinize the design decisions.Problem Framing: Did you actually need an LLM? Could a simple RegEx or Logistic Regression have solved 80% of the problem cheaper?Data Quality & Cleaning: How did you handle "garbage" in your training/prompting data? Did you use deduplication or toxicity filters?Architecture Choice: Why BERT (Encoder) vs. GPT (Decoder) vs. T5 (Encoder-Decoder)?BERT: Better for understanding/classification.GPT: Better for generation/creative writing.The "So What?" (Metrics): How did you measure success? Did you use LLM-specific metrics like Perplexity or human-centric ones like ROUGE/BLEU or G-Eval?3. Transformer Architecture: In-DepthThe Transformer moved us away from processing words one by one (RNNs) to processing them all at once (Parallelization).The Core ComponentsInput Embedding & Positional Encoding:Since Transformers process all words simultaneously, they have no idea about word order. We add a "signal" (sine/cosine waves) to the embeddings so the model knows "The cat sat on the mat" is different from "The mat sat on the cat."Self-Attention Mechanism (The "Secret Sauce"):This allows every word in a sentence to "look at" every other word to find context.Query (Q): What am I looking for?Key (K): What do I contain?Value (V): What information do I provide?The attention score is calculated using the formula:$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$Multi-Head Attention:Instead of one attention "view," the model has multiple (heads). One head might focus on grammar, another on the relationship between names, and another on the emotional tone.Feed-Forward Networks (FFN):After attention gathers context, the FFN processes each token's information independently to refine the representation.Residual Connections & Layer Norm:These act like "highways" that let the original signal pass through without getting lost (preventing vanishing gradients), keeping the training stable.