Statistics

UPSC Statistics 2021

All 16 questions from the 2021 Civil Services Mains Statistics paper across 2 papers — 800 marks in total. Each question comes with a detailed evaluation rubric, directive word analysis, and model answer points.

16Questions
800Total marks
2Papers
2021Exam year

Paper I

8 questions · 400 marks
Q1
50M Compulsory solve Probability theory and statistical inference

(a) A production unit manufacturing surgical masks is concerned about the quality of their masks. A random sample of n masks are inspected to estimate 'p', the probability of manufacturing a defective mask. How large a sample is required so that the estimate of p lies in the range p ± 0.1 with probability 0.95 ? (10 marks) (b) An insurance company studies a sample of 150 policy-holders. There are three categories of policies : auto, home and medical. The following results are obtained about the policies held by the policy-holders : (i) 30 have only home insurance (ii) 10 have only medical insurance (iii) 98 have auto insurance, but not all three types of insurance (iv) 27 have medical insurance, but not all three types of insurance (v) 13 have auto and medical insurance Given that a policy-holder has medical insurance, calculate the probability that he has home insurance. (10 marks) (c) Let X and Y be independent and identically distributed exponential random variables with mean λ > 0. Define $$Z = \begin{cases} 1, & \text{if} \quad X < Y \\ 0, & \text{if} \quad X \geq Y \end{cases}$$ Find E[X|Z = 1] + E[X|Z = 0]. (10 marks) (d) Let X₁, X₂, ..., Xₙ be a random sample from $$f(x, \theta) = \frac{\log(\theta)}{\theta - 1}\theta^x; \quad 0 < x < 1, \quad \theta > 1$$ Is there a function of θ, say g(θ), for which there exists an unbiased estimator whose variance attains the C-R lower bound ? If yes, find it. If not, show why not. (10 marks) (e) Let f(x, θ) be the Cauchy pdf $$f(x, \theta) = \frac{\theta}{\pi} \frac{1}{\theta^2 + x^2}; -\infty < x < \infty, \theta > 0$$ (i) Show that this family does not have Monotone Likelihood Ratio (MLR). (ii) If X is one observation from f(x, θ), show that |X| is sufficient for θ and hence the distribution of |X| does have an MLR. (5+5 marks)

Answer approach & key points

Solve each sub-part systematically with clear mathematical derivations. For (a), apply normal approximation to binomial for sample size determination; for (b), use set theory and conditional probability with Venn diagram analysis; for (c), exploit memoryless property of exponential distribution and symmetry arguments; for (d), verify regularity conditions and apply Cramér-Rao inequality; for (e), construct likelihood ratio and apply factorization theorem. Allocate approximately 15% time to (a), 15% to (b), 20% to (c), 25% to (d), and 25% to (e) given their analytical complexity.

  • (a) Sample size formula using n = z²₀.₀₂₅ × p(1-p)/d² with conservative p = 0.5 yielding n = 97 (or 96 with p unspecified)
  • (b) Complete Venn diagram construction: only auto = 85, auto∩home only = 0, all three = 0, medical∩home only = 4, yielding conditional probability 4/27
  • (c) E[X|Z=1] = E[X|X<Y] = λ/2 by memoryless property and E[X|Z=0] = λ + λ/2 = 3λ/2, sum = 2λ
  • (d) Verification of regularity conditions, Fisher information calculation I(θ) = [θ(log θ)² - (θ-1)²]/[θ(θ-1)²(log θ)²], and proof that only linear functions of the canonical parameter attain C-R bound
  • (e)(i) Counterexample showing L(θ₂)/L(θ₁) is not monotone by comparing likelihood ratios at x = 0 and x → ∞ for θ₂ > θ₁
  • (e)(ii) Factorization theorem application showing |X| sufficient, and proof that g(|X|;θ) = 2θ/[π(θ²+x²)] for x > 0 has MLR in |X|
Q2
50M derive Limit theorems and characteristic functions

(a) Let Y₁, Y₂, Y₃, ... be independent and identical Poisson random variables with parameter 1. Use central limit theorem to establish $$n! \simeq \sqrt{2\pi n}\left(\frac{n}{e}\right)^n$$ for large value of positive integer n. (20 marks) (b) Let X₁, X₂, ..., Xₙ be a random sample such that log Xᵢ ~ N(θ, θ) distribution with θ > 0 unknown. Show that one of the solutions of the likelihood equation is the unique MLE of θ. Obtain asymptotic distribution of MLE of θ. (15 marks) (c) (i) State the sufficient conditions for a function φ(t) to be a characteristic function. (ii) Investigate if the following functions are characteristic functions : 1. e⁻ᵗ⁴ 2. [1 + |t|]⁻¹ Justify your answer. (5+10 marks)

Answer approach & key points

Derive the Stirling approximation in part (a) by applying CLT to Poisson sums and carefully manipulating the resulting normal approximation. For part (b), derive the likelihood equation, verify the MLE solution, and obtain its asymptotic normality via Fisher information. In part (c), state Bochner's theorem precisely, then investigate the two functions using properties of positive definiteness and Polya's criteria. Allocate approximately 40% time to (a), 30% to (b), and 30% to (c), ensuring rigorous justification at each step.

  • Part (a): Define Sₙ = Y₁ + ... + Yₙ ~ Poisson(n), apply CLT to (Sₙ - n)/√n → N(0,1), and use P(Sₙ = n) with Stirling's manipulation
  • Part (a): Equate Poisson pmf at n to normal density approximation and solve for n! to obtain √2πn(n/e)ⁿ
  • Part (b): Construct log-likelihood l(θ) = -n/2 log(2πθ) - 1/(2θ)Σ(log Xᵢ - θ)², derive score function and likelihood equation
  • Part (b): Verify second-order condition (negative Fisher information) to confirm unique MLE, then apply standard asymptotic theory: √n(θ̂ - θ) → N(0, I(θ)⁻¹)
  • Part (c)(i): State Bochner's theorem: φ(0)=1, continuous at 0, positive definite (non-negative definite matrices from φ(tᵢ-tⱼ))
  • Part (c)(ii): Show e⁻ᵗ⁴ fails (fourth derivative at 0 gives E[X⁴]=0, contradiction) or check positive definiteness failure
  • Part (c)(ii): Verify [1+|t|]⁻¹ satisfies Polya's criteria (convex on t>0, φ(0)=1, even, continuous, φ(∞)=0) hence is characteristic function
Q3
50M construct Sequential probability ratio test and convergence

(a) Let X and Y be two independent random variables following exponential distribution with mean $\frac{1}{\lambda}$ and $\frac{1}{\mu}$ respectively, $\lambda > 0$, $\mu > 0$. Suppose that $(X_1, X_2, ..., X_n)$ and $(Y_1, Y_2, ..., Y_n)$ are sequences of observations on X and Y respectively. A random variable $U_i$ is defined as $$U_i = \begin{cases} 1, & \text{if} \quad X_i \geq Y_i, \quad i = 1, 2, ..., n \\ 0, & \text{otherwise} \end{cases}$$ Construct Wald's SPRT procedure based on $U_i$'s for testing H : $\lambda = \mu$ versus K : $\lambda = 2\mu$ with strength $(\alpha, \beta)$. (20 marks) (b) Let $Y_i$, $i \geq 1$ be independent and identical $U(-1, 1)$ random variables. Determine if the following sequences converge in probability : (i) $\left\{\frac{Y_i}{i}\right\}$ (ii) $\left\{(Y_i)^i\right\}$ (5+10 marks) (c) Let X₁, X₂, ..., Xₙ be a random sample from uniform distribution U(− θ, θ), θ > 0. Find the complete sufficient statistic for θ. Hence, obtain the best unbiased estimator of θ. (15 marks)

Answer approach & key points

Construct the Wald's SPRT procedure for part (a) by deriving the likelihood ratio for Bernoulli outcomes, then determine convergence properties for sequences in part (b) using appropriate limit theorems, and finally derive the complete sufficient statistic and MVUE for part (c). Allocate approximately 40% time to part (a) given its 20 marks, 30% to part (c) for its 15 marks, and 30% to part (b) for its 10 marks. Structure with clear headings for each sub-part, showing derivations step-by-step and concluding with explicit final answers.

  • For (a): Derive P(X_i ≥ Y_i) = μ/(λ+μ) under H and K, showing U_i ~ Bernoulli with p = 1/2 under H and p = 1/3 under K
  • For (a): Construct Wald's SPRT with likelihood ratio Λ_n = (2/3)^T_n × (4/3)^(n-T_n) where T_n = ΣU_i, and specify continuation region with bounds A ≈ (1-β)/α and B ≈ β/(1-α)
  • For (b)(i): Show Y_i/i → 0 in probability using Chebyshev's inequality or direct calculation of P(|Y_i/i| > ε)
  • For (b)(ii): Analyze (Y_i)^i convergence by considering cases Y_i ∈ (-1,1), showing convergence to 0 in probability
  • For (c): Identify T = max(|X_(1)|, |X_(n)|) or equivalently max(-X_(1), X_(n)) as complete sufficient statistic using factorization theorem and completeness of uniform family
  • For (c): Derive E[T] = nθ/(n+1) and construct unbiased estimator θ̂ = (n+1)T/n, verifying it is the UMVUE via Lehmann-Scheffé theorem
Q4
50M prove UMVUE, joint distributions and non-parametric tests

(a) Let X₁, X₂, ..., Xₙ be a random sample from Poisson distribution with mean λ > 0. Define a statistic W = (1 − 1/n)^T, T = Σᵢ₌₁ⁿ Xᵢ (i) Show that T is complete sufficient statistic. (ii) Show that T is unbiased for e^(−λ). (iii) Show that even though T is UMVUE, it does not attain the CRLB for g(λ) = e^(−λ). (20 marks) (b) Let f(x, y) = \frac{e^{\frac{-yx^2}{2}} y^{3/2} e^{-y}}{\sqrt{2\pi}}, -\infty < x < \infty, y > 0. (i) Obtain the marginal distribution of Y and conditional distribution of X given Y. (ii) Find E(Y), V(Y), E(X|Y), V(X|Y). (iii) Use (ii) to find E(X), V(X). (5+5+5 marks) (c) A company's trainees are randomly assigned to groups which are through a certain industrial inspection procedure by three different methods. At the end of the instructing period they are tested for inspection performance quality. The following are their scores : Method A : 80 83 79 85 90 68 Method B : 82 84 60 72 86 67 91 Method C : 93 65 77 78 88 Using the appropriate non-parametric test, determine at 0·05 level of significance whether the three methods are equally effective. (15 marks)

Answer approach & key points

Prove all required results systematically, spending approximately 40% of time on part (a) given its 20 marks, 30% on part (b) for 15 marks, and 30% on part (c) for 15 marks. Structure as: (a) establish completeness via exponential family, sufficiency via factorization, unbiasedness via expectation calculation, and CRLB non-attainment via variance comparison; (b) integrate to obtain marginal Gamma distribution, derive conditional Normal, then apply law of total expectation/variance; (c) state Kruskal-Wallis test assumptions, compute ranks, calculate H-statistic, and compare with χ² critical value.

  • Part (a)(i): Apply factorization theorem to show T is sufficient; use completeness property of Poisson exponential family with natural parameter space containing an open set
  • Part (a)(ii): Calculate E[W] = E[(1-1/n)^T] using Poisson MGF to verify unbiasedness for e^(-λ)
  • Part (a)(iii): Compute Var(W), derive CRLB for g(λ)=e^(-λ), and explicitly show strict inequality Var(W) > CRLB
  • Part (b): Identify Y ~ Gamma(5/2, 1) marginal; X|Y ~ N(0, 1/Y); apply E(X)=E[E(X|Y)] and V(X)=E[V(X|Y)]+V[E(X|Y)]
  • Part (c): Apply Kruskal-Wallis H-test: pool and rank all 18 observations, compute rank sums per method, calculate H = [12/N(N+1)]Σ(Ri²/ni) - 3(N+1), compare with χ²₂,₀.₀₅ = 5.991
Q5
50M Compulsory derive Linear regression, experimental designs, sampling theory

(a) For a simple linear regression model Y = β₀ + β₁Xᵢ + εᵢ, i = 1, ..., n (i) Derive the least square estimators of β₀ and β₁, clearly stating the conditions assumed. (ii) For eᵢ = Yᵢ - Ŷᵢ where Ŷᵢ is the fitted value, show that 1. Σᵢ₌₁ⁿ eᵢ = 0 2. Σᵢ₌₁ⁿ Yᵢ = Σᵢ₌₁ⁿ Ŷᵢ 3. Σᵢ₌₁ⁿ Xᵢeᵢ = 0 4. Σᵢ₌₁ⁿ Ŷᵢeᵢ = 0 5. The regression line passes through (X̄, Ȳ). 5+5 (b) In usual notations, if v, b, r, k and λ are the parameters of a Balanced Incomplete Block Design, then show that : (i) b ≥ r + 1 ≥ λ + 2 (ii) v ≤ b ≤ (r² - 1)/λ 10 (c) For the multiple linear regression model with two predictor variables X₁ and X₂, show that the estimate of regression coefficient of X₁ is unchanged when X₂ is added to the regression model, whenever X₁ and X₂ are uncorrelated. 10 (d) A sample of size n is drawn from a population having N units by simple random sampling without replacement. A sub-sample of n₁ units is drawn from the n units by simple random sampling without replacement. Let ȳ₁ denote the mean based on n₁ units and ȳ₂, the mean based on n₂ = n - n₁ units. Consider the estimator of the population mean Ȳₙ given by : Ŷₙ = wȳ₁ + (1-w)ȳ₂ ; 0 < w < 1 Show that E(Ŷₙ) = Ȳₙ, and obtain its variance. 10 (e) How is the efficiency of a design measured ? Derive the expression to measure the efficiency of a Randomised Block Design over a Completely Randomised Design. 10

Answer approach & key points

Derive requires rigorous step-by-step mathematical proofs with clear logical progression. Allocate time proportionally: ~20% for (a)(i)-(ii) on SLR properties, ~20% for (b) on BIBD inequalities, ~20% for (c) on multiple regression orthogonality, ~20% for (d) on two-phase sampling variance, and ~20% for (e) on design efficiency. Begin each sub-part by stating assumptions, proceed with systematic derivation, and conclude with the required result clearly boxed.

  • (a)(i) Correct setup of normal equations minimizing Σ(Yᵢ - β₀ - β₁Xᵢ)²; explicit statement of Gauss-Markov conditions (E(εᵢ)=0, Var(εᵢ)=σ², Cov(εᵢ,εⱼ)=0)
  • (a)(ii) All five residual properties proved using normal equations: Σeᵢ=0 from first normal equation; ΣXᵢeᵢ=0 from second; ΣŶᵢeᵢ=0 via substitution; (X̄,Ȳ) on regression line verified
  • (b) BIBD parameter relationships: bk=vr, r(k-1)=λ(v-1) used to prove b≥v (Fisher's inequality) and hence b≥r+1≥λ+2; second inequality using r(k-1)=λ(v-1) and k≤v
  • (c) Multiple regression: β̂₁ = (S₁₁S₂₂ - S₁₂S₂₂)/(S₁₁S₂₂ - S₁₂²) or equivalent; when S₁₂=0, β̂₁ reduces to S₁₀/S₁₁ = simple regression coefficient
  • (d) Two-phase sampling: E(ȳ₁)=Ȳ, E(ȳ₂)=Ȳ shown; E(Ŷₙ)=wȲ+(1-w)Ȳ=Ȳ; variance derived using Var(ȳ₁)=σ²/n₁, Var(ȳ₂)=σ²/n₂ and independence
  • (e) Efficiency defined as ratio of variances (or precision); E = Var(CRD)/Var(RBD) = [(σ²+σ²ᵦ)/σ²] × adjustment; derivation using E(MSE) for both designs
  • Proper mathematical notation throughout: summation limits, subscripts, expectation and variance operators clearly distinguished
Q6
50M derive Multivariate analysis, correlation, cluster sampling, multivariate normal distribution

(a) For a multiple linear regression model with three covariates X₁, X₂ and X₃, let rᵢⱼ denote the correlation coefficient between Xᵢ and Xⱼ. For a data, it was found r₁₂ = 0·77, r₂₃ = 0·52, r₁₃ = 0·72. (i) Check the consistency of the above data. (ii) If r₁₃ is unknown, obtain the limits within which r₁₃ lies given the above values for r₁₂ and r₂₃. 20 (b) In cluster sampling with equal size clusters, obtain the unbiased estimate of population mean. Also obtain its sampling variance as V(ȳ̄) = (1-f)(NM-1)S²{1+(M-1)ρcl}/[M²(N-1)n], where notations have their usual meanings. 15 (c) Let Z₃ₓ₁ = (X₁ₓ₁, Y₂ₓ₁)ᵀ ~ N₃((0, 0, 1)ᵀ, [[1, 2, 1], [2, 5, 2], [1, 2, 2]]). Show that conditional on X₁ₓ₁, the two components of Y₂ₓ₁ are independent but marginally they are not. 15

Answer approach & key points

Derive the required mathematical results systematically across all three parts. For part (a)(i)-(ii), apply correlation matrix properties and determinant conditions first, then use partial correlation bounds. For part (b), build the cluster sampling theory from first principles with ANOVA decomposition. For part (c), partition the multivariate normal distribution and derive conditional distributions. Allocate approximately 40% time to part (a) given its 20 marks, 30% each to parts (b) and (c). Structure as: direct derivations without lengthy introductions, clear theorem statements, step-by-step proofs, and boxed final expressions.

  • For (a)(i): Verify positive semi-definiteness of correlation matrix by checking det(R) ≥ 0 or all principal minors non-negative; compute 1 - r₁₂² - r₂₃² - r₁₃² + 2r₁₂r₂₃r₁₃ ≥ 0
  • For (a)(ii): Derive bounds using r₁₃ = r₁₂r₂₃ ± √[(1-r₁₂²)(1-r₂₃²)]; obtain numerical interval [0.077, 0.963] or equivalent
  • For (b): Define cluster sampling estimator ȳ̄ = (1/nM)ΣᵢΣⱼ yᵢⱼ; prove unbiasedness E(ȳ̄) = Ȳ; derive variance via between-cluster and within-cluster SS decomposition
  • For (b): Express variance in ICC form using ρcl = (S_b² - S_w²)/(S_b² + (M-1)S_w²) or equivalent definition; manipulate to reach target formula
  • For (c): Partition covariance matrix Σ = [[Σ_XX, Σ_XY], [Σ_YX, Σ_YY]]; derive conditional distribution Y|X ~ N(μ_Y + Σ_YXΣ_XX⁻¹(X-μ_X), Σ_YY - Σ_YXΣ_XX⁻¹Σ_XY)
  • For (c): Show conditional covariance matrix is diagonal (implying independence given X₁) while marginal covariance Σ_YY is not diagonal
Q7
50M derive Factorial experiments, principal components, regression estimator

(a) (i) What is confounding in factorial experiments ? (ii) A $2^6$ factorial experiment is conducted in blocks of size $2^3$. Write the confounded effects such that no main effect or two factor interaction are confounded. Give the list of independent and generalised interactions confounded along with the elements of key block only. (iii) Give the break-up of degrees of freedom for a $2^n$ factorial experiment in $2^k$ blocks. (b) What are principal components ? Describe how to compute the principal components of the vectors X₁ = $\begin{bmatrix} 1 \\ 0 \\ -1 \end{bmatrix}$ and X₂ = $\begin{bmatrix} -1 \\ 1 \\ 0 \end{bmatrix}$. Give X₁ and X₂ in terms of the principal components. (c) Define Regression estimator. Show bias = – Cov ($\bar{x}$, b). Under what conditions is bias negligible ? Find the mean square error of the estimator to first degree of approximation. Give comparison of Regression estimator with Ratio estimator.

Answer approach & key points

Derive the required expressions systematically across all five sub-parts. For (a)(i)-(iii), allocate ~35% time covering confounding definition, the specific 2^6 in 2^3 blocks construction with ABC, DEF, ABCDEF as confounded effects, and the general df breakdown. For (b), spend ~25% time on PCA computation: construct data matrix, find covariance, eigenvalues (3, 1, 0), eigenvectors, and express X₁, X₂ in PC terms. For (c), allocate ~40% time deriving regression estimator bias, MSE approximation, and comparison with ratio estimator via Cochran's approach. Begin with definitions, proceed through step-by-step derivations, and conclude with clear interpretations.

  • (a)(i) Define confounding as mixing of treatment effects with block effects; distinguish complete vs partial confounding
  • (a)(ii) Identify confounded effects: ABC, DEF, and their generalized interaction ABCDEF; verify no main effect or 2-factor interaction is confounded; construct key block with I, AD, BE, CF, ABDE, ACDF, BCEF, ABCDEF
  • (a)(iii) State df breakdown: blocks (2^k - 1), treatments (2^n - 1), error (2^n - 2^k - n + nk - k), total (2^n - 1)
  • (b) Define PCs as uncorrelated linear combinations maximizing variance; compute covariance matrix [2 -1; -1 2], eigenvalues λ₁=3, λ₂=1, eigenvectors [1/√2, -1/√2]ᵀ and [1/√2, 1/√2]ᵀ; express X₁ = (1/√2)PC₁ + (1/√2)PC₂, X₂ = (-1/√2)PC₁ + (1/√2)PC₂
  • (c) Define regression estimator Ŷ_reg = Ȳ + b(X̄ - x̄); derive bias = -Cov(x̄, b) using E(b) = β + O(1/n); state negligible bias when n is large or ρ ≈ 0; derive MSE ≈ S²_y(1-ρ²)(1/n + 1/N); compare: regression has smaller MSE when |ρ| > 1/2 C_x/C_y
Q8
50M solve Stratified sampling, polynomial regression, split-plot designs

(a) (i) In stratified sampling under optimum allocation, how will you proceed to select units from different strata, if one or more nᵢ's happens to be greater than Nᵢ (i ≥ 2) ? (ii) A sample survey was conducted in a certain district of Himachal Pradesh. Four strata A, B, C and D of villages were formed according to the acreage of fruit trees as obtained from revenue records. A random sample of villages was selected from each stratum and the number of apple orchards in each selected village was noted. The data are shown below : | Stratum | Total number of villages (Nᵢ) | Number of villages in sample (nᵢ) | Number of orchards in the selected villages | |---------|------------------------------|-----------------------------------|---------------------------------------------| | A (0 – 3 acres) | 275 | 15 | 2, 5, 1, 9, 6, 7, 0, 4, 7, 0, 5, 0, 0, 3, 0 | | B (3 – 6 acres) | 146 | 10 | 21, 11, 7, 5, 6, 19, 5, 24, 30, 24 | | C (6 – 15 acres) | 93 | 12 | 3, 10, 4, 11, 38, 11, 4, 46, 4, 18, 1, 39 | | D (15 acres and above) | 62 | 11 | 30, 42, 20, 38, 29, 22, 31, 28, 66, 14, 15 | Estimate the number of orchards in the district. (b) (i) For a second order polynomial model with one predictor variable, derive the least squares normal equations clearly stating the conditions assumed. How will you interpret the parameters in this model ? (ii) Describe why it is recommended to work with predictor variables centred around the mean. Comment on fitted values of the response variable in this case. Prove your claim. (c) What are split-plot designs ? When do you recommend the use of such designs ? If e₁ and e₂ are the main plot and sub-plot errors respectively, both estimated in units of a single sub-plot, explain why e₁ is expected to be larger than e₂.

Answer approach & key points

This multi-part question demands solving numerical problems alongside theoretical derivations and explanations. Allocate approximately 35% effort to part (a) combining optimum allocation adjustment and stratified estimation with Himachal Pradesh data; 35% to part (b) covering polynomial regression derivation, centering benefits, and proof; and 30% to part (c) explaining split-plot designs with error comparison. Structure as: brief theoretical setup → step-by-step calculations/derivations → interpretation of results in context.

  • For (a)(i): Explain the iterative adjustment procedure when nᵢ > Nᵢ in optimum allocation—set nᵢ = Nᵢ for such strata, recompute allocation for remaining strata using revised formula, and repeat until all nᵢ ≤ Nᵢ
  • For (a)(ii): Calculate stratum means, apply Neyman or proportional allocation weights, compute stratified estimate Ŷ = ΣNᵢȳᵢ with standard error, and present final estimate of total orchards in the district
  • For (b)(i): Derive normal equations for Y = β₀ + β₁X + β₂X² by minimizing Σ(Yᵢ - β₀ - β₁Xᵢ - β₂Xᵢ²)²; interpret β₀ as response at X=0, β₁ as linear rate of change, β₂ as curvature/acceleration
  • For (b)(ii): Explain that centering (X - X̄) eliminates correlation between linear and quadratic terms, stabilizes variance-covariance matrix; prove fitted values remain identical using algebraic expansion showing predicted Y unchanged
  • For (c): Define split-plot designs as experiments with two sizes of experimental units where whole plots receive one factor and sub-plots receive another; recommend when one factor is harder/costlier to change; explain e₁ > e₂ due to additional whole-plot error component from main plot-to-main plot variation

Paper II

8 questions · 400 marks
Q1
50M Compulsory solve Quality control, sampling, game theory, Markov chains, reliability

Explain Single Sampling Plan with the help of an example. Also, write the importance of an Operating Characteristics Curve in a sampling plan. (10 marks) Solve the above assignment problem. Depot I II III IV V A 160 130 175 190 200 Town B 135 120 130 160 175 C 140 110 145 170 185 D 50 50 80 80 110 E 55 35 80 80 105 (10 marks) Use algebraic method to solve the above game. Player B B₁ B₂ B₃ B₄ A₁ 0·25 0·20 0·14 0·30 Player A A₂ 0·27 0·16 0·12 0·14 A₃ 0·35 0·08 0·15 0·19 A₄ −0·02 0·08 0·13 0·00 (10 marks) Consider the Markov Chain with transition probability matrix: 0 1 2 0 (0 1 0) 1 (½ 0 ½) 2 (0 1 0) Show that the states are periodic and persistent non-null. (10 marks) State the importance of the hazard function. If the hazard rate of a component is given by: h(t) = { 0.015, t ≤ 200 { 0.025, t > 200 then find an expression for the reliability function of the component. (10 marks)

Answer approach & key points

The directive 'solve' requires demonstrating complete analytical solutions for all five numerical problems. Structure your answer by addressing each sub-question sequentially: (1) Single Sampling Plan with OC curve illustration, (2) Assignment problem using Hungarian method, (3) Game theory problem via algebraic method for mixed strategies, (4) Markov chain periodicity and persistence proof, and (5) Reliability function derivation from piecewise hazard rate. Each solution must show method, calculation, and final interpretation.

  • Single Sampling Plan: Define n (sample size) and c (acceptance number) with concrete example; sketch OC curve showing P(A) vs p with points at p=0, p=AQL, p=LTPD, p=1
  • Assignment Problem: Apply Hungarian method—row reduction, column reduction, minimum lines to cover zeros, optimality check, and final assignment with minimum cost
  • Game Theory: Verify no saddle point exists, formulate as LPP or use algebraic method for 2×2 subgames, find mixed strategy probabilities and game value
  • Markov Chain: Compute P² to show period d=2, verify irreducibility, calculate stationary distribution to confirm persistent non-null states
  • Reliability: Derive R(t) = exp(-∫h(u)du) giving R(t)=exp(-0.015t) for t≤200 and R(t)=exp(-3)exp(-0.025(t-200)) for t>200 with continuity at t=200
Q2
50M solve Reliability, control charts, acceptance sampling

A manufacturer finds that on the average, a television set is used 1.8 hours per day. A one year warranty is offered on the picture tube having a mean time to failure (MTTF) of 2000 hours. If the distribution of time to failure is exponential, then determine the percentage of tubes failing during the warranty period. (15 marks) The number of defects on 20 items were recorded as given above: | Item No. | No. of defects | Item No. | No. of defects | |----------|----------------|----------|----------------| | 1 | 2 | 11 | 6 | | 2 | 0 | 12 | 0 | | 3 | 4 | 13 | 2 | | 4 | 1 | 14 | 1 | | 5 | 0 | 15 | 0 | | 6 | 8 | 16 | 3 | | 7 | 0 | 17 | 2 | | 8 | 1 | 18 | 1 | | 9 | 2 | 19 | 0 | | 10 | 0 | 20 | 2 | Use a suitable control chart to identify whether the process is in control or not? (15 marks) Explain the concepts of producer's and consumer's risks. It has been decided to sample 100 items at random from each large batch. We reject the batch if more than 2 defectives are found. If the acceptable quality level is 1% and the unacceptable quality level is 5%, then find the producer's and consumer's risks. (20 marks)

Answer approach & key points

Solve this three-part numerical problem by first calculating the warranty failure probability using exponential distribution properties, then constructing and interpreting a c-chart for defect data with proper control limits, and finally computing producer's and consumer's risks using binomial distribution for the given sampling plan. Present each part sequentially with clear headings, showing all formulas, substitutions, and final interpretations.

  • Part 1: Calculate total warranty period as 1.8 × 365 = 657 hours and use P(T ≤ 657) = 1 - e^(-657/2000) for exponential failure probability
  • Part 2: Compute c-bar = 33/20 = 1.65, then UCL = 1.65 + 3√1.65 ≈ 5.50 and LCL = 0, identifying Item 6 (8 defects) as out of control
  • Part 3: Define producer's risk α = P(reject | p=0.01) and consumer's risk β = P(accept | p=0.05) using binomial or Poisson approximation
  • Correct application of Poisson approximation with λ₁ = 1 for AQL and λ₂ = 5 for LTPD to find α = 1 - P(X≤2; λ=1) and β = P(X≤2; λ=5)
  • Numerical values: α ≈ 0.080 or 8% and β ≈ 0.125 or 12.5% (or precise binomial equivalents)
Q3
50M solve Queuing systems and linear programming

(a) Explain M|G|1 queuing system. Obtain Pollaczek-kinchine formula. (15 marks) (b) Use MODI method to solve the above transportation problem: Store I II III IV A 4 6 8 13 B 13 11 10 8 C 14 4 10 13 D 9 11 13 8 Supply 50 70 30 50 Demand 25 35 105 20 (15 marks) (c) Use two-phase method to solve: Maximize z = 2x₁ + x₂ + x₃ subject to the constraints 4x₁ + 6x₂ + 3x₃ ≤ 8 3x₁ - 6x₂ - 4x₃ ≤ 1 2x₁ + 3x₂ - 5x₃ ≥ 4 and x₁, x₂, x₃ ≥ 0. (20 marks)

Answer approach & key points

Solve this three-part numerical problem by allocating approximately 30% time to part (a) for deriving the Pollaczek-Khinchine formula, 30% to part (b) for the MODI method transportation problem, and 40% to part (c) for the two-phase simplex method. Begin each part with clear problem setup, show all computational steps systematically, and conclude with verified final answers. For (a), explain M|G|1 characteristics before derivation; for (b), ensure initial basic feasible solution before MODI optimization; for (c), complete Phase I before proceeding to Phase II.

  • Part (a): Correct explanation of M|G|1 queuing system components (Poisson arrivals, General service time, single server) and derivation of Pollaczek-Khinchine formula for mean queue length Lq = λ²E(S²)/[2(1-ρ)] or equivalent forms
  • Part (b): Correct initial basic feasible solution using VAM or NWCR method, followed by MODI (UV method) iterations with proper stepping stone paths until optimality is reached with minimum transportation cost
  • Part (c): Proper conversion of inequalities to equations using slack, surplus and artificial variables; successful completion of Phase I to eliminate artificial variables; Phase II optimization yielding optimal solution
  • Verification of supply-demand balance (total supply = total demand = 200) in part (b) before solving, and correct handling of ≥ constraint in part (c) with surplus and artificial variables
  • Clear presentation of all simplex tableaus for part (c) showing entering and leaving variables, pivot operations, and final optimal value of objective function
Q4
50M solve Linear programming and inventory control

(a) Solve the following linear programming problem: Maximize z = 3x₁ + 5x₂ subject to the constraints 3x₁ + 2x₂ ≤ 18 x₁ ≤ 4 x₂ ≤ 6 and x₁, x₂ ≥ 0. Discuss the change in Cⱼ on the optimality of the optimal basic feasible solution. (15 marks) (b) A manufacturer has to supply his customers with 600 units of his product per year. Shortages are not allowed and storage amounts to 60 paise per unit per year. The set up cost per run is Rs. 80. Find (i) economic order quantity (ii) optimum period of supply per optimum order and (iii) increase in the total cost associated with ordering 20 per cent more and 40% less. (15 marks) (c) A machine is set to deliver the packets of a given weight. Ten samples of size 5 each were examined and the following results were obtained: | Sample No. | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |---|---|---|---|---|---|---|---|---|---|---| | Mean | 43 | 49 | 37 | 44 | 45 | 37 | 51 | 46 | 43 | 47 | | Range | 5 | 6 | 5 | 7 | 7 | 4 | 8 | 6 | 4 | 6 | Use mean and range charts to check whether process is under control. (Given for n = 5, d₂ = 2·326 and d₃ = 0·864) (20 marks)

Answer approach & key points

Begin with the directive 'solve' for part (a), applying the simplex method or graphical method to find the optimal solution, then analyze sensitivity of Cⱼ coefficients. For part (b), apply the EOQ formula with given parameters (D=600, C₀=₹80, Cₕ=₹0.60) and calculate percentage cost variations. For part (c), construct X̄ and R control charts using given constants d₂=2.326 and d₃=0.864, computing center lines and control limits to assess process stability. Allocate approximately 30% time to (a), 25% to (b), and 45% to (c) given their mark distribution and computational complexity.

  • Part (a): Correct identification of feasible region vertices (0,0), (4,0), (4,3), (2,6), (0,6) and optimal solution at (2,6) with Z=36
  • Part (a): Sensitivity analysis showing range for C₁ as [0, 7.5] and C₂ as [4, ∞) maintaining optimality of current basis
  • Part (b): EOQ calculation as √(2×600×80/0.6) = 400 units; optimum period = 400/600 × 12 = 8 months
  • Part (b): Total cost at EOQ = ₹240; cost at 480 units = ₹244 (1.67% increase); cost at 240 units = ₹264 (10% increase)
  • Part (c): Grand mean X̄̄ = 44.2, average range R̄ = 5.8; X̄ chart limits: 44.2 ± 3×5.8/(2.326×√5) = 44.2 ± 3.34
  • Part (c): R chart limits: UCL = 5.8×(1+3×0.864/2.326) = 12.27, LCL = 5.8×(1-3×0.864/2.326) = 0 (adjusted to 0)
  • Part (c): Correct conclusion that Sample 3 (mean=37) and Sample 7 (mean=51) fall outside X̄ control limits, indicating process is not under statistical control
Q5
50M Compulsory explain Econometrics and vital statistics

(a) Explain Zellner's seemingly unrelated regression model and the feasible generalized least squares method of estimating the model. (10 marks) (b) Explain the functions of N.S.S.O. (10 marks) (c) Obtain the generalized least squares estimators in the two-variable model Yᵢ = β₁ + β₂Xᵢ + uᵢ assuming the heteroscedastic variances are known and obtain their variances. (10 marks) (d) Why is it considered desirable to convert gross scores to some standard scores? Define 'standard scores' and 'normalised scores' and describe how they are derived. (10 marks) (e) Fill in blanks which are marked with a query in the above skeleton life table and explain the meaning of the symbols at the heads of the columns. (10 marks)

Answer approach & key points

The directive 'explain' demands clear exposition with theoretical foundations and derivations. Allocate approximately 20% (10 marks) to each sub-part equally. For (a), present SURE model structure and FGLS estimation steps; for (b), enumerate NSSO functions with Indian statistical system context; for (c), derive GLS estimators with matrix algebra; for (d), clarify standardization rationale with formulae; for (e), complete life table calculations and interpret actuarial symbols. Structure: brief introduction, systematic part-wise treatment with equations, and concluding synthesis on statistical estimation methods.

  • Part (a): SURE model specification with contemporaneous correlation across equations, disturbance covariance matrix structure, and FGLS two-step estimation (estimated Ω used for feasible estimator)
  • Part (b): NSSO functions—conducting large-scale sample surveys (NSS, ASI), data dissemination, methodological research, coordination with state DES, and international reporting (SDGs)
  • Part (c): GLS transformation with known heteroscedastic variances σᵢ², weighted least squares derivation, variance-covariance matrix of estimators, and efficiency comparison with OLS
  • Part (d): Desirability of standard scores (comparability, norm-referenced interpretation), z-score formula, normalized scores (T-scores, stanines), and derivation steps
  • Part (e): Life table completion—calculating nqx, npx, nLx, Tx, ex from given lx and ndx columns; explanation of q (mortality), p (survival), L (person-years), T (total), e (expectation)
Q6
50M explain Time series analysis and index numbers

(a) Explain Box-Jenkins methodology to build ARIMA models. (15 marks) (b) Prepare the cost of living index for 2006 on the basis of 2005 from the above data by (i) aggregative method and (ii) method of weighted relatives and comment. (15 marks) (c) Explain price statistics relating to 'Price Quotations'. Elucidate publications of data concerning foreign trade of India. (20 marks)

Answer approach & key points

The directive 'explain' demands clear exposition with logical flow and appropriate technical depth. Allocate approximately 30% time/words to part (a) on Box-Jenkins methodology, 30% to part (b) on index number calculations with explicit working, and 40% to part (c) covering price quotations and foreign trade publications. Structure: brief conceptual introduction for each part, detailed methodological explanation with formulae, worked calculations for (b), and specific Indian statistical system references for (c).

  • Part (a): Identification of ARIMA model components (AR, I, MA), stationarity testing via ADF or KPSS, differencing procedures, ACF/PACF analysis for order identification, parameter estimation via maximum likelihood, diagnostic checking with Ljung-Box test, and forecasting with confidence intervals
  • Part (b): Correct application of aggregative method (Laspeyres or Paasche formula with base year quantities as weights) and weighted relatives method (price relatives multiplied by base year expenditure weights), proper identification of missing data items from 'above data' reference, and meaningful economic interpretation of cost of living changes
  • Part (c): Explanation of price quotation system (selection of representative items, specification of quality, choice of markets, timing of collection), role of NSSO and CSO in price data collection, and detailed coverage of DGCI&S publications (Monthly Statistics of Foreign Trade, Annual Report, commodity-wise/country-wise trade data, ITC-HS classification)
  • Integration of Indian context: citing RBI's use of ARIMA for monetary policy forecasting, CPI-AL/RL/CPI-U/CPI-IW construction by Labour Bureau, and Ministry of Commerce's trade statistics dissemination
  • Critical commentary in (b) on limitations of index numbers (substitution bias, quality changes, new goods) and in (c) on challenges in price quotation representativeness and trade data timeliness
  • Correct mathematical notation: ∇^d for differencing, φ(B) and θ(B) operators, p/d/q orders, and index number formulae with proper subscripts
Q7
50M derive Demography, psychometrics, and life table analysis

(a) If c(x, t) denote observed proportion of females in the age group (x, x+t) and f(x, t) is the observed proportion of females giving birth to female children in the age group (x, x+t) at time t. Let us assume that X is uniformly distributed in (α, β). Then show that $$ \hat{B}_f(t)=\left[\hat{r}_{c,f|t} \hat{\sigma}_c \hat{\sigma}_f (\beta-\alpha) + \frac{[\hat{T}_f(t)]^2}{(\beta-\alpha)} \frac{1}{\hat{G}_f(t)}\right], $$ where $\hat{T}_f(t)$ is the estimated total fertility. $\hat{B}_f(t)$ is the estimated female birthrate at time t. $\hat{G}_f(t)$ is the estimated General Fertility rate. $\hat{r}_{c,f|t}$ represents product moment correlation coefficient between c and f given t. $\hat{\sigma}_c, \hat{\sigma}_f$ are observed standard deviations of c and f respectively. (15 marks) (b) What do you mean by Intelligence Quotient (I.Q.) ? Describe the procedure and test of measuring I.Q. How does an aptitude test differ from an Intelligence Test ? The reliability coefficient of a test of 60 items is 0·65. How much the test should be lengthened to raise the self correlation to 0·95 ? What effect will the doubling and tripling the test's length have upon the reliability coefficients ? What is the reliability of a test having 135 comparable items ? (15 marks) (c) Define instantaneous force of mortality (μₓ). Show that qₓ = (1/lₓ) ∫₀¹ μₓ₊ₜ lₓ₊ₜ dx where qₓ is the probability of dying within one year following the attainment of age x. Also prove that μₓ = (1/eₓ⁰) [1 + (deₓ⁰/dx)] where eₓ⁰ is the complete expectation of life. (20 marks)

Answer approach & key points

The directive 'derive' demands rigorous mathematical proofs and derivations. Allocate approximately 30% time to part (a) on female birthrate estimation using correlation structure, 30% to part (b) covering IQ definition, measurement procedures, aptitude-intelligence distinction, and reliability calculations using Spearman-Brown prophecy, and 40% to part (c) on force of mortality derivations and life table relationships. Structure with clear section headings, state assumptions explicitly, show step-by-step derivations, and conclude with precise final expressions.

  • Part (a): Derivation of female birthrate formula using uniform distribution assumption, correlation structure between c(x,t) and f(x,t), and proper substitution of T_f(t) and G_f(t) with algebraic manipulation of (β-α) terms
  • Part (b): Precise definition of IQ (Mental Age/Chronological Age × 100 or deviation IQ), Stanford-Binet and Wechsler procedures, distinction between aptitude (specific potential) and intelligence (general ability) tests
  • Part (b): Application of Spearman-Brown prophecy formula n = r₂(1-r₁)/r₁(1-r₂) to find required test length for reliability 0.95, and calculation of new reliabilities for doubled/tripled lengths and 135 items
  • Part (c): Definition of μₓ as instantaneous death rate and derivation of qₓ = (1/lₓ)∫₀¹ μₓ₊ₜ lₓ₊ₜ dt using relationship between force of mortality and survival function
  • Part (c): Proof of μₓ = (1/eₓ⁰)[1 + (deₓ⁰/dx)] using complete expectation of life definition eₓ⁰ = Tₓ/lₓ and differentiation with respect to age
  • Correct handling of Indian demographic context: mention of SRS (Sample Registration System) data for fertility estimation and applicability to Indian population studies
Q8
50M explain Econometrics and population projection

(a) What is autocorrelation ? What are its consequences ? Explain the Goldfeld-Quandt test and Glesjer test for heteroscedasticity. (20 marks) (b) Check the identifiability of the following two-equation system : β₁₁y₁ₜ + β₁₂y₂ₜ + γ₁₁x₁ₜ + γ₁₂x₂ₜ = u₁ₜ β₂₁y₁ₜ + β₂₂y₂ₜ + γ₂₁x₁ₜ + γ₂₂x₂ₜ = u₂ₜ Given the restrictions (i) γ₁₂ = 0, γ₂₁ = 0 and (ii) γ₁₁ = 0, γ₁₂ = 0 (15 marks) (c) Describe Leslie matrix and describe Leslie Matrix Technique for the population projection. (15 marks)

Answer approach & key points

The directive 'explain' demands clear exposition with causal reasoning and illustrative examples. Allocate approximately 40% of time/words to part (a) given its 20 marks weight, covering autocorrelation definition, consequences, and both heteroscedasticity tests with step-wise procedures. Devote roughly 30% each to parts (b) and (c): for (b), construct the identification analysis using order and rank conditions for both restriction sets; for (c), explain Leslie matrix structure, fertility/survival parameters, and iterative projection mechanics with Indian demographic application. Structure as: definitional clarity → methodological exposition → worked application → concluding synthesis.

  • Part (a): Precise definition of autocorrelation (correlation between error terms across observations) and distinction from heteroscedasticity; enumeration of consequences including inflated t-statistics, inefficient OLS estimates, and misleading R²
  • Part (a): Goldfeld-Quandt test: correct procedure of ordering observations, splitting samples, computing F-ratio of residual variances, and interpretation against critical values
  • Part (a): Glesjer test: auxiliary regression of absolute/squared residuals on explanatory variables, test statistic derivation, and comparison with Goldfeld-Quandt in terms of power and applicability
  • Part (b): Application of order condition (K-k ≥ m-1) and rank condition for identification under restriction set (i) γ₁₂=0, γ₂₁=0, showing both equations are identified
  • Part (b): Analysis under restriction set (ii) γ₁₁=0, γ₁₂=0, demonstrating identification failure for equation 1 due to rank deficiency
  • Part (c): Leslie matrix structure: age-specific fertility rates (Fᵢ) in first row, survival probabilities (Pᵢ) on sub-diagonal, zeros elsewhere; matrix dimensions matching age classes
  • Part (c): Population projection technique: iterative multiplication n(t+1) = L × n(t), stable population properties, intrinsic growth rate extraction, and application to Indian census projections

Practice any of these questions

Write your answer, get it evaluated against UPSC's real rubric in seconds.

Start free evaluation →