Statistics

UPSC Statistics 2022

All 16 questions from the 2022 Civil Services Mains Statistics paper across 2 papers — 800 marks in total. Each question comes with a detailed evaluation rubric, directive word analysis, and model answer points.

16Questions
800Total marks
2Papers
2022Exam year

Paper I

8 questions · 400 marks
Q1
50M Compulsory prove Probability distributions and statistical inference

(a) Let X and Y be independent random variables with exponential distribution having respective means $\frac{1}{\lambda_1}$ and $\frac{1}{\lambda_2}$, $\lambda_1 > 0, \lambda_2 > 0$. Find E [max (X, Y)]. (10 marks) (b) Using Central Limit Theorem, show that $$\lim_{n \to \infty} e^{-n} \sum_{k=0}^{n} \frac{n^k}{k!} = \frac{1}{2}$$ (10 marks) (c) An unbiased six-sided die is thrown twice. Let X denote the smaller of the scores obtained. Then show that the probability mass function (p.m.f.) of X is given by : $$p_X(x) = \frac{13-2x}{36}, \quad x = 1, 2, ..., 6$$ $$= 0, \quad \text{otherwise.}$$ (10 marks) (d) Let T₁ and T₂ be two unbiased estimators of θ with Var(T₁) = Var(T₂), then show that Corr(T₁, T₂) ≥ 2e – 1, where e is the efficiency of each estimator. (10 marks) (e) An urn contains 5 marbles of which θ are white and the others black. In order to test null hypothesis H₀ : θ = 3 versus alternative hypothesis H₁ : θ = 4, two marbles are drawn at random. H₀ is rejected if both the marbles are white, otherwise H₀ is accepted. Show that probability of type I error in case of without replacement and with replacement schemes, both are less than 0·40, but power of the test under with replacement is higher than that of under without replacement scheme. (10 marks)

Answer approach & key points

Prove each of the five results systematically, allocating approximately 2 minutes per mark (20 minutes total). For (a), use the identity E[max(X,Y)] = E[X] + E[Y] - E[min(X,Y)] or direct integration; for (b), recognize the Poisson sum and apply CLT with continuity correction; for (c), enumerate favorable outcomes for minimum value; for (d), apply Cauchy-Schwarz and efficiency definition; for (e), compute hypergeometric vs binomial probabilities. Present each proof with clear statement of assumptions, step-by-step derivation, and boxed final result.

  • (a) Correct setup using E[max(X,Y)] = ∫∫ max(x,y)f_X(x)f_Y(y)dxdy or equivalent identity with min(X,Y) ~ Exp(λ₁+λ₂)
  • (b) Identification of Poisson(n) probability mass function and application of CLT with continuity correction to show P(S_n ≤ n) → Φ(0) = 1/2
  • (c) Enumeration of outcomes where min equals k: (k,k), (k,j) for j>k, (i,k) for i>k, yielding count (13-2k) for each k = 1,...,6
  • (d) Use of Var(T₁+T₂) ≥ 0 and efficiency definition e = [CRLB]/Var(T₁) to establish the inequality Corr(T₁,T₂) ≥ 2e-1
  • (e) Type I error: P(reject|H₀) = C(3,2)/C(5,2) = 0.3 (without replacement) vs (3/5)² = 0.36 (with replacement); Power: P(reject|H₁) = C(4,2)/C(5,2) = 0.6 vs (4/5)² = 0.64, showing higher power with replacement
Q2
50M construct Sequential probability ratio test and order statistics

(a) Let a random variable X have exponential distribution with mean 1/θ, θ > 0. To test H₀ : θ = 3 against H₁ : θ = 2, construct sequential probability ratio test. Show that probability of terminating the test at the first stage when null hypothesis is true is 1 – 8/27 ((A–B)/AB), where B and A, B < A, are stopping bounds. (20 marks) (b) Each Sunday a fisherman visits one of three possible locations near his home : he goes to the sea with probability 1/2, to a river with probability 1/4, or to a lake with probability 1/4. If he goes to the sea there is an 80% chance that he will catch fish; corresponding figures for the river and the lake are 40% and 60% respectively. (i) Find the probability that, on a given Sunday, he catches fish. (ii) If, on a particular Sunday, he comes home without catching anything, determine the most likely place that he has been to. (5+10=15 marks) (c) Let X₁ < X₂ < X₃ be the order statistics from uniform population having probability density function f(x; θ) = 1/θ, 0 < x < θ. Show that 4X₁ is an unbiased estimator of θ. (15 marks)

Answer approach & key points

Construct the sequential probability ratio test for part (a) by deriving the likelihood ratio and identifying stopping bounds, allocating approximately 40% of effort given its 20 marks. For part (b), apply Bayes' theorem to solve the probability and posterior location problem, spending ~30% of time. For part (c), derive the distribution of the first order statistic and verify unbiasedness, using the remaining ~30%. Present derivations step-by-step with clear probabilistic reasoning throughout.

  • Part (a): Derive likelihood ratio Λₙ = (3/2)ⁿ exp(-∑Xᵢ/6) for SPRT with stopping bounds A and B, and show termination probability at first stage under H₀ equals 1 − P(B < (3/2)exp(−X₁/6) < A)
  • Part (a): Evaluate P(termination at stage 1 | H₀) = 1 − [exp(−6ln(2A/3)) − exp(−6ln(2B/3))] and simplify to 1 − (8/27)((A−B)/AB) using exponential CDF
  • Part (b)(i): Apply total probability theorem: P(catch) = (1/2)(0.8) + (1/4)(0.4) + (1/4)(0.6) = 0.65
  • Part (b)(ii): Use Bayes' theorem to find P(sea|no catch) = 0.1/0.35, P(river|no catch) = 0.15/0.35, P(lake|no catch) = 0.1/0.35; identify river as most likely location
  • Part (c): Derive PDF of X₍₁₎ as f₍₁₎(x) = 3(θ−x)²/θ³ for 0 < x < θ, compute E(X₍₁₎) = θ/4, and conclude E(4X₍₁₎) = θ proving unbiasedness
Q3
50M solve Probability theory and statistical inference

(a) (i) How large a sample must be taken in order that the probability will be at least 0·90 that the sample mean will be within 0·4 – neighbourhood of the population mean, provided the population standard deviation is 2 ? (8 marks) (ii) Examine whether the weak law of large numbers holds for the sequence {Xₖ} of independent random variables defined as follows : $$P(X_k = -1 - \frac{1}{k}) = \frac{1}{2}\left\{1 - \left(1 - \frac{1}{k^2}\right)^{1/2}\right\},$$ $$P(X_k = 1 + \frac{1}{k}) = \frac{1}{2}\left\{1 + \left(1 - \frac{1}{k^2}\right)^{1/2}\right\}.$$ (7 marks) (b) Theoretical probabilities in the four cells of a multinomial distribution are $\frac{2+\theta}{4}$, $\frac{1-\theta}{4}$, $\frac{1-\theta}{4}$ and $\frac{\theta}{4}$, whereas the observed frequencies are 108, 27, 30 and 8 respectively, then estimate θ by maximum likelihood method. Also, obtain the standard error of the estimate. (20 marks) (c) If X is a random variable with characteristic function $$\varphi(t) = \begin{cases} 1-|t|, & |t| \leq 1 \\ 0, & \text{otherwise}, \end{cases}$$ then obtain the corresponding probability density function. (15 marks)

Answer approach & key points

Solve this multi-part numerical problem by allocating approximately 15 minutes to part (a)(i) on sample size determination using CLT, 15 minutes to part (a)(ii) on verifying WLLN conditions, 25 minutes to part (b) on MLE estimation and standard error computation for multinomial data, and 20 minutes to part (c) on deriving PDF from characteristic function via Fourier inversion. Begin each part with clear statement of the statistical principle being applied, show all computational steps explicitly, and conclude with precise numerical answers or definitive conclusions.

  • Part (a)(i): Apply Central Limit Theorem with z₀.₉₀ = 1.645 to obtain n ≥ (1.645 × 2/0.4)² = 67.65 → n = 68
  • Part (a)(ii): Verify E(Xₖ) = 1/k and Var(Xₖ) = 1 - 1/k², then apply Chebyshev or Kolmogorov's criterion to establish WLLN holds
  • Part (b): Formulate multinomial likelihood L(θ), take log-likelihood, solve dℓ/dθ = 0 to get θ̂ = (2×108 - 27 - 30 + 2×8)/(108+27+30+8) = 0.5, then compute Fisher information for SE(θ̂)
  • Part (c): Apply Fourier inversion formula f(x) = (1/2π)∫₋₁¹ (1-|t|)e⁻ⁱᵗˣ dt, evaluate to obtain f(x) = (1/πx²)(1 - cos x) = (1/2π)sinc²(x/2) for x ≠ 0, with f(0) = 1/2π
  • Demonstrate understanding that characteristic function φ(t) = (1-|t|)₊ corresponds to triangular distribution on [-1,1] in frequency domain, yielding Fejér kernel/sinc² in density domain
Q4
50M discuss Statistical estimation and hypothesis testing

(a) Consider Poisson distribution $$P_{\theta}(X = j) = \frac{e^{-\theta} \theta^{j}}{j!} = p_{j}, j = 0, 1, 2, ....$$ Let $f_{j}$ be the frequency for X = j and $E(f_{j}) = m_{j} = np_{j}$. Discuss how you obtain minimum chi-square estimate for $\theta$. Does minimum chi-square method necessarily yield a sufficient statistic even if it exists ? (20 marks) (b) (i) Let the joint probability density function of X and Y be $$f(x, y) = C . \exp \{-(4x^{2} + 9y^{2} - xy)\},$$ where C is a constant. Find E(X), V(X), E(Y), V(Y) and the correlation coefficient between X and Y. (10 marks) (ii) If $X_{1}, X_{2}, ..., X_{6}$ are independent random variables such that $$P(X_{i} = -1) = P(X_{i} = 1) = \frac{1}{2}, i = 1, 2, ..., 6,$$ then obtain the value of $$P\left[\sum_{i=1}^{6} X_{i} = 4\right].$$ (5 marks) (c) The following data present the time (in minutes), that a commuter had to wait to catch a bus to reach his destination : Use the sign-test at 0·05 level of significance to test the claim of the bus operators that commuters do not have to wait for more than 15 minutes before the bus is made available to them. [Given Z₍₀.₀₂₅₎ = 1·96, Z₍₀.₀₅₎ = 1·645] (15 marks)

Answer approach & key points

The directive 'discuss' in part (a) requires a balanced analytical treatment with derivation and critical evaluation, while parts (b) and (c) are primarily computational. Allocate approximately 40% of effort to part (a) given its 20 marks and theoretical depth, 30% to part (b) covering both (i) bivariate normal properties and (ii) probability calculation, and 30% to part (c) for the non-parametric test. Structure as: brief theoretical exposition for (a), systematic derivations for (b), and complete hypothesis testing procedure for (c).

  • For (a): Derivation of minimum chi-square estimator for Poisson parameter by minimizing Σ(f_j - np_j)²/(np_j) with respect to θ, leading to the estimating equation
  • For (a): Critical discussion that minimum chi-square method does NOT necessarily yield sufficient statistics—contrast with MLE which preserves sufficiency via factorization theorem; cite example where MCS estimator differs from sufficient statistic
  • For (b)(i): Recognition of bivariate normal form, completion of squares to identify μ_x = μ_y = 0, extraction of variances σ_x² = 9/35, σ_y² = 4/35 and covariance to find ρ = 1/6
  • For (b)(ii): Identification that ΣX_i follows distribution of (number of +1's) - (number of -1's), equivalent to 2×Binomial(6,½) - 6, yielding P(ΣX_i=4) = P(5 successes) = 6/64 = 3/32
  • For (c): Correct application of sign test with null hypothesis H₀: median ≤ 15 vs H₁: median > 15, counting positive signs (values > 15), using normal approximation with continuity correction, and proper conclusion based on Z = 1.645 critical value
Q5
50M Compulsory solve Linear models, multivariate normal, experimental design, sampling

(a) Define general linear model with usual assumptions. If y₁ = β₁ + u₁, y₂ = –β₁ + β₂ + u₂, y₃ = –β₂ + u₃, where u₁, u₂, u₃ are mutually independent random variables with mean zero and variance σ², then find the least square estimators of β₁ and β₂. (10 marks) (b) Given X ~ N₃(μ, Σ), where μ = (2, 4, 3)' and Σ = ⎛8 2 3⎞ ⎜2 4 1⎟ ⎝3 1 3⎠ (i) find the regression function of X₁ on X₂ and X₃, and (ii) compute the conditional variance of X₁ given X₂ and X₃. (10 marks) (c) What is a uniformity trial ? Explain how it can be used to determine optimum shape and size. (10 marks) (d) In a 2⁶ – factorial experiment, the key block is given as : (1), ab, cd, ef, ace, abef, abcd, bce, cdef, acf, ade, abcdef, bde, bcf, adf, bdf. Identify the confounded effects. (10 marks) (e) If the coefficients of variation of x and y are equal and the correlation coefficient between x and y is ρ = 2/3, compute the efficiency of ratio estimator relative to the mean of a simple random sample. (10 marks)

Answer approach & key points

This is a computational-cum-descriptive question requiring precise derivations and calculations across five sub-parts. Allocate approximately 20% time to part (a) for matrix formulation of GLM and LSE derivation, 20% to part (b) for multivariate normal conditional distributions, 15% to part (c) for explaining uniformity trials with agricultural field trial context, 25% to part (d) for systematic identification of confounded effects in 2⁶ factorial, and 20% to part (e) for ratio estimator efficiency computation. Begin each part with clear statement of method, show all computational steps, and conclude with boxed final answers.

  • Part (a): Correct matrix formulation of GLM y = Xβ + u with assumptions E(u)=0, Var(u)=σ²I; proper construction of design matrix X and derivation of LSE β̂ = (X'X)⁻¹X'y yielding β̂₁ = (y₁ - y₂)/2 and β̂₂ = (y₁ + y₂ + 2y₃)/2
  • Part (b)(i): Correct partitioning of Σ into Σ₁₁, Σ₁₂, Σ₂₁, Σ₂₂ and computation of regression coefficients β = Σ₁₂Σ₂₂⁻¹ for E(X₁|X₂,X₃) = μ₁ + Σ₁₂Σ₂₂⁻¹(x₂-μ₂, x₃-μ₃)'
  • Part (b)(ii): Computation of conditional variance Var(X₁|X₂,X₃) = Σ₁₁ - Σ₁₂Σ₂₂⁻¹Σ₂₁ using Schur complement
  • Part (c): Definition of uniformity trial as trial with uniform treatment to assess field variability; explanation of how coefficient of variation and soil heterogeneity index guide selection of plot shape (long narrow for fertility gradient) and size (balancing variance reduction vs cost)
  • Part (d): Systematic identification of confounded effects by finding generalized interaction of defining contrasts; recognition that key block corresponds to I = ABCDEF or equivalent 6-factor interaction confounding
  • Part (e): Application of ratio estimator efficiency formula RE = (1-ρ²)/(Cₓ²/Cᵧ² + 1 - 2ρCₓ/Cᵧ) with Cₓ = Cᵧ yielding simplified computation; final numerical answer for efficiency
Q6
50M derive ANOVA, sampling techniques, multivariate analysis

(a) In a set of two-way classified data according to k levels of factor A and r levels of factor B, there is one observation in each cell. Show that the total number of error contrasts is (r – 1) (k – 1). (15 marks) (b) Describe with examples the technique of two-stage sampling. Obtain the variance of the sample mean under two-stage sampling without replacement. Hence, deduce the variance of the sample mean under : (i) Stratified random sampling, and (ii) Cluster sampling (20 marks) (c) (i) If X₁ = Y₁ + Y₂, X₂ = Y₂ + Y₃, X₃ = Y₃ + Y₁, where Y₁, Y₂ and Y₃ are uncorrelated random variables and each of which has zero mean and unit standard deviation, find the multiple correlation coefficient between X₃ and X₁, X₂. (ii) Let X be a 3-dimensional random vector with dispersion matrix Σ = ⎛9 3 3⎞ ⎜3 9 3⎟ ⎝3 3 9⎠. Determine the first principal component and the proportion of the total variability that it explains. (7+8=15 marks)

Answer approach & key points

Derive the required results systematically across all sub-parts. For (a), establish the linear model and count constraints; for (b), describe two-stage sampling with Indian census/NSSO examples, then derive variance formula and deduce special cases; for (c)(i), compute multiple correlation using matrix algebra; for (c)(ii), find eigenvalues and eigenvectors for PCA. Allocate approximately 30% time to (a), 40% to (b), 15% each to (c)(i) and (c)(ii), ensuring all derivations show complete steps with proper justification.

  • For (a): Define the two-way ANOVA model with one observation per cell, identify total contrasts (rk-1), subtract treatment contrasts (k-1 for factor A, r-1 for factor B), and show error contrasts = (r-1)(k-1) using degrees of freedom partition
  • For (b): Describe two-stage sampling with NSSO household survey or agricultural census example; derive variance of sample mean under SRSWOR at both stages; deduce stratified random sampling variance by letting second-stage sampling fraction tend to 1
  • For (b) continued: Deduce cluster sampling variance by letting first-stage sampling fraction tend to 1, showing how the general formula collapses to known special cases
  • For (c)(i): Compute Var(X₁), Var(X₂), Cov(X₁,X₂), Cov(X₃,X₁), Cov(X₃,X₂); set up multiple regression of X₃ on X₁,X₂; calculate R² and multiple correlation coefficient R₃.₁₂
  • For (c)(ii): Find eigenvalues of Σ (6, 6, 12), identify first principal component as (1/√3)(1,1,1)′ corresponding to λ=12, and compute proportion of variability as 12/24 = 0.5 or 50%
Q7
50M analyse Design of experiments and multivariate analysis

(a) Consider the following data given for a BIBD with v = b = 4, r = k = 3, λ = 2 and N = 12 : Analyse the design. [Given that : F₃,₅ (0·05) = 5·41] 15 (b) (i) The data matrix of a random sample of size n = 3 from a bivariate normal population BVN (μ₁, μ₂, σ₁², σ₂², ρ) is X = [6 10; 10 6; 8 2]. Test the null hypothesis H₀ : μ = μ₀ against H₁ : μ ≠ μ₀, where μ₀' = (8, 5), at 10% level of significance. [You are given : F₀.₁₀; ₂, ₁ = 49·5, F₀.₁₀; ₁, ₂ = 8·53] (ii) Suppose n₁ = 11 and n₂ = 12, observations are made on two random vectors X₁ and X₂ which are assumed to have bivariate normal distribution with a common covariance matrix Σ, but possibly different mean vectors μ₁ and μ₂. The sample mean vectors and pooled covariance matrix are X̄₁ = (-1, -1)', X̄₂ = (2, 1)', S_pooled = (7 -1; -1 5). Obtain Mahalanobis sample distance D² and Fisher's linear discriminant function. Assign the observation X₀ = (0, 1)' to either population Π₁ or Π₂. 10+10=20 (c) A sample of size n is drawn with equal probability and without replacement from a population with size N. Let Ŷ_N = Σᵣ₌₁ⁿ aᵣ yᵣ be any linear estimate of the population mean Ȳ_N, where aᵣ are constants and yᵣ denotes the value of the unit included in the sample at the rᵗʰ draw. (i) Show that Ŷ_N is an unbiased estimate of Ȳ_N if and only if Σᵣ₌₁ⁿ aᵣ = 1 (ii) Under above condition V(Ŷ_N) = (S²/N)[NΣᵣ₌₁ⁿ aᵣ² - 1] (iii) If aᵣ = 1/n, for what value of n may this variance of the sample mean in simple random sampling without replacement be exactly half the variance of the mean of a random sample of the same size taken with replacement ? 15

Answer approach & key points

The directive 'analyse' demands systematic examination with computational rigour across all sub-parts. Allocate approximately 30% time to part (a) BIBD analysis, 40% to part (b) multivariate tests and discriminant analysis, and 30% to part (c) sampling theory proofs. Structure as: brief identification of appropriate statistical methods for each sub-part → step-by-step computational working with formulae stated → interpretation of results in context → final conclusions with statistical significance statements.

  • Part (a): Verify BIBD parameters satisfy λ(v-1) = r(k-1), construct ANOVA table with SST, SSB, SStr, SSE, compute F-ratio and compare with critical value 5.41 for treatment significance
  • Part (b)(i): Compute sample mean vector, sample covariance matrix S, Hotelling's T² statistic, convert to F-statistic using F = (n-p)/((n-1)p) × T² with p=2, compare with given critical value
  • Part (b)(ii): Calculate Mahalanobis D² = (X̄₁-X̄₂)'S_pooled⁻¹(X̄₁-X̄₂), derive Fisher's linear discriminant function Z = a'X where a = S_pooled⁻¹(X̄₁-X̄₂), compute discriminant scores and classify X₀
  • Part (c)(i): Prove unbiasedness by showing E(Ŷ_N) = Ȳ_N requires Σaᵣ = 1 using linearity of expectation and equal probability sampling properties
  • Part (c)(ii): Derive variance expression using V(yᵣ) = σ² and Cov(yᵣ, yₛ) = -σ²/(N-1) for r≠s, expand V(Σaᵣyᵣ) and simplify
  • Part (c)(iii): Set V(SRSWOR) = ½ V(SRSWR), i.e., (N-n)/(Nn) × S² = ½ × S²/n, solve to get n = N/2
Q8
50M derive Regression, sampling and experimental design

(a) (i) What are orthogonal polynomials ? How do you fit an orthogonal polynomial of degree 'p' ? (ii) For the model Y_(n×1) = X_(n×k) β_(k×1) + u_(n×1), E(uu') = σ² I_n, where X_(n×k) is a matrix of rank k (k < n), find out the value of E[Y'(I_n - X(X'X)⁻¹X')Y]. 10+10=20 (b) Consider an artificial population of three farms. Their selection probabilities and the wheat production (in '000 tons) are as follows : Farm unit (i) : 1 2 3; Selection probability (pᵢ) : 0·3 0·2 0·5; Wheat production (yᵢ) : 11 6 25. Draw all possible samples of size 2 with replacement (order is to be considered). Show that Horvitz-Thompson estimator of total wheat production is unbiased. 15 (c) What is a missing plot technique ? Derive the missing value formula for a Latin Square Design. How would you proceed to analyse such a design ? 15

Answer approach & key points

Begin with (a)(i) defining orthogonal polynomials with the orthogonality condition Σφᵢ(x)φⱼ(x)=0 for i≠j, then describe the recurrence relation method for fitting degree p. For (a)(ii), recognize the residual sum of squares form and apply E[u'Mu]=σ²tr(M) to obtain (n-k)σ². In (b), enumerate all 9 ordered samples with replacement, compute πᵢ=1-(1-pᵢ)² for inclusion probabilities, verify E[Ŷ_HT]=Y. For (c), derive the Latin Square missing value formula ŷ=(R+C+T-2G)/((t-1)(t-2)) and outline the adjusted ANOVA procedure. Allocate ~40% time to (a), ~30% each to (b) and (c).

  • (a)(i) Definition: orthogonal polynomials satisfy Σφᵢ(x)φⱼ(x)=0 for i≠j over the point set; fitting uses recurrence φᵣ₊₁(x)=(x-aᵣ)φᵣ(x)-bᵣφᵣ₋₁(x) with specific coefficient formulas
  • (a)(ii) Recognition that Iₙ-X(X'X)⁻¹X' is the residual maker matrix M; E[Y'MY]=E[u'Mu]=σ²tr(M)=σ²(n-k) using tr(Iₙ)=n and tr(X(X'X)⁻¹X')=k
  • (b) Enumeration of 9 ordered samples: (1,1),(1,2),(1,3),(2,1),(2,2),(2,3),(3,1),(3,2),(3,3) with their probabilities; calculation of first-order inclusion probabilities πᵢ=1-(1-pᵢ)²; verification that Σ(yᵢ/πᵢ)·πᵢ/Σ1 = Y
  • (c) Missing plot technique: Yates' method for estimating missing observations by minimizing error sum of squares; derivation using ∂SSE/∂y=0 for Latin Square layout
  • (c) Latin Square missing value formula: ŷ = (tRᵢ + tCⱼ + tTₖ - 2G) / [(t-1)(t-2)] where R,C,T are respective totals and G is grand total; analysis proceeds with reduced degrees of freedom and bias correction in treatment SS

Paper II

8 questions · 400 marks
Q1
50M Compulsory distinguish Statistical Quality Control and Operations Research

(a) Distinguish between process control and product control. Explain the various sources of variation encountered in a process control study. Suggest how they can be eliminated from the process. 10 marks (b) The management of ABC company is considering the question of marketing a new product. The fixed cost required in the project is ₹ 4,000. Three factors are uncertain, viz., selling price, variable cost and annual sales volume. The product has life of only one year. The management has the data on three factors as under: | Selling Price (₹) | Probability | Variable Cost (₹) | Probability | Sales Volume (units) | Probability | |---|---|---|---|---|---| | 3 | 0·2 | 1 | 0·3 | 2000 | 0·3 | | 4 | 0·5 | 2 | 0·6 | 3000 | 0·3 | | 5 | 0·3 | 3 | 0·1 | 5000 | 0·4 | Consider the sequence of thirty random numbers 81, 32, 60, 04, 46, 31, 67, 25, 24, 10, 40, 02, 39, 68, 08, 59, 66, 90, 12, 64, 79, 31, 86, 68, 82, 89, 25, 11, 98, 16 and using the sequence (first 3 random numbers for the first trial, etc.), simulate the average profit for the above project on the basis of 10 trials. 10 marks (c) If N(t) is a Poisson process and s < t, find P(N(s) = k | N(t) = n) and comment. 10 marks (d) What are the assumptions made in the theory of games? Describe the maximin principle and minimax principle. Explain the algebraic method for games without saddle point. 10 marks (e) What are the importances of censoring in life-testing experiments? Discuss the estimation of parameters involved in exponential distribution with mean θ, using type-2 censored sample. 10 marks

Answer approach & key points

Distinguish requires clear differentiation followed by explanation. Allocate approximately 20% time to part (a) on process vs product control with variation sources, 25% to part (b) simulation with proper random number mapping, 15% to part (c) conditional probability derivation for Poisson, 20% to part (d) game theory assumptions and algebraic method, and 20% to part (e) censoring importance and MLE for exponential Type-2. Begin with definitions, proceed to analytical derivations or computational steps, and conclude with interpretations.

  • Part (a): Clear distinction between process control (monitoring during production) and product control (acceptance sampling); identification of chance causes (random, inherent) and assignable causes (special, identifiable) of variation; remedial measures for each source
  • Part (b): Correct probability interval mapping for selling price (00-19, 20-69, 70-99), variable cost (00-29, 30-89, 90-99), and sales volume (00-29, 30-59, 60-99); proper profit calculation as (Price - Variable Cost) × Volume - Fixed Cost; accurate simulation table with 10 trials and average profit computation
  • Part (c): Derivation of P(N(s)=k|N(t)=n) using independent increments property resulting in Binomial(n, s/t) distribution; recognition that conditional distribution depends only on ratio s/t not on rate parameter λ
  • Part (d): Assumptions: two players, finite strategies, zero-sum, simultaneous moves, complete information; maximin (player A's security level) and minimax (player B's security level) principles; algebraic method using mixed strategies with probability variables p and q, solving simultaneous equations for value of game
  • Part (e): Importance of censoring: time/cost efficiency, ethical considerations, handling heavy-tailed distributions; Type-2 censoring with r failures out of n items; MLE derivation for θ with estimator T/r where T is total time on test, showing unbiasedness and variance properties
Q2
50M solve Statistical Quality Control and Reliability Theory

(a) Samples of size n = 5 units are taken from a process every hour. The x̄ and R̄ values for a particular quality characteristic are determined. After 25 samples have been collected, we obtain x̄̄ = 20 and R̄ = 4·56. (i) What are the three-sigma control limits for x̄ and R? (ii) Estimate the process standard deviation if both the charts exhibit control. (iii) Assume that the process output is normally distributed. If the specifications are 19 ± 5, what are your conclusions regarding the process capability? (iv) If the process mean shifts to 24, what is the probability of not detecting this shift on the first subsequent sample? (d₂ = 2·326, D₁ = 0, D₂ = 4·918, D₃ = 0, D₄ = 2·114, A = 1·342, A₂ = 0·577, A₃ = 1·427, C₄ = 0·940, B₃ = 0, B₄ = 2·089) 15 marks (b) Define a Weibull distribution with scale parameter α and shape parameter β. Obtain the hazard function and reliability function of the model. Show also that the distribution satisfies increasing, constant and decreasing failure rate based on suitable choice of the shape parameter. 15 marks (c) A company uses the following acceptance-sampling procedure—A sample equal to 10% of the lot is taken. If 2% or less of the items in the sample are defective, the lot is accepted, otherwise it is rejected. If the submitted lot varies in size from 5000 units to 10000 units, what can you say about the protection by this plan? If 0·05 is the LTPD, does this scheme offer reasonable protection to the consumer? 20 marks

Answer approach & key points

This is a numerical problem requiring systematic calculation across three parts. Allocate approximately 35% time to part (a) with its four sub-parts on control charts (15 marks), 30% to part (b) on Weibull distribution derivations (15 marks), and 35% to part (c) on acceptance sampling analysis (20 marks). Begin each part with clear identification of given parameters, show all formulas before substitution, and conclude with explicit interpretation of results.

  • Part (a)(i): Correct application of x̄ chart limits using A₂R̄ and R chart limits using D₃R̄, D₄R̄ with n=5
  • Part (a)(ii): Estimation of process standard deviation using σ̂ = R̄/d₂ = 4.56/2.326
  • Part (a)(iii): Calculation of Cp and Cpk indices comparing process capability with specification limits 19±5
  • Part (a)(iv): Calculation of β-risk (Type II error) using normal distribution for shifted mean μ=24
  • Part (b): Derivation of Weibull hazard function h(t) = (β/α)(t/α)^(β-1) and reliability R(t) = exp[-(t/α)^β], with IFR/CFR/DFR classification based on β
  • Part (c): Analysis of variable sample size plan, calculation of acceptance probability using binomial/Poisson approximation, and evaluation against LTPD=0.05 for consumer protection
Q3
50M solve Manpower planning, queuing theory and inventory management

(a) It is planned to raise a research team to a strength of 50 chemists, which is to be maintained. The wastage of recruits depends on their length of service which is as follows: Year : 1 2 3 4 5 6 7 8 9 10 Total percentage who have left by the end of year : 5 36 55 63 68 73 79 87 97 100 What is the required number of recruitments per year necessary to maintain the required strength? There are 8 senior posts for which the length of service is the main criterion. What is the average length of service after which the next entrant expects promotion to one of these posts? (20 marks) (b) Explain the structure of a queuing system. Explain M/M/1 queuing system and obtain steady-state solution. Also calculate busy period distribution. (15 marks) (c) A company that operates for 50 weeks in a year is concerned about its stocks of copper cable. This costs ₹ 240 a metre and there is a demand for 8000 metres a week. Each replenishment costs ₹ 1,050 for administration and ₹ 1,650 for delivery, while holding costs are estimated at 25 percent of value held a year. Assuming that no shortages are allowed, what is the optimal inventory policy for the company? How would this analysis differ if the company wants to maximize its profits rather than minimize cost? What is the gross profit if the company sells the cable for ₹ 360 a metre? (15 marks)

Answer approach & key points

This is a multi-part numerical problem requiring you to solve three distinct operations research scenarios. Allocate approximately 40% of time to part (a) given its 20 marks, and 30% each to parts (b) and (c). Begin with clear problem identification for each sub-part, show all working steps with proper formulae, and conclude with precise numerical answers with units. For part (b), balance theoretical explanation with mathematical derivation.

  • Part (a): Calculate annual wastage rates from cumulative percentages, determine survival probabilities, compute required annual recruitment using renewal equation, and find average service length for promotion using weighted probability distribution
  • Part (a): Correctly interpret 'total percentage who have left' as cumulative distribution and derive conditional probabilities of leaving in each specific year
  • Part (b): Define queuing system components (arrival process, service mechanism, queue discipline) and derive steady-state probabilities for M/M/1 using balance equations with ρ = λ/μ < 1
  • Part (b): Obtain explicit formulas for P₀ = 1-ρ, Pₙ = ρⁿ(1-ρ), and derive busy period distribution using Takács formula or generating function approach
  • Part (c): Apply EOQ model with D = 400,000 metres/year, C₀ = ₹2,700, Cₕ = ₹60/metre/year, calculate optimal Q*, cycle time, and total minimum cost
  • Part (c): Distinguish cost minimization from profit maximization by incorporating revenue function, show that optimal quantity remains unchanged under constant price, and compute gross profit as (360-240) × 400,000
Q4
50M solve Linear programming, crew scheduling and sequential sampling

(a) Use penalty method to solve the following linear programming problem : Maximize Z = x₁ + 2x₂ + 3x₃ - x₄ subject to the constraints x₁ + 2x₂ + 3x₃ = 15 2x₁ + x₂ + 5x₃ = 20 x₁ + 2x₂ + x₃ + x₄ = 10 x₁, x₂, x₃, x₄ ≥ 0 (20 marks) (b) An airline that operates seven days a week has the time-table shown below. Crew must have a minimum layover of 5 hours between flights. Obtain the pairing of flights that minimizes layover time away from home. For any given pairing, crew will be based at the city that results in the smaller layover : Delhi-Jaipur Flight No. | Departure | Arrival 1 | 7:00 AM | 8:00 AM 2 | 8:00 AM | 9:00 AM 3 | 1:30 PM | 2:30 PM 4 | 6:30 PM | 7:30 PM Jaipur-Delhi Flight No. | Departure | Arrival 101 | 8:00 AM | 9:15 AM 102 | 8:30 AM | 9:45 AM 103 | 12 Noon | 1:15 PM 104 | 5:30 PM | 6:45 PM For each pair, also mention the city where the crew should be based. (15 marks) (c) What are sequential sampling plans? Suggest a sequential sampling plan for which p₁ = 0·01, α = 0·05, p₂ = 0·06 and β = 0·10. (15 marks)

Answer approach & key points

Begin with the directive 'solve' for part (a), applying the Big-M penalty method to convert equality constraints and maximize the objective. Allocate approximately 40% of time to part (a) given its 20 marks, 30% to part (b) for crew scheduling optimization, and 30% to part (c) for sequential sampling theory and design. Structure as: (a) complete LP solution with simplex iterations, (b) layover time matrix and optimal pairing, (c) definition followed by ASN and OC curve construction.

  • Part (a): Convert to standard form using Big-M penalty for equality constraints; introduce artificial variables A₁, A₂ with -M coefficient in objective
  • Part (a): Execute simplex iterations showing entering and leaving variables until optimality reached with Z_max = 15, x₁=2.5, x₂=5, x₃=2.5, x₄=0
  • Part (b): Construct 4×4 layover time matrix for Delhi-based and Jaipur-based crews; calculate layovers respecting 5-hour minimum
  • Part (b): Identify optimal pairings minimizing total layover: Flight 1-101 (Delhi base), 2-102 (Delhi), 3-103 (Jaipur), 4-104 (Jaipur)
  • Part (c): Define sequential sampling as item-by-item inspection with decision boundaries; state Wald's SPRT principles
  • Part (c): Calculate decision parameters h₁, h₂, s and construct acceptance/rejection lines; provide ASN ~ 40-50 and OC curve characteristics
Q5
50M Compulsory calculate Time series, econometrics, life tables, psychometrics

(a) Apply the method of link relatives to the following data and calculate the seasonal indices : Price of Rice (in ₹ per 10 kg) | Quarter | 2001 | 2002 | 2003 | 2004 | |---------|------|------|------|------| | 1 | 75 | 86 | 90 | 100 | | 2 | 60 | 65 | 72 | 78 | | 3 | 54 | 63 | 66 | 72 | | 4 | 59 | 80 | 82 | 93 | 10 marks (b) Derive the means and variances of the sampling distributions of the OLS estimates of α and β in the two-variable linear model Y = α + βX + u. 10 marks (c) Consider, in the usual notations, the equation y = Y₁β + X₁γ + u, where y is an (n × 1) vector, Y₁ is an (n × (g-1)) matrix, X₁ is an (n × k) matrix. Derive the equations for the two-stage least square method of estimation. 10 marks (d) If the survivorship function l(x) in life table is linear between x and x+1, and complete expectations of life at ages 40 and 41 for a particular group of persons are 21·39 years and 20·91 years respectively and l(40) = 41176, find the number of persons that attain the age 41. 10 marks (e) Compute the T-scores corresponding to test score x for the following frequency distribution : | x | 1 | 2 | 3 | 4 | 5 | |-----|---|---|---|---|---| | f | 2 | 3 | 8 | 6 | 1 | (Cumulative Normal Distribution Table is given in Page No. 9) 10 marks

Answer approach & key points

This multi-part question requires precise calculation and derivation across five distinct statistical domains. Allocate approximately 20% time to each sub-part: (a) construct link relatives table and seasonal indices using chain relatives method; (b) derive OLS estimators' properties using Gauss-Markov assumptions; (c) set up 2SLS normal equations showing projection onto instruments; (d) apply linear survivorship assumption to solve for l(41); (e) compute percentile ranks then transform to T-scores using given normal table. Present each part clearly with step-by-step working.

  • For (a): Calculate chain relatives by expressing each quarter's value as percentage of preceding quarter, then obtain corrected relatives and seasonal indices normalized to 400
  • For (b): Derive E(α̂) = α and E(β̂) = β showing unbiasedness, then derive Var(α̂) = σ²ΣX²/(nΣx²) and Var(β̂) = σ²/Σx² using matrix or scalar algebra
  • For (c): State first stage projection Ŷ₁ = X(X'X)⁻¹X'Y₁, then second stage OLS of y on Ŷ₁ and X₁ to obtain 2SLS estimator β̂₂ₛₗₛ = (Z'PₓZ)⁻¹Z'Pₓy where Z = [Y₁|X₁]
  • For (d): Use linearity of l(x) to establish e°₄₀ = ½ + l(41)/l(40) × e°₄₁, then solve l(41) = l(40)(e°₄₀ - ½)/(e°₄₁ + ½) and compute numerical value
  • For (e): Compute cumulative frequencies, percentile ranks P = (100/N)(C - ½), find corresponding z-scores from normal table, then T = 50 + 10z for each score value
Q6
50M explain Time series analysis, econometrics, industrial statistics

(a) Explain Akaike information criterion for order selection in an ARMA (p, q) process. 15 marks (b) Define autocorrelation coefficient. What are its consequences for ordinary least squares? Discuss the maximum likelihood estimation of the model, in the usual notations, Y = Xβ + u with AR (autoregressive)(1) disturbance. 20 marks (c) Explain the method of collection of industrial data. Describe the (i) official publications for data collection and (ii) statistics collected by the various official agencies pertaining to industrial production. 15 marks

Answer approach & key points

Explain the theoretical foundations across all three parts with appropriate mathematical derivations where required. Allocate approximately 30% effort to part (a) on AIC and ARMA order selection, 40% to part (b) on autocorrelation and MLE estimation given its higher marks, and 30% to part (c) on industrial data collection methods. Structure with clear sectional headings, begin each part with precise definitions, develop through step-by-step reasoning, and conclude with practical implications or limitations.

  • Part (a): Definition of AIC as -2log(L) + 2k where L is likelihood and k is number of parameters; trade-off between goodness-of-fit and model complexity; comparison with BIC/AICc; application to ARMA(p,q) via minimization over candidate orders
  • Part (b): Autocorrelation coefficient ρ_k = Cov(u_t, u_{t-k})/Var(u_t); consequences for OLS: biased standard errors, inefficient estimates, invalid t/F tests; MLE derivation for AR(1) errors with transformation matrix Ω, concentrated likelihood, and iterative estimation
  • Part (c): Methods: census vs sample surveys, ASI (Annual Survey of Industries) schedule, establishment surveys; Official publications: ASI Summary Results, Index of Industrial Production (IIP), Economic Census; Agencies: CSO (now NSO), DIPP, Labour Bureau, RBI industrial data
  • Mathematical rigor: Proper likelihood functions, matrix notation for GLS transformation, stationarity conditions for AR(1) parameter |ρ| < 1
  • Applied context: Indian industrial statistics system, ASI coverage of registered manufacturing, limitations of informal sector data
Q7
50M explain Demography and vital statistics

(a) What are the various indices of mortality measure? Explain the purpose and procedure for standardizing them. (20 marks) (b) With usual notations, obtain logistic curve as given by P(t) = L / (1 + e^(r(β-t))) ; t > 0, β > 0, r > 0 for population growth model. Also discuss its any three properties. (15 marks) (c) In what way do total fertility rate (TFR), gross reproduction rate (GRR) and net reproduction rate (NRR) differ from one another as a measure of reproduction? (15 marks)

Answer approach & key points

The directive 'explain' demands conceptual clarity with logical exposition. For part (a) carrying 20 marks, allocate ~40% effort covering CDR, ASDR, IMR, MMR with direct and indirect standardization procedures; for (b) with 15 marks, spend ~30% on deriving the logistic curve from differential equation dP/dt = rP(1-P/L) and discussing properties like inflection point, asymptotic behavior, and symmetry; for (c) with 15 marks, devote remaining ~30% to contrasting TFR, GRR, NRR through formulas, assumptions, and replacement-level interpretations. Structure: brief intro → systematic part-wise treatment → integrated conclusion on demographic measurement evolution.

  • Part (a): Lists at least 5 mortality indices (CDR, ASDR, IMR, MMR, U5MR) with formulas; explains purpose of standardization (eliminating age-structure bias for inter-population/temporal comparison); describes direct standardization (applying standard population weights) and indirect standardization (applying standard rates to study population) with step-wise procedure
  • Part (a): Cites Indian context—SRS data, Sample Registration System mortality estimates, or NFHS standardized mortality ratios for interstate comparisons
  • Part (b): Derives logistic curve by solving dP/dt = rP(1-P/L) with initial condition P(0) = P₀; shows integration steps, substitution, and algebraic manipulation to reach given form with β = (1/r)ln[(L-P₀)/P₀]
  • Part (b): Discusses three properties—(i) sigmoid/S-shaped curve with inflection point at P=L/2, t=β; (ii) upper asymptote L (carrying capacity); (iii) growth rate parameter r determining steepness; may add symmetry or point of diminishing returns
  • Part (c): Distinguishes TFR (age-specific fertility rates summed, no mortality adjustment, both sexes), GRR (TFR × proportion female births, no mortality, female births only), NRR (GRR adjusted by survival probabilities lₓ to reproductive age, female generation replacement measure); notes NRR=1 indicates exact replacement, TFR≈2.1 replacement level for India
  • Part (c): Clarifies that TFR is period measure, GRR/NRR are generation measures; NRR most complete for population projection while TFR most commonly reported
Q8
50M explain Index numbers and queuing theory

(a) How does the concept of wholesale price index work? Describe the major components of wholesale price index. Explain the methodology of index numbers of area, production and yield in agriculture. (15 marks) (b) Explain G/M/1 model and show that the steady-state arrival point system has a geometric distribution. (20 marks) (c) If e(x) is the average number of complete years of life lived by each of l(x) persons in life table population after attaining age x, and q(x) is the probability of dying within one year following the attainment of age x, prove that q(x) = (1 - (e(x) - e(x+1))) / (1 + e(x+1)) (15 marks)

Answer approach & key points

The directive 'explain' demands clear exposition with logical reasoning and supporting evidence. Structure your answer with: (a) ~30% time/space (15 marks) — define WPI, list its three major components (primary articles, fuel & power, manufactured products), then detail the Laspeyres/Fisher methodology for agricultural index numbers with base year selection; (b) ~40% time/space (20 marks) — define G/M/1 assumptions, derive the embedded Markov chain, use generating functions or recursive relations to prove the geometric steady-state distribution πₙ = (1-σ)σⁿ; (c) ~30% time/space (15 marks) — start with life table definitions, express e(x) and e(x+1) in terms of T(x) and l(x), manipulate the identities algebraically to reach the required expression. Conclude each part with a brief summary of significance.

  • For (a): WPI definition as base-weighted index measuring wholesale price movements; three commodity groups with their weights (primary articles ~23%, fuel & power ~13%, manufactured products ~64% in India's WPI)
  • For (a): Agricultural index methodology — fixed base vs chain base, Laspeyres formula P₀₁ = Σp₁q₀/Σp₀q₀, area and yield indices as relatives with geometric/ arithmetic mean aggregation
  • For (b): G/M/1 model specification — general inter-arrival distribution, exponential service (rate μ), single server; embedded Markov chain at arrival epochs with transition probabilities
  • For (b): Proof of geometric distribution — derive root σ of generating equation A*(μ(1-z)) = z where σ ∈ (0,1), show πⱼ = (1-σ)σʲ satisfies balance equations and normalization
  • For (c): Life table relationships — T(x) = ∫ₓ^ω l(t)dt, e(x) = T(x)/l(x), L(x) = l(x+1) + ½d(x); algebraic manipulation using l(x+1) = l(x)(1-q(x)) to derive the identity
  • For (c): Verification that the derived expression satisfies boundary conditions and consistency with complete expectation of life definitions

Practice any of these questions

Write your answer, get it evaluated against UPSC's real rubric in seconds.

Start free evaluation →