Statistics

UPSC Statistics 2023

All 16 questions from the 2023 Civil Services Mains Statistics paper across 2 papers — 800 marks in total. Each question comes with a detailed evaluation rubric, directive word analysis, and model answer points.

16Questions
800Total marks
2Papers
2023Exam year

Paper I

8 questions · 400 marks
Q1
50M Compulsory solve Probability theory and distributions

(a) Out of 1000 persons born, only 900 reach the age of 15 years, and out of every 1000 who reach the age of 15 years, 950 reach the age of 50 years. Out of every 1000 who reach the age of 50 years, 40 die in one year. Accordingly, what is the probability that a person would attain the age of 51 years ? (10 marks) (b) Let X be a continuous random variable with probability density function : f(x) = $$ \begin{cases} \frac{x}{2}, & 0 \leq x < 1 \\\ \frac{1}{2}, & 1 \leq x < 2 \\\ \frac{3-x}{2}, & 2 \leq x < 3 \\\ 0, & \text{elsewhere} \end{cases} $$ Obtain the cumulative distribution function of X and hence find the value of $P\left(X > \frac{3}{2}\right)$. (10 marks) (c) Let {Xₙ, n ≥ 1} be a sequence of mutually independent random variables such that P(Xₙ = nᵅ) = P(Xₙ = – nᵅ) = 0·5, for any α > 0. Derive the condition on α under which the sequence {Xₙ, n ≥ 1} obeys WLLNs. (10 marks) (d) Apply Run Test to test the randomness of the following sequence of H and T at 5% level of significance : HHHHHHTHHHHHTHTHHHH TTHHHHTHHHTTHHHHHH THHTTHHTHHH Given : Z₍₀·₀₂₅₎ = 1·96 Z₍₀·₀₅₎ = 1·645 (10 marks) (e) Differentiate between prior and posterior distributions. In case of squared error loss function, find out the Bayes estimator for unknown parameter. (10 marks)

Answer approach & key points

Solve each sub-part systematically with clear mathematical working. For (a), apply conditional probability using survival data; (b) integrate piecewise to find CDF and evaluate tail probability; (c) apply Khinchin's WLLN condition checking variance behavior; (d) count runs and apply normal approximation for hypothesis testing; (e) state Bayes theorem and minimize posterior expected loss. Allocate approximately 2 minutes per mark, presenting each solution with clear labeling and logical flow from given information to final answer.

  • (a) Correct application of chain rule for conditional probability: P(age 51) = P(survive to 15) × P(survive to 50 | 15) × P(survive 50-51 | 50) = 0.9 × 0.95 × 0.96
  • (b) Proper piecewise integration of f(x) to obtain F(x) with continuity checks at x=1 and x=2, then P(X > 3/2) = 1 - F(3/2) = 5/8
  • (c) Derivation that E(Xₙ) = 0, Var(Xₙ) = n^(2α), and application of Khinchin's theorem requiring (1/n²)ΣVar(Xᵢ) → 0, yielding condition α < 1/2
  • (d) Correct counting of runs (r=12), expected runs μᵣ = 2n₁n₂/(n₁+n₂) + 1, variance σᵣ², and Z-test showing |Z| < 1.96 so randomness not rejected
  • (e) Clear distinction: prior π(θ) represents pre-sample belief, posterior π(θ|x) ∝ L(x|θ)π(θ) updates belief; Bayes estimator under squared error loss is posterior mean E[θ|x]
Q2
50M derive Joint distributions and convergence of random variables

(a) Let X, Y, Z be three mutually independent standard exponential variates and W₁ = X + Y + Z, W₂ = (X + Y)/(X + Y + Z), W₃ = X/(X + Y). Then (i) determine the joint distribution of W₁, W₂ and W₃. (ii) find out the marginal probability density functions of W₁, W₂ and W₃. (iii) examine the mutual independence of W₁, W₂ and W₃, and give your comment. (10+6+4=20 marks) (b) Give an example to prove or disprove the following : P(lim sup Aₙ) = 0 ⇒ Σₖ₌₁^∞ P(Aₖ) < ∞, for any sequence {Aₙ, n ≥ 1} of events defined on a probability space (Ω, 𝓐, P). (15 marks) (c) Let {Yₙ, n ≥ 1} be a sequence of random variables and Y be a degenerate random variable. Examine whether 'Yₙ converges in distribution to Y' implies 'Yₙ converges in probability to Y'. (15 marks)

Answer approach & key points

Derive the joint and marginal distributions systematically using transformation of variables for part (a), spending approximately 40% of time on this highest-weighted section. For (b), construct a counterexample using independent events with probabilities decaying appropriately (e.g., P(Aₙ) = 1/n). For (c), prove the equivalence of convergence in distribution and probability for degenerate limits using the definition of degenerate distribution and properties of weak convergence. Structure: direct derivations for (a)(i)-(iii), counterexample construction with verification for (b), and rigorous proof with necessary lemmas for (c).

  • For (a)(i): Apply Jacobian transformation from (X,Y,Z) to (W₁,W₂,W₃), correctly computing the Jacobian determinant and establishing support 0 < w₃ < w₂ < 1, w₁ > 0
  • For (a)(ii): Integrate appropriately to obtain Gamma(3,1) for W₁, Beta(2,1) for W₂, and Beta(1,1)=Uniform(0,1) for W₃
  • For (a)(iii): Verify factorization of joint PDF and conclude mutual independence of W₁, W₂, W₃ with proper statistical interpretation
  • For (b): Construct valid counterexample where P(Aₙ) = 1/n (or similar), verify lim sup Aₙ = ∅ by Borel-Cantelli, yet ΣP(Aₙ) diverges
  • For (c): Prove that for degenerate Y = c, Yₙ →ᵈ Y implies P(Yₙ ≤ y) → 0 or 1 appropriately, hence Yₙ →ᵖ Y using definition of convergence in probability
  • For (c): Establish the converse is trivial, giving complete equivalence for degenerate limits
Q3
50M prove Probability theory and statistical inference

(a) (i) If X is a random variable with finite variance, show that lim n² P{|X| > n} = 0. n → ∞ (10 marks) (ii) In a certain recruitment test, there are multiple choice questions. There are four possible options to each question, out of which one is correct. The probability of knowing correct option for an intelligent student is 90%, while it is 20% for a weaker student. An intelligent student ticks the correct option. What is the probability that he was guessing ? (10 marks) (b) Determine whether the sequence of mutually independent random variables {Xₙ, n ≥ 1}, in which P(Xₙ = ± n^λ) = 1/(2n^(2λ)) P(Xₙ = 0) = 1 - 1/n^(2λ) (λ < 1/2) obeys Central Limit Theorem (CLT) or not. (15 marks) (c) Define Sequential Probability Ratio Test (SPRT) along with its operating characteristic function and average sample number. Determine SPRT for testing H₀ : θ = 4 against H₁ : θ = 5 in N(θ, 1) with α = 0·5 and β = 0·2. (15 marks)

Answer approach & key points

Prove the limit result in (a)(i) using Markov/Chebyshev inequalities; solve (a)(ii) using Bayes' theorem with clear event definitions; for (b), verify Lindeberg condition or Lyapunov's theorem to establish CLT validity; for (c), define SPRT components then derive boundaries A, B and continuation region. Allocate ~20% time to (a)(i), ~15% to (a)(ii), ~30% to (b), and ~35% to (c) given mark distribution. Structure: state definitions → apply methods → derive results → interpret findings.

  • (a)(i): Application of Markov's inequality or direct variance bound to show n²P{|X|>n} ≤ E[X²I(|X|>n)] → 0
  • (a)(ii): Bayes' theorem setup with events K (knows), G (guesses), C (correct); calculation of P(G|C) = P(C|G)P(G)/P(C)
  • (b): Verification of Lindeberg condition or checking variance of sum → ∞; showing standardized sum converges to N(0,1)
  • (b): Explicit computation that Var(Sₙ) = Σn^(2λ) → ∞ and Lyapunov condition holds for λ < 1/2
  • (c): Definition of SPRT, OC function L(θ), and ASN Eθ(N); derivation of Wald's boundaries A ≈ (1-β)/α, B ≈ β/(1-α)
  • (c): Specific SPRT construction for N(θ,1): continuation region as sum(Xi - 4.5) between adjusted log-boundaries
Q4
50M explain Statistical inference and hypothesis testing

(a) What is the role of properties of completeness and sufficiency in Statistical Inference ? Explain. In U (0, θ), find out Uniformly Minimum Variance Unbiased Estimator (UMVUE) of θ. (20 marks) (b) A survey of 400 families with four children each have the following distribution : | Number of boys | 0 | 1 | 2 | 3 | 4 | |---|---|---|---|---|---| | Number of families | 16 | 89 | 145 | 118 | 32 | Is this result consistent with the hypothesis that male and female births are equally probable at 5% level of significance ? It is given that χ²_(.05) for 4 degrees of freedom = 9·488 and χ²_(.05) for 5 degrees of freedom = 11·070. (c) Define Likelihood Ratio Test. In N(θ, σ²), where σ² is unknown, find out LR test for testing H₀ : θ = θ₀ against H₁ : θ ∈ (Ω – θ₀), where Ω is the parametric space for θ. α is the size of the test.

Answer approach & key points

Explain the theoretical foundations and derive the required estimators and tests across all three parts. Spend approximately 40% of effort on part (a) covering completeness, sufficiency and UMVUE derivation; 30% on part (b) for chi-square goodness-of-fit test with correct degrees of freedom and conclusion; and 30% on part (c) for likelihood ratio test definition and derivation in normal distribution with unknown variance. Structure with clear headings for each sub-part, stating definitions first, then derivations, and ending with explicit final answers.

  • Part (a): Define completeness and sufficiency; explain their roles in reducing data without loss of information and enabling unbiased estimation via Rao-Blackwell and Lehmann-Scheffé theorems
  • Part (a): For U(0,θ), identify sufficient statistic T = X_(n), prove completeness, and derive UMVUE of θ as ((n+1)/n)X_(n) with proper justification
  • Part (b): Set up H₀: p = 0.5 (equal probability), calculate expected frequencies under Binomial(4, 0.5), compute chi-square statistic correctly
  • Part (b): Use correct degrees of freedom = 4 (not 5), compare with critical value 9.488, and state conclusion about hypothesis
  • Part (c): Define Likelihood Ratio Test (LRT) as λ(x) = sup_{θ∈Θ₀}L(θ)/sup_{θ∈Θ}L(θ)
  • Part (c): Derive LRT statistic for N(θ,σ²) with unknown σ², showing it reduces to t-test with rejection region |t| > t_{α/2,n-1}
Q5
50M Compulsory solve Multivariate normal distribution and linear models

(a) (i) If **X** = (X₁ X₂ X₃)' is distributed as N₃ (μ, Σ), find the distribution of [(X₁ – X₂) (X₂ – X₃)]'. (5 marks) (ii) Suppose that **X** = (X₁ X₂ X₃)' ~ N₃ (**0**, Σ), where Σ = $\begin{pmatrix} 1 & \rho & 0 \\ \rho & 1 & \rho \\ 0 & \rho & 1 \end{pmatrix}$. Is there a value of ρ for which (X₁ + X₂ + X₃) and (X₁ – X₂ – X₃) are independent ? (5 marks) (b) Show that **X** = (X₁, X₂, ..., Xₚ)' has p-variate normal distribution if and only if every linear combination (l₁X₁ + l₂X₂ + ... + lₚXₚ) of **X** follows a univariate normal distribution. (10 marks) (c) Let x₁, x₂, ..., xₙ be n given observations, and suppose that Yᵢ = β₀ + β₁xᵢ + eᵢ; i = 1, 2, ..., n, where β₀, β₁ are unknown parameters and eᵢ are mutually independent normal random variables with E(eᵢ) = 0 and V(eᵢ) = σ², i = 1, 2, ..., n. Also, σ² is assumed to be unknown. Test the null hypothesis H₀ : β₀ = β₁ = 0. (10 marks) (d) Complete the following analysis of variance table of a design and examine whether there is a significant difference between the treatments at 5% level of significance: | Source of Variation | Degrees of Freedom | Sum of Squares | Mean Sum of Squares | Variance Ratio | |---------------------|-------------------|----------------|---------------------|----------------| | Blocks | — | 21 | 4·2 | — | | Treatments | — | — | 5·0 | — | | Error | 15 | 12 | — | | | Total | — | — | | | Given that F_{·05}(3, 15) = 8·70, F_{·05}(5, 15) = 4·62 (10 marks) (e) Define regression estimator used for the estimation of population mean. Obtain its bias and Mean Square Error (MSE) to the first order of approximation. (10 marks)

Answer approach & key points

Solve this multi-part numerical problem by allocating time proportionally to marks: spend ~20% on (a)(i)-(ii) combined, ~20% on (b), ~20% on (c), ~20% on (d), and ~20% on (e). Begin each sub-part by stating the relevant theorem or formula, show complete derivation/calculation steps, and conclude with precise final answers. For (d), complete the ANOVA table systematically before hypothesis testing. For (e), clearly define the estimator before deriving bias and MSE.

  • (a)(i) Apply linear transformation theorem: if Y = AX, then Y ~ N₂(Aμ, AΣA') with correct matrix A = [[1,-1,0],[0,1,-1]]
  • (a)(ii) Use independence condition Cov(X₁+X₂+X₃, X₁-X₂-X₃) = 0; solve for ρ = -1/2 and verify validity
  • (b) Prove both directions: (⇒) by definition of MVN, (⇒) using characteristic functions or Cramér-Wold theorem
  • (c) Set up F-test for H₀: β₀=β₁=0 using extra sum of squares; compute F = [(SSR/2)]/[SSE/(n-2)] with correct df
  • (d) Complete ANOVA table: Blocks df=5, Treatments df=3, Total df=23, Total SS=33, Error MS=0.8; compute F_Treatments=6.25 and compare with critical value
  • (e) Define regression estimator Ŷ_reg = ȳ + b(X̄ - x̄); derive bias ≈ 0 and MSE ≈ (1-f)S²_y(1-ρ²)/n to first order
Q6
50M solve Multivariate analysis and principal components

(a) Let **X** = (X₁ X₂ X₃)' be distributed as N₃ (μ, Σ), where μ = (2 −1 3)' and Σ = $\begin{pmatrix} 4 & 1 & 0 \\ 1 & 2 & 1 \\ 0 & 1 & 3 \end{pmatrix}$. Find (i) the conditional distribution of (X₁ X₂)' given X₃ = 2. (ii) partial correlation coefficient ρ₁₂.₃ and multiple correlation coefficient R₁.₂₃ (8+7 marks) (b) (i) Describe the complete analysis of two-way classified data with multiple (but equal) observations per cell, clearly stating the assumptions used. Also state two examples where such type of analysis is used. (ii) Let three mutually independent variables Y₁, Y₂ and Y₃ having common variance σ² and E(Y₁) = β₁ + β₂, E(Y₂) = β₁ + β₃, E(Y₃) = β₁ + β₂ be given. Show that the linear parametric function p₁β₁ + p₂β₂ + p₃β₃ is estimable if and only if p₁ = p₂ + p₃, clearly stating the assumptions used, if any. (5 marks) (c) (i) State briefly three reasons why an analyst may wish to perform a principal component analysis. (6 marks) (ii) Define canonical correlations and give two examples of their application. Describe the procedure of working out canonical correlations and canonical variates. (9 marks)

Answer approach & key points

Solve this multi-part numerical and theoretical question by allocating approximately 35% time to part (a) due to its 15 marks and computational complexity, 25% to part (b) covering ANOVA and estimability, and 40% to part (c) on PCA and canonical correlations. Begin with clear problem identification for each sub-part, show all computational steps with matrix operations for (a), present structured ANOVA decomposition for (b)(i) and rigorous linear algebra proof for (b)(ii), and provide conceptual clarity with real-world Indian examples for (c). Conclude each part with precise final answers and interpretations.

  • Part (a)(i): Correctly partition Σ into Σ₁₁, Σ₁₂, Σ₂₁, Σ₂₂ and apply conditional distribution formula N₂(μ₁ + Σ₁₂Σ₂₂⁻¹(x₃-μ₃), Σ₁₁ - Σ₁₂Σ₂₂⁻¹Σ₂₁) with x₃=2
  • Part (a)(ii): Compute partial correlation ρ₁₂.₃ = (σ₁₂ - σ₁₃σ₂₃/σ₃₃)/√[(σ₁₁-σ₁₃²/σ₃₃)(σ₂₂-σ₂₃²/σ₃₃)] and multiple correlation R₁.₂₃ = √[σ₁'Σ₂₂⁻¹σ₁/σ₁₁] where σ₁' = (σ₁₂, σ₁₃)
  • Part (b)(i): Describe two-way ANOVA with replication: model yᵢⱼₖ = μ + αᵢ + βⱼ + (αβ)ᵢⱼ + εᵢⱼₖ, assumptions (normality, homoscedasticity, independence), ANOVA table with SS_T, SS_A, SS_B, SS_AB, SS_E, and examples like agricultural field trials (ICRISAT crop studies) or industrial quality control
  • Part (b)(ii): Set up design matrix X, show rank deficiency, derive condition for estimability via Cβ where C = (p₁,p₂,p₃), prove p₁ = p₂ + p₃ using linear independence of rows and estimability condition C = LX for some L
  • Part (c)(i): Three reasons for PCA: dimensionality reduction (e.g., reducing NSSO household survey variables), multicollinearity remediation in regression, and data visualization/pattern detection in large datasets
  • Part (c)(ii): Define canonical correlations as correlations between linear combinations u=a'X and v=b'Y maximizing correlation; examples: relationship between economic indicators and social development indices, or agricultural inputs vs outputs; describe eigenvalue solution of Σ₁₁⁻¹Σ₁₂Σ₂₂⁻¹Σ₂₁ and extraction of canonical variates
Q7
50M discuss Sampling methods and stratified random sampling

(a) Discuss the difference between sampling for variables and sampling for attributes with examples. For a qualitative characteristic, find an unbiased estimator of population proportion along with its variance when sample is drawn by simple random sampling without replacement. Also obtain an unbiased estimator of this variance. 20 (b) The table given below gives the population and sample sizes, stratum means and variance of a stratified random sample of size 50. Symbols used have their usual meanings. | Stratum Number | Nᵢ | nᵢ | ȳᵢ | sᵢ² | |---|---|---|---|---| | 1 | 30 | 5 | 35 | 36 | | 2 | 50 | 10 | 40 | 49 | | 3 | 60 | 15 | 40 | 81 | | 4 | 60 | 20 | 55 | 144 | Verify that the existing allocation is optimum for given 4 strata. Also calculate the estimate of population variance under this allocation. 15 (c) Differentiate between Simple Random Sampling and Probability Proportional to Size Sampling. How will you draw a PPS sample of size n from a population of size N (n < N) by (i) Cumulative Total Method and (ii) Lahri's Method ? Explain. 15

Answer approach & key points

Begin with a clear conceptual distinction in part (a) between variables (quantitative) and attributes (qualitative) with Indian examples like agricultural yield vs literacy status. Derive the unbiased estimator p̂ = n'/n for population proportion and its variance V(p̂) = (N-n)/(N-1) · p(1-p)/n, then obtain unbiased estimator v(p̂). For part (b), verify Neyman optimum allocation by checking if nᵢ ∝ NᵢSᵢ/√cᵢ (assuming equal costs), then compute V(ȳ_st). For part (c), contrast SRS with PPS on selection probability basis, then detail both Cumulative Total and Lahri's methods with numerical illustration. Allocate approximately 40% time to part (a), 30% each to (b) and (c) based on marks distribution.

  • Part (a): Clear distinction between sampling for variables (measurable quantities like income, yield) vs attributes (dichotomous characteristics like employment status, disease presence) with appropriate Indian examples
  • Part (a): Derivation of unbiased estimator p̂ = n'/n for population proportion P, its variance V(p̂) = (N-n)/(N-1) · P(1-P)/n under SRSWOR, and unbiased estimator of variance v(p̂) = (N-n)/(N-1) · p̂(1-p̂)/(n-1)
  • Part (b): Verification of Neyman optimum allocation condition nᵢ/n = NᵢSᵢ/ΣNⱼSⱼ using given data; calculation showing existing allocation matches or approximates this ratio
  • Part (b): Computation of stratified mean estimate ȳ_st = ΣWᵢȳᵢ where Wᵢ = Nᵢ/N, and population variance estimate V(ȳ_st) = ΣWᵢ²(Nᵢ-nᵢ)/(Nᵢnᵢ) · sᵢ²
  • Part (c): Systematic comparison of SRS (equal probability) vs PPS (probability ∝ size) on grounds of efficiency, especially for skewed populations like industrial output or agricultural holdings
  • Part (c): Step-wise description of Cumulative Total Method: list cumulative totals, select random numbers between 1 and ΣXᵢ, identify selected units
  • Part (c): Step-wise description of Lahri's Method: select random number i from 1 to N and random number j from 1 to M (M=max size), accept if j ≤ Xᵢ, else reject and repeat
Q8
50M differentiate Experimental design and statistical models

(a) Differentiate between randomised block design and balanced incomplete block design. In usual notations, for a balanced incomplete block design, prove that (i) bk = vr (ii) λ(v – 1) = r(k – 1) and (iii) b ≥ v. 20 (b) Explain the concept of confounding in design of experiment. In an experiment with three factors A, B and C, each at two levels, three replicates are divided in two blocks, each of four units. How will you confound ABC in the first, AC in the second and BC in the third replication ? 15 (c) Differentiate among fixed, random and mixed effect models with examples. How are the three basic principles of design fulfilled in randomised block design ? Explain. 15

Answer approach & key points

Begin with a structured comparison of RBD vs BIBD in part (a), then rigorously prove all three BIBD parameters using standard notation with clear algebraic steps. For part (b), first define confounding with factorial design context, then explicitly construct the three replication schemes showing which treatment combinations go to which block. For part (c), use tabular comparison for model types with agricultural/industrial examples, then explain how RBD satisfies randomization, replication, and local control. Allocate approximately 40% time to part (a) given its 20 marks and proof demands, 30% each to parts (b) and (c).

  • Part (a): Clear distinction between RBD (complete blocks, all treatments per block) and BIBD (incomplete blocks, not all treatments appear in each block) with structural conditions
  • Part (a): Correct proofs of bk = vr, λ(v–1) = r(k–1), and b ≥ v using incidence matrix properties or combinatorial counting with λ defined as pairwise concurrence
  • Part (b): Accurate definition of confounding as sacrificing higher-order interaction information to achieve block homogeneity, with distinction between complete and partial confounding
  • Part (b): Correct construction of three replications: Rep I confounds ABC (assign +++ and +–– to Block 1, ++– and +–+ to Block 2, etc.), Rep II confounds AC, Rep III confounds BC using Yates notation
  • Part (c): Precise differentiation of fixed (levels specifically chosen, inference only to those levels), random (levels random sample from population, variance component estimation), and mixed models with appropriate examples like crop varieties vs fertilizer doses
  • Part (c): Explanation of how RBD achieves randomization (random allocation within blocks), replication (multiple blocks), and local control (homogeneous blocks reducing experimental error)

Paper II

8 questions · 400 marks
Q1
50M Compulsory solve Statistical Quality Control, Reliability, Linear Programming, Game Theory, Markov Chains

(a) What do you understand by Statistical Quality Control (SQC)? Discuss briefly its need and utility in Industry. Discuss the causes of variation in quality. (10 marks) (b) Consider an item with failure rate $Z(t) = \frac{t}{t+1}$. Write down the survivor function $R(t)$ and hence evaluate Mean Time To Failure (MTTF). Also obtain the conditional survival function and Mean Residual Life (MRL). (10 marks) (c) Solve the following linear programming problem by using graphical approach: Minimize $4x_1 + 5x_2 + 6x_3$ Subject to $x_1 + x_2 \geq 11$ $x_1 - x_2 \leq 5$ $x_3 - x_1 - x_2 = 0$ $7x_1 + 12x_2 \geq 35$ $x_1 \geq 0, x_2 \geq 0, x_3 \geq 0$ (10 marks) (d) In a two-person zero-sum game, write the payoff matrix in general notation. Consider the two-person zero-sum game where each player tosses an unbiased coin simultaneously. Player B pays ₹7 to A if {H, H} occurs or {T, T} occurs otherwise player A pays ₹3 to B. Write down A's payoff matrix. Explain the Max Min criterion for player A and hence define the saddle point. (10 marks) (e) Let Xₜ be the state of a flea at time t Find the transition Matrix P. Also obtain Pᵣ[X₂ = 3 | X₀ = 1]. (10 marks)

Answer approach & key points

This multi-part question requires solving five distinct problems: (a) discuss SQC concepts with industrial applications, (b) derive reliability functions from given failure rate, (c) solve LP graphically, (d) construct and analyze game matrix, and (e) compute Markov chain probabilities. Allocate approximately 15-18 minutes per part, with extra attention to (c) and (e) where computational errors are common. Begin each part clearly labeled, show all derivation steps, and conclude with boxed final answers.

  • Part (a): Define SQC as statistical methods for maintaining quality standards; explain need (mass production complexity, cost reduction, customer satisfaction) and utility (process control, acceptance sampling); classify variations into chance causes (random, inherent) and assignable causes (identifiable, correctable)
  • Part (b): Derive R(t) = exp(-∫Z(t)dt) = (t+1)e^(-t); compute MTTF = ∫R(t)dt = 2; obtain conditional survival R(x|t) = R(t+x)/R(t) and MRL = ∫R(x|t)dx
  • Part (c): Use constraint x₃ = x₁ + x₂ to reduce to 2-variable problem; minimize 10x₁ + 11x₂; identify feasible region vertices from intersection of x₁+x₂≥11, x₁-x₂≤5, 7x₁+12x₂≥35; optimal solution at (3,8) with value 118
  • Part (d): General payoff matrix [aᵢⱼ] where i=1,...,m strategies for A, j=1,...,n for B; specific matrix with entries +7 for (H,H) and (T,T), -3 otherwise; apply Max Min: maximize minimum row payoff; saddle point exists if Max Min = Min Max
  • Part (e): Construct transition matrix P from flea movement probabilities (typically given in diagram); compute P² and extract P[X₂=3|X₀=1] = (P²)₁₃ using Chapman-Kolmogorov equations
Q2
50M explain Control Charts for Fraction Defective and Replacement Policy

(a) What do you understand by control chart for fraction defective? Explain its construction. Give the theoretical distribution on which the control limits are based. (15 marks) (b) Each day a sample of 50 items from the production process was examined. The number of defectives found in each sample was as follows: | Day | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | |-----|---|---|---|---|---|---|---|---|---|----|----|----| | No. of Defectives | 6 | 2 | 5 | 1 | 2 | 2 | 3 | 5 | 3 | 4 | 12 | 4 | | Day | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | |-----|----|----|----|----|----|----|----|----|----|----|----|----| | No. of Defectives | 4 | 1 | 3 | 5 | 4 | 1 | 4 | 3 | 5 | 4 | 2 | 3 | Draw a suitable control chart and check for control. What control limits would you suggest for subsequent use? (15 marks) (c) A factory has 1000 bulbs installed. Cost of individual replacement is US $3 while cost of that of group replacement is US $1 per bulb respectively. It is decided to replace all the bulbs simultaneously at fixed interval and also to replace the individual bulbs that fall in between. Determine the optimum replacement policy. Failure probability are given below: | Week | 1 | 2 | 3 | 4 | 5 | |------|-----|------|------|------|------| | Failure probability(p) | 0·10 | 0·25 | 0·50 | 0·70 | 1·00 | (20 marks)

Answer approach & key points

Explain the theoretical foundations of p-charts in part (a), then solve the numerical problems in (b) and (c) with systematic working. Allocate approximately 25-30% time to (a) as it requires conceptual elaboration, 30-35% to (b) for control chart construction and interpretation, and 35-40% to (c) as it carries the highest marks and involves multi-step replacement policy optimization. Present calculations in tabular format where possible and conclude with clear managerial recommendations.

  • Part (a): Definition of control chart for fraction defective (p-chart), construction steps using sample proportion p̂ = d/n, and identification of Binomial distribution as the theoretical basis with Normal approximation for large samples
  • Part (b): Calculation of center line (CL = p̄), control limits UCL/LCL = p̄ ± 3√[p̄(1-p̄)/n], plotting of 24 sample points, identification of Day 11 as out-of-control, and revised limits after removing assignable cause
  • Part (c): Computation of expected failures Np, N·q·p₂, etc., individual replacement cost, group replacement cost for each policy period, and determination of optimal replacement interval at minimum average cost per week
  • Correct handling of variable control limits when sample sizes differ (though here n=50 constant), and recognition that p-chart is appropriate for attribute data with varying sample sizes
  • Economic interpretation: trade-off between individual replacement flexibility and group replacement economies of scale, with explicit cost comparison across weeks 1-5
Q3
50M solve Linear programming, assignment problem, EOQ model

(a) Solve the following Linear Programming problem using Two Phase method : Maximize Z = 3x₁ - x₂ Subject to 2x₁ + x₂ ≥ 2, x₁ + 3x₂ ≤ 2, x₂ ≤ 4, x₁ ≥ 0, x₂ ≥ 0 (b)(i) Solve the above assignment problem. Cell values represent cost of assigning job A, B, C and D to the machines I, II, III and IV. (b)(ii) Write down the dual for the given primal problem. Max Z = 6x₁ - 5x₂ + 7x₃ + x₄ Subject to 2x₁ + 4x₂ - x₃ + x₄ ≤ 4, x₁ - x₂ + 6x₃ + 7x₄ ≥ 5, 2x₁ + 2x₂ + 4x₃ + 5x₄ = 6, x₁ + 8x₂ + x₃ = 7; x₁ and x₄ unrestricted, x₂ ≥ 0, x₃ ≥ 0 (c) What is a basic Economic Order Quantity (EOQ) model in Inventory Control and state the assumption made. A Company estimates that it will sell 12000 units of its products for the forthcoming year. The ordering cost is ₹100 per order and the carrying cost per year is 20% of the purchase price per unit. The purchase price per unit is ₹50. Find (i) EOQ (ii) Number of orders per year (iii) Time between successive orders.

Answer approach & key points

Solve this multi-part numerical problem by allocating approximately 40% time to part (a) Two-Phase method as it requires extensive tableau iterations, 35% to part (c) EOQ calculations with clear formula application, and 25% to part (b) covering both assignment problem and dual formulation. Begin with clear problem identification for each part, show complete step-by-step working with proper tableaus for (a), cost matrix reduction for (b)(i), systematic dual conversion rules for (b)(ii), and standard EOQ model derivation followed by substitution for (c). Conclude each part with boxed final answers and appropriate units.

  • Part (a): Convert LPP to standard form by introducing surplus, slack and artificial variables; set up Phase I objective to minimize artificial variable; execute simplex iterations till feasibility; proceed to Phase II with original objective; identify optimal solution at x₁ = 2, x₂ = 0, Z = 6
  • Part (b)(i): Apply Hungarian algorithm to 4×4 cost matrix — row reduction, column reduction, minimum lines to cover zeros, adjust matrix, make optimal assignments; state final assignment with minimum total cost
  • Part (b)(ii): Convert primal to dual by transforming maximization to minimization, reversing inequality signs for ≥ constraints, handling equality with unrestricted dual variables, and noting primal unrestricted variables become dual equality constraints
  • Part (c): Define EOQ as optimal order quantity minimizing total inventory cost; list assumptions: constant demand, instantaneous replenishment, no stockouts, fixed ordering cost, constant carrying cost percentage
  • Calculate EOQ = √(2×12000×100)/(0.20×50) = 490 units (or √480000 ≈ 693 if using 20% as 0.20 directly); number of orders = 12000/490 ≈ 24.5; time between orders = 365/24.5 ≈ 14.9 days
Q4
50M solve Transportation problem, normal distribution, queuing theory

(a) A Company ships truckloads of grain from three silos to four mills. The supply (in truckloads) and the demand (also in truckloads) together with the unit transportation costs per truckload on the different routes are summarized in the following table : Purpose is to find the minimum-cost shipping schedule between the silos and the mills. Use any method. Obtain the starting basic feasible solution. (b)(i) Suppose that the life in hours of an electric Gadget manufactured by a certain process is normally distributed with parameters μ = 160 hours and some σ. What would be the maximum allowable value of σ if the life X of the gadget is to have a probability 0.80 of being between 120 hours and 200 hours ? (Normal distribution Table is given at the end). (b)(ii) Let the compressive strength X of concrete be log-normally distributed with parameters μY = 3 MPa and σY = 0.2 MPa where Y = logeX. What is the probability that the strength is less than or equal to 10 MPa ? (Normal distribution Table is given at the end) (c) A departmental store operates with three checkout counters. To determine the number of counters in operation based on the number of customers, the manager uses the following schedule : | Number of customers in store | Number of customers in operation | |---|---| | 1 to 3 | 1 | | 4 to 6 | 2 | | More than 6 | 3 | Customers arrive in the counter(s) according to a Poisson distribution with a mean rate of 10 customers/hour. The average checkout time per customer is exponential with mean 12 minutes. Determine the steady state probability pn of n customers in the checkout area.

Answer approach & key points

Solve this multi-part numerical problem by allocating approximately 40% time to part (a) transportation problem as it requires complete solution methodology, 35% to part (b) probability calculations involving normal and log-normal distributions, and 25% to part (c) queuing theory steady-state probabilities. Begin with clear problem setup for each part, show all computational steps with proper formulae, and conclude with interpreted final answers in correct units.

  • For (a): Correctly set up the balanced transportation problem (check if supply equals demand, add dummy if needed), apply Vogel's Approximation Method or Least Cost Method to obtain degenerate/non-degenerate basic feasible solution with (m+n-1) allocations
  • For (b)(i): Set up P(120 < X < 200) = 0.80, convert to standard normal Z-scores, use symmetry property to find z₀ such that P(-z₀ < Z < z₀) = 0.80, hence Φ(z₀) = 0.90, interpolate from table to find z₀ ≈ 1.28, then solve σ = 40/1.28 = 31.25 hours
  • For (b)(ii): Transform log-normal to normal: P(X ≤ 10) = P(Y ≤ ln10) = P(Y ≤ 2.3026), calculate Z = (2.3026-3)/0.2 = -3.487, use table to find Φ(-3.49) ≈ 0.0002 or precise interpolation
  • For (c): Identify this as M/M/3 queuing system with state-dependent service rates (μ, 2μ, 3μ for n=1,2,3+), use λ=10/hr, μ=5/hr, ρ=λ/3μ=2/3, apply birth-death process balance equations for steady-state probabilities p₀, p₁, p₂, and general formula for pₙ when n≥3
  • Verify all calculations: check transportation cost arithmetic, confirm normal table reading with interpolation, validate queuing traffic intensity ρ < 1 for steady state existence, and ensure probability sum equals 1
Q5
50M Compulsory explain Psychometrics, population growth, epidemiological rates

(c) What do you mean by reliability and validity of tests ? What is the difference between reliability and validity of a test ? If the reliability of a test is raised from 0·80 to 0·90 by lengthening the test, a validity coefficient of 0·60 for this test would be expected to increase to what value ? 10 marks (d) The rate of increase of a population at time t is r(t) = 0·01 + 0·0001 t². If the population totals 1,000,000 at time t = 0, what is the population at t = 30 ? 10 marks (e) Suggest which of the two measures : Morbidity Incidence rate (MIR) and Morbidity Prevalence rate (MPR) should be used to decide on the amount of medicine to be sent to a Malaria affected area. Cite an example where the other rate can be useful. 10 marks

Answer approach & key points

This question requires explaining three distinct statistical concepts across psychometrics, demography, and epidemiology. Allocate approximately 35% time to part (c) covering reliability, validity definitions, their distinction, and the Spearman-Brown prophecy formula application; 35% to part (d) setting up and solving the differential equation for population growth; and 30% to part (e) comparing MIR and MPR with practical Indian public health examples. Begin with clear conceptual definitions, proceed to mathematical derivations where required, and conclude with contextual interpretations.

  • Part (c): Define reliability (consistency/stability of test scores) and validity (extent test measures what it claims to measure); distinguish reliability as necessary but not sufficient for validity; apply Spearman-Brown prophecy formula to calculate new validity coefficient ≈ 0.67
  • Part (c): Correctly identify that validity coefficient increases proportionally to square root of reliability ratio: r_new = r_old × √(0.90/0.80) = 0.60 × 1.0607 ≈ 0.636 or 0.64
  • Part (d): Set up differential equation dP/dt = P×r(t) = P(0.01 + 0.0001t²); integrate ln(P) = ∫(0.01 + 0.0001t²)dt = 0.01t + 0.0001t³/3 + C
  • Part (d): Apply initial condition P(0) = 1,000,000 to find C = ln(10⁶); compute P(30) = 10⁶ × exp[0.01(30) + 0.0001(27000)/3] = 10⁶ × e^1.2 ≈ 3,320,117
  • Part (e): Recommend MIR (incidence rate) for medicine allocation as it measures new cases over time, directly indicating current disease burden and transmission dynamics requiring immediate intervention
  • Part (e): Cite MPR usefulness for chronic disease planning like diabetes or hypertension prevalence studies in India where total existing cases matter for long-term healthcare infrastructure and resource allocation
Q6
50M explain Time series analysis, least squares, heteroscedasticity

(a) Explain the principle of least squares. How it is used in fitting trend in time series analysis ? Explain the fitting of trend for the curve $y=ab^tc^{t^2}$. 15 marks (b) Define stationary time series. How would you test the stationarity of the given time series ? Write the importance of stationary time series. Check the following time series for stationarity. (i) $Y_t = Y_{t-1} + U_t$ (ii) $Y_t = \delta + Y_{t-1} + U_t$ (iii) $Y_t = \delta Y_{t-1} + U_t$ ; $-1 \leq \delta \leq 1$ 15 marks (c) State the different methods of detecting the presence of heteroscedasticity. Explain in brief the Goldfeld-Quandt Test for detecting the presence of heteroscedasticity. Also write the assumption required to apply this test. For a data on consumption expenditure in relation to income for a cross section of 30 families, after dropping the middle 4 observations, the OLS regression based on the first 13 and the last 13 observations and their associated residual sum of squares are as follows : Regression based on the first 13 observations : $\hat{Y}_i = 3.4094 + 0.6968 X_i$ $(r^2 = 0.8887, RSS_1 = 377.17, df = 11)$ Regression based on the last 13 observations : $\hat{Y}_i = -28.0272 + 0.7941 X_i$ $(r^2 = 0.7681, RSS_2 = 1536.8, df = 11)$ Check the presence of heteroscedasticity for the above given results and write your conclusion. $(F_{(11, 11, 5\%)} = 2.82, F_{(11, 11, 1\%)} = 4.46, F_{(13, 13, 5\%)} = 2.53, F_{(13, 13, 1\%)} = 3.82)$ 20 marks

Answer approach & key points

Explain the theoretical foundations first, then demonstrate computational application. Allocate ~30% time to part (a) on least squares and trend fitting, ~30% to part (b) on stationarity concepts and testing the three given models, and ~40% to part (c) on heteroscedasticity detection with complete Goldfeld-Quandt test execution. Structure as: theoretical exposition → mathematical derivation → numerical computation → statistical inference.

  • Part (a): Principle of least squares (minimizing sum of squared residuals), its application in linear and non-linear trend fitting, and complete working for y=ab^tc^{t^2} using logarithmic transformation to linear form
  • Part (b): Formal definition of weak/strong stationarity (constant mean, variance, autocovariance), Dickey-Fuller or graphical methods for testing, importance for valid inference, and classification of (i) random walk (non-stationary), (ii) random walk with drift (non-stationary), (iii) AR(1) process (stationary when |δ|<1)
  • Part (c): Listing detection methods (graphical, Park test, Glejser test, White test, Goldfeld-Quandt test), complete Goldfeld-Quandt procedure with assumptions (normality, homoscedasticity under null, increasing/decreasing variance pattern)
  • Correct computation of F-statistic = RSS2/RSS1 = 1536.8/377.17 = 4.075 with proper degrees of freedom (11,11)
  • Proper hypothesis testing conclusion: F_calculated (4.075) > F_critical at 5% (2.82), reject null, heteroscedasticity present; also note significance at 1% level since 4.075 < 4.46 is false—actually 4.075 < 4.46, so not significant at 1%
  • Recognition that RSS2 > RSS1 indicates increasing variance with income, confirming heteroscedasticity in consumption expenditure data
Q7
50M derive Life table functions and standard scores

7.(a) Derive, by starting from a suitable functional form for $l_x$, the formula (i) $L_x = \dfrac{l_x + l_{x+1}}{2}$ and (ii) $L_x = \dfrac{l_x - l_{x+1}}{(\log l_x - \log l_{x+1})} = -\dfrac{d_x}{\log p_x}$ (iii) $e_x^0 = \dfrac{1}{2} + \sum\limits_{i=1}^{\infty} \dfrac{i d_{x+i}}{l_x}$ where $l_x$ = members of the cohort alive at age $x$ $L_x$ = number of years lived, in the aggregate, by the cohort of $l_0$ persons between age $x$ and $(x+1)$ $d_x$ = number of persons dying between age $x$ and $(x+1)$ $= l_x - l_{x+1}$ $p_x$ = probability that a person of age $x$ will survive till age $(x+1)$ $e_x^0$ = expectation of life at age $x$ 7.(b) (i) 400 students are given a test. The average is 60 and the standard deviation is 12. Obtain the Z-score and the standard scores equivalent to raw scores. The raw scores are given by | Raw scores | 84 | 78 | 72 | 66 | 60 | 54 | 48 | 42 | 36 | (ii) Convert the ten scores 1, 2, ..., 10 into standard scores with mean 50 and standard deviation 10. 7.(c) On the life table with $l_x = \dfrac{100-x}{190}$, $5 \leq x \leq 100$, Find (i) the chance that a child who has reached age 5 will live to age 60. (ii) the chance that a man of age 30 will live until age 80. (iii) the probability of dying within 5 years for a man aged 40. (iv) the expectation of life at age 40. (v) the chance that of the three men aged 30 at least one survives till age 80.

Answer approach & key points

Begin with clear statement of assumptions for each derivation in 7(a), showing step-by-step integration for L_x formulas and summation manipulation for e_x^0. For 7(b), apply Z-score formula z = (X-μ)/σ systematically, then demonstrate linear transformation for standard scores with mean 50, SD 10. For 7(c), substitute given l_x = (100-x)/190 into survival probabilities, death probabilities, and life expectancy formulas, computing each numerical value with proper fraction handling. Allocate approximately 35% time to derivations in (a), 25% to standard score calculations in (b), and 40% to life table computations in (c) given its five sub-parts.

  • 7(a)(i): Assume l_x linear in [x, x+1], integrate l(t)dt from x to x+1 to obtain (l_x + l_{x+1})/2
  • 7(a)(ii): Assume l_x exponential (constant force of mortality), use l_t = l_x·e^{-μt} and integrate to derive harmonic mean form -d_x/log(p_x)
  • 7(a)(iii): Express T_x = Σ L_{x+i} using linear assumption, substitute L_x = (l_x + l_{x+1})/2, rearrange to obtain e_x^0 = 1/2 + Σ i·d_{x+i}/l_x
  • 7(b): Calculate Z-scores as (X-60)/12 for each raw score, then apply linear transformation 50 + 10·z for standard scores; for 1-10 scores, first find mean=5.5, SD=√8.25, then transform
  • 7(c): Compute survival probabilities as l_60/l_5, l_80/l_30; death probability as 1-l_45/l_40; e_40^0 = T_40/l_40 with T_40 = Σ L_{40+i}; binomial probability for at least one survivor among three men
Q8
50M explain Agricultural statistics, fertility rates, and econometric model identification

8.(a) Explain the method of collection of agriculture data. Describe the (i) official publications for data collection and (ii) statistics collected by the various official agencies pertaining to agriculture production. 8.(b) Distinguish between GFR and TFR. What is meant by TFR = 3.29 ? Discuss the merits and demerits of TFR. Construct the relationship between GRR and TFR. Interpret GRR when GRR >1, <1 or =1. 8.(c) State the order and rank conditions to check the identifiability of the given system of simultaneous equations. Consider the following extended Keynesian model of income determination : Consumption function : $C_t = \beta_1 + \beta_2 Y_t - \beta_3 T_t + U_{1t}$ Investment function : $I_t = \alpha_0 + \alpha_1 Y_{t-1} + U_{2t}$ Taxation function : $T_t = \gamma_0 + \gamma_1 Y_t + U_{3t}$ Income Identity : $Y_t = C_t + I_t + G_t$ Where C = Consumption expenditure Y = Income I = Investment T = Taxes G = Government expenditure and U's = the disturbance terms. In the model the endogenous variables are C, I, T and Y and the predetermined variables are G and $Y_{t-1}$. By applying the order condition, check the identifiability of each of the equations in the system and of the system as a whole. Write your conclusion.

Answer approach & key points

The question demands explanation across three distinct domains: agricultural data systems, fertility measures, and econometric identification. Allocate approximately 35% (15-18 marks) to part (a) covering data collection methods and official publications; 35% (15-18 marks) to part (b) distinguishing GFR/TFR with mathematical relationships and demographic interpretation; and 30% (12-15 marks) to part (c) applying order/rank conditions to the Keynesian model. Structure with clear sectional headings, begin each part with definitions, proceed to methodological details, and conclude with synthesis—ensuring the econometric section explicitly shows matrix calculations for identifiability.

  • Part (a): Enumeration of agricultural data collection methods (census, sample surveys, administrative records) with specific Indian examples—Land Records, Agricultural Census, NSSO rounds; identification of official publications (Agricultural Statistics at a Glance, State Statistical Abstracts, FAO reports, DES publications)
  • Part (a)(ii): Classification of statistics by agency—MoA&FW (crop area, yield, production), CSO (national income from agriculture), RBI (agricultural credit), NABARD (rural credit, WDRA data), State Directorates of Economics and Statistics
  • Part (b): Precise distinction between GFR (age-specific births per 1000 women) and TFR (average births per woman completing reproductive span); interpretation of TFR=3.29 as replacement-level analysis; merits/demerits covering data requirements, period sensitivity, and cross-population comparability
  • Part (b): Mathematical derivation TFR = 5 × GFR (assuming 5-year age groups) or TFR = Σ ASFR; GRR-TFR relationship via sex ratio at birth; GRR interpretation with population stability implications—GRR>1 (growing), GRR<1 (declining), GRR=1 (stationary, ignoring mortality)
  • Part (c): Correct statement of order condition (K ≥ M-1, where K=excluded predetermined, M=endogenous in equation) and rank condition; systematic application to four-equation Keynesian model identifying endogenous (C,I,T,Y) and predetermined (G,Yt-1) variables; construction of coefficient matrix and explicit identifiability verdict for each equation

Practice any of these questions

Write your answer, get it evaluated against UPSC's real rubric in seconds.

Start free evaluation →