Q5 50M Compulsory derive Linear regression, experimental designs, sampling theory
(a) For a simple linear regression model Y = β₀ + β₁Xᵢ + εᵢ, i = 1, ..., n
(i) Derive the least square estimators of β₀ and β₁, clearly stating the conditions assumed.
(ii) For eᵢ = Yᵢ - Ŷᵢ where Ŷᵢ is the fitted value, show that
1. Σᵢ₌₁ⁿ eᵢ = 0
2. Σᵢ₌₁ⁿ Yᵢ = Σᵢ₌₁ⁿ Ŷᵢ
3. Σᵢ₌₁ⁿ Xᵢeᵢ = 0
4. Σᵢ₌₁ⁿ Ŷᵢeᵢ = 0
5. The regression line passes through (X̄, Ȳ). 5+5
(b) In usual notations, if v, b, r, k and λ are the parameters of a Balanced Incomplete Block Design, then show that :
(i) b ≥ r + 1 ≥ λ + 2
(ii) v ≤ b ≤ (r² - 1)/λ
10
(c) For the multiple linear regression model with two predictor variables X₁ and X₂, show that the estimate of regression coefficient of X₁ is unchanged when X₂ is added to the regression model, whenever X₁ and X₂ are uncorrelated.
10
(d) A sample of size n is drawn from a population having N units by simple random sampling without replacement. A sub-sample of n₁ units is drawn from the n units by simple random sampling without replacement. Let ȳ₁ denote the mean based on n₁ units and ȳ₂, the mean based on n₂ = n - n₁ units. Consider the estimator of the population mean Ȳₙ given by :
Ŷₙ = wȳ₁ + (1-w)ȳ₂ ; 0 < w < 1
Show that E(Ŷₙ) = Ȳₙ, and obtain its variance.
10
(e) How is the efficiency of a design measured ? Derive the expression to measure the efficiency of a Randomised Block Design over a Completely Randomised Design. 10
हिंदी में पढ़ें
(a) एक साधारण रैखिक समाश्रयण निदर्श Y = β₀ + β₁Xᵢ + εᵢ, i = 1, ..., n के लिए
(i) माने गए प्रतिबंधों को स्पष्ट लिखते हुए, β₀ और β₁ के न्यूनतम वर्ग आकलकों को व्युत्पन्न कीजिए।
(ii) eᵢ = Yᵢ - Ŷᵢ जहाँ Ŷᵢ आसंजित मान है, के लिए दर्शाइए कि
1. Σᵢ₌₁ⁿ eᵢ = 0
2. Σᵢ₌₁ⁿ Yᵢ = Σᵢ₌₁ⁿ Ŷᵢ
3. Σᵢ₌₁ⁿ Xᵢeᵢ = 0
4. Σᵢ₌₁ⁿ Ŷᵢeᵢ = 0
5. समाश्रयण रेखा (X̄, Ȳ) से गुजरती है। 5+5
(b) प्रचलित संकेतों में, यदि v, b, r, k और λ किसी संतुलित अपूर्ण खंडक अभिकल्पना के प्राचल हैं, तो दर्शाइए कि :
(i) b ≥ r + 1 ≥ λ + 2
(ii) v ≤ b ≤ (r² - 1)/λ
10
(c) एक बहु रैखिक समाश्रयण निदर्श जिसमें X₁ और X₂ दो प्रावकता चर हैं, के लिए दर्शाइए कि जब भी X₁ और X₂ असहसंबंधित होंगे, समाश्रयण निदर्श में X₂ को जोड़ने पर X₁ के समाश्रयण गुणांक का आकलक अपरिवर्तित रहेगा ।
10
(d) प्रतिस्थापन रहित सरल यादृच्छिक प्रतिचयन द्वारा समष्टि की N इकाइयों से n आकार का एक प्रतिदर्श चुना गया । प्रतिस्थापन रहित सरल यादृच्छिक प्रतिचयन द्वारा n इकाइयों से n₁ इकाई का एक उप-प्रतिदर्श चुना गया । माना कि n₁ इकाइयों पर आधारित माध्य को ȳ₁ और n₂ = n - n₁ इकाइयों पर आधारित माध्य को ȳ₂ से व्यक्त किया गया । समष्टि माध्य Ȳₙ का आकलक दिया गया है :
Ŷₙ = wȳ₁ + (1-w)ȳ₂ ; 0 < w < 1
दर्शाइए कि E(Ŷₙ) = Ȳₙ, और इसका प्रसरण प्राप्त कीजिए ।
10
(e) किसी अभिकल्पना की दक्षता कैसे मापी जाती है ? पूर्णतः यादृच्छिकीकृत अभिकल्पना पर यादृच्छिकीकृत खंडक अभिकल्पना की दक्षता को मापने का व्यंजक व्युत्पन्न कीजिए।
10
Answer approach & key points
Derive requires rigorous step-by-step mathematical proofs with clear logical progression. Allocate time proportionally: ~20% for (a)(i)-(ii) on SLR properties, ~20% for (b) on BIBD inequalities, ~20% for (c) on multiple regression orthogonality, ~20% for (d) on two-phase sampling variance, and ~20% for (e) on design efficiency. Begin each sub-part by stating assumptions, proceed with systematic derivation, and conclude with the required result clearly boxed.
- (a)(i) Correct setup of normal equations minimizing Σ(Yᵢ - β₀ - β₁Xᵢ)²; explicit statement of Gauss-Markov conditions (E(εᵢ)=0, Var(εᵢ)=σ², Cov(εᵢ,εⱼ)=0)
- (a)(ii) All five residual properties proved using normal equations: Σeᵢ=0 from first normal equation; ΣXᵢeᵢ=0 from second; ΣŶᵢeᵢ=0 via substitution; (X̄,Ȳ) on regression line verified
- (b) BIBD parameter relationships: bk=vr, r(k-1)=λ(v-1) used to prove b≥v (Fisher's inequality) and hence b≥r+1≥λ+2; second inequality using r(k-1)=λ(v-1) and k≤v
- (c) Multiple regression: β̂₁ = (S₁₁S₂₂ - S₁₂S₂₂)/(S₁₁S₂₂ - S₁₂²) or equivalent; when S₁₂=0, β̂₁ reduces to S₁₀/S₁₁ = simple regression coefficient
- (d) Two-phase sampling: E(ȳ₁)=Ȳ, E(ȳ₂)=Ȳ shown; E(Ŷₙ)=wȲ+(1-w)Ȳ=Ȳ; variance derived using Var(ȳ₁)=σ²/n₁, Var(ȳ₂)=σ²/n₂ and independence
- (e) Efficiency defined as ratio of variances (or precision); E = Var(CRD)/Var(RBD) = [(σ²+σ²ᵦ)/σ²] × adjustment; derivation using E(MSE) for both designs
- Proper mathematical notation throughout: summation limits, subscripts, expectation and variance operators clearly distinguished
Q6 50M derive Multivariate analysis, correlation, cluster sampling, multivariate normal distribution
(a) For a multiple linear regression model with three covariates X₁, X₂ and X₃, let rᵢⱼ denote the correlation coefficient between Xᵢ and Xⱼ. For a data, it was found r₁₂ = 0·77, r₂₃ = 0·52, r₁₃ = 0·72.
(i) Check the consistency of the above data.
(ii) If r₁₃ is unknown, obtain the limits within which r₁₃ lies given the above values for r₁₂ and r₂₃. 20
(b) In cluster sampling with equal size clusters, obtain the unbiased estimate of population mean. Also obtain its sampling variance as
V(ȳ̄) = (1-f)(NM-1)S²{1+(M-1)ρcl}/[M²(N-1)n],
where notations have their usual meanings. 15
(c) Let Z₃ₓ₁ = (X₁ₓ₁, Y₂ₓ₁)ᵀ ~ N₃((0, 0, 1)ᵀ, [[1, 2, 1], [2, 5, 2], [1, 2, 2]]).
Show that conditional on X₁ₓ₁, the two components of Y₂ₓ₁ are independent but marginally they are not.
15
हिंदी में पढ़ें
(a) किसी बहु रैखिक समाश्रयण निदर्श जिसमें तीन सह-विचर X₁, X₂ और X₃ हैं, के लिए, माना rᵢⱼ, Xᵢ और Xⱼ में सहसंबंध गुणांक दर्शाता है। किन्हीं आँकड़ों के लिए, देखा गया कि r₁₂ = 0·77, r₂₃ = 0·52, r₁₃ = 0·72 है।
(i) उपर्युक्त आँकड़ों की संगतता जाँचिए।
(ii) यदि r₁₃ अज्ञात हो, तो ऊपर दिए गए r₁₂ और r₂₃ के मानों से r₁₃ की सीमाएँ प्राप्त कीजिए। 20
(b) समान आकार वाले गुच्छों के गुच्छ प्रतिचयन में, समष्टि माध्य का अनभिनत आकलक प्राप्त कीजिए। इसका प्रतिचयन प्रसरण भी निम्न रूप में ज्ञात कीजिए :
V(ȳ̄) = (1-f)(NM-1)S²{1+(M-1)ρcl}/[M²(N-1)n]
जहाँ संकेतों के अपने सामान्य अर्थ हैं। 15
(c) माना Z₃ₓ₁ = (X₁ₓ₁, Y₂ₓ₁)ᵀ ~ N₃((0, 0, 1)ᵀ, [[1, 2, 1], [2, 5, 2], [1, 2, 2]]).
दर्शाइए कि X₁ₓ₁ के प्रतिबंध पर, Y₂ₓ₁ के दो घटक स्वतंत्र हैं लेकिन उपांतिय वे स्वतंत्र नहीं हैं।
15
Answer approach & key points
Derive the required mathematical results systematically across all three parts. For part (a)(i)-(ii), apply correlation matrix properties and determinant conditions first, then use partial correlation bounds. For part (b), build the cluster sampling theory from first principles with ANOVA decomposition. For part (c), partition the multivariate normal distribution and derive conditional distributions. Allocate approximately 40% time to part (a) given its 20 marks, 30% each to parts (b) and (c). Structure as: direct derivations without lengthy introductions, clear theorem statements, step-by-step proofs, and boxed final expressions.
- For (a)(i): Verify positive semi-definiteness of correlation matrix by checking det(R) ≥ 0 or all principal minors non-negative; compute 1 - r₁₂² - r₂₃² - r₁₃² + 2r₁₂r₂₃r₁₃ ≥ 0
- For (a)(ii): Derive bounds using r₁₃ = r₁₂r₂₃ ± √[(1-r₁₂²)(1-r₂₃²)]; obtain numerical interval [0.077, 0.963] or equivalent
- For (b): Define cluster sampling estimator ȳ̄ = (1/nM)ΣᵢΣⱼ yᵢⱼ; prove unbiasedness E(ȳ̄) = Ȳ; derive variance via between-cluster and within-cluster SS decomposition
- For (b): Express variance in ICC form using ρcl = (S_b² - S_w²)/(S_b² + (M-1)S_w²) or equivalent definition; manipulate to reach target formula
- For (c): Partition covariance matrix Σ = [[Σ_XX, Σ_XY], [Σ_YX, Σ_YY]]; derive conditional distribution Y|X ~ N(μ_Y + Σ_YXΣ_XX⁻¹(X-μ_X), Σ_YY - Σ_YXΣ_XX⁻¹Σ_XY)
- For (c): Show conditional covariance matrix is diagonal (implying independence given X₁) while marginal covariance Σ_YY is not diagonal
Q7 50M derive Factorial experiments, principal components, regression estimator
(a) (i) What is confounding in factorial experiments ?
(ii) A $2^6$ factorial experiment is conducted in blocks of size $2^3$. Write the confounded effects such that no main effect or two factor interaction are confounded. Give the list of independent and generalised interactions confounded along with the elements of key block only.
(iii) Give the break-up of degrees of freedom for a $2^n$ factorial experiment in $2^k$ blocks.
(b) What are principal components ? Describe how to compute the principal components of the vectors X₁ = $\begin{bmatrix} 1 \\ 0 \\ -1 \end{bmatrix}$ and X₂ = $\begin{bmatrix} -1 \\ 1 \\ 0 \end{bmatrix}$. Give X₁ and X₂ in terms of the principal components.
(c) Define Regression estimator. Show bias = – Cov ($\bar{x}$, b). Under what conditions is bias negligible ? Find the mean square error of the estimator to first degree of approximation. Give comparison of Regression estimator with Ratio estimator.
हिंदी में पढ़ें
(a) (i) बहु-उपादानी प्रयोगों में संकरण क्या है ?
(ii) एक $2^6$ बहु-उपादानी प्रयोग $2^3$ आकार के खंडकों में संचालित किया गया। संकीर्ण प्रभावों को लिखिए जिसमें कोई भी मुख्य उपादान या दो घटक अन्योन्यक्रिया संकीर्ण न हों। संकीर्ण होने वाले स्वतंत्र व व्यापकीकृत अन्योन्यक्रियाओं की सूची लिखिए, साथ ही केवल प्रमुख खंडक के अवयव लिखिए।
(iii) $2^k$ खंडकों में $2^n$ बहु-उपादानी प्रयोग के लिए स्वातंत्र्य कोटियों का विभाजन दीजिए।
(b) मुख्य घटक क्या हैं ? सदिश X₁ = $\begin{bmatrix} 1 \\ 0 \\ -1 \end{bmatrix}$ और X₂ = $\begin{bmatrix} -1 \\ 1 \\ 0 \end{bmatrix}$ के मुख्य घटकों के परिकलन का विवरण दीजिए । X₁ और X₂ को मुख्य घटकों के रूप में लिखिए ।
(c) समाश्रयण आकलक परिभाषित कीजिए । दर्शाइए अभिनति = – सहप्रसरण ($\bar{x}$, b) । किन प्रतिबंधों के अंतर्गत अभिनति नगण्य होती है ? प्रथम घात के सन्निकट आकलक की त्रुटि वर्ग माध्य ज्ञात कीजिए । समाश्रयण आकलक की अनुपात आकलक के साथ तुलना कीजिए ।
Answer approach & key points
Derive the required expressions systematically across all five sub-parts. For (a)(i)-(iii), allocate ~35% time covering confounding definition, the specific 2^6 in 2^3 blocks construction with ABC, DEF, ABCDEF as confounded effects, and the general df breakdown. For (b), spend ~25% time on PCA computation: construct data matrix, find covariance, eigenvalues (3, 1, 0), eigenvectors, and express X₁, X₂ in PC terms. For (c), allocate ~40% time deriving regression estimator bias, MSE approximation, and comparison with ratio estimator via Cochran's approach. Begin with definitions, proceed through step-by-step derivations, and conclude with clear interpretations.
- (a)(i) Define confounding as mixing of treatment effects with block effects; distinguish complete vs partial confounding
- (a)(ii) Identify confounded effects: ABC, DEF, and their generalized interaction ABCDEF; verify no main effect or 2-factor interaction is confounded; construct key block with I, AD, BE, CF, ABDE, ACDF, BCEF, ABCDEF
- (a)(iii) State df breakdown: blocks (2^k - 1), treatments (2^n - 1), error (2^n - 2^k - n + nk - k), total (2^n - 1)
- (b) Define PCs as uncorrelated linear combinations maximizing variance; compute covariance matrix [2 -1; -1 2], eigenvalues λ₁=3, λ₂=1, eigenvectors [1/√2, -1/√2]ᵀ and [1/√2, 1/√2]ᵀ; express X₁ = (1/√2)PC₁ + (1/√2)PC₂, X₂ = (-1/√2)PC₁ + (1/√2)PC₂
- (c) Define regression estimator Ŷ_reg = Ȳ + b(X̄ - x̄); derive bias = -Cov(x̄, b) using E(b) = β + O(1/n); state negligible bias when n is large or ρ ≈ 0; derive MSE ≈ S²_y(1-ρ²)(1/n + 1/N); compare: regression has smaller MSE when |ρ| > 1/2 C_x/C_y
Q8 50M solve Stratified sampling, polynomial regression, split-plot designs
(a) (i) In stratified sampling under optimum allocation, how will you proceed to select units from different strata, if one or more nᵢ's happens to be greater than Nᵢ (i ≥ 2) ?
(ii) A sample survey was conducted in a certain district of Himachal Pradesh. Four strata A, B, C and D of villages were formed according to the acreage of fruit trees as obtained from revenue records. A random sample of villages was selected from each stratum and the number of apple orchards in each selected village was noted. The data are shown below :
| Stratum | Total number of villages (Nᵢ) | Number of villages in sample (nᵢ) | Number of orchards in the selected villages |
|---------|------------------------------|-----------------------------------|---------------------------------------------|
| A (0 – 3 acres) | 275 | 15 | 2, 5, 1, 9, 6, 7, 0, 4, 7, 0, 5, 0, 0, 3, 0 |
| B (3 – 6 acres) | 146 | 10 | 21, 11, 7, 5, 6, 19, 5, 24, 30, 24 |
| C (6 – 15 acres) | 93 | 12 | 3, 10, 4, 11, 38, 11, 4, 46, 4, 18, 1, 39 |
| D (15 acres and above) | 62 | 11 | 30, 42, 20, 38, 29, 22, 31, 28, 66, 14, 15 |
Estimate the number of orchards in the district.
(b) (i) For a second order polynomial model with one predictor variable, derive the least squares normal equations clearly stating the conditions assumed. How will you interpret the parameters in this model ?
(ii) Describe why it is recommended to work with predictor variables centred around the mean. Comment on fitted values of the response variable in this case. Prove your claim.
(c) What are split-plot designs ? When do you recommend the use of such designs ? If e₁ and e₂ are the main plot and sub-plot errors respectively, both estimated in units of a single sub-plot, explain why e₁ is expected to be larger than e₂.
हिंदी में पढ़ें
(a) (i) स्तरीत प्रतिचयन में अनुकूलतम नियतन के अंतर्गत यदि एक या अधिक nᵢ, Nᵢ (i ≥ 2) से ज्यादा बड़े हैं, तो आप विभिन्न स्तरों से इकाइयों का चयन किस प्रकार करेंगे ?
(ii) हिमाचल प्रदेश के किसी जिले में एक प्रतिदर्श सर्वेक्षण किया गया । राजस्व अभिलेखों द्वारा प्राप्त फलदार पेड़ों के क्षेत्रफल के आधार पर गाँवों के चार स्तर A, B, C और D बनाए गए । प्रत्येक स्तर से गाँवों का एक यादृच्छिक प्रतिदर्श चुना गया और प्रत्येक चुने गए गाँव से सेब के बगीचों की संख्या लिखी गई । आँकड़े नीचे दर्शाए गए हैं :
| स्तर | गाँवों की कुल संख्या (Nᵢ) | प्रतिदर्श में गाँवों की संख्या (nᵢ) | चुने गए गाँवों में बगीचों की संख्या |
|-----|------------------------|-------------------------------|--------------------------------|
| A (0 – 3 एकड़) | 275 | 15 | 2, 5, 1, 9, 6, 7, 0, 4, 7, 0, 5, 0, 0, 3, 0 |
| B (3 – 6 एकड़) | 146 | 10 | 21, 11, 7, 5, 6, 19, 5, 24, 30, 24 |
| C (6 – 15 एकड़) | 93 | 12 | 3, 10, 4, 11, 38, 11, 4, 46, 4, 18, 1, 39 |
| D (15 एकड़ और अधिक) | 62 | 11 | 30, 42, 20, 38, 29, 22, 31, 28, 66, 14, 15 |
जिले में बगीचों की संख्या का आकलन कीजिए ।
(b) (i) द्विघातीय बहुपद निर्देश जिसमें एक प्रावकता चर है, के लिए माने गए प्रतिबंधों को स्पष्ट लिखते हुए, न्यूनतम वर्ग प्रसामान्य समीकरण व्युत्पन्न कीजिए । आप इस निर्देश में प्राचलों की व्याख्या कैसे करेंगे ?
(ii) वर्णन कीजिए कि क्यों माध्य के परितः केंद्रित प्रावकता चरों को संस्तुत किया जाता है । इस विषय में अनुक्रिया चर के आसंगित मानों पर टिप्पणी लिखिए । अपने दावे को सिद्ध कीजिए ।
(c) विभक्त-क्षेत्र अभिकल्पनाएँ क्या हैं ? आप इन अभिकल्पनाओं के उपयोग को कब संस्तुत करेंगे ? यदि e₁ और e₂ क्रमशः मुख्य क्षेत्र और उप-क्षेत्र त्रुटियाँ हैं, दोनों ही एकल उप-क्षेत्र इकाइयों में आकलित हैं, तो स्पष्ट कीजिए कि क्यों e₁, e₂ से अधिक बड़ा अनुमानित होता है ।
Answer approach & key points
This multi-part question demands solving numerical problems alongside theoretical derivations and explanations. Allocate approximately 35% effort to part (a) combining optimum allocation adjustment and stratified estimation with Himachal Pradesh data; 35% to part (b) covering polynomial regression derivation, centering benefits, and proof; and 30% to part (c) explaining split-plot designs with error comparison. Structure as: brief theoretical setup → step-by-step calculations/derivations → interpretation of results in context.
- For (a)(i): Explain the iterative adjustment procedure when nᵢ > Nᵢ in optimum allocation—set nᵢ = Nᵢ for such strata, recompute allocation for remaining strata using revised formula, and repeat until all nᵢ ≤ Nᵢ
- For (a)(ii): Calculate stratum means, apply Neyman or proportional allocation weights, compute stratified estimate Ŷ = ΣNᵢȳᵢ with standard error, and present final estimate of total orchards in the district
- For (b)(i): Derive normal equations for Y = β₀ + β₁X + β₂X² by minimizing Σ(Yᵢ - β₀ - β₁Xᵢ - β₂Xᵢ²)²; interpret β₀ as response at X=0, β₁ as linear rate of change, β₂ as curvature/acceleration
- For (b)(ii): Explain that centering (X - X̄) eliminates correlation between linear and quadratic terms, stabilizes variance-covariance matrix; prove fitted values remain identical using algebraic expansion showing predicted Y unchanged
- For (c): Define split-plot designs as experiments with two sizes of experimental units where whole plots receive one factor and sub-plots receive another; recommend when one factor is harder/costlier to change; explain e₁ > e₂ due to additional whole-plot error component from main plot-to-main plot variation