Equivalent Q Values Within Statistical Variance and Linear Regression Framework
Source:vignettes/pdrm.rmd
pdrm.rmd
1. Q Values
\[ q = 1 - \frac{\sum\limits_{h = 1}^{L}N_h\sigma_h^2}{N\sigma^2} = 1 - \frac{\sum\limits_{h = 1}^{L}\sum\limits_{i = 1}^{N_h}\left({y_{hi} - \overline{y}_h}^2\right)}{\sum\limits_{i = 1}^{N} \left({y_i - \overline{y}}^2\right)} \]
2. Linear Regression Model with Dummy Variables
Consider a linear regression model where the response variable \(y\) depends on \(k\) dummy variables \(D_1, D_2, \dots, D_k\). Each dummy variable \(D_j\) indicates membership in a particular category or group, taking the value 1 if an observation belongs to category \(j\) and 0 otherwise. The model is expressed as:
\[ y_i = \beta_0 + \beta_1 D_{i1} + \beta_2 D_{i2} + \cdots + \beta_k D_{ik} + \epsilon_i \]
where: \(y_i\) is the response variable for the \(i\)-th observation, \(\beta_0\) is the intercept term, \(\beta_j\) is the coefficient associated with dummy variable \(D_j\), \(D_{ij}\) is the value of dummy variable \(D_j\) for the \(i\)-th observation, which is 1 if the \(i\)-th observation belongs to category \(j\) and 0 otherwise, \(\epsilon_i\) is the error term.
3. Behavior of Dummy Variables
Each observation belongs to exactly one of the categories represented
by the dummy variables. If the \(i\)-th
observation belongs to category \(j\),
then \(D_{ij} = 1\) and all other dummy
variables are 0
for that observation. Thus, the regression
equation for observation \(i\)
belonging to category \(j\) simplifies
to:
\[ y_i = \beta_0 + \beta_j + \epsilon_i \]
This equation holds for all observations \(i\) that belong to category \(j\).
4. Predicted Value for a Category
The predicted value \(\widehat{y}_j\) for an observation in
category \(j\) is the expected value of
\(y_i\), given that \(D_{ij} = 1\) and the other dummy variables
are 0
. Since the error term \(\epsilon_i\) has an expected value of 0,
the predicted value for category \(j\)
is:
\[ \widehat{y}_j = \mathbb{E}(y_i | D_{ij} = 1) = \beta_0 + \beta_j \]
5. Ordinary Least Squares (OLS) Estimation
In linear regression, the coefficients \(\beta_0, \beta_1, \dots, \beta_k\) are estimated using the Ordinary Least Squares (OLS) method, which minimizes the sum of squared residuals:
\[ \text{Minimize } \sum_{i=1}^n (y_i - \widehat{y}_i)^2 \]
The OLS solution leads to the estimate of \(\beta_j\) for each dummy variable such that the predicted value \(\widehat{y}_j = \beta_0 + \beta_j\) represents the mean value of \(y_i\) for all observations in category \(j\). Specifically, OLS ensures that:
\[ \widehat{y}_j = \frac{1}{N_j} \sum_{i: D_{ij} = 1} y_i = \bar{y}_j \]
where \(N_j\) is the number of observations in category \(j\), and \(\bar{y}_j\) is the mean response for category \(j\).
6. Conclusion
\[ q = 1 - \frac{\sum\limits_{h = 1}^{L}N_h\sigma_h^2}{N\sigma^2} = 1 - \frac{\sum\limits_{h = 1}^{L}\sum\limits_{i = 1}^{N_h}\left({y_{hi} - \overline{y}_h}^2\right)}{\sum\limits_{i = 1}^{N} \left({y_i - \overline{y}}^2\right)} = \frac{\sum\limits_{i = 1}^{N}\left({y_i - \widehat{y}_i}^2\right)}{\sum\limits_{i = 1}^{N}\left({y_i - \overline{y}}^2\right)} = R^2 \]
7. Real-world example
library(sf)
## Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.3.1; sf_use_s2() is TRUE
library(sesp)
library(gdverse)
NTDs = sf::st_as_sf(gdverse::NTDs, coords = c('X','Y'))
system.time({
go1 = sesp(incidence ~ ., data = NTDs, discvar = 'none',
model = 'ols', overlay = 'intersection')
})
## user system elapsed
## 0.22 0.03 0.65
system.time({
go2 = gd(incidence ~ ., data = NTDs,
type = c('factor','interaction'))
})
## user system elapsed
## 0.04 0.00 0.08
Factor detector
go1$factor
## # A tibble: 3 × 5
## Variable Qvalue AIC BIC LogLik
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 watershed 0.638 -10.0 -10.0 15.0
## 2 elevation 0.607 1.18 1.18 7.41
## 3 soiltype 0.386 79.7 79.7 -33.8
go2$factor
## # A tibble: 3 × 3
## variable `Q-statistic` `P-value`
## <chr> <dbl> <dbl>
## 1 watershed 0.638 0.000129
## 2 elevation 0.607 0.0434
## 3 soiltype 0.386 0.372
Interaction detector
go1$interaction
## # A tibble: 3 × 7
## Variable Interaction Qv1 Qv2 Qv12 Variable1 Variable2
## <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 watershed ∩ elevation Enhance, bi- 0.638 0.607 0.714 watershed elevation
## 2 watershed ∩ soiltype Enhance, bi- 0.638 0.386 0.736 watershed soiltype
## 3 elevation ∩ soiltype Enhance, bi- 0.607 0.386 0.664 elevation soiltype
go2$interaction
## # A tibble: 3 × 6
## variable1 variable2 Interaction `Variable1 Q-statistics` `Variable2 Q-statistics`
## <chr> <chr> <chr> <dbl> <dbl>
## 1 watershed elevation Enhance, bi- 0.638 0.607
## 2 watershed soiltype Enhance, bi- 0.638 0.386
## 3 elevation soiltype Enhance, bi- 0.607 0.386
## # ℹ 1 more variable: `Variable1 and Variable2 interact Q-statistics` <dbl>