Equivalent Q Values Within Statistical Variance and Linear Regression Framework • sesp

1. Q Values

2. Linear Regression Model with Dummy Variables

Consider a linear regression model where the response variable \(y\) depends on \(k\) dummy variables \(D_1, D_2, \dots, D_k\). Each dummy variable \(D_j\) indicates membership in a particular category or group, taking the value 1 if an observation belongs to category \(j\) and 0 otherwise. The model is expressed as:

\[ y_i = \beta_0 + \beta_1 D_{i1} + \beta_2 D_{i2} + \cdots + \beta_k D_{ik} + \epsilon_i \]

where: \(y_i\) is the response variable for the \(i\)-th observation, \(\beta_0\) is the intercept term, \(\beta_j\) is the coefficient associated with dummy variable \(D_j\), \(D_{ij}\) is the value of dummy variable \(D_j\) for the \(i\)-th observation, which is 1 if the \(i\)-th observation belongs to category \(j\) and 0 otherwise, \(\epsilon_i\) is the error term.

3. Behavior of Dummy Variables

Each observation belongs to exactly one of the categories represented by the dummy variables. If the \(i\)-th observation belongs to category \(j\), then \(D_{ij} = 1\) and all other dummy variables are 0 for that observation. Thus, the regression equation for observation \(i\) belonging to category \(j\) simplifies to:

\[ y_i = \beta_0 + \beta_j + \epsilon_i \]

This equation holds for all observations \(i\) that belong to category \(j\).

4. Predicted Value for a Category

The predicted value \(\widehat{y}_j\) for an observation in category \(j\) is the expected value of \(y_i\), given that \(D_{ij} = 1\) and the other dummy variables are 0. Since the error term \(\epsilon_i\) has an expected value of 0, the predicted value for category \(j\) is:

\[ \widehat{y}_j = \mathbb{E}(y_i | D_{ij} = 1) = \beta_0 + \beta_j \]

5. Ordinary Least Squares (OLS) Estimation

In linear regression, the coefficients \(\beta_0, \beta_1, \dots, \beta_k\) are estimated using the Ordinary Least Squares (OLS) method, which minimizes the sum of squared residuals:

\[ \text{Minimize } \sum_{i=1}^n (y_i - \widehat{y}_i)^2 \]

The OLS solution leads to the estimate of \(\beta_j\) for each dummy variable such that the predicted value \(\widehat{y}_j = \beta_0 + \beta_j\) represents the mean value of \(y_i\) for all observations in category \(j\). Specifically, OLS ensures that:

\[ \widehat{y}_j = \frac{1}{N_j} \sum_{i: D_{ij} = 1} y_i = \bar{y}_j \]

where \(N_j\) is the number of observations in category \(j\), and \(\bar{y}_j\) is the mean response for category \(j\).

6. Conclusion

\[ q = 1 - \frac{\sum\limits_{h = 1}^{L}N_h\sigma_h^2}{N\sigma^2} = 1 - \frac{\sum\limits_{h = 1}^{L}\sum\limits_{i = 1}^{N_h}\left({y_{hi} - \overline{y}_h}^2\right)}{\sum\limits_{i = 1}^{N} \left({y_i - \overline{y}}^2\right)} = \frac{\sum\limits_{i = 1}^{N}\left({y_i - \widehat{y}_i}^2\right)}{\sum\limits_{i = 1}^{N}\left({y_i - \overline{y}}^2\right)} = R^2 \]

7. Real-world example

library(sf)
## Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.3.1; sf_use_s2() is TRUE
library(sesp)
library(gdverse)

NTDs = sf::st_as_sf(gdverse::NTDs, coords = c('X','Y'))

system.time({
go1 = sesp(incidence ~ ., data = NTDs, discvar = 'none',
           model = 'ols', overlay = 'intersection')
})
##    user  system elapsed 
##    0.22    0.03    0.65

system.time({
go2 = gd(incidence ~ ., data = NTDs,
         type = c('factor','interaction'))
})
##    user  system elapsed 
##    0.04    0.00    0.08

Factor detector

go1$factor
## # A tibble: 3 × 5
##   Variable  Qvalue    AIC    BIC LogLik
##   <chr>      <dbl>  <dbl>  <dbl>  <dbl>
## 1 watershed  0.638 -10.0  -10.0   15.0 
## 2 elevation  0.607   1.18   1.18   7.41
## 3 soiltype   0.386  79.7   79.7  -33.8
go2$factor
## # A tibble: 3 × 3
##   variable  `Q-statistic` `P-value`
##   <chr>             <dbl>     <dbl>
## 1 watershed         0.638  0.000129
## 2 elevation         0.607  0.0434  
## 3 soiltype          0.386  0.372

Interaction detector

go1$interaction
## # A tibble: 3 × 7
##   Variable              Interaction    Qv1   Qv2  Qv12 Variable1 Variable2
##   <chr>                 <chr>        <dbl> <dbl> <dbl> <chr>     <chr>    
## 1 watershed ∩ elevation Enhance, bi- 0.638 0.607 0.714 watershed elevation
## 2 watershed ∩ soiltype  Enhance, bi- 0.638 0.386 0.736 watershed soiltype 
## 3 elevation ∩ soiltype  Enhance, bi- 0.607 0.386 0.664 elevation soiltype
go2$interaction
## # A tibble: 3 × 6
##   variable1 variable2 Interaction  `Variable1 Q-statistics` `Variable2 Q-statistics`
##   <chr>     <chr>     <chr>                           <dbl>                    <dbl>
## 1 watershed elevation Enhance, bi-                    0.638                    0.607
## 2 watershed soiltype  Enhance, bi-                    0.638                    0.386
## 3 elevation soiltype  Enhance, bi-                    0.607                    0.386
## # ℹ 1 more variable: `Variable1 and Variable2 interact Q-statistics` <dbl>