| Title: | Conditional Expectation Function Estimation with K-Conditional-Means |
|---|---|
| Description: | Implementation of the KCMeans regression estimator studied by Wiemann (2023) <arXiv:2311.17021> for expectation function estimation conditional on categorical variables. Computation leverages the unconditional KMeans implementation in one dimension using dynamic programming algorithm of Wang and Song (2011) <doi:10.32614/RJ-2011-015>, allowing for global solutions in time polynomial in the number of observed categories. |
| Authors: | Thomas Wiemann [aut, cre] |
| Maintainer: | Thomas Wiemann <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 0.1.0.9000 |
| Built: | 2026-05-08 06:08:24 UTC |
| Source: | https://github.com/thomaswiemann/kcmeans |
Implementation of the K-Conditional-Means estimator.
kcmeans(y, X, which_is_cat = 1, K = 2)kcmeans(y, X, which_is_cat = 1, K = 2)
y |
The outcome variable, a numerical vector. |
X |
A (sparse) feature matrix where one column is the categorical predictor. |
which_is_cat |
An integer indicating which column of |
K |
The number of support points, an integer greater than 2. |
kcmeans returns an object of S3 class kcmeans. An
object of class kcmeans is a list containing the following
components:
cluster_mapA matrix that characterizes the estimated
predictor of the residualized outcome
. The first column
x denotes the value of the categorical variable that
corresponds to the unrestricted sample mean mean_x of
, the sample share p_x, the estimated
cluster cluster_x, and the estimated restricted sample mean
mean_xK of with just K support
points.
mean_yThe unconditional sample mean of
.
piThe best linear prediction coefficients of
on corresponding to the non-categorical predictors
.
which_is_cat,K
Passthrough of user-provided arguments. See above for details.
Wang H and Song M (2011). "Ckmeans.1d.dp: optimal k-means clustering in one dimension by dynamic programming." The R Journal 3(2), 29–33.
Wiemann T (2023). "Optimal Categorical Instruments." https://arxiv.org/abs/2311.17021
# Simulate simple dataset with n=800 observations X <- rnorm(800) # continuous predictor Z <- sample(1:20, 800, replace = TRUE) # categorical predictor Z0 <- Z %% 4 # lower-dimensional latent categorical variable y <- Z0 + X + rnorm(800) # outcome # Compute kcmeans with four support points kcmeans_fit <- kcmeans(y, cbind(Z, X), K = 4) # Print the estimated support points of the categorical predictor print(unique(kcmeans_fit$cluster_map[, "mean_xK"]))# Simulate simple dataset with n=800 observations X <- rnorm(800) # continuous predictor Z <- sample(1:20, 800, replace = TRUE) # categorical predictor Z0 <- Z %% 4 # lower-dimensional latent categorical variable y <- Z0 + X + rnorm(800) # outcome # Compute kcmeans with four support points kcmeans_fit <- kcmeans(y, cbind(Z, X), K = 4) # Print the estimated support points of the categorical predictor print(unique(kcmeans_fit$cluster_map[, "mean_xK"]))
Prediction method for the K-Conditional-Means estimator.
## S3 method for class 'kcmeans' predict(object, newdata, clusters = FALSE, ...)## S3 method for class 'kcmeans' predict(object, newdata, clusters = FALSE, ...)
object |
An object of class |
newdata |
A (sparse) feature matrix where the first column corresponds to the categorical predictor. |
clusters |
A boolean indicating whether estimated clusters should be returned. |
... |
Currently unused. |
A numerical vector with predicted values (if clusters = FALSE)
or predicted clusters (if clusters = FALSE).
Wiemann T (2023). "Optimal Categorical Instruments." https://arxiv.org/abs/2311.17021
# Simulate simple dataset with n=800 observations X <- rnorm(800) # continuous predictor Z <- sample(1:20, 800, replace = TRUE) # categorical predictor Z0 <- Z %% 4 # lower-dimensional latent categorical variable y <- Z0 + X + rnorm(800) # outcome # Compute kcmeans with four support points kcmeans_fit <- kcmeans(y, cbind(Z, X), K = 4) # Calculate in-sample predictions fitted_values <- predict(kcmeans_fit, cbind(Z, X)) # Print sample share of estimated clusters clusters <- predict(kcmeans_fit, cbind(Z, X), clusters = TRUE) table(clusters)# Simulate simple dataset with n=800 observations X <- rnorm(800) # continuous predictor Z <- sample(1:20, 800, replace = TRUE) # categorical predictor Z0 <- Z %% 4 # lower-dimensional latent categorical variable y <- Z0 + X + rnorm(800) # outcome # Compute kcmeans with four support points kcmeans_fit <- kcmeans(y, cbind(Z, X), K = 4) # Calculate in-sample predictions fitted_values <- predict(kcmeans_fit, cbind(Z, X)) # Print sample share of estimated clusters clusters <- predict(kcmeans_fit, cbind(Z, X), clusters = TRUE) table(clusters)