-
Notifications
You must be signed in to change notification settings - Fork 0
/
030-bigsimr-package.Rmd
207 lines (132 loc) · 8.88 KB
/
030-bigsimr-package.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
# The `bigsimr` R package {#package}
```{r ch030-basic-example-setup, cache=FALSE}
box::use(
patchwork[...],
dplyr[...],
ggplot2[...],
tidyr[drop_na],
bigsimr[bigsimr_setup, distributions_setup],
JuliaCall[julia_call]
)
Sys.setenv(JULIA_NUM_THREADS = parallel::detectCores())
bs <- bigsimr_setup(pkg_check = FALSE)
dist <- distributions_setup()
```
This section describes a low-dimensional (2D) random vector simulation workflow via the `bigsimr` `R` package (https://github.com/SchisslerGroup/r-bigsimr). The package `bigsimr` provides an interface to the native code written in Julia, registered as the `Bigsimr` Julia package (https://github.com/adknudson/Bigsimr.jl). In addition to the native Julia `Bigsimr` package and `R` interface `bigsimr`, we also provide a python interface `bigsimr` (https://github.com/SchisslerGroup/python-bigsimr/) that interfaces with the Julia `Bigsimr` package. The Julia package provides a high-performance implementation of our proposed random vector generation algorithm and associated functions (see Section \@ref(algorithms)).
The subsections below describe the basic use of `bigsimr` by stepping through an example workflow using the data set `airquality` that contains daily air quality measurements in New York, May to September 1973 [@Chambers1983]. This workflow proceeds from setting up the computing environment, to data wrangling, estimation, simulation configuration, random vector generation, and, finally, result visualization.
## Bivariate example description
We illustrate the use of `bigsimr` motivated by the New York air quality data set (`airquality`) included in the R `datasets` package. First, we load the `bigsimr` library and a few other convenient data science packages, including the syntactically-elegant `tidyverse` suite of `R` packages. The code chunk below prepares the computing environment:
```{r ch030-computeEnvironmentSetup, eval=FALSE, echo=TRUE}
library("tidyverse")
library("bigsimr")
# Activate multithreading in Julia
Sys.setenv(JULIA_NUM_THREADS = parallel::detectCores())
# Load the Bigsimr and Distributions Julia packages
bs <- bigsimr_setup()
dist <- distributions_setup()
```
For simplicity and to provide a minimal working example, we consider bivariate simulation of the two `airquality` variables `Temperature`, in degrees Fahrenheit, and `Ozone` level, in parts per billion.
```{r ch030-air-quality-data, echo=TRUE}
df <- airquality %>% select(Temp, Ozone) %>% drop_na()
```
```{r ch030-aq-glimpse}
glimpse(df)
```
Figure \@ref(fig:ch030-aq-joint-dist) visualizes the bivariate relationship between Ozone and Temperature. We aim to simulate random two-component vectors mimicking this structure. The margins are not normally distributed; particularly the Ozone level exhibits a strong positive skew.
```{r ch030-aq-joint-dist, cache=FALSE, fig.width= 8, fig.cap="Bivariate scatterplot of Ozone vs. Temp with estimated marginal densities. We model the Ozone data as marginally log-normal and the Temperature data as normal."}
p0 <- ggplot(df, aes(Temp, Ozone)) +
geom_point(size = 1) +
theme(legend.position = "none") +
labs(x = "Temperature")
pTemp <- ggplot(df, aes(Temp)) +
geom_density() +
theme(axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank())
pOzone <- ggplot(df, aes(Ozone)) +
geom_density() +
theme(axis.title.y = element_blank(),
axis.title.x = element_blank(),
axis.ticks = element_blank(),
axis.text = element_blank()) +
coord_flip()
pTemp + plot_spacer() + p0 + pOzone +
plot_layout(widths = c(3,1), heights = c(1, 3))
```
Next, we specify the marginal distributions and correlation coefficient (both type and magnitude). Here the analyst is free to be creative. For this example, we avoid goodness-of-fit considerations to determine the marginal distributions. But it seems sensible without domain knowledge to estimate these quantities from the data, and `bigsimr` contains fast functions designed for this task.
## Specifying marginal distributions
Based on the estimated densities in Figure \@ref(fig:ch030-aq-joint-dist), we assume `Temp` is normally distributed and `Ozone` is log-normally distributed, as the latter values are positive and skewed. We use the well-known, unbiased estimators for the normal distribution's parameters and maximum likelihood estimators for the log-normal parameters:
```{r ch030-aq-temp-pars, echo=TRUE}
df %>% select(Temp) %>%
summarise_all(.funs = c(mean = mean, sd = sd))
```
```{r ch030-aq-ozone-pars, echo=TRUE}
mle_mean <- function(x) mean(log(x))
mle_sd <- function(x) mean( sqrt( (log(x) - mean(log(x)))^2 ) )
df %>%
select(Ozone) %>%
summarise_all(.funs = c(meanlog = mle_mean, sdlog = mle_sd))
```
Next, we configure the input marginals for later input into `rvec`. The marginal distributions are specified using `Julia`'s `Distributions` package and stored in a vector.
```{r ch030-margins-alist, echo=TRUE}
margins <- c(dist$Normal(mean(df$Temp), sd(df$Temp)),
dist$LogNormal(mle_mean(df$Ozone), mle_sd(df$Ozone)))
```
## Specifying correlation
As mentioned, the user must decide how to describe correlation based on the particulars of the problem. For non-normal data and for improved simulation accuracy/scalability in our scheme, we advocate the use of Spearman's $\rho$ correlation matrix $R_S$ and Kendall's $\tau$ correlation matrix $R_K$. We also support Pearson correlation coefficient matching, while cautioning the user to check the performance for their parametric multivariate model (see [Monte Carlo evaluations](#simulations) below for evaluation strategies and guidance). These estimation methods are classical approaches, not designed for high-dimensional correlation estimation (see the [Conclusion and Discussion]({#discussion) sections for more on this).
```{r ch030-aq-cor, echo=TRUE}
(R_S <- bs$cor(as.matrix(df), bs$Spearman))
```
## Checking target correlation matrix admissibility
First we use `cor_bounds` to ensure that the pairwise target correlation values are valid prior to the mapping step which constructs $R_X$ for the MVN input into NORTA (see Section \@ref(algorithms)). `cor_bounds` estimates the pairwise lower and upper correlation bounds using the Generate, Sort, and Correlate algorithm of @DH2011.
```{r ch030-cor-bounds, echo=TRUE}
bounds <- bs$cor_bounds(margins)
bounds$lower
bounds$upper
```
Since our single estimated Spearman correlation is within the theoretical bounds, the correlation is valid as input to `rvec`. But even if the 2-dimensional bounds are satisfied for each pair of margins, this does not guarantee the feasibility of a $d-$variate distribution [@BF17].
To provide higher dimensional feasibility/admissibility checking, we begin by mapping using `cor_convert` and check admissibility.
```{r ch030-cor-convert, echo=TRUE}
# Step 1. Mapping
(R_X <- bs$cor_convert(R_S, bs$Spearman, bs$Pearson))
# Step 2. Check admissibility
bs$iscorrelation(R_X)
```
The bounds on the Pearson correlation coefficient $R_P$ between these margins are restricted as seen below in our MC estimated correlation bounds:
```{r ch030-cor-bounds-pearson, echo=TRUE}
bs$cor_bounds(margins[1], margins[2], bs$Pearson, n_samples = 1e6)
```
Our MC estimate of the bounds slightly overestimates the theoretic bounds of $(-0.881, 0.881)$. An analytic derivation is presented as an Appendix.
## Simulating random vectors
Finally, we arrive at the main function of `bigsimr`: `rvec`. We now simulate $B=10,000$ random vectors from the assumed joint distribution of Ozone levels and Temp.
```{r ch030-sim-margins, echo=TRUE}
x <- bs$rvec(10000, R_X, margins)
df_sim <- as.data.frame(x)
colnames(df_sim) <- colnames(df)
```
Figure \@ref(fig:ch030-plot-sim) plots the 10,000 simulated points.
```{r ch030-plot-sim, fig.cap="Contour plot and marginal densities for the simulated bivariate distribution of Air Quality Temperatures and Ozone levels. The simulated points mimic the observed data with respect to both the marginal characteristics and bivariate association."}
p1 <- df_sim %>%
ggplot(aes(Temp, Ozone)) +
geom_density_2d_filled() +
theme(legend.position = "none") +
labs(x = "Simulated Temperature", y = "Simulated Ozone") +
scale_y_continuous(limits = c(0, max(df$Ozone))) +
scale_x_continuous(limits = range(df$Temp))
p1Temp <- ggplot(df_sim, aes(Temp)) +
geom_density(alpha = 0.5, fill = "lightseagreen") +
theme(axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank())
p1Ozone <- ggplot(df_sim, aes(Ozone)) +
geom_density(alpha = 0.5, fill = "lightseagreen") +
theme(axis.title.y = element_blank(),
axis.title.x = element_blank(),
axis.ticks = element_blank(),
axis.text = element_blank()) +
coord_flip()
p1Temp + plot_spacer() + p1 + p1Ozone +
plot_layout(widths = c(3,1), heights = c(1, 3))
```