030-bigsimr-package.Rmd

# The `bigsimr` R package {#package}


```{r ch030-basic-example-setup, cache=FALSE}
box::use(
  patchwork[...],
  dplyr[...],
  ggplot2[...],
  tidyr[drop_na],
  bigsimr[bigsimr_setup, distributions_setup],
  JuliaCall[julia_call]
)

Sys.setenv(JULIA_NUM_THREADS = parallel::detectCores())
bs <- bigsimr_setup(pkg_check = FALSE)
dist <- distributions_setup()
```


This section describes a low-dimensional (2D) random vector simulation workflow via the `bigsimr` `R` package (https://github.com/SchisslerGroup/r-bigsimr). The package `bigsimr` provides an interface to the native code written in Julia, registered as the `Bigsimr` Julia package (https://github.com/adknudson/Bigsimr.jl). In addition to the native Julia `Bigsimr` package and `R` interface `bigsimr`, we also provide a python interface `bigsimr` (https://github.com/SchisslerGroup/python-bigsimr/) that interfaces with the Julia `Bigsimr` package. The Julia package provides a high-performance implementation of our proposed random vector generation algorithm and associated functions (see Section \@ref(algorithms)).


The subsections below describe the basic use of `bigsimr` by stepping through an example workflow using the data set `airquality` that contains daily air quality measurements in New York, May to September 1973 [@Chambers1983]. This workflow proceeds from setting up the computing environment, to data wrangling, estimation, simulation configuration, random vector generation, and, finally, result visualization.


## Bivariate example description


We illustrate the use of `bigsimr` motivated by the New York air quality data set (`airquality`) included in the R `datasets` package. First, we load the `bigsimr` library and a few other convenient data science packages, including the syntactically-elegant `tidyverse` suite of `R` packages. The code chunk below prepares the computing environment:


```{r ch030-computeEnvironmentSetup, eval=FALSE, echo=TRUE}
library("tidyverse")
library("bigsimr")
# Activate multithreading in Julia
Sys.setenv(JULIA_NUM_THREADS = parallel::detectCores())
# Load the Bigsimr and Distributions Julia packages
bs <- bigsimr_setup()
dist <- distributions_setup()
```


For simplicity and to provide a minimal working example, we consider bivariate simulation of the two `airquality` variables `Temperature`, in degrees Fahrenheit, and `Ozone` level, in parts per billion.


```{r ch030-air-quality-data, echo=TRUE}
df <- airquality %>% select(Temp, Ozone) %>% drop_na()
```


```{r ch030-aq-glimpse}
glimpse(df)
```


Figure \@ref(fig:ch030-aq-joint-dist) visualizes the bivariate relationship between Ozone and Temperature. We aim to simulate random two-component vectors mimicking this structure. The margins are not normally distributed; particularly the Ozone level exhibits a strong positive skew.


```{r ch030-aq-joint-dist, cache=FALSE, fig.width= 8, fig.cap="Bivariate scatterplot of Ozone vs. Temp with estimated marginal densities. We model the Ozone data as marginally log-normal and the Temperature data as normal."}
p0 <- ggplot(df, aes(Temp, Ozone)) +
  geom_point(size = 1) +
  theme(legend.position = "none") + 
  labs(x = "Temperature")

pTemp <- ggplot(df, aes(Temp)) + 
  geom_density() +
  theme(axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank())

pOzone <- ggplot(df, aes(Ozone)) + 
    geom_density() +
  theme(axis.title.y = element_blank(),
        axis.title.x = element_blank(),
        axis.ticks = element_blank(),
        axis.text = element_blank()) +
  coord_flip()

pTemp + plot_spacer() + p0 + pOzone + 
  plot_layout(widths = c(3,1), heights = c(1, 3)) 
```


Next, we specify the marginal distributions and correlation coefficient (both type and magnitude). Here the analyst is free to be creative. For this example, we avoid goodness-of-fit considerations to determine the marginal distributions. But it seems sensible without domain knowledge to estimate these quantities from the data, and `bigsimr` contains fast functions designed for this task.


## Specifying marginal distributions 

Based on the estimated densities in Figure \@ref(fig:ch030-aq-joint-dist), we assume `Temp` is normally distributed and `Ozone` is log-normally distributed, as the latter values are positive and skewed. We use the well-known, unbiased estimators for the normal distribution's parameters and maximum likelihood estimators for the log-normal parameters:


```{r ch030-aq-temp-pars, echo=TRUE}
df %>% select(Temp) %>% 
  summarise_all(.funs = c(mean = mean, sd = sd))
```


```{r ch030-aq-ozone-pars, echo=TRUE}
mle_mean <- function(x) mean(log(x))
mle_sd <- function(x) mean( sqrt( (log(x) - mean(log(x)))^2 ) )
df %>% 
  select(Ozone) %>% 
  summarise_all(.funs = c(meanlog = mle_mean, sdlog = mle_sd))
```


Next, we configure the input marginals for later input into `rvec`. The marginal distributions are specified using `Julia`'s `Distributions` package and stored in a vector.


```{r ch030-margins-alist, echo=TRUE}
margins <- c(dist$Normal(mean(df$Temp), sd(df$Temp)),
             dist$LogNormal(mle_mean(df$Ozone), mle_sd(df$Ozone)))
```


## Specifying correlation

As mentioned, the user must decide how to describe correlation based on the particulars of the problem. For non-normal data and for improved simulation accuracy/scalability in our scheme, we advocate the use of Spearman's $\rho$ correlation matrix $R_S$ and Kendall's $\tau$ correlation matrix $R_K$. We also support Pearson correlation coefficient matching, while cautioning the user to check the performance for their parametric multivariate model (see [Monte Carlo evaluations](#simulations) below for evaluation strategies and guidance). These estimation methods are classical approaches, not designed for high-dimensional correlation estimation (see the [Conclusion and Discussion]({#discussion) sections for more on this).


```{r ch030-aq-cor, echo=TRUE}
(R_S <- bs$cor(as.matrix(df), bs$Spearman))
```


## Checking target correlation matrix admissibility


First we use `cor_bounds` to ensure that the pairwise target correlation values are valid prior to the mapping step which constructs $R_X$ for the MVN input into NORTA (see Section \@ref(algorithms)). `cor_bounds` estimates the pairwise lower and upper correlation bounds using the Generate, Sort, and Correlate algorithm of @DH2011.


```{r ch030-cor-bounds, echo=TRUE}
bounds <- bs$cor_bounds(margins)
bounds$lower
bounds$upper
```


Since our single estimated Spearman correlation is within the theoretical bounds, the correlation is valid as input to `rvec`. But even if the 2-dimensional bounds are satisfied for each pair of margins, this does not guarantee the feasibility of a $d-$variate distribution [@BF17].


To provide higher dimensional feasibility/admissibility checking, we begin by mapping using `cor_convert` and check admissibility.


```{r ch030-cor-convert, echo=TRUE}
# Step 1. Mapping
(R_X <- bs$cor_convert(R_S, bs$Spearman, bs$Pearson))
# Step 2. Check admissibility
bs$iscorrelation(R_X)
```


The bounds on the Pearson correlation coefficient $R_P$ between these margins are restricted as seen below in our MC estimated correlation bounds:


```{r ch030-cor-bounds-pearson, echo=TRUE}
bs$cor_bounds(margins[1], margins[2], bs$Pearson, n_samples = 1e6)
```


Our MC estimate of the bounds slightly overestimates the theoretic bounds of $(-0.881, 0.881)$. An analytic derivation is presented as an Appendix.


## Simulating random vectors


Finally, we arrive at the main function of `bigsimr`: `rvec`. We now simulate $B=10,000$ random vectors from the assumed joint distribution of Ozone levels and Temp.


```{r ch030-sim-margins, echo=TRUE}
x <- bs$rvec(10000, R_X, margins)
df_sim <- as.data.frame(x)
colnames(df_sim) <- colnames(df)
```


Figure \@ref(fig:ch030-plot-sim) plots the 10,000 simulated points.


```{r ch030-plot-sim, fig.cap="Contour plot and marginal densities for the simulated bivariate distribution of Air Quality Temperatures and Ozone levels. The simulated points mimic the observed data with respect to both the marginal characteristics and bivariate association."}
p1 <- df_sim %>%
  ggplot(aes(Temp, Ozone)) +
  geom_density_2d_filled() +
  theme(legend.position = "none") +
  labs(x = "Simulated Temperature", y = "Simulated Ozone") +
  scale_y_continuous(limits = c(0, max(df$Ozone))) +
  scale_x_continuous(limits = range(df$Temp))

p1Temp <- ggplot(df_sim, aes(Temp)) + 
  geom_density(alpha = 0.5, fill = "lightseagreen") +
  theme(axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank())

p1Ozone <- ggplot(df_sim, aes(Ozone)) + 
  geom_density(alpha = 0.5, fill = "lightseagreen") + 
  theme(axis.title.y = element_blank(),
        axis.title.x = element_blank(),
        axis.ticks = element_blank(),
        axis.text = element_blank()) +
  coord_flip() 

p1Temp + plot_spacer() + p1 + p1Ozone + 
  plot_layout(widths = c(3,1), heights = c(1, 3))
```