Skip to contents

Purpose

The simulations here are meant to be portable, toy-examples of eSVD2, with the intent to demonstrate that an installation of eSVD2 was successful.

A successful installation of eSVD2 is required for this. See the last section of this README of all the system/package dependencies used when creating these simulations. Both simulations should complete in less than 2 minutes.

Overview

Broadly speaking, eSVD-DE (the name of the method, which is implemented in the eSVD2 package) is a pipeline of 6 different function calls. The procedure starts with primarily 3 inputs,

  • dat, either matrix or dgCMatrix, with n rows (for n cells) and p columns (for p genes). We advise making sure each row/column of either matrix has non-zero variance prior to using this pipeline.
  • covariates, a matrix with n rows with the same rownames as dat where the columns represent the different covariates Notably, this should contain only numerical columns (i.e., all categorical variables should have already been split into numerous indicator variables), and all the columns in covariates will (strictly speaking) be included in the eSVD matrix factorization model.
  • metadata_individual, a factor vector with length n which denotes which cell originates from which individual.

Simulating data

We create a simple dataset that has no differentially expressed genes via the eSVD2::generate_null() function.

set.seed(10)
res <- eSVD2::generate_null()
covariates <- res$covariates
metadata_individual <- res$metadata_individual
obs_mat <- res$obs_mat

If you have Seurat also installed, you can run the following lines as well,

set.seed(10)
meta_data <- data.frame(covariates[, c("CC", "Sex", "Age")])
meta_data$Individual <- metadata_individual
seurat_obj <- Seurat::CreateSeuratObject(counts = Matrix::t(obs_mat), meta.data = meta_data)
Seurat::VariableFeatures(seurat_obj) <- rownames(seurat_obj)
seurat_obj <- Seurat::NormalizeData(seurat_obj)
seurat_obj <- Seurat::ScaleData(seurat_obj)
seurat_obj <- Seurat::RunPCA(seurat_obj, verbose = F)
seurat_obj <- Seurat::RunUMAP(seurat_obj, dims = 1:5)

Seurat::DimPlot(seurat_obj, reduction = "umap", group.by = "CC")

The separation between case and control cells is purely determined by the individual covariates of Age and Sex, not the case-control status.

Using eSVD-DE

Empirically, we have found it useful to set omitted_variables as the Log_UMI and the batch variables. (Here, there are no batch variables.) This means that when eSVD2 reparameterizes the variables, it does not fiddle with the coefficient for Log_UMI. We found this beneficial since the coefficient for specifically the covariate Log_UMI bears a special meaning (as it impacts the normalization of the sequencing depth most directly), and we do not wish to “mix” covariates correlated with Log_UMI too early. (Overall, this is not a strict recommendation, however, we have found this practice to be beneficial in almost all instances we have used eSVD2, so we ourselves have not tried deviating from this convention.)

This typically looks like the following.

set.seed(10)
eSVD_obj <- eSVD2::initialize_esvd(
  dat = obs_mat,
  covariates = covariates,
  metadata_individual = metadata_individual,
  case_control_variable = "CC",
  bool_intercept = T,
  k = 5,
  lambda = 0.1,
  verbose = 0
)
eSVD_obj <- eSVD2::reparameterization_esvd_covariates(
  input_obj = eSVD_obj,
  fit_name = "fit_Init",
  omitted_variables = "Log_UMI"
)

names(eSVD_obj)
#> [1] "dat"          "covariates"   "param"        "fit_Init"     "latest_Fit"  
#> [6] "case_control" "individual"
  • Step 2 (Fitting the eSVD object): We now fit the eSVD. Typically, we recommend fitting twice (i.e., calling eSVD2::opt_esvd() twice) due to the non-convex nature of the objective function. In the first call of eSVD2::opt_esvd(), hold all the coefficients to all the covariates aside from the case-control covariate (here, denoted as "CC" so we remove as much of the case-control covariate effect as possible. Then, in the second call of eSVD2::opt_esvd(), we allow all the coefficients to be optimized.

Note that in the last fit of eSVD2::opt_esvd(), we do not set any omitted_variables. This is our typical recommendation – it is good to allow all coefficients to be reparameterized right before the optimization is all done.

set.seed(10)
eSVD_obj <- eSVD2::opt_esvd(
  input_obj = eSVD_obj,
  l2pen = 0.1,
  max_iter = 100,
  offset_variables = setdiff(colnames(eSVD_obj$covariates), "CC"),
  tol = 1e-6,
  verbose = 0,
  fit_name = "fit_First",
  fit_previous = "fit_Init"
)
eSVD_obj <- eSVD2::reparameterization_esvd_covariates(
  input_obj = eSVD_obj,
  fit_name = "fit_First",
  omitted_variables = "Log_UMI"
)

set.seed(10)
eSVD_obj <- eSVD2::opt_esvd(
  input_obj = eSVD_obj,
  l2pen = 0.1,
  max_iter = 100,
  offset_variables = NULL,
  tol = 1e-6,
  verbose = 0,
  fit_name = "fit_Second",
  fit_previous = "fit_First"
)

eSVD_obj <- eSVD2::reparameterization_esvd_covariates(
  input_obj = eSVD_obj,
  fit_name = "fit_Second",
  omitted_variables = NULL
)

names(eSVD_obj)
#> [1] "dat"          "covariates"   "param"        "fit_Init"     "latest_Fit"  
#> [6] "case_control" "individual"   "fit_First"    "fit_Second"
  • Step 3 (Estimating the nuisance (i.e., over-dispersion) parameters): After fitting the eSVD, we need to estimate the nuisance parameters. Since we are modeling the data via a negative binomial, this nuisance parameter is the over-dispersion parameter. This is done via the eSVD2::estimate_nuisance() function.
set.seed(10)
eSVD_obj <- eSVD2::estimate_nuisance(
  input_obj = eSVD_obj,
  bool_covariates_as_library = TRUE,
  bool_library_includes_interept = TRUE,
  bool_use_log = FALSE,
  min_val = 1e-4,
  verbose = 0
)

names(eSVD_obj)
#> [1] "dat"          "covariates"   "param"        "fit_Init"     "latest_Fit"  
#> [6] "case_control" "individual"   "fit_First"    "fit_Second"
  • Step 4 (Computing the posterior): With the over-disperison parameter estimated, we are now ready to compute the posterior mean and variance of each cell’s gene expression value. This posterior distribution is to balance out the fitted eSVD denoised value against the covariate-adjusted library size. This step ensures that even though we pooled all the genes to denoise the single-cell expression matrix, we do not later inflate the Type-1 error. This is done via the eSVD2::compute_posterior() function.
set.seed(10)
eSVD_obj <- eSVD2::compute_posterior(
  input_obj = eSVD_obj,
  bool_adjust_covariates = FALSE,
  alpha_max = 100,
  bool_covariates_as_library = TRUE,
  bool_stabilize_underdispersion = TRUE,
  library_min = 1,
  pseudocount = 0
)

names(eSVD_obj)
#> [1] "dat"          "covariates"   "param"        "fit_Init"     "latest_Fit"  
#> [6] "case_control" "individual"   "fit_First"    "fit_Second"
set.seed(10)
eSVD_obj <- eSVD2::compute_test_statistic(input_obj = eSVD_obj, verbose = 0)

names(eSVD_obj)
#>  [1] "dat"          "covariates"   "param"        "fit_Init"     "latest_Fit"  
#>  [6] "case_control" "individual"   "fit_First"    "fit_Second"   "teststat_vec"
#> [11] "case_mean"    "control_mean"
set.seed(10)
eSVD_obj <- eSVD2::compute_pvalue(input_obj = eSVD_obj)

names(eSVD_obj)
#>  [1] "dat"          "covariates"   "param"        "fit_Init"     "latest_Fit"  
#>  [6] "case_control" "individual"   "fit_First"    "fit_Second"   "teststat_vec"
#> [11] "case_mean"    "control_mean" "pvalue_list"

To visualize the p-values, we can plot the theoretical quantiles against the observed quantiles. We see that despite the UMAP above showing strong differences between the case and control cells, the p-values are all uniformly distributed. This shows that we do not inflate the Type-1 error.

pvalue_vec <- 10 ^ (-eSVD_obj$pvalue_list$log10pvalue)
graphics::plot(
  x = sort(pvalue_vec),
  y = seq(0, 1, length.out = length(pvalue_vec)),
  asp = TRUE,
  pch = 16,
  xlab = "Observed quantile",
  ylab = "Theoretical quantile"
)
graphics::lines(
  x = c(0, 1),
  y = c(0, 1),
  col = "red",
  lty = 2,
  lwd = 2
)

Setup

The following shows the suggested package versions that the developer (GitHub username: linnykos) used when developing the eSVD2 package.

devtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.4.1 (2024-06-14)
#>  os       Ubuntu 22.04.4 LTS
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language en
#>  collate  C.UTF-8
#>  ctype    C.UTF-8
#>  tz       UTC
#>  date     2024-07-13
#>  pandoc   3.1.11 @ /opt/hostedtoolcache/pandoc/3.1.11/x64/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package          * version  date (UTC) lib source
#>  abind              1.4-5    2016-07-21 [1] RSPM
#>  bslib              0.7.0    2024-03-29 [1] RSPM
#>  cachem             1.1.0    2024-05-16 [1] RSPM
#>  cli                3.6.3    2024-06-21 [1] RSPM
#>  cluster            2.1.6    2023-12-01 [3] CRAN (R 4.4.1)
#>  codetools          0.2-20   2024-03-31 [3] CRAN (R 4.4.1)
#>  colorspace         2.1-0    2023-01-23 [1] RSPM
#>  cowplot            1.1.3    2024-01-22 [1] RSPM
#>  data.table         1.15.4   2024-03-30 [1] RSPM
#>  deldir             2.0-4    2024-02-28 [1] RSPM
#>  desc               1.4.3    2023-12-10 [1] RSPM
#>  devtools           2.4.5    2022-10-11 [1] RSPM
#>  digest             0.6.36   2024-06-23 [1] RSPM
#>  dotCall64          1.1-1    2023-11-28 [1] RSPM
#>  dplyr              1.1.4    2023-11-17 [1] RSPM
#>  ellipsis           0.3.2    2021-04-29 [1] RSPM
#>  eSVD2            * 1.0.1.01 2024-07-13 [1] local
#>  evaluate           0.24.0   2024-06-10 [1] RSPM
#>  fansi              1.0.6    2023-12-08 [1] RSPM
#>  farver             2.1.2    2024-05-13 [1] RSPM
#>  fastDummies        1.7.3    2023-07-06 [1] RSPM
#>  fastmap            1.2.0    2024-05-15 [1] RSPM
#>  fitdistrplus       1.2-1    2024-07-12 [1] RSPM
#>  foreach            1.5.2    2022-02-02 [1] RSPM
#>  fs                 1.6.4    2024-04-25 [1] RSPM
#>  future             1.33.2   2024-03-26 [1] RSPM
#>  future.apply       1.11.2   2024-03-28 [1] RSPM
#>  generics           0.1.3    2022-07-05 [1] RSPM
#>  ggplot2            3.5.1    2024-04-23 [1] RSPM
#>  ggrepel            0.9.5    2024-01-10 [1] RSPM
#>  ggridges           0.5.6    2024-01-23 [1] RSPM
#>  glmnet             4.1-8    2023-08-22 [1] RSPM
#>  globals            0.16.3   2024-03-08 [1] RSPM
#>  glue               1.7.0    2024-01-09 [1] RSPM
#>  gmp                0.7-4    2024-01-15 [1] RSPM
#>  goftest            1.2-3    2021-10-07 [1] RSPM
#>  gridExtra          2.3      2017-09-09 [1] RSPM
#>  gtable             0.3.5    2024-04-22 [1] RSPM
#>  highr              0.11     2024-05-26 [1] RSPM
#>  htmltools          0.5.8.1  2024-04-04 [1] RSPM
#>  htmlwidgets        1.6.4    2023-12-06 [1] RSPM
#>  httpuv             1.6.15   2024-03-26 [1] RSPM
#>  httr               1.4.7    2023-08-15 [1] RSPM
#>  ica                1.0-3    2022-07-08 [1] RSPM
#>  igraph             2.0.3    2024-03-13 [1] RSPM
#>  irlba              2.3.5.1  2022-10-03 [1] RSPM
#>  iterators          1.0.14   2022-02-05 [1] RSPM
#>  jquerylib          0.1.4    2021-04-26 [1] RSPM
#>  jsonlite           1.8.8    2023-12-04 [1] RSPM
#>  KernSmooth         2.23-24  2024-05-17 [3] CRAN (R 4.4.1)
#>  knitr              1.48     2024-07-07 [1] RSPM
#>  labeling           0.4.3    2023-08-29 [1] RSPM
#>  later              1.3.2    2023-12-06 [1] RSPM
#>  lattice            0.22-6   2024-03-20 [3] CRAN (R 4.4.1)
#>  lazyeval           0.2.2    2019-03-15 [1] RSPM
#>  leiden             0.4.3.1  2023-11-17 [1] RSPM
#>  lifecycle          1.0.4    2023-11-07 [1] RSPM
#>  listenv            0.9.1    2024-01-29 [1] RSPM
#>  lmtest             0.9-40   2022-03-21 [1] RSPM
#>  locfdr             1.1-8    2015-07-15 [1] RSPM
#>  magrittr           2.0.3    2022-03-30 [1] RSPM
#>  MASS               7.3-60.2 2024-04-26 [3] CRAN (R 4.4.1)
#>  Matrix             1.7-0    2024-04-26 [3] CRAN (R 4.4.1)
#>  matrixStats        1.3.0    2024-04-11 [1] RSPM
#>  memoise            2.0.1    2021-11-26 [1] RSPM
#>  mime               0.12     2021-09-28 [1] RSPM
#>  miniUI             0.1.1.1  2018-05-18 [1] RSPM
#>  munsell            0.5.1    2024-04-01 [1] RSPM
#>  nlme               3.1-164  2023-11-27 [3] CRAN (R 4.4.1)
#>  parallelly         1.37.1   2024-02-29 [1] RSPM
#>  patchwork          1.2.0    2024-01-08 [1] RSPM
#>  pbapply            1.7-2    2023-06-27 [1] RSPM
#>  pillar             1.9.0    2023-03-22 [1] RSPM
#>  pkgbuild           1.4.4    2024-03-17 [1] RSPM
#>  pkgconfig          2.0.3    2019-09-22 [1] RSPM
#>  pkgdown            2.1.0    2024-07-06 [1] RSPM
#>  pkgload            1.4.0    2024-06-28 [1] RSPM
#>  plotly             4.10.4   2024-01-13 [1] RSPM
#>  plyr               1.8.9    2023-10-02 [1] RSPM
#>  png                0.1-8    2022-11-29 [1] RSPM
#>  polyclip           1.10-6   2023-09-27 [1] RSPM
#>  profvis            0.3.8    2023-05-02 [1] RSPM
#>  progressr          0.14.0   2023-08-10 [1] RSPM
#>  promises           1.3.0    2024-04-05 [1] RSPM
#>  purrr              1.0.2    2023-08-10 [1] RSPM
#>  R6                 2.5.1    2021-08-19 [1] RSPM
#>  ragg               1.3.2    2024-05-15 [1] RSPM
#>  RANN               2.6.1    2019-01-08 [1] RSPM
#>  RColorBrewer       1.1-3    2022-04-03 [1] RSPM
#>  Rcpp             * 1.0.12   2024-01-09 [1] RSPM
#>  RcppAnnoy          0.0.22   2024-01-23 [1] RSPM
#>  RcppHNSW           0.6.0    2024-02-04 [1] RSPM
#>  remotes            2.5.0    2024-03-17 [1] RSPM
#>  reshape2           1.4.4    2020-04-09 [1] RSPM
#>  reticulate         1.38.0   2024-06-19 [1] RSPM
#>  rlang              1.1.4    2024-06-04 [1] RSPM
#>  rmarkdown          2.27     2024-05-17 [1] RSPM
#>  Rmpfr              0.9-5    2024-01-21 [1] RSPM
#>  ROCR               1.0-11   2020-05-02 [1] RSPM
#>  RSpectra           0.16-1   2022-04-24 [1] RSPM
#>  Rtsne              0.17     2023-12-07 [1] RSPM
#>  sass               0.4.9    2024-03-15 [1] RSPM
#>  scales             1.3.0    2023-11-28 [1] RSPM
#>  scattermore        1.2      2023-06-12 [1] RSPM
#>  sctransform        0.4.1    2023-10-19 [1] RSPM
#>  sessioninfo        1.2.2    2021-12-06 [1] RSPM
#>  Seurat           * 5.1.0    2024-05-10 [1] RSPM
#>  SeuratObject     * 5.0.2    2024-05-08 [1] RSPM
#>  shape              1.4.6.1  2024-02-23 [1] RSPM
#>  shiny              1.8.1.1  2024-04-02 [1] RSPM
#>  sp               * 2.1-4    2024-04-30 [1] RSPM
#>  spam               2.10-0   2023-10-23 [1] RSPM
#>  spatstat.data      3.1-2    2024-06-21 [1] RSPM
#>  spatstat.explore   3.2-7    2024-03-21 [1] RSPM
#>  spatstat.geom      3.2-9    2024-02-28 [1] RSPM
#>  spatstat.random    3.2-3    2024-02-29 [1] RSPM
#>  spatstat.sparse    3.1-0    2024-06-21 [1] RSPM
#>  spatstat.utils     3.0-5    2024-06-17 [1] RSPM
#>  stringi            1.8.4    2024-05-06 [1] RSPM
#>  stringr            1.5.1    2023-11-14 [1] RSPM
#>  survival           3.6-4    2024-04-24 [3] CRAN (R 4.4.1)
#>  systemfonts        1.1.0    2024-05-15 [1] RSPM
#>  tensor             1.5      2012-05-05 [1] RSPM
#>  textshaping        0.4.0    2024-05-24 [1] RSPM
#>  tibble             3.2.1    2023-03-20 [1] RSPM
#>  tidyr              1.3.1    2024-01-24 [1] RSPM
#>  tidyselect         1.2.1    2024-03-11 [1] RSPM
#>  urlchecker         1.0.1    2021-11-30 [1] RSPM
#>  usethis            2.2.3    2024-02-19 [1] RSPM
#>  utf8               1.2.4    2023-10-22 [1] RSPM
#>  uwot               0.2.2    2024-04-21 [1] RSPM
#>  vctrs              0.6.5    2023-12-01 [1] RSPM
#>  viridisLite        0.4.2    2023-05-02 [1] RSPM
#>  withr              3.0.0    2024-01-16 [1] RSPM
#>  xfun               0.45     2024-06-16 [1] RSPM
#>  xtable             1.8-4    2019-04-21 [1] RSPM
#>  yaml               2.3.9    2024-07-05 [1] RSPM
#>  zoo                1.8-12   2023-04-13 [1] RSPM
#> 
#>  [1] /home/runner/work/_temp/Library
#>  [2] /opt/R/4.4.1/lib/R/site-library
#>  [3] /opt/R/4.4.1/lib/R/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────