
I am a tenure-track assistant professor at UW Biostatistics, having started in Fall 2023. Previously, I received my Ph.D. from the Department of Statistics & Data Science at Carnegie Mellon University, under supervision of Dr. Kathryn Roeder and Dr. Jing Lei, and completed my post-doctoral training with Dr. Nancy R. Zhang. I am excited to be joining as an tenure-track assistant professor starting August 2023.
My research mainly focuses on investigating regulation among biological systems (such as in the developing brain to understand the autism spectrum or in immune cells to understand acquired resistance to immunotherapy) through single-cell data. This entails developing statistical methods, and leveraging recent statistical-theoretical developments in areas such as matrix factorizations and network analysis. Recently, I have been focused on developing these methods tailored for paired multiomic data where we can directly observe the relationships between two modalities. The code for my research are available on git, and my Google Scholar page is here.
When taking a break from work, I am a fanatic fan of zumba, cooking, Yu-Gi-Oh, and poetry. Follow my adventures on Instagram! (Fun fact: The "Z." in "Kevin Z. Lin" is simply for publishing purposes, and is not an official middle name.)

In genomics, we often observe a sequence of graphs over some notion of time, such as the gene network ordered by the cells' developmental age. However, there is a lack of concrete statistical theory for this setting whose assumptions are amendable for single-cell analyses. In this work, we develop an estimator designed for models where the underlying stochastic block model is smoothly changing with time, and prove the convergence rates of the clustering and estimates of the connectivity matrix under assumptions more general than those existing in the literature. We apply this to study the dynamics of gene co-expression networks among oligodendrocytes.

As single-cell RNA-seq (scRNA) transitions from studying cell-lines, clonal mice or a few individuals into studying cohorts of individuals, there is a need to create new differential expression (DE) methods to investigate DE among individuals instead of among cells. Towards this end, we tailor the eSVD, formerly a dimension-reduction tool, equipped with a downstream pipeline to perform DE in multi-individual scRNA data, where we both design our test statistics to account for variability among and within individuals as well as incorporating the individual-level covariates into the eSVD framework. We show that this can recover reproducible signals across different datasets as well as achieves higher power thanks to its dimension-reduction framework.

Dependency graphs encode complex pairwise patterns that are often statistically estimated, but are often hard to diagnose with visualizations due to the quadratic number of scatter plots. In this paper, we develop an interactive system in R that learns if the data scientist visually interprets dependency. Then, this system applies the learned classifier to infer a dependency graph that can be compared against the estimated graph. This paper won honorable mention for the Student Paper Award in the ASA Section: Statistical Computing and Statistical Graphics (2018).
Interactive dependency visualizer: Understanding pairwise variable relationships, applied to single-cell RNA-seq data
In preparation

Different sequencing technologies offer complementary insights into a biological system, but it becomes difficult to integrate two datasets together when both datasets differ in the specific cells sequenced as well as which features are measured. Hence, using an iterative co-embedding based on CCA, data smoothing, and cell matching, we match cells in one dataset (i.e., one modality) to the other. The effectiveness of our method is demonstrated by integrating scRNA-seq and surface-antibody data, as well as scRNA-seq data with spatial proteomic data (such as CODEX) and scATAC-seq.
Integration of spatial and single-cell data across modalities with weak linkage
Nature Biotechnology (To appear, 2023). (biorxiv)

Paired multiomic single-cell data provides ample opportunities for biologists to study the relations between molecular modalities (such as gene expression and chromatin peaks), but it difficult to assess what kind of signals are shared in common between the two modalities. We developed the Tilted-CCA, which shows the "intersection of information" between the two modalities, in contrast with many existing dimension-reduction methods that show the "union of information." In particular, the matrix decomposition provided by Tilted-CCA allows us to design targeted antibody panels for RNA and surface-antibody technologies, as well as uncover the terminal cell-states as well as the relation between chromatin open/closed cell-states and gene expression in developmental systems.
Quantifying common and distinct information in single-cell multimodal data with Tilted-CCA
Proceedings of the National Academy of Sciences (PNAS) (To appear, 2023). (link) (biorxiv)

There are different signals a researcher can extract from single-cell ATAC-seq, and each has their own biological relevance for downstream analysis. However, one might want to perform computational tasks that aggregate all such signals together. We propose a computational framework to aggregate across such signals for downstream visualization, cell-clustering and trajectory inference.
Destin2: Integrative and cross-modality analysis of single-cell chromatin accessibility data
Frontiers in Genetics 14 (2023), (link) (biorxiv)

While clustering of nodes is now theoretically well-understood for a single network through the lense of spectral embeddings, the theoretical understanding for clustering in multi-layer networks is less well-understood. In this paper, when the clustering structure is shared across all the layers, we develop a simple way to aggregate information across layers via squaring the adjacency matrix with an appropriate bias-adjustment. Our theoretical analysis shows we can both allow for networks with dissociative structure (i.e., negative eigenvalues) and provably obtains consistent clustering even if each individual network is extremely sparse. We apply our method to study cluster of genes in developing monkey brains.
Bias-adjusted spectral clustering in multi-layer stochastic block models
Journal of the American Statistical Association (JASA) (To appear, 2022) (link) (arxiv) (git)

Methods to detect multiple changepoints in regression coefficients typically have localization rate (i.e., rate to estimate the location of the changepoints) that scale inversely with the square root with number of samples. In this paper, we develop VPWBS, a method that appropriately transforms the regression problem into a one-dimensional problem upon which we apply WBS. We prove the convergence rate scales inversely with number of samples, a substantial improvement over existing rates.

Common pipelines to estimate the cell developmental trajectories based on single-cell data typically first embed each cell into a lower-dimensional space, but these embedding typically assume statistical models that do not model single-cell data well. In this paper, we develop an embedding for hierarchical model where the inner product between two latent low-dimensional vectors is the natural parameter of an exponential family distributed random variable, and prove identifiability and convergence. When studying oligodendrocytes in fetal mouse brains, we find that oligodendrocytes mature into various cell types.

Changepoint detection methods such as binary segmentation are often used in CGH analyses for copy number variation detection, but these methods lack proper downstream statistical inference. In this paper, we develop post-selection hypothesis tests for various changepoint detection methods, prove our sampling strategies' validity, and provide substantial practical guidelines based on simulation.
Valid post-selection inference for segmentation methods with application to copy number variation data
Biometrics 77.3 (2021): 1037-1049 (link) (arxiv) (pdf) (git 1, git 2)

Microarray samples from brain tissue is hard to collect, and also varies substantially depending on the tissue's brain region and the developmental age of its subject, hence it is hard to collect enough samples for the statistical analysis. In this paper, we develop a sample selection method to find additional microarray samples that are statistically similar to the samples of our desired spatio-temporal brain tissue. We demonstrate that after apply an existing analysis pipeline to our selected samples, we detect a higher percentage of autism risk genes.

We discuss "Network cross-validation by edge sampling" by Li, Levina, and Zhu (Biometrika, 2020) where empirically explore the possibilities of combining the ideas of the authors with ideas providing confidence sets for cross-validated parameters as well as extending the ideas of the authors into the network tensor setting.
Discussion of 'Network cross-validation by edge sampling'
Biometrika 107.2 (2020): 285-287. (link) (pdf) (git)

While de novo mutations within the protein-coding portion of the genome have been thoroughly studied, these mutations in the noncoding portions which comprise of 98.5% of the genome have been less well understood. In this paper, we use a bioinformatics framework that builds upon (among other things) sparse PCA and DAWN based on simulated null datasets to analyze 1902 autism quartets via WGS and find that the strongest signals arose from promoters -- noncoding regions that control gene transcription.
Genome-wide de novo risk score implicates promoter variation in autism spectrum disorder
Science 362.6420 (2018). (link) (pdf) (git)

Changepoint estimators have statistical theory for how well they estimate the mean function and how well they estimate the changepoints, but existing theory often analyzes these properties separately. In this paper, we prove a near-optimal estimation rate for the fused lasso, which in turn directly proves a changepoint detection rate that is near the detection limit. We extend this logic to other estimators and settings.
A sharp error analysis for the fused Lasso, with application to approximate changepoint screening
Advances in Neural Information Processing Systems (NeurIPS) (2017). (link) (arxiv) (git)

Social biases can influence Wikipedia articles of individuals across different genders and ethnicities that result in disproportiate article lengths, number of languages written about said individuals, etc., but these differences can be difficult to quantify appropriately. In this paper, we develop a matching method to find an appropriate comparison set of articles for each target group of interest (ex: articles of African American individuals) based on the articles' categories. We show the differing covariate distributions across the different target groups, and uncover quantitative results that reinforce existing social theories. This paper won the Wikimedia Foundation Research Award of the Year in 2023.
Controlled Analyses of Social Biases in Wikipedia Bios
Proceedings of the ACM Web Conference 2022, (link) (arxiv) (git)

Many compressed sensing are developed to be as generic as possible, but have shortcomings in specialized settings where modern optimization theory can deliver a substantial boost in computational efficiency. In this paper, we develop two compressed sensing algorithms, one specialized for extremely sparse signals and another specialized for Kronecker-structed sensing matrices. We numerically demonstrate a near 10-times reduction in computation time compared to other state-of-the-art methods.
Revisiting compressed sensing: Exploiting the efficiency of simplex and sparsification methods
Mathematical Programming Computation 8.3 (2016): 253-269. (link) (pdf) (git)
-
eSVD-DE: Cohort-level differential expression in single-cell data via matrix factorization (talk).
2023 Joint Statistical Meetings (JSM), Toronto, Canada. -
Tilted-CCA: Quantifying common and distinct information in jointly-sequenced multiomic single-cell data (poster).
2023 New Researcher's Conference (NRC), Toronto, Canada. -
Spectral clustering for heterophilic stochastic block models with dynamic node memberships (talk).
2023 International Chinese Statistical Association (ICSA) Applied Statistics Symposium, Ann Arbor, MI. -
Tilted-CCA: Quantifying common and distinct information in jointly-sequenced multiomic single-cell data (talk).
2022 Joint Statistical Meetings (JSM), Washington DC. -
Spectral clustering for multi-layer stochastic block models: Theoretical analysis of static and dynamic settings for heterophilic networks (talk).
2022 Symposium on Data Science and Statistics (SDSS), Pittsburgh, PA. -
Exponential-family embedding for single-cell data with applications to developmental trajectory and differential expression (talk).
2021 UCLA Department of Statistics: Seminar Series. (Remote) -
Time-varying stochastic block models, with application to understanding the dynamics of gene co-expression (talk).
2021 StatScale Seminar. (Remote) -
Exponential-family embedding with application to cell developmental trajectories for single-cell RNA-Seq data (poster).
2020 American Society of Human Genetics (ASHG). (Remote) -
Exponential-family embedding with application to cell developmental trajectories for single-cell RNA-Seq data (talk).
Joint talk with Professor Kathryn Roeder
2020 Joint Statistical Meetings (JSM), Philadelphia, Pennsylvania. (Remote) -
Time-varying stochastic block models via kernel smoothing, with application to RNA-Seq data and cell development (talk).
2020 Joint Statistical Meetings (JSM), Philadelphia, Pennsylvania. (Remote) -
Exponential-family embedding with application to cell developmental trajectories for single-cell RNA-Seq data (talk).
2019 Joint Statistical Meetings (JSM), Denver, Colorado. -
Dependency diagnostic: Visually understanding pairwise variable relationships (talk).
2018 Joint Statistical Meetings (JSM), Vancouver, Canada. -
A sharp error analysis for the fused lasso, with application to approximate changepoint screening (poster).
2017 Conference on Neural Information Processing Systems (NeurIPS), Long Beach, CA. -
Hypothesis testing for simultaneous variable clustering and correlation network estimation, with applications to gene coexpression networks (talk).
2017 Joint Statistical Meetings (JSM), Baltimore, MD. -
Longitudinal Gaussian graphical model for autism risk gene detection (talk).
2016 Joint Statistical Meetings (JSM), Chicago, IL. -
Longitudinal Gaussian graphical model integrating gene expression and sequencing data for autism risk gene detection (poster).
2015 American Society of Human Genetics (ASHG), Baltimore, MD. -
Optimization for compressed sensing: New insights and alternatives (talk).
2014 Modeling and Optimization: Theory and Applications, Bethlehem, PA.
-
PhD TAs of the year
(For the Spring 2020 semester, one of two recipients)
Carnegie Mellon University, May 2020 -
Honorable mention in student paper competition
(For article "Dependency diagnostic: Visually understanding pairwise variable relationships")
ASA section: Statistical Computing and Statistical Graphics, January 2018 -
Winner of Statistical excellence for early-career writing
(For article "We, the millennials: The statistical significance of political significance")
Significance magazine in partnership with Young Statisticians Section of Royal Statistical Society, June 2017 -
Teaching assistant award recipient
(For "Statistical Computing" in Fall 2016)
Carnegie Mellon University, May 2017 -
Award recipient of Kenneth H. Condit Prize
(For excellence in service to department)
Princeton University, May 2014
- (2022 January) Section on Statistics in Genetics and Genomics: An Introduction to Deep Learning in Omics (TA for a virtual short course under W. Sun and N. R. Zhang )
- (2020 Spring) CMU: 36-469 Statistical Genomics and High Dimensional Inference (co-instructor with K. Roeder )
- (2019 Spring) CMU: 36-490 Undergraduate Research (Data Science Initiative Project Fellow under R. Nugent and P. Freeman )
- (2018 Summer) CMU: 36-350 Statistical Computing (Instructor)
- (2018 Spring) CMU: 36-350 Statistical Computing (Assistant Instructor with R. J. Tibshirani )
- (2017 Fall) 36-350 CMU: Statistical Computing (TA under P. Freeman )
- (2015 Fall, 2016 Fall) CMU: 36-350 Statistical Computing (TA under R. J. Tibshirani )
- (2015 Spring) CMU: 36-217 Probability Theory and Random Processes (TA under A. Rinaldo )
- (2014 Fall) CMU: 46-921 Financial Data Analysis I and 46-923 Financial Data Analysis II (TA under C. Schafer )
- (2014 Spring, 2013 Spring, 2012 Spring) Princeton: ORF 350 Analysis of Big Data (Course designer with H. Liu )
- Discussant for a reading group for the International Biometric Society Journal club (for A stochastic block Ising model for multi-layer networks with inter-layer dependence presented by J. Zhang, organized by G. Li, August 2023, virtual)
- Chaired an invited session at JSM titled "Addressing statistical challenges in precision medicine with single cell and spatial omics data" (organized by Z. Li, August 2023, Toronto, Canada)
- Certified by Mental Health First Aid USA (September 2020)
- Certified by CMU's Eberly Center's Future Faculty Program (Fall 2019 to Summer 2020)
- Certified via QPR (Question, Persuade, Refer) gatekeeper certificate by the QPR suicide prevention gatekeeper program (February 2020)
- Founding member of CMU Statistics department wellness network (2018-2020)
- Reviewer of Annals of Applied Statistics, Annals of Statistics, Bayesian Analysis, Biometrics, Biometrika, Cell Press, Electronic Journal of Statistics, IEEE Transactions on Network Science and Engineering, Journal of Machine Learning Research, Journal of Molecular Biology, Journal of the American Statistical Association, Nature Neuroscience, Statistical Sinica, Statistics and Probability Letters, Statistics in Medicine, Technometrics
Last Updated: July 31, 2023