I am a tenure-track assistant professor at UW Biostatistics, and started in Fall 2023. Previously, I received my Ph.D. from the Department of Statistics & Data Science at Carnegie Mellon University, under supervision of Dr. Kathryn Roeder and Dr. Jing Lei, and completed my post-doctoral training with Dr. Nancy R. Zhang.
My research mainly focuses on investigating regulation among biological systems (such as in the aging brain to understand dementia and resilience or in immune cells to understand acquired resistance to immunotherapy) through single-cell data. This entails developing statistical methods, and leveraging recent statistical-theoretical developments in areas such as matrix factorizations and network analysis. I mainly focus on projects that are driven first-and-foremost by the science. What can current statistical and computational methods achieve, and why are those not enough to uncover the fascinating cell biology mechanisms that are urgently needed to be studied? Is it because current methods are not statistically robust, is it because the current frameworks are not satisfactory, or do new technology offer new perspectives on how to model the biology?
When taking a break from work, I am a fanatic fan of zumba and cooking. Follow my adventures on Instagram! (Fun fact: The "Z." in "Kevin Z. Lin" is simply for publishing purposes, and is not an official middle name.)
Mental Health and Wellness sub-committee: I am the coordinator for the Mental Health and Wellness sub-committee in our Biostat department (within the EDI committee). If you are part of UW and would like to collaborate in mental health/wellness resources or ideas, would love to promote new mental health/wellness initiative/training or have related ideas to our department, please contact me!
Note: There are multiple "Kevin Lin"'s at the University of Washington. Be sure to contact me at the correct email (ID: kzlin)!
In static lineage-barcoding single-cell sequencing, we measure both each cell's gene expression and its ancestry in order to gain insights into a cell's fate. However, there lack methods to identify which genes separate the cells by their lineages. In this work, we develop a deep-learning architecture based on contrastive learning to isolate the information in the gene expression that recapitulates cells' lineages. We use the lineage barcoe as the premise of our data augmentation mechanism. This strategy enables us to better identify the genes driving fate commitment and cells' fate boundaries.
LCL: Contrastive Learning for Lineage Barcoded scRNA-seq Data
(biorxiv) (github)
In genomics, we often observe a sequence of graphs over some notion of time, such as the gene network ordered by the cells' developmental age. However, there is a lack of concrete statistical theory for this setting whose assumptions are amendable for single-cell analyses. In this work, we develop an estimator designed for models where the underlying stochastic block model is smoothly changing with time, and prove the convergence rates of the clustering and estimates of the connectivity matrix under assumptions more general than those existing in the literature. We apply this to study the dynamics of gene co-expression networks among oligodendrocytes.
Computational single-cell analyses have become quite complex. While there are many tools for any particular specific analysis, researchers new to single-cell analyses might often feel overwhelmed and not immediately understand the logical flow that connects one computational method to another. This review paper highlights the underlying logic of how to craft a computational workflow, tailored for the biological context of studying glial cells. It starts with thinking about the experimental design, and spotlights common pitfalls when planning your own computational workflow.
All the single cells: Single-cell transcriptomics/epigenomics experimental design and analysis considerations for glial biologists
Glia To appear (2024): (arxiv)
As single-cell RNA-seq (scRNA) transitions from studying cell-lines, clonal mice or a few individuals into studying cohorts of individuals, there is a need to create new differential expression (DE) methods to investigate DE among individuals instead of among cells. Towards this end, we tailor the eSVD, formerly a dimension-reduction tool, equipped with a downstream pipeline to perform DE in multi-individual scRNA data, where we both design our test statistics to account for variability among and within individuals as well as incorporating the individual-level covariates into the eSVD-DE framework. We show that this can recover reproducible signals across different datasets as well as achieves higher power thanks to its dimension-reduction framework.
eSVD-DE: Cohort-wide differential expression in single-cell RNA-seq data using exponential-family embeddings
BMC Bioinformatics 25.1 (2024): (link) (biorxiv) (code: method) (code: analysis) (code: tutorials)
Different sequencing technologies offer complementary insights into a biological system, but it becomes difficult to integrate two datasets together when both datasets differ in the specific cells sequenced as well as which features are measured. Hence, using an iterative co-embedding based on CCA, data smoothing, and cell matching, we match cells in one dataset (i.e., one modality) to the other. The effectiveness of our method is demonstrated by integrating scRNA-seq and surface-antibody data, as well as scRNA-seq data with spatial proteomic data (such as CODEX) and scATAC-seq.
Integration of spatial and single-cell data across modalities with weak linkage
Nature Biotechnology 42 (2023): (link) (biorxiv) (code: method) (code: tutorials)
Paired multiomic single-cell data provides ample opportunities for biologists to study the relations between molecular modalities (such as gene expression and chromatin peaks), but it difficult to assess what kind of signals are shared in common between the two modalities. We developed the Tilted-CCA, which shows the "intersection of information" between the two modalities, in contrast with many existing dimension-reduction methods that show the "union of information." In particular, the matrix decomposition provided by Tilted-CCA allows us to design targeted antibody panels for RNA and surface-antibody technologies, as well as uncover the terminal cell-states as well as the relation between chromatin open/closed cell-states and gene expression in developmental systems.
Quantifying common and distinct information in single-cell multimodal data with Tilted-CCA
Proceedings of the National Academy of Sciences (PNAS) 120.23 (2023): (link) (biorxiv) (code: method) (code: analysis) (code: tutorials)
There are different signals a researcher can extract from single-cell ATAC-seq, and each has their own biological relevance for downstream analysis. However, one might want to perform computational tasks that aggregate all such signals together. We propose a computational framework to aggregate across such signals for downstream visualization, cell-clustering and trajectory inference.
Destin2: Integrative and cross-modality analysis of single-cell chromatin accessibility data
Frontiers in Genetics 14 (2023): (link) (biorxiv)
While clustering of nodes is now theoretically well-understood for a single network through the lense of spectral embeddings, the theoretical understanding for clustering in multi-layer networks is less well-understood. In this paper, when the clustering structure is shared across all the layers, we develop a simple way to aggregate information across layers via squaring the adjacency matrix with an appropriate bias-adjustment. Our theoretical analysis shows we can both allow for networks with dissociative structure (i.e., negative eigenvalues) and provably obtains consistent clustering even if each individual network is extremely sparse. We apply our method to study cluster of genes in developing monkey brains.
Bias-adjusted spectral clustering in multi-layer stochastic block models
Journal of the American Statistical Association (JASA) To appear (2022): (link) (arxiv) (git)
Methods to detect multiple changepoints in regression coefficients typically have localization rate (i.e., rate to estimate the location of the changepoints) that scale inversely with the square root with number of samples. In this paper, we develop VPWBS, a method that appropriately transforms the regression problem into a one-dimensional problem upon which we apply WBS. We prove the convergence rate scales inversely with number of samples, a substantial improvement over existing rates.
Common pipelines to estimate the cell developmental trajectories based on single-cell data typically first embed each cell into a lower-dimensional space, but these embedding typically assume statistical models that do not model single-cell data well. In this paper, we develop an embedding for hierarchical model where the inner product between two latent low-dimensional vectors is the natural parameter of an exponential family distributed random variable, and prove identifiability and convergence. When studying oligodendrocytes in fetal mouse brains, we find that oligodendrocytes mature into various cell types.
Changepoint detection methods such as binary segmentation are often used in CGH analyses for copy number variation detection, but these methods lack proper downstream statistical inference. In this paper, we develop post-selection hypothesis tests for various changepoint detection methods, prove our sampling strategies' validity, and provide substantial practical guidelines based on simulation.
Valid post-selection inference for segmentation methods with application to copy number variation data
Biometrics 77.3 (2021): 1037-1049 (link) (arxiv) (pdf) (git 1, git 2)
Microarray samples from brain tissue is hard to collect, and also varies substantially depending on the tissue's brain region and the developmental age of its subject, hence it is hard to collect enough samples for the statistical analysis. In this paper, we develop a sample selection method to find additional microarray samples that are statistically similar to the samples of our desired spatio-temporal brain tissue. We demonstrate that after apply an existing analysis pipeline to our selected samples, we detect a higher percentage of autism risk genes.
We discuss "Network cross-validation by edge sampling" by Li, Levina, and Zhu (Biometrika, 2020) where empirically explore the possibilities of combining the ideas of the authors with ideas providing confidence sets for cross-validated parameters as well as extending the ideas of the authors into the network tensor setting.
Discussion of 'Network cross-validation by edge sampling'
Biometrika 107.2 (2020): 285-287. (link) (pdf) (git)
While de novo mutations within the protein-coding portion of the genome have been thoroughly studied, these mutations in the noncoding portions which comprise of 98.5% of the genome have been less well understood. In this paper, we use a bioinformatics framework that builds upon (among other things) sparse PCA and DAWN based on simulated null datasets to analyze 1902 autism quartets via WGS and find that the strongest signals arose from promoters -- noncoding regions that control gene transcription.
Genome-wide de novo risk score implicates promoter variation in autism spectrum disorder
Science 362.6420 (2018): (link) (pdf) (git)
Changepoint estimators have statistical theory for how well they estimate the mean function and how well they estimate the changepoints, but existing theory often analyzes these properties separately. In this paper, we prove a near-optimal estimation rate for the fused lasso, which in turn directly proves a changepoint detection rate that is near the detection limit. We extend this logic to other estimators and settings.
A sharp error analysis for the fused Lasso, with application to approximate changepoint screening
Advances in Neural Information Processing Systems (NeurIPS) (2017): (link) (arxiv) (git)
Social biases can influence Wikipedia articles of individuals across different genders and ethnicities that result in disproportiate article lengths, number of languages written about said individuals, etc., but these differences can be difficult to quantify appropriately. In this paper, we develop a matching method to find an appropriate comparison set of articles for each target group of interest (ex: articles of African American individuals) based on the articles' categories. We show the differing covariate distributions across the different target groups, and uncover quantitative results that reinforce existing social theories. This paper won the Wikimedia Foundation Research Award of the Year in 2023.
Controlled Analyses of Social Biases in Wikipedia Bios
Proceedings of the ACM Web Conference 2022 (2022): (link) (arxiv) (git)
Dependency graphs encode complex pairwise patterns that are often statistically estimated, but are often hard to diagnose with visualizations due to the quadratic number of scatter plots. In this paper, we develop an interactive system in R that learns if the data scientist visually interprets dependency. Then, this system applies the learned classifier to infer a dependency graph that can be compared against the estimated graph. This paper won honorable mention for the Student Paper Award in the ASA Section: Statistical Computing and Statistical Graphics (2018).
Interactive dependency visualizer: Understanding pairwise variable relationships, applied to single-cell RNA-seq data
Many compressed sensing are developed to be as generic as possible, but have shortcomings in specialized settings where modern optimization theory can deliver a substantial boost in computational efficiency. In this paper, we develop two compressed sensing algorithms, one specialized for extremely sparse signals and another specialized for Kronecker-structed sensing matrices. We numerically demonstrate a near 10-times reduction in computation time compared to other state-of-the-art methods.
Revisiting compressed sensing: Exploiting the efficiency of simplex and sparsification methods
Mathematical Programming Computation 8.3 (2016): 253-269. (link) (pdf) (git)
Last Updated: January 30, 2024