Calvin Chi

calvin.chi at berkeley dot edu



I am an applied scientist at Amazon Advertising, working on machine learning model development and deployment, from research papers to writing production code. I was a major contributor in the launch of Amazon Demand-side Platform's first deep learning system. My areas of applied research include semi-supervised learning, multitask learning, and modeled uncertainty.

I completed my Ph.D. in Computational Biology at UC Berkeley in 2020, advised by Haiyan Huang and Lisa Barcellos, and supported by the NSF Graduate Research Fellowship. My graduate work was at the intersection of applied machine learning and epidemiology/pharmacogenomics, which is summarized by this dissertation.

My graduate coursework was focused on the areas of statistics and computer science, where I studied optimization, algorithms, theoretical statistics, machine learning, computer vision, NLP, and reinforcement learning.


"Nature cannot be fooled." - Richard Feynman

Biography

I grew up in Taichung, Taiwan as a third culture kid, and always had an interest in STEM. Unfortunately, because my high school education was focused on the liberal arts and natural sciences, I was not aware of the engineering career until my senior year in high school. In college, I was pre-med because I thought medicine was the best way to practice my interest in STEM. However, after trying out a few courses in statistics and programming, and going on medical school interviews, I realized I was more interested in a career in computing, engineering, and data science. This led me to turn down my medical school admissions offer and to pursue a Ph.D. in Computational Biology, where I fell in love with computing and found my eventual career path. If you happen to find my rate my professors profile, it is a joke started by college classmates back in organic chemistry. The map indicates cities I have been to.

Education

Case Western Reserve University

BS, Biochemistry, summa cum laude
2011 - 2015

University of California, Berkeley

PhD, Computational Biology
2015 - 2020
Industry

Amazon

Applied Scientist Intern, summer 2019
Applied Scientist, 2020 - present
Publications

Identification of Sjögren’s syndrome patient subgroups by clustering of labial salivary gland DNA methylation profiles

Calvin Chi, Olivia Solomon, Caroline Shiboski, Kimberly E. Taylor, Hong Quach, Diana Quach, Lisa F. Barcellos, Lindsey A. Criswell

PLoS ONE 18(3) 2023



Abstract Heterogeneity in Sjögren’s syndrome (SS), increasingly called Sjögren’s disease, suggests the presence of disease subtypes, which poses a major challenge for the diagnosis, management, and treatment of this autoimmune disorder. Previous work distinguished patient subgroups based on clinical symptoms, but it is not clear to what extent symptoms reflect underlying pathobiology. The purpose of this study was to discover clinical meaningful subtypes of SS based on genome-wide DNA methylation data. We performed a cluster analysis of genome-wide DNA methylation data from labial salivary gland (LSG) tissue collected from 64 SS cases and 67 non-cases. Specifically, hierarchical clustering was performed on low dimensional embeddings of DNA methylation data extracted from a variational autoencoder to uncover unknown heterogeneity. Clustering revealed clinically severe and mild subgroups of SS. Differential methylation analysis revealed that hypomethylation at the MHC and hypermethylation at other genome regions characterize the epigenetic differences between these SS subgroups. Epigenetic profiling of LSGs in SS yields new insights into mechanisms underlying disease heterogeneity. The methylation patterns at differentially methylated CpGs are different in SS subgroups and support the role of epigenetic contributions to the heterogeneity in SS. Biomarker data derived from epigenetic profiling could be explored in future iterations of the classification criteria for defining SS subgroups.
Paper

Hypomethylation mediates genetic association with the major histocompatibility complex genes in Sjögren's syndrome

Calvin Chi, Kimberly E. Taylor, Hong Quach, Diana Quach, Lindsey A. Criswell, Lisa F. Barcellos

PLoS ONE 16(4) 2021



Abstract Differential methylation of immune genes has been a consistent theme observed in Sjögren’s syndrome (SS) in CD4+ T cells, CD19+ B cells, whole blood, and labial salivary glands (LSGs). Multiple studies have found associations supporting genetic control of DNA methylation in SS, which in the absence of reverse causation, has positive implications for the potential of epigenetic therapy. However, a formal study of the causal relationship between genetic variation, DNA methylation, and disease status is lacking. We performed a causal mediation analysis of DNA methylation as a mediator of nearby genetic association with SS using LSGs and genotype data collected from 131 female members of the Sjögren’s International Collaborative Clinical Alliance registry, comprising of 64 SS cases and 67 non-cases. Bumphunter was used to first identify differentially-methylated regions (DMRs), then the causal inference test (CIT) was applied to identify DMRs mediating the association of nearby methylation quantitative trait loci (MeQTL) with SS. Bumphunter discovered 215 DMRs, with the majority located in the major histocompatibility complex (MHC) on chromosome 6p21.3. Consistent with previous findings, regions hypomethylated in SS cases were enriched for gene sets associated with immune processes. Using the CIT, we observed a total of 19 DMR-MeQTL pairs that exhibited strong evidence for a causal mediation relationship. Close to half of these DMRs reside in the MHC and their corresponding meQTLs are in the region spanning the HLA-DQA1, HLA-DQB1, and HLA-DQA2 loci. The risk of SS conferred by these corresponding MeQTLs in the MHC was further substantiated by previous genome-wide association study results, with modest evidence for independent effects. By validating the presence of causal mediation, our findings suggest both genetic and epigenetic factors contribute to disease susceptibility, and inform the development of targeted epigenetic modification as a therapeutic approach for SS.
Paper

Bipartite graph-based approach for clustering of cell lines by gene expression-drug response associations

Calvin Chi, Yuting Ye, Bin Chen, Haiyan Huang

Bioinformatics 2021



Abstract In pharmacogenomic studies, the biological context of cell lines influences the predictive ability of drug-response models and the discovery of biomarkers. Thus, similar cell lines are often studied together based on prior knowledge of biological annotations. However, this selection approach is not scalable with the number of annotations, and the relationship between gene-drug association patterns and biological context may not be obvious. We present a procedure to compare cell lines based on their gene-drug association patterns. Starting with a grouping of cell lines from biological annotation, we model gene-drug association patterns for each group as a bipartite graph between genes and drugs. This is accomplished by applying sparse canonical correlation analysis (SCCA) to extract the gene-drug associations, and using the canonical vectors to construct the edge weights. Then, we introduce a nuclear norm-based dissimilarity measure to compare the bipartite graphs. Accompanying our procedure is a permutation test to evaluate the significance of similarity of cell line groups in terms of gene-drug associations. In the pharmacogenomics datasets CTRP2, GDSC2, and CCLE, hierarchical clustering of carcinoma groups based on this dissimilarity measure uniquely reveals clustering patterns driven by carcinoma subtype rather than primary site. Next, we show that the top associated drugs or genes from SCCA can be used to characterize the clustering patterns of haematopoietic and lymphoid malignancies. Finally, we confirm by simulation that when drug responses are linearly-dependent on expression, our approach is the only one that can effectively infer the true hierarchy compared to existing approaches.
Paper Software Code

Admixture mapping reveals evidence of differential multiple sclerosis risk by genetic ancestry

Calvin Chi, Xiaorong Shao, Brooke Rhead, Evangelina Gonzalez, Jessica B. Smith, Anny H. Xiang, Jennifer Graves, Amy Waldman, Timothy Lotze, Teri Schreiner, Bianca Weinstock-Guttman, Gregory Aaen, Jan-Mendelt Tillema, Jayne Ness, et al.

PLoS Genetics 15(1) 2019



Abstract Multiple sclerosis (MS) is an autoimmune disease with high prevalence among populations of northern European ancestry. Past studies have shown that exposure to ultraviolet radiation could explain the difference in MS prevalence across the globe. In this study, we investigate whether the difference in MS prevalence could be explained by European genetic risk factors. We characterized the ancestry of MS-associated alleles using RFMix, a conditional random field parameterized by random forests, to estimate their local ancestry in the largest assembled admixed population to date, with 3,692 African Americans, 4,915 Asian Americans, and 3,777 Hispanics. The majority of MS-associated human leukocyte antigen (HLA) alleles, including the prominent HLA-DRB1*15:01 risk allele, exhibited cosmopolitan ancestry. Ancestry-specific MS-associated HLA alleles were also identified. Analysis of the HLA-DRB1*15:01 risk allele in African Americans revealed that alleles on the European haplotype conferred three times the disease risk compared to those on the African haplotype. Furthermore, we found evidence that the European and African HLA-DRB1*15:01 alleles exhibit single nucleotide polymorphism (SNP) differences in regions encoding the HLA-DRB1 antigen-binding heterodimer. Additional evidence for increased risk of MS conferred by the European haplotype were found for HLA-B*07:02 and HLA-A*03:01 in African Americans. Most of the 200 non-HLA MS SNPs previously established in European populations were not significantly associated with MS in admixed populations, nor were they ancestrally more European in cases compared to controls. Lastly, a genome-wide search of association between European ancestry and MS revealed a region of interest close to the ZNF596 gene on chromosome 8 in Hispanics; cases had a significantly higher proportion of European ancestry compared to controls. In conclusion, our study established that the genetic ancestry of MS-associated alleles is complex and implicated that difference in MS prevalence could be explained by the ancestry of MS-associated alleles.
Poster Paper
Projects

HLA Allele Imputation with Multitask Deep Convolutional Neural Network


Research
Computational Biology
2020

  • Developed multitask convolutional neural network for HLA imputation from phased genotype data.
  • On the T1DGC test dataset, achieved 97.6% imputation accuracy, which is comparable to state-of-the-art performance from programs such as HIBAG, HLA*IMP:02, and SNP2HLA.

  • Embedding-Augmented Deep CNN for PudMed Journal Recommendation


    Class Project
    NLP
    2018

  • Journal detection from PubMed abstract with 415,381 programmatically-collected abstracts
  • Compared multitask and embedding-augmented CNNs with output space of 1,548 journals
  • Best performance when CNN input augmented with topic and impact factor embeddings, with accuracy 23.7% and 90% of true journals in top 60 recommendations

  • Data Augmentation using GAN for Breast Cancer Classification


    Class Project
    Computer Vision
    2018

  • Investigated whether augmenting data with GANs could improve histology breast cancer classification with Resnet-18 re-trained on 5,547 breast histology images. Dataset provided by Kaggle
  • DCGAN most effective at generating realistic histology images when kernel size is divisible by stride length in the generator
  • Augmentation with ~400 DCGAN images improved accuracy and precision by 5% and 12% respectively, but recall decreased by nearly 15%

  • Analyzing the Effect of Salary on Employee Attrition


    Class Project
    Causal Inference
    2017

  • Causal inference of effect of salary on employee attrition for simulated Kaggle dataset
  • Variable importance ranking of features for employee attrition
  • Causal inference with DAGs, ensemble learning, and TMLE

  • Bearmaps


    Class Project
    Data Structures
    2016

  • Mapping application for Berkeley, CA; implemented in Java
  • Rastering with quad tree
  • Routing via A* algorithm; location search autocompletion with trie

  • Resources

    During graduate school, I found it helpful to actively take notes on what I learned, in the same spirit as the Feynman technique. These notes include mathematical derivations and re-implementation of well-known algorithms, and are useful when I find existing resources inadequate in providing a good understanding. I continue to write notes as a life-long learner, and am posting these notes in case they are helpful. Feedback and corrections via email welcome.

    Classical Machine Learning

  • Principal component analysis
  • Support vector machine
  • Locally-weighted logistic regression
  • Collaborative filtering
  • t-SNE

  • Statistics

  • Expectation-maximization algorithm
  • Multiple hypothesis testing

  • Deep Learning

  • Neural networks
  • AlexNet
  • ResNet
  • Generative adversarial network

  • Natural Language Processing

  • word2vec

  • Algorithms

  • Optimal Stopping

  • Website Hit Counters unique visitors since Mar. 2018
    Last update: April. 2023