Calvin Chi

calvin.chi at berkeley dot edu

I am an applied scientist at Amazon Advertising - I worked on semi-supervised learning, multitask learning, and modeled uncertainty for multi-armed bandit problems.

I completed my Ph.D. in Computational Biology at UC Berkeley in 2020, advised by Haiyan Huang and Lisa Barcellos, and supported by the NSF Graduate Research Fellowship. My graduate work was at the intersection of applied machine learning and epidemiology/pharmacogenomics, which is summarized by this dissertation.

My academic interests include the foundational topics of optimization, statistics, and machine learning; application areas such as vision, language, and biomedicine; and frameworks such as reinforcement learning and metalearning.

"Nature cannot be fooled." - Richard Feynman


I grew up in Taichung, Taiwan as a third culture kid, and always had an interest in STEM. Due to my high school's emphasis on the liberal arts, I did not know about engineering as a career until my senior year. Since I thought medicine to be the most practical application of STEM, I committed my college career towards earning medical school admission. It was not until the interviews when I deeply realized medicine to be the wrong path, leading me to make my most risky career move by turning down my admissions offer. I later pursued a Ph.D. in Computational Biology to connect with statistics, computer science, and math, and finally realigned my direction with my passions. If you happen to find my rate my professors profile, it is a joke started by college classmates back in organic chemistry. The map indicates cities I have been to.


Case Western Reserve University

BS, Biochemistry, summa cum laude
2011 - 2015

University of California, Berkeley

PhD, Computational Biology
2015 - 2020


Applied Scientist Intern, summer 2019
Applied Scientist, 2020 - present

Hypomethylation mediates genetic association with the major histocompatibility complex genes in Sjögren's syndrome

Calvin Chi, Kimberly E. Taylor, Hong Quach, Diana Quach, Lindsey A. Criswell, Lisa F. Barcellos

PLoS ONE 16(4) 2021

Abstract Differential methylation of immune genes has been a consistent theme observed in Sjögren’s syndrome (SS) in CD4+ T cells, CD19+ B cells, whole blood, and labial salivary glands (LSGs). Multiple studies have found associations supporting genetic control of DNA methylation in SS, which in the absence of reverse causation, has positive implications for the potential of epigenetic therapy. However, a formal study of the causal relationship between genetic variation, DNA methylation, and disease status is lacking. We performed a causal mediation analysis of DNA methylation as a mediator of nearby genetic association with SS using LSGs and genotype data collected from 131 female members of the Sjögren’s International Collaborative Clinical Alliance registry, comprising of 64 SS cases and 67 non-cases. Bumphunter was used to first identify differentially-methylated regions (DMRs), then the causal inference test (CIT) was applied to identify DMRs mediating the association of nearby methylation quantitative trait loci (MeQTL) with SS. Bumphunter discovered 215 DMRs, with the majority located in the major histocompatibility complex (MHC) on chromosome 6p21.3. Consistent with previous findings, regions hypomethylated in SS cases were enriched for gene sets associated with immune processes. Using the CIT, we observed a total of 19 DMR-MeQTL pairs that exhibited strong evidence for a causal mediation relationship. Close to half of these DMRs reside in the MHC and their corresponding meQTLs are in the region spanning the HLA-DQA1, HLA-DQB1, and HLA-DQA2 loci. The risk of SS conferred by these corresponding MeQTLs in the MHC was further substantiated by previous genome-wide association study results, with modest evidence for independent effects. By validating the presence of causal mediation, our findings suggest both genetic and epigenetic factors contribute to disease susceptibility, and inform the development of targeted epigenetic modification as a therapeutic approach for SS.

Bipartite graph-based approach for clustering of cell lines by gene expression-drug response associations

Calvin Chi, Yuting Ye, Bin Chen, Haiyan Huang

Bioinformatics 2021

Abstract In pharmacogenomic studies, the biological context of cell lines influences the predictive ability of drug-response models and the discovery of biomarkers. Thus, similar cell lines are often studied together based on prior knowledge of biological annotations. However, this selection approach is not scalable with the number of annotations, and the relationship between gene-drug association patterns and biological context may not be obvious. We present a procedure to compare cell lines based on their gene-drug association patterns. Starting with a grouping of cell lines from biological annotation, we model gene-drug association patterns for each group as a bipartite graph between genes and drugs. This is accomplished by applying sparse canonical correlation analysis (SCCA) to extract the gene-drug associations, and using the canonical vectors to construct the edge weights. Then, we introduce a nuclear norm-based dissimilarity measure to compare the bipartite graphs. Accompanying our procedure is a permutation test to evaluate the significance of similarity of cell line groups in terms of gene-drug associations. In the pharmacogenomics datasets CTRP2, GDSC2, and CCLE, hierarchical clustering of carcinoma groups based on this dissimilarity measure uniquely reveals clustering patterns driven by carcinoma subtype rather than primary site. Next, we show that the top associated drugs or genes from SCCA can be used to characterize the clustering patterns of haematopoietic and lymphoid malignancies. Finally, we confirm by simulation that when drug responses are linearly-dependent on expression, our approach is the only one that can effectively infer the true hierarchy compared to existing approaches.
Paper Software Code

Admixture mapping reveals evidence of differential multiple sclerosis risk by genetic ancestry

Calvin Chi, Xiaorong Shao, Brooke Rhead, Evangelina Gonzalez, Jessica B. Smith, Anny H. Xiang, Jennifer Graves, Amy Waldman, Timothy Lotze, Teri Schreiner, Bianca Weinstock-Guttman, Gregory Aaen, Jan-Mendelt Tillema, Jayne Ness, et al.

PLoS Genetics 15(1) 2019

Abstract Multiple sclerosis (MS) is an autoimmune disease with high prevalence among populations of northern European ancestry. Past studies have shown that exposure to ultraviolet radiation could explain the difference in MS prevalence across the globe. In this study, we investigate whether the difference in MS prevalence could be explained by European genetic risk factors. We characterized the ancestry of MS-associated alleles using RFMix, a conditional random field parameterized by random forests, to estimate their local ancestry in the largest assembled admixed population to date, with 3,692 African Americans, 4,915 Asian Americans, and 3,777 Hispanics. The majority of MS-associated human leukocyte antigen (HLA) alleles, including the prominent HLA-DRB1*15:01 risk allele, exhibited cosmopolitan ancestry. Ancestry-specific MS-associated HLA alleles were also identified. Analysis of the HLA-DRB1*15:01 risk allele in African Americans revealed that alleles on the European haplotype conferred three times the disease risk compared to those on the African haplotype. Furthermore, we found evidence that the European and African HLA-DRB1*15:01 alleles exhibit single nucleotide polymorphism (SNP) differences in regions encoding the HLA-DRB1 antigen-binding heterodimer. Additional evidence for increased risk of MS conferred by the European haplotype were found for HLA-B*07:02 and HLA-A*03:01 in African Americans. Most of the 200 non-HLA MS SNPs previously established in European populations were not significantly associated with MS in admixed populations, nor were they ancestrally more European in cases compared to controls. Lastly, a genome-wide search of association between European ancestry and MS revealed a region of interest close to the ZNF596 gene on chromosome 8 in Hispanics; cases had a significantly higher proportion of European ancestry compared to controls. In conclusion, our study established that the genetic ancestry of MS-associated alleles is complex and implicated that difference in MS prevalence could be explained by the ancestry of MS-associated alleles.
Poster Paper

HLA Allele Imputation with Multitask Deep Convolutional Neural Network

Computational Biology

  • Developed multitask convolutional neural network for HLA imputation from phased genotype data.
  • On the T1DGC test dataset, achieved 97.6% imputation accuracy, which is comparable to state-of-the-art performance from programs such as HIBAG, HLA*IMP:02, and SNP2HLA.

  • Embedding-Augmented Deep CNN for PudMed Journal Recommendation

    Class Project

  • Journal detection from PubMed abstract with 415,381 programmatically-collected abstracts
  • Compared multitask and embedding-augmented CNNs with output space of 1,548 journals
  • Best performance when CNN input augmented with topic and impact factor embeddings, with accuracy 23.7% and 90% of true journals in top 60 recommendations

  • Data Augmentation using GAN for Breast Cancer Classification

    Class Project
    Computer Vision

  • Investigated whether augmenting data with GANs could improve histology breast cancer classification with Resnet-18 re-trained on 5,547 breast histology images. Dataset provided by Kaggle
  • DCGAN most effective at generating realistic histology images when kernel size is divisible by stride length in the generator
  • Augmentation with ~400 DCGAN images improved accuracy and precision by 5% and 12% respectively, but recall decreased by nearly 15%

  • Analyzing the Effect of Salary on Employee Attrition

    Class Project
    Causal Inference

  • Causal inference of effect of salary on employee attrition for simulated Kaggle dataset
  • Variable importance ranking of features for employee attrition
  • Causal inference with DAGs, ensemble learning, and TMLE

  • Bearmaps

    Class Project
    Data Structures

  • Mapping application for Berkeley, CA; implemented in Java
  • Rastering with quad tree
  • Routing via A* algorithm; location search autocompletion with trie

  • Resources

    During graduate school, I found it helpful to actively take notes on what I learned, in the same spirit as the Feynman technique. These notes include mathematical derivations and re-implementation of well-known algorithms, and are useful when I find existing resources inadequate in providing a good understanding. I continue to write notes as a life-long learner, and am posting these notes in case they are helpful. Feedback and corrections via email welcome.

    Classical Machine Learning

  • Principal component analysis
  • Support vector machine
  • Locally-weighted logistic regression
  • Collaborative filtering
  • t-SNE

  • Statistics

  • Expectation-maximization algorithm
  • Multiple hypothesis testing

  • Deep Learning

  • Neural networks
  • AlexNet
  • ResNet
  • Generative adversarial network

  • Natural Language Processing

  • word2vec

  • Algorithms

  • Optimal Stopping

  • Website Hit Counters unique visitors since Mar. 2018
    Last update: Jun. 2021