hierBipartite Vignette

1. Introduction

Pharmacogenomic studies are often interested in the relationship between a set of genomic features and a set of drug responses, for purposes such as biomarker discovery or building drug-response models. We propose a framework for clustering pre-defined cell line groups in terms of gene-drug relationships, which we refer to as bipartite graph-based hierarchical clustering. This enables applications such as visualization of group similarities or determining which cell line groups to include for downstream analysis.

The \(\texttt{hierBipartite}\) R package implements the bipartite graph-based hierarchical clustering method detailed in the paper. The “bipartite graph” describes the association relationship between the set of genes and the set of drugs. The method starts by creating a dissimilarity matrix for the provided starting cell line groups by (1) extracting gene-drug association patterns for each group using sparse canonical correlation analysis (SCCA) and (2) using a nuclear norm-based dissimilarity measure to compare groups based on the extracted association patterns. The \(\texttt{hierBipartite}\) package applies hierarchical clustering to determine the hierarchical relationship between the cell line groups. A few optional procedures are implemented to enhance this framework:

2. Load Data

The \(\texttt{hierBipartite}\) R package contains a test dataset based on gene expression dataset from the Cancer Cell Line Encyclopedia (CCLE) resource and drug sensitivity dataset from the Cancer Therapeutics Response Portal (CTRP2) resource (Barretina et al. 2012; Seashore-Ludlow et al. 2015). In the paper, this dataset is referred to as “CTRP2”, named after the drug sensitivity resource.

The CTRP2 dataset has been processed exactly as described in the paper, with the exception of selecting the top 1,000 transcripts instead of 5,000 transcripts by correlation with drug sensitivity values. This is to abide by the Bioconductor memory constraint of less than or equal to 5 MB for individual files. However, the purpose of this test dataset is to demonstrate how to use the \(\texttt{hierBipartite}\) package. For additional details on the CTRP2 dataset, execute ?ctrp2. The data structure of the CTRP2 dataset is as follows:

First load this test dataset and extract the above three components.

library(hierBipartite)
data(ctrp2)

# gene expression
X = ctrp2[["X"]]
# drug sensitivity
Y = ctrp2[["Y"]]
# starting cell line groups
groups = ctrp2[["groups"]]

List the 10 cell line groups present in this dataset. The cell lines are grouped by carcinoma subtype (e.g. adenocarcinoma) and primary site (e.g. lung NSC).

names(groups)
##  [1] "adenocarcinoma_colorectal"                   "adenocarcinoma_endometrium"                 
##  [3] "adenocarcinoma_ovary"                        "adenocarcinoma_stomach"                     
##  [5] "adenocarcinoma_lung_NSC"                     "ductal_carcinoma_breast"                    
##  [7] "ductal_carcinoma_pancreas"                   "squamous_cell_carcinoma_esophagus"          
##  [9] "squamous_cell_carcinoma_upper_aerodigestive" "squamous_cell_carcinoma_lung_NSC"

Show each group is represented by vector of row indices in \(X, Y\)

# Row indices of samples in X, Y from adenocarcinoma, lung NSC
groups[["adenocarcinoma_lung_NSC"]]
##  [1]  87  89  95 101 103 104 105 106 112 113 117 118 119 121 123 124 125 126 127 132 133 136 139 140 144 146 148 149 150 151
## [31] 158 159 163

3. Bipartite Graph-based Hierarchical Clustering

3.1 Using \(\texttt{hierBipartite}\)

The bipartite graph-based hierarchical clustering method is applied using the hierBipartite() main function, which takes two data matrices X and Y as input and a groups list indicating how samples are grouped. The link parameter determines the link function to use for hierarchical clustering, which can be one of “ward.D”, “ward.D2”, “single”, “complete”, “average”, “mcquitty”, “median”, or “centroid”. If applying the optional procedures described in the Introduction:

Although the subsampling procedures and the permutation tests are computationally intensive, each subsample step or permutation test is independent of each other. Thus, we can parallelize these steps using the parallel package, by toggling parallel = TRUE.

Now we apply hierBipartite() method on the test data with the permutation test, but without performing the subsampling procedure to save runtime. This process could take 1 - 2 hours (for 260 cell lines, 1,000 genes, 133 drugs, and 10 groups).

3.2 Examining Results

The hierBipartite() method outputs a list containing

To view the resulting dendrogram

plot of chunk dendrogram_uncolored

We now illustrate how to answer some common questions about the results after the method is finished. To get the groups in the third merge and corresponding p-value:

## [1] "ductal_carcinoma_pancreas"        "squamous_cell_carcinoma_lung_NSC"
## [1] 0

Suppose we are interested in selecting the expression and drug sensitivity data from cell lines in the third merge for downstream analysis.

Get SCCA coefficients using \(\texttt{SCCA()}\) for genes and drugs from cell lines in cluster 3. These coefficients can be used to rank genes and drugs in terms of association with each other.

## Computng Component number  1

plot of chunk SCCA_coefficients

Color branches of dendrogram corresponding to merges with p-value less than or equal to threshold of 0.10.

plot of chunk dendrogram_colored

Finally, the custom function getSignificantMergedGroups() selects only clusters with a merge p-value at or below a threshold.

## $`1`
## [1] 0
## 
## $`2`
## [1] 0
## 
## $`3`
## [1] 0
## 
## $`4`
## [1] 0.03

Here, each cluster of the list is named after the merge order.

References

Barretina, Jordi, Giordano Caponigro, Nicolas Stransky, Kavitha Venkatesan, Adam A Margolin, Sungjoon Kim, Christopher J Wilson, et al. 2012. “The Cancer Cell Line Encyclopedia Enables Predictive Modelling of Anticancer Drug Sensitivity.” Nature 483 (7391). Nature Publishing Group: 603–7.

Seashore-Ludlow, Brinton, Matthew G Rees, Jaime H Cheah, Murat Cokol, Edmund V Price, Matthew E Coletti, Victor Jones, et al. 2015. “Harnessing Connectivity in a Large-Scale Small-Molecule Sensitivity Dataset.” Cancer Discovery 5 (11). AACR: 1210–23.


  1. \(TPM\) stands for transcripts per million, a normalized unit of transcript expression.