Identify biological functions for genes in copy number variation regions

 

Summary

This recipe provides an outline of one method to identify biological functions for genes lying in copy number variation (CNV) regions. CNVs are large alterations to genomes, such as amplification or deletion of large segments of a chromosome. They can range in size from a focal aberration in a single gene to aberrations covering entire chromosome arms. These variations in the genome have been associated with different conditions, such as cancer. Many genomic analyses produce a set of genes which are assumed to be relevant to an underlying biological mechanism or phenotype. Thus, an investigator often has additional questions about the function or relatedness of these genes: Are they part of the same pathway? Do the gene products interact physically? Do the gene products localize to a specific part of the cell? Are the genes associated with certain stages of development? These questions, and others like them, can be answered by performing functional annotation of gene lists to better understand the underlying connections between genes.

In this particular example, we imagine a scenario in which an investigator identifies CNV regions that are amplified or deleted in glioblastoma multiforme (GBM) tumor samples, using a method called Genomic Identification of Significant Targets in Cancer (GISTIC, Mermel et al. (2011) Genome Biol.). Given a set of CNV regions, the goal is to infer which biological functions (e.g., metabolic and regulatory pathways, chemical perturbation signatures, etc.) are overrepresented in the set of reference genes that overlap with these regions. In particular, this recipe uses several Galaxy tools to find the overlap between CNV regions and reference genes obtained from the UCSC Table Browser. Then it uses the Molecular Signatures Database (MSigDB) to identify biological functions of the overlapped genes.

How can I use this recipe? This recipe may be modified to analyze CNV regions derived from any organism for which an annotated reference genome exists. Nor does the recipe depend on the algorithm used to identify these regions (GISTIC); any source of CNV data can be used. Once an investigator has pinpointed the CNV regions believed to be influencing their phenotype (disease state, cell type, etc.) of study he/she can use this recipe to identify functional pathways that may be affected by these copy number changes and draw closer to an understanding of the mechanisms behind these CNV effects.

Input

To complete this recipe, we will need a file describing the locations of CNV regions in a specific condition, and a reference genome to compare against. In this example, we use CNV regions identified in primary glioblastoma multiforme (GBM) tumor samples using the GISTIC algorithm. For this particular recipe, we will need the following datasets, which can be downloaded from GenomeSpace's Public folder:

GISTIC_CNV_deleted.txt: This file is a standard output file of the GISTIC2 tool. It lists narrow- and wide-peak regions of significant amplification in the genome, organized by chromosome, as well as start and end positions in the genome for each of these CNV regions.
GISTIC_CNV_amplified.txt: This file is a standard output file of the GISTIC2 tool. It lists narrow- and wide-peak regions of significant deletion in the genome, organized by chromosome, as well as start and end positions in the genome for each of these CNV regions.
Note: These data are modified files downloaded from FireBrowse, the TCGA data browser. The reference study for the TCGA glioblastoma dataset is McLendon et al. (2008) Nature.

Getting Data

  1. Navigate to the following Public data folder: Public > RecipeData > GenomicFeatureData_bed gtf
  2. The following files are used in this recipe::
    1. GISTIC_CNV_amplified.txt
    2. GISTIC_CNV_deleted.txt

Overview

  1. Obtain reference gene annotations from the UCSC Table Browser.
  2. Identify genes lying in CNV regions by checking the overlap between the CNV regions and the reference gene annotations, using Galaxy:
    1. Preprocess the CNV regions data files and convert to BED format in preparation for subsequent analysis (Text Manipulation: Add Columns, Merge Columns).
    2. Identify the overlap between the CNV regions and the reference gene annotations (Operate on Genomic Intervals: Intersect).
    3. Extract a list of reference genes for further analysis (Text Manipulation: Cut).
  3. Test for significant overlap between reference genes in the CNV regions and different gene sets with unique biological functions or pathways, using MSigDB.

UCSC Table Browser

UCSC Table Browser

Getting reference gene annotations

In this step, we will use the UCSC Table Browser to retrieve reference gene annotations corresponding to the reference genome for our example data. If you are using your own data, you may already have a reference gene annotation file, or you may need to search for one matching your reference genome here.

Screenshot: Example input to UCSC Table Browser.
Video: Obtaining a reference genome from the UCSC Table Browser (BED files).

Galaxy

Galaxy


NOTE: If you have not yet associated your GenomeSpace account with your Galaxy account, you will be asked to do so. If you do not yet have a Galaxy account, you can automatically generate a new account that will be associated with your GenomeSpace account.

Loading data into Galaxy

Load the files into Galaxy using one of the following methods:

  1. Click on the file(s) (e.g., hg19.RefSeq.bed) in GenomeSpace, then use the Galaxy context menu and click Launch on File.
  2. Click on the file(s) (e.g., hg19.RefSeq.bed) in GenomeSpace, then drag it to the Galaxy icon to launch.
  3. Open Galaxy from GenomeSpace, navigate to the Get Data tool, then click on GenomeSpace import from file browser, then navigate to your personal directory.

Make sure to load hg19.RefSeq.bed, GISTIC_CNV_amplified.txt and GISTIC_CNV_deleted.txt into Galaxy, either from your personal GenomeSpace folder, or from the GenomeSpace Public folder.

Video: Loading data into Galaxy from GenomeSpace.

 

Identify genes in CNV regions

We will use a pre-built GenomeSpace workflow to identify genes that are located in CNV regions. This workflow uses Operate on Genomic Intervals to find the overlap between two datasets, one of which is processed using the Text Manipulation tool.

Screenshot: Importing and running a workflow in Galaxy.
REPEAT: In this recipe we are investigating both amplified and deleted CNV regions. To find the overlap between the reference gene annotations (hg19.refseq.bed) and the amplified CNV regions (GISTIC_CNV_amplified.txt), repeat the above steps, substituting GISTIC_CNV_amplified.txt for GISTIC_CNV_deleted.txt.

MSigDB

MSigDB

NOTE: If you have not yet associated your GenomeSpace account with your MSigDB account, you will be asked to do so. If you do not yet have a MSigDB account, you can automatically generate a new account that will be associated with your GenomeSpace account.

Functional Annotation

In this step, we search for the biological functions and pathways that are represented in the set of reference genes which exist in CNV regions. We compute the overlap between our gene list, and pre-compiled gene sets in MSigDB. In this Recipe, we will select C1, C2, and C3 to compare to our dataset. See below the descriptions for the different gene sets in MSigDB:

  • C1: positional gene sets: Gene sets corresponding to each human chromosome and each cytogenetic band that has at least one gene. (Cytogenetic locations were parsed from HUGO, October 2006, and Unigene, build 197. When there were conflicts, the Unigene entry was used.) These gene sets are helpful in identifying effects related to chromosomal deletions or amplifications, dosage compensation, epigenetic silencing, and other regional effects.
  • C2: curated gene sets: Gene sets collected from various sources such as online pathway databases, publications in PubMed, and knowledge of domain experts. The gene set page for each gene set lists its source.
  • C3: motif gene sets: Gene sets that contain genes that share a cis-regulatory motif that is conserved across the human, mouse, rat, and dog genomes. The motifs are catalogued (Xie et al. 2005) and represent known or likely regulatory elements in promoters and 3'-UTRs. These gene sets make it possible to link changes in a microarray experiment to a conserved, putative cis-regulatory element.
Screenshot: Example input to MSigDB.
REPEAT: To complete functional enrichment analysis on the amplified CNV regions (GISTIC_CNV_amplified.txt), repeat the above steps, substituting GISTIC_CNV_amplified.txt for GISTIC_CNV_deleted.txt.
Video: Search for significant overlap with known genesets using MSigDB.

Interpretation of the Results

This is an example interpretation of the results from this Recipe. First, we identified the overlap between reference gene annotations (RefSeq format) and the copy number variation (CNV) regions using Galaxy. This results in a list of annotated genes that are located in the CNVs; there may be more genes in the CNV regions that are not properly annotated and therefore were missed in the analysis. In this example, we find roughly 1300 genes that are amplified in CNV regions, and roughly 8800 genes that are deleted in CNV regions. Next, we were interested in knowing what, if any, functional annotation these genes had - are there specific gene functions being duplicated in CNV regions? Are the gene products in these regions connected functionally?

We used MSigDB to probe our dataset for functional annotation. In this case, we used only three collections: C1, C2 and C3. In this example we are most interested in knowing whether our genes are related to chromosomal deletions or amplifications (C1: positional gene set), whether our genes have functions that are reviewed in the literature (C2: curated gene set), and whether our genes share any cis-regulatory motifs (C3: motif gene set).

Our first result lists the gene set name and description, the number of our genes which overlap with the gene set, and measures of significance (p-values and q-values). For example, we see that 28 genes out of the ~1300 amplified genes fall into the "chr3q27" category, which has 83 genes total. This result is significant (p-value = 1.72e-44). This suggests that the CNV regions associated with glioblastoma are enriched for genes duplicated on chromosomal region chr3q27. Similarly, 222 genes out of the ~8800 deleted genes fall into the "chr19q13" category, suggesting that glioblastoma is associated with deletions in this chromosomal region (p = 3.78e-135). This is just one example of a possible interpretation of these results.

Screenshot: Example output of gene set overlap, including significance values, for amplified CNV regions.
Screenshot: Example output of gene set overlap, including significance values, for deleted CNV regions.

Our second result lists each gene by ID and Symbol, then highlights which of the top categories it is in. For example, the amplified gene CDK4 overlaps with 4 categories: TCGA_GLIOBLASTOMA_COPY_NUMBER_UP, NIKOLSKY_BREAST_CANCER_12Q13_Q21_AMPLICON, LOCKWOOD_AMPLIFIED_IN_LUNG_CANCER, and PUJANA_BRCA1_PCC_NETWORK. If we examine the categories, they suggest that CDK4 is amplified in glioblastoma, breast cancer, and lung cancer, and that CDK4 is a part of the BRCA1 regulatory network. Similarly, when we examine the deleted gene NUP62, we observe that it overlaps with 5 categories: chr19q13, CAGGTG_V$E12_Q6, GGGCGGR_V$SP1_Q6, PUJANA_BRCA1_PCC_NETWORK, and DANG_BOUND_BY_MYC. This suggests that NUP62 is in the chr19q13 chromosomal region, that it contains motifs CAGGTG and GGGCGGR, and that it is associated with the BRCA regulatory network and its promoter is bound by Myc.

Screenshot: Example output of annotation for each amplified gene.
Screenshot: Example output of annotation for each deleted gene.

These results suggest that our gene list is enriched for specific chromosomal regions and specific promoter region motifs, among other functional annotations. This suggests that functionally related genes are being duplicated in CNV regions in a cancer phenotype. However, the results in this example are not necessarily significant and are only a simple representation of possible results.