BioC 2016: Where Software and Biology Connect
June 25-26, 2016 (Developer Day: June 24)
Stanford University, Stanford, CA
This conference highlights current developments within and beyond Bioconductor. Morning scientific talks and afternoon workshops provide conference participants with insights and tools required for the analysis and comprehension of high-throughput genomic data. ‘Developer Day’ precedes the main conference on June 24, providing developers and would-be developers an opportunity to gain insights into project direction and software development best practices. The conference is immediately before useR! 2016.
To launch an Amazon Machine Image (AMI) for this conference:
- Create an Amazon Web Services (AWS) Account if you don’t already have one.
- Start the instance ami-ee35f883; See the documentation for this. Make sure your security group has port 80 accessible.
- Paste the Public DNS name of your instance into a web browser.
- Log in to RStudio with username ubuntu and password bioc .
- Be sure and terminate your instance when you are done using it, in order to avoid excessive charges.
Schedule
Developer Day
Friday June 24
-
9:00 - 10:00 Introductions; Project and Other Updates
-
10:00 - 11:15 Lightning talks 1
- Edward Lee. Obtaining T cell receptor pairs from high-throughput sequencing data using the alphabetr package [slides]
- Rene Welch. A Quality Control pipeline for ChIP-exo and ChIP-nexus [slides]
- Caleb Lareau. Uncovering and Visualizing Differential Topological Domains in DNA [slides]
- Thomas Girke. Experience teaching R / Bioconductor in graduate classes. [slides]
- Richie Cotton. readat: Read and work with SomaLogic ADAT files [slides]
- Benjamin Haibe-Kains. Reproducible research, algorithms, and data [slides]
-
11:15 - 11:30 BREAK
-
11:30 - 12:30 Lightning talks 2
- Michael Lawrence. Hello, Ranges [slides]
- Michael Steinbaugh. Sharing genome-wide screening and RNA-seq experiments in reproducible data packages using R/Bioconductor. [slides]
- Marcin Kosiński. Integrating TCGA Data - RTCGA Workflow [slides]
- Lucas Schiffer. Curated TCGA Data - A MultiAssayExperiment Teaser [slides]
- Peter Hickey. New stuff in bsseq for analysing large whole genome bisulfite-sequencing datasets [slides]
- Wolfgang Huber. Updated BiocStyle for PDF and HTML vignettes [slides]
- Karim Chine. Bioconductor on the rosettaHUB community platform [slides]
-
12:30 - 1:30 LUNCH
- 1:30 - 1:45 Facilitated Discussion: Approaches to Data Modeling
-
1:45 - 2:00 Facilitated Discussion: Interactive Visualization
-
2:00 - 2:45 Lightning workshops 1
- Making Interactive Packages (Sean Davis) [github] [slides]
- Accessing ISB Cancer Genomics Cloud Resources (Vince Carey) [setup notes] [ cgcR package or Vince’s fork of cgcR ] [strategies]
-
2:45 - 3:15 BREAK
-
3:15 - 4:00 Lightning workshops 2
- GenomicRanges for developers (Hervé Pagès) [slides]
- Unit tests and other approaches to writing robust packages (Jim Hester) [Bioconductor Testing Guidelines] [testthat]
- Common pitfalls in R, and how to avoid them (Martin Morgan) [github]
-
4:00 - 5:00 Developer Day Keynote Address. Dr. Robert Gentleman, 23andMe. Distributed Content Generation and Creating Pipelines for Genomic Analysis.
- 7:00 - ??? Interactive Visualization Design and Hackathon. Statistics Dept., Sequoia Hall (Room Seqouia 200)
Main Conference
Saturday, June 25
-
8:00 - 8:30. REGISTRATION AND CONTINENTAL BREAKFAST
-
Speakers
- 8:30 - 9:15. Sandrine Dudoit, University of California, Berkeley. Identification of Novel Cell Types in the Brain Using Single-Cell Transcriptome Sequencing. [slides]
- 9:15 - 10:00. Susan Holmes. Stanford University. Multicomponent data integration for the Human Microbiome. [slides]
- 10:00 - 10:30 BREAK
- 10:30 - 11:15. Jenny Bryan, University of British Columbia. Spreadsheets: the Data Format we Love to Hate.
-
Community Speakers
- 11:15 - 11:30. Michael Love, Harvard TH Chan School of Public Health. Bioconductor Workflows Following Fast, Lightweight RNA Transcript Quantifiers. [slides]
- 11:30 - 11:45. Ramnath Vaidya, Alteryx. HtmlWidgets: The Power of Javascript in R!
- 11:45 - 12:00. Benjamin Haibe-Kains, Princess Margaret Cancer Center / University of Toronto. Data Sharing and Research Reproducibility: Why and How. [slides]
-
12:00 - 1:00 LUNCH
-
1:00 - 2:50. Afternoon workshops, session 1.
-
Getting to know R and Bioconductor. Valerie Obenchain, Lori Shepherd (Roswell Park Cancer Institute). Beginner. [slides] [github]
-
Building and running automated NGS analysis workflows. Thomas Girke (UC Riverside). Intermediate. [slides] [github]
-
Introduction to ImmuneSpaceR. Renan Sauteraud, Lev Dashevskiy, Greg Finak, Raphael Gottardo (Fred Hutchinson Cancer Research Center). Intermediate. [slides] [github]
-
Analysis of single-cell RNA-seq data with R and Bioconductor. Davide Risso, Kelly Street, Michael Cole (UC Berkeley). Intermediate. [slides] [github]
-
Writing efficient, scalable code. Martin Morgan (Roswell Park Cancer Institute). Intermediate. [slides] [github]
-
-
2:50 - 3:10 BREAK
-
3:10 - 5:00. Afternoon workshops, session 2.
-
Annotating high throughput data using Bioconductor resources. James MacDonald (University of Washington) Intermediate. [slides] [github]
-
Interactive visualization with epiviz. Héctor Corrada Bravo, Jayaram Kancherla, Justin Wagner, Deok Park (University of Maryland, College Park). Intermediate. [slides] [github]
-
Single Cell Differential Expression and Gene Set Enrichment with MAST. Andrew McDavid, Raphael Gottardo, Greg Finak (Fred Hutchinson Cancer Research). Intermediate. [slides] [github]
-
Managing big biological sequence data with Biostrings and DECIPHER. Erik Wright (UW, Madison). Intermediate. [slides] [github]
-
-
5:00 - 7:00 RECEPTION / SOCIAL HOUR
Sunday, June 26
-
8:00 - 8:30. REGISTRATION AND CONTINENTAL BREAKFAST
-
Speakers
- 8:30 - 9:15. Rob Tibshirani, Stanford University. Some Recent Advances in Post-selection Statistical Inference. [slides]
- 9:15 - 10:00. William Greenleaf, Department of Genetics and, by courtesy, Applied Physics, Stanford University. ATAC-ing Open Chromatin Data Analysis.
- 10:00 - 10:30 BREAK.
- 10:30 - 11:15. Stephen Montgomery, Stanford University. Identifying Gene-by-Environment Variants in Studies of Gene Expression.
-
Community Speakers
- 11:15 - 11:30. Jim Hester, RStudio. Using Devtools, Travis and Git for Bioconductor Package Development. [slides]
- 11:30 - 11:45. Johannes Rainer, Center for Biomedicine, European Academy of Bozen/Bolzano (EURAC), Bolzano, Italy. Building and Using Ensembl Based Annotation Packages with ensembldb. [slides] [github]
- 11:45 - 12:00. Xiuwen Zheng, University of Washington, Seattle, WA, USA. SeqArray: Data Management of Whole-Genome Sequence Variant Calls with Thousands of Individuals. [slides]
-
12:00 - 1:00 LUNCH
-
1:00 - 2:50. Afternoon workshops, session 3
-
Low-level exploratory data analysis and methods development for RNA-seq. Michael Love (Harvard TH Chan School of Public Health). Intermediate. [slides] [github]
-
Making your packages accessible to non-programmer collaborators using the VisRseq platform. Hamid Younesy (Simon Fraser University), Torsten Möller (University of Vienna) Mohammad M. Karimi (University of British Columbia). Beginner. [slides] [github] [software]
-
The MultiAssayExperiment class for analysis of multi-omics experiments. Levi Waldron, Marcel Ramos (presenter) (CUNY School of Public Health), Vince Carey (Brigham and Women’s), Kasper Hansen (Johns Hopkins University), Martin Morgan (Roswell Park Cancer Institute). Intermediate. [slides] [github]
-
Analyzing splice events from RNA-seq data with SGSeq. Leonard Goldstein (Genentech, Inc.). Intermediate. [slides] [github]
-
Hello Ranges: An Introduction to Analyzing Genomic Ranges in R. Michael Lawrence (Genentech, Inc.). Intermediate. [slides] [github]
-
-
2:50 - 3:10 BREAK
-
3:10 - 5:00. Afternoon workshops, session 4
-
Wrapping Your R tools to Analyze National-Scale Cancer Genomics in the Cloud. Tengfei Yin, Rohit Reja, Jing Zhao, Marko Zecevic, Noel Namai, Kristina Clemens (Seven Bridges Genomics). Beginner. [slides] [github]
-
Analysing DNA methylation data with Bioconductor. Peter Hickey, Kasper Hansen (Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health). Intermediate. [slides] [github]
-
Introduction to Bayesian Inference using Stan with Applications to Cancer Genomics. Jacqueline Buros (Icahn School of Medicine at Mount Sinai), Benjamin Goodrich (Columbia University), Eric Novik (Stan Group Inc.). Intermediate. [slides] [github]
-
R / Bioconductor tools for mass spectrometry-based proteomics. Laurent Gatto (University of Cambridge). Intermediate. [slides] [github]
-
-
5:00 - 7:00 RECEPTION / SOCIAL HOURS
Workshop Abstracts
Session 1
-
Getting to know R and Bioconductor. Valerie Obenchain, Lori Shepherd (Roswell Park Cancer Institute). Beginner.
This workshop introduces R and Bioconductor, and is oriented toward the novice Bioconductor user. Participants can expect to learn the essentials of R, including understanding basic data structures and navigating the help system. Participants will also be introduced to the breadth of packages available in Bioconductor, how to find packages that are relevant to their particular needs, and how to benefit from the particular resources (vignettes, work flows, online support forum) that characterize the Bioconductor community.
-
Building and running automated NGS analysis workflows. Thomas Girke (UC Riverside). Intermediate.
The proposed workshop will introduce systemPipeR as an environment for building and running complex analysis workflows for most NGS applications. This includes support for both command-line and R/Bioconductor software as well as resources for parallel evaluations on computer clusters along with automated generation of analysis reports in PDF and HTML formats. The event will cover the basic usage of so-called workflow design modules, construction of custom workflows, integration of external command-line software and automation routines for running NGS workflows from start to finish with a single command. The last part will demonstrate how to run a relatively complex sample workflow (e.g. Ribo-Seq and/or smallRNA-Seq) on large numbers of input samples (here FASTQ files) on both a single machine and a computer cluster with a scheduler.
-
Introduction to ImmuneSpaceR. Renan Sauteraud, Lev Dashevskiy, Greg Finak, Raphael Gottardo (Fred Hutchinson Cancer Research Center). Intermediate.
ImmuneSpace is the data portal of the Human Immunology Project consortium (HIPC). The HIPC program, funded by the NIH, is a multi-center collaborative effort to characterize the status of the immune system in different populations under diverse stimulations and disease states. This ongoing effort has generated large amounts of varied high-throughput, high-dimensional biological data (flow cytometry, CyTOF, RNA-Seq, Luminex, among others). All data generated to date by HIPC, along with other selected datasets generated by other NIAID funded projects, have been made publicly available through ImmuneSpace and are ready to be explored using visualization and analysis tools built in ImmuneSpace. ImmuneSpaceR is a recently releasead package that aims to provide a fast and easy to use interface to download, manipulate and analyze data. In this workshop, we propose an introduction to the ImmuneSpaceR package, from instantiating connection to ImmuneSpace, to the visualization of normalized cross-study data. At the end of this workshop, the users will be able to use the portal to get data of interest, combine ImmuneSpace data with other resources and write report that can be submitted to the ImmuneSpace team for addition to the website. The participants are expected to have a basic knowledge of R and BioConductor’s common data structures. Interested participants are encouraged to try the web portal first at www.immunespace.org
-
Analysis of single-cell RNA-seq data with R and Bioconductor. Davide Risso, Kelly Street, Michael Cole (UC Berkeley). Intermediate.
In this workshop, we will show how to perform statistical analyses of single-cell RNA-seq data using R and Bioconductor. Starting from gene-level read counts, we will emphasize three important aspects of a typical workflow: (i) quality-control (QC) and normalization; (ii) cluster analysis and discovery of cell sub-populations; (iii) lineage analysis and pseudo-time cell ordering. The participants will learn how to compute common metrics of QC that can be used for gene and sample filtering and normalization; how to discover new cell populations using the new package clusterExperiment (available on Github at http://github.com/epurdom/clusterExperiment and about to be submitted to Bioconductor) that takes full advantage of the SummarizedExperiment container; and how to order the cells in pseudo-time. The participants are expected to have some familiarity with RNA-seq and with the concepts of normalization and clustering, as well as with R and Bioconductor software.
-
Writing efficient, scalable code. Martin Morgan (Roswell Park Cancer Institute). Intermediate.
This workshop reinforces common patterns that lead to effcient and scalable R and Bioconductor code. We spend the first part discussing vectorization, measuring basic performance, and ensuring our results are consistent and correct. The second part explores approaches to parallel evaluation using the BiocParallel package, including strategies for recovering from and debugging errors during parallel evaluation.
Session 2
-
Annotating high throughput data using Bioconductor resources. James MacDonald (University of Washington) Intermediate.
When analyzing high throughput data, there is usually a single (often cryptic) identifier for each thing being measured. When you want to present your results to collaborators, you will usually want to map that ID to something more familiar, such as a HUGO gene symbol, Entrez Gene ID, or similar. You may also need to map IDs to things like KeGG or Gene Ontology IDs for additional analyses. The Bioconductor project has various annotation packages that are intended to simplify this process. In this workshop we will learn how to use these annotation packages using real-world examples.
-
Interactive visualization with epiviz. Héctor Corrada Bravo, Jayaram Kancherla, Justin Wagner, Deok Park (University of Maryland, College Park). Intermediate.
This workshop provides an overview of interactive visualization tools available through epivizr. It will be centered around analysis of epigenetic data using Bioconductor tools and annotation packages. After completion participants will be expected to be able to setup interactive visualization sessions through R, load data from diverse R / Bioconductor objects, including those provided by annotation infrastructure, interact with data using patterns provided by the epiviz user interface, and write scripts that can reproduce interactive visualization analyses. Some familiarity with Bioconductor infrastructure recommended.
-
Single Cell Differential Expression and Gene Set Enrichment with MAST. Andrew McDavid, Raphael Gottardo, Greg Finak (Fred Hutchinson Cancer Research). Intermediate.
We will learn how to use the package MAST to analyze single cell gene expression experiments. Starting from a matrix of counts of transcripts (cells by transcripts), we will discuss the preliminary steps of quality control, filtering, and exploratory data analysis. Once we are satisfied that we have high-quality expression, we will consider tests for differential expression and ways to visualize results. It is often helpful to synthesize from gene-level into module-level statements. Therefore, we will learn how to use MAST to test for gene set enrichment. An example data set of Mucosal Associated Invariant T cells (MAITs) that have been cytokine stimulated will be provided. Participants are also encouraged to bring their own data. Prerequisites for this class will include basic R syntax, plotting and manipulation of data using
data.table
andggplot2
. Background with linear models may be useful as well. -
Managing big biological sequence data with Biostrings and DECIPHER. Erik Wright (UW, Madison). Intermediate.
This workshop provides an introduction to using the Biostrings and DECIPHER packages to manage big biological sequence data. Participants will learn how to build a database of sequences, curate the database, extract sequences, manipulate sequences, run large-scale analyses in pieces, and perform other tasks with these two R packages.
Session 3
-
Low-level exploratory data analysis and methods development for RNA-seq. Michael Love (Harvard TH Chan School of Public Health). Intermediate.
Interested in exploring RNA-seq alignments in Bioconductor? Interested in developing RNA-seq methods in Bioconductor? In this workshop, we will cover the basic classes that are useful for working with RNA-seq data: paired alignments, exons, transcripts and genes, etc. We will also cover basic functionality useful for manipulating RNA-seq data: finding compatible overlaps between alignments and transcripts, mapping between genomic and transcriptomic coordinates, calculating coverage, visualizing gene structure, etc. As a motivating example, we will investigate the various kinds of bias that can be observed in RNA-seq data: random hexamer priming bias, positional bias, and GC content bias. By “low-level”, I mean that this workshop will not cover the basic workflow and statistical packages for analyzing summarized RNA-seq data, for example, count matrices.
-
Making your packages accessible to non-programmer collaborators using the VisRseq platform. Hamid Younesy (Simon Fraser University), Torsten Möller (University of Vienna) Mohammad M. Karimi (University of British Columbia). Beginner.
The goal of this workshop is to introduce the VisRseq platform and walk the participants through the quick process of creating modules called R-apps from their R packages. I expect this to be mostly useful for bioinformaticians and package developers that develop R-based analysis tools and would like to make them accessible to their non-programmer collaborators or to the public without having to spend time on creating extensive graphical user interfaces. I will walk the participants through several examples of creating diverse types of apps, from simple plotting (e.g. ggplot) to intermediate (e.g. clustering) to more advanced (e.g. edgeR and DEseq) packages. I will also show how several R-apps can be linked together to create more complex workflows. Participants will require having beginner/intermediate knowledge of R and a machine with R and Java installation.
-
The MultiAssayExperiment class for analysis of multi-omics experiments. Levi Waldron, Marcel Ramos (presenter) (CUNY School of Public Health), Vince Carey (Brigham and Women’s), Kasper Hansen (Johns Hopkins University), Martin Morgan (Roswell Park Cancer Institute). Intermediate.
This workshop will introduce the MultiAssayExperiment data class for the analysis of multi-assay, or multi-omics, experiments. This class provides a single data object to store and link both ID-based datasets (ie matrix and ExpressionSet) and range-based datasets (ie RangedSummarizedExperiment and GRangesList), with each other and with over-arching experiment-level metadata. These data objects are easy to create, and allow harmonized subsetting across genomic features, samples, and assays, using standard bracket notation and providing a consistent interface to diverse base data classes. Subsetting can be formed using gene/probeset identifiers or genomic ranges. This workshop will be accessible to beginner Bioconductor users, although some familiarity with GenomicRanges will be helpful. The workshop will cover how to create and manipulate a MultiAssayExperiment for various kinds of -omics data, and how to incorporate new data classes.
-
Analyzing splice events from RNA-seq data with SGSeq. Leonard Goldstein (Genentech, Inc.). Intermediate.
SGSeq is an R/Bioconductor package for analyzing annotated or previously uncharacterized splice events from short read RNA-seq data. Input data are sequence reads mapped to a reference genome in BAM format. Gene models are represented as a genome-wide splice graph, which can be obtained from transcript annotation or predicted from the data. Splice events are identified from the graph and are quantified locally using structurally compatible reads that span event boundaries. The workshop introduces SGSeq data structures and illustrates typical analysis workflows, including splice event prediction, quantification, visualization and interpretation.
-
Hello Ranges: An Introduction to Analyzing Genomic Ranges in R. Michael Lawrence (Genentech, Inc.). Intermediate.
Genomic ranges are central to the analysis of high-throughput genomic data, and Bioconductor provides an integrated platform for computing and annotating data like read alignments, variant calls, transcript models, coverage vectors, genomic assays, etc. Outside of Bioconductor, high-level applications like bedtools and bedops provide convenient interfaces to simple, common operations on genomic intervals. The Bioconductor Ranges API is comparatively more flexible, and more complex. Although simple operations often have simple solutions, those solutions are hidden within the complexity. A new package, HelloRanges, attempts to solve this problem by compiling bedtools invocations to R/Bioconductor code. The compiler outputs code that aims to be correct, reasonably efficient, and, importantly, pedagogical. The code promotes best practices and makes explicit some assumptions in the API that may be surprising. This tutorial will walk through common workflows in genomic data processing from the bedtools perspective, while students use HelloRanges to learn Bioconductor equivalents. Basic familiarity with data in BED, BAM and VCF files is assumed.
Session 4
-
Wrapping Your R tools to Analyze National-Scale Cancer Genomics in the Cloud. Tengfei Yin, Rohit Reja, Jing Zhao, Marko Zecevic, Noel Namai, Kristina Clemens (Seven Bridges Genomics). Beginner.
The Cancer Genomics Cloud (CGC), built by Seven Bridges and funded by the National Cancer Institute hosts The Cancer Genome Atlas (TCGA), that is one of the world’s largest cancer genomics data collections. Computational resources and optimized, portable bioinformatics tools are provided to analyze the cancer data at any scale immediately, collaboratively, and reproducibly. Seven Bridges platform is not only available on AWS but also available on google cloud as well. With Docker and Common Workflow Language open standard, wrapping a tool in any programming language into the cloud and compute on petabyte of data has never been so easy. Open source R/Bioconductor package sevenbridges is developed to provide full API support to Seven Bridges Platforms including CGC, supporting flexible operations on project, task, file, billing, apps etc, users could easily develop fully automatic workflow within R to do an end-to-end data analysis in the cloud, from raw data to report. What’s most important, sevenbridges packages also provides interface to describe your tools in R and make it portable to CWL format in JSON and YAML, that you can share easily with collaborators, execute it in different environment locally or in the cloud, everything is fully reproducible. Combined with the R API client functionality, users will be able to create a CWL tool in R and execute it in the cancer genomics cloud to analyze the huge amount of cancer data at scale.
In this workshop, you will learn the best practice to create command line interface for your R cwl tool and docker container, put bioconductor workflow in cwl and execute it in the cloud, use API client to do different operations for your account, and use the R interface to describe CWL tool/workflows.
-
Analysing DNA methylation data with Bioconductor. Peter Hickey, Kasper Hansen (Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health). Intermediate.
DNA methylation is an epigenetic modification of DNA that is involved in the regulation of gene expression. Popular assays for studying DNA methylation are the Illumina 450k and EPIC microarrays, reduced-representation bisulfite-sequencing, and whole-genome bisulfite-sequencing, each of which has its own set of bioinformatic challenges. However, there are also common statistical themes, such as the strong spatial correlation of DNA methylation along the genome and that measurements from these assays are aggregates from a population of heterogeneous cells. The Bioconductor project currently includes 50 packages for analysing DNA methylation data. This workshop will introduce some of these packages and help users identify appropriate tools and methodology for their own analyses. The workshop presenters are developers of and contributors to several popular Bioconductor packages, including minfi for analysing Illumina methylation microarrays and bsseq for analysing whole-genome bisulfite-sequencing data.
-
Introduction to Bayesian Inference using Stan with Applications to Cancer Genomics. Jacqueline Buros (Icahn School of Medicine at Mount Sinai), Benjamin Goodrich (Columbia University), Eric Novik (Stan Group Inc.). Intermediate.
This workshop will provide a background and introduction to Bayesian analysis using Stan and R interfaces to Stan. We will then work through 1-2 example applications in the area of cancer genomics. Stan is a modern probabilistic programming language implementing full Bayesian statistical inference with MCMC sampling (NUTS, HMC), approximate Bayesian inference with variational inference (ADVI), and penalized maximum likelihood estimation with optimization (L-BFGS). Potential applications include: * A model for mutation signature detection (cancer genomics) * A clonality deconvolution model (cancer genomics) * Evaluating (predictive) biomarkers for response to treatment (cancer genomics) * Analysis of RNA expression data (genomics) * Gene set enrichment analysis (genomics) Users should be familiar with basic probability theory and statistical methods. Prior experience with R is essential. Knowledge of genomic data and cancer biology will help but is not at all required. Following this workshop users will have a basic familiarity with Stan and RStan. They will be able to code simple models in Stan and be familiar with the process for specifying and fitting complex models in Stan.
-
R / Bioconductor tools for mass spectrometry-based proteomics. Laurent Gatto (University of Cambridge). Intermediate.
In this workshop, we will use R/Bioconductor packages to explore, process, visualise and understand mass spectrometry-based proteomics data, starting with raw data, and proceeding with identification and quantitation data, discussing some of their peculiarities compared to sequencing data along the way. The participants will gain a general overview of Bioconductor packages for mass spectrometry and proteomics, and learn how to navigate raw data and reconstruct quantitative data. The workshop is aimed at a beginner to intermediate level, such as, for example, seasoned R users who want to get started with mass spectrometry and proteomics, or proteomics practitioners who want to familiarise themselves with R and Bioconductor infrastructure.