Bioconductor in the cloud
Obtain an Amazon Web Services account and start the AMI. Additional instructions below.
Note: Bioconductor preconfigured AMI’s are deprecated and no longer being provided as of Bioc3.13. Other related resources are Docker Images and the AnVIL project
Contents
- Overview
- Preloaded AMI
- First-Time Steps
- Launching The AMI
- Connecting to your AMI using SSH
- Connecting to your AMI using HTTP and Rstudio
- AMI IDs
- Scenarios for using your Bioconductor instance
- Using Rgraphviz
- Parallelization using the parallel package
- Using the AMI as a cluster
- Installing StarCluster
- Configuring StarCluster
- Starting a Cluster
- Connecting to a Cluster
- Terminating the Cluster
- Cluster Scenarios
- Using BiocParallel with Sun Grid Engine
- Using SSH as the back end
- Using MPI as the back end
- Creating a custom version of the Bioconductor AMI
- Provisioning a virtual or physical machine for use with Bioconductor
- Moving data to and from your Bioconductor AMI instance
- Questions
Overview
We have developed an Amazon Machine Image (AMI) that is optimized for running Bioconductor in the Amazon Elastic Compute Cloud (or EC2) for sequencing tasks.
Here are a few reasons you could use it:
- You do not want to install Bioconductor on your own machine.
- You have a long-running task and you don’t want it to tie up the CPU on your own machine.
- You have a parallelizable task and would like to run it (either on multiple CPUs on a single machine, or in a cluster of many machines).
- You want to run R in your web browser (using RStudio Server).
- The AMI contains many packages which can be very difficult to install and configure.
See below for more specific scenarios.
Preloaded AMI
The AMI comes pre-loaded with the latest release version of R and the top 80 Bioconductor software packages plus the following categories of annotation packages:
- org.*
- BSgenome.*
- TxDb.*
How To Use It
First-time steps
First you will need an Amazon Web Services (AWS) account if you do not already have one. Sign up for AWS and then click here to sign up for the EC2 service. This will require that you provide credit card information; however, you will only be charged for services used. Some AWS services are free.
That’s all that is required if you want to use RStudio Server to connect to your AMI with a web browser. If you also want to connect to it with SSH, create a keypair as follows:
Creating a Key Pair
Launch the AWS Console. Click on the Key Pairs link in the lower left-hand corner of the page. Click the “Create Key Pair” button. When prompted, supply a name. We suggest that the name be a combination of “bioconductor”, your first name, and your machine name (this will avoid conflicts with other people who share your AWS account, or possibly your own account on another machine). For example, if your name is Bob and your personal computer is named “mylaptop”, your key pair name could be “bioconductor-bob-mylaptop”. Download the resulting .pem file and keep it in a safe place.
Launching the AMI
Once you have created an AWS account, you can launch one of the pre-loaded AMIs.
To launch the AMI you’ll step through several screens:
- Choose AMI Choose version of Bioconductor you want to run.
- Choose Instance Type The current AMIs were created from instances with 4 cores and 16 GiB of memory.
- Configure Instance Defaults are usually fine.
- Add Storage Only necessary if you want to store large files with the instance.
- Tag Instance Specify any tags (key-value pairs) you want to assign to the image.
- Configure Security Group Specify port 22 for SSH (default) and add port 80 for http (Rstudio) access.
- Review and Launch
View the progress of the launching instance by going to the EC2 dashboard then click ‘running instances’. Once the instance is ‘running’ it’s ready to go.
Important Note: When you are finished using the AMI, be sure to stop or terminate the instance. Click on the instance and select the desired action from the ‘Actions’ -> ‘Instance State’. If you stop the instance you are still charged for storage. Termination eliminates all further charges but also looses any changes you’ve made and any new session needs to be started fresh from one of the pre-loaded AMIs.
Connecting to your AMI using SSH
Start one of the pre-loaded AMIs.
Follow the same steps as above. On the ‘Running Instances’ page select the instance and get the public IP from the ‘Description’ tab at the bottom.
If the public IP was 50.16.120.30 the basic ssh command is
ssh -i bioconductor-bob-mylaptop.pem ubuntu@ec2-50-16-120-30.compute-1.amazonaws.com
You can vary this command. If you want to use programs or R packages that use X11, be sure and add a -X flag, to make the command something like this:
ssh -X -i bioconductor-bob-mylaptop.pem ubuntu@ec2-50-16-120-30.compute-1.amazonaws.com
Now you can paste your command line into a terminal or Command Prompt. Make sure you are in the same directory as your key pair file.
Windows Users: You will need to install a version of the ssh and scp commands. Graphical programs like PuTTY and WinSCP will work. Our examples, however, will use the command-line versions of these programs, which you can obtain by installing Cygwin (be sure and install the openssh package).
Once you have pasted this command into your Terminal or Command Prompt window (and pressed Enter) you should be connected to your Amazon EC2 instance.
Connecting to your AMI using HTTP and RStudio
Each instance that receives a public IP address is also given an external DNS hostname. The Public IP is part of the hostname, for example if the Public IP is 50-16-120-30, the external DNS hostname is ec2-50-16-120-30.compute-1.amazonaws.com. Note the use of dash in the DNS hostname vs dots in the IP.
Paste the hostname in your browser and it should take you to the RStudio
Server login page. Log in with the username ubuntu
and password bioc
.
AMI IDs
As of Bioc3.13 Bioconductor has stopped provided preconfigured AMI. AMIs for previous release have the following IDs.
Bioconductor Version | R Version | AMI ID |
---|---|---|
3.13 | 4.1.0 | ami-0e7efd11a6eab85a6 |
3.12 | 4.0.3 | ami-04c69d122c1cf7e81 |
3.11 | 4.0.0 | ami-071b80cf0d8ca085c |
3.10 | 3.6.3 | ami-0c5ab50ca03a54468 |
3.9 | 3.6.1 | ami-0f5d1990d8c571cdf |
3.8 | 3.5.3 | ami-0565362d8bfb9cbed |
3.7 | 3.5.1 | ami-01bcd08e357360496 |
3.6 | 3.4.2 | ami-ac5df1d3 |
3.5 | 3.4.0 | ami-279a315d |
3.4 | 3.3.2 | ami-8946709f |
3.3 | 3.3.0 | ami-abd0b3bc |
3.2 | 3.2.0 | ami-85d88de0 |
3.1 | 3.2.0 | ami-a3d126c8 |
3.0 | 3.1.0 | ami-be7917d6 |
2.14 | 3.1.0 | ami-9c25fff4 |
2.13 | 3.0.2 | ami-4a25ff22 |
2.12 | 3.0 | ami-7224fe1a |
2.11 | 2.15 | ami-f827fd90 |
2.10 | 2.15 | ami-5621fb3e |
2.9 | 2.14 | ami-2623f94e |
2.8 | 2.13 | ami-3a2ef452 |
Please note that AMI IDs may change over time as we update the underlying AMI. Refer to this page for the most
current AMI IDs. These AMIs live in the US-East-1 region.
For administrative and funding reasons, Bioconductor keeps track of each time a Bioconductor AMI is launched. No identifying information is kept. By using the AMI, you consent to this tracking.
Scenarios for using your Bioconductor instance
Using Rgraphviz
Make sure you have connected to your instance either with a web browser, or using the -X flag of the ssh command. Something like:
ssh -X -i bioconductor-bob-mylaptop.pem ubuntu@ec2-50-16-120-30.compute-1.amazonaws.com
Then, from within R on the remote instance:
library("Rgraphviz")
set.seed(123)
V <- letters[1:10]
M <- 1:4
g1 <- randomGraph(V, M, 0.2)
plot(g1)
This should start a graphics device on your local computer displaying a simple graph.
Paralellization using the parallel package
This works best if you have selected a high-CPU instance type to run.
This trivial example runs the rnorm()
function, but any function would work. Consult the
parallel documentation for more information.
library(parallel)
mclapply(1:30, rnorm)
Using the AMI as a cluster
You can also use the AMI as a cluster, wherein the machines communicate with each other via one of the following mechanisms:
- SSH
- MPI
- Sun Grid Engine
In order to use the Bioconductor AMI in one of these scenarios, you need to install StarCluster, which is a software package designed to automate and simplify the process of building, configuring, and managing clusters of virtual machines on Amazon’s EC2 cloud.
StarCluster takes care of the details of making sure that the machines can communicate with each other, including:
- Passwordless SSH
- Shared disk space (using NFS)
- Convenient aliases for host names (such as master and node001)
- Configuration of job scheduler (Sun Grid Engine)
Note: Using the Bioconductor AMI for cluster-related tasks is only supported for the Bioconductor AMI version 2.14 and higher.
Installing StarCluster
Install StarCluster by following the StarCluster Installation Guide. This is a simple and fast process.
If you are on Windows, go directly to the Windows section.
Before continuing, it’s worth watching the Quick Start Screencast or following the Quick-Start Tutorial.
Configuring StarCluster
Before we can use StarCluster with the Bioconductor AMI,
we need to configure it, by editing its config
file.
You can create the file by issuing the command:
starcluster help
This will give you three options:
Options:
--------
[1] Show the StarCluster config template
[2] Write config template to /home/user/.starcluster/config
[q] Quit
Choose option 2 and note the location of the config file (it will be different from what is shown above).
On Unix systems (including Linux and Mac OS X), this file is found
at ~/.starcluster/config
. On Windows systems, the .starcluster
folder should be located in your
home directory.
Open the config
file in your favorite text editor, and edit it
as follows:
AWS Credentials and Connection Settings section
You need to fill in values for AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
. If you don’t know these values,
go to the Security Credentials
page of the AWS Console and expand the “Access Keys” section.
You can view or create your access keys here. Be sure and
store these credentials in a safe place (in addition to
your StarCluster config file).
The value of AWS_USER_ID
can also be found on
the Security Credentials page, by expanding
the “Account Identifiers” section. Fill in AWS_USER_ID
with the number shown as your “AWS Account ID” (this
should be a 12-digit number with hyphens).
Defining EC2 Keypairs section
If you haven’t already created a keypair in EC2, please do so now by reading the keypairs section.
You can also create a keypair with StarCluster; run the command
starcluster createkey --help
…for instructions.
Remember the name that you assigned to your keypair. Change the line
[key mykey]
So that mykey
is replaced by the name you assigned
to your keypair in EC2, and change the following line
KEY_LOCATION=~/.ssh/mykey.rsa
So that the value of KEY_LOCATION
is the full path to
the private key downloaded from .ec2 (it probably has a .pem
extension).
Defining Cluster Templates section
StarCluster allows you to define multiple clusters in the config
file. For now let’s just modify the cluster defined as smallcluster
.
- Change the value of
KEYNAME
to the name of your key pair (see keypair section above). - Optionally change
CLUSTER_SIZE
to the number of machines you want to launch. This number includes the master node, so the default value of 2 means one master, and one worker. We recommend starting with 2 until you have familiarized yourself with using StarCluster and Bioconductor. - Change
CLUSTER_USER
toubuntu
. - Uncomment the line
DNS_PREFIX = True
. This makes your cluster instances easier to recognize when using the AWS Console or command line tools. - Change the
NODE_IMAGE_ID
to the AMI-ID of the AMI you want to use This will be listed in the AMI IDs section of this document. Note that StarCluster only works with AMIs for Bioconductor version 2.14 and higher. - Optionally change
NODE_INSTANCE_TYPE
to another instance type. See the Instance Types page for more information. - Under the line reading
#PERMISSIONS = ssh, http
, add the linepermissions = http
(note lowercase). This is related to security group permissions (more about this below).
You can make additional changes to this section if you want to further customize your configuration. Refer to the StarCluster documentation for more information.
Configuring Security Group Permissions section
Remove the comments (#
symbol) from the four lines
starting with [permission http]
so that you end up with:
[permission http]
IP_PROTOCOL = tcp
FROM_PORT = 80
TO_PORT = 80
This allows port 80 on the cluster instances to be open to the world, allowing us to use Rstudio Server on that port.
Starting a Cluster
Assuming you have completed the steps above, you can create a cluster with the command:
starcluster start smallcluster
After a few moments, the cluster should be available.
Connecting to the cluster
There are two ways to connect to the cluster’s master node: RStudio Server and SSH. Unless you have a special need to use SSH, we recommend using RStudio Server.
Connecting using RStudio Server
First, get the hostname of the master node by issuing the command:
starcluster listclusters
You can also abbreviate this:
starcluster lc
This will produce output like the following:
-----------------------------------------------
smallcluster (security group: @sc-smallcluster)
-----------------------------------------------
Launch time: 2014-06-16 09:57:54
Uptime: 0 days, 02:19:56
Zone: us-east-1b
Keypair: bioc-default
EBS volumes: N/A
Cluster nodes:
smallcluster-master running i-46a76c6d ec2-54-91-23-93.compute-1.amazonaws.com
smallcluster-node001 running i-47a76c6c ec2-54-224-6-153.compute-1.amazonaws.com
Total nodes: 2
The line that starts with smallcluster-master
ends with a hostname
(in this case it’s ec2-54-91-23-93.compute-1.amazonaws.com
; in your
case it will be something different but similar). You can paste
this host name into a web browser (depending on the browser, you may
need to put http://
in front of the host name).
This should bring you to the RStudio Server login page. You can log
in with the username ubuntu
and the password bioc
.
Connecting using SSH
To connect to the master node using ssh, simply issue the command
starcluster sshmaster --user=ubuntu smallcluster
Terminating the Cluster **IMPORTANT!!**
When you are done, you MUST terminate your cluster or you will continue to be charged money by Amazon Web Services. To terminate the cluster, do this:
starcluster terminate smallcluster
This command will prompt you to confirm that you really want to terminate the cluster.
Cluster Scenarios
The following scenarios assume that you have started up a cluster and that you are connected to the master node.
Using BiocParallel with Sun Grid Engine
When you start a cluster with StarCluster, it’s automatically
configured to use the BiocParallel
and BatchJobs
packages
with Sun Grid Engine as the back end. You can demonstrate this
by loading BatchJobs:
library(BatchJobs)
Among other output, this will say
cluster functions: SGE
indicating that Sun Grid Engine is the back end.
Here’s how to send a simple job to the cluster:
library(BatchJobs)
library(BiocParallel)
param <- BatchJobsParam(2, resources=list(ncpus=1))
register(param)
FUN <- function(i) system("hostname", intern=TRUE)
xx <- bplapply(1:100, FUN)
table(unlist(xx))
This will produce:
smallcluster-master smallcluster-node001
50 50
…indicating that SGE ran half the jobs on the master and the other half on the worker node.
This presentation outlines how to develop a full-scale analysis (identification of cis-dsQTL) using the kind of cluster we’ve just created.
Using SSH as the back end
Here is the same example as above, except using SSH instead of Sun Grid Engine as the back end:
library(BatchJobs)
library(BiocParallel)
cluster.functions <- makeClusterFunctionsSSH(
makeSSHWorker(nodename="smallcluster-master"),
makeSSHWorker(nodename="smallcluster-node001")
)
param2 <- BatchJobsParam(2, resources=list(ncpus=1),
cluster.functions=cluster.functions)
register(param2)
FUN <- function(i) system("hostname", intern=TRUE)
xx <- bplapply(1:10, FUN)
table(unlist(xx))
You should see results like this:
smallcluster-master smallcluster-node001
5 5
Using MPI as the back end
When you start a cluster using the above steps, R is automatically aware of the cluster, as shown in the following example:
library(Rmpi)
mpi.universe.size()
With the default cluster configuration, this should return 2, which
make sense, since our cluster consists of two machines (smallcluster-master
and smallcluster-node001
), each of type m1.small, which
have one core each.
Again using BiocParallel
, you can run a simple function
on your MPI cluster:
FUN <- function(i) system("hostname", intern=TRUE)
Create a SnowParam
instance with the number of nodes equal to
the size of the MPI universe minus 1 (let one node dispatch jobs to
workers), and register this instance as the default:
param3 <- SnowParam(mpi.universe.size() - 1, "MPI")
register(param3)
Evaluate the work in parallel and process the results:
xx <- bplapply(1:10, FUN)
table(unlist(xx))
Creating a custom version of the Bioconductor AMI
Note: If you make changes to the running Bioconductor AMI, and then terminate the AMI, your changes will be lost. Use the steps described here to ensure that your changes are persistent.
If the AMI is missing some packages or features you think it should have, please let us know.
If you want to customize the AMI for your own purposes, it is simple. Just go ahead and customize your
running instance as you see fit. Typically this will involve installing R packages with BiocManager::install()
,
and software packages (at the operating system level) with the Ubuntu package manager apt-get
.
You may also want to change the password of the “ubuntu” user (because the default password is publicly known, in order to run RStudio Server) with the command:
passwd ubuntu
Now use the AWS Console to Stop your instance (important note: do NOT “Terminate” your instance; use the Stop command (under Instance Actions) instead.)
Then choose “Create Image (EBS AMI)” under the Instance Actions menu. You will be prompted for a name for your AMI. After entering the name, your AMI will be created and given a unique AMI ID. You can then launch instances of this AMI using the steps above, being sure to substitute the ID of your own AMI. Your AMI will be private, accessible only to your AWS account, unless you decide to make it more widely accessible.
Now you should Terminate the Stopped instance of the Bioconductor AMI.
Provisioning a virtual or physical machine for use with Bioconductor
The Bioconductor AMI was created using Vagrant and Chef. The same scripts that were used to create these AMIs can also be used to provision virtual machines (Virtualbox or VMWare) or physical machines.
For more information, see the scripts’ github repository.
Moving data to and from your Bioconductor AMI instance
If you are using RStudio Server, you can upload and download files from the Files pane in RStudio server.
If you are connected via ssh, the scp command is the most efficient way to move data to and from your EC2 instances.
To copy a file from your computer to a running Bioconductor AMI instance:
- Open a Terminal or Command Prompt window on your computer
- Using “cd”, change to the directory where your key pair (.pem) file lives
- Issue a command like the following (your key pair name and the hostname of the AMI instance will be different; you can determine the correct values by clicking on your running instance in the AWS Console):
scp -i bioconductor-bob-mylaptop.pem /path/to/myfile ubuntu@ec2-50-16-120-30.compute-1.amazonaws.com:~
That will copy the file at “/path/to/myfile” on your local computer to ubuntu’s home directory on the remote instance. To copy a file from a running instance to your local computer, do something like this (still at your local computer):
scp -i bioconductor-bob-mylaptop.pem ubuntu@ec2-50-16-120-30.compute-1.amazonaws.com:~/myfile /some/directory
That will copy the file ~/myfile from the running instance to /some/directory on your local machine.
Reminder: Files created on a running EC2 instance are not persisted unless you do some special steps. So if you are generating output files with Bioconductor, you must copy them to your local machine before terminating your instance, or your files will be lost. If you have a lot of data to move back and forth, you may want to look into Elastic Block Storage.
Questions
If you have questions about the Bioconductor AMI, please contact us through the Bioconductor support site.