This is a walkthrough demonstrating initial set up of a workspace on the Terra cloud platform for viral sequencing work.
In addition to the aspects of Terra described below, additional information concerning Terra can be found in the official Terra documentation.
Terra is tested using and intended to be accessed with the Google Chrome browser.
Next, a new workspace will be created. In Terra, a workspace is a way of grouping together tabular data, compute jobs, output data, and interactive notebooks. Workspaces are a good way to scope data by project, and each is associated with a billing project and a list of users or groups allowed to access the workspace data.
pathogen-genomic-surveillance
).Workspaces, their data, and stored output from compute jobs exist on either Google Compute Platform or Microsoft Azure. Each billing project is associated with a particular cloud backend. The cloud backend used for a workspace is specified based on the billing project selected when creating a new workspace, and it cannot be changed for an existing workspace.
Within a workspace, data are organized into files and tables.
Each workspace created in Terra has its own cloud bucket3 for storing file data. Paths to these files can be stored in data tables and used as workflow inputs. File outputs from compute jobs are stored in the same workspace bucket. Access to the file data of a workspace is controlled according to the sharing settings of the workspace as a while.
Compute jobs can also use data stored in external buckets, provided the user’s proxy account has read access to the data4.
Data can be transferred to or from a workspace bucket using a web browser, or from the command line via the gsutil
or gcloud storage
CLI (Google Cloud Platform).
Upload the files provided to the workspace:
reference_genomes/ref-RSVA-KY654518.1.fasta
reference_genomes/ref-RSVB-MZ516105.1.fasta
There are two main types of table:
true
/false
), references to other table rows, or a list with multiple items of one of these types.
The values in the rows in these tables can be used as input for compute jobs, and also store the corresponding output from the same compute jobs.
When a compute workflow is configured, it can use data from an individual table as input, and execute multiple jobs in parallel, one per row selected from the chosen table.
When each job finishes, its output will be stored in columns of the same row that was used as input for the job.
Each workspace can have multiple tables, to aid organization, and also to describe relationships between rows in different tables5.A cell in one table can reference one or more rows in another table; for example, a table representing samples may list rows with sample names, and have a column that references one or more sequencing libraries for each sample, with the actual data for each sequencing library stored in a second table.
%%{ init: { 'flowchart': { 'curve': 'basis' } } }%%
flowchart LR
subgraph flowcell
flowcell1:::fc-nostroke
flowcell2:::fc-nostroke
flowcell3:::fc-nostroke
end
subgraph library
flowcell1-->sample1.l1:::entity1
flowcell1-->sample2.l1:::entity2
flowcell1-->sample3.l1:::entity3
flowcell2-->sample1.l2:::entity1
flowcell2-->sample2.l2:::entity2
flowcell2-->sample3.l2:::entity3
flowcell3-->sample4.l1:::entity4
flowcell3-->sample5.l1:::entity5
flowcell3-->sample6.l1:::entity6
end
subgraph sample
sample1.l1-->sample1:::set_entity1
sample1.l2-->sample1:::set_entity1
sample2.l1-->sample2:::set_entity2
sample2.l2-->sample2:::set_entity2
sample3.l1-->sample3:::set_entity3
sample3.l2-->sample3:::set_entity3
sample4.l1-->sample4:::set_entity4
sample5.l1-->sample5:::set_entity5
sample6.l1-->sample6:::set_entity6
end
classDef fc-nostroke fill:green, color:#fff, stroke-width:0px
classDef set_entity1 fill:red,color:#fff,stroke:red,stroke-width:2px
classDef set_entity2 fill:yellow,color:#000,stroke:yellow,stroke-width:2px
classDef set_entity3 fill:blue,color:#fff,stroke:blue,stroke-width:2px
classDef set_entity4 fill:#888,color:#fff,stroke:#888,stroke-width:2px
classDef set_entity5 fill:#666,color:#fff,stroke:#666,stroke-width:2px
classDef set_entity6 fill:#444,color:#fff,stroke:#444,stroke-width:2px
classDef entity1 stroke:red,color:#000,fill:#fff,stroke-width:3px
classDef entity2 stroke:yellow,color:#000,fill:#fff,stroke-width:3px
classDef entity3 stroke:blue,color:#000,fill:#fff,stroke-width:3px
classDef entity4 stroke:#888,color:#000,fill:#fff,stroke-width:3px
classDef entity5 stroke:#666,color:#000,fill:#fff,stroke-width:3px
classDef entity6 stroke:#444,color:#000,fill:#fff,stroke-width:3px
style flowcell fill:#eee,stroke:#333,stroke-width:0px
style library fill:#eee,stroke:#333,stroke-width:0px
style sample fill:#eee,stroke:#333,stroke-width:0px
First, add few common fields to the Workspace Data table from the TSV file provided:
Navigate to the Data tab of the workspace, click Workspace Data, and then drag and drop onto the browser (upload TSV) the file tabular_inputs/workspace-attributes.tsv
TO DO: screenshots and detailed description of how to fix the bugs with String Lists, specifically blastDbs, remove quotes and brackets but not commas, switch String to String List, verify it works by previewing the files and seeing a non-zero file size
Copy the full bucket paths for the two reference genomes to the corresponding cells in the Workspace Data table:
ref_RSV_A_fasta
: (path to: ref-RSVA-KY654518.1.fasta
in workspace files)ref_RSV_B_fasta
: (path to: reference_genomes/ref-RSVB-MZ516105.1.fasta
in workspace files)Add a table to store information for sequencing runs, where each row in the table will corresponds to an individual flowcell:
Navigate to the Data tab of the workspace, click Import Data, and then click Upload TSV. Drag and drop the file provided, tabular_inputs/flowcell_data.tsv
, and click Start Import Job.
The content of Terra data tables can be exported to TSV files for viewing and manipulation in external tools. The same data can also be copied to the user’s clipboard.
TSV data can also be imported to create a new data table or modify the values of an existing table, as shown above.
Each workspace can have its own interactive Jupyter notebooks.
The notebooks are stored as *.ipynb
files within the notebooks/
sub-path of a workspace file storage bucket, and can be moved or copied between workspaces as long as the *.ipynb
files reside within notebooks/
.
To create or use a notebook, a virtual compute instance containing Jupyter must be created or re-used. These instances can be created or accessed via the Analyses tab of a Terra workspace.
In the Data tab, click Files on the left-hand pane. If a folder called notebooks/
does not exist, click New folder and create a folder called notebooks
6.
Click the notebooks/
folder to view the content.
Upload the create_data_table_tsv.ipynb
file to notebooks/
. The *.ipynb
file will appear once fully uploaded.
Pop-up blocking may need to be disabled in your browser if the authentication pop-up window does not appear after clicking “Sign In” ↩
Do not select an Authorization Domain; doing so will complicate access and sharing of workspace data. ↩
File data are stored in Blob Storage on Microsoft Azure. ↩
The identifier for a user’s proxy account—formatted as an e-mail address—can be found on the Terra Profile Information page. ↩
Internally, Terra stores data in a relational database and conceptualizes one-to-one, one-to-many, and many-to-many relationships similarly. ↩
The Analyses tab only lists notebooks stored in the notebooks/
folder; if the folder name does not match this exactly, the notebooks will not appear under Analyses. ↩