viral-workshops

Terra workspace initial setup

This is a walkthrough demonstrating initial set up of a workspace on the Terra cloud platform for viral sequencing work.

Contents

  1. Terra workspace initial setup
    1. Terra documentation
    2. Sign in to Terra
    3. Create a new workspace
      1. Selecting Microsoft Azure as the cloud backend for a Terra workspace
    4. Organizing data in Terra
      1. File data
      2. Tabular data
        1. Adding common Workspace Data
        2. Adding a table for sequencing runs
      3. Exporting tabular data
    5. Interactive Notebooks in Terra
      1. Upload the notebook

Terra documentation

In addition to the aspects of Terra described below, additional information concerning Terra can be found in the official Terra documentation.

Terra is tested using and intended to be accessed with the Google Chrome browser.

Sign in to Terra

  1. Navigate to https://app.terra.bio.
  2. Click the icon in the upper left corner to expand the main menu.
  3. Click “Sign In”. A popup window will appear1.
  4. Click “Sign in with Google”, and authenticate using the credentials for your Google account.

Create a new workspace

Next, a new workspace will be created. In Terra, a workspace is a way of grouping together tabular data, compute jobs, output data, and interactive notebooks. Workspaces are a good way to scope data by project, and each is associated with a billing project and a list of users or groups allowed to access the workspace data.

  1. Navigate to view the list of workspaces you can access: https://app.terra.bio/#workspaces.
  2. Click the “+” icon to start configuring a new workspace.
  3. Enter a unique workspace name, and optionally a description.
  4. Select a billing project (pathogen-genomic-surveillance).
  5. After selecting a billing project, a few more input fields will appear. Leave Bucket location set to its default value. DO NOT select “Workspace will have protected data”. DO NOT select any value for the “Authorization domain2.
  6. Click “Create workspace”; you will be redirected to the main dashboard for the newly-created workspace.

Selecting Microsoft Azure as the cloud backend for a Terra workspace

Workspaces, their data, and stored output from compute jobs exist on either Google Compute Platform or Microsoft Azure. Each billing project is associated with a particular cloud backend. The cloud backend used for a workspace is specified based on the billing project selected when creating a new workspace, and it cannot be changed for an existing workspace.

Organizing data in Terra

Within a workspace, data are organized into files and tables.

File data

Each workspace created in Terra has its own cloud bucket3 for storing file data. Paths to these files can be stored in data tables and used as workflow inputs. File outputs from compute jobs are stored in the same workspace bucket. Access to the file data of a workspace is controlled according to the sharing settings of the workspace as a while.

Compute jobs can also use data stored in external buckets, provided the user’s proxy account has read access to the data4.

Data can be transferred to or from a workspace bucket using a web browser, or from the command line via the gsutil or gcloud storage CLI (Google Cloud Platform).

Upload the files provided to the workspace:

Tabular data

There are two main types of table:

A cell in one table can reference one or more rows in another table; for example, a table representing samples may list rows with sample names, and have a column that references one or more sequencing libraries for each sample, with the actual data for each sequencing library stored in a second table.

%%{ init: { 'flowchart': { 'curve': 'basis' } } }%%
flowchart LR
    subgraph flowcell
    flowcell1:::fc-nostroke
    flowcell2:::fc-nostroke
    flowcell3:::fc-nostroke
    end
    subgraph library
    flowcell1-->sample1.l1:::entity1
    flowcell1-->sample2.l1:::entity2
    flowcell1-->sample3.l1:::entity3
    flowcell2-->sample1.l2:::entity1
    flowcell2-->sample2.l2:::entity2
    flowcell2-->sample3.l2:::entity3
    flowcell3-->sample4.l1:::entity4
    flowcell3-->sample5.l1:::entity5
    flowcell3-->sample6.l1:::entity6
    end
    subgraph sample
    sample1.l1-->sample1:::set_entity1
    sample1.l2-->sample1:::set_entity1
    sample2.l1-->sample2:::set_entity2
    sample2.l2-->sample2:::set_entity2
    sample3.l1-->sample3:::set_entity3
    sample3.l2-->sample3:::set_entity3
    sample4.l1-->sample4:::set_entity4
    sample5.l1-->sample5:::set_entity5
    sample6.l1-->sample6:::set_entity6
    end
    classDef fc-nostroke fill:green, color:#fff, stroke-width:0px
    classDef set_entity1 fill:red,color:#fff,stroke:red,stroke-width:2px
    classDef set_entity2 fill:yellow,color:#000,stroke:yellow,stroke-width:2px
    classDef set_entity3 fill:blue,color:#fff,stroke:blue,stroke-width:2px
    classDef set_entity4 fill:#888,color:#fff,stroke:#888,stroke-width:2px
    classDef set_entity5 fill:#666,color:#fff,stroke:#666,stroke-width:2px
    classDef set_entity6 fill:#444,color:#fff,stroke:#444,stroke-width:2px
    
    classDef entity1 stroke:red,color:#000,fill:#fff,stroke-width:3px
    classDef entity2 stroke:yellow,color:#000,fill:#fff,stroke-width:3px
    classDef entity3 stroke:blue,color:#000,fill:#fff,stroke-width:3px

    classDef entity4 stroke:#888,color:#000,fill:#fff,stroke-width:3px
    classDef entity5 stroke:#666,color:#000,fill:#fff,stroke-width:3px
    classDef entity6 stroke:#444,color:#000,fill:#fff,stroke-width:3px
    style flowcell fill:#eee,stroke:#333,stroke-width:0px
    style library fill:#eee,stroke:#333,stroke-width:0px
    style sample fill:#eee,stroke:#333,stroke-width:0px

Adding common Workspace Data

First, add few common fields to the Workspace Data table from the TSV file provided: Navigate to the Data tab of the workspace, click Workspace Data, and then drag and drop onto the browser (upload TSV) the file tabular_inputs/workspace-attributes.tsv

TO DO: screenshots and detailed description of how to fix the bugs with String Lists, specifically blastDbs, remove quotes and brackets but not commas, switch String to String List, verify it works by previewing the files and seeing a non-zero file size

Copy the full bucket paths for the two reference genomes to the corresponding cells in the Workspace Data table:

Adding a table for sequencing runs

Add a table to store information for sequencing runs, where each row in the table will corresponds to an individual flowcell: Navigate to the Data tab of the workspace, click Import Data, and then click Upload TSV. Drag and drop the file provided, tabular_inputs/flowcell_data.tsv, and click Start Import Job.

Exporting tabular data

The content of Terra data tables can be exported to TSV files for viewing and manipulation in external tools. The same data can also be copied to the user’s clipboard.

TSV data can also be imported to create a new data table or modify the values of an existing table, as shown above.

Interactive Notebooks in Terra

Each workspace can have its own interactive Jupyter notebooks.

The notebooks are stored as *.ipynb files within the notebooks/ sub-path of a workspace file storage bucket, and can be moved or copied between workspaces as long as the *.ipynb files reside within notebooks/.

To create or use a notebook, a virtual compute instance containing Jupyter must be created or re-used. These instances can be created or accessed via the Analyses tab of a Terra workspace.

Upload the notebook

In the Data tab, click Files on the left-hand pane. If a folder called notebooks/ does not exist, click New folder and create a folder called notebooks6.

Click the notebooks/ folder to view the content.

Upload the create_data_table_tsv.ipynb file to notebooks/. The *.ipynb file will appear once fully uploaded.

  1. Pop-up blocking may need to be disabled in your browser if the authentication pop-up window does not appear after clicking “Sign In” 

  2. Do not select an Authorization Domain; doing so will complicate access and sharing of workspace data. 

  3. File data are stored in Blob Storage on Microsoft Azure. 

  4. The identifier for a user’s proxy account—formatted as an e-mail address—can be found on the Terra Profile Information page

  5. Internally, Terra stores data in a relational database and conceptualizes one-to-one, one-to-many, and many-to-many relationships similarly. 

  6. The Analyses tab only lists notebooks stored in the notebooks/ folder; if the folder name does not match this exactly, the notebooks will not appear under Analyses