Bioinformatics examples

Last updated on 2024-11-19 | Edit this page

Estimated time: 12 minutes

Overview

Questions

  • How can we work with files specific to the biological sciences in Unix?

Objectives

  • Explain how to use common Unix tools (eg. grep, sed etc) to extract information from a specialized biological file format (example: GFF )
  • Demonstrate how to use a specialized Unix tool (eg. h5ls) to inspect biological data in a specialized file format (example: AnnData with single cell RNAseq data)

Unix/Linux provides many small tools and utilities that can easily interact with each other using their built-in input and output streams, allowing you to perform complex data manipulations on the command line without writing a formal program or script.

This bonus section provides some examples using Unix command line tools for biology-specific tasks.

Data wrangling using grep, pipes and file redirection


Background: The CHESS gene catalog curates the set of all human genes and their transcripts through analysis of GTEx RNA sequencing data and manual assessment of RefSeq and GENCODE annotations and overlap. This catalog reports a different count of protein-coding genes in the human genome compared to RefSeq and GENCODE. To identify the overlap of genes across all three annotations sets, we need the total list of protein-coding gene names by extracting them from the annotation files.

Task: generate the unique list of protein-coding gene names from the CHESS annotation file (GFF)

Download the CHESS GFF file (chess3.1.3.GRCh38.assembly.gff.gz) to your home directory and unpack it.

BASH

$ cd
$ wget https://github.com/jlchang/cb-unix-shell-lesson-template/raw/refs/heads/main/learners/files/chess3.1.3.GRCh38.assembly.gff.gz
$ gunzip chess3.1.3.GRCh38.assembly.gff.gz

Let’s look at the file, can we figure out how a protein-coding gene is represented?

BASH

$ head chess3.1.3.GRCh38.assembly.gff

OUTPUT

##gff-version 3
#!gff-spec-version 1.21
chr1	CHESS	gene	11874	14409	.	+	.	ID=CHS.1;gene_name=DDX11L1;gene_type=transcribed_pseudogene
chr1	RefSeq	transcript	11874	14409	.	+	.	ID=CHS.1.1;Parent=CHS.1;gene_name=DDX11L1;gene_type=transcribed_pseudogene;db_xref=RefSeq:NR_046018.2,GENCODE:ENST00000456328.2;assembly_id=NA;cds_adjustment_status=0;exon_adjustment_status=0;original_source=BestRefSeq
chr1	RefSeq	exon	11874	12227	.	+	.	Parent=CHS.1.1;gene_name=DDX11L1
chr1	RefSeq	exon	12613	12721	.	+	.	Parent=CHS.1.1;gene_name=DDX11L1
chr1	RefSeq	exon	13221	14409	.	+	.	Parent=CHS.1.1;gene_name=DDX11L1
chr1	CHESS	gene	14362	29370	.	-	.	ID=CHS.2;gene_name=WASH7P;gene_type=transcribed_pseudogene
chr1	RefSeq	transcript	14362	29370	.	-	.	ID=CHS.2.1;Parent=CHS.2;gene_name=WASH7P;gene_type=transcribed_pseudogene;db_xref=RefSeq:NR_024540.1;assembly_id=NA;cds_adjustment_status=0;exon_adjustment_status=0;original_source=BestRefSeq
chr1	RefSeq	exon	14362	14829	.	-	.	Parent=CHS.2.1;gene_name=WASH7P

In the file we see from the 2nd column that there are lines for features identified by CHESS and RefSeq. To get a better sense of all the different possible values, we can use Unix tools to extract and summarize for us.

BASH

$ cut -f 2 chess3.1.3.GRCh38.assembly.gff

We don’t want to sift through 2.6 million values so let’s use Unix’ sort command to tell us all the unique values.

BASH

$ cut -f 2 chess3.1.3.GRCh38.assembly.gff | sort -u

OUTPUT

#!gff-spec-version 1.21
##gff-version 3
CHESS
GENCODE
MANE
RefSeq

Exercise

Look up how the sort and uniq tools work. How would you find out how many lines of each type are in the file?

BASH

$ cut -f 2 chess3.1.3.GRCh38.assembly.gff | sort | uniq -c | sort -r

OUTPUT

61921 GENCODE
618423 CHESS
413966 MANE
1517005 RefSeq
   1 ##gff-version 3
   1 #!gff-spec-version 1.21

After examining the file a bit more, we realize we’re probably looking for lines that have “gene_type=protein_coding” in the ninth field.

BASH

grep gene_type=protein_coding chess3.1.3.GRCh38.assembly.gff | less

BUT we only want the genes identified by CHESS… so we can use grep to first filter for just the CHESS lines, then look for just the protein_coding lines

BASH

grep "CHESS\tgene" chess3.1.3.GRCh38.assembly.gff | grep gene_type=protein_coding | wc -l 

It seems protein coding gene names are listed in the file with the following syntax:

OUTPUT

;gene_name=DNASE1;gene_type=protein_coding

… where the gene “DNASE1” could be replaced with any gene name.

Alternative - use regular expressions to extract the gene names

To identify such a string with any gene name inserted, we recognize that a gene name may contain uppercase letters, lowercase letters, numbers, or hyphens. We therefore turn to https://regex101.com/ to see if we can devise an appropriate regular expression to identify the string above containing any gene name

grep the file for these lines, using the regular expression so it only reports the exact matching sequence (-o), and save the results to a text file

BASH

 grep -o ';gene_name=[a-zA-Z0-9-]*;gene_type=protein_coding' chess3.1.3.CHM13.assembly.gff | uniq > chess3.1.3_genes.txt
 
 sed 's/^.*gene_name=//' test | sed 's/;gene_type.*$//'

Isolate individual gene names from the matching lines

BASH

grep -o "[A-Z][a-zA-Z0-9-]*" chess3.1.3_genes_temp.txt | sort >chess3.1.3_genes.txt

Removing the parts of the file that you don’t want can be done with sed and wildcards

(This part is still rough, ran out of time. Sorry! Will fix soon.)

BASH

grep -o ';gene_name=[a-zA-Z0-9-]*;gene_type=protein_coding' chess3.1.3.CHM13.assembly.gff | uniq > newfile.txt
sed 's/;gene_name=//' newfile.txt > newnewfile.text
sed 's/;gene_type=protein_coding//' newnewfile.text > justthegenes.txt

Next time, remember that you can also look up how a file format like GFF format is structured.

Introducing h5ls, a Unix tool to peek into AnnData objects

Let’s get an example h5ad file and put it in our home directory

BASH

$ cd
$ wget https://github.com/jlchang/cb-unix-shell-lesson-template/raw/refs/heads/main/learners/files/example.h5ad

If you have an AnnData file locally AND you have the hdf5 library installed locally (which you do if you have the AnnData or ScanPy library installed), you can use the hdf5 command line tool h5ls to quickly peek inside an AnnData file. HDF5 files have two types of objects, Datasets and Groups. Example:

BASH

> h5ls example.h5ad

OUTPUT

X                        Dataset {700, 765}
layers                   Group
obs                      Group
obsm                     Group
obsp                     Group
raw                      Group
uns                      Group
var                      Group
varm                     Group
varp                     Group

Go deeper by appending a Group name to the file:

BASH

> h5ls example.h5ad/obs

OUTPUT

G2M_score                Dataset {700}
S_score                  Dataset {700}
bulk_labels              Group
index                    Dataset {700}
louvain                  Group
n_counts                 Dataset {700}
n_genes                  Dataset {700}
percent_mito             Dataset {700}
phase                    Group

Too focused? Get the entire directory tree for the file with h5ls -r

BASH

> h5ls -r example.h5ad

Too much information? Do a more focused recursive h5ls by specifying a deeper path in the object (in this example, the deeper path is ‘/obs’):

OUTPUT

> h5ls -r example.h5ad/obs
/G2M_score               Dataset {700}
/S_score                 Dataset {700}
/bulk_labels             Group
/bulk_labels/categories  Dataset {10}
/bulk_labels/codes       Dataset {700}
/index                   Dataset {700}
/louvain                 Group
/louvain/categories      Dataset {11}
/louvain/codes           Dataset {700}
/n_counts                Dataset {700}
/n_genes                 Dataset {700}
/percent_mito            Dataset {700}
/phase                   Group
/phase/categories        Dataset {3}
/phase/codes             Dataset {700}

Notice that for AnnData objects, there are some Group objects that have nested datasets named categories and codes these are often Pandas’ categorical data. You can get the unique list of categorical labels with h5dump -d:

BASH

> h5dump -d "obs/phase/categories" example.h5ad

OUTPUT

HDF5 "example.h5ad" {
DATASET "obs/phase/categories" {
   DATATYPE  H5T_STRING {
      STRSIZE H5T_VARIABLE;
      STRPAD H5T_STR_NULLTERM;
      CSET H5T_CSET_UTF8;
      CTYPE H5T_C_S1;
   }
   DATASPACE  SIMPLE { ( 3 ) / ( 3 ) }
   DATA {
   (0): "G1", "G2M", "S"
   }
   ATTRIBUTE "encoding-type" {
      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "string-array"
      }
   }
   ATTRIBUTE "encoding-version" {
      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "0.2.0"
      }
   }
}
}

Key Points

  • Use common Unix tools to work with specialized file formats for biological data
  • Look for less common tools that may be more efficient or more quickly accessible to increase your productivity.