Bioinformatics examples
Last updated on 2024-11-19 | Edit this page
Overview
Questions
- How can we work with files specific to the biological sciences in Unix?
Objectives
Unix/Linux provides many small tools and utilities that can easily interact with each other using their built-in input and output streams, allowing you to perform complex data manipulations on the command line without writing a formal program or script.
This bonus section provides some examples using Unix command line tools for biology-specific tasks.
Data wrangling using grep, pipes and file redirection
Background: The CHESS gene catalog curates the set of all human genes and their transcripts through analysis of GTEx RNA sequencing data and manual assessment of RefSeq and GENCODE annotations and overlap. This catalog reports a different count of protein-coding genes in the human genome compared to RefSeq and GENCODE. To identify the overlap of genes across all three annotations sets, we need the total list of protein-coding gene names by extracting them from the annotation files.
Task: generate the unique list of protein-coding gene names from the CHESS annotation file (GFF)
Download the CHESS GFF file (chess3.1.3.GRCh38.assembly.gff.gz) to your home directory and unpack it.
BASH
$ cd
$ wget https://github.com/jlchang/cb-unix-shell-lesson-template/raw/refs/heads/main/learners/files/chess3.1.3.GRCh38.assembly.gff.gz
$ gunzip chess3.1.3.GRCh38.assembly.gff.gz
Let’s look at the file, can we figure out how a protein-coding gene is represented?
OUTPUT
##gff-version 3
#!gff-spec-version 1.21
chr1 CHESS gene 11874 14409 . + . ID=CHS.1;gene_name=DDX11L1;gene_type=transcribed_pseudogene
chr1 RefSeq transcript 11874 14409 . + . ID=CHS.1.1;Parent=CHS.1;gene_name=DDX11L1;gene_type=transcribed_pseudogene;db_xref=RefSeq:NR_046018.2,GENCODE:ENST00000456328.2;assembly_id=NA;cds_adjustment_status=0;exon_adjustment_status=0;original_source=BestRefSeq
chr1 RefSeq exon 11874 12227 . + . Parent=CHS.1.1;gene_name=DDX11L1
chr1 RefSeq exon 12613 12721 . + . Parent=CHS.1.1;gene_name=DDX11L1
chr1 RefSeq exon 13221 14409 . + . Parent=CHS.1.1;gene_name=DDX11L1
chr1 CHESS gene 14362 29370 . - . ID=CHS.2;gene_name=WASH7P;gene_type=transcribed_pseudogene
chr1 RefSeq transcript 14362 29370 . - . ID=CHS.2.1;Parent=CHS.2;gene_name=WASH7P;gene_type=transcribed_pseudogene;db_xref=RefSeq:NR_024540.1;assembly_id=NA;cds_adjustment_status=0;exon_adjustment_status=0;original_source=BestRefSeq
chr1 RefSeq exon 14362 14829 . - . Parent=CHS.2.1;gene_name=WASH7P
In the file we see from the 2nd column that there are lines for features identified by CHESS and RefSeq. To get a better sense of all the different possible values, we can use Unix tools to extract and summarize for us.
We don’t want to sift through 2.6 million values so let’s use Unix’ sort command to tell us all the unique values.
OUTPUT
#!gff-spec-version 1.21
##gff-version 3
CHESS
GENCODE
MANE
RefSeq
Exercise
Look up how the sort and uniq tools work. How would you find out how many lines of each type are in the file?
After examining the file a bit more, we realize we’re probably looking for lines that have “gene_type=protein_coding” in the ninth field.
BUT we only want the genes identified by CHESS… so we can use grep to first filter for just the CHESS lines, then look for just the protein_coding lines
It seems protein coding gene names are listed in the file with the following syntax:
OUTPUT
;gene_name=DNASE1;gene_type=protein_coding
… where the gene “DNASE1” could be replaced with any gene name.
Alternative - use regular expressions to extract the gene names
To identify such a string with any gene name inserted, we recognize that a gene name may contain uppercase letters, lowercase letters, numbers, or hyphens. We therefore turn to https://regex101.com/ to see if we can devise an appropriate regular expression to identify the string above containing any gene name
grep
the file for these lines, using the regular
expression so it only reports the exact matching sequence (-o), and save
the results to a text file
BASH
grep -o ';gene_name=[a-zA-Z0-9-]*;gene_type=protein_coding' chess3.1.3.CHM13.assembly.gff | uniq > chess3.1.3_genes.txt
sed 's/^.*gene_name=//' test | sed 's/;gene_type.*$//'
Isolate individual gene names from the matching lines
Removing the parts of the file that you don’t want can be done with sed and wildcards
(This part is still rough, ran out of time. Sorry! Will fix soon.)
BASH
grep -o ';gene_name=[a-zA-Z0-9-]*;gene_type=protein_coding' chess3.1.3.CHM13.assembly.gff | uniq > newfile.txt
sed 's/;gene_name=//' newfile.txt > newnewfile.text
sed 's/;gene_type=protein_coding//' newnewfile.text > justthegenes.txt
Next time, remember that you can also look up how a file format like GFF format is structured.
Introducing h5ls, a Unix tool to peek into AnnData objects
Let’s get an example h5ad file and put it in our home directory
BASH
$ cd
$ wget https://github.com/jlchang/cb-unix-shell-lesson-template/raw/refs/heads/main/learners/files/example.h5ad
If you have an AnnData file locally AND you have the hdf5 library installed locally (which you do if you have the AnnData or ScanPy library installed), you can use the hdf5 command line tool h5ls to quickly peek inside an AnnData file. HDF5 files have two types of objects, Datasets and Groups. Example:
OUTPUT
X Dataset {700, 765}
layers Group
obs Group
obsm Group
obsp Group
raw Group
uns Group
var Group
varm Group
varp Group
Go deeper by appending a Group name to the file:
OUTPUT
G2M_score Dataset {700}
S_score Dataset {700}
bulk_labels Group
index Dataset {700}
louvain Group
n_counts Dataset {700}
n_genes Dataset {700}
percent_mito Dataset {700}
phase Group
Too focused? Get the entire directory tree for the file with h5ls -r
Too much information? Do a more focused recursive h5ls by specifying a deeper path in the object (in this example, the deeper path is ‘/obs’):
OUTPUT
> h5ls -r example.h5ad/obs
/G2M_score Dataset {700}
/S_score Dataset {700}
/bulk_labels Group
/bulk_labels/categories Dataset {10}
/bulk_labels/codes Dataset {700}
/index Dataset {700}
/louvain Group
/louvain/categories Dataset {11}
/louvain/codes Dataset {700}
/n_counts Dataset {700}
/n_genes Dataset {700}
/percent_mito Dataset {700}
/phase Group
/phase/categories Dataset {3}
/phase/codes Dataset {700}
Notice that for AnnData objects, there are some Group objects that have nested datasets named categories and codes these are often Pandas’ categorical data. You can get the unique list of categorical labels with h5dump -d:
OUTPUT
HDF5 "example.h5ad" {
DATASET "obs/phase/categories" {
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
DATASPACE SIMPLE { ( 3 ) / ( 3 ) }
DATA {
(0): "G1", "G2M", "S"
}
ATTRIBUTE "encoding-type" {
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
DATA {
(0): "string-array"
}
}
ATTRIBUTE "encoding-version" {
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
DATA {
(0): "0.2.0"
}
}
}
}
Key Points
- Use common Unix tools to work with specialized file formats for biological data
- Look for less common tools that may be more efficient or more quickly accessible to increase your productivity.