3 Understanding Sequencing Raw Data
3.1 Class Environment
3.1.1 Getting into AWS Instance
There is a nice breakdown from another Physalia course on instructions for different operating systems and accessing AWS. It is called Connection to the Amazon EC2 service. This will help with connecting to the AWS instance to run docker.
3.2 Shell and Unix commands
3.2.1 Common Linux Commands
3.2.1.1 Lab 1a
- check the your present directory
- check history
- pipe history to grep to search for the cd command
- put history into a history.txt file
- make a directory called data
- change into data directory
- move history.txt file into data directory
- check manual page of wget command
- redirect wget maunual page output into a file called wget.txt
- return the lines that contain output in the wget.txt file
- Compress wget.txt file
- View Compressed file
3.2.1.2 Git Commands
Git is a distributed version-control system for tracking changes in source code during software development. It is designed for coordinating work among programmers, but it can be used to track changes in any set of files. Its goals include speed, data integrity, and support for distributed, non-linear workflows.
Go to your user directory and run the following command from git. This will create a directory of all the course material inside your user directory. After it is done cloning change directory into the 2020_scWorkshop directory where the course material is. The commands are below.
3.3 File formats
- bcl
- fastq
- bam
- mtx, tsv
- hdf5 (.h5, .h5ad)
3.3.1 View FASTQ Files
3.3.1.1 Viewing entire file
3.3.1.2 Viewing first 10 lines
3.3.1.3 Stream Viewing with less command
3.3.2 View BAM Files
3.3.2.1 Viewing first 10 lines
3.3.2.2 Stream Viewing with less command
3.4 Public data repositories
3.4.1 Cellranger/10x
3.4.1.1 Lab 1b
10x PBMC data are hosted in https://s3-us-west-2.amazonaws.com/10x.files/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz
- get 10x PBMC data
- unzip data
- explore directory
- explore files
3.4.2 GEO
3.4.2.1 Lab 1c
Get GEO Data - ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE81nnn/GSE81905/matrix/GSE81905-GPL19057_series_matrix.txt.gz - ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE81nnn/GSE81905/matrix/GSE81905-GPL17021_series_matrix.txt.gz
- go into that directory
- get files and place them in the directory
- View files (try keeping in compressed format and view that way)
3.4.3 Single Cell Portal
- https://portals.broadinstitute.org/single_cell
- Study: Salk Institute - Single-cell Methylome Sequencing Identifies Distinct Neuronal Populations in Mouse Frontal Cortex
3.4.3.1 Lab 1d
- Get R2 fastq file from the Salk Institute study
- Look at files
3.4.3.2 Lab 1e
- Get Docker on your local computer for you to have
- Explore Single Cell Portal
- Explore GEO
3.5 Docker Commands
Docker provides a consistent compute enviornment to ensure all software that you need is on the machine and able to be used. It will give you the version you need and help reduce software conflicts that may arise.
- make sure you are in the directory from the cloned repository directory
- run following command to start docker script
The full command inside the script is below. There is also an explaination of each part for your reference.
## if you are the super user on your computer
docker run --rm -it -v $PWD:/home/rstudio/materials kdgosik/2020scworkshop bash
## if you need to access permission to run the command
sudo docker run --rm -it -v $PWD:/home/rstudio/materials kdgosik/2020scworkshop bash
Explaination of commands
- docker: command to run docker
- run: asking docker to run a container
- --rm: flag to remove the container when you exit from it
- nothing will be saved from your session to access again later
- this flag can be removed to keep container
- -it: flag to run the container interactively
- this will keep all session output displaying on the terminal
- to stop container go to terminal and press Crtl+c
-v $PWD:/home/rstudio/materials: map your home directory to a directory inside docker container called home
- kdgosik/2020scworkshop: the image to run. It will be the image into a container if not already built on your computer
- [image link](https://hub.docker.com/r/kdgosik/2020scworkshop)
- bash: The entry point into the container. Start on the bash command line