bioinformatics-one-liners

Useful bioinformatics one-line commands.

Indexing files

Indexing gtf files with tabix can be tricky. (from Book Bioinformatics Data Skills)

(zgrep -v "^#" gencode.v32.annotation.gtf.gz | sort -k1,1 -k4,4n) | bgzip > gencode.v32.annotation.gtf.bgz \
&& tabix -p gff gencode.v32.annotation.gtf.bgz

Fetching annotations

get CpG Island annotation in BED format (ref, BEDOPS needed)

wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/cpgIslandExt.txt.gz \
  | gunzip -c \
  | awk 'BEGIN{ OFS="\t"; }{ print $2, $3, $4, $5$6, substr($0, index($0, $7)); }' \
  | sort-bed - \
  > cpgIslandExt.hg38.bed

Dealing with genomic annotations

extract genomic intervals of genes in BED format from GENCODE annotation

cat gencode.v32lift37.annotation.gtf \
  | grep -v '^#' \
  | grep '^chr' \
  | awk 'BEGIN {FS="\t"; OFS="\t"} {if ($3 == "gene") print $1,$4,$5,$9,$6,$7}' \
  | bedtools sort -i stdin \
  > gencode.v32lift37.genes.bed

annotate peaks (e.g. ChIP-seq) with closest 5 genes that are at most 3kb away

bedtools closest -a peaks.bed -b gencode.v32lift37.genes.bed -d -k 5 \
  | awk 'BEGIN {FS="\t"; OFS="\t"} {if (-3000 <= $17 && $17 <= 3000) print $1,$2,$3,$14,$16,$17} \
  > peaks.annotated.bed

E-utilities

Return a list of all Entrez database names

curl https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi

GDC API

Map GDC file uuid to case barcode

curl 'https://api.gdc.cancer.gov/files/{file_uuid}?fields=cases.submitter_id&pretty=true'

samtools (doc)

sort sam/bam files

samtools sort SRR1234567/SRR1234567.bam -o SRR1234567/SRR1234567.sorted.bam --threads 4

index sam/bam files (this creates SRR1234567.sorted.bam.bai file)

samtools index SRR1234567/SRR1234567.sorted.bam

get alignments on chromosome 20 only

samtools view SRR1234567/SRR1234567.bam chr20 -b > SRR1234567/SRR1234567_chr20.bam

count the number of reads in sam/bam files

samtools view -c SRR1234567/SRR1234567.bam

count the number of mapped reads in sam/bam files

samtools view -F 4 -c SRR1234567/SRR1234567.bam

count the number of unmapped reads in sam/bam files

samtools view -f 4 -c SRR1234567/SRR1234567.bam

show statistics of sam/bam files

samtools stats SRR1234567/SRR1234567.bam

check whether my alignment file is sorted lexicographically (list all the chromosomes)

samtools view SRR1234567/SRR1234567.bam | cut -f3 | uniq

subsample aligned reads (e.g. use -s 42.1 to use random seed 42 to subsample 0.1 (10%) of reads)

samtools view -s [SEED].[PROPORTION] SRR1234567/SRR1234567.bam

Dealing with fastq files

count the number of reads in fastq files

cat SRR1234567/SRR1234567.fastq | echo $((`wc -l` / 4))

count the number of reads in gzipped fastq files

zcat SRR1234567/SRR1234567.fastq.gz | echo $((`wc -l` / 4))

concatenate multiple fastq.gz files (just use cat!)

cat sample1.fastq.gz sample2.fastq.gz sample3.fastq.gz > concatenated.fastq.gz

fastq-dump best practice

fastq-dump --split-3 --skip-technical --gzip --readids --clip --read-filter pass <SRR->

use ascp to prefetch sra files (aspera-connect required, and use absolute path for aspera)

prefetch --ascp-path \
'/path/to/home/.aspera/connect/bin/ascp|/path/to/home/.aspera/connect/etc/asperaweb_id_dsa.openssh' \
--ascp-options -l150M SRR1234567

grep

print files in current directory which contain a pattern

grep -l <pattern> *

discard alternate contigs from bed files (i.e. drop entries at chromosome named chr~_alt or chr~_random, ...)

cat my_intervals.bed | grep -E '^chr[0-9XY]+\s' > my_intervals.filtered.bed

sed

extract desired part of each line using substitution

cat file | sed -e 's/.*\([0-9]*\).*/\1/'

join (tutorial)

join lines of two files on a common field in the first column

join file1.txt file2.txt > out.txt

comm (mnemonics: `common`)

print only lines common to both files

comm -12 <(sort file1) <(sort file2)

print only lines unique to first file

comm -23 <(sort file1) <(sort file2)

print only lines unique to second file

comm -13 <(sort file1) <(sort file2)

print only lines common to three files

comm -12 <(sort file1) <(sort file2) | comm -12 <(sort file3) -

tr (mnemonics: `translate`)

get complementary DNA sequence

echo ATGCTGTAGTC | rev | tr 'ACGT' 'TGCA'

cut

print only the second column of the tab-delimited file

cat file.txt | cut -f 2

print only the first column of the comma-delimited file

cat file.txt | cut -d',' -f 1

print the first and second columns of the comma-delimited file

cat file.txt | cut -d',' -f 1,2

convert tsv to csv

cat file.tsv | cut -f - --output-delimiter , > file.csv

print unique values in second column

cat file.tsv | cut -f2 | uniq

Linux basics

show the size of subdirectories

du -h

only show the size of current directory

du -sh

print a file except first line

cat file.txt | tail -n +2

prepend a line to a file

sed '1i header' file.txt > prepended_file.txt

iterate over files in a directory

for f in *.sra; do fastq-dump $f; done

show from the a-th line to the b-th line of a file

head file -n $b | tail -n $(($b-$a+1))

Statistics

Multiple testing correction

from statsmodels.stats import multitest
reject, pvals_corrected, alphacSidak, alphacBonf = multitest.multipletests(p_values, method='fdr_bh')

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bioinformatics-one-liners

Indexing files

Fetching annotations

Dealing with genomic annotations

E-utilities

GDC API

samtools (doc)

Dealing with fastq files

grep

sed

join (tutorial)

comm (mnemonics: `common`)

tr (mnemonics: `translate`)

cut

Linux basics

Statistics

About

Releases

Packages

dohlee/bioinformatics-one-liners

Folders and files

Latest commit

History

Repository files navigation

bioinformatics-one-liners

Indexing files

Fetching annotations

Dealing with genomic annotations

E-utilities

GDC API

samtools (doc)

Dealing with fastq files

grep

sed

join (tutorial)

comm (mnemonics: common)

tr (mnemonics: translate)

cut

Linux basics

Statistics

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

comm (mnemonics: `common`)

tr (mnemonics: `translate`)

Packages