Useful bioinformatics one-line commands.
Indexing gtf files with tabix can be tricky. (from Book Bioinformatics Data Skills)
(zgrep -v "^#" gencode.v32.annotation.gtf.gz | sort -k1,1 -k4,4n) | bgzip > gencode.v32.annotation.gtf.bgz \
&& tabix -p gff gencode.v32.annotation.gtf.bgz
get CpG Island annotation in BED format (ref, BEDOPS needed)
wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/cpgIslandExt.txt.gz \
| gunzip -c \
| awk 'BEGIN{ OFS="\t"; }{ print $2, $3, $4, $5$6, substr($0, index($0, $7)); }' \
| sort-bed - \
> cpgIslandExt.hg38.bed
extract genomic intervals of genes in BED format from GENCODE annotation
cat gencode.v32lift37.annotation.gtf \
| grep -v '^#' \
| grep '^chr' \
| awk 'BEGIN {FS="\t"; OFS="\t"} {if ($3 == "gene") print $1,$4,$5,$9,$6,$7}' \
| bedtools sort -i stdin \
> gencode.v32lift37.genes.bed
annotate peaks (e.g. ChIP-seq) with closest 5 genes that are at most 3kb away
bedtools closest -a peaks.bed -b gencode.v32lift37.genes.bed -d -k 5 \
| awk 'BEGIN {FS="\t"; OFS="\t"} {if (-3000 <= $17 && $17 <= 3000) print $1,$2,$3,$14,$16,$17} \
> peaks.annotated.bed
Return a list of all Entrez database names
curl https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi
Map GDC file uuid to case barcode
curl 'https://api.gdc.cancer.gov/files/{file_uuid}?fields=cases.submitter_id&pretty=true'
samtools (doc)
sort sam/bam files
samtools sort SRR1234567/SRR1234567.bam -o SRR1234567/SRR1234567.sorted.bam --threads 4
index sam/bam files (this creates SRR1234567.sorted.bam.bai file)
samtools index SRR1234567/SRR1234567.sorted.bam
get alignments on chromosome 20 only
samtools view SRR1234567/SRR1234567.bam chr20 -b > SRR1234567/SRR1234567_chr20.bam
count the number of reads in sam/bam files
samtools view -c SRR1234567/SRR1234567.bam
count the number of mapped reads in sam/bam files
samtools view -F 4 -c SRR1234567/SRR1234567.bam
count the number of unmapped reads in sam/bam files
samtools view -f 4 -c SRR1234567/SRR1234567.bam
show statistics of sam/bam files
samtools stats SRR1234567/SRR1234567.bam
check whether my alignment file is sorted lexicographically (list all the chromosomes)
samtools view SRR1234567/SRR1234567.bam | cut -f3 | uniq
subsample aligned reads (e.g. use -s 42.1
to use random seed 42 to subsample 0.1 (10%) of reads)
samtools view -s [SEED].[PROPORTION] SRR1234567/SRR1234567.bam
count the number of reads in fastq files
cat SRR1234567/SRR1234567.fastq | echo $((`wc -l` / 4))
count the number of reads in gzipped fastq files
zcat SRR1234567/SRR1234567.fastq.gz | echo $((`wc -l` / 4))
concatenate multiple fastq.gz files (just use cat!)
cat sample1.fastq.gz sample2.fastq.gz sample3.fastq.gz > concatenated.fastq.gz
fastq-dump best practice
fastq-dump --split-3 --skip-technical --gzip --readids --clip --read-filter pass <SRR->
use ascp to prefetch sra files (aspera-connect required, and use absolute path for aspera)
prefetch --ascp-path \
'/path/to/home/.aspera/connect/bin/ascp|/path/to/home/.aspera/connect/etc/asperaweb_id_dsa.openssh' \
--ascp-options -l150M SRR1234567
print files in current directory which contain a pattern
grep -l <pattern> *
discard alternate contigs from bed files (i.e. drop entries at chromosome named chr~_alt or chr~_random, ...)
cat my_intervals.bed | grep -E '^chr[0-9XY]+\s' > my_intervals.filtered.bed
extract desired part of each line using substitution
cat file | sed -e 's/.*\([0-9]*\).*/\1/'
join (tutorial)
join lines of two files on a common field in the first column
join file1.txt file2.txt > out.txt
print only lines common to both files
comm -12 <(sort file1) <(sort file2)
print only lines unique to first file
comm -23 <(sort file1) <(sort file2)
print only lines unique to second file
comm -13 <(sort file1) <(sort file2)
print only lines common to three files
comm -12 <(sort file1) <(sort file2) | comm -12 <(sort file3) -
get complementary DNA sequence
echo ATGCTGTAGTC | rev | tr 'ACGT' 'TGCA'
print only the second column of the tab-delimited file
cat file.txt | cut -f 2
print only the first column of the comma-delimited file
cat file.txt | cut -d',' -f 1
print the first and second columns of the comma-delimited file
cat file.txt | cut -d',' -f 1,2
convert tsv to csv
cat file.tsv | cut -f - --output-delimiter , > file.csv
print unique values in second column
cat file.tsv | cut -f2 | uniq
show the size of subdirectories
du -h
only show the size of current directory
du -sh
print a file except first line
cat file.txt | tail -n +2
prepend a line to a file
sed '1i header' file.txt > prepended_file.txt
iterate over files in a directory
for f in *.sra; do fastq-dump $f; done
show from the a-th line to the b-th line of a file
head file -n $b | tail -n $(($b-$a+1))
Multiple testing correction
from statsmodels.stats import multitest
reject, pvals_corrected, alphacSidak, alphacBonf = multitest.multipletests(p_values, method='fdr_bh')