Skip to content

Latest commit

 

History

History
executable file
·
89 lines (47 loc) · 3.76 KB

Notebook.md

File metadata and controls

executable file
·
89 lines (47 loc) · 3.76 KB

PROJECT NOTEBOOK

TO DO LIST

  • Decide method for negative control

  • Generate negative control dataset

  • Run all features for all 1000 sequences

  • Manuscript

  • Review copy-number justification as feature

Estefi

  • Negative control analysis
  • Incorporate epigenetics marks analysis to the workflow
  • intrinsic feature: di-nucleotide analysis

Dani

  • Review correlated features
  • Finish readme files and project setup
  • Conda env file
  • Add filter for protein-coding for high-express genes used for RNA:RNA interactions
  • Intrinsic feature: sequence low complexity analysis --> Good preliminary results --> Incude it as a intrinsic feature
  • change CPC2 for tcode --> tcode uses a 200 nt window --> removed
  • add "distance" label to short-ncrna-negative-control

IDEAS

Dani

  • snakemake script for analysis
  • Extract gnomad data directly from a server ? (902G)

PROGRESS

  • 2024/04/05 - Dani:

                  Divided sequence-processing.sh script into 3 (one for each type of RNA) so they can be run in parallel. 
    
  • 2024/04/05 - Dani:

                  Short-ncRNA dataset had many repeated sequences due to duplicated, but different ID, sequences in RNAcentral dataset --> Removed them from RNAcentral dataset and extracted new sequences 
    
  • 2024/04/09 - Dani:

                  1000 functional sequences with negative controls ready to analyse  --> new order of the dataset: by chr, start, end   
    
  • 2024/04/10 - Dani:

                  Remove Fickett score (CPC2): we're already calculating the coding potential with RNAcode --> DJ work showed better results with RNAcode 
                  
                  Remove Accessibility (RNApfold): feature showed perfect spearman correlation with MFE (rho=-1) --> doesn't meet feature criteria: redundant
                  
                  Remove phasCons score: redundant with phyloP
                  
                  Add low complexity sequence density feature to intrinsic-features.sh script 
    
  • 2024/04/12 - Dani:

                  update population script to remove the vcf files for each sequences after using it, since it's the process of extracting them from the local database is quite quick and it's not worth to store them in case of needed again (already storing the database)
    
  • 2024/04/19 - Dani:

                  update the way the gene_complement was being calculated --> now Gencode includes not only genes. 
    
                                                                              Gencode and RNAcentral are being merged before finding the complement (instead of finding complement of both separately and join afterwards)
                  change the steps used to sample the negative control: 
                      
                      The distance between "negative control exons" now is set at 100 (arbitrary) instead of the distance between the "functional exons" --> This was done because many exons distances were quite big respect the size of the "complement regions" --> many of this regions were not used reducing considerably the number of negative samples 
    
  • 2024/05/27 - Dani:

      			change how to extract the sequences from the genome: before --> grep on .csv; now bedtools getfasta --> much faster. Adapted scripts. Now the strand is considered as well: takes inverse complementary for negative strand. 
      			
      			The genome sequence now is downloaded from UCSC --> fasta file header chr1 which matches to the chr column we were using --> requirement for bedtools getfasta 
    
  • 2024/05/28 - Dani:

      			add filter to exclude ncRNA which are included in the curated database use to calculate the interaction RNA:RNA (lncRNA and shor-ncRNA sequences script)