A reproducible, end-to-end bioinformatic resource for the 2025 dFORCE manuscript available on biorXiv
The code here are not production software, but serve as an archive of the pre-processing and analysis performed for dFORCE.
Paths in the nextflow pipeline and bash script will need to be updated to your specific environment.
rna-biogenesis-maps_publication/
βββ ROGUE1_minimal/ # utility for classifying splicing state and handling pre-mRNA sequencing data
β βββ R1.py
β
βββ dFORCE_nextflow/ # nextflow pipeline to process raw dFORCE POD5 data using Dorado, minimap2 and ROGUE1
β βββ basecall.nf
β βββ align_genome.nf # align dFORCE reads to a genome using minimap2
β βββ run_ROGUE1*.nf # run R1 in first-pass or second-pass mode
β βββ main_workflow.nf
β βββ nextflow.config
β
βββ downstream_analysis/ # commands to filter annotations and run nextflow pipelines (in .sh scripts) and run figure-specific analysis (in .ipynb)
β βββ 0_dFORCE_preprocess_reads_*.sh
β βββ dFORCE_fig{1..6}_*.ipynb
β βββ dFORCE_fig3_m6A_preprocessing.sh
β βββ dFORCE_fig3_m6A_isoform_plots.R
β βββ dFORCE_fig6_m6A_preprocessing.sh
β βββ β¦ (extra notebooks & helpers)
β
βββ igv/ # IGV to SVG utility for making publication-grade genome browser plots
βββ IGV_to_svg.bat
- The dFORCE nextflow pipeline basecalls, aligns and summarises the data in multiple 'R1' read summary files which are used for subsequent analysis.
- The workflow for these steps is included in
downstream_analysis/0_dFORCE_preprocess_reads_*.shfor the relevant species/sample - R1 runs in a first-pass mode, which classifies RNA processing relative to a filtered GTF annotation.
- dFORCE uses this data to curate a 'second-pass' isoform index, which enables R1 to be run in second-pass mode.
- Most dFORCE analysis is based off these second-pass outputs, except initial biotype classification and PCA analysis.
The 0_dFORCE_preprocess_reads_.sh scripts implement a two-pass indexing workflow:
| Stage | Purpose | Key outputs |
|---|---|---|
| (a) GTF filtering | Remove non-Ensembl/HAVANA transcripts; retain mirBase genes | annotation/filtered_annotation.gtf |
| (b) First-pass Nextflow | GPU base-call & align every replicate to genome; run ROGUE-1 once | first_pass/<sample>/*.bam, *_first_pass.txt |
| (c) Merge & re-score | Merge total-RNA BAMs; select best isoform per gene with filter_gtf_using_totalRNA_alignments.py |
second_pass_index/filter_gtf/new_anno.gtf |
| (d) Poly(A) clustering | Cluster pre-mRNA-A sites (cluster_polyA_sites.py) and adjust 3β² UTRs |
BED + GTF cluster files in annotation_with_clusters/ |
| (e) Second-pass Nextflow | Re-run each sample with --R1_index <annotation_with_clusters> to refine read classification |
second_pass/*second_pass.txt, updated BAMs |
| (f) Optional βunfilteredβ run | Repeat first-pass using the vanilla Ensembl 113 GTF for biotype plots | ens113_unfiltered_first_pass/ |
Most figure notebooks read the TSV/BED/BAM artefacts produced by the human or mouse preprocessing scripts.
- Absolute HPC paths are left in the example scripts so reviewers can replay our December 2024 production runs. Change them to your own scratch/project space as neededβthe first ~20 lines of every shell script contain all path variables.
All original code is released under the MIT Licence; third-party tools retain their own licences. Cite this repository as:
Sethi A.J. et al. (2025) dFORCE https://github.com/compRNA/dFORCE