AN ATLAS OF THE PRIMATE-SPECIFIC RNA LAYER

AutoAluPRE-PUB · UNDER DEV

A working atlas of human Alu-derived long non-coding RNAs, built from the primate-specific layer of the transcriptome that standard RNA-Seq pipelines silently discard. Short-read DESeq2 at three resolutions, long-read confirmation, locus-level review at scale — under active development from Nordlandssykehuset, Bodø.

FIG 00Alu colonisation of the primate genome
~1.14 M elements · RepeatMasker · hg38
LINEAGE · ABUNDANCE · AGE AluJ ~280k · 25% silent ~30 Mya AluJb 155k AluJo · 90k Jr · 35k AluS ~660k · 58% peak ~25 Mya AluSx 315k · largest subfamily AluSx1 130k AluSp · 75k AluSq · 68k Sc · 50k Sg · 45k AluY ~145k · 13% · active AluY · 90k Ya5 · 12k still inserting in living human genomes
FIG 00. Three waves. AluJ colonised the primate lineage around 65 Mya and went largely silent by ~30 Mya. AluS exploded between ~35–20 Mya, leaving AluSx the single most abundant Alu subfamily in hg38. AluY is the youngest and still inserting — many AluY elements are polymorphic among individuals today. Counts are RepeatMasker / hg38, rounded.
genomic ribbon · Alu density chr?:[ scale = arbitrary ]
each tick = one annotated Alu element (RepeatMasker) schematic · density real-typical, positions stylised
§ 01

The project

scope · motivation · roadmap

Alu elements are the most abundant short interspersed nuclear element in the human genome and they are not silent. A growing literature — STAIR, JARID2-NBDY, SINEUPs, IRAlus, NEAT1 — shows that Alu insertions can become functional cis- or trans-acting elements of host transcripts and of stand-alone long non-coding RNAs. Standard RNA-Seq pipelines, built before this view was dominant, routinely discard multi-mapping repeat reads — exactly the reads that would tell you whether a particular Alu is being transcribed.

AutoAlu is our attempt to map that layer back in, with enough numerical care that the resulting calls survive inspection at the per-locus level. The atlas is built by reconciling independent counts at three resolutions — host gene, intron, single Alu — and validating short-read calls with long-read Nanopore reads where available.

From inventory to confirmed call § 01.1

FIG 02The candidate-narrowing funnel
schematic widths · proportions stylised
01 INVENTORY ~1.14M Annotated Alus RepeatMasker · hg38 every element as a candidate genome-fact 02 DETECTED subset Expressed in cohort short-read DESeq2 3 layers reconciled cohort-specific 03 TOPOLOGY STAIR-like candidate lncRNAs topology rules 04 CONF long-read spanning filter: expression filter: topology confirm: ONT
FIG 02. The atlas is built as a narrowing funnel — every annotated Alu enters as a candidate, and three filters carry the survivors forward. Expression (DESeq2 across three feature layers) prunes silent loci. Topology rules drawn from the published Alu-lncRNA literature flag STAIR-like candidates. Long-read confirmation (Oxford Nanopore direct-RNA) promotes a candidate to confirmed. Specific exit counts at columns 02 → 04 will land here as the analyses stabilise — proportions in this figure are schematic.

Three layers, one locus § 01.2

FIG 01Counting an Alu at three resolutions
chr?:?,???,???–?,???,??? · ~5 kb window
01 GENE layer EXON INTRON EXON 02 INTRON layer Alu peak ⇢ intronic Alu 03 ALU layer AluSx ~300 nt 0 1k 2k 3k 4k 5k
FIG 01. The same locus counted at three resolutions. Layer 01 — host gene, exon-intron-exon structure with broad coverage. Layer 02 — the intron, where a peak that moves out of step with the surrounding exons exposes a candidate intronic Alu. Layer 03 — the Alu itself, ~300 nt, treated as its own feature with reads piled at sub-element resolution. Reconciling the three is most of what makes the calls usable downstream.
Status. This is v0 of the public site, stood up 2026-05-19 alongside the DNS record for autoalu.deepsek.no. Content fills in as the underlying analyses stabilise and the manuscript progresses. Specific locus counts are deliberately sparse here — the working repo changes daily and a quoted figure today will not match tomorrow's.
§ 02

Atlas

layers · scope · browse

The full Alu inventory in hg38 is on the order of 1.14 million elements. Only a minority are transcribed in any given cellular context, and a smaller minority still show topology consistent with stand-alone lncRNA function. The pipeline tracks every Alu as a candidate and lets the combination of short-read coverage, intronic context, and long-read confirmation prune the list down.

Distribution across the genome § 02.1

FIG 03Alu counts per chromosome · hg38
RepeatMasker · 24 sequences · ~1.14 M total
120k 90k 60k 30k 0 μ ≈ 47k chr19 ↑ highest density / Mb 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y CHROMOSOME · hg38 ALU COUNT
FIG 03. Absolute Alu counts per chromosome, sorted by chromosome number. chr1 and chr2 carry the most simply because they are the largest sequences; chr19 (highlighted) is the anomaly — physically small but gene- and GC-rich, with the highest density per Mb of any chromosome. The dashed line is the per-chromosome mean (~47k). Counts are RepeatMasker / hg38, Alu family only, rounded to the nearest 1k.

Why three layers § 02.2

A naive single-feature count loses information in both directions: a busy host gene drowns out a quietly transcribed Alu, and a single Alu count can't tell whether the signal is element-specific or part of an intronic envelope. Reconciling the three layers is most of what makes the calls usable downstream.

Browse-ready content. Locus-level pages are not yet on this site. The current working database, IGV-atlas builds, and per-locus screenshots live in the private working repo and will appear here as a curated slice once the calling has stabilised.
§ 03

The driver system

complement · inflammation · innate immunity

The driving biology for the first wave of analyses is the complement system — specifically the C5a fragment, one of the strongest known short-acting drivers of innate inflammation. Paired control and complement-inhibited samples give us a defined perturbation against which to ask which Alus respond?

Worked examples § 03.1

The headline candidate locus from these comparisons sits in the IL6 neighbourhood, where an Alu-derived intronic element has the right topology to act as a cis-regulatory lncRNA on the host transcript. A second worked thread runs through JARID2 and its lncRNA neighbourhood. Both will land here as their own pages once the corresponding manuscripts move.

Why this system § 03.2

  • Clean perturbation. Pharmacological complement inhibition gives a well-defined contrast — no transfection artefacts, no chronic-disease confounders.
  • Independent priors. The complement / IL6 axis is well-characterised at the protein and cytokine level, so Alu-layer findings can be tested against existing literature rather than built from scratch.
  • Clinical proximity. The host PI runs a clinical lab; findings that hold up here can move toward bedside-relevant questions without leaving the same dataset.

Open mechanistic threads § 03.3

A handful of additional loci show the right topology and are being developed in parallel. Pages will land as the analyses harden.

§ 04

Methods

pipeline · tools · reconciliation

Short-read RNA-Seq is processed with a repeat-aware mapping configuration so multi-mappers are retained and resolved against the RepeatMasker Alu inventory rather than discarded. Counts are produced at three feature layers and analysed with DESeq2. Long-read (ONT direct-RNA) data is mapped and used for locus-level confirmation, not for primary counting.

Pipeline stages § 04.1

FIG 04From FASTQ to topology call
6 stages · 1 sibling confirmation track
01 INPUT FASTQ paired short-read + ONT direct-RNA 02 QC fastp adapter trim quality filter 03 ALIGN STAR repeat-aware multi-mapper 04 COUNT HOMER gene ∥ intron ∥ Alu 3 independent tables 05 DE DESeq2 per layer reconcile 06 CALL STAIR rules topology call candidate set CONFIRMATION · ONT direct-RNA spanning long-read evidence promotes candidate → confirmed SHORT-READ PIPELINE
FIG 04. Six main stages from raw FASTQ to topology call. Stage 04 (HOMER) produces three independent count tables — one per feature layer — which are reconciled rather than merged at the counting step. The dashed teal track is the Nanopore direct-RNA confirmation sibling: long reads feed the terminal topology call as a confirmation channel, not as a discovery channel.

Reference set § 04.2

Genome
hg38 (GRCh38, primary assembly)
Repeat set
RepeatMasker · Alu family only
Gene model
GENCODE (current release)
Short-read
Public RNA-Seq + paired complement-inhibition cohort
Long-read
Oxford Nanopore direct-RNA (in-lab)

Topology calls § 04.3

A locus is flagged STAIR-topology when its per-Alu coverage, surrounding intron coverage, and host-gene coverage jointly satisfy a ruleset drawn from the published Alu-lncRNA literature. The exact ruleset is being tightened locus by locus through visual IGV review of an initial sample of the call list; the audit log lives in the working repo and informs the next round of automated re-classification.

Confirmation channel § 04.4

Where Nanopore direct-RNA reads are available, called Alus are checked for spanning long reads. This is used as a confirmation, not a discovery, channel — the short-read call list is the inventory; the long reads decide which entries to promote from candidate to confirmed.

§ 05

Working notes

newest first · curated extracts

Curated extracts from the working repo (bok1/alu_lncrna, private during pre-publication). The full audit trail is internal until first manuscript.

2026-05-19
Public site stood up
DNS A record for autoalu.deepsek.no registered; vhost scaffolded in the deepsek.no fleet repo. Public landing page (this site) goes live as the shared Let's Encrypt cert is expanded to cover the new subdomain. Content fills in as the underlying analyses stabilise.
2026-04-27
STAIR isoform inventory
Per-locus inventory of STAIR-topology Alu transcripts. Novel E1alt isoform identified; class-wide Alu-exon switch observed across the curated set. Locus-level pages will draw from this once the inventory hardens.
2026-04-14
STAIR audit at the IGV atlas
Visual audit across the curated locus atlas. Calling rules tightened in response — distinguishing topology that survives close inspection from coat-tailing signal is the dominant source of false positives.
2026-04-10
DESeq2 across three layers
Differential expression run independently at the gene, intron, and per-Alu layers under complement inhibition. Reconciliation downstream rather than at the counting step.
§ 06

People & contact

project home · collaborators · siblings

AutoAlu is led from the bioinformatics group at Nordlandssykehuset (Bodø, Norway). The project is part of a broader programme on host responses to inflammation and complement activation, with clinical context provided by the surrounding hospital research environment.

Project home § 06.1

Institution
Nordlandssykehuset, Bodø, Norway
Group
Bioinformatics / inflammation programme
Funding
Helse Nord (institutional)
Status
Pre-publication; under active development

Collaborators § 06.2

Groups working on innate-immune regulation in primary cells, complement-targeted therapeutics, and long-read RNA biology. Listed when the corresponding manuscripts surface.

Sibling sites § 06.3

AutoAlu shares infrastructure with sibling sites in the deepsek.no fleet — independent projects, shared design language.

Contact § 06.4

For correspondence regarding this project — data access, collaboration, pre-publication questions — please reach out via institutional channels at Nordlandssykehuset. Public release of the working database, analysis code, and locus-level browser will follow the first manuscript.

autoalu.deepsek.no · v0 · 2026-05-19
deepsek.no fleet · no tracking · no cookies