AutoAluPRE-PUB · UNDER DEV
A working atlas of human Alu-derived long non-coding RNAs, built from the primate-specific layer of the transcriptome that standard RNA-Seq pipelines silently discard. Short-read DESeq2 at three resolutions, long-read confirmation, locus-level review at scale — under active development from Nordlandssykehuset, Bodø.
AluSx the
single most abundant Alu subfamily in hg38.
AluY is the youngest and still inserting — many
AluY elements are polymorphic among individuals today.
Counts are RepeatMasker / hg38, rounded.
The project
Alu elements are the most abundant short interspersed nuclear
element in the human genome and they are not silent.
A growing literature — STAIR, JARID2-NBDY,
SINEUPs, IRAlus, NEAT1 —
shows that Alu insertions can become functional
cis- or trans-acting elements of host transcripts
and of stand-alone long non-coding RNAs. Standard RNA-Seq
pipelines, built before this view was dominant, routinely
discard multi-mapping repeat reads — exactly the reads that
would tell you whether a particular Alu is being transcribed.
AutoAlu is our attempt to map that layer back in, with enough numerical care that the resulting calls survive inspection at the per-locus level. The atlas is built by reconciling independent counts at three resolutions — host gene, intron, single Alu — and validating short-read calls with long-read Nanopore reads where available.
From inventory to confirmed call § 01.1
STAIR-like
candidates. Long-read confirmation (Oxford Nanopore direct-RNA)
promotes a candidate to confirmed. Specific exit
counts at columns 02 → 04 will land here as the analyses
stabilise — proportions in this figure are schematic.
Three layers, one locus § 01.2
autoalu.deepsek.no. Content fills in as the
underlying analyses stabilise and the manuscript progresses.
Specific locus counts are deliberately sparse here — the
working repo changes daily and a quoted figure today will
not match tomorrow's.
Atlas
The full Alu inventory in hg38 is on the order of 1.14 million elements. Only a minority are transcribed in any given cellular context, and a smaller minority still show topology consistent with stand-alone lncRNA function. The pipeline tracks every Alu as a candidate and lets the combination of short-read coverage, intronic context, and long-read confirmation prune the list down.
Distribution across the genome § 02.1
chr1 and chr2 carry the most simply
because they are the largest sequences;
chr19 (highlighted) is the anomaly — physically small
but gene- and GC-rich, with the highest density per Mb
of any chromosome. The dashed line is the per-chromosome
mean (~47k). Counts are RepeatMasker / hg38, Alu family only,
rounded to the nearest 1k.
Why three layers § 02.2
A naive single-feature count loses information in both directions: a busy host gene drowns out a quietly transcribed Alu, and a single Alu count can't tell whether the signal is element-specific or part of an intronic envelope. Reconciling the three layers is most of what makes the calls usable downstream.
The driver system
The driving biology for the first wave of analyses is the complement system — specifically the C5a fragment, one of the strongest known short-acting drivers of innate inflammation. Paired control and complement-inhibited samples give us a defined perturbation against which to ask which Alus respond?
Worked examples § 03.1
The headline candidate locus from these comparisons sits in
the IL6 neighbourhood, where an Alu-derived
intronic element has the right topology to act as a
cis-regulatory lncRNA on the host transcript.
A second worked thread runs through JARID2 and
its lncRNA neighbourhood. Both will land here as their own
pages once the corresponding manuscripts move.
Why this system § 03.2
- Clean perturbation. Pharmacological complement inhibition gives a well-defined contrast — no transfection artefacts, no chronic-disease confounders.
- Independent priors. The complement / IL6 axis is well-characterised at the protein and cytokine level, so Alu-layer findings can be tested against existing literature rather than built from scratch.
- Clinical proximity. The host PI runs a clinical lab; findings that hold up here can move toward bedside-relevant questions without leaving the same dataset.
Open mechanistic threads § 03.3
A handful of additional loci show the right topology and are being developed in parallel. Pages will land as the analyses harden.
Methods
Short-read RNA-Seq is processed with a repeat-aware mapping
configuration so multi-mappers are retained and resolved
against the RepeatMasker Alu inventory rather than discarded.
Counts are produced at three feature layers and analysed with
DESeq2. Long-read (ONT direct-RNA) data is mapped
and used for locus-level confirmation, not for primary
counting.
Pipeline stages § 04.1
HOMER) produces three independent count tables —
one per feature layer — which are reconciled rather than
merged at the counting step. The dashed teal track is the
Nanopore direct-RNA confirmation sibling: long reads feed the
terminal topology call as a confirmation channel, not as a
discovery channel.
Reference set § 04.2
- Genome
hg38(GRCh38, primary assembly)- Repeat set
RepeatMasker· Alu family only- Gene model
GENCODE(current release)- Short-read
- Public RNA-Seq + paired complement-inhibition cohort
- Long-read
- Oxford Nanopore direct-RNA (in-lab)
Topology calls § 04.3
A locus is flagged STAIR-topology when its per-Alu coverage, surrounding intron coverage, and host-gene coverage jointly satisfy a ruleset drawn from the published Alu-lncRNA literature. The exact ruleset is being tightened locus by locus through visual IGV review of an initial sample of the call list; the audit log lives in the working repo and informs the next round of automated re-classification.
Confirmation channel § 04.4
Where Nanopore direct-RNA reads are available, called Alus are checked for spanning long reads. This is used as a confirmation, not a discovery, channel — the short-read call list is the inventory; the long reads decide which entries to promote from candidate to confirmed.
Working notes
Curated extracts from the working repo (bok1/alu_lncrna,
private during pre-publication). The full audit trail is
internal until first manuscript.
autoalu.deepsek.no registered;
vhost scaffolded in the deepsek.no fleet repo.
Public landing page (this site) goes live as the shared
Let's Encrypt cert is expanded to cover the new subdomain.
Content fills in as the underlying analyses stabilise.
E1alt isoform identified; class-wide
Alu-exon switch observed across the curated set.
Locus-level pages will draw from this once the inventory
hardens.
People & contact
AutoAlu is led from the bioinformatics group at Nordlandssykehuset (Bodø, Norway). The project is part of a broader programme on host responses to inflammation and complement activation, with clinical context provided by the surrounding hospital research environment.
Project home § 06.1
- Institution
- Nordlandssykehuset, Bodø, Norway
- Group
- Bioinformatics / inflammation programme
- Funding
- Helse Nord (institutional)
- Status
- Pre-publication; under active development
Collaborators § 06.2
Groups working on innate-immune regulation in primary cells, complement-targeted therapeutics, and long-read RNA biology. Listed when the corresponding manuscripts surface.
Sibling sites § 06.3
AutoAlu shares infrastructure with sibling sites in the
deepsek.no fleet — independent projects, shared
design language.
Contact § 06.4
For correspondence regarding this project — data access, collaboration, pre-publication questions — please reach out via institutional channels at Nordlandssykehuset. Public release of the working database, analysis code, and locus-level browser will follow the first manuscript.