This workflow provides two complementary ways to perform genomic distribution analysis across multiple annotation layers. Users may either run a single wrapper function to execute all selected analyses in one step, or call each annotation function individually for inspection of intermediate results.
Integrated workflow (recommended):
The genomic_distribution() wrapper function executes one or
more distribution analyses within a single, unified pipeline. According
to the specified distributions argument, it internally
invokes the four distribution functions in sequence, while automatically
handling shared parameters and organizing all outputs in a common
directory.
Step-by-step workflow:
The individual functions genic_distribution(),
ccre_distribution(), chromhmm_distribution(),
and repeat_distribution() can be run independently. This
approach allows users to customize parameters for each analysis and to
inspect intermediate results with greater flexibility.
The genomic_distribution() function performs genomic
distribution analyses across multiple annotation types.
| Parameter | Type | Required | Description | Example |
|---|---|---|---|---|
query |
GRanges or GRangesList | Yes | Genomic regions to be annotated. Each element in a GRangesList is treated as an independent sample or cluster. | query = my_peaks |
out_dir |
character | Yes | Output directory for all result tables and plots. The directory will be created if it does not exist. | out_dir = "./distribution_results" |
distributions |
character | No (default: c("genic", "ccre")) |
Annotation layers to run. Supported values include
"genic", "ccre", "chromhmm", and
"repeat". Multiple analyses can be selected
simultaneously. |
distributions = c("genic", "repeat") |
ref_genome |
character | No (default: "hg38") |
Reference genome used for annotation. Supported values are
"hg38" and "mm10". |
ref_genome = "mm10" |
ref_source |
character | No (default: "knownGene") |
Gene annotation source used for genic annotation.
"knownGene" uses UCSC knownGene annotations via TxDb
packages, while "GENCODE" uses GENCODE reference
annotations (Release 49 for hg38, vM35 for mm10). |
ref_source = "GENCODE" |
mode |
character | No (default: "nearest") |
Assignment mode for overlapping features. "nearest"
assigns each query region to the closest annotation feature, while
"weighted" assigns features proportionally based on
overlap. |
mode = "weighted" |
plot |
logical | No (default: TRUE) |
Whether to generate stacked bar plots in addition to summary tables. | plot = FALSE |
library(epigenomeR)
genomic_distribution(
query = my_grl,
out_dir = "./genomic_distribution_results",
distributions = c("genic", "ccre", "chromhmm", "repeat"),
ref_genome = "hg38",
ref_source = "knownGene",
mode = "nearest",
plot = TRUE
)
The genic_distribution() function annotates genomic
regions with gene structural features, calculates
overlap-based feature proportions, and generates a summary table with an
optional visualization. The genic categories are defined as follows:
| Parameter | Type | Required | Description | Example |
|---|---|---|---|---|
query |
GRangesList or GRanges | Yes | Query regions to annotate with gene structures | query = my_peaks |
out_dir |
character | No (default: “./”) | Output directory for results. Created if doesn’t exist | out_dir = "./genic_results" |
ref_genome |
character | No (default: “hg38”) | Reference genome: “hg38” or “mm10” | ref_genome = "mm10" |
ref_source |
character | No (default: “knownGene”) | Gene annotation source used to define gene models. “knownGene” uses UCSC’s manually-curated gene models (via TxDb packages), “GENCODE” uses comprehensive reference annotations from GENCODE (Release 49 for hg38, vM35 for mm10) | ref_source = "GENCODE" |
mode |
character | No (default: “nearest”) | Assignment mode: “nearest” (closest feature) or “weighted” (proportional by overlap) | mode = "weighted" |
plot |
logical | No (default: TRUE) | Generate stacked barplot of genic feature distributions | plot = FALSE |
The function generates the following output files in the specified
out_dir:
genic_distribution.tsv
genic_distribution.pdf (if plot = TRUE)
The ccre_distribution() function annotates genomic
regions with candidate cis-regulatory elements (cCREs)
using RepeatMasker data, calculates overlap-based repeat class
proportions, and outputs a summary table with an optional visualization.
The cCRE categories are defined as follows:
Promoter (Promoter-like signature):
cCREs located within 200 bp of an annotated or experimentally supported
TSS. They show high chromatin accessibility together with strong H3K4me3
enrichment, consistent with active transcription initiation
sites.
Enhancer (Enhancer-like signature):
cCREs with high chromatin accessibility and strong H3K27ac signals.
Elements within 2 kb of a TSS are classified as TSS-proximal enhancers,
while those farther away are classified as TSS-distal enhancers. Regions
within 200 bp of a TSS must have low H3K4me3 to avoid promoter
assignment.
CA-H3K4me3 (Chromatin accessibility with
H3K4me3):
cCREs showing high chromatin accessibility and H3K4me3 enrichment, low
H3K27ac, and located outside the 200 bp TSS window. These regions
resemble promoter-like chromatin states but are not directly associated
with annotated TSSs.
CA-CTCF (Chromatin accessibility with
CTCF):
cCREs characterized by high chromatin accessibility and CTCF binding,
with low H3K4me3 and H3K27ac signals. They are typically associated with
chromatin insulation and higher-order genome organization.
CA-TF (Chromatin accessibility with
transcription factor binding):
cCREs with high chromatin accessibility that overlap transcription
factor binding clusters, while showing low H3K4me3, H3K27ac, and CTCF
signals. These regions likely represent TF-driven regulatory
sites.
CA (Chromatin accessibility only):
cCREs marked by high chromatin accessibility but lacking H3K4me3,
H3K27ac, and CTCF signals. They may represent open chromatin regions
with potential or context-dependent regulatory activity.
TF (Transcription factor binding):
cCREs overlapping transcription factor binding clusters but showing low
chromatin accessibility and low levels of H3K4me3, H3K27ac, and CTCF
signals. These sites may reflect transient or condition-specific TF
occupancy.

| Parameter | Type | Required | Description | Example |
|---|---|---|---|---|
query |
GRangesList or GRanges | Yes | Query regions to annotate with cCRE elements | query = my_peaks |
out_dir |
character | No (default: “./”) | Output directory for results. Created if doesn’t exist | out_dir = "./ccre_results" |
ref_genome |
character | No (default: “hg38”) | Reference genome: “hg38” or “mm10” | ref_genome = "mm10" |
mode |
character | No (default: “nearest”) | Assignment mode: “nearest” (closest category) or “weighted” (proportional by overlap) | mode = "weighted" |
plot |
logical | No (default: TRUE) | Generate stacked barplot of cCRE distributions | plot = FALSE |
The function generates the following output files in the specified
out_dir:
ccre_distribution.tsv
ccre_distribution.pdf (if plot = TRUE)
The chromhmm_distribution() function annotates genomic
regions with chromatin states using ChromHMM data,
calculates overlap-based chromatin state proportions, and produces a
summary table with an optional visualization.
Acet: Acetylation-associated active regulatory
regions
EnhWk: Weak enhancers
EnhA: Active enhancers
PromF: Promoter flanking regions
TSS: Transcription start sites
TxWk: Weak transcription
TxEx: Transcription with exon signals
Tx: Transcription
OpenC: Open chromatin
TxEnh: Transcribed enhancers
ReprPCopenC: Polycomb-repressed regions with open chromatin
BivProm: Bivalent promoters
ZNF: Zinc finger gene clusters
ReprPC: Polycomb-repressed chromatin
HET: Constitutive heterochromatin
GapArtf: Assembly gaps and artifacts
Quies: Quiescent background
| Parameter | Type | Required | Description | Example |
|---|---|---|---|---|
query |
GRangesList or GRanges | Yes | Query regions to annotate with cCRE elements | query = my_peaks |
out_dir |
character | No (default: “./”) | Output directory for results. Created if doesn’t exist | out_dir = "./ccre_results" |
ref_genome |
character | No (default: “hg38”) | Reference genome: “hg38” or “mm10” | ref_genome = "mm10" |
mode |
character | No (default: “nearest”) | Assignment mode: “nearest” (closest category) or “weighted” (proportional by overlap) | mode = "weighted" |
plot |
logical | No (default: TRUE) | Generate stacked barplot of cCRE distributions | plot = FALSE |
The function generates the following output files in the specified
out_dir:
chromhmm_distribution.tsv
chromhmm_distribution.pdf (if plot = TRUE)
The repeat_distribution() function anotates genomic
regions with chromatin states using ChromHMM data, computes
overlap-based chromatin state percentages, and generates a summary table
along with an optional visualization.
SINE: Short interspersed nuclear elements
LINE: Long interspersed nuclear elements
LTR: Long terminal repeat retrotransposons
Retroposon: Non-LTR retrotransposon-derived
elements
RC: Rolling-circle transposons
DNA: DNA transposons
Satellite: Satellite repeats
Simple_repeat: Simple sequence repeats
Low_complexity: Low-complexity sequences
rRNA: Ribosomal RNA repeats
tRNA: Transfer RNA repeats
snRNA: Small nuclear RNA repeats
scRNA: Small cytoplasmic RNA repeats
srpRNA: Signal recognition particle RNA
repeats
RNA: Other RNA-related repeats
Unknown: Unclassified repeat elements
| Parameter | Type | Required | Description | Example |
|---|---|---|---|---|
query |
GRangesList or GRanges | Yes | Query regions to annotate with cCRE elements | query = my_peaks |
out_dir |
character | No (default: “./”) | Output directory for results. Created if doesn’t exist | out_dir = "./ccre_results" |
ref_genome |
character | No (default: “hg38”) | Reference genome: “hg38” or “mm10” | ref_genome = "mm10" |
mode |
character | No (default: “nearest”) | Assignment mode: “nearest” (closest feature) or “weighted” (proportional by overlap) | mode = "weighted" |
plot |
logical | No (default: TRUE) | Generate stacked barplot of cCRE distributions | plot = FALSE |
The function generates the following output files in the specified
out_dir:
repeat_distribution.tsv
repeat_distribution.pdf (if plot = TRUE)
The get_matched_control() function generates background
control regions by randomly selecting genes with similar length to the
nearest gene of each target peak, then placing control regions at the
same TSS-relative distance on the selected genes. This approach
preserves the relationship between peaks and gene structure while
avoiding biases from using the same genomic neighborhoods.
What this function does:
Identifies the nearest gene for each target genomic region
Finds candidate genes with similar length (within tolerance threshold)
Randomly selects control genes from candidates, excluding the original gene
Places control regions at identical TSS-relative distances on control genes
Validates control regions are within chromosome boundaries
Generates multiple control regions per target (configurable replicates)
Automatically relaxes length tolerance if insufficient candidates found
Returns GRanges object with all generated control regions
| Parameter | Type | Required | Description | Example |
|---|---|---|---|---|
target_gr |
GRanges | Yes | GRanges object containing target peak regions for which controls will be generated | target_gr = my_peaks |
ref_genome |
character | No (default: “hg38”) | Reference genome version: “hg38” (Human GRCh38) or “mm10” (Mouse GRCm38) | ref_genome = "hg38" |
style |
character | No (default: “UCSC”) | Chromosome naming style (e.g., “UCSC” uses “chr1”, “Ensembl” uses “1”) | style = "UCSC" |
n_rep |
integer | No (default: 1) | Number of control regions to generate per target peak | n_rep = 3 |
regions |
integer | No (default: 800) | Width of control regions in base pairs (centered on calculated position) | regions = 500 |
seed |
integer | No (default: 42) | Random seed for reproducible control region selection | seed = 12345 |
length_tolerance |
numeric | No (default: 0.2) | Initial tolerance for gene length matching (0.2 = ±20%). Automatically relaxed up to ±100% if insufficient candidates | length_tolerance = 0.3 |
n_rep * 2), tolerance
increases by 10% incrementsThe function returns a GRanges object containing all successfully generated control regions.
Object type: GRanges (GenomicRanges package)
Number of regions: Up to
n_rep × length(target_gr) regions
n_rep control
regionsRegion characteristics:
regions parameterstyle parameter
(e.g., “chr1” for UCSC)library(epigenomeR)
library(GenomicRanges)
# Load target peaks
target_peaks <- rtracklayer::import("CTCF_peaks.bed")
# Basic usage - 1 control per target
control_regions <- get_matched_control(
target_gr = target_peaks,
ref_genome = "hg38"
)
# Generate multiple controls per target
control_regions <- get_matched_control(
target_gr = target_peaks,
ref_genome = "hg38",
n_rep = 5,
regions = 500
)
# Mouse genome with relaxed length tolerance
control_regions <- get_matched_control(
target_gr = target_peaks,
ref_genome = "mm10",
n_rep = 3,
length_tolerance = 0.3,
seed = 123
)
The calculate_TFBS_enrichment() function performs motif
enrichment analysis comparing a set of target genomic regions against
control regions using Fisher’s exact test to identify significantly
enriched transcription factor binding motifs.
What this function does:
Compares target genomic regions (GRanges) against background/control regions (GRanges)
Works directly with GRanges objects without file I/O for region inputs
Overlaps regions with a comprehensive motif library (stored as RDS files)
Optionally filters motif sites using functional region annotations (GRanges)
Counts motif occurrences in target vs control regions using
countOverlaps()
Performs Fisher’s exact test for each motif with pseudocount correction
Calculates odds ratios with standard errors
Adjusts p-values for multiple testing using Benjamini-Hochberg FDR
Identifies transcription factors with significantly enriched binding sites
Provides quantitative enrichment statistics for downstream interpretation
| Parameter | Type | Required | Description | Example |
|---|---|---|---|---|
target_region |
GRanges | Yes | GRanges object containing genomic regions of interest | target_region = my_peaks |
control_region |
GRanges | Yes | GRanges object containing background/control regions for comparison | control_region = background_peaks |
functional_region |
GRanges or NULL | No (default: NULL) | Optional GRanges object with functional regions for filtering motif sites. Only motifs overlapping these regions are considered | functional_region = open_chromatin |
out_path |
character | No (default: “./TFBS_enrichment.tsv”) | File path to save enrichment results table (TSV format) | out_path = "./results/motif_enrichment.tsv" |
regions |
integer or NULL | No (default: NULL) | Size in base pairs to resize all regions around their center point. If NULL, regions are used as-is | regions = 500 |
ref_genome |
character | No (default: “hg38”) | Reference genome version: “hg38” or “mm10” | ref_genome = "hg38" |
style |
character or NULL | No (default: NULL) | Chromosome naming style (e.g., “UCSC”, “Ensembl”). If NULL, auto-detects from target_region | style = "UCSC" |
style is NULL, uses the style from
target_regionregions
parameterregions = NULL (default): Regions are used with
their original widthsregions is specified (e.g., 500): Both target and
control regions are resized to the specified width, centered on each
region’s midpoint.The function generates the following output file at the specified
output_path:
<output_path> (TSV format)
| feature | odds_ratio | pvalue | odds_ratio_se | FDR | target_hit | control_hit |
|---|---|---|---|---|---|---|
| (CRISPR targeting AFF4)AFF4 | 16.14696 | 2.348799e-123 | 0.1066921 | 1.327293e-122 | 194 | 768 |
| (CRISPR targeting ARID3B)ARID3B | 35.14312 | 1.320588e-76 | 0.1620585 | 3.858694e-76 | 78 | 95 |
| (CRISPR targeting ARID4B)ARID4B | 34.39295 | 7.107417e-223 | 0.1229221 | 4.257343e-221 | 308 | 1241 |
| (CRISPR targeting ATF1)ATF1 | 18.37715 | 3.968721e-74 | 0.1327106 | 1.132030e-73 | 98 | 242 |
| (CRISPR targeting ATF2)ATF2 | 18.67509 | 6.632538e-33 | 0.1978719 | 1.119124e-32 | 40 | 83 |
| (CRISPR targeting ATF3)ATF3 | 23.37675 | 9.848921e-136 | 0.1117167 | 7.023218e-135 | 174 | 444 |
library(epigenomeR)
target_region <- rtracklayer::import("target_peaks.bed")
control_region <- rtracklayer::import("control_peaks.bed")
# Basic enrichment analysis (using original region widths)
calculate_TFBS_enrichment(
target_region = target_region,
control_region = control_region,
out_path = "./TFBS_enrichment.tsv",
ref_genome = "hg38"
)
# With uniform region resizing (200bp centered windows)
calculate_TFBS_enrichment(
target_region = target_region,
control_region = control_region,
regions = 200,
out_path = "./TFBS_enrichment_200bp.tsv",
ref_genome = "hg38"
)
# With functional region filtering (e.g., open chromatin)
functional_region <- rtracklayer::import("ATAC_peaks.bed")
calculate_TFBS_enrichment(
target_region = target_region,
control_region = control_region,
regions = 500,
functional_region = functional_region,
out_path = "./TFBS_enrichment_filtered.tsv",
ref_genome = "hg38"
)
The TFBS_enrichment_heatmap() function generates
heatmaps to visualize transcription factor binding site (TFBS)
enrichment patterns across multiple genomic clusters or experimental
conditions, using results from TFBS_enrichment()
analysis.
What this function does:
Reads multiple TFBS enrichment result files from
TFBS_enrichment() output
Combines odds ratios, FDR values, and target hits across all samples/clusters
Filters TFBS based on significance (FDR ≤ 0.05), effect size (odds ratio ≥ 2), and coverage criteria
Allows filtering for specific transcription factors of interest via pattern matching
Applies log2 transformation to odds ratios for symmetric visualization
Generates a comprehensive heatmap showing all TFBS passing filtering criteria
(Optional) Creates a focused heatmap displaying the union of top N most enriched TFBS per sample
(Optional) Supports optional hierarchical clustering of samples for pattern discovery
Produces publication-ready visualizations with consistent formatting and scalable dimensions
Exports log2-transformed odds ratio and FDR matrices for downstream analysis
| Parameter | Type | Required | Description | Example |
|---|---|---|---|---|
tsv_path |
character vector | Yes | Vector of file paths to TFBS enrichment result TSV files from TFBS_enrichment() analysis | tsv_path = c("cluster1.tsv", "cluster2.tsv") |
label |
character vector | Yes | Vector of sample/cluster labels corresponding to tsv_path. Must have same length as tsv_path | label = c("Cluster1", "Cluster2", "Cluster3") |
out_dir |
character | Yes | Output directory to save heatmaps and data matrices | out_dir = "./heatmaps" |
top_n |
integer | No (default: 20) | Number of top enriched TFBS per sample to display in second heatmap. If NULL, only the full heatmap is generated | top_n = 50 |
selected_tfs |
character vector | No (default: NULL) | Vector of transcription factor names or substrings to filter rows (TFBS). Matching is case-insensitive. If NULL, all TFBS retained | selected_tfs = c("CTCF", "AP1", "STAT") |
apply_cluster |
logical | No (default: FALSE) | Whether to apply hierarchical clustering to columns and reorder them. If TRUE, columns are clustered and reversed for visualization | apply_cluster = TRUE |
TFBS_enrichment() functionodds_ratio, FDR,
target_hitapply_cluster = FALSE): Columns
ordered by sorted labelsapply_cluster = TRUE):
The function generates the following output files in the specified
out_dir:
TFBS_heatmap_all.pdf

TFBS_heatmap_top<n>.pdf (if top_n
provided)
apply_cluster = TRUE
TFBS_odds_ratio_log2.csv

TFBS_FDR.csv

library(epigenomeR)
# Example 1: Basic usage - generate heatmap for all filtered TFBS
TFBS_enrichment_heatmap(
tsv_path = c(
"output/TFBS_enrichment_Cluster1.tsv",
"output/TFBS_enrichment_Cluster2.tsv",
"output/TFBS_enrichment_Cluster3.tsv"
),
label = c("Cluster1", "Cluster2", "Cluster3"),
out_dir = "heatmaps/all_tfbs"
)
# Example 2: With top N selection - focus on most enriched TFBS
TFBS_enrichment_heatmap(
tsv_path = c(
"output/TFBS_enrichment_Cluster1.tsv",
"output/TFBS_enrichment_Cluster2.tsv",
"output/TFBS_enrichment_Cluster3.tsv"
),
label = c("Cluster1", "Cluster2", "Cluster3"),
out_dir = "heatmaps/top10",
top_n = 10
)
# Example 3: Filter for specific transcription factors
TFBS_enrichment_heatmap(
tsv_path = c(
"output/TFBS_enrichment_TypeA.tsv",
"output/TFBS_enrichment_TypeB.tsv",
"output/TFBS_enrichment_TypeC.tsv"
),
label = c("TypeA", "TypeB", "TypeC"),
out_dir = "heatmaps/selected_tfs",
top_n = 15,
selected_tfs = c("CTCF", "AP1", "STAT", "NRF", "FOXO")
)