1. Genomic Element Distribution Analysis

This workflow provides two complementary ways to perform genomic distribution analysis across multiple annotation layers. Users may either run a single wrapper function to execute all selected analyses in one step, or call each annotation function individually for inspection of intermediate results.

  • Integrated workflow (recommended):
    The genomic_distribution() wrapper function executes one or more distribution analyses within a single, unified pipeline. According to the specified distributions argument, it internally invokes the four distribution functions in sequence, while automatically handling shared parameters and organizing all outputs in a common directory.

  • Step-by-step workflow:
    The individual functions genic_distribution(), ccre_distribution(), chromhmm_distribution(), and repeat_distribution() can be run independently. This approach allows users to customize parameters for each analysis and to inspect intermediate results with greater flexibility.

Wrapper Function Overview

The genomic_distribution() function performs genomic distribution analyses across multiple annotation types.

Parameters

Parameter Type Required Description Example
query GRanges or GRangesList Yes Genomic regions to be annotated. Each element in a GRangesList is treated as an independent sample or cluster. query = my_peaks
out_dir character Yes Output directory for all result tables and plots. The directory will be created if it does not exist. out_dir = "./distribution_results"
distributions character No (default: c("genic", "ccre")) Annotation layers to run. Supported values include "genic", "ccre", "chromhmm", and "repeat". Multiple analyses can be selected simultaneously. distributions = c("genic", "repeat")
ref_genome character No (default: "hg38") Reference genome used for annotation. Supported values are "hg38" and "mm10". ref_genome = "mm10"
ref_source character No (default: "knownGene") Gene annotation source used for genic annotation. "knownGene" uses UCSC knownGene annotations via TxDb packages, while "GENCODE" uses GENCODE reference annotations (Release 49 for hg38, vM35 for mm10). ref_source = "GENCODE"
mode character No (default: "nearest") Assignment mode for overlapping features. "nearest" assigns each query region to the closest annotation feature, while "weighted" assigns features proportionally based on overlap. mode = "weighted"
plot logical No (default: TRUE) Whether to generate stacked bar plots in addition to summary tables. plot = FALSE

Example Usage

library(epigenomeR)

genomic_distribution(
  query = my_grl,
  out_dir = "./genomic_distribution_results",
  distributions = c("genic", "ccre", "chromhmm", "repeat"),
  ref_genome = "hg38",
  ref_source = "knownGene",
  mode = "nearest",
  plot = TRUE
)

1A. Genic Distribution Analysis

The genic_distribution() function annotates genomic regions with gene structural features, calculates overlap-based feature proportions, and generates a summary table with an optional visualization. The genic categories are defined as follows:

  • Promoter: Regions defined as the 1,000 bp upstream of transcription start sites (TSS), representing promoter-proximal regulatory sequences associated with transcription initiation.
  • 5’ UTR: Untranslated regions at the 5’ end of transcripts.
  • Exon: Exonic regions corresponding to coding sequences only. Since untranslated regions are excluded, this category represents protein-coding portions of transcripts.
  • Intron: Non-coding regions spliced out during transcript processing, often enriched for regulatory elements.
  • 3’ UTR: Untranslated regions at the 3’ end of transcripts.

Parameters

Parameter Type Required Description Example
query GRangesList or GRanges Yes Query regions to annotate with gene structures query = my_peaks
out_dir character No (default: “./”) Output directory for results. Created if doesn’t exist out_dir = "./genic_results"
ref_genome character No (default: “hg38”) Reference genome: “hg38” or “mm10” ref_genome = "mm10"
ref_source character No (default: “knownGene”) Gene annotation source used to define gene models. “knownGene” uses UCSC’s manually-curated gene models (via TxDb packages), “GENCODE” uses comprehensive reference annotations from GENCODE (Release 49 for hg38, vM35 for mm10) ref_source = "GENCODE"
mode character No (default: “nearest”) Assignment mode: “nearest” (closest feature) or “weighted” (proportional by overlap) mode = "weighted"
plot logical No (default: TRUE) Generate stacked barplot of genic feature distributions plot = FALSE

Output Files

The function generates the following output files in the specified out_dir:

  1. cCRE annotation table - genic_distribution.tsv
    • Tab-delimited table of genic category percentages
    • Rows: samples/clusters from input
    • Columns: genic categories
    • Values: percentage of each category class per sample/cluster
  2. cCRE distribution plot - genic_distribution.pdf (if plot = TRUE)
    • Stacked bar plot showing genic composition across samples or clusters
    • X-axis: percentage (0-100%)
    • Y-axis: samples/clusters

1B. cCRE Distribution Analysis

The ccre_distribution() function annotates genomic regions with candidate cis-regulatory elements (cCREs) using RepeatMasker data, calculates overlap-based repeat class proportions, and outputs a summary table with an optional visualization. The cCRE categories are defined as follows:

  • Promoter (Promoter-like signature):
    cCREs located within 200 bp of an annotated or experimentally supported TSS. They show high chromatin accessibility together with strong H3K4me3 enrichment, consistent with active transcription initiation sites.

  • Enhancer (Enhancer-like signature):
    cCREs with high chromatin accessibility and strong H3K27ac signals. Elements within 2 kb of a TSS are classified as TSS-proximal enhancers, while those farther away are classified as TSS-distal enhancers. Regions within 200 bp of a TSS must have low H3K4me3 to avoid promoter assignment.

  • CA-H3K4me3 (Chromatin accessibility with H3K4me3):
    cCREs showing high chromatin accessibility and H3K4me3 enrichment, low H3K27ac, and located outside the 200 bp TSS window. These regions resemble promoter-like chromatin states but are not directly associated with annotated TSSs.

  • CA-CTCF (Chromatin accessibility with CTCF):
    cCREs characterized by high chromatin accessibility and CTCF binding, with low H3K4me3 and H3K27ac signals. They are typically associated with chromatin insulation and higher-order genome organization.

  • CA-TF (Chromatin accessibility with transcription factor binding):
    cCREs with high chromatin accessibility that overlap transcription factor binding clusters, while showing low H3K4me3, H3K27ac, and CTCF signals. These regions likely represent TF-driven regulatory sites.

  • CA (Chromatin accessibility only):
    cCREs marked by high chromatin accessibility but lacking H3K4me3, H3K27ac, and CTCF signals. They may represent open chromatin regions with potential or context-dependent regulatory activity.

  • TF (Transcription factor binding):
    cCREs overlapping transcription factor binding clusters but showing low chromatin accessibility and low levels of H3K4me3, H3K27ac, and CTCF signals. These sites may reflect transient or condition-specific TF occupancy.

This figure summarizes how candidate cis-regulatory elements (cCREs) are classified based on chromatin accessibility, histone marks, CTCF binding, and transcription factor occupancy. Source: https://screen.wenglab.org/about

Parameters

Parameter Type Required Description Example
query GRangesList or GRanges Yes Query regions to annotate with cCRE elements query = my_peaks
out_dir character No (default: “./”) Output directory for results. Created if doesn’t exist out_dir = "./ccre_results"
ref_genome character No (default: “hg38”) Reference genome: “hg38” or “mm10” ref_genome = "mm10"
mode character No (default: “nearest”) Assignment mode: “nearest” (closest category) or “weighted” (proportional by overlap) mode = "weighted"
plot logical No (default: TRUE) Generate stacked barplot of cCRE distributions plot = FALSE

Output Files

The function generates the following output files in the specified out_dir:

  1. cCRE annotation table - ccre_distribution.tsv
    • Tab-delimited table of cCRE category percentages
    • Rows: samples/clusters from input
    • Columns: cCRE categories
    • Values: percentage of each category class per sample/cluster
  2. cCRE distribution plot - ccre_distribution.pdf (if plot = TRUE)
    • Stacked bar plot showing cCRE composition across samples or clusters
    • X-axis: percentage (0-100%)
    • Y-axis: samples/clusters

1C. ChromHMM Distribution Analysis

The chromhmm_distribution() function annotates genomic regions with chromatin states using ChromHMM data, calculates overlap-based chromatin state proportions, and produces a summary table with an optional visualization.

  • Acet: Acetylation-associated active regulatory regions

  • EnhWk: Weak enhancers

  • EnhA: Active enhancers

  • PromF: Promoter flanking regions

  • TSS: Transcription start sites

  • TxWk: Weak transcription

  • TxEx: Transcription with exon signals

  • Tx: Transcription

  • OpenC: Open chromatin

  • TxEnh: Transcribed enhancers

  • ReprPCopenC: Polycomb-repressed regions with open chromatin

  • BivProm: Bivalent promoters

  • ZNF: Zinc finger gene clusters

  • ReprPC: Polycomb-repressed chromatin

  • HET: Constitutive heterochromatin

  • GapArtf: Assembly gaps and artifacts

  • Quies: Quiescent background

Parameters

Parameter Type Required Description Example
query GRangesList or GRanges Yes Query regions to annotate with cCRE elements query = my_peaks
out_dir character No (default: “./”) Output directory for results. Created if doesn’t exist out_dir = "./ccre_results"
ref_genome character No (default: “hg38”) Reference genome: “hg38” or “mm10” ref_genome = "mm10"
mode character No (default: “nearest”) Assignment mode: “nearest” (closest category) or “weighted” (proportional by overlap) mode = "weighted"
plot logical No (default: TRUE) Generate stacked barplot of cCRE distributions plot = FALSE

Output Files

The function generates the following output files in the specified out_dir:

  1. ChromHMM annotation table - chromhmm_distribution.tsv
    • Tab-delimited table of ChromHMM state percentages
    • Rows: samples/clusters from input
    • Columns: cCRE categories
    • Values: percentage of each category class per sample/cluster
  2. ChromHMM distribution plot - chromhmm_distribution.pdf (if plot = TRUE)
    • Stacked bar plot showing repeat class composition across samples or clusters
    • X-axis: percentage (0-100%)
    • Y-axis: samples/clusters

1D. Repeat Distribution Analysis

The repeat_distribution() function anotates genomic regions with chromatin states using ChromHMM data, computes overlap-based chromatin state percentages, and generates a summary table along with an optional visualization.

  • SINE: Short interspersed nuclear elements

  • LINE: Long interspersed nuclear elements

  • LTR: Long terminal repeat retrotransposons

  • Retroposon: Non-LTR retrotransposon-derived elements

  • RC: Rolling-circle transposons

  • DNA: DNA transposons

  • Satellite: Satellite repeats

  • Simple_repeat: Simple sequence repeats

  • Low_complexity: Low-complexity sequences

  • rRNA: Ribosomal RNA repeats

  • tRNA: Transfer RNA repeats

  • snRNA: Small nuclear RNA repeats

  • scRNA: Small cytoplasmic RNA repeats

  • srpRNA: Signal recognition particle RNA repeats

  • RNA: Other RNA-related repeats

  • Unknown: Unclassified repeat elements

Parameters

Parameter Type Required Description Example
query GRangesList or GRanges Yes Query regions to annotate with cCRE elements query = my_peaks
out_dir character No (default: “./”) Output directory for results. Created if doesn’t exist out_dir = "./ccre_results"
ref_genome character No (default: “hg38”) Reference genome: “hg38” or “mm10” ref_genome = "mm10"
mode character No (default: “nearest”) Assignment mode: “nearest” (closest feature) or “weighted” (proportional by overlap) mode = "weighted"
plot logical No (default: TRUE) Generate stacked barplot of cCRE distributions plot = FALSE

Output Files

The function generates the following output files in the specified out_dir:

  1. Repeat annotation table - repeat_distribution.tsv
    • Tab-delimited table of repeat class percentages
    • Rows: samples/clusters from input GRangesList
    • Columns: repeat element categories ordered as:
      • Transposable elements: SINE, LINE, LTR, Retroposon, RC, DNA
      • Structural variants: Satellite, Simple_repeat, Low_complexity
      • RNA genes: rRNA, tRNA, snRNA, scRNA, srpRNA, RNA
      • Unknown
    • Values: percentage of each repeat class per sample
  2. Repeat distribution plot - repeat_distribution.pdf (if plot = TRUE)
    • Stacked barplot visualization
    • X-axis: percentage (0-100%)
    • Y-axis: samples/clusters (reversed order)
    • Colors: Dark2 palette from RColorBrewer
    • Dimensions: 6 inches (width) × 5 inches (height)
    • Features: Minimal theme with horizontal legend at right

2. TFBS Enrichment Analysis

2A. Generate matched control regions

The get_matched_control() function generates background control regions by randomly selecting genes with similar length to the nearest gene of each target peak, then placing control regions at the same TSS-relative distance on the selected genes. This approach preserves the relationship between peaks and gene structure while avoiding biases from using the same genomic neighborhoods.

What this function does:

  • Identifies the nearest gene for each target genomic region

  • Finds candidate genes with similar length (within tolerance threshold)

  • Randomly selects control genes from candidates, excluding the original gene

  • Places control regions at identical TSS-relative distances on control genes

  • Validates control regions are within chromosome boundaries

  • Generates multiple control regions per target (configurable replicates)

  • Automatically relaxes length tolerance if insufficient candidates found

  • Returns GRanges object with all generated control regions

Parameters

Parameter Type Required Description Example
target_gr GRanges Yes GRanges object containing target peak regions for which controls will be generated target_gr = my_peaks
ref_genome character No (default: “hg38”) Reference genome version: “hg38” (Human GRCh38) or “mm10” (Mouse GRCm38) ref_genome = "hg38"
style character No (default: “UCSC”) Chromosome naming style (e.g., “UCSC” uses “chr1”, “Ensembl” uses “1”) style = "UCSC"
n_rep integer No (default: 1) Number of control regions to generate per target peak n_rep = 3
regions integer No (default: 800) Width of control regions in base pairs (centered on calculated position) regions = 500
seed integer No (default: 42) Random seed for reproducible control region selection seed = 12345
length_tolerance numeric No (default: 0.2) Initial tolerance for gene length matching (0.2 = ±20%). Automatically relaxed up to ±100% if insufficient candidates length_tolerance = 0.3
  • Chromosome handling:
    • Only uses standard chromosomes (excludes mitochondrial DNA)
    • For hg38: chr1-chr22, chrX, chrY
    • For mm10: chr1-chr19, chrX, chrY
    • Control regions must fit within chromosome boundaries
    • Invalid positions (out of bounds) are skipped
  • TSS-relative positioning:
    • Preserves signed distance from target peak to nearest gene TSS
    • Negative distance: peak is upstream of TSS
    • Positive distance: peak is downstream of TSS
    • Accounts for gene strand orientation (+ or -)
    • Control center = Control gene TSS + original TSS distance
  • Gene length matching:
    • Initial tolerance: ±20% of original gene length (default)
    • If insufficient candidates (< n_rep * 2), tolerance increases by 10% increments
    • Maximum automatic tolerance: ±100% (doubled or halved gene length)
    • If still insufficient, uses all genes except the original
    • Ensures control genes have comparable structural features

Output

The function returns a GRanges object containing all successfully generated control regions.

  • Object type: GRanges (GenomicRanges package)

  • Number of regions: Up to n_rep × length(target_gr) regions

    • Actual count may be lower if some controls fail boundary validation
    • Each target peak ideally generates n_rep control regions
  • Region characteristics:

    • Width: All regions have uniform width specified by regions parameter
    • Positioning: Centered on calculated TSS-relative positions
    • Chromosome style: Follows the specified style parameter (e.g., “chr1” for UCSC)
    • Genome info: Includes seqlengths from reference genome

Usage Examples

library(epigenomeR)
library(GenomicRanges)

# Load target peaks
target_peaks <- rtracklayer::import("CTCF_peaks.bed")

# Basic usage - 1 control per target
control_regions <- get_matched_control(
  target_gr = target_peaks,
  ref_genome = "hg38"
)

# Generate multiple controls per target
control_regions <- get_matched_control(
  target_gr = target_peaks,
  ref_genome = "hg38",
  n_rep = 5,
  regions = 500
)

# Mouse genome with relaxed length tolerance
control_regions <- get_matched_control(
  target_gr = target_peaks,
  ref_genome = "mm10",
  n_rep = 3,
  length_tolerance = 0.3,
  seed = 123
)

2B. TFBS Enrichment Analysis

The calculate_TFBS_enrichment() function performs motif enrichment analysis comparing a set of target genomic regions against control regions using Fisher’s exact test to identify significantly enriched transcription factor binding motifs.

What this function does:

  • Compares target genomic regions (GRanges) against background/control regions (GRanges)

  • Works directly with GRanges objects without file I/O for region inputs

  • Overlaps regions with a comprehensive motif library (stored as RDS files)

  • Optionally filters motif sites using functional region annotations (GRanges)

  • Counts motif occurrences in target vs control regions using countOverlaps()

  • Performs Fisher’s exact test for each motif with pseudocount correction

  • Calculates odds ratios with standard errors

  • Adjusts p-values for multiple testing using Benjamini-Hochberg FDR

  • Identifies transcription factors with significantly enriched binding sites

  • Provides quantitative enrichment statistics for downstream interpretation

Parameters

Parameter Type Required Description Example
target_region GRanges Yes GRanges object containing genomic regions of interest target_region = my_peaks
control_region GRanges Yes GRanges object containing background/control regions for comparison control_region = background_peaks
functional_region GRanges or NULL No (default: NULL) Optional GRanges object with functional regions for filtering motif sites. Only motifs overlapping these regions are considered functional_region = open_chromatin
out_path character No (default: “./TFBS_enrichment.tsv”) File path to save enrichment results table (TSV format) out_path = "./results/motif_enrichment.tsv"
regions integer or NULL No (default: NULL) Size in base pairs to resize all regions around their center point. If NULL, regions are used as-is regions = 500
ref_genome character No (default: “hg38”) Reference genome version: “hg38” or “mm10” ref_genome = "hg38"
style character or NULL No (default: NULL) Chromosome naming style (e.g., “UCSC”, “Ensembl”). If NULL, auto-detects from target_region style = "UCSC"
  • Input format:
    • Function expects GRanges objects directly (not file paths)
    • All chromosome naming styles are harmonized automatically
    • If style is NULL, uses the style from target_region
  • Region resizing:
    • Optional: Controlled by the regions parameter
    • If regions = NULL (default): Regions are used with their original widths
    • If regions is specified (e.g., 500): Both target and control regions are resized to the specified width, centered on each region’s midpoint.
  • Statistical method:
    • Adds pseudocount (+1) to all counts to avoid division by zero
    • Uses one-sided Fisher’s exact test (alternative = “greater”)
    • Tests enrichment hypothesis (motif more frequent in target)

Output Files

The function generates the following output file at the specified output_path:

  1. Motif enrichment table - <output_path> (TSV format)
    • Tab-separated table with motif enrichment statistics
    • Rows: Individual transcription factor motifs from JASPAR database
    • Columns:
      • Row names: Motif identifiers (e.g., “MA0139.1” for CTCF)
      • odds_ratio: Odds ratio from Fisher’s exact test (enrichment magnitude)
        • Value > 1: Motif enriched in target regions
        • Value < 1: Motif depleted in target regions
        • Higher values indicate stronger enrichment
      • pvalue: Raw p-value from Fisher’s exact test (one-sided, greater)
        • Tests if motif is significantly more frequent in target vs control
        • Lower values indicate stronger statistical significance
      • odds_ratio_se: Standard error of log odds ratio
        • Measure of uncertainty in odds ratio estimate
        • Useful for calculating confidence intervals
      • FDR: False Discovery Rate (Benjamini-Hochberg adjusted p-value)
        • Corrects for multiple testing across all motifs
        • Typical threshold: FDR < 0.05 or 0.01
      • target_hit: Number of target regions overlapping motif sites
        • Count of target regions containing the motif
      • control_hit: Number of control regions overlapping motif sites
        • Count of control regions containing the motif
    • Sorted by row names (motif IDs)
    • Can be sorted by FDR or odds_ratio for prioritization
    feature odds_ratio pvalue odds_ratio_se FDR target_hit control_hit
    (CRISPR targeting AFF4)AFF4 16.14696 2.348799e-123 0.1066921 1.327293e-122 194 768
    (CRISPR targeting ARID3B)ARID3B 35.14312 1.320588e-76 0.1620585 3.858694e-76 78 95
    (CRISPR targeting ARID4B)ARID4B 34.39295 7.107417e-223 0.1229221 4.257343e-221 308 1241
    (CRISPR targeting ATF1)ATF1 18.37715 3.968721e-74 0.1327106 1.132030e-73 98 242
    (CRISPR targeting ATF2)ATF2 18.67509 6.632538e-33 0.1978719 1.119124e-32 40 83
    (CRISPR targeting ATF3)ATF3 23.37675 9.848921e-136 0.1117167 7.023218e-135 174 444

Example Usage

library(epigenomeR)
target_region <- rtracklayer::import("target_peaks.bed")
control_region <- rtracklayer::import("control_peaks.bed")

# Basic enrichment analysis (using original region widths)
calculate_TFBS_enrichment(
  target_region = target_region,
  control_region = control_region,
  out_path = "./TFBS_enrichment.tsv",
  ref_genome = "hg38"
)

# With uniform region resizing (200bp centered windows)
calculate_TFBS_enrichment(
  target_region = target_region,
  control_region = control_region,
  regions = 200,
  out_path = "./TFBS_enrichment_200bp.tsv",
  ref_genome = "hg38"
)

# With functional region filtering (e.g., open chromatin)
functional_region <- rtracklayer::import("ATAC_peaks.bed")
calculate_TFBS_enrichment(
  target_region = target_region,
  control_region = control_region,
  regions = 500,
  functional_region = functional_region,
  out_path = "./TFBS_enrichment_filtered.tsv",
  ref_genome = "hg38"
)

2C. TFBS Enrichment Heatmap Visualization

The TFBS_enrichment_heatmap() function generates heatmaps to visualize transcription factor binding site (TFBS) enrichment patterns across multiple genomic clusters or experimental conditions, using results from TFBS_enrichment() analysis.

What this function does:

  • Reads multiple TFBS enrichment result files from TFBS_enrichment() output

  • Combines odds ratios, FDR values, and target hits across all samples/clusters

  • Filters TFBS based on significance (FDR ≤ 0.05), effect size (odds ratio ≥ 2), and coverage criteria

  • Allows filtering for specific transcription factors of interest via pattern matching

  • Applies log2 transformation to odds ratios for symmetric visualization

  • Generates a comprehensive heatmap showing all TFBS passing filtering criteria

  • (Optional) Creates a focused heatmap displaying the union of top N most enriched TFBS per sample

  • (Optional) Supports optional hierarchical clustering of samples for pattern discovery

  • Produces publication-ready visualizations with consistent formatting and scalable dimensions

  • Exports log2-transformed odds ratio and FDR matrices for downstream analysis

Parameters

Parameter Type Required Description Example
tsv_path character vector Yes Vector of file paths to TFBS enrichment result TSV files from TFBS_enrichment() analysis tsv_path = c("cluster1.tsv", "cluster2.tsv")
label character vector Yes Vector of sample/cluster labels corresponding to tsv_path. Must have same length as tsv_path label = c("Cluster1", "Cluster2", "Cluster3")
out_dir character Yes Output directory to save heatmaps and data matrices out_dir = "./heatmaps"
top_n integer No (default: 20) Number of top enriched TFBS per sample to display in second heatmap. If NULL, only the full heatmap is generated top_n = 50
selected_tfs character vector No (default: NULL) Vector of transcription factor names or substrings to filter rows (TFBS). Matching is case-insensitive. If NULL, all TFBS retained selected_tfs = c("CTCF", "AP1", "STAT")
apply_cluster logical No (default: FALSE) Whether to apply hierarchical clustering to columns and reorder them. If TRUE, columns are clustered and reversed for visualization apply_cluster = TRUE
  • Input file requirements:
    • Files must be TSV format with tab-separated values
    • Must be output from TFBS_enrichment() function
    • Required columns: odds_ratio, FDR, target_hit
    • Row names: TFBS/motif IDs (e.g., JASPAR identifiers)
    • All files should have consistent TFBS IDs (mismatches will trigger warnings)
  • Filtering logic:
    • Stringent default criteria (OR ≥ 2, FDR ≤ 0.05)
    • Requires meeting ALL criteria in at least ONE sample
    • Target hit threshold: 10th percentile prevents low-coverage artifacts
    • Only non-NA rows/columns retained
    • Filters applied sequentially: NA removal → biological criteria → selected_tfs
  • Column clustering:
    • Without clustering (apply_cluster = FALSE): Columns ordered by sorted labels
    • With clustering (apply_cluster = TRUE):
      • Hierarchical clustering applied (Euclidean distance, complete linkage)
      • Column order extracted and reversed for better visualization
      • Final heatmap drawn with fixed order (no dendrogram shown)
    • Clustering applied identically to both full and top N heatmaps if both generated

Output Files

The function generates the following output files in the specified out_dir:

  1. Full matrix heatmap - TFBS_heatmap_all.pdf
    • Heatmap showing all motifs passing filtering criteria
    • Rows: Transcription factor motifs (TF IDs)
    • Columns: Clusters or experimental conditions
    • Color scale: Blue (low enrichment) → White (neutral) → Red (high enrichment)
    • Symmetric color scale centered at zero
    • Log2-transformed odds ratios for visualization
    • No row/column clustering by default unless “cluster” transformation specified
  2. Top N motifs heatmap - TFBS_heatmap_top<n>.pdf (if top_n provided)
    • Heatmap showing union of top N most enriched TFBS per sample
    • Selects top N TFBS from each sample independently, then takes union
    • Optional column clustering if apply_cluster = TRUE
  3. Log2 odds ratio matrix - TFBS_odds_ratio_log2.csv
    • CSV file containing log2-transformed odds ratios
    • Rows: TFBS IDs
    • Columns: Cluster/condition names
    • Values: log2(odds_ratio) from enrichment analysis
    • Filtered to significant TFBS only (odds_ratio ≥ 2, FDR ≤ 0.05, target_hit ≥ 10th percentile)
    • Reflects the exact data shown in heatmaps
  4. FDR matrix - TFBS_FDR.csv
    • CSV file containing FDR-adjusted p-values
    • Rows: TFBS IDs matching odds_ratio_log2.csv
    • Columns: Sample/cluster names (same order as odds_ratio_log2.csv)
    • Values: FDR (False Discovery Rate) from enrichment analysis
    • Useful for identifying significant enrichments

Example Usage

library(epigenomeR)

# Example 1: Basic usage - generate heatmap for all filtered TFBS
TFBS_enrichment_heatmap(
  tsv_path = c(
    "output/TFBS_enrichment_Cluster1.tsv",
    "output/TFBS_enrichment_Cluster2.tsv",
    "output/TFBS_enrichment_Cluster3.tsv"
  ),
  label = c("Cluster1", "Cluster2", "Cluster3"),
  out_dir = "heatmaps/all_tfbs"
)

# Example 2: With top N selection - focus on most enriched TFBS
TFBS_enrichment_heatmap(
  tsv_path = c(
    "output/TFBS_enrichment_Cluster1.tsv",
    "output/TFBS_enrichment_Cluster2.tsv",
    "output/TFBS_enrichment_Cluster3.tsv"
  ),
  label = c("Cluster1", "Cluster2", "Cluster3"),
  out_dir = "heatmaps/top10",
  top_n = 10
)

# Example 3: Filter for specific transcription factors
TFBS_enrichment_heatmap(
  tsv_path = c(
    "output/TFBS_enrichment_TypeA.tsv",
    "output/TFBS_enrichment_TypeB.tsv",
    "output/TFBS_enrichment_TypeC.tsv"
  ),
  label = c("TypeA", "TypeB", "TypeC"),
  out_dir = "heatmaps/selected_tfs",
  top_n = 15,
  selected_tfs = c("CTCF", "AP1", "STAT", "NRF", "FOXO")
)