1. Highly Variable Regions

This workflow provides two complementary approaches for identifying highly variable genomic regions:

  • Integrated workflow (recommended): The detect_hvr() function combines mean-variance modeling and feature selection in a single step, internally calling mav_screen() and filter_hvr() in sequence.

    library(epigenomeR)
    path <- "./count_matrix/Count_Matrix_800_merged_transformed.feather"
    detect_hvr(transformed_cm_path = path, out_dir = "./count_matrix/")
  • Step-by-step workflow: For users who need fine-grained control or want to inspect intermediate results, the individual functions mav_screen() and filter_hvr() can be called separately.

Function Overview:

  • detect_hvr() - Complete pipeline for detecting highly variable regions (wrapper function)

  • mav_screen() - Step 1: Model mean-variance relationship and calculate hypervariance metrics

  • filter_hvr() - Step 2: Select top-ranked regions based on hypervariance values


1A. Model mean-variance relationship and calculate hypervariance metrics

The mav_screen() function models the mean-variance relationship in transformed count data and normalizes features by their expected variance. It computes hypervariance metrics that quantify each feature’s deviation from the expected mean-variance trend.

What this function does:

  • (Model mean-variance relationship) Fits a regression model (GAM or LOESS) to characterize the relationship between feature mean and variance in log2 space across all genomic regions. This establishes the baseline mean-variance trend expected under technical variation alone.

  • (Calculate expected variance) For each feature, computes the expected variance based on its mean expression level using the fitted model. This serves as the null expectation against which observed variance is compared.

  • (Normalize feature values) Centers each feature on its observed mean and scales by the expected standard deviation (square root of expected variance) derived from the model. This normalization removes the mean-dependent component of variance, allowing fair comparison across features with different expression levels.

  • (Quantify overdispersion) Computes hypervariance metrics (sum of squared normalized values divided by degrees of freedom) for each feature. Higher hypervariance values indicate greater variability than expected from the mean-variance trend, signaling biological overdispersion rather than technical noise.

  • (Generate comprehensive statistics) Saves all statistics including observed mean/variance, expected variance, normalized values, and hypervariance metrics to a feather file for downstream feature selection and quality control.

Parameters

Parameter Type Required Description Example
transformed_cm_path character Yes Path to transformed count matrix file (.feather format with “pos” column) transformed_cm_path = "data_transformed.feather"
out_dir character No (default: “./”) Output directory path for results out_dir = "./mav_screen_results"
fitting_model character No (default: “gam”) Model to fit mean-variance relationship: “gam” or “loess” fitting_model = "loess"
k integer No (default: 40) Number of basis functions for GAM spline fitting (only used when fitting_model = “gam”) k = 30
span numeric No (default: 0.5) Smoothing span for LOESS fitting (only used when fitting_model = “loess”) span = 0.3
seed integer No (default: 42) Random seed for reproducibility in sampling seed = 123
font_size numeric No (default: 10) Font size for plot labels and axes font_size = 12
nrow_sample_per numeric No (default: 0.2) Proportion of rows (0-1) to randomly sample for density plot nrow_sample_per = 0.1
plot logical No (default: FALSE) Whether to generate and save diagnostic plots (.png files) plot = TRUE
  • Input data requirements:

    The transformed_cm_path parameter refers directly to a pre-processed count matrix stored as a .feather file. This file must not contain raw counts. Instead, the matrix should already be:

    • Pre-normalized (e.g., quantile normalization, TPM, RPKM, or other appropriate normalization)
    • Log-transformed if necessary for your analysis
    • Batch-corrected if applicable
    • Quality-filtered to remove low-quality features

    Additionally, the .feather file must have the first column named “pos” containing feature identifiers (e.g., chr1_9601_10400).

  • Model selection:

    • Use GAM (default) for smooth, flexible fits with automatic complexity control
    • Use LOESS for local fitting, better for non-linear patterns
  • Memory considerations: For very large datasets, adjust nrow_sample_per to reduce memory usage for density plots.

Output Files

The function generates the following output files in the specified out_dir (default: “./”):

  1. Mean-Variance Statistics File - <input_name>_mav_screen.feather
    • Feather format dataframe with mean-variance statistics for each genomic region
    • Contains the following columns:
      • pos: Original region identifier
      • mean: Mean expression across all samples
      • var: Variance across all samples
      • var_expect: Expected variance from fitted mean-variance model
      • norm_mean: Mean of normalized values (should be ~0)
      • norm_var: Variance of normalized values (should be ~1)
      • hypervar: Hypervariance, sum of squared normalized values divided by (n-1)
      • log2_mean: Log2-transformed mean expression
      • log2_var: Log2-transformed variance
      • log2_var_expected: Log2-transformed expected variance from fitted model
    pos mean var var_expect norm_mean norm_var hypervar
    chr1_9601_10400 0.60482747 3.2718638368 2.999402e+00 2.999402e+00 -2.400220e-17 1.0908388
    chr1_10401_11200 0.06557750 0.1646858526 1.620011e-01 1.620011e-01 -6.149439e-18 1.0165723
    chr1_12801_13600 0.03007577 0.0003917656 6.413126e-04 6.413126e-04 -1.840330e-18 0.6108808
    chr1_14401_15200 0.02905376 0.0001423452 1.494525e-04 1.494525e-04 9.551020e-17 0.9524448
    chr1_15201_16000 0.02994487 0.0007778492 5.396820e-04 5.396820e-04 -4.221334e-17 1.4413104
    chr1_16001_16800 0.02875908 0.0001076822 9.580701e-05 9.580701e-05 -1.361187e-16 1.1239494
    log2_mean log2_var log2_var_expected
    -0.7254044 1.710113 1.585175
    -3.9306552 -2.602211 -2.625862
    -5.0552546 -11.317722 -10.603762
    -5.1051311 -12.778318 -12.731808
    -5.0615474 -10.328222 -10.873420
    -5.1198388 -13.180932 -13.352449
  2. Main diagnostic plots - <input_name>_mean_variance.png (if plot = TRUE)
    • Combined two-panel figure showing:
      • Panel A: Log2(mean) vs Log2(variance) with fitted trend line in red
      • Panel B: Log2(mean) vs Hypervariance
    • Points shown with transparency to visualize density
    • Theme: black and white with customizable font size
    • Saved as PNG format
  3. Density plot - <input_name>_fit_density.png (if plot = TRUE)
    • Point density visualization of mean-variance relationship
    • Uses viridis color scale to show point density
    • Red points show expected variance from fitted model
    • Based on randomly sampled subset of data (controlled by nrow_sample_per and seed)
    • Useful for visualizing patterns in large datasets
    • Saved as PNG format

Example Usage

library(epigenomeR)
path <- "./count_matrix/Count_Matrix_800_merged_transformed.feather"
mav_screen(transformed_cm_path = path, out_dir = "./count_matrix/", plot=TRUE)

1B. Select top-ranked regions based on hypervariance values

The filter_hvr() function identifies the most informative genomic regions by combining hypervariance-based selection with stratified sampling across expression levels.

What this function does:

  • (Stratified feature selection) Divides the log2(mean) expression range into equal-sized bins to ensure balanced representation across low, medium, and high expression levels.

  • (Hypervariance filtering) Selects top hypervariant regions from each bin, prioritizing regions with biological variability (hypervariance > 1 indicates overdispersion).

  • (Expression threshold) Applies a percentile-based cutoff to retain only highly expressed regions, removing lowly expressed features that may be noisy.

  • (Count matrix extraction) Retrieves corresponding count data for selected informative regions from the full transformed count matrix.

  • (Quality control visualization) Generates optional diagnostic plots showing:

    • Mean-variance relationship with selected regions highlighted
    • Hypervariance distribution across expression levels with selection boundaries
    • Visual confirmation of balanced sampling strategy
  • (Efficient processing) Uses vectorized operations and selective column loading for fast processing of large genomic datasets.

Parameters

Parameter Type Required Description Example
transformed_cm_path character Yes Path to transformed count matrix file (.feather format with “pos” column as region identifier) transformed_cm_path = "normalized_counts.feather"
mav_stats_path character Yes Path to mean-variance statistics file (.feather format) generated from upstream MAV screening analysis mav_stats_path = "mav_screen/data_mav_screen.feather"
out_dir character No (default: “./”) Output directory path for saving filtered results and optional plots out_dir = "./informative_results"
n_bins integer No (default: 100) Number of bins to divide log2(mean) range for stratified sampling, ensuring representation across expression levels n_bins = 50
keep_percent numeric No (default: 0.01) Fraction of total regions to retain (0-1). Selected regions are distributed equally across all bins keep_percent = 0.05
log2mean_quantile_thres numeric No (default: 0.99) Percentile threshold (0-1) for log2(mean) expression. Only regions above this threshold are retained in final selection log2mean_quantile_thres = 0.95
plot logical No (default: FALSE) Whether to generate diagnostic scatter plots visualizing the selection process plot = TRUE
  • Input file requirements:

    • mav_stats_path: Must be the output .feather file from mav_screen() containing columns: pos, log2_mean, log2_var, and hypervar. The function will automatically select only these required columns during loading for memory efficiency.

    • transformed_cm_path: Must be a pre-processed count matrix in .feather format. This should be the same normalized matrix used as input to mav_screen().

  • Feature filtering behavior:

    • Regions with hypervar ≤ 1 are automatically excluded as they represent features with expected or reduced variance (not informative for downstream analysis)
    • The stratified binning ensures balanced selection across expression levels, preventing bias toward highly expressed features
    • The final log2mean_quantile_thres threshold removes lowly expressed regions that may be dominated by technical noise
  • Parameter tuning recommendations: If your analysis yields too few selected regions, this typically occurs when you’re analyzing a targeted genomic subset (e.g., promoter regions, CpG islands, specific gene loci) rather than genome-wide bins. The combination of default keep_percent = 0.01 (1%) and log2mean_quantile_thres = 0.99 (99th percentile) is too stringen.

    Scenario keep_percent log2mean_quantile_thres Expected output
    Genome-wide bins (default) 0.01 (1%) 0.99 (99th) ~1,000-5,000 regions
    Targeted regions (promoters, peaks) 0.05-0.10 (5-10%) 0.90-0.95 (90-95th) ~2,000-10,000 regions
    Very sparse data 0.10-0.20 (10-20%) 0.85-0.90 (85-90th) ~5,000-20,000 regions
    Large datasets (>100K input regions) Reduce n_bins to 50 Keep defaults Faster computation

Output Files

The function generates the following output files in the specified out_dir (default: “./”):

  1. Filtered Count Matrix - <input_name>_filtered_regions.feather
    • Feather format file containing count data for filtered regions
    • First column “pos” contains region identifiers
    • Subsequent columns contain count values for each sample
    • Only includes regions passing all selection criteria:
      • Hypervariance > 1 (overdispersed)
      • Top hypervariant features within each log2(mean) bin
      • log2(mean) above the specified quantile threshold
    • Dimensions: n_selected_regions × (n_samples + 1)
pos H3K36me3-H3K36me3 H3K36me3-H3K4me1 H3K27ac-H3K36me3 H3K36me3-H3K9me2
chr1_9601_10400 0.03135214 0.02224862 0.03659935 0.03001337
chr1_10401_11200 0.02891244 0.01988311 0.03451229 0.02833411
chr1_11201_12000 0.02750192 0.01844290 0.03329854 0.02701144
chr1_12001_12800 0.02933155 0.02015533 0.03510177 0.02985421
chr1_12801_13600 0.02684492 0.01791241 0.03166388 0.02622109
chr1_13601_14400 0.02599817 0.01762288 0.03088455 0.02548933
chr1_14401_15200 0.02744561 0.01819877 0.03255291 0.02680714
chr1_15201_16000 0.02819945 0.01888922 0.03394107 0.02790365
  1. Diagnostic Plots - <input_name>_filtered_regions.png (if plot = TRUE)

This combined figure contains two complementary visualizations of the selection process:

  • Panel A: Mean-Variance Relationship
    • Scatter plot of log2(mean) vs log2(variance)
    • Shows how selected regions relate to overall mean-variance distribution
    • Highlights selected regions:
      • Light Green points: All regions
      • Light Orange points: Final selected highly-variable regions
  • Panel B: Hypervariance vs Expression
    • Scatter plot of log2(mean) vs hypervariance
    • Visualizes the selection process across expression levels
    • Three layers of points:
      • Blue points: All regions from hypervar summary
      • Red points: Regions selected from binned filtering
      • Green points: Final selected highly-variable regions after quantile filtering

Example Usage

library(epigenomeR)
mav_path <- "./count_matrix/Count_Matrix_800_merged_transformed_mav_screen.feather"
path <- "./count_matrix/Count_Matrix_800_merged_transformed.feather"
informative_regions(transformed_cm_pat = path, mav_stats_path = mav_path, out_dir = "./count_matrix/", plot=TRUE)

2. Combinatory Landscape of Co-localized CRF Pairs

2A. Automated Clustering and Heatmap Generation

The biclustering() function performs bidirectional k-means clustering on genomic count matrices and generates cluster assignment files along with publication-ready heatmap visualizations.

What this function does:

  • (Consensus k-means clustering) Applies k-means clustering independently to rows (genomic features) and columns (samples). For robustness, consensus clustering aggregates results from multiple k-means runs (controlled by row_repeats and col_repeats in the underlying algorithm) to identify stable cluster assignments.

  • (Hierarchical cluster ordering) After initial k-means assignment, clusters are reordered hierarchically to optimize visual interpretation in heatmaps. For each dimension (rows/columns), cluster centroids (mean profiles) are calculated and hierarchically clustered using specified distance metrics and linkage methods. The resulting dendrogram is reordered by centroid weights to place similar clusters adjacent to each other.

  • (Within-cluster feature ordering) Within each cluster, individual features are reordered using hierarchical clustering to reveal internal structure and gradual transitions between expression patterns. This two-level organization (between-cluster + within-cluster) ensures both global pattern recognition and local detail preservation.

  • (Integrated heatmap generation) Automatically creates publication-ready heatmaps with the clustered and ordered matrix, using the generated cluster assignment files to add annotation tracks. Heatmap aesthetics (color ranges, font sizes, column name display) are fully customizable.

Parameters

Parameter Type Required Description Example
cm_path character Yes Path to count matrix .feather file with ‘pos’ column (rows=genomic positions, cols=samples) cm_path = "normalized_counts.feather"
row_km integer Yes Number of k-means clusters for rows (genomic features/regions) row_km = 15
col_km integer Yes Number of k-means clusters for columns (samples/conditions) col_km = 16
out_dir character Yes Directory to save cluster assignment files and heatmap output out_dir = "./clustering_results"
seed integer No (default: 123) Random seed for reproducible clustering results seed = 42
plot logical No (default: TRUE) Whether to generate and save heatmap visualization plot = FALSE
show_column_names logical No (default: FALSE) Whether to display column names at the bottom of the heatmap show_column_names = TRUE
lower_range numeric No (default: NULL) Lower bound for heatmap color scale. If NULL, uses minimum value from data lower_range = 0
upper_range numeric No (default: NULL) Upper bound for heatmap color scale. If NULL, uses maximum value from data upper_range = 10
row_title_fontsize numeric No (default: NULL) Font size for row cluster titles (A, B, C, etc.) row_title_fontsize = 40
col_title_fontsize numeric No (default: NULL) Font size for column cluster titles (1, 2, 3, etc.) col_title_fontsize = 22
legend_title_fontsize numeric No (default: NULL) Font size for legend title legend_title_fontsize = 40
legend_label_fontsize numeric No (default: NULL) Font size for legend tick labels legend_label_fontsize = 30
  • Input requirements:

    • Count matrix must be in .feather format
    • First column must be named “pos” containing feature identifiers
    • Matrix must contain numeric values only
  • Reproducibility: Always set seed parameter for reproducible results, as k-means involves random initialization.

  • Color scale interpretation:

    • Default: Auto-scales to data range
    • Custom: Use lower_range and upper_range for consistent scales across analyses
    • Blue = Low signal, White = Medium, Red = High signal

Output Files

The function generates the following output files in the specified out_dir:

  1. Row cluster assignments - row_table.tsv
    • Tab-separated file with row cluster assignments
    • Two columns:
      • feature: Genomic region identifier (e.g., “chr1_1000_2000”)
      • label: Cluster label as letter (A, B, C, D, etc.)
    • Sorted by cluster label for easy inspection
    • Rows within each cluster are ordered by hierarchical clustering
    • Used for annotating genomic regions by cluster membership
      A tibble: 20 × 2
      feature label
      <chr> <chr>
      1 chr16_89995201_89996000 A
      2 chr3_10401_11200 A
      3 chrX_156029601_156030400 A
      4 chr21_8234401_8235200 A
      5 chr7_64849601_64850400 A
      6 chr2_85993601_85994400 A
      7 chr7_106568801_106569600 B
      8 chr19_4374401_4375200 B
      9 chr10_72320801_72321600 B
      10 chr1_30768801_30769600 B
      11 chr1_28259201_28260000 C
      12 chr16_1379201_1380000 C
      13 chr6_43771201_43772000 C
      14 chr16_28824001_28824800 C
      15 chr11_65497601_65498400 C
      16 chr6_31958401_31959200 D
      17 chr18_3448001_3448800 D
      18 chr19_1044801_1045600 D
      19 chr17_75046401_75047200 D
      20 chr2_85602401_85603200 D
  2. Column cluster assignments - col_table.tsv
    • Tab-separated file with column (sample) cluster assignments
    • Two columns:
      • feature: Sample identifier (e.g., “YY1-cJun”)
      • label: Cluster label as number (1, 2, 3, etc.)
    • Sorted by cluster label
    • Columns within each cluster are ordered by hierarchical clustering
    • Used for grouping samples by similarity
      A tibble: 12 × 2
      feature label
      <chr> <dbl>
      1 H3K9me3-H3K9me3 1
      2 H3K27ac-H3K4me3 2
      3 H3K4me1-H3K9me3 2
      4 H3K9me2-H3K9me3 2
      5 H3K27ac-H3K4me1 2
      6 H3K27ac-H3K27me3 2
      7 H3K27ac-H3K9me3 2
      8 H3K27ac-H3K27ac 2
      9 H3K4me1-H3K4me3 3
      10 H3K27me3-H3K4me3 3
      11 H3K4me3-H3K9me3 3
      12 H3K4me3-H3K4me3 4
  3. Clustered heatmap - figures/biclustering_heatmap.pdf
    • Publication-ready heatmap in PDF format
    • Features:
      • Rows organized by k-means clusters (labeled A, B, C, etc.)
      • Columns organized by k-means clusters (labeled 1, 2, 3, etc.)
      • Color scale: Blue (low) → White (middle) → Red (high)
      • Horizontal legend positioned at bottom
      • White gaps between clusters for visual separation (3mm)
      • Customizable font sizes for titles and labels
      • Transparent background for publication flexibility
    • Dimensions automatically calculated based on matrix size (5mm per feature/sample)

Example Usage

library(epigenomeR)
path <- "/dcs05/hongkai/data/next_cutntag/bulk/wgc/mixed/800/V1V2_mixed_800_colQC-all-qc_libnorm_noAllZero_log2_qnorm_gam-40_per-0.01_mean-0.99.feather"
biclustering(cm_path = path, row_km = 15, col_km = 16, out_dir = "/users/yhu1/next_cutntag/mav_screen/figures/test_plots")

2B. (Optional) Add Non-Informative Regions Back to Clusters

The add_regions_back_to_cluster() function assigns cluster labels to genomic regions that were excluded from the highly variable feature set by correlating them with existing cluster signatures.

What this function does:

This function recovers regions that were filtered out during feature selection and assigns them to the row clusters generated by the biclustering() function.

  • (Filter non-highly-variable regions) Starting with the original count matrix, the function identifies non-highly-variable regions (those excluded from clustering) and filters them based on a minimum non-zero count threshold at the raw count level. Specifically, regions must have non-zero counts in more than a specified number of samples (controlled by cutoff_non_zero) to be considered for cluster assignment.

  • (Correlation-based cluster assignment) For each row cluster from biclustering, the function calculates a signature profile by averaging the transformed expression values of highly variable regions assigned to that cluster. Non-highly-variable regions are then correlated against these cluster signatures using their transformed values. Regions with maximum correlation values exceeding the specified quantile threshold (controlled by quantile_threshold) are assigned to their best-matching cluster.

  • (Priority-based label assignment) The final output combines all regions into a priority-based classification:

    1. Original cluster assignments from biclustering (highest priority)
    2. Correlation-based assignments for high-correlation regions
    3. CRF_specific for low-correlation candidates that passed the non-zero filter
    4. Background for all remaining regions with non-zero counts

Parameters

Parameter Type Required Description Example
orig_cm_path character Yes Path to feather file containing the original count matrix from build_count_matrix (all regions, untransformed) with ‘pos’ column as first column orig_cm_path = "count_matrix.feather"
transformed_cm_path character Yes Path to feather file with transformed count data (e.g., log-normalized, scaled) for correlation analysis transformed_cm_path = "qnorm_counts.feather"
filtered_cm_path character Yes Path to feather file containing informative/significant regions used for original clustering filtered_cm_path = "informative_regions.feather"
row_cluster_path character Yes Path to TSV file with ‘feature’ and ‘label’ columns defining cluster assignments for informative regions row_cluster_path = "row_table.tsv"
out_dir character Yes Directory to save output files (tables and optional plots) out_dir = "./results"
cutoff_non_zero integer No (default: 10) Minimum number of non-zero samples required per region. Regions with MORE than this many non-zero values are kept cutoff_non_zero = 15
quantile_threshold numeric No (default: 0.75) Quantile threshold (0-1) for filtering high-correlation regions. Only regions above this quantile are assigned cluster labels quantile_threshold = 0.80
plot logical No (default: FALSE) Whether to generate and save correlation distribution histogram plot = TRUE
  • Input file requirements:

    • All feather files must have ‘pos’ column as first column
    • cluster_path TSV must have ‘feature’ and ‘label’ columns
    • Feature names must be consistent across all input files
  • Non-zero filtering: The cutoff_non_zero = 10 means regions must have MORE than 10 non-zero samples (not equal to 10).

  • Quantile threshold interpretation: quantile_threshold = 0.75 means only regions with correlation in the top 25% receive cluster assignments.

Output Files

The function generates the following output files in out_dir:

  1. Complete feature-label table - row_table_all.tsv
    • Comprehensive table containing all genomic regions with assigned labels
    • Two columns:
      • feature: Genomic region identifier (chromosome coordinates)
      • label: Assigned cluster label or category
    • Label priority hierarchy (highest to lowest):
      • Original cluster assignments: From row_cluster_path (informative region clusters)
      • Correlation-based assignments: High-correlation regions matched to cluster signatures
      • CRF_specific: Regions passing non-zero filter but below correlation threshold
      • Background: All other regions with non-zero counts
    • Tab-delimited format (.tsv)
    • Includes all non-zero regions from the genome
    feature label
    chr1_9601_10400 J
    chr1_10401_11200 CRF_specific
    chr1_12801_13600 Background
    chr1_14401_15200 Background
    chr1_15201_16000 Background
    chr1_16001_16800 Background
  2. Filtered feature-label table - row_table_clean.tsv
    • Contains only cluster-assigned regions (excludes “Background” and “CRF_specific” labels)
    • Same two-column format as complete table
    • Tab-delimited format (.tsv)
    • Useful for downstream analyses focusing on clustered regions only
  3. Correlation distribution histogram - correlation_histogram.png (if plot = TRUE)
    • PNG format visualization showing distribution of maximum correlation values
    • X-axis: Maximum correlation between regions and cluster signatures
    • Y-axis: Frequency of regions
    • Red dashed line: Quantile threshold for filtering
    • Legend: Shows quantile threshold value
    • Resolution: 800 × 600 pixels
    • Helps visualize correlation quality and threshold selection
    • Color scheme: Light blue bars with black borders

Example Usage

library(epigenomeR)

# define input file paths
orig_cm_path <- "./count_matrix.feather"
transformed_cm_path <- "./count_matrix_transformed.feather"
filtered_cm_path <- "./count_matrix_mav_screen_filtered_regions.feather"
row_cluster_path <- "./row_table.tsv"
out_dir <- "./add_regions_results"

# run the function
add_regions_back_to_cluster(
  orig_cm_path = orig_cm_path,
  transformed_cm_path = transformed_cm_path,
  filtered_cm_path = filtered_cm_path,
  row_cluster_path = row_cluster_path,
  out_dir = out_dir,
  cutoff_non_zero = 10,
  quantile_threshold = 0.75,
  plot = TRUE
)

# check output file
row_table_all <- read.table(file.path(out_dir, "row_table_all.tsv"), 
                             header = TRUE, sep = "\t")
row_table_clean <- read.table(file.path(out_dir, "row_table_clean.tsv"), 
                               header = TRUE, sep = "\t")

2C. (Optional) Apply Pre-Computed Clusters to Generate Heatmaps

After adding non-highly-variable regions back to clusters, the biclustering_heatmap() function can be used to generate heatmaps using the expanded cluster assignments (typically from row_table_clean.tsv). This function creates publication-ready visualizations without re-running the clustering algorithm. Note that this function is also called internally by the biclustering() function to generate the initial heatmap after performing bidirectional k-means clustering.

What this function does:

  • (Load and validate cluster assignments) Reads row and column cluster assignment files and validates that features in the cluster files match those present in the input count matrix. Typically uses the clean cluster assignments (row_table_clean.tsv) that exclude “Background” and “CRF_specific” labels.

  • (Order matrix by clusters) Reorders the count matrix rows and columns according to cluster assignments, ensuring features within the same cluster are grouped together for visualization.

  • (Configure color scaling) Sets up a diverging color scheme (blue → white → red) with customizable value ranges. If ranges are not specified, automatically determines appropriate bounds from the data.

  • (Calculate optimal dimensions) Automatically computes cell sizes and heatmap dimensions based on the number of clusters and their sizes, ensuring cluster labels are readable and properly positioned. Adjusts legend font sizes to fit within the available space.

  • (Generate publication-ready heatmap) Creates a PDF heatmap with:

    • Rows split by cluster assignments (labeled A, B, C, …)
    • Columns split by cluster assignments (labeled 1, 2, 3, …)
    • White gaps between clusters for visual separation
    • Horizontal legend at bottom showing color scale
    • Transparent background for publication flexibility
    • High-resolution rasterized cells for efficient rendering

Parameters

Parameter Type Required Description Example
mat matrix Yes Count matrix with genomic regions as rows and samples as columns. Must have row names (region IDs) and column names (sample IDs) mat = as.matrix(count_data)
row_cluster_file_path character Yes Path to TSV file with ‘feature’ and ‘label’ columns defining row cluster assignments. Typically uses row_table_clean.tsv from add_regions_back_to_cluster() row_cluster_file_path = "row_table_clean.tsv"
col_cluster_file_path character Yes Path to TSV file with ‘feature’ and ‘label’ columns defining column cluster assignments col_cluster_file_path = "col_table.tsv"
out_dir character No (default: “./”) Directory to save heatmap output out_dir = "./heatmaps"
show_column_names logical No (default: FALSE) Whether to display sample names along the column axis show_column_names = TRUE
lower_range numeric No (default: NULL) Lower bound for heatmap color scale. If NULL, uses minimum value in matrix lower_range = 0
upper_range numeric No (default: NULL) Upper bound for heatmap color scale. If NULL, uses maximum value in matrix upper_range = 4.5
row_title_fontsize numeric No (default: NULL) Font size for row cluster titles (A, B, C, …). If NULL, uses 20 row_title_fontsize = 25
col_title_fontsize numeric No (default: NULL) Font size for column cluster titles (1, 2, 3, …). If NULL, uses 20 col_title_fontsize = 25
legend_title_fontsize numeric No (default: NULL) Font size for legend title. If NULL, uses 15 (auto-adjusted to fit) legend_title_fontsize = 18
legend_label_fontsize numeric No (default: NULL) Font size for legend tick labels. If NULL, uses 15 legend_label_fontsize = 15

Note: The input matrix is typically the complete transformed count matrix (e.g., from transformed_cm_path), not a pre-filtered subset. The function will automatically subset the matrix to include only features present in both the matrix and the cluster assignment files. Features in the matrix but not in cluster files will be excluded from visualization; features in cluster files but not in the matrix will be skipped.

Output Files

The function generates the following output files in the specified out_dir:

  1. Clustered heatmap(s) - biclustering_heatmap.pdf
    • Publication-ready heatmap(s) in PDF format
    • One PDF file per input count matrix
    • Features:
      • Rows ordered and split by cluster assignments from row_cluster_file_path
      • Columns ordered and split by cluster assignments from col_cluster_file_path
      • Row clusters labeled with letters (A, B, C, etc.)
      • Column clusters labeled with numbers (1, 2, 3, etc.)
      • Color scale: Blue (#3155C3) → White → Red (#AF0525)
      • Horizontal legend positioned at bottom with title “Normalized Read Counts”
      • White gaps between clusters (3mm) for visual separation
      • Transparent background for publication flexibility
      • No dendrograms (unless show_dend_boolean = TRUE shows column names)
    • Dimensions automatically calculated based on matrix size (5mm per feature/sample)

Example Usage

library(epigenomeR)

# Add non-highly-variable regions back to clusters
orig_cm_path <- "./count_matrix.feather"
transformed_cm_path <- "./count_matrix_log2_qnorm.feather"
filtered_cm_path <- "./highly_variable_regions.feather"
row_cluster_path <- "./biclustering_results/row_table.tsv"
out_dir <- "./add_regions_results"

add_regions_back_to_cluster(
  orig_cm_path = orig_cm_path,
  transformed_cm_path = transformed_cm_path,
  filtered_cm_path = filtered_cm_path,
  row_cluster_path = row_cluster_path,
  out_dir = out_dir,
  cutoff_non_zero = 10,
  quantile_threshold = 0.75,
  plot = TRUE
)

# Load the transformed count matrix
library(arrow)
library(tibble)
mat <- as.matrix(column_to_rownames(read_feather(transformed_cm_path), var = "pos"))

# Generate heatmap with expanded cluster assignments
row_table_clean_path <- file.path(out_dir, "row_table_clean.tsv")
col_table_path <- "./biclustering_results/col_table.tsv"

biclustering_heatmap(
  mat = mat,
  row_cluster_file_path = row_table_clean_path,
  col_cluster_file_path = col_table_path,
  out_dir = out_dir
)

3. (Optional) Cluster-based Annotation

3A. Annotate and Plot Regulatory Element Composition for Clustered Regions

The biclustering_genomic_distribution() function performs post-clustering genomic annotation analysis by quantifying the regulatory element composition of clustered genomic regions. Clustered regions are overlapped with multiple external annotation resources, including cCREs, ChromHMM chromatin states, and repeat elements, followed by comparative visualization across clusters.

Parameters

Parameter Type Required Default Description Example
row_cluster_file_path character Yes - Path to TSV file with cluster assignments. Must contain columns: feature (genomic coordinates as chr_start_end) and label (cluster ID). Typically the row_table_clean.tsv from biclustering output "./row_table_clean.tsv"
out_dir character No "./" Output directory for annotation results. Subdirectories are automatically created for each annotation type "./distribution_annotation"
distributions character vector No c("genic", "ccre") Annotation types to perform. Valid options: "genic" (gene features), "ccre" (cCRE elements), "chromhmm" (chromatin states), "repeat" (repeat elements). Can specify any combination c("genic", "ccre", "repeat")
ref_genome character No "hg38" Reference genome version. Options: "hg38" (Human GRCh38) or "mm10" (Mouse GRCm38) "mm10"
ref_source character No "knownGene" Gene annotation source for cCRE/genic annotation. Options: "knownGene" (UCSC) or "GENCODE". Only used when "genic" is in distributions "GENCODE"
mode character No "nearest" Annotation assignment method. Options: "nearest" (assign to closest feature) or "weighted" (proportional by overlap length) "weighted"
plot logical No TRUE Generate stacked barplot visualizations for each annotation type FALSE
  • Annotation types available:
    • "genic": Gene structural features - Promoter, 5’ UTR, Exon, Intron, 3’ UTR
    • "ccre": Candidate cis-Regulatory Elements - dELS, pELS, PLS, CA-H3K4me3, CA-CTCF, CA-TF, TF, CA
    • "chromhmm": Chromatin states - Acet, EnhWk, EnhA, PromF, TSS, TxWk, TxEx, Tx, OpenC, TxEnh, ReprPCopenC, BivProm, ZNF, ReprPC, HET, GapArtf, Quies
    • "repeat": Repetitive elements - SINE, LINE, LTR, Retroposon, RC, DNA, Satellite, Simple_repeat, Low_complexity, rRNA, tRNA, snRNA, scRNA, srpRNA, RNA, Unknown
    • For detailed description of each annotation category, see the Annotation page
  • Annotation assignment methods:
    • "nearest": Each genomic region is assigned to its closest feature (by distance to feature center)
    • "weighted": Each region is proportionally assigned to overlapping features based on overlap length
    • nearest mode is faster and simpler; weighted mode provides more accurate representation for regions spanning multiple features

Output Files

The function generates the following output files in the specified out_dir:

  1. Composition tables - {genic/ccre/chromhmm/repeat}_distribution.pdf
  • .tsv tables summarizing the percentage composition of various annotations
  • Row: cluster lables
  • Col: annotations states
  • Values represent the proportion (%) of genomic regions assigned to each state within a cluster
  • Genic distribution
    Promoter 5’ UTR Exon Intron 3’ UTR other
    A 7.37327188940092 0 2.76497695852535 13.3640552995392 0 76.4976958525346
    B 59.8802395209581 11.9760479041916 8.98203592814371 16.7664670658683 1.79640718562874 0.598802395209581
    C 54.2253521126761 11.9718309859155 13.3802816901408 14.0845070422535 0 6.33802816901408
    D 8.82352941176471 0 10.2941176470588 54.4117647058823 2.94117647058824 23.5294117647059
    E 43.3823529411765 10.2941176470588 10.2941176470588 27.2058823529412 1.47058823529412 7.35294117647059
    ……
    O 60.427807486631 10.6951871657754 13.903743315508 12.8342245989305 0 2.13903743315508
  • cCRE distribution
    dELS pELS PLS CA-H3K4me3
    A 26.7281105990783 7.37327188940092 2.76497695852535 9.67741935483871
    B 5.38922155688623 34.7305389221557 59.8802395209581 0
    C 4.22535211267606 26.7605633802817 62.6760563380282 3.52112676056338
    D 79.4117647058823 13.2352941176471 2.94117647058824 0
    E 19.8529411764706 41.9117647058824 38.2352941176471 0
    ……
    O 4.81283422459893 28.8770053475936 66.3101604278075 0
    Cluster CA-CTCF CA-TF TF CA other
    A 0.921658986175115 0 4.14746543778802 1.84331797235023 46.5437788018433
    B 0 0 0 0 0
    C 0 0 1.40845070422535 0 1.40845070422535
    D 0 0 2.94117647058824 0 1.47058823529412
    E 0 0 0 0 0
    ……
    O 0 0 0 0 0
  • chromHMM distribution
    Cluster Acet EnhWk EnhA PromF TSS OpenC
    A 3.2258064516129 0.921658986175115 2.76497695852535 5.06912442396313 0.921658986175115 0
    B 0 0 0.598802395209581 43.7125748502994 50.8982035928144 0.598802395209581
    C 1.40845070422535 0 1.40845070422535 50.7042253521127 35.9154929577465 2.11267605633803
    D 16.1764705882353 0 36.7647058823529 16.1764705882353 0 2.94117647058824
    E 1.47058823529412 0.735294117647059 4.41176470588235 38.2352941176471 38.2352941176471 2.20588235294118
    ……
    O 0.53475935828877 0 2.67379679144385 45.4545454545455 49.7326203208556 0
    TxEnh BivProm TxWk TxEx Tx
    A 0.921658986175115 0.921658986175115 0 0 0
    B 0 1.19760479041916 0 0 0
    C 0.704225352112676 0 0 0 0
    D 11.7647058823529 0 1.47058823529412 10.2941176470588 1.47058823529412
    E 0.735294117647059 10.2941176470588 0 0 0
    ……
    O 0 1.06951871657754 0 0 0
    ZNF ReprPC HET GapArtf Quies other
    A 0.921658986175115 3.2258064516129 13.3640552995392 38.2488479262673 1.84331797235023 27.6497695852535
    B 0 0 1.19760479041916 0 0 1.79640718562874
    C 0 0 1.40845070422535 2.11267605633803 0 4.22535211267606
    D 0 0 0 1.47058823529412 0 1.47058823529412
    E 0 2.94117647058824 0.735294117647059 0 0 0
    ……
    O 0 0 0 0 0 0.53475935828877
  • Repeat distribution
  1. Cmposition bar plot - {genic/ccre/chromhmm/repeat}_distribution.pdf
    • Stacked horizontal bar plot showing ChromHMM chromatin state distribution across clusters
    • X-axis: Percentage (0-100%)
    • Y-axis: Cluster labels (top to bottom)

Example Usage

library(epigenomeR)
row_cluster_file_path <- "./bicluster_row_cluster.tsv"
out_dir <- "output/figures"

annotation_ccre_hmm(row_cluster_file_path = row_cluster_file_path, out_dir = out_dir)

3B. Biclustering TFBS Annotation Pipeline

The biclustering_TFBS_enrichment() function provides a complete, automated workflow for analyzing transcription factor binding site (TFBS) enrichment across multiple clusters of genomic regions. It handles the entire pipeline from control region generation to enrichment testing and visualization.

What this function does:

  • Reads a file containing clustered genomic regions (e.g., from biclustering analysis)

  • Generates matched control regions for each cluster using gene-based matching

  • Performs TFBS enrichment analysis comparing each cluster against controls

  • Creates heatmap visualizations showing enrichment patterns across clusters

  • Saves all intermediate and final results to organized output files

Parameters

Parameter Type Required Description Example
row_cluster_file_path character Yes Path to tab-delimited file with clustered regions. Must have columns: feature (format: chr_start_end) and label (cluster ID) "bicluster_results.tsv"
out_dir character No (default: “./”) Output directory where all results will be saved "./TFBS_results/"
ref_genome character No (default: “hg38”) Reference genome version. Options: “hg38”, “hg19”, “mm10”, “mm39” ref_genome = "mm10"
control_rep integer No (default: 1) Multiplier for control region generation. Determines the ratio of control to target regions. E.g., control_rep = 2 generates 2x control regions; control_rep = 5 generates 5x control regions control_rep = 3
regions integer No (default: 800) Size in base pairs to resize all regions (centered on original midpoint) regions = 500
plot logical No (default: TRUE) Whether to generate heatmap visualizations. If FALSE, only performs enrichment analysis plot = FALSE

Input File Format:

The cluster file must be tab-delimited with at least these columns:

feature              label
chr1_1234567_1235567 cluster_1
chr2_9876543_9877543 cluster_1
chr3_5555555_5556555 cluster_2
...
  • feature: Genomic coordinates in format “chr_start_end” (underscore-separated)
  • label: Cluster assignment (e.g., “cluster_1”, “group_A”, “bicluster_2”)

Output Files

All output files are saved to the specified out_dir:

File Description
all_controls.bed BED file containing all matched control regions (combined and de-duplicated)
TFBS_enrichment_cluster_<label>.tsv One TSV per cluster with enrichment statistics (odds ratio, p-value, FDR, hit counts)
TFBS_heatmap_all.pdf Heatmap visualization of enriched motifs across all clusters (if plot = TRUE)
odds_ratio_log2.csv Matrix of log2 odds ratios for all TFs × clusters (if plot = TRUE)
FDR.csv Matrix of FDR values for all TFs × clusters (if plot = TRUE)

Usage Examples

library(epigenomeR)

# Basic usage - complete pipeline with visualization
biclustering_TFBS_enrichment(
  row_cluster_file_path = "NMF_clusters.tsv",
  out_dir = "./TFBS_analysis/",
  ref_genome = "hg38"
)

# Custom region size for enhancer analysis
biclustering_TFBS_enrichment(
  row_cluster_file_path = "enhancer_clusters.tsv",
  out_dir = "./enhancer_TFBS/",
  ref_genome = "hg38",
  regions = 1000
)

# Mouse genome analysis
biclustering_TFBS_enrichment(
  row_cluster_file_path = "mouse_peaks_clustered.tsv",
  out_dir = "./mouse_TFBS/",
  ref_genome = "mm10",
  regions = 500
)

# Generate 3x more control regions for more robust statistics
biclustering_TFBS_enrichment(
  row_cluster_file_path = "ATAC_peaks_clustered.tsv",
  out_dir = "./ATAC_TFBS/",
  ref_genome = "hg38",
  control_rep = 3,
  regions = 800
)

# Enrichment analysis only (no heatmap)
biclustering_TFBS_enrichment(
  row_cluster_file_path = "clusters.tsv",
  out_dir = "./TFBS_tables/",
  ref_genome = "hg38",
  plot = FALSE
)