This workflow provides two complementary approaches for identifying highly variable genomic regions:
Integrated workflow (recommended): The
detect_hvr() function combines mean-variance modeling and
feature selection in a single step, internally calling
mav_screen() and filter_hvr() in sequence.
library(epigenomeR)
path <- "./count_matrix/Count_Matrix_800_merged_transformed.feather"
detect_hvr(transformed_cm_path = path, out_dir = "./count_matrix/")Step-by-step workflow: For users who need
fine-grained control or want to inspect intermediate results, the
individual functions mav_screen() and
filter_hvr() can be called separately.
Function Overview:
detect_hvr() - Complete pipeline
for detecting highly variable regions (wrapper function)
mav_screen() - Step 1: Model
mean-variance relationship and calculate hypervariance metrics
filter_hvr() - Step 2: Select
top-ranked regions based on hypervariance values
The mav_screen() function models the mean-variance
relationship in transformed count data and normalizes features by their
expected variance. It computes hypervariance metrics that quantify each
feature’s deviation from the expected mean-variance trend.
What this function does:
(Model mean-variance relationship) Fits a regression model (GAM or LOESS) to characterize the relationship between feature mean and variance in log2 space across all genomic regions. This establishes the baseline mean-variance trend expected under technical variation alone.
(Calculate expected variance) For each feature, computes the expected variance based on its mean expression level using the fitted model. This serves as the null expectation against which observed variance is compared.
(Normalize feature values) Centers each feature on its observed mean and scales by the expected standard deviation (square root of expected variance) derived from the model. This normalization removes the mean-dependent component of variance, allowing fair comparison across features with different expression levels.
(Quantify overdispersion) Computes hypervariance metrics (sum of squared normalized values divided by degrees of freedom) for each feature. Higher hypervariance values indicate greater variability than expected from the mean-variance trend, signaling biological overdispersion rather than technical noise.
(Generate comprehensive statistics) Saves all statistics including observed mean/variance, expected variance, normalized values, and hypervariance metrics to a feather file for downstream feature selection and quality control.
| Parameter | Type | Required | Description | Example |
|---|---|---|---|---|
transformed_cm_path |
character | Yes | Path to transformed count matrix file (.feather format with “pos” column) | transformed_cm_path = "data_transformed.feather" |
out_dir |
character | No (default: “./”) | Output directory path for results | out_dir = "./mav_screen_results" |
fitting_model |
character | No (default: “gam”) | Model to fit mean-variance relationship: “gam” or “loess” | fitting_model = "loess" |
k |
integer | No (default: 40) | Number of basis functions for GAM spline fitting (only used when fitting_model = “gam”) | k = 30 |
span |
numeric | No (default: 0.5) | Smoothing span for LOESS fitting (only used when fitting_model = “loess”) | span = 0.3 |
seed |
integer | No (default: 42) | Random seed for reproducibility in sampling | seed = 123 |
font_size |
numeric | No (default: 10) | Font size for plot labels and axes | font_size = 12 |
nrow_sample_per |
numeric | No (default: 0.2) | Proportion of rows (0-1) to randomly sample for density plot | nrow_sample_per = 0.1 |
plot |
logical | No (default: FALSE) | Whether to generate and save diagnostic plots (.png files) | plot = TRUE |
Input data requirements:
The transformed_cm_path parameter refers directly to a
pre-processed count matrix stored as a .feather file. This
file must not contain raw counts. Instead, the matrix should already
be:
Additionally, the .feather file must have the first
column named “pos” containing feature identifiers (e.g.,
chr1_9601_10400).
Model selection:
Memory considerations: For very large datasets,
adjust nrow_sample_per to reduce memory usage for density
plots.
The function generates the following output files in the specified
out_dir (default: “./”):
<input_name>_mav_screen.feather
pos: Original region identifiermean: Mean expression across all samplesvar: Variance across all samplesvar_expect: Expected variance from fitted mean-variance
modelnorm_mean: Mean of normalized values (should be
~0)norm_var: Variance of normalized values (should be
~1)hypervar: Hypervariance, sum of squared normalized
values divided by (n-1)log2_mean: Log2-transformed mean expressionlog2_var: Log2-transformed variancelog2_var_expected: Log2-transformed expected variance
from fitted model| pos | mean | var | var_expect | norm_mean | norm_var | hypervar |
|---|---|---|---|---|---|---|
| chr1_9601_10400 | 0.60482747 | 3.2718638368 | 2.999402e+00 | 2.999402e+00 | -2.400220e-17 | 1.0908388 |
| chr1_10401_11200 | 0.06557750 | 0.1646858526 | 1.620011e-01 | 1.620011e-01 | -6.149439e-18 | 1.0165723 |
| chr1_12801_13600 | 0.03007577 | 0.0003917656 | 6.413126e-04 | 6.413126e-04 | -1.840330e-18 | 0.6108808 |
| chr1_14401_15200 | 0.02905376 | 0.0001423452 | 1.494525e-04 | 1.494525e-04 | 9.551020e-17 | 0.9524448 |
| chr1_15201_16000 | 0.02994487 | 0.0007778492 | 5.396820e-04 | 5.396820e-04 | -4.221334e-17 | 1.4413104 |
| chr1_16001_16800 | 0.02875908 | 0.0001076822 | 9.580701e-05 | 9.580701e-05 | -1.361187e-16 | 1.1239494 |
| log2_mean | log2_var | log2_var_expected |
|---|---|---|
| -0.7254044 | 1.710113 | 1.585175 |
| -3.9306552 | -2.602211 | -2.625862 |
| -5.0552546 | -11.317722 | -10.603762 |
| -5.1051311 | -12.778318 | -12.731808 |
| -5.0615474 | -10.328222 | -10.873420 |
| -5.1198388 | -13.180932 | -13.352449 |
<input_name>_mean_variance.png (if
plot = TRUE)

<input_name>_fit_density.png (if
plot = TRUE)
nrow_sample_per and seed)
library(epigenomeR)
path <- "./count_matrix/Count_Matrix_800_merged_transformed.feather"
mav_screen(transformed_cm_path = path, out_dir = "./count_matrix/", plot=TRUE)
The filter_hvr() function identifies the most
informative genomic regions by combining hypervariance-based selection
with stratified sampling across expression levels.
What this function does:
(Stratified feature selection) Divides the log2(mean) expression range into equal-sized bins to ensure balanced representation across low, medium, and high expression levels.
(Hypervariance filtering) Selects top hypervariant regions from each bin, prioritizing regions with biological variability (hypervariance > 1 indicates overdispersion).
(Expression threshold) Applies a percentile-based cutoff to retain only highly expressed regions, removing lowly expressed features that may be noisy.
(Count matrix extraction) Retrieves corresponding count data for selected informative regions from the full transformed count matrix.
(Quality control visualization) Generates optional diagnostic plots showing:
(Efficient processing) Uses vectorized operations and selective column loading for fast processing of large genomic datasets.
| Parameter | Type | Required | Description | Example |
|---|---|---|---|---|
transformed_cm_path |
character | Yes | Path to transformed count matrix file (.feather format with “pos” column as region identifier) | transformed_cm_path = "normalized_counts.feather" |
mav_stats_path |
character | Yes | Path to mean-variance statistics file (.feather format) generated from upstream MAV screening analysis | mav_stats_path = "mav_screen/data_mav_screen.feather" |
out_dir |
character | No (default: “./”) | Output directory path for saving filtered results and optional plots | out_dir = "./informative_results" |
n_bins |
integer | No (default: 100) | Number of bins to divide log2(mean) range for stratified sampling, ensuring representation across expression levels | n_bins = 50 |
keep_percent |
numeric | No (default: 0.01) | Fraction of total regions to retain (0-1). Selected regions are distributed equally across all bins | keep_percent = 0.05 |
log2mean_quantile_thres |
numeric | No (default: 0.99) | Percentile threshold (0-1) for log2(mean) expression. Only regions above this threshold are retained in final selection | log2mean_quantile_thres = 0.95 |
plot |
logical | No (default: FALSE) | Whether to generate diagnostic scatter plots visualizing the selection process | plot = TRUE |
Input file requirements:
mav_stats_path: Must be the output
.feather file from mav_screen() containing
columns: pos, log2_mean,
log2_var, and hypervar. The function will
automatically select only these required columns during loading for
memory efficiency.
transformed_cm_path: Must be a
pre-processed count matrix in .feather format. This should
be the same normalized matrix used as input to
mav_screen().
Feature filtering behavior:
hypervar ≤ 1 are automatically excluded as
they represent features with expected or reduced variance (not
informative for downstream analysis)log2mean_quantile_thres threshold removes
lowly expressed regions that may be dominated by technical noiseParameter tuning recommendations: If your
analysis yields too few selected regions, this typically occurs when
you’re analyzing a targeted genomic subset (e.g.,
promoter regions, CpG islands, specific gene loci) rather than
genome-wide bins. The combination of default
keep_percent = 0.01 (1%) and
log2mean_quantile_thres = 0.99 (99th percentile) is too
stringen.
| Scenario | keep_percent |
log2mean_quantile_thres |
Expected output |
|---|---|---|---|
| Genome-wide bins (default) | 0.01 (1%) | 0.99 (99th) | ~1,000-5,000 regions |
| Targeted regions (promoters, peaks) | 0.05-0.10 (5-10%) | 0.90-0.95 (90-95th) | ~2,000-10,000 regions |
| Very sparse data | 0.10-0.20 (10-20%) | 0.85-0.90 (85-90th) | ~5,000-20,000 regions |
| Large datasets (>100K input regions) | Reduce n_bins to 50 |
Keep defaults | Faster computation |
The function generates the following output files in the specified
out_dir (default: “./”):
<input_name>_filtered_regions.feather
| pos | H3K36me3-H3K36me3 | H3K36me3-H3K4me1 | H3K27ac-H3K36me3 | H3K36me3-H3K9me2 |
|---|---|---|---|---|
| chr1_9601_10400 | 0.03135214 | 0.02224862 | 0.03659935 | 0.03001337 |
| chr1_10401_11200 | 0.02891244 | 0.01988311 | 0.03451229 | 0.02833411 |
| chr1_11201_12000 | 0.02750192 | 0.01844290 | 0.03329854 | 0.02701144 |
| chr1_12001_12800 | 0.02933155 | 0.02015533 | 0.03510177 | 0.02985421 |
| chr1_12801_13600 | 0.02684492 | 0.01791241 | 0.03166388 | 0.02622109 |
| chr1_13601_14400 | 0.02599817 | 0.01762288 | 0.03088455 | 0.02548933 |
| chr1_14401_15200 | 0.02744561 | 0.01819877 | 0.03255291 | 0.02680714 |
| chr1_15201_16000 | 0.02819945 | 0.01888922 | 0.03394107 | 0.02790365 |
<input_name>_filtered_regions.png (if
plot = TRUE)This combined figure contains two complementary visualizations of the selection process:

library(epigenomeR)
mav_path <- "./count_matrix/Count_Matrix_800_merged_transformed_mav_screen.feather"
path <- "./count_matrix/Count_Matrix_800_merged_transformed.feather"
informative_regions(transformed_cm_pat = path, mav_stats_path = mav_path, out_dir = "./count_matrix/", plot=TRUE)
The biclustering() function performs bidirectional
k-means clustering on genomic count matrices and generates cluster
assignment files along with publication-ready heatmap
visualizations.
What this function does:
(Consensus k-means clustering) Applies k-means
clustering independently to rows (genomic features) and columns
(samples). For robustness, consensus clustering aggregates results from
multiple k-means runs (controlled by row_repeats and
col_repeats in the underlying algorithm) to identify stable
cluster assignments.
(Hierarchical cluster ordering) After initial k-means assignment, clusters are reordered hierarchically to optimize visual interpretation in heatmaps. For each dimension (rows/columns), cluster centroids (mean profiles) are calculated and hierarchically clustered using specified distance metrics and linkage methods. The resulting dendrogram is reordered by centroid weights to place similar clusters adjacent to each other.
(Within-cluster feature ordering) Within each cluster, individual features are reordered using hierarchical clustering to reveal internal structure and gradual transitions between expression patterns. This two-level organization (between-cluster + within-cluster) ensures both global pattern recognition and local detail preservation.
(Integrated heatmap generation) Automatically creates publication-ready heatmaps with the clustered and ordered matrix, using the generated cluster assignment files to add annotation tracks. Heatmap aesthetics (color ranges, font sizes, column name display) are fully customizable.
| Parameter | Type | Required | Description | Example |
|---|---|---|---|---|
cm_path |
character | Yes | Path to count matrix .feather file with ‘pos’ column (rows=genomic positions, cols=samples) | cm_path = "normalized_counts.feather" |
row_km |
integer | Yes | Number of k-means clusters for rows (genomic features/regions) | row_km = 15 |
col_km |
integer | Yes | Number of k-means clusters for columns (samples/conditions) | col_km = 16 |
out_dir |
character | Yes | Directory to save cluster assignment files and heatmap output | out_dir = "./clustering_results" |
seed |
integer | No (default: 123) | Random seed for reproducible clustering results | seed = 42 |
plot |
logical | No (default: TRUE) | Whether to generate and save heatmap visualization | plot = FALSE |
show_column_names |
logical | No (default: FALSE) | Whether to display column names at the bottom of the heatmap | show_column_names = TRUE |
lower_range |
numeric | No (default: NULL) | Lower bound for heatmap color scale. If NULL, uses minimum value from data | lower_range = 0 |
upper_range |
numeric | No (default: NULL) | Upper bound for heatmap color scale. If NULL, uses maximum value from data | upper_range = 10 |
row_title_fontsize |
numeric | No (default: NULL) | Font size for row cluster titles (A, B, C, etc.) | row_title_fontsize = 40 |
col_title_fontsize |
numeric | No (default: NULL) | Font size for column cluster titles (1, 2, 3, etc.) | col_title_fontsize = 22 |
legend_title_fontsize |
numeric | No (default: NULL) | Font size for legend title | legend_title_fontsize = 40 |
legend_label_fontsize |
numeric | No (default: NULL) | Font size for legend tick labels | legend_label_fontsize = 30 |
Input requirements:
Reproducibility: Always set seed
parameter for reproducible results, as k-means involves random
initialization.
Color scale interpretation:
lower_range and upper_range
for consistent scales across analysesThe function generates the following output files in the specified
out_dir:
row_table.tsv
feature: Genomic region identifier (e.g.,
“chr1_1000_2000”)label: Cluster label as letter (A, B, C, D, etc.)| feature | label | |
|---|---|---|
| <chr> | <chr> | |
| 1 | chr16_89995201_89996000 | A |
| 2 | chr3_10401_11200 | A |
| 3 | chrX_156029601_156030400 | A |
| 4 | chr21_8234401_8235200 | A |
| 5 | chr7_64849601_64850400 | A |
| 6 | chr2_85993601_85994400 | A |
| 7 | chr7_106568801_106569600 | B |
| 8 | chr19_4374401_4375200 | B |
| 9 | chr10_72320801_72321600 | B |
| 10 | chr1_30768801_30769600 | B |
| 11 | chr1_28259201_28260000 | C |
| 12 | chr16_1379201_1380000 | C |
| 13 | chr6_43771201_43772000 | C |
| 14 | chr16_28824001_28824800 | C |
| 15 | chr11_65497601_65498400 | C |
| 16 | chr6_31958401_31959200 | D |
| 17 | chr18_3448001_3448800 | D |
| 18 | chr19_1044801_1045600 | D |
| 19 | chr17_75046401_75047200 | D |
| 20 | chr2_85602401_85603200 | D |
col_table.tsv
feature: Sample identifier (e.g., “YY1-cJun”)label: Cluster label as number (1, 2, 3, etc.)| feature | label | |
|---|---|---|
| <chr> | <dbl> | |
| 1 | H3K9me3-H3K9me3 | 1 |
| 2 | H3K27ac-H3K4me3 | 2 |
| 3 | H3K4me1-H3K9me3 | 2 |
| 4 | H3K9me2-H3K9me3 | 2 |
| 5 | H3K27ac-H3K4me1 | 2 |
| 6 | H3K27ac-H3K27me3 | 2 |
| 7 | H3K27ac-H3K9me3 | 2 |
| 8 | H3K27ac-H3K27ac | 2 |
| 9 | H3K4me1-H3K4me3 | 3 |
| 10 | H3K27me3-H3K4me3 | 3 |
| 11 | H3K4me3-H3K9me3 | 3 |
| 12 | H3K4me3-H3K4me3 | 4 |
figures/biclustering_heatmap.pdf

library(epigenomeR)
path <- "/dcs05/hongkai/data/next_cutntag/bulk/wgc/mixed/800/V1V2_mixed_800_colQC-all-qc_libnorm_noAllZero_log2_qnorm_gam-40_per-0.01_mean-0.99.feather"
biclustering(cm_path = path, row_km = 15, col_km = 16, out_dir = "/users/yhu1/next_cutntag/mav_screen/figures/test_plots")
The add_regions_back_to_cluster() function assigns
cluster labels to genomic regions that were excluded from the highly
variable feature set by correlating them with existing cluster
signatures.
What this function does:
This function recovers regions that were filtered out during feature
selection and assigns them to the row clusters generated by the
biclustering() function.
(Filter non-highly-variable regions) Starting
with the original count matrix, the function identifies
non-highly-variable regions (those excluded from clustering) and filters
them based on a minimum non-zero count threshold at the raw count level.
Specifically, regions must have non-zero counts in more than a specified
number of samples (controlled by cutoff_non_zero) to be
considered for cluster assignment.
(Correlation-based cluster assignment) For each
row cluster from biclustering, the function calculates a signature
profile by averaging the transformed expression values of highly
variable regions assigned to that cluster. Non-highly-variable regions
are then correlated against these cluster signatures using their
transformed values. Regions with maximum correlation values exceeding
the specified quantile threshold (controlled by
quantile_threshold) are assigned to their best-matching
cluster.
(Priority-based label assignment) The final output combines all regions into a priority-based classification:
| Parameter | Type | Required | Description | Example |
|---|---|---|---|---|
orig_cm_path |
character | Yes | Path to feather file containing the original count matrix from build_count_matrix (all regions, untransformed) with ‘pos’ column as first column | orig_cm_path = "count_matrix.feather" |
transformed_cm_path |
character | Yes | Path to feather file with transformed count data (e.g., log-normalized, scaled) for correlation analysis | transformed_cm_path = "qnorm_counts.feather" |
filtered_cm_path |
character | Yes | Path to feather file containing informative/significant regions used for original clustering | filtered_cm_path = "informative_regions.feather" |
row_cluster_path |
character | Yes | Path to TSV file with ‘feature’ and ‘label’ columns defining cluster assignments for informative regions | row_cluster_path = "row_table.tsv" |
out_dir |
character | Yes | Directory to save output files (tables and optional plots) | out_dir = "./results" |
cutoff_non_zero |
integer | No (default: 10) | Minimum number of non-zero samples required per region. Regions with MORE than this many non-zero values are kept | cutoff_non_zero = 15 |
quantile_threshold |
numeric | No (default: 0.75) | Quantile threshold (0-1) for filtering high-correlation regions. Only regions above this quantile are assigned cluster labels | quantile_threshold = 0.80 |
plot |
logical | No (default: FALSE) | Whether to generate and save correlation distribution histogram | plot = TRUE |
Input file requirements:
cluster_path TSV must have ‘feature’ and ‘label’
columnsNon-zero filtering: The
cutoff_non_zero = 10 means regions must have MORE than 10
non-zero samples (not equal to 10).
Quantile threshold interpretation:
quantile_threshold = 0.75 means only regions with
correlation in the top 25% receive cluster assignments.
The function generates the following output files in
out_dir:
row_table_all.tsv
feature: Genomic region identifier (chromosome
coordinates)label: Assigned cluster label or categoryrow_cluster_path (informative region clusters)| feature | label |
|---|---|
| chr1_9601_10400 | J |
| chr1_10401_11200 | CRF_specific |
| chr1_12801_13600 | Background |
| chr1_14401_15200 | Background |
| chr1_15201_16000 | Background |
| chr1_16001_16800 | Background |
row_table_clean.tsv
correlation_histogram.png (if plot = TRUE)

library(epigenomeR)
# define input file paths
orig_cm_path <- "./count_matrix.feather"
transformed_cm_path <- "./count_matrix_transformed.feather"
filtered_cm_path <- "./count_matrix_mav_screen_filtered_regions.feather"
row_cluster_path <- "./row_table.tsv"
out_dir <- "./add_regions_results"
# run the function
add_regions_back_to_cluster(
orig_cm_path = orig_cm_path,
transformed_cm_path = transformed_cm_path,
filtered_cm_path = filtered_cm_path,
row_cluster_path = row_cluster_path,
out_dir = out_dir,
cutoff_non_zero = 10,
quantile_threshold = 0.75,
plot = TRUE
)
# check output file
row_table_all <- read.table(file.path(out_dir, "row_table_all.tsv"),
header = TRUE, sep = "\t")
row_table_clean <- read.table(file.path(out_dir, "row_table_clean.tsv"),
header = TRUE, sep = "\t")
After adding non-highly-variable regions back to clusters, the
biclustering_heatmap() function can be used to generate
heatmaps using the expanded cluster assignments (typically from
row_table_clean.tsv). This function creates
publication-ready visualizations without re-running the clustering
algorithm. Note that this function is also called internally by the
biclustering() function to generate the initial heatmap
after performing bidirectional k-means clustering.
What this function does:
(Load and validate cluster assignments) Reads
row and column cluster assignment files and validates that features in
the cluster files match those present in the input count matrix.
Typically uses the clean cluster assignments
(row_table_clean.tsv) that exclude “Background” and
“CRF_specific” labels.
(Order matrix by clusters) Reorders the count matrix rows and columns according to cluster assignments, ensuring features within the same cluster are grouped together for visualization.
(Configure color scaling) Sets up a diverging color scheme (blue → white → red) with customizable value ranges. If ranges are not specified, automatically determines appropriate bounds from the data.
(Calculate optimal dimensions) Automatically computes cell sizes and heatmap dimensions based on the number of clusters and their sizes, ensuring cluster labels are readable and properly positioned. Adjusts legend font sizes to fit within the available space.
(Generate publication-ready heatmap) Creates a PDF heatmap with:
| Parameter | Type | Required | Description | Example |
|---|---|---|---|---|
mat |
matrix | Yes | Count matrix with genomic regions as rows and samples as columns. Must have row names (region IDs) and column names (sample IDs) | mat = as.matrix(count_data) |
row_cluster_file_path |
character | Yes | Path to TSV file with ‘feature’ and ‘label’ columns defining row
cluster assignments. Typically uses row_table_clean.tsv
from add_regions_back_to_cluster() |
row_cluster_file_path = "row_table_clean.tsv" |
col_cluster_file_path |
character | Yes | Path to TSV file with ‘feature’ and ‘label’ columns defining column cluster assignments | col_cluster_file_path = "col_table.tsv" |
out_dir |
character | No (default: “./”) | Directory to save heatmap output | out_dir = "./heatmaps" |
show_column_names |
logical | No (default: FALSE) | Whether to display sample names along the column axis | show_column_names = TRUE |
lower_range |
numeric | No (default: NULL) | Lower bound for heatmap color scale. If NULL, uses minimum value in matrix | lower_range = 0 |
upper_range |
numeric | No (default: NULL) | Upper bound for heatmap color scale. If NULL, uses maximum value in matrix | upper_range = 4.5 |
row_title_fontsize |
numeric | No (default: NULL) | Font size for row cluster titles (A, B, C, …). If NULL, uses 20 | row_title_fontsize = 25 |
col_title_fontsize |
numeric | No (default: NULL) | Font size for column cluster titles (1, 2, 3, …). If NULL, uses 20 | col_title_fontsize = 25 |
legend_title_fontsize |
numeric | No (default: NULL) | Font size for legend title. If NULL, uses 15 (auto-adjusted to fit) | legend_title_fontsize = 18 |
legend_label_fontsize |
numeric | No (default: NULL) | Font size for legend tick labels. If NULL, uses 15 | legend_label_fontsize = 15 |
Note: The input matrix is typically the
complete transformed count matrix (e.g., from
transformed_cm_path), not a pre-filtered subset. The
function will automatically subset the matrix to include only features
present in both the matrix and the cluster assignment files. Features in
the matrix but not in cluster files will be excluded from visualization;
features in cluster files but not in the matrix will be skipped.
The function generates the following output files in the specified
out_dir:
biclustering_heatmap.pdf
row_cluster_file_pathcol_cluster_file_pathlibrary(epigenomeR)
# Add non-highly-variable regions back to clusters
orig_cm_path <- "./count_matrix.feather"
transformed_cm_path <- "./count_matrix_log2_qnorm.feather"
filtered_cm_path <- "./highly_variable_regions.feather"
row_cluster_path <- "./biclustering_results/row_table.tsv"
out_dir <- "./add_regions_results"
add_regions_back_to_cluster(
orig_cm_path = orig_cm_path,
transformed_cm_path = transformed_cm_path,
filtered_cm_path = filtered_cm_path,
row_cluster_path = row_cluster_path,
out_dir = out_dir,
cutoff_non_zero = 10,
quantile_threshold = 0.75,
plot = TRUE
)
# Load the transformed count matrix
library(arrow)
library(tibble)
mat <- as.matrix(column_to_rownames(read_feather(transformed_cm_path), var = "pos"))
# Generate heatmap with expanded cluster assignments
row_table_clean_path <- file.path(out_dir, "row_table_clean.tsv")
col_table_path <- "./biclustering_results/col_table.tsv"
biclustering_heatmap(
mat = mat,
row_cluster_file_path = row_table_clean_path,
col_cluster_file_path = col_table_path,
out_dir = out_dir
)
The biclustering_genomic_distribution() function
performs post-clustering genomic annotation analysis by quantifying the
regulatory element composition of clustered genomic regions. Clustered
regions are overlapped with multiple external annotation resources,
including cCREs, ChromHMM chromatin states, and repeat elements,
followed by comparative visualization across clusters.
| Parameter | Type | Required | Default | Description | Example |
|---|---|---|---|---|---|
row_cluster_file_path |
character | Yes | - | Path to TSV file with cluster assignments. Must contain columns:
feature (genomic coordinates as chr_start_end) and
label (cluster ID). Typically the row_table_clean.tsv from
biclustering output |
"./row_table_clean.tsv" |
out_dir |
character | No | "./" |
Output directory for annotation results. Subdirectories are automatically created for each annotation type | "./distribution_annotation" |
distributions |
character vector | No | c("genic", "ccre") |
Annotation types to perform. Valid options: "genic"
(gene features), "ccre" (cCRE elements),
"chromhmm" (chromatin states), "repeat"
(repeat elements). Can specify any combination |
c("genic", "ccre", "repeat") |
ref_genome |
character | No | "hg38" |
Reference genome version. Options: "hg38" (Human
GRCh38) or "mm10" (Mouse GRCm38) |
"mm10" |
ref_source |
character | No | "knownGene" |
Gene annotation source for cCRE/genic annotation. Options:
"knownGene" (UCSC) or "GENCODE". Only used
when "genic" is in distributions |
"GENCODE" |
mode |
character | No | "nearest" |
Annotation assignment method. Options: "nearest"
(assign to closest feature) or "weighted" (proportional by
overlap length) |
"weighted" |
plot |
logical | No | TRUE |
Generate stacked barplot visualizations for each annotation type | FALSE |
"genic": Gene structural features - Promoter, 5’ UTR,
Exon, Intron, 3’ UTR"ccre": Candidate cis-Regulatory Elements - dELS, pELS,
PLS, CA-H3K4me3, CA-CTCF, CA-TF, TF, CA"chromhmm": Chromatin states - Acet, EnhWk, EnhA,
PromF, TSS, TxWk, TxEx, Tx, OpenC, TxEnh, ReprPCopenC, BivProm, ZNF,
ReprPC, HET, GapArtf, Quies"repeat": Repetitive elements - SINE, LINE, LTR,
Retroposon, RC, DNA, Satellite, Simple_repeat, Low_complexity, rRNA,
tRNA, snRNA, scRNA, srpRNA, RNA, Unknown"nearest": Each genomic region is assigned to its
closest feature (by distance to feature center)"weighted": Each region is proportionally assigned to
overlapping features based on overlap lengthnearest mode is faster and simpler;
weighted mode provides more accurate representation for
regions spanning multiple featuresThe function generates the following output files in the specified
out_dir:
{genic/ccre/chromhmm/repeat}_distribution.pdf| Promoter | 5’ UTR | Exon | Intron | 3’ UTR | other | |
|---|---|---|---|---|---|---|
| A | 7.37327188940092 | 0 | 2.76497695852535 | 13.3640552995392 | 0 | 76.4976958525346 |
| B | 59.8802395209581 | 11.9760479041916 | 8.98203592814371 | 16.7664670658683 | 1.79640718562874 | 0.598802395209581 |
| C | 54.2253521126761 | 11.9718309859155 | 13.3802816901408 | 14.0845070422535 | 0 | 6.33802816901408 |
| D | 8.82352941176471 | 0 | 10.2941176470588 | 54.4117647058823 | 2.94117647058824 | 23.5294117647059 |
| E | 43.3823529411765 | 10.2941176470588 | 10.2941176470588 | 27.2058823529412 | 1.47058823529412 | 7.35294117647059 |
| …… | ||||||
| O | 60.427807486631 | 10.6951871657754 | 13.903743315508 | 12.8342245989305 | 0 | 2.13903743315508 |
| dELS | pELS | PLS | CA-H3K4me3 | ||
|---|---|---|---|---|---|
| A | 26.7281105990783 | 7.37327188940092 | 2.76497695852535 | 9.67741935483871 | |
| B | 5.38922155688623 | 34.7305389221557 | 59.8802395209581 | 0 | |
| C | 4.22535211267606 | 26.7605633802817 | 62.6760563380282 | 3.52112676056338 | |
| D | 79.4117647058823 | 13.2352941176471 | 2.94117647058824 | 0 | |
| E | 19.8529411764706 | 41.9117647058824 | 38.2352941176471 | 0 | |
| …… | |||||
| O | 4.81283422459893 | 28.8770053475936 | 66.3101604278075 | 0 | |
| Cluster | CA-CTCF | CA-TF | TF | CA | other |
|---|---|---|---|---|---|
| A | 0.921658986175115 | 0 | 4.14746543778802 | 1.84331797235023 | 46.5437788018433 |
| B | 0 | 0 | 0 | 0 | 0 |
| C | 0 | 0 | 1.40845070422535 | 0 | 1.40845070422535 |
| D | 0 | 0 | 2.94117647058824 | 0 | 1.47058823529412 |
| E | 0 | 0 | 0 | 0 | 0 |
| …… | |||||
| O | 0 | 0 | 0 | 0 | 0 |
| Cluster | Acet | EnhWk | EnhA | PromF | TSS | OpenC |
|---|---|---|---|---|---|---|
| A | 3.2258064516129 | 0.921658986175115 | 2.76497695852535 | 5.06912442396313 | 0.921658986175115 | 0 |
| B | 0 | 0 | 0.598802395209581 | 43.7125748502994 | 50.8982035928144 | 0.598802395209581 |
| C | 1.40845070422535 | 0 | 1.40845070422535 | 50.7042253521127 | 35.9154929577465 | 2.11267605633803 |
| D | 16.1764705882353 | 0 | 36.7647058823529 | 16.1764705882353 | 0 | 2.94117647058824 |
| E | 1.47058823529412 | 0.735294117647059 | 4.41176470588235 | 38.2352941176471 | 38.2352941176471 | 2.20588235294118 |
| …… | ||||||
| O | 0.53475935828877 | 0 | 2.67379679144385 | 45.4545454545455 | 49.7326203208556 | 0 |
| TxEnh | BivProm | TxWk | TxEx | Tx | |
|---|---|---|---|---|---|
| A | 0.921658986175115 | 0.921658986175115 | 0 | 0 | 0 |
| B | 0 | 1.19760479041916 | 0 | 0 | 0 |
| C | 0.704225352112676 | 0 | 0 | 0 | 0 |
| D | 11.7647058823529 | 0 | 1.47058823529412 | 10.2941176470588 | 1.47058823529412 |
| E | 0.735294117647059 | 10.2941176470588 | 0 | 0 | 0 |
| …… | |||||
| O | 0 | 1.06951871657754 | 0 | 0 | 0 |
| ZNF | ReprPC | HET | GapArtf | Quies | other | |
|---|---|---|---|---|---|---|
| A | 0.921658986175115 | 3.2258064516129 | 13.3640552995392 | 38.2488479262673 | 1.84331797235023 | 27.6497695852535 |
| B | 0 | 0 | 1.19760479041916 | 0 | 0 | 1.79640718562874 |
| C | 0 | 0 | 1.40845070422535 | 2.11267605633803 | 0 | 4.22535211267606 |
| D | 0 | 0 | 0 | 1.47058823529412 | 0 | 1.47058823529412 |
| E | 0 | 2.94117647058824 | 0.735294117647059 | 0 | 0 | 0 |
| …… | ||||||
| O | 0 | 0 | 0 | 0 | 0 | 0.53475935828877 |
{genic/ccre/chromhmm/repeat}_distribution.pdf




library(epigenomeR)
row_cluster_file_path <- "./bicluster_row_cluster.tsv"
out_dir <- "output/figures"
annotation_ccre_hmm(row_cluster_file_path = row_cluster_file_path, out_dir = out_dir)
The biclustering_TFBS_enrichment() function provides a
complete, automated workflow for analyzing transcription factor binding
site (TFBS) enrichment across multiple clusters of genomic regions. It
handles the entire pipeline from control region generation to enrichment
testing and visualization.
What this function does:
Reads a file containing clustered genomic regions (e.g., from biclustering analysis)
Generates matched control regions for each cluster using gene-based matching
Performs TFBS enrichment analysis comparing each cluster against controls
Creates heatmap visualizations showing enrichment patterns across clusters
Saves all intermediate and final results to organized output files
| Parameter | Type | Required | Description | Example |
|---|---|---|---|---|
row_cluster_file_path |
character | Yes | Path to tab-delimited file with clustered regions. Must have
columns: feature (format: chr_start_end) and
label (cluster ID) |
"bicluster_results.tsv" |
out_dir |
character | No (default: “./”) | Output directory where all results will be saved | "./TFBS_results/" |
ref_genome |
character | No (default: “hg38”) | Reference genome version. Options: “hg38”, “hg19”, “mm10”, “mm39” | ref_genome = "mm10" |
control_rep |
integer | No (default: 1) | Multiplier for control region generation. Determines the ratio of
control to target regions. E.g., control_rep = 2 generates
2x control regions; control_rep = 5 generates 5x control
regions |
control_rep = 3 |
regions |
integer | No (default: 800) | Size in base pairs to resize all regions (centered on original midpoint) | regions = 500 |
plot |
logical | No (default: TRUE) | Whether to generate heatmap visualizations. If FALSE, only performs enrichment analysis | plot = FALSE |
Input File Format:
The cluster file must be tab-delimited with at least these columns:
feature label
chr1_1234567_1235567 cluster_1
chr2_9876543_9877543 cluster_1
chr3_5555555_5556555 cluster_2
...
feature: Genomic coordinates in format “chr_start_end”
(underscore-separated)label: Cluster assignment (e.g., “cluster_1”,
“group_A”, “bicluster_2”)All output files are saved to the specified out_dir:
| File | Description |
|---|---|
all_controls.bed |
BED file containing all matched control regions (combined and de-duplicated) |
TFBS_enrichment_cluster_<label>.tsv |
One TSV per cluster with enrichment statistics (odds ratio, p-value, FDR, hit counts) |
TFBS_heatmap_all.pdf |
Heatmap visualization of enriched motifs across all clusters (if
plot = TRUE) |
odds_ratio_log2.csv |
Matrix of log2 odds ratios for all TFs × clusters (if
plot = TRUE) |
FDR.csv |
Matrix of FDR values for all TFs × clusters (if
plot = TRUE) |
library(epigenomeR)
# Basic usage - complete pipeline with visualization
biclustering_TFBS_enrichment(
row_cluster_file_path = "NMF_clusters.tsv",
out_dir = "./TFBS_analysis/",
ref_genome = "hg38"
)
# Custom region size for enhancer analysis
biclustering_TFBS_enrichment(
row_cluster_file_path = "enhancer_clusters.tsv",
out_dir = "./enhancer_TFBS/",
ref_genome = "hg38",
regions = 1000
)
# Mouse genome analysis
biclustering_TFBS_enrichment(
row_cluster_file_path = "mouse_peaks_clustered.tsv",
out_dir = "./mouse_TFBS/",
ref_genome = "mm10",
regions = 500
)
# Generate 3x more control regions for more robust statistics
biclustering_TFBS_enrichment(
row_cluster_file_path = "ATAC_peaks_clustered.tsv",
out_dir = "./ATAC_TFBS/",
ref_genome = "hg38",
control_rep = 3,
regions = 800
)
# Enrichment analysis only (no heatmap)
biclustering_TFBS_enrichment(
row_cluster_file_path = "clusters.tsv",
out_dir = "./TFBS_tables/",
ref_genome = "hg38",
plot = FALSE
)