1. Read-Level QC

The qc_by_percentile() function performs quality control on BAM files and generates a log2-transformed read count heatmap for CRF-CRF pairs. This function can also accept BED files as input for visualization purposes without filtering (set filtered_percentile = 0 and plot = TRUE).

What this function does: This function implements a percentile-based quality control strategy for CRF pair data. It calculates read counts for all CRF pairs from input BAM files, then filters out pairs with the lowest read counts based on a user-defined percentile threshold. For example, setting filtered_percentile = 0.25 removes the bottom 25% of CRF pairs ranked by read count, retaining only pairs with sufficient sequencing depth for downstream analysis.

The function generates a lower-triangle heatmap displaying log2-transformed read counts for all CRF pairs. Pairs that pass the quality control threshold are shown with a blue-white-red color gradient, while filtered pairs are displayed in grey to clearly distinguish them from high-quality data. Optional CRF grouping allows users to organize and annotate CRFs by biological categories (e.g., histone modifications, transcription factors), which are displayed as colored blocks along the heatmap margins. All QC results, including read count tables and visualizations, are automatically saved to the specified output directory.

Parameters

Parameter Type Required Description Example
file_paths character vector Yes Vector of BAM or BED file paths for QC and visualization file_paths = c("sample1.bam", "sample2.bam")
out_dir character No (default: “./”) Directory to store output TSV files and PDF out_dir = "./results"
filtered_percentile numeric No (default: 0.25) Numeric value between 0 and 1 to filter the lowest-read-count samples filtered_percentile = 0.25
plot logical No (default: TRUE) If TRUE, generates heatmap visualization plot = TRUE
split_pair_by character No (default: “-”) Delimiter used to split CRF pairs into individual CRF names split_pair_by = "-"
group_csv character No (default: NULL) Optional file path to CSV defining CRF groupings and categories for annotation group_csv = "crf_groups.csv"
crf_col character No (default: “crf”) Column name in group_csv for CRF identifiers crf_col = "crf"
category_col character No (default: “category”) Column name in group_csv for group categories category_col = "category"

Output Files

The function generates the following output files in the specified out_dir:

  1. All read counts TSV - all_read_count.tsv
    • Contains read counts for all input files
    • Two columns: file prefix and read count
file read_count
H3K27ac-H3K4me3 60573
H3K4me3-H3K4me3 940240
H3K4me1-H3K4me3 432292
H3K4me3-H3K9me3 415540
H3K27me3-H3K4me3 574643
H3K9me2-H3K9me3 201788
  1. Filtered read counts CSV - filtered_read_count.tsv
    • Contains read counts for files passing the percentile filter
    • Excludes samples in the lowest percentile based on filtered_percentile parameter
    • Two columns: file name and read count
  2. Peak count heatmap - qc_heatmap.pdf (if plot = TRUE)
    • PDF visualization of log2-transformed read counts for CRF-CRF pairs
    • Lower triangle matrix showing read counts between CRF pairs
    • Filtered (non-significant) pairs displayed in grey
    • If group_csv provided, includes categorical annotations with colored blocks

Example Usage

# BAM files (typical use case for QC):
input_path <- "./dir_for_bam_files"
bam_files <- list.files(path = input_path, pattern = "\\.bam$", recursive = TRUE, full.names = TRUE)

# Optional: Create grouping CSV
crfs <- c("H3K4me1", "H3K4me3", "H3K9ac", "H3K9me2", "H3K9me3", "H3K27ac", "H3K27me3", "H3K36me3")
categories <- c(rep("Active", 3), rep("Repressive", 3), rep("Other", 2))
group_df <- data.frame(crf = crfs, category = categories)
write.csv(group_df, "crf_groups.csv", row.names = FALSE)

# Run QC with grouping
library(multiEpiCore)
qc_by_percentile(
  file_paths = bam_files, 
  out_dir = "output/qc_results",
  filtered_percentile = 0.25,
  plot = TRUE,
  group_csv = "crf_groups.csv"
)

Notes

  • The function automatically detects whether input files are BAM or BED format based on file extensions.

  • The filtered_percentile parameter removes samples with the lowest read counts (e.g., 0.25 removes the bottom 25%).

  • If filtered_path is provided, it overrides the percentile-based filtering for determining significant pairs in the heatmap.

  • The heatmap displays only the lower triangle of the matrix, with filtered pairs shown in grey.

  • Peak counts are log2-transformed (log2(count + 1)) for visualization.

  • When group_csv is provided, tags are sorted by category and displayed with colored block annotations.

  • The function requires the following R packages: ComplexHeatmap, latex2exp, circlize, and grid.

  • All file paths in file_path and filtered_path must exist, or the function will fail.

  • If output_path_dir is NULL, outputs are saved to the directory containing the first input file.

2. Fragment Length Analysis

The frag_decomposition() function performs a complete fragment length analysis pipeline including quality control, valley detection, and visualization for nucleosome positioning analysis from BAM files.

What this function does:

  • Performs quality control filtering on BAM files based on coverage percentile

  • Computes fragment lengths from filtered paired-end BAM files

  • Detects two local minimum valleys for nucleosome fragment classification

  • Generates fragment length histograms with valley cutoff lines

  • Creates summary statistics report (valley positions, min/max fragment lengths)

  • Produces fragment decomposition data with percentages

  • Generates bar plots showing fragment distribution across samples

Parameters

Parameter Type Required Description Example
file_path character vector Yes Paths to BAM files to analyze file_path = c("sample1.bam", "sample2.bam")
save_dir character Yes Directory path to save all output files and plots save_dir = "./fragment_analysis"
frag_decomp_file character No Path to generated fragment decomposition file (returned by function) Auto-generated
filtered_percentile numeric No Percentile threshold for quality control filtering (default: 0.25, removes bottom 25%) filtered_percentile = 0.25
dens_reso numeric No Resolution for kernel density estimation in valley detection (default: 2^15 = 32768) dens_reso = 2^15
target_pair_mapping_df data frame No Optional data frame for mapping sample names to display names (default: NULL) target_pair_mapping_df = df
density_kernel character No Kernel type for density estimation (default: “gaussian”) density_kernel = "gaussian"
valley1_range numeric vector No Search range for first valley in bp (default: c(73, 221)) valley1_range = c(73, 221)
valley2_range numeric vector No Search range for second valley in bp (default: c(222, 368)) valley2_range = c(222, 368)

Output Files

The function generates the following output files in the specified save_dir:

  1. Fragment length data - premerge_all-qc_frag_lens.RData
    • Cached fragment length data from all filtered samples
    • Numeric vector of fragment lengths
  2. Initial histogram - premerge_frag_hist.pdf
    • Fragment length distribution histogram without valley cutoffs
    • Shows overall fragment length distribution
  3. Summary report - summary_report.csv
    • Contains key metrics:
      • local_min1: First valley position (bp)
      • local_min2: Second valley position (bp)
      • min_frag_len: Minimum fragment length (bp)
      • max_frag_len: Maximum fragment length (bp)
  4. Histogram with cutoffs - premerge_frag_hist_with_cutoff.pdf
    • Fragment length distribution with vertical lines at valley positions
    • Used to visualize nucleosome fragment classification
  5. Fragment decomposition data - *_frag_decomp_with_perc.tsv
    • Per-sample fragment counts and percentages
    • Classified by valley thresholds: subnucleo, monomer, dimer+
  6. Bar plots - save_dir/barplots/*.pdf
    • Visual representation of fragment distribution across samples
    • Separate plots for different fragment categories

Example Usage

library(multiEpiCore)
frag_decomposition(file_path = bam_files_vector, save_dir = "output/frag_decomposition")

3. Split bam files in bash

## Use in bash
## change with your length of scen_list
#SBATCH --array=0-1

module load samtools

# setup (change)
thres_name="valley-V-qc"
scen_list=( "T" "V" ) # input dir name
scen_simple_name_list=( "T" "V" ) # output name
target_list=( "IgG_control" "H3K36me3" "H3K4me1"... )
load_root_dir="./data_align" # directory path for scen_list
save_root_raw_dir="output/frag_decomposition/split_bam"

# main function
run_frag_decomposition() {
    scen=${scen_list[${SLURM_ARRAY_TASK_ID}]}
    scen_simple_name=${scen_simple_name_list[${SLURM_ARRAY_TASK_ID}]}

    save_root_dir=${save_root_raw_dir}/${thres_name}
    mkdir -p "${save_root_dir}"

    load_dir=${load_root_dir}/${scen}
    mixed_dir=${save_root_dir}/${scen_simple_name}_mixed
    subneucleo_dir=${save_root_dir}/${scen_simple_name}_sub
    mononeucleo_dir=${save_root_dir}/${scen_simple_name}_mono
    dineucleo_dir=${save_root_dir}/${scen_simple_name}_di

    mkdir -p "${mixed_dir}" "${subneucleo_dir}" "${mononeucleo_dir}" "${dineucleo_dir}"
    cp -r ${load_dir}/* "${mixed_dir}/"

    for target_1 in "${target_list[@]}"; do
        for target_2 in "${target_list[@]}"; do
            if [ "${target_1}" \< "${target_2}" ] || [ "${target_1}" = "${target_2}" ]; then
                mixed_filename="${target_1}-${target_2}.bam"
                mixed_dir_filename=${load_dir}/${mixed_filename}

                if [[ ! -f "${mixed_dir_filename}" ]]; then
                    echo "Warning: ${mixed_dir_filename} not found, skip."
                    continue
                fi
                samtools view -h ${mixed_dir_filename} | awk 'substr($0,1,1)=="@" || ($9>=24 && $9<=126) || ($9<=-24 && $9>=-126)' | samtools view -b > ${subneucleo_dir}/${mixed_filename}
                samtools view -h ${mixed_dir_filename} | awk 'substr($0,1,1)=="@" || ($9>=127 && $9<=297) || ($9<=-127 && $9>=-297)' | samtools view -b > ${mononeucleo_dir}/${mixed_filename}
                samtools view -h ${mixed_dir_filename} | awk 'substr($0,1,1)=="@" || ($9>=298 && $9<=800) || ($9<=-298 && $9>=-800)' | samtools view -b > ${dineucleo_dir}/${mixed_filename}
            fi
        done
    done
}

run_frag_decomposition