Post

Calculating BED total feature length with awk

This post shows how we can readily use command-line tools such as awk to perform bioinformatics data analysis and reveal meaningful biological insights without much effort.

Calculating BED total feature length with awk

TL;DR: In the absence of dedicated tools, shell scripting with readily available command-line tools such as awk can help reveal meaningful biological insights without much effort.

When working with genomic data, especially in BED format, it’s often useful to quickly summarize regions of interest.

What is a BED file?

A BED file (Browser Extensible Data) is a tab-delimited text file that defines genomic regions (more info here). Each line in a standard BED file has at least three columns:

  1. chrom – the chromosome name (e.g., chr1)
  2. start – the starting position of the feature
  3. end – the ending position of the feature

These coordinates define a region on the genome.

It is possible that popular tools such as bedtools that are made to work with BED files does not provide the functionality we need out of the box. In such cases, we can consider employing command-line tools such as awk.

What is awk?

awk is a powerful command-line tool for pattern scanning and processing. It’s especially useful for manipulating structured, columnar text data like BED files. It operates on a per-line basis and automatically splits each line into fields (more info here).

For instance, we may be interested to calculate the total length of all features in a BED file. For example, you might want to compare chromatin accessibility across two conditions: healthy vs. cancer.

Here’s a compact way to do that using awk:

1
awk 'NF >= 3  {sum += $3 - $2} END {print sum}' file.bed

This command works as follows:

  • NF >= 3 ensures we only process lines with at least 3 fields (valid BED format).
  • $3 - $2 calculates the length of each genomic region (end - start).
  • sum += ... accumulates the total length.
  • END {print sum} prints the final result after all lines have been processed.

Important Note: For accurate results, ensure your BED file is sorted and merged (using bedtools sort and bedtools merge) if it contains overlapping regions, to avoid double-counting the same genomic positions (more info here).

Making it reusable by creating a shell function

If you expect to use this command frequently, consider turning it into a reusable shell function. Shell functions are a way to group commands for later execution using a single name for the group. They are executed just like a “regular” command (more info here). Add the following snippet to your ~/.bashrc (or ~/.bash_profile on macOS):

1
2
3
4
5
6
7
8
bed_total_length() {
  if [[ ! -f "$1" ]]; then
    echo "Error: File '$1' not found" >&2
    return 1
  fi

  awk 'NF >= 3 {sum += $3 - $2} END {print sum}' "$1"
}

Then, activate the function by running:

1
source ~/.bashrc # or source ~/.bash_profile on macOS

Now you can call it like this:

1
bed_total_length file.bed

Example application: Comparing chromatin accessibility across conditions

Imagine you have two sets of consensus peaks obtained from an assay such as ATAC-seq that is used to detect accessible chromatin in a genome-wide manner. They correspond to two different biological conditions: normal and cancer and they are stored in two files, normal_peaks.bed and cancer_peaks.bed, contain regions of open chromatin.

Comparing the total length of accessible chromatin regions provides a global view of epigenetic changes between conditions. While this is a preliminary analysis (more detailed approaches would examine peak overlap, intensity, and functional annotations), it can reveal overall shifts in chromatin landscape.

To compare overall accessibility per condition, calculate the total peak lengths in each file:

1
2
3
4
5
bed_total_length normal_peaks.bed
# Output: 15747392 (bp)

bed_total_length cancer_peaks.bed
# Output: 21405847 (bp)

If the total length in the cancer sample is greater, this may suggest increased chromatin accessibility, potentially due to epigenetic deregulation. One possible mechanism involves the MLL protein (i.e., KMT2A gene), a histone methyltransferase that catalyzes H3K4 methylation, a modification associated with active chromatin (Flavahan, Gaskell & Bernstein, 2017). Increased MLL activity in disease states can make chromatin aberrantly permissive which creates a state of “epigenetic plasticity,” activating oncogene expression or cell fate changes that drive cancer development (Flavahan, Gaskell & Bernstein, 2017).


Conclusion: In the absence of dedicated tools, one can consider using command-line tools such as awk to perform bioinformatics data analysis and reveal meaningful biological insights. This simple approach demonstrates how basic calculations on genomic coordinates can provide initial insights into complex biological phenomena.

This post is licensed under CC BY 4.0 by the author.