NIPTeR package

This vignette is an explanation on how to use NIPTeR. It starts with a short introduction, the second chapter explains the two most important objects in this package, NIPTSample and NIPTControlgroup. The third chapter explains all the analysis functions, the arguments the function take and their output. The last chapter shows some workflow examples.

Introduction

Non-Invasive Prenatal Testing (NIPT) is a method based on analysis of cell-free fetal (cff) DNA in maternal blood using next-generation sequencing. The basic approach is relatively straightforward: cell-free DNA is isolated from the mothers blood plasma and the obtained DNA fragments are sequenced. Subsequently the number of DNA fragments originating from the different chromosomes is counted, since in case of a fetal trisomy a relative increase of the fraction of reads of the trisomic chromosome is expected. Various analysis methods for trisomy prediction, variation reduction and quality control have been described in the literature. These methods can be combined to improve the sensitivity of the analysis. The NIPTeR package makes these methods available and allows for flexibility in workflow design. NIPTeR uses BAM files as input and calculates chromosomal fractions, which are used for trisomy prediction. These fractions are based on read counts of the different chromosomes. The chromosomes are divided in bins to allow for read count correction.

Package

The atomic unit of analysis in this package is a NIPTSample object

A NIPTSample object is a named list of length 5, having the following fields:

  1. autosomal_chromosome_reads A list containing the bin count matrices for the autosomal chromosomes. If forward and reverse reads are counted separately the list contains two matrices, if the reads are counted together it contains one. In these matrices the columns represent bins and the 22 rows represent the chromosomes. For a small excerpt of such a matrix see Table 1.

  2. correction_status_autosomal_chromosomes Correction status of the autosomal chromosomes. Default is Uncorrected

  3. sex_chromosome_reads List containing bin count matrices for the sex chromosomes. If forward and reverse reads are counted separately the list contains two matrices, if the reads are counted together it contains one. In these matrices the columns represent bins and the two represent rows the chromosomes. For a small excerpt of such a matrix see Table 2.

  4. correction_status_sex_chromosome Correction status of the sex chromosomes. Default is Uncorrected

  5. sample_name The name of the sample. Default is the filename of the .bam it originates from with the path prefix and .bam suffix

Table 1: Example data bin counts. The rows represent chromosomes and the columns bins

  1 2 3 4 5 6 7 8
1 308 303 270 157 148 99 146 236
2 243 263 262 269 225 225 285 259
3 0 189 287 245 264 251 258 242
4 624 637 257 240 229 241 251 216
5 448 249 274 262 239 205 241 244
6 0 174 213 142 231 324 355 383
7 247 266 234 217 142 88 264 230
8 163 216 244 209 243 255 259 261
Table 2: Example data bin counts sex chromosomes. The rows represent chromosomes and the columns bins
  1 2 3 4 5 6 7 8
X 0 160 6 266 127 85 182 235
Y 0 0 0 0 0 0 0 0

The panel of control samples to which the sample of interest can be compared is also modelled in an object, NIPTControlGroup. A control group is a named list and composed of NIPTSample objects. It has the following fields:

  1. samples The NIPTSample objects

  2. correction_status_autosomal_chromosomes Correction status of the autosomal chromosomes of the samples. Default is Uncorrected

  3. correction_status_sex_chromosomes Correction status of the sex chromosomes of the samples. Default is Uncorrected

  4. Description Description of the control group. This can either be Generic control group or Fitted to sample name

Functions

Loading BAM files

To load a bam file, use the bin_bam_sample function. The bam file does not need to be sorted, but its recommended to use pre-sorted bam files

bin_bam_sample(bam_filepath, do_sort = F, separate_strands = F, custom_name = NULL)

The arguments bin_bam_sample takes:

  1. bam_filepath The location and filename on the file system where the bam file is stored.

  2. do_sort Sort the bam file? If the bam is unsorted set to true, but the use of pre-sorted bam files is recommended.

  3. separate_strands If set to true, reads from forward and reverse strands are counted and stored separately. This option should be used if you are planning on using regression, since this doubles the number of predictors (F+R) and distributes predictive power more equally over prediction sets since F and R strand from the same chromosome cannot be both in one prediction set.

  4. custom_name The sample name. Default samplename is the filename of the bam file without the .bam suffix and filepath prefix.

Variation reduction functions

GC correction

GC content bias is the correlation between the number of reads mapped to a specific genomic region and the GC content of this region. In NIPTeR, two GC bias correction algorithms have been implemented, the LOESS based method introduced by Chen et al. (2011) and the bin weight based method described by Fan and Quake (2010).

gc_correct(nipt_object, method, include_XY = F, span = 0.75)
  1. nipt_object The object that will be corrected. This can either be a NIPTSample or a NIPTControlGroup object.

  2. method To select the LOESS based method use “LOESS”, to select the bin weights based method use “bin”.

  3. include_XY Also perform correction on the X and Y chromosomes?

  4. span The span for the LOESS fit. Only applicable when LOESS method is used.

  5. ref_genome The reference genome used. Either “hg37” or “hg38” default = “hg37

The output of the gc_correct function is a similar object as the input; when the input is a NIPTSample the output will also be a (corrected) NIPTSample, and the same applies for a NIPTControlGroup object. . After correction the correction status of the object is set to GC_corrected.

Code snippit:

#Correct NIPTSample object using LOESS method
loess_corrected_sample <- gc_correct(nipt_object = sample_of_interest, method = "LOESS",
                                     include_XY = F, span = 0.75)
#Correct NIPTControlGroup object using bin method
gc_bin_corrected_control_group <- gc_correct(nipt_object = control_group, method = "bin", 
                                             include_XY = T)

Match control group

The matchcontrolgroup function determines how well an NIPTSample fits within the NIPTControlGroup and, if needed, makes a subset NIPTControlGroup of length n.

matchcontrolgroup(nipt_sample, nipt_control_group, mode, n_of_samples,
  include_chromosomes = NULL, exclude_chromosomes = NULL)

Explanation of the arguments matchcontrolgroup takes:

  1. nipt_sample The NIPTSample object that is the focus of the analysis.

  2. nipt_control_group The NIPTControlGroup object used in the analysis.

  3. mode The function mode. This can either be “subset” or “report”. Mode “subset” means the return value will be a new NIPTControlGroup object containing n samples. When mode “report” is used the output is a matrix containing the sum of squares score of the differences between the chromosomal fractions of the sample and the control for every control sample, sorted in increasing score.

  4. n_of_samples The length of the resulting NIPTControlGroup. Only applicable if mode “subset” is used.

  5. include_chromosomes Include potential trisomic chromosomes into the comparison? Default = NULL, meaning chromosomes 13, 18 and 21 are not included.

  6. exclude_chromosomes Exclude other autosomal chromosomes besides chromosomes 13, 18 and 21? Default = NULL.

The output for mode subset is a new NIPTControlGroup composed of n samples. The output for mode report is a matrix with a single column containing the sum of squares in ascending order.

Table 3: Example match_control_group mode ‘subset’
  Sum_of_squares
Sample 6 6.766e-07
Sample 8 2.159e-06
Sample 1 3.101e-06
Sample 10 3.292e-06
Sample 9 3.588e-06
Sample 3 3.839e-06
Sample 5 4.432e-06
Sample 2 7.892e-06
Sample 7 9.737e-06
Sample 4 1.126e-05

Code snippit:

#Run matchcontrolgroup in mode "report"
scores_control_group <- matchcontrolgroup(nipt_sample = gc_LOESS_corrected_sample, 
                                          nipt_control_group = gc_LOESS_corrected_control_group, 
                                          mode = "report", include_chromosomes = c(13,18))
#Run matchcontrolgroup in mode "subset" and select 50 best matching samples
subset_loess_corrected_control_group <- matchcontrolgroup(nipt_sample = gc_LOESS_corrected_sample, 
                                                          nipt_control_group = loess_corrected_control_group, 
                                                          mode = "subset", n_of_samples = 50)

Chi-squared based variation reduction

The chi-squared based variation reduction identifies overdispersed bins within the control group and corrects these bins in both the sample of interest and the control group. The function takes in a NIPTSample and a NIPTControlGroup object, both to be corrected. For every corresponding bin in the control group a chi-squared score is calculated and this total score is converted to a normal distribution. Corresponding bins with a normalized score above chi_cutoff (default 3.5) are corrected by dividing the number of reads by the total chi-squared score divided by degrees of freedom

chi_correct(nipt_sample, nipt_control_group, chi_cutoff = 3.5, include_XY = F)
  1. nipt_sample The NIPTSample object that is the focus of the analysis.

  2. nipt_control_group The NIPTControlGroup object used in the analysis.

  3. chi_cutoff The normalized cut-off threshold.

  4. include_XY Also perform correction on the X and Y chromosomes?

The output of this function is a named list containing two fields, the first field is the corrected sample ($sample) and the second the corrected control group ($control_group).

Code snippit:

#Apply chi-squared based variation reduction method
chi_corrected_data <- chicorrect(nipt_sample = gc_LOESS_corrected_sample, 
                                 nipt_control_group = subset_loess_corrected_control_group)
#Extract sample and control group
loess_chi_corrected_sample <- chi_corrected_data$sample
subset_loess_chi_corrected_control_group <- chi_corrected_data$control_group

Trisomy prediction functions

Z-score

In the Z-score approach, introduced by Chiu et al in 2008, the chromosomal fraction of interest of a sample is compared to the chromosomal fractions of interest of the reference samples, the NIPTControlGroup object.

calculate_z_score(nipt_sample, nipt_control_group, chromo_focus)


This function takes three arguments:

  1. nipt_sample The NIPTSample object that is the focus of the analysis.

  2. nipt_control_group The NIPTControlGroup object used in the analysis.

  3. chromo_focus The chromosome of interest. Most commonly chromosome 13, 18 or 21. However, every autosomal chromosome can be predicted.


The output of the function is an object of class ZscoreResult. It is a named list containing seven fields:

  1. sample_Zscore A numeric, The Z score for the sample of interest for the sample of interest.

  2. control_group_statistics Named num of length 3, the first field being the mean (name mean), the second field is the standard deviation (name SD) and the third field is the P value of the Shapiro-Wilk test (name Shapiro_P_value).

  3. control_group_Zscores Matrix containing the Z scores of the chromosome of interest for all used control samples.

  4. focus_chromosome The chromosome of interest. Most commonly chromosome 13, 18 or 21. However, every autosomal chromosome can be predicted.

  5. control_group_sample_names The sample names of all control group samples used in the analysis.

  6. correction_status The correction status of the control group.

  7. sample_name The sample_name of the sample of interest.

Code snippit:

#Calculate Z score for chromosome 13
z_score_result_13 <- calculate_z_score(nipt_sample = loess_chi_corrected_sample, 
                                       nipt_control_group = subset_loess_chi_corrected_control_group,
                                       chromo_focus = 13)

Regression based Z-score

The regression based Z-score builds n models with m predictors using stepwise regression with forward selection. The models are used to predict the chromosomal fraction of interest, for the sample and for the control group. The observed fractions are then divided by the expected fraction, and Z-scores are calculated over these ratios. The Z-score is calculated by subtracting one from the ratio of the sample and dividing this result by the coefficient of variation. The coefficient of variation (CV) can either be the Practical or Theoretical CV. The Theoretical CV is the standard error multiplied by the overdispersion. Theoretically, the CV cannot be lower than the standard error of the mean. If it is case the CV is lower than Theoretical CV, then the Theoretical CV is used.

perform_regression(nipt_sample, nipt_control_group, chromo_focus,
  n_models = 4, n_predictors = 4, exclude_chromosomes = NULL,
  include_chromosomes = NULL, use_test_train_set = T,
  size_of_train_set = 0.6, overdispersion_rate = 1.15,
  force_practical_vc = F)
  1. nipt_sample The NIPTSample object that is the focus of the analysis.

  2. nipt_control_group The NIPTControlGroup object used in the analysis.

  3. chromo_focus The chromosome of interest. Most commonly chromosome 13, 18 or 21. However, every autosomal chromosome can be predicted.

  4. n_models Number of linear models to be made. Default setting is 4 models.

  5. n_predictors The number of predictors each model contains. Default is 4.

  6. exclude_chromosomes Exclude which autosomal chromosomes as potential predictors? Default potential trisomic chromosomes 13, 18 and 21 are exluded.

  7. include_chromosomes Include potential trisomic chromosomes? Options are: chromosomes 13, 18 and 21.

  8. use_test_train_set Use a test and train set to build the models? Default is TRUE.

  9. size_of_train_set The size of the train set expressed in a decimal. Default is 0.6 (60% of the control samples).

  10. overdispersion_rate The standard error of the mean is multiplied by this factor.

  11. force_practical_vc Ignore the theoretical CV and always use the practical CV.

The output of this function is an object of type RegressionResult, a named list containing:

  1. prediction_statistics A dataframe with 7 rows and a column for every model. Table 3 is an example of such a dataframe.
Table 2: Example regression results and statistics
  Prediction_set_1 Prediction_set_2
Z_score_sample -0.171248675025602 -0.799303637284865
CV 0.00615122686158584 0.00599131304613039
cv_types Practical_CV Practical_CV
P_value_shapiro 0.583158688871931 0.503865450973465
Predictor_chromosomes 11 2 5 12 10 19 17 16
Mean_test_set 1.00050469190991 1.00011356353248
CV_train_set 0.00335713985926308 0.00394053718934269
  1. control_group_Zscores A matrix containing the regression based Z-scores for the control sample

  2. focus_chromosome The chromosome of interest. Most commonly chromosome 13, 18 or 21. However, every autosomal chromosome can be predicted.

  3. correction_status The correction status of the control group autosomes.

  4. control_group_sample_names The sample names of the test set group.

  5. models List of the summary.lm() output for every model.

  6. potential_predictors The total pool of chromosomes where the predictors are selected from.

  7. all_control_group_Z_scores Z-scores for every sample using theoretical and practical VCs.

  8. additional_statistics Statistics for both the practical and theoretical CVs for every prediction set.

Normalized Chromosome Value


The Normalized Chromosome Value or NCV (Sehnert et al., 2011) method selects a subset of chromosomes to calculate the chromosomal fractions. The ‘best’ subset is the set which yields the lowest coefficient of variation for the chromosomal fractions of the chromosome of interest in the control group. Because a brute force approach is used to determine the best subset, which can be computationally intensive,this method is divided into two functions, prepare_ncv and calculate_ncv. prepare_ncv returns a template object (NCVTemplate) for a given chromosome of interest and the control group used. This template can be used for any number of analyses. If the control group or chromosome of interest changes, a new template must be made.
calculate_ncv uses a NCVTemplate and a NIPTSample to calculate the NCV score for the NIPTSample.

prepare_ncv

prepare_ncv(nipt_control_group, chr_focus, max_elements,
  exclude_chromosomes = NULL, include_chromosomes = NULL,
  use_test_train_set = T, size_of_train_set = 0.6)
  1. nipt_control_group The NIPTControlGroup object used in the analysis.

  2. chr_focus The chromosome of interest. Most commonly chromosome 13, 18 or 21. However, every autosomal chromosome can be predicted.

  3. max_elements The maximum number of denominator chromosomes.

  4. exclude_chromosomes Exclude which autosomal chromosomes as potential predictors? Default potential trisomic chromosomes 13, 18 and 21 are exluded.

  5. include_chromosomes Include potential trisomic chromosomes? Options are: chromosomes 13, 18 and 21.

  6. use_test_train_set Use a test and train set to to build the models? Default is TRUE.

  7. size_of_train_set The size of the train set expressed in a decimal. Default is 0.6 (60% of the control group samples).

The output of this function is an object of class NCVTemplate, a named list.

  1. denominators The set of ‘best’ denominators.

  2. focus_chromosome The chromosome of interest used for this NCVTemplate object.

  3. control_group_sample_names The sample names of the test set samples.

  4. correction_status The correction status of the NIPTControlGroup used.

  5. control_group_Zscores The NCV scores for the test set samples.

  6. potential_denominators The total pool of denominators the best denominators are selected from.

  7. control_group_statistics Named num of length 3, the first field being the mean (name mean), the second field is the standard deviation (name SD) and the third field is the P value of the Shapiro-Wilk test (name Shapiro_P_value).

  8. sample_names_train_set The names of the samples used in the train set.

  9. train_set_statistics Named num of length 3, the first field being the mean of the train set (name mean), the second field is the standard deviation of the train set (name SD) and the third field is the P value of the Shapiro-Wilk test of the train set (name Shapiro_P_value).

  10. train_set_Zscores Matrix containing the Z scores for the train set samples.


calculate_ncv

calculate_ncv_score(nipt_sample, ncv_template)
  1. nipt_sample The NIPTSample object that is the focus of the analysis.

  2. ncv_template An object of class NCVTemplate that is produced by function prepare_ncv

The output of this function is an object of class NCVResult, which is basically the same as a NCVTemplate object, the only difference being the NCV score of the sample appended to the list.

Control group functions

as_control_group

This functions converts a list of NIPTSample objects to a NIPTControlGroup object and return this object.

as_control_group(nipt_samples)

1 nipt_samples A list of NIPTSample objects.

This functions adds NIPTSample objects to an existing control group and returns a new NIPTControlGroup object.

add_samples_controlgroup

add_samples_controlgroup(nipt_control_group, samples_to_add)
  1. nipt_control_group The NIPTControlGroup to add the samples to.

  2. samples_to_add A list with sample(s) to add. This always needs to be a list.


remove_duplicates_controlgroup

This functions removes duplicate samples from the control group based on name. It returns a new NIPTControlGroup object.

remove_duplicates_controlgroup(nipt_control_group)
  1. nipt_control_group The NIPTControlGroup object to remove duplicates from.

remove_sample_controlgroup

This function removes a sample from the NIPTControlGroup object by name. Note that this function uses a regular expression, and if more sample_names satisfy the regular expression, they will also be removed. It returns a new NIPTControlGroup object.

remove_sample_controlgroup(samplename, nipt_control_group)
  1. samplename samplename Regular expression string. All matching samplenames are removed from the control group.

  2. nipt_control_group NIPTControlGroup object where the samples are removed from.

diagnose_control_group

This function computes a regular Z-score for every chromosome of every sample in the NIPTControlGroup object. It returns a named list with diagnostics information.

diagnose_control_group(nipt_control_group)
  1. nipt_control_group The NIPTControlGroup to diagnose.


The function returns a named list with 3 fields:

  1. Z_scores A matrix containing Z-scores for every sample and every chromosome.

  2. abberant_scores Dataframe with samplename and chromosome of Z-scores outside -3 3 range.

  3. control_group_statistics Matrix with mean, standard deviation and P value of Shapiro-Wilk test.

Examples


Preparation


Control group

Since NIPT is based on the comparison of the sample of interest to a set of samples with no known aneuploidies, the first thing that should be done before any samples can be analyzed is making this NIPTControlGroup. A NIPTControlGroup is an object and is composed of NIPTSample objects. First, all bam file paths have to be collected in a vector. This tutorial assumes all .bam files are in the same directory. To collect all bam files the following command can be used:

bam_filepaths <- list.files(path = "/Path/to/bamfiles/", pattern = ".bam", full.names = T)

Subsequently, the filepaths must be fed to the function bin_bam_sample. This function loads bam files and converts them to NIPTSample objects. For example:

list_NIPT_samples <- lapply(X = bam_filepaths, bin_bam_sample, do_sort = FALSE, 
                             separate_strands = T, custom_name = NULL)

In this example we count the forward and reverse reads separately and a pre-sorted bam file is used, so the argument do_sort is ommited since the default is F.

Using the above command all samples will be loaded with the same default settings. To load the files with for instance custom names, store the filenames in a vector of the same length as the bam filepath vector and use mapply. For example:

list_NIPT_samples <- mapply(FUN = bin_bam_sample, bam_filepaths, name = names, SIMPLIFY = F)

To convert the list of NIPTSamples to a NIPTControlGroup object use:

control_group <- as_control_group(nipt_samples = list_NIPT_samples)

of course, the fastest and simplest way is to wrap the as_control_group function around the loading of samples like this:

control_group  <- as_control_group(nipt_samples = lapply(X = bam_filepaths, bin_bam_sample, 
                                                         do_sort = F, separate_strands = T))


or when applying custom names:

control_group  <- as_control_group(nipt_samples = mapply(FUN = bin_bam_sample, bam_filepaths, 
                                                         custom_name = names, SIMPLIFY = FALSE))


To store the control group on disk for later use:

saveRDS(object = control_group, file = "/Path/to/directory/control_group.rds")

So, in conclusion, the data control group preparation script might look something like this:

library(NIPTeR)
#Retrieve filenames
bam_filepaths <- list.files(path = "/Path/to/bamfiles/", pattern = ".bam", full.names = T)
#Load files and convert to control group
control_group  <- as_control_group(nipt_samples = lapply(X = bam_filepaths, bin_bam_sample, 
                                                         do_sort = F, separate_strands = FALSE))
#Save control group for later
saveRDS(object = control_group, file = "/Path/to/directory/control_group.rds")

Assuming a NIPTControlGroup object has been prepared and stored on disk we can analyse our first sample. The script starts with loading the NIPTeR package:

library(NIPTeR)

After that, we can load a sample of interest with the bin_bam_sample function:

sample_of_interest <- bin_bam_sample(bam_filepath = "/path/to/sample.bam", separate_strands = T)


Figure 1: An example of a control group workflow.

A possible workflow to create a control group is shown in figure 1. In this example first a list of samples to include in the control group is created (1a) and bin counts are created for each of those samples (1b). Secondly, all samples in the control group are LOESS gc corrected (2a). Thirdly, the resulting control group is checked for samples having aberrant scores (3a) and if present, those samples are removed (3b). The resulting control group is than saved.

library(NIPTeR)
#Gather all bam filepaths in a vector. Corresponds to 1a in figure
bam_filepaths <- list.files(path = "/Path/to/bamfiles/", pattern = ".bam", full.names = T)
#Load all bam files using lapply and feed the results to function as_control_group,
#converting the NIPTSamples to a NIPTControlGroup object. Corresponds to 1b in figure
control_group  <- as_control_group(nipt_samples = lapply(X = bam_filepaths, bin_bam_sample,
                                                         do_sort = F, separate_strands = F))
#apply a gc LOESS correction to all samples. Since this can take up to 30 seconds
#sample, doing this once for a control group and store the results can be rewarding
#in terms of analysis time. Corresponds to 2a in figure
gc_control_group <- gc_correct(nipt_object = control_group, method = "LOESS")
#Retrieve control group diagnostics. Corresponds with 3a in figure
control_group_diagnostics <- diagnose_control_group(nipt_control_group = control_group)
#Retrieve samplenames with an abberant Z score for any chromosome and remove these samples 
#from the control group. Corresponds with 3b in figure
abberant_sample_names <- unique(control_group_diagnostics$abberant_scores$Sample_name)
for (i in 1:length(abberant_sample_names)){
  control_group <- remove_sample_controlgroup(samplename = abberant_sample_names[i], 
                                              nipt_control_group = control_group)
}
#Save the gc_corrected control groups to disk. Corresponds to 4a in figure
saveRDS(object = control_group, file = "/path/to/controlgroup.rds")

Figure 2: An example of a simple workflow.

An example of a simple workflow, including a chi-squared based variation reduction and a trisomy prediction using the regression-based Z-score is shown in figure 2. Firstly, bin counts are made for the analysed sample (1a) and a relevant control group is loaded (1b). Secondly, using the control group, both the analysed sample and the control samples are chi_corrected (2a). Thirdly, the corrected control group is used to create a regression model for a trisomy 21 prediction of the NIPT sample (3a). Results can be gathered from the workspace.

library(NIPTeR)
#Load sample. Corresponds with 1a in figure
sample_of_interest <- bin_bam_sample(bam_filepath = "/Path/to/bamfile.bam", separate_strands = T)
#Load control group. Corresponds with 1b in figure
control_group <- readRDS("/Path/to/control_group.rds")
#Perform a chi-square based variation reduction on new trimmed control group and sample and
#extract data. Corresponds with 2a in figure
chi_data <- chi_correct(nipt_sample = sample_of_interest, nipt_control_group = control_group)
sample_of_interest <- chi_data$sample
control_group <- chi_data$control_group
#Perform regression for chromosome 21 with default settings, so:
#Create 4 models with 4 predictors each
#All chromosomes are potential predictors except the potential trisomic chromosomes 13, 18 and 21
#Use a test and train set where the size of the train set is 60% of the control group
#Assume at least 15% overdispersion in the data
#Dont force practical CV, so if the CV is below 1.15 * standard error of the mean use this as VC
#Corresponds with 3a in figure
regression_score_21 <- perform_regression(nipt_sample = sample_of_interest, 
                                          nipt_control_group = control_group, chromo_focus = 21)


###       Gather relevant data from the objects on the workspace     #######