This vignette is an explanation on how to use NIPTeR. It starts with a short introduction, the second chapter explains the two most important objects in this package, NIPTSample
and NIPTControlgroup
. The third chapter explains all the analysis functions, the arguments the function take and their output. The last chapter shows some workflow examples.
Non-Invasive Prenatal Testing (NIPT) is a method based on analysis of cell-free fetal (cff) DNA in maternal blood using next-generation sequencing. The basic approach is relatively straightforward: cell-free DNA is isolated from the mothers blood plasma and the obtained DNA fragments are sequenced. Subsequently the number of DNA fragments originating from the different chromosomes is counted, since in case of a fetal trisomy a relative increase of the fraction of reads of the trisomic chromosome is expected. Various analysis methods for trisomy prediction, variation reduction and quality control have been described in the literature. These methods can be combined to improve the sensitivity of the analysis. The NIPTeR package makes these methods available and allows for flexibility in workflow design. NIPTeR uses BAM files as input and calculates chromosomal fractions, which are used for trisomy prediction. These fractions are based on read counts of the different chromosomes. The chromosomes are divided in bins to allow for read count correction.
The atomic unit of analysis in this package is a NIPTSample
object
A NIPTSample
object is a named list of length 5, having the following fields:
autosomal_chromosome_reads
A list containing the bin count matrices for the autosomal chromosomes. If forward and reverse reads are counted separately the list contains two matrices, if the reads are counted together it contains one. In these matrices the columns represent bins and the 22 rows represent the chromosomes. For a small excerpt of such a matrix see Table 1.
correction_status_autosomal_chromosomes
Correction status of the autosomal chromosomes. Default is Uncorrected
sex_chromosome_reads
List containing bin count matrices for the sex chromosomes. If forward and reverse reads are counted separately the list contains two matrices, if the reads are counted together it contains one. In these matrices the columns represent bins and the two represent rows the chromosomes. For a small excerpt of such a matrix see Table 2.
correction_status_sex_chromosome
Correction status of the sex chromosomes. Default is Uncorrected
sample_name
The name of the sample. Default is the filename of the .bam it originates from with the path prefix and .bam suffix
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
---|---|---|---|---|---|---|---|---|
1 | 308 | 303 | 270 | 157 | 148 | 99 | 146 | 236 |
2 | 243 | 263 | 262 | 269 | 225 | 225 | 285 | 259 |
3 | 0 | 189 | 287 | 245 | 264 | 251 | 258 | 242 |
4 | 624 | 637 | 257 | 240 | 229 | 241 | 251 | 216 |
5 | 448 | 249 | 274 | 262 | 239 | 205 | 241 | 244 |
6 | 0 | 174 | 213 | 142 | 231 | 324 | 355 | 383 |
7 | 247 | 266 | 234 | 217 | 142 | 88 | 264 | 230 |
8 | 163 | 216 | 244 | 209 | 243 | 255 | 259 | 261 |
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
---|---|---|---|---|---|---|---|---|
X | 0 | 160 | 6 | 266 | 127 | 85 | 182 | 235 |
Y | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
The panel of control samples to which the sample of interest can be compared is also modelled in an object, NIPTControlGroup
. A control group is a named list and composed of NIPTSample
objects. It has the following fields:
samples
The NIPTSample
objects
correction_status_autosomal_chromosomes
Correction status of the autosomal chromosomes of the samples. Default is Uncorrected
correction_status_sex_chromosomes
Correction status of the sex chromosomes of the samples. Default is Uncorrected
Description
Description of the control group. This can either be Generic control group or Fitted to sample name
To load a bam file, use the bin_bam_sample
function. The bam file does not need to be sorted, but its recommended to use pre-sorted bam files
bin_bam_sample(bam_filepath, do_sort = F, separate_strands = F, custom_name = NULL)
The arguments bin_bam_sample
takes:
bam_filepath
The location and filename on the file system where the bam file is stored.
do_sort
Sort the bam file? If the bam is unsorted set to true, but the use of pre-sorted bam files is recommended.
separate_strands
If set to true, reads from forward and reverse strands are counted and stored separately. This option should be used if you are planning on using regression, since this doubles the number of predictors (F+R) and distributes predictive power more equally over prediction sets since F and R strand from the same chromosome cannot be both in one prediction set.
custom_name
The sample name. Default samplename is the filename of the bam file without the .bam suffix and filepath prefix.
GC content bias is the correlation between the number of reads mapped to a specific genomic region and the GC content of this region. In NIPTeR, two GC bias correction algorithms have been implemented, the LOESS based method introduced by Chen et al. (2011) and the bin weight based method described by Fan and Quake (2010).
gc_correct(nipt_object, method, include_XY = F, span = 0.75)
nipt_object
The object that will be corrected. This can either be a NIPTSample
or a NIPTControlGroup
object.
method
To select the LOESS based method use “LOESS”, to select the bin weights based method use “bin”.
include_XY
Also perform correction on the X and Y chromosomes?
span
The span for the LOESS fit. Only applicable when LOESS method is used.
ref_genome
The reference genome used. Either “hg37” or “hg38” default = “hg37”
The output of the gc_correct function is a similar object as the input; when the input is a NIPTSample
the output will also be a (corrected) NIPTSample
, and the same applies for a NIPTControlGroup
object. . After correction the correction status of the object is set to GC_corrected.
Code snippit:
#Correct NIPTSample object using LOESS method
loess_corrected_sample <- gc_correct(nipt_object = sample_of_interest, method = "LOESS",
include_XY = F, span = 0.75)
#Correct NIPTControlGroup object using bin method
gc_bin_corrected_control_group <- gc_correct(nipt_object = control_group, method = "bin",
include_XY = T)
The matchcontrolgroup
function determines how well an NIPTSample
fits within the NIPTControlGroup
and, if needed, makes a subset NIPTControlGroup
of length n.
matchcontrolgroup(nipt_sample, nipt_control_group, mode, n_of_samples,
include_chromosomes = NULL, exclude_chromosomes = NULL)
Explanation of the arguments matchcontrolgroup
takes:
nipt_sample
The NIPTSample
object that is the focus of the analysis.
nipt_control_group
The NIPTControlGroup
object used in the analysis.
mode
The function mode. This can either be “subset” or “report”. Mode “subset” means the return value will be a new NIPTControlGroup
object containing n samples. When mode “report” is used the output is a matrix containing the sum of squares score of the differences between the chromosomal fractions of the sample and the control for every control sample, sorted in increasing score.
n_of_samples
The length of the resulting NIPTControlGroup. Only applicable if mode “subset” is used.
include_chromosomes
Include potential trisomic chromosomes into the comparison? Default = NULL, meaning chromosomes 13, 18 and 21 are not included.
exclude_chromosomes
Exclude other autosomal chromosomes besides chromosomes 13, 18 and 21? Default = NULL.
The output for mode subset is a new NIPTControlGroup
composed of n samples. The output for mode report is a matrix with a single column containing the sum of squares in ascending order.
Sum_of_squares | |
---|---|
Sample 6 | 6.766e-07 |
Sample 8 | 2.159e-06 |
Sample 1 | 3.101e-06 |
Sample 10 | 3.292e-06 |
Sample 9 | 3.588e-06 |
Sample 3 | 3.839e-06 |
Sample 5 | 4.432e-06 |
Sample 2 | 7.892e-06 |
Sample 7 | 9.737e-06 |
Sample 4 | 1.126e-05 |
Code snippit:
#Run matchcontrolgroup in mode "report"
scores_control_group <- matchcontrolgroup(nipt_sample = gc_LOESS_corrected_sample,
nipt_control_group = gc_LOESS_corrected_control_group,
mode = "report", include_chromosomes = c(13,18))
#Run matchcontrolgroup in mode "subset" and select 50 best matching samples
subset_loess_corrected_control_group <- matchcontrolgroup(nipt_sample = gc_LOESS_corrected_sample,
nipt_control_group = loess_corrected_control_group,
mode = "subset", n_of_samples = 50)
The chi-squared based variation reduction identifies overdispersed bins within the control group and corrects these bins in both the sample of interest and the control group. The function takes in a NIPTSample
and a NIPTControlGroup
object, both to be corrected. For every corresponding bin in the control group a chi-squared score is calculated and this total score is converted to a normal distribution. Corresponding bins with a normalized score above chi_cutoff (default 3.5) are corrected by dividing the number of reads by the total chi-squared score divided by degrees of freedom
chi_correct(nipt_sample, nipt_control_group, chi_cutoff = 3.5, include_XY = F)
nipt_sample
The NIPTSample
object that is the focus of the analysis.
nipt_control_group
The NIPTControlGroup
object used in the analysis.
chi_cutoff
The normalized cut-off threshold.
include_XY
Also perform correction on the X and Y chromosomes?
The output of this function is a named list containing two fields, the first field is the corrected sample ($sample
) and the second the corrected control group ($control_group
).
Code snippit:
#Apply chi-squared based variation reduction method
chi_corrected_data <- chicorrect(nipt_sample = gc_LOESS_corrected_sample,
nipt_control_group = subset_loess_corrected_control_group)
#Extract sample and control group
loess_chi_corrected_sample <- chi_corrected_data$sample
subset_loess_chi_corrected_control_group <- chi_corrected_data$control_group
In the Z-score approach, introduced by Chiu et al in 2008, the chromosomal fraction of interest of a sample is compared to the chromosomal fractions of interest of the reference samples, the NIPTControlGroup
object.
calculate_z_score(nipt_sample, nipt_control_group, chromo_focus)
This function takes three arguments:
nipt_sample
The NIPTSample
object that is the focus of the analysis.
nipt_control_group
The NIPTControlGroup
object used in the analysis.
chromo_focus
The chromosome of interest. Most commonly chromosome 13, 18 or 21. However, every autosomal chromosome can be predicted.
The output of the function is an object of class ZscoreResult
. It is a named list containing seven fields:
sample_Zscore
A numeric, The Z score for the sample of interest for the sample of interest.
control_group_statistics
Named num of length 3, the first field being the mean (name mean), the second field is the standard deviation (name SD) and the third field is the P value of the Shapiro-Wilk test (name Shapiro_P_value).
control_group_Zscores
Matrix containing the Z scores of the chromosome of interest for all used control samples.
focus_chromosome
The chromosome of interest. Most commonly chromosome 13, 18 or 21. However, every autosomal chromosome can be predicted.
control_group_sample_names
The sample names of all control group samples used in the analysis.
correction_status
The correction status of the control group.
sample_name
The sample_name of the sample of interest.
Code snippit:
#Calculate Z score for chromosome 13
z_score_result_13 <- calculate_z_score(nipt_sample = loess_chi_corrected_sample,
nipt_control_group = subset_loess_chi_corrected_control_group,
chromo_focus = 13)
The regression based Z-score builds n models with m predictors using stepwise regression with forward selection. The models are used to predict the chromosomal fraction of interest, for the sample and for the control group. The observed fractions are then divided by the expected fraction, and Z-scores are calculated over these ratios. The Z-score is calculated by subtracting one from the ratio of the sample and dividing this result by the coefficient of variation. The coefficient of variation (CV) can either be the Practical or Theoretical CV. The Theoretical CV is the standard error multiplied by the overdispersion. Theoretically, the CV cannot be lower than the standard error of the mean. If it is case the CV is lower than Theoretical CV, then the Theoretical CV is used.
perform_regression(nipt_sample, nipt_control_group, chromo_focus,
n_models = 4, n_predictors = 4, exclude_chromosomes = NULL,
include_chromosomes = NULL, use_test_train_set = T,
size_of_train_set = 0.6, overdispersion_rate = 1.15,
force_practical_vc = F)
nipt_sample
The NIPTSample
object that is the focus of the analysis.
nipt_control_group
The NIPTControlGroup
object used in the analysis.
chromo_focus
The chromosome of interest. Most commonly chromosome 13, 18 or 21. However, every autosomal chromosome can be predicted.
n_models
Number of linear models to be made. Default setting is 4 models.
n_predictors
The number of predictors each model contains. Default is 4.
exclude_chromosomes
Exclude which autosomal chromosomes as potential predictors? Default potential trisomic chromosomes 13, 18 and 21 are exluded.
include_chromosomes
Include potential trisomic chromosomes? Options are: chromosomes 13, 18 and 21.
use_test_train_set
Use a test and train set to build the models? Default is TRUE.
size_of_train_set
The size of the train set expressed in a decimal. Default is 0.6 (60% of the control samples).
overdispersion_rate
The standard error of the mean is multiplied by this factor.
force_practical_vc
Ignore the theoretical CV and always use the practical CV.
The output of this function is an object of type RegressionResult
, a named list containing:
prediction_statistics
A dataframe with 7 rows and a column for every model. Table 3 is an example of such a dataframe.Prediction_set_1 | Prediction_set_2 | |
---|---|---|
Z_score_sample | -0.171248675025602 | -0.799303637284865 |
CV | 0.00615122686158584 | 0.00599131304613039 |
cv_types | Practical_CV | Practical_CV |
P_value_shapiro | 0.583158688871931 | 0.503865450973465 |
Predictor_chromosomes | 11 2 5 12 | 10 19 17 16 |
Mean_test_set | 1.00050469190991 | 1.00011356353248 |
CV_train_set | 0.00335713985926308 | 0.00394053718934269 |
control_group_Zscores
A matrix containing the regression based Z-scores for the control sample
focus_chromosome
The chromosome of interest. Most commonly chromosome 13, 18 or 21. However, every autosomal chromosome can be predicted.
correction_status
The correction status of the control group autosomes.
control_group_sample_names
The sample names of the test set group.
models
List of the summary.lm() output for every model.
potential_predictors
The total pool of chromosomes where the predictors are selected from.
all_control_group_Z_scores
Z-scores for every sample using theoretical and practical VCs.
additional_statistics
Statistics for both the practical and theoretical CVs for every prediction set.
The Normalized Chromosome Value or NCV (Sehnert et al., 2011) method selects a subset of chromosomes to calculate the chromosomal fractions. The ‘best’ subset is the set which yields the lowest coefficient of variation for the chromosomal fractions of the chromosome of interest in the control group. Because a brute force approach is used to determine the best subset, which can be computationally intensive,this method is divided into two functions, prepare_ncv
and calculate_ncv
. prepare_ncv
returns a template object (NCVTemplate) for a given chromosome of interest and the control group used. This template can be used for any number of analyses. If the control group or chromosome of interest changes, a new template must be made.
calculate_ncv
uses a NCVTemplate and a NIPTSample to calculate the NCV score for the NIPTSample.
prepare_ncv(nipt_control_group, chr_focus, max_elements,
exclude_chromosomes = NULL, include_chromosomes = NULL,
use_test_train_set = T, size_of_train_set = 0.6)
nipt_control_group
The NIPTControlGroup
object used in the analysis.
chr_focus
The chromosome of interest. Most commonly chromosome 13, 18 or 21. However, every autosomal chromosome can be predicted.
max_elements
The maximum number of denominator chromosomes.
exclude_chromosomes
Exclude which autosomal chromosomes as potential predictors? Default potential trisomic chromosomes 13, 18 and 21 are exluded.
include_chromosomes
Include potential trisomic chromosomes? Options are: chromosomes 13, 18 and 21.
use_test_train_set
Use a test and train set to to build the models? Default is TRUE.
size_of_train_set
The size of the train set expressed in a decimal. Default is 0.6 (60% of the control group samples).
The output of this function is an object of class NCVTemplate, a named list.
denominators
The set of ‘best’ denominators.
focus_chromosome
The chromosome of interest used for this NCVTemplate
object.
control_group_sample_names
The sample names of the test set samples.
correction_status
The correction status of the NIPTControlGroup
used.
control_group_Zscores
The NCV scores for the test set samples.
potential_denominators
The total pool of denominators the best denominators are selected from.
control_group_statistics
Named num of length 3, the first field being the mean (name mean), the second field is the standard deviation (name SD) and the third field is the P value of the Shapiro-Wilk test (name Shapiro_P_value).
sample_names_train_set
The names of the samples used in the train set.
train_set_statistics
Named num of length 3, the first field being the mean of the train set (name mean), the second field is the standard deviation of the train set (name SD) and the third field is the P value of the Shapiro-Wilk test of the train set (name Shapiro_P_value).
train_set_Zscores
Matrix containing the Z scores for the train set samples.
calculate_ncv_score(nipt_sample, ncv_template)
nipt_sample
The NIPTSample
object that is the focus of the analysis.
ncv_template
An object of class NCVTemplate that is produced by function prepare_ncv
The output of this function is an object of class NCVResult, which is basically the same as a NCVTemplate object, the only difference being the NCV score of the sample appended to the list.
This functions converts a list of NIPTSample
objects to a NIPTControlGroup
object and return this object.
as_control_group(nipt_samples)
1 nipt_samples
A list of NIPTSample
objects.
This functions adds NIPTSample
objects to an existing control group and returns a new NIPTControlGroup
object.
add_samples_controlgroup(nipt_control_group, samples_to_add)
nipt_control_group
The NIPTControlGroup
to add the samples to.
samples_to_add
A list with sample(s) to add. This always needs to be a list.
This functions removes duplicate samples from the control group based on name. It returns a new NIPTControlGroup
object.
remove_duplicates_controlgroup(nipt_control_group)
nipt_control_group
The NIPTControlGroup
object to remove duplicates from. This function removes a sample from the NIPTControlGroup
object by name. Note that this function uses a regular expression, and if more sample_names satisfy the regular expression, they will also be removed. It returns a new NIPTControlGroup
object.
remove_sample_controlgroup(samplename, nipt_control_group)
samplename
samplename Regular expression string. All matching samplenames are removed from the control group.
nipt_control_group
NIPTControlGroup
object where the samples are removed from.
This function computes a regular Z-score for every chromosome of every sample in the NIPTControlGroup
object. It returns a named list with diagnostics information.
diagnose_control_group(nipt_control_group)
nipt_control_group
The NIPTControlGroup
to diagnose.The function returns a named list with 3 fields:
Z_scores
A matrix containing Z-scores for every sample and every chromosome.
abberant_scores
Dataframe with samplename and chromosome of Z-scores outside -3 3 range.
control_group_statistics
Matrix with mean, standard deviation and P value of Shapiro-Wilk test.
Since NIPT is based on the comparison of the sample of interest to a set of samples with no known aneuploidies, the first thing that should be done before any samples can be analyzed is making this NIPTControlGroup
. A NIPTControlGroup
is an object and is composed of NIPTSample
objects. First, all bam file paths have to be collected in a vector. This tutorial assumes all .bam files are in the same directory. To collect all bam files the following command can be used:
bam_filepaths <- list.files(path = "/Path/to/bamfiles/", pattern = ".bam", full.names = T)
Subsequently, the filepaths must be fed to the function bin_bam_sample
. This function loads bam files and converts them to NIPTSample
objects. For example:
list_NIPT_samples <- lapply(X = bam_filepaths, bin_bam_sample, do_sort = FALSE,
separate_strands = T, custom_name = NULL)
In this example we count the forward and reverse reads separately and a pre-sorted bam file is used, so the argument do_sort is ommited since the default is F.
Using the above command all samples will be loaded with the same default settings. To load the files with for instance custom names, store the filenames in a vector of the same length as the bam filepath vector and use mapply
. For example:
list_NIPT_samples <- mapply(FUN = bin_bam_sample, bam_filepaths, name = names, SIMPLIFY = F)
To convert the list of NIPTSamples
to a NIPTControlGroup
object use:
control_group <- as_control_group(nipt_samples = list_NIPT_samples)
of course, the fastest and simplest way is to wrap the as_control_group
function around the loading of samples like this:
control_group <- as_control_group(nipt_samples = lapply(X = bam_filepaths, bin_bam_sample,
do_sort = F, separate_strands = T))
or when applying custom names:
control_group <- as_control_group(nipt_samples = mapply(FUN = bin_bam_sample, bam_filepaths,
custom_name = names, SIMPLIFY = FALSE))
To store the control group on disk for later use:
saveRDS(object = control_group, file = "/Path/to/directory/control_group.rds")
So, in conclusion, the data control group preparation script might look something like this:
library(NIPTeR)
#Retrieve filenames
bam_filepaths <- list.files(path = "/Path/to/bamfiles/", pattern = ".bam", full.names = T)
#Load files and convert to control group
control_group <- as_control_group(nipt_samples = lapply(X = bam_filepaths, bin_bam_sample,
do_sort = F, separate_strands = FALSE))
#Save control group for later
saveRDS(object = control_group, file = "/Path/to/directory/control_group.rds")
Assuming a NIPTControlGroup
object has been prepared and stored on disk we can analyse our first sample. The script starts with loading the NIPTeR
package:
library(NIPTeR)
After that, we can load a sample of interest with the bin_bam_sample
function:
sample_of_interest <- bin_bam_sample(bam_filepath = "/path/to/sample.bam", separate_strands = T)
library(NIPTeR)
#Gather all bam filepaths in a vector. Corresponds to 1a in figure
bam_filepaths <- list.files(path = "/Path/to/bamfiles/", pattern = ".bam", full.names = T)
#Load all bam files using lapply and feed the results to function as_control_group,
#converting the NIPTSamples to a NIPTControlGroup object. Corresponds to 1b in figure
control_group <- as_control_group(nipt_samples = lapply(X = bam_filepaths, bin_bam_sample,
do_sort = F, separate_strands = F))
#apply a gc LOESS correction to all samples. Since this can take up to 30 seconds
#sample, doing this once for a control group and store the results can be rewarding
#in terms of analysis time. Corresponds to 2a in figure
gc_control_group <- gc_correct(nipt_object = control_group, method = "LOESS")
#Retrieve control group diagnostics. Corresponds with 3a in figure
control_group_diagnostics <- diagnose_control_group(nipt_control_group = control_group)
#Retrieve samplenames with an abberant Z score for any chromosome and remove these samples
#from the control group. Corresponds with 3b in figure
abberant_sample_names <- unique(control_group_diagnostics$abberant_scores$Sample_name)
for (i in 1:length(abberant_sample_names)){
control_group <- remove_sample_controlgroup(samplename = abberant_sample_names[i],
nipt_control_group = control_group)
}
#Save the gc_corrected control groups to disk. Corresponds to 4a in figure
saveRDS(object = control_group, file = "/path/to/controlgroup.rds")
library(NIPTeR)
#Load sample. Corresponds with 1a in figure
sample_of_interest <- bin_bam_sample(bam_filepath = "/Path/to/bamfile.bam", separate_strands = T)
#Load control group. Corresponds with 1b in figure
control_group <- readRDS("/Path/to/control_group.rds")
#Perform a chi-square based variation reduction on new trimmed control group and sample and
#extract data. Corresponds with 2a in figure
chi_data <- chi_correct(nipt_sample = sample_of_interest, nipt_control_group = control_group)
sample_of_interest <- chi_data$sample
control_group <- chi_data$control_group
#Perform regression for chromosome 21 with default settings, so:
#Create 4 models with 4 predictors each
#All chromosomes are potential predictors except the potential trisomic chromosomes 13, 18 and 21
#Use a test and train set where the size of the train set is 60% of the control group
#Assume at least 15% overdispersion in the data
#Dont force practical CV, so if the CV is below 1.15 * standard error of the mean use this as VC
#Corresponds with 3a in figure
regression_score_21 <- perform_regression(nipt_sample = sample_of_interest,
nipt_control_group = control_group, chromo_focus = 21)
### Gather relevant data from the objects on the workspace #######