The `naflex`

R package provides additional flexibility for
handling missing values in summary functions beyond the existing options
(`na.rm = TRUE`

/`FALSE`

) available in base R.

Most summary functions in base R e.g. `mean`

, provide the
two extreme options for handling missing values:

- calculate the summary ignoring all missing values
(
`na.rm = TRUE`

), or - require no missing values for the summary to be calculated
(
`na.rm = FALSE`

)

In many cases, something in between these two extremes is often more
appropriate. For example, you may wish to give a summary statistic if
less than `5%`

of values are missing.

`naflex`

provides helper functions to facilitate this
flexibility. It allows for omitting missing values conditionally, using
four types of checks:

- a maximum proportion of missing values allowed
- a maximum number of missing values allowed
- a maximum number of consecutive missing values allowed, and
- a minimum number of non-missing values required.

The motivating application for producing this package was the
calculation of *Climate Normals*: Long term averages of surface
meteorological measurements e.g. total rainfall and mean temperature
that provide benchmark information about the climate at specific
locations. The World Meteorological Organization (WMO) Guidelines on the
Calculation of Climate Normals^{1}
provides recommendations to standardise these calculations across
countries, including handling of missing values.

For example, it recommendations that a monthly mean value calculated
from daily values should only be calculated when there are no more than
`10`

missing values in the month and no more than
`4`

days of consecutive missing values. Adhering to such
rules using base R requires doing further calculations and increasing
the complexity and length of code. The aim of `naflex`

is to
make it easier to apply such rules routinely and efficiently as part of
calculations.

Install the current release from CRAN:

`install.packages("naflex")`

Or install the latest development version from GitHub:

```
# install.packages("devtools")
::install_github("dannyparsons/naflex") devtools
```

The main function in `naflex`

is
`na_omit_if`

.

When wrapped around a vector in a summary function,
`na_omit_if`

ensures that the summary value is calculated
when the checks pass, and returns `NA`

if not. The example
below shows how to calculate the `mean`

, conditionally on the
proportion of missing values.

```
library(naflex)
<- c(1, 3, NA, NA, 3, 2, NA, 5, 8, 7)
x
# Calculate if 30% or less missing values
mean(na_omit_if(x, prop = 0.3))
#> [1] 4.142857
# Calculate if 20% or less missing values
mean(na_omit_if(x, prop = 0.2))
#> [1] NA
```

Four types of checks are available:

`prop`

: the maximum proportion (0 to 1) of missing values allowed`n`

: the maximum number of missing values allowed`consec`

: the maximum number of consecutive missing values allowed, and`n_non`

: the minimum number of non-missing values required.

If multiple checks are specified, all checks must pass for missing
values to be removed. For example, although there are less than 4
missing values in `x`

, there are two consecutive missing
values, hence the `consec = 1`

check fails in the example
below the result is `NA`

.

```
# Calculate if 4 or less missing values and 1 or less consecutive missing values
mean(na_omit_if(x, n = 4, consec = 1))
#> [1] NA
```

The use of `%>%`

(“pipe”) from `magrittr`

can be used to make the code look clearer and more familiar. The
beginning of the line is now the same as standard R and it moves
`na_omit_if`

after `x`

which then appears more
like an option within the function, like `na.rm`

, which is
how you might think about `na_omit_if`

conceptually in this
case.

```
require(magrittr)
#> Loading required package: magrittr
sum(x %>% na_omit_if(prop = 0.25))
#> [1] NA
```

Note that you should not use `na_omit_if`

with
`na.rm = TRUE`

in the summary function since this will always
remove missing values so the checks are essentially ignored.

`naflex`

works
& more details`na_omit_if`

works by removing the missing values from
`x`

if the checks pass, and leaves `x`

unmodified
otherwise.

```
# Missing values removed
na_omit_if(x, n = 4)
#> [1] 1 3 3 2 5 8 7
#> attr(,"na.action")
#> [1] 3 4 7
#> attr(,"class")
#> [1] "omit"
```

`na_omit_if`

can be thought of like an extension of
`stats::na.omit`

and if missing values are removed, an
`na.action`

attribute and `omit`

class are added
for consistency with `stats::na.omit`

.

```
# Missing values not removed, x is unmodified
na_omit_if(x, n = 2)
#> [1] 1 3 NA NA 3 2 NA 5 8 7
```

A further set of four `na_omit_if_*`

functions are
provided for doing the same thing but restricted to a single check
e.g. `na_omit_if_n(x, 2)`

.

`na_check`

has the same parameters as
`na_omit_if`

but returns a logical indicating whether the
checks pass. It is used internally in `na_omit_if`

and may
also be a useful helper function.

```
if (na_check(x, n = 4, consec = 1)) "NA checks pass" else "NA checks fail"
#> [1] "NA checks fail"
```

A set a four `na_check_*`

functions are also provided for
doing the same thing restricted to a single check
e.g. `na_check_prop(x, 0.2)`

Finally, `naflex`

provides a set of helper functions for
calculating missing value properties used in these checks.

```
na_prop(x)
#> [1] 0.3
na_n(x)
#> [1] 3
na_consec(x)
#> [1] 2
na_non_na(x)
#> [1] 7
```

In base R, this functionality can often be achieved using a
combination of `ifelse`

, `is.na`

, `rle`

and the option `na.rm = TRUE`

.`naflex`

aims to
simplify, shorten and standardise this process for users.

For example, the equivalent of:

```
mean(na_omit_if(x, n = 4, prop = 0.2))
#> [1] NA
```

in base R is:

```
ifelse(sum(is.na(x)) <= 4 && mean(is.na(x)) <= 0.2, mean(x, na.rm = TRUE), NA)
#> [1] NA
```

The check for longest sequence of consecutive missing values is more
complex and requires clever use of the `rle`

function. For
example,

```
mean(na_omit_if(x, consec = 5))
#> [1] 4.142857
```

is equivalent to:

```
<- rle(is.na(x))
r <- r$lengths[r$values]
m ifelse(max(m) <= 5, mean(x, na.rm = TRUE), NA)
#> [1] 4.142857
```

^{1}
WMO
Guidelines on the Calculation of Climate Normals ↩︎