Introduction to the rgnoisefilt package

The rgnoisefilt package contains filtering techniques to remove noisy samples in regression datasets. It adapts classic and recent filtering techniques for use in regression problems, and it also incorporates methods specifically designed for regression data. In order to do this, it uses approaches proposed in the specialized literature, such as Martín et al. (2021) and Arnaiz-González et al. (2016).

Instalation

The rgnoisefilt package can be installed in R from CRAN servers using the command:

#install.packages("rgnoisefilt")

This command installs all the dependencies of the package as well as all the regression algorithms necessary for the operation of the noise filters. In order to access all the functions of the package, it is necessary to use the R command:

library(rgnoisefilt)

Documentation

All the information corresponding to each noise filter can be consulted from the CRAN website. Additionally, the help() command can be used. For example, in order to check the documentation of the regIPF noise filter, we can use:

help(regIPF)

Usage of regression noise filters

For processing noisy regression data, each noise filter in the rgnoisefilt package provides two standard ways of use:

  • Default method. It receives a data frame with the input attributes in the x argument, whereas the output variable is received through the y argument (a double vector).
  • Formula class method. This method allows passing the whole data frame (attributes and response variable) in the data argument. In addition, the attributes along with the output regressand must be indicated in the formula argument.

An example on how to use these two methods for filtering out the rock dataset with the regCNN noise filter is shown below:

data(rock)
head(rock)
#>   area    peri     shape perm
#> 1 4990 2791.90 0.0903296  6.3
#> 2 7002 3892.60 0.1486220  6.3
#> 3 7558 3930.66 0.1833120  6.3
#> 4 7352 3869.32 0.1170630  6.3
#> 5 7943 3948.54 0.1224170 17.1
#> 6 7979 4010.15 0.1670450 17.1
# Using the default method:
set.seed(9)
out.def <- regCNN(x = rock[,-ncol(rock)], y = rock[,ncol(rock)])
# Using the formula method:
set.seed(9)
out.frm <- regCNN(formula = perm ~ ., data = rock)
# Check the match of noisy indices:
all(out.def$idnoise == out.frm$idnoise)
#> [1] TRUE

Note that, the $ operator is used to access the elements returned by the filter in the objects out.def and out.frm.

Output values

All regression noise filters return an object of rfdata class. It is designed to unify the output value of the methods included in the rgnoisefilt package. The rfdata class is a list of elements with the most relevant information of the noise filtering process:

  • xclean: a data frame with the input attributes of clean samples (without errors).
  • yclean: a double vector with the output regressand of clean samples (without errors).
  • numclean: an integer with the amount of clean samples.
  • idclean: an integer vector with the indices of clean samples.
  • xnoise: a data frame with the input attributes of noisy samples (with errors).
  • ynoise: a double vector with the output regressand of noisy samples (with errors).
  • numnoise: an integer with the amount of noisy samples.
  • idnoise: an integer vector with the indices of noisy samples.
  • filter: the full name of the noise filter used.
  • param: a list of the argument values.
  • call: the function call.

As an example, the structure of the rfdata object returned using the regCNN noise filter is shown below:

str(out.def)
#> List of 11
#>  $ xclean  :'data.frame':    39 obs. of  3 variables:
#>   ..$ area : int [1:39] 4990 7002 7558 7352 7943 7979 9333 8209 8393 6425 ...
#>   ..$ peri : num [1:39] 2792 3893 3931 3869 3949 ...
#>   ..$ shape: num [1:39] 0.0903 0.1486 0.1833 0.1171 0.1224 ...
#>  $ yclean  : num [1:39] 6.3 6.3 6.3 6.3 17.1 17.1 17.1 17.1 119 119 ...
#>  $ numclean: int 39
#>  $ idclean : num [1:39] 1 2 3 4 5 6 7 8 9 10 ...
#>  $ xnoise  :'data.frame':    9 obs. of  3 variables:
#>   ..$ area : int [1:9] 3469 1468 3524 5267 5048 1016 5605 8793 5514
#>   ..$ peri : num [1:9] 1377 476 1189 1645 942 ...
#>   ..$ shape: num [1:9] 0.177 0.439 0.164 0.254 0.329 ...
#>  $ ynoise  : num [1:9] 100 100 100 100 1300 1300 1300 1300 580
#>  $ numnoise: int 9
#>  $ idnoise : int [1:9] 37 38 39 40 41 42 43 44 47
#>  $ filter  : chr "Condensed Nearest Neighbors"
#>  $ param   :List of 1
#>   ..$ t: num 0.2
#>  $ call    : language regCNN(x = rock[, -ncol(rock)], y = rock[, ncol(rock)])
#>  - attr(*, "class")= chr "rfdata"

In order to display the results of the rfdata class in a friendly way in the R console, two specific print and summary functions are implemented. The print function presents the basic information of the noise filtering process:

print(out.def)
#> 
#> ## Noise filter: 
#> Condensed Nearest Neighbors
#> 
#> ## Parameters:
#> - t = 0.2
#> 
#> ## Number of noisy and clean samples:
#> - Noisy samples: 9/48 (18.75%)
#> - Clean samples: 39/48 (81.25%)

The information offered by print is as follows:

  • The name of the regression noise filter.
  • The parameters associated with the noise filter.
  • The number of noisy and clean samples in the dataset.

On the other hand, the summary function displays the information of the dataset processed with the noise filter along with other additional details. This function can be called by typing the following R command:

summary(out.frm, showid = TRUE)
#> 
#> ########################################################
#>  Noise filtering process: Summary
#> ########################################################
#> 
#> ## Original call:
#> regCNN(formula = perm ~ ., data = rock)
#> 
#> ## Noise filter: 
#> Condensed Nearest Neighbors
#> 
#> ## Parameters:
#> - t = 0.2
#> 
#> ## Number of noisy and clean samples:
#> - Noisy samples: 9/48 (18.75%)
#> - Clean samples: 39/48 (81.25%)
#> 
#> ## Indices of noisy samples:
#> 37, 38, 39, 40, 41, 42, 43, 44, 47

The information offered by this function is as follows:

  • The function call.
  • The name of the regression noise filter.
  • The parameters associated with the noise filter.
  • The indices of the noisy and clean samples (if showid = TRUE).