Cleaning data • impactR

Why?

The purpose of this vignette is to explain data cleaning using impactR. Note that the cleaning functions are linked to the data collection monitoring functions. So if one does not use the data collection monitoring functions, one will need to match the cleaning log to the functions shown below.

# If impactR is not yet installed
# devtools::install_github("impactR)
library(impactR)

Let’s import the dataset to clean.

# Load dataset in environment
data(data)

# Show the first lines and types of airports' tibble
data <- data |> tibble::as_tibble()

Import the ‘survey’ sheet:

data(survey)
survey <- survey |> tibble::as_tibble()

The first thing to do with the ‘survey’ object, as it will be used elsewhere, is to split the ‘type’ column into two columns.

# Except for the 'col_to_split' argument (the column to split), the other parameters are the default parameters
survey <- survey |>
  split_survey(
  col_to_split = "type",
    into = c("type", "list_name"),
  sep = " ",
    fill = "right")

Import the ‘choices’ sheet:

data(choices)
choices <- choices |> tibble::as_tibble()

Import the cleaning log:

data(cleaning_log)
cleaning_log <- cleaning_log |> tibble::as_tibble()

Clean data

It’s as simple as two steps.

1- Check if the cleaning log is minimally well filled:

check_cleaning_log(cleaning_log, data, uuid, "autre_")

## Warning in check_cleaning_log(cleaning_log, data, uuid, "autre_"): The following id_col and question_name have remaining bits from the template such as 'Fill in' in column 'feedback', please check:
##  x7: i_consensus
##  x8: i_consensus
##  x1: survey_duration
##  x2: survey_duration
##  x3: survey_duration
##  x4: survey_duration
##  x5: survey_duration
##  x6: survey_duration
##  x7: survey_duration
##  x8: survey_duration
##  x2: c_chef_menage_age
##  x2: i_enquete_age
##  x1: c_n_chef_menage_age
##  x2: c_n_chef_menage_age
##  x1: c_chef_menage_age

## Warning in check_cleaning_log(cleaning_log, data, uuid, "autre_"): The following id_col and question_name have remaining bits from the template such as 'Fill in' in column 'new_value', please check:
##  x7: autre_r_besoin_assistance
##  x2: i_enquete_age
##  x2: c_n_chef_menage_age
##  x1: c_chef_menage_age

## NULL

# If NULL, then ok

2 - Use the clean_all() function:

cleaned_data <- clean_all(data, cleaning_log, survey, choices, uuid, "autre_")

## Warning in check_cleaning_log(log, .tbl, {: The following id_col and question_name have remaining bits from the template such as 'Fill in' in column 'feedback', please check:
##  x7: i_consensus
##  x8: i_consensus
##  x1: survey_duration
##  x2: survey_duration
##  x3: survey_duration
##  x4: survey_duration
##  x5: survey_duration
##  x6: survey_duration
##  x7: survey_duration
##  x8: survey_duration
##  x2: c_chef_menage_age
##  x2: i_enquete_age
##  x1: c_n_chef_menage_age
##  x2: c_n_chef_menage_age
##  x1: c_chef_menage_age

## Warning in check_cleaning_log(log, .tbl, {: The following id_col and question_name have remaining bits from the template such as 'Fill in' in column 'new_value', please check:
##  x7: autre_r_besoin_assistance
##  x2: i_enquete_age
##  x2: c_n_chef_menage_age
##  x1: c_chef_menage_age

## Warning: There is no corresponding list_name in choices for col: 'r_besoin_assistance'.
## An empty vector or an empty tibble is returned.

## Warning in Sys.timezone(): unable to identify current timezone 'H':
## please set environment variable 'TZ'

The cleaned_data object is the cleaned dataset with :

the interviews to be deleted removed
the duplicated interviews removed
values to modify modified
other children and parent values to modify/recode modified
the multiple choices columns of 0 and 1 modified to take into account the recodings/modifications.