Why?
This document aims at providing the workflow that may arise from and
through the use of impactR
. Let’s load the package.
Below, let’s use the ‘airports’ dataset from package
nycflights13
# Load dataset in environment
data(data)
# Show the first lines and types of airports' tibble
data <- data |> tibble::as_tibble()
data
## Warning in as.POSIXlt.POSIXct(x, tz): unable to identify current timezone 'H':
## please set environment variable 'TZ'
## # A tibble: 30 × 52
## uuid cluster weights start end
## <chr> <chr> <dbl> <dttm> <dttm>
## 1 x1 wendou_1 0.8 2021-10-18 09:08:24 2021-10-18 10:01:35
## 2 x2 wendou_1 0.8 2021-10-20 09:14:47 2021-10-20 10:05:23
## 3 x3 wendou_2 0.8 2021-10-20 10:11:14 2021-10-20 11:00:07
## 4 x4 wendou_2 0.8 2021-10-18 09:08:24 2021-10-20 12:02:57
## 5 x5 wendou_1 0.8 2021-10-20 10:11:14 2021-10-20 15:02:31
## 6 x6 wendou_2 0.8 2021-10-20 14:14:36 2021-10-20 16:04:35
## 7 x7 wendou_1 0.8 2021-10-20 15:09:11 2021-10-20 10:26:34
## 8 x8 wendou_1 0.8 2021-10-20 09:44:15 2021-10-20 11:12:48
## 9 x9 gnarala_1 1.2 2021-10-20 11:12:05 2021-10-21 11:12:48
## 10 x10 gnarala_2 1.2 2021-10-20 14:14:36 2021-10-22 11:12:48
## # … with 20 more rows, and 47 more variables: today <dttm>, deviceid <chr>,
## # i_enum_id <dbl>, i_enum_genre <chr>, i_admin2 <chr>, i_ville <chr>,
## # i_secteur <chr>, i_info_secteur <chr>, i_zad <chr>, i_zad_1 <chr>,
## # i_consensus <chr>, i_consensus_note <chr>, i_enquete_age <dbl>,
## # i_enquete_genre <chr>, i_statut <chr>, i_note_statut <lgl>,
## # i_statut_1 <chr>, c_note <lgl>, c_chef_menage <chr>,
## # c_n_chef_menage_genre <chr>, c_n_chef_menage_age <dbl>, …
Most of impactR
’s functions are written with the
assumption that the provided data may be coerced to a tibble, since it
extensively use the tidyverse.
Let’s sat we have a survey sheet composed as:
First thing to do with ‘survey’ as it will be used elsewhere: separate the type column.
# All are already defaults apart from 'col_to_split'
survey <- survey |>
split_survey(
col_to_split = "type",
into = c("type", "list_name"),
sep = " ",
fill = "right")
Monitor data collection
Check for outliers
In order to obtain the so-called outliers (for numerical variables),
you can use the make_log_outlier
function which allows you
to obtain two types of outliers: the deviation from the mean using the
outliers_sd
function and the interquartile range using the
outliers_iqr
function.
# Helper to get numeric columns
# numeric_cols(data) # Get all numeric columns: it includes for example numeric enumerator ids or gps locations
# numeric_cols(data, survey) # Check for numeric columns using the survey sheet.
# Get IQR outliers (1.5 default rule)
# outliers_iqr(data, col = i_enquete_age,times = 1.5, id_col = uuid) # id_col is usually uuid with Kobo
# Get standard deviation outliers
# outliers_sd(data, i_enquete_age, times = 3, id_col = uuid)
# Create the full log of outliers of all numeri columns
log_outliers <- make_log_outlier(data, survey, id_col = uuid, today, i_enum_id, i_zad)
log_outliers
## # A tibble: 2 × 16
## id_check today i_enu…¹ i_zad uuid quest…² quest…³ why feedb…⁴
## <chr> <dttm> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 outlier 2021-10-18 00:00:00 8 wend… x1 c_n_ch… Quel e… Outl… Fill in
## 2 outlier 2021-10-20 00:00:00 3 wend… x2 c_n_ch… Quel e… Outl… Fill in
## # … with 7 more variables: action <chr>, old_value <dbl>, new_value <chr>,
## # type <chr>, other_parent_question <chr>, other_new_value <chr>,
## # checkdate <chr>, and abbreviated variable names ¹i_enum_id, ²question_name,
## # ³question_label, ⁴feedback
Check for others
This functions needs the Kobo question with “other” answer to be defined as “variable” for the parent question and “other_variable” for the parent question. “other_” maybe different and is defined in the following question by the ‘other’ arg, being the character pattern. In the example that follows, it is “autre_”.
# Get other answers
other_cols <- other_cols(data, "autre_", id_col = uuid)
# Get other parent answers
other_parent_cols(data, other_cols, "autre_", id_col = uuid)
## # A tibble: 4 × 3
## uuid other_parent_question other_old_value
## <chr> <chr> <chr>
## 1 x15 r_besoin_assistance abri_bna secal autre
## 2 x22 h_3_type_latrine latrine_prive autre
## 3 x5 h_3_type_latrine latrine_prive_partage autre
## 4 x7 r_besoin_assistance abri_bna secal autre
# Fabriquer le journal de nettoyage des "autres"
log_others <- make_log_other(data, survey, "autre_", uuid, i_enum_id, i_zad)
log_others
## # A tibble: 4 × 16
## id_check i_enum_id i_zad uuid questi…¹ quest…² why feedb…³ action old_v…⁴
## <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 autre_ 3 wendou x5 autre_h… Lesque… Othe… Fill in check Aussi …
## 2 autre_ 12 wendou x7 autre_r… De que… Othe… Fill in check Un mét…
## 3 autre_ 3 gnarala x15 autre_r… De que… Othe… Fill in check Electr…
## 4 autre_ 5 gnarala x22 autre_h… Lesque… Othe… Fill in check Trou o…
## # … with 6 more variables: new_value <chr>, type <chr>,
## # other_parent_question <chr>, other_old_value <chr>, other_new_value <chr>,
## # checkdate <chr>, and abbreviated variable names ¹question_name,
## # ²question_label, ³feedback, ⁴old_value
Cleaning log based on a checklist
All you need to do is to have produced an Excel file of logical checks in advance, which can be modified during the collection process.
## # A tibble: 5 × 7
## id_check question_name why logic…¹ new_v…² action type
## <chr> <chr> <chr> <chr> <lgl> <chr> <chr>
## 1 id1 i_consensus Pas de consentement p… "i_con… NA remove char…
## 2 id2 survey_duration Durée d'enquête de mo… "surve… NA urgen… doub…
## 3 id3 survey_duration Durée d'enquête de pl… "surve… NA check doub…
## 4 id4 survey_duration Durée d'enquête de pl… "surve… NA urgen… doub…
## 5 id5 c_chef_menage_age Le ou la cheffe de mé… "c_che… NA check doub…
## # … with abbreviated variable names ¹logical_test, ²new_value
The check excel spreadsheet must follow a few rules in order to be
read by the impactR [add list] functions. For example, it is necessary
that all variables present in the logical tests (column ‘logical_test’
of the check table) also exist in the data, object data
.
For this, we can use the check_check_list
function, which
aims to validate or not a logical check table [note: it is not yet
robust, but it already allows to do a number of checks].
For example, below the column survey_duration
does not
exist in the data
. However, there are logical checks that
take it into account in the logical checks spreadsheet
check_list
. If we run the following command, we would get
an error:
check_check_list(check_list, data)
# following column/s from `question_name` is/are missing in `.tbl`: survey_duration, survey_duration, survey_duration
So we will add the survey_duration column using the
survey_duration
function. We take the opportunity to add
the time difference between two surveys per interviewer:
data <- data |>
survey_duration(start, end, new_colname = "survey_duration") #|>
# NOT RUN!
# survey_difftime(start, end, new_colname = "survey_difftime", i_enum_id)
data$survey_duration
## [1] 53 51 49 3055 291 110 -283 89 1441 2698 1204 2969 1481 5884 1558
## [16] 2942 4444 2942 1258 2644 1529 2881 1258 2644 1529 2921 1529 5884 1502 2698
We can check the logical check spreadsheet again:
check_check_list(check_list, data)
## [1] TRUE
This time, it’s ok, the function gives TRUE
, so we can
proceed with the production of the cleaning log.
log_check_list <- make_log_from_check_list(data, survey, check_list, uuid, today, i_enum_id, i_zad)
log_check_list
## # A tibble: 57 × 17
## id_ch…¹ today i_enu…² i_zad uuid quest…³ quest…⁴ why feedb…⁵
## <chr> <dttm> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 id2 2021-10-20 00:00:00 12 wend… x7 survey… "" Duré… Verifi…
## 2 id3 2021-10-18 00:00:00 8 wend… x1 survey… "" Duré… Verifi…
## 3 id3 2021-10-20 00:00:00 3 wend… x2 survey… "" Duré… Verifi…
## 4 id3 2021-10-20 00:00:00 3 wend… x3 survey… "" Duré… Verifi…
## 5 id3 2021-10-20 00:00:00 3 wend… x4 survey… "" Duré… Verifi…
## 6 id3 2021-10-20 00:00:00 3 wend… x5 survey… "" Duré… Verifi…
## 7 id3 2021-10-20 00:00:00 3 wend… x6 survey… "" Duré… Verifi…
## 8 id3 2021-10-20 00:00:00 12 wend… x8 survey… "" Duré… Verifi…
## 9 id3 2021-10-20 00:00:00 1 gnar… x9 survey… "" Duré… Verifi…
## 10 id3 2021-10-20 00:00:00 5 gnar… x10 survey… "" Duré… Verifi…
## # … with 47 more rows, 8 more variables: action <chr>, old_value <chr>,
## # new_value <lgl>, type <chr>, other_parent_question <chr>,
## # other_old_value <chr>, other_new_value <chr>, checkdate <chr>, and
## # abbreviated variable names ¹id_check, ²i_enum_id, ³question_name,
## # ⁴question_label, ⁵feedback
Full cleaning log and export
Finally, we just need to combine all these cleaning logs into one. We can then export it to an excel file.
log <- list(log_outliers, log_check_list, log_others) |>
purrr::map(~ .x |> dplyr::mutate(dplyr::across(.fns = as.character))) |>
dplyr::bind_rows() |>
readr::type_convert()
It is also possible to use the make_all_logs
function
which combines these three functions and outputs a single cleanup
log.
log <- make_all_logs(data, survey, check_list, "autre_", uuid, today, i_enum_id, i_zad)
log
## # A tibble: 63 × 17
## id_check today i_enum…¹ i_zad uuid quest…² quest…³ why feedb…⁴ action
## <chr> <date> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 id2 2021-10-20 12 wend… x7 survey… NA Duré… Verifi… urgen…
## 2 id3 2021-10-18 8 wend… x1 survey… NA Duré… Verifi… check
## 3 id3 2021-10-20 3 wend… x2 survey… NA Duré… Verifi… check
## 4 id3 2021-10-20 3 wend… x3 survey… NA Duré… Verifi… check
## 5 id3 2021-10-20 3 wend… x4 survey… NA Duré… Verifi… check
## 6 id3 2021-10-20 3 wend… x5 survey… NA Duré… Verifi… check
## 7 id3 2021-10-20 3 wend… x6 survey… NA Duré… Verifi… check
## 8 id3 2021-10-20 12 wend… x8 survey… NA Duré… Verifi… check
## 9 id3 2021-10-20 1 gnar… x9 survey… NA Duré… Verifi… check
## 10 id3 2021-10-20 5 gnar… x10 survey… NA Duré… Verifi… check
## # … with 53 more rows, 7 more variables: old_value <chr>, new_value <chr>,
## # type <chr>, other_parent_question <chr>, other_old_value <chr>,
## # other_new_value <lgl>, checkdate <chr>, and abbreviated variable names
## # ¹i_enum_id, ²question_name, ³question_label, ⁴feedback
Finally, there are several ways to export. Here we give the example
with the package writexl
:
# Not run! The simplest one
# writexl::write_xlsx(log, "output/log.xlsx)
# You can add the current date to allow tracking
# writexl::write_xlsx(log, paste0("output/log_", Sys.Date(), ".xlsx))