Some workflow • impactR

Why?

This document aims at providing the workflow that may arise from and through the use of impactR. Let’s load the package.

# If you haven't done so 
# devtools::install_github("impactR)
library(impactR)

Below, let’s use the ‘airports’ dataset from package nycflights13

# Load dataset in environment
data(data)

# Show the first lines and types of airports' tibble
data <- data |> tibble::as_tibble()
data

## Warning in as.POSIXlt.POSIXct(x, tz): unable to identify current timezone 'H':
## please set environment variable 'TZ'

## # A tibble: 30 × 52
##    uuid  cluster   weights start               end                
##    <chr> <chr>       <dbl> <dttm>              <dttm>             
##  1 x1    wendou_1      0.8 2021-10-18 09:08:24 2021-10-18 10:01:35
##  2 x2    wendou_1      0.8 2021-10-20 09:14:47 2021-10-20 10:05:23
##  3 x3    wendou_2      0.8 2021-10-20 10:11:14 2021-10-20 11:00:07
##  4 x4    wendou_2      0.8 2021-10-18 09:08:24 2021-10-20 12:02:57
##  5 x5    wendou_1      0.8 2021-10-20 10:11:14 2021-10-20 15:02:31
##  6 x6    wendou_2      0.8 2021-10-20 14:14:36 2021-10-20 16:04:35
##  7 x7    wendou_1      0.8 2021-10-20 15:09:11 2021-10-20 10:26:34
##  8 x8    wendou_1      0.8 2021-10-20 09:44:15 2021-10-20 11:12:48
##  9 x9    gnarala_1     1.2 2021-10-20 11:12:05 2021-10-21 11:12:48
## 10 x10   gnarala_2     1.2 2021-10-20 14:14:36 2021-10-22 11:12:48
## # … with 20 more rows, and 47 more variables: today <dttm>, deviceid <chr>,
## #   i_enum_id <dbl>, i_enum_genre <chr>, i_admin2 <chr>, i_ville <chr>,
## #   i_secteur <chr>, i_info_secteur <chr>, i_zad <chr>, i_zad_1 <chr>,
## #   i_consensus <chr>, i_consensus_note <chr>, i_enquete_age <dbl>,
## #   i_enquete_genre <chr>, i_statut <chr>, i_note_statut <lgl>,
## #   i_statut_1 <chr>, c_note <lgl>, c_chef_menage <chr>,
## #   c_n_chef_menage_genre <chr>, c_n_chef_menage_age <dbl>, …

Most of impactR’s functions are written with the assumption that the provided data may be coerced to a tibble, since it extensively use the tidyverse.

Let’s sat we have a survey sheet composed as:

data(survey)
survey <- survey |> tibble::as_tibble(survey)

First thing to do with ‘survey’ as it will be used elsewhere: separate the type column.

# All are already defaults apart from 'col_to_split'
survey <- survey |>  
  split_survey(
    col_to_split = "type",
    into = c("type", "list_name"),
    sep = " ",
    fill = "right")

Monitor data collection

Check for outliers

In order to obtain the so-called outliers (for numerical variables), you can use the make_log_outlier function which allows you to obtain two types of outliers: the deviation from the mean using the outliers_sd function and the interquartile range using the outliers_iqr function.

# Helper to get numeric columns
# numeric_cols(data) # Get all numeric columns: it includes for example numeric enumerator ids or gps locations
# numeric_cols(data, survey) # Check for numeric columns using the survey sheet.

# Get IQR outliers (1.5 default rule)
# outliers_iqr(data, col = i_enquete_age,times = 1.5,  id_col = uuid)  # id_col is usually uuid with Kobo

# Get standard deviation outliers
# outliers_sd(data, i_enquete_age, times = 3, id_col = uuid)

# Create the full log of outliers of all numeri columns
log_outliers <- make_log_outlier(data, survey, id_col = uuid, today, i_enum_id, i_zad)
log_outliers

## # A tibble: 2 × 16
##   id_check today               i_enu…¹ i_zad uuid  quest…² quest…³ why   feedb…⁴
##   <chr>    <dttm>                <dbl> <chr> <chr> <chr>   <chr>   <chr> <chr>  
## 1 outlier  2021-10-18 00:00:00       8 wend… x1    c_n_ch… Quel e… Outl… Fill in
## 2 outlier  2021-10-20 00:00:00       3 wend… x2    c_n_ch… Quel e… Outl… Fill in
## # … with 7 more variables: action <chr>, old_value <dbl>, new_value <chr>,
## #   type <chr>, other_parent_question <chr>, other_new_value <chr>,
## #   checkdate <chr>, and abbreviated variable names ¹i_enum_id, ²question_name,
## #   ³question_label, ⁴feedback

Check for others

This functions needs the Kobo question with “other” answer to be defined as “variable” for the parent question and “other_variable” for the parent question. “other_” maybe different and is defined in the following question by the ‘other’ arg, being the character pattern. In the example that follows, it is “autre_”.

# Get other answers
other_cols <- other_cols(data, "autre_", id_col = uuid)

# Get other parent answers
other_parent_cols(data, other_cols, "autre_", id_col = uuid)

## # A tibble: 4 × 3
##   uuid  other_parent_question other_old_value            
##   <chr> <chr>                 <chr>                      
## 1 x15   r_besoin_assistance   abri_bna secal autre       
## 2 x22   h_3_type_latrine      latrine_prive autre        
## 3 x5    h_3_type_latrine      latrine_prive_partage autre
## 4 x7    r_besoin_assistance   abri_bna secal autre

# Fabriquer le journal de nettoyage des "autres"
log_others <- make_log_other(data, survey, "autre_", uuid, i_enum_id, i_zad) 
log_others

## # A tibble: 4 × 16
##   id_check i_enum_id i_zad   uuid  questi…¹ quest…² why   feedb…³ action old_v…⁴
##   <chr>        <dbl> <chr>   <chr> <chr>    <chr>   <chr> <chr>   <chr>  <chr>  
## 1 autre_           3 wendou  x5    autre_h… Lesque… Othe… Fill in check  Aussi …
## 2 autre_          12 wendou  x7    autre_r… De que… Othe… Fill in check  Un mét…
## 3 autre_           3 gnarala x15   autre_r… De que… Othe… Fill in check  Electr…
## 4 autre_           5 gnarala x22   autre_h… Lesque… Othe… Fill in check  Trou o…
## # … with 6 more variables: new_value <chr>, type <chr>,
## #   other_parent_question <chr>, other_old_value <chr>, other_new_value <chr>,
## #   checkdate <chr>, and abbreviated variable names ¹question_name,
## #   ²question_label, ³feedback, ⁴old_value

Cleaning log based on a checklist

All you need to do is to have produced an Excel file of logical checks in advance, which can be modified during the collection process.

data(check_list)
check_list <- check_list |> tibble::as_tibble()
check_list

## # A tibble: 5 × 7
##   id_check question_name     why                    logic…¹ new_v…² action type 
##   <chr>    <chr>             <chr>                  <chr>   <lgl>   <chr>  <chr>
## 1 id1      i_consensus       Pas de consentement p… "i_con… NA      remove char…
## 2 id2      survey_duration   Durée d'enquête de mo… "surve… NA      urgen… doub…
## 3 id3      survey_duration   Durée d'enquête de pl… "surve… NA      check  doub…
## 4 id4      survey_duration   Durée d'enquête de pl… "surve… NA      urgen… doub…
## 5 id5      c_chef_menage_age Le ou la cheffe de mé… "c_che… NA      check  doub…
## # … with abbreviated variable names ¹logical_test, ²new_value

The check excel spreadsheet must follow a few rules in order to be read by the impactR [add list] functions. For example, it is necessary that all variables present in the logical tests (column ‘logical_test’ of the check table) also exist in the data, object data. For this, we can use the check_check_list function, which aims to validate or not a logical check table [note: it is not yet robust, but it already allows to do a number of checks].

For example, below the column survey_duration does not exist in the data. However, there are logical checks that take it into account in the logical checks spreadsheet check_list. If we run the following command, we would get an error:

check_check_list(check_list, data)
# following column/s from `question_name` is/are missing in `.tbl`: survey_duration, survey_duration, survey_duration

So we will add the survey_duration column using the survey_duration function. We take the opportunity to add the time difference between two surveys per interviewer:

data <- data |> 
  survey_duration(start, end, new_colname = "survey_duration") #|>
  # NOT RUN!
  # survey_difftime(start, end, new_colname = "survey_difftime", i_enum_id)
data$survey_duration

##  [1]   53   51   49 3055  291  110 -283   89 1441 2698 1204 2969 1481 5884 1558
## [16] 2942 4444 2942 1258 2644 1529 2881 1258 2644 1529 2921 1529 5884 1502 2698

We can check the logical check spreadsheet again:

check_check_list(check_list, data)

## [1] TRUE

This time, it’s ok, the function gives TRUE, so we can proceed with the production of the cleaning log.

log_check_list <- make_log_from_check_list(data, survey, check_list, uuid, today, i_enum_id, i_zad)

log_check_list

## # A tibble: 57 × 17
##    id_ch…¹ today               i_enu…² i_zad uuid  quest…³ quest…⁴ why   feedb…⁵
##    <chr>   <dttm>                <dbl> <chr> <chr> <chr>   <chr>   <chr> <chr>  
##  1 id2     2021-10-20 00:00:00      12 wend… x7    survey… ""      Duré… Verifi…
##  2 id3     2021-10-18 00:00:00       8 wend… x1    survey… ""      Duré… Verifi…
##  3 id3     2021-10-20 00:00:00       3 wend… x2    survey… ""      Duré… Verifi…
##  4 id3     2021-10-20 00:00:00       3 wend… x3    survey… ""      Duré… Verifi…
##  5 id3     2021-10-20 00:00:00       3 wend… x4    survey… ""      Duré… Verifi…
##  6 id3     2021-10-20 00:00:00       3 wend… x5    survey… ""      Duré… Verifi…
##  7 id3     2021-10-20 00:00:00       3 wend… x6    survey… ""      Duré… Verifi…
##  8 id3     2021-10-20 00:00:00      12 wend… x8    survey… ""      Duré… Verifi…
##  9 id3     2021-10-20 00:00:00       1 gnar… x9    survey… ""      Duré… Verifi…
## 10 id3     2021-10-20 00:00:00       5 gnar… x10   survey… ""      Duré… Verifi…
## # … with 47 more rows, 8 more variables: action <chr>, old_value <chr>,
## #   new_value <lgl>, type <chr>, other_parent_question <chr>,
## #   other_old_value <chr>, other_new_value <chr>, checkdate <chr>, and
## #   abbreviated variable names ¹id_check, ²i_enum_id, ³question_name,
## #   ⁴question_label, ⁵feedback

Full cleaning log and export

Finally, we just need to combine all these cleaning logs into one. We can then export it to an excel file.

log <- list(log_outliers, log_check_list, log_others) |> 
  purrr::map(~ .x |> dplyr::mutate(dplyr::across(.fns = as.character))) |> 
  dplyr::bind_rows() |> 
  readr::type_convert()

It is also possible to use the make_all_logs function which combines these three functions and outputs a single cleanup log.

log <- make_all_logs(data, survey, check_list, "autre_", uuid, today, i_enum_id, i_zad)
log

## # A tibble: 63 × 17
##    id_check today      i_enum…¹ i_zad uuid  quest…² quest…³ why   feedb…⁴ action
##    <chr>    <date>        <dbl> <chr> <chr> <chr>   <chr>   <chr> <chr>   <chr> 
##  1 id2      2021-10-20       12 wend… x7    survey… NA      Duré… Verifi… urgen…
##  2 id3      2021-10-18        8 wend… x1    survey… NA      Duré… Verifi… check 
##  3 id3      2021-10-20        3 wend… x2    survey… NA      Duré… Verifi… check 
##  4 id3      2021-10-20        3 wend… x3    survey… NA      Duré… Verifi… check 
##  5 id3      2021-10-20        3 wend… x4    survey… NA      Duré… Verifi… check 
##  6 id3      2021-10-20        3 wend… x5    survey… NA      Duré… Verifi… check 
##  7 id3      2021-10-20        3 wend… x6    survey… NA      Duré… Verifi… check 
##  8 id3      2021-10-20       12 wend… x8    survey… NA      Duré… Verifi… check 
##  9 id3      2021-10-20        1 gnar… x9    survey… NA      Duré… Verifi… check 
## 10 id3      2021-10-20        5 gnar… x10   survey… NA      Duré… Verifi… check 
## # … with 53 more rows, 7 more variables: old_value <chr>, new_value <chr>,
## #   type <chr>, other_parent_question <chr>, other_old_value <chr>,
## #   other_new_value <lgl>, checkdate <chr>, and abbreviated variable names
## #   ¹i_enum_id, ²question_name, ³question_label, ⁴feedback

Finally, there are several ways to export. Here we give the example with the package writexl:

# Not run! The simplest one
# writexl::write_xlsx(log, "output/log.xlsx)

# You can add the current date to allow tracking
# writexl::write_xlsx(log, paste0("output/log_", Sys.Date(), ".xlsx))