library("knitr")
library("bookdown")
library("kableExtra")
library("DT")
library("Hmisc")
REDCap to R Data Set
Introduction
Many investigators, project managers, and data managers have turned to REDCap to manage their data (Harris et al. 2009, 2019). Eventually, it falls to the statistician to take the REDCap data and load it into their statistical analysis program of choice. In this tutorial I show how to use the CSV and R script file downloaded from REDCap to create a clean R data set.
This tutorial uses the R programming language (R Core Team 2019). All of the files needed to reproduce these results can be downloaded from the Git repository https://github.com/wkingc/redcap-to-r-data-set.
Libraries
The libraries knitr, bookdown, kableExtra, and DT, generate the HTML output (Xie 2019, 2018; Zhu 2019; Xie, Cheng, and Tan 2018). The Hmisc library is loaded to generate and store variable labels (Harrell Jr, Charles Dupont, and others. 2019).
REDCap Data Structure
For this tutorial, I created two files that mimic the general structure of data exported from REDCap for use in the R programming language. This structure is accurate as of REDCap version 9.1.0. While not the focus of this tutorial, for the sake of reproducibility, the file used the generate the example data is downloadable here.
The first file is a flat-file of comma-separated values (CSV). Below is a display of the example data downloaded here. Of note, all of the factor variables, such as sex, are numerically coded. There are also indicators for ambiguous data (666) and missing data (999).
<- read.csv(file = "fromREDCap_DATA_2019-12-05_1111.csv", stringsAsFactors = FALSE)
d_csv
<- datatable(
d_csv_DT
d_csv,rownames = FALSE,
class = 'cell-border stripe',
caption = htmltools::tags$caption(
style = 'caption-side: bottom; text-align: center;',
'', htmltools::em('The CSV file generated with REDCap for use with the R programming language. Of note, all of the factor variables are numerically coded.')
),filter = 'top',
options = list(
pageLength = 5,
autoWidth = TRUE,
scrollX = TRUE
)
)
d_csv_DT
The second file is an R script that codes the data for use in R. The contents of the example REDCap script file used in this tutorial are shown below and downloadable here. We see that the Hmisc library is used to set the labels, and none of the numerically coded variables are stored as factors, but rather as new variables with .factor appended.
<- readLines("fromREDCap_R_2019-12-05_1111.R")
d_script
cat(paste(d_script, "\n", sep = ""))
#Clear existing data and graphics
rm(list = ls())
graphics.off()
#Load Hmisc library
library(Hmisc)
#Read Data
data = read.csv('fromREDCap_DATA_2019-12-05_1111.csv')
#Setting Labels
label(data$record_id) = "Record ID"
label(data$redcap_event_name) = "Event Name"
label(data$visit_date) = "Visit Date"
label(data$randomization) = "Randomization"
label(data$sex) = "Sex"
label(data$age) = "Age"
label(data$social_connectedness) = "Social Connectedness"
label(data$comments) = "Comments"
#Setting Factors(will create new variable for factors)
data$redcap_event_name.factor = factor(data$redcap_event_name, levels = c("baseline", "followup"))
data$randomization.factor = factor(data$randomization,levels = c("0", "1"))
data$sex.factor = factor(data$sex,levels = c("0", "1", "666", "999"))
data$social_connectedness.factor = factor(data$social_connectedness, levels = c("0", "1", "2", "3", "4", "5", "6", "7", "666", "999"))
levels(data$redcap_event_name.factor) = c("Baseline","Follow-up")
levels(data$randomization.factor) = c("Control","Treatment")
levels(data$sex.factor) = c("Male","Female","Ambiguous","Missing")
levels(data$social_connectedness) = c("0", "1", "2", "3", "4", "5", "6", "7", "Ambiguous","Missing")
REDCap Script File Clean-up and Parsing
To set up the data, we want to parse and evaluate the script file with the primary goal of removing the appended .factor. The function redcap_to_r_data_set does just that. It takes two arguments; The first is redcap_data_file, which is the path to the REDCap data set. The second is redcap_script_file, which is the path to the REDCap script file.
<- function(redcap_data_file, redcap_script_file) {
redcap_to_r_data_set # Read in the data and script file.
<- read.csv(file = redcap_data_file, stringsAsFactors = FALSE)
redcap_data <- readLines(redcap_script_file)
redcap_script
# We want to remove the appended .factor, but since releveling the numerically coded data erases the labels, we need to reorder the file so that the labels are last.
# Every line in the script file that uses the factor() function
<- redcap_script[grep("factor\\(", redcap_script)]
redcap_factor
# Every line in the script file that uses the levels() function
<- redcap_script[grep("levels\\(", redcap_script)]
redcap_levels
# Every line in the script file that begins with the label function
<- redcap_script[grep("^label\\(", redcap_script)]
redcap_label
# Reorder the chunks in the script file.
<- c(redcap_factor, "", redcap_levels, "", redcap_label)
redcap_reorder
# Remove the appended .factor.
<- gsub("\\.factor", "", redcap_reorder)
redcap_no_append
# REDCap defaults to calling the data 'data'. Before evaluating, we need to change this to what the data is named here.
<- gsub("data\\$", "redcap_data\\$", redcap_no_append)
redcap_rename
# Now we can safely evaluate the script file.
eval(parse(text = redcap_rename))
return(redcap_data)
}
<- redcap_to_r_data_set(redcap_data_file = "fromREDCap_DATA_2019-12-05_1111.csv", redcap_script_file = "fromREDCap_R_2019-12-05_1111.R") d
Below is the completed R data set after cleaning and parsing the REDCap script file.
<- datatable(
d_DT
d,rownames = FALSE,
class = 'cell-border stripe',
caption = htmltools::tags$caption(
style = 'caption-side: bottom; text-align: center;',
'', htmltools::em('The complete R data set after cleaning and parsing the REDCap script file.')
),filter = 'top',
options = list(
pageLength = 5,
autoWidth = TRUE,
scrollX = TRUE
)
)
d_DT
Printing the data summary shows that variable coding worked as expected. The label accessor function shows that the labels were accurately created.
summary(d)
record_id redcap_event_name visit_date randomization
Min. : 1 Baseline :25 Length:50 Control :10
1st Qu.: 7 Follow-up:25 Class1:labelled Treatment:15
Median :13 Class2:character NA's :25
Mean :13 Mode :character
3rd Qu.:19
Max. :25
sex age social_connectedness comments
Male : 0 Min. : 25.0 1 :9 Length:50
Female :11 1st Qu.: 33.0 2 :9 Class1:labelled
Ambiguous: 0 Median : 38.0 3 :8 Class2:character
Missing : 3 Mean :228.8 4 :8 Mode :character
NA's :36 3rd Qu.: 61.0 Missing :6
Max. :999.0 Ambiguous:4
NA's :25 (Other) :6
label(d)
record_id redcap_event_name visit_date
"Record ID" "Event Name" "Visit Date"
randomization sex age
"Randomization" "Sex" "Age"
social_connectedness comments
"Social Connectedness" "Comments"
R Session Info
sessionInfo()
R version 4.5.1 (2025-06-13)
Platform: aarch64-apple-darwin24.4.0
Running under: macOS Tahoe 26.0.1
Matrix products: default
BLAS: /opt/homebrew/Cellar/openblas/0.3.30/lib/libopenblasp-r0.3.30.dylib
LAPACK: /opt/homebrew/Cellar/r/4.5.1/lib/R/lib/libRlapack.dylib; LAPACK version 3.12.1
locale:
[1] C.UTF-8/C.UTF-8/C.UTF-8/C/C.UTF-8/C.UTF-8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Hmisc_5.2-4 DT_0.34.0 kableExtra_1.4.0 bookdown_0.45
[5] knitr_1.50
loaded via a namespace (and not attached):
[1] sass_0.4.10 generics_0.1.4 xml2_1.4.0 stringi_1.8.7
[5] digest_0.6.37 magrittr_2.0.4 evaluate_1.0.5 grid_4.5.1
[9] RColorBrewer_1.1-3 fastmap_1.2.0 jsonlite_2.0.0 nnet_7.3-20
[13] backports_1.5.0 Formula_1.2-5 gridExtra_2.3 crosstalk_1.2.2
[17] viridisLite_0.4.2 scales_1.4.0 jquerylib_0.1.4 textshaping_1.0.4
[21] cli_3.6.5 rlang_1.1.6 cachem_1.1.0 base64enc_0.1-3
[25] yaml_2.3.10 tools_4.5.1 checkmate_2.3.3 htmlTable_2.4.3
[29] dplyr_1.1.4 colorspace_2.1-2 ggplot2_4.0.0 vctrs_0.6.5
[33] R6_2.6.1 rpart_4.1.24 lifecycle_1.0.4 stringr_1.5.2
[37] htmlwidgets_1.6.4 foreign_0.8-90 cluster_2.1.8.1 pkgconfig_2.0.3
[41] bslib_0.9.0 pillar_1.11.1 gtable_0.3.6 glue_1.8.0
[45] data.table_1.17.8 systemfonts_1.3.1 xfun_0.53 tibble_3.3.0
[49] tidyselect_1.2.1 rstudioapi_0.17.1 farver_2.1.2 htmltools_0.5.8.1
[53] rmarkdown_2.30 svglite_2.2.1 compiler_4.5.1 S7_0.2.0