REDCap to R Data Set

Author

Wade K. Copeland

Published

October 15, 2025

Introduction

Many investigators, project managers, and data managers have turned to REDCap to manage their data (Harris et al. 2009, 2019). Eventually, it falls to the statistician to take the REDCap data and load it into their statistical analysis program of choice. In this tutorial I show how to use the CSV and R script file downloaded from REDCap to create a clean R data set.

This tutorial uses the R programming language (R Core Team 2019). All of the files needed to reproduce these results can be downloaded from the Git repository https://github.com/wkingc/redcap-to-r-data-set.

Libraries

The libraries knitr, bookdown, kableExtra, and DT, generate the HTML output (Xie 2019, 2018; Zhu 2019; Xie, Cheng, and Tan 2018). The Hmisc library is loaded to generate and store variable labels (Harrell Jr, Charles Dupont, and others. 2019).

library("knitr")
library("bookdown")
library("kableExtra")
library("DT")
library("Hmisc")

REDCap Data Structure

For this tutorial, I created two files that mimic the general structure of data exported from REDCap for use in the R programming language. This structure is accurate as of REDCap version 9.1.0. While not the focus of this tutorial, for the sake of reproducibility, the file used the generate the example data is downloadable here.

The first file is a flat-file of comma-separated values (CSV). Below is a display of the example data downloaded here. Of note, all of the factor variables, such as sex, are numerically coded. There are also indicators for ambiguous data (666) and missing data (999).

d_csv <- read.csv(file = "fromREDCap_DATA_2019-12-05_1111.csv", stringsAsFactors = FALSE)

d_csv_DT <- datatable(
    d_csv,
    rownames = FALSE,
    class = 'cell-border stripe',
    caption = htmltools::tags$caption(
        style = 'caption-side: bottom; text-align: center;',
        '', htmltools::em('The CSV file generated with REDCap for use with the R programming language.  Of note, all of the factor variables are numerically coded.')
    ),
    filter = 'top',
    options = list(
        pageLength = 5,
        autoWidth = TRUE,
        scrollX = TRUE
    )
)

d_csv_DT

The second file is an R script that codes the data for use in R. The contents of the example REDCap script file used in this tutorial are shown below and downloadable here. We see that the Hmisc library is used to set the labels, and none of the numerically coded variables are stored as factors, but rather as new variables with .factor appended.

d_script <- readLines("fromREDCap_R_2019-12-05_1111.R")

cat(paste(d_script, "\n", sep = ""))

#Clear existing data and graphics
 rm(list = ls())
 graphics.off()
 #Load Hmisc library
 library(Hmisc)
 #Read Data
 data = read.csv('fromREDCap_DATA_2019-12-05_1111.csv')
 #Setting Labels
 
 label(data$record_id) = "Record ID"
 label(data$redcap_event_name) = "Event Name"
 label(data$visit_date) = "Visit Date"
 label(data$randomization) = "Randomization"
 label(data$sex) = "Sex"
 label(data$age) = "Age"
 label(data$social_connectedness) = "Social Connectedness"
 label(data$comments) = "Comments"
                  
 #Setting Factors(will create new variable for factors)
 data$redcap_event_name.factor = factor(data$redcap_event_name, levels = c("baseline", "followup"))
 data$randomization.factor = factor(data$randomization,levels = c("0", "1"))
 data$sex.factor = factor(data$sex,levels = c("0", "1", "666", "999"))
 data$social_connectedness.factor = factor(data$social_connectedness, levels = c("0", "1", "2", "3", "4", "5", "6", "7", "666", "999"))
 
 levels(data$redcap_event_name.factor) = c("Baseline","Follow-up")
 levels(data$randomization.factor) = c("Control","Treatment")
 levels(data$sex.factor) = c("Male","Female","Ambiguous","Missing")
 levels(data$social_connectedness) = c("0", "1", "2", "3", "4", "5", "6", "7", "Ambiguous","Missing")

REDCap Script File Clean-up and Parsing

To set up the data, we want to parse and evaluate the script file with the primary goal of removing the appended .factor. The function redcap_to_r_data_set does just that. It takes two arguments; The first is redcap_data_file, which is the path to the REDCap data set. The second is redcap_script_file, which is the path to the REDCap script file.

redcap_to_r_data_set <- function(redcap_data_file, redcap_script_file) {
    # Read in the data and script file.
    redcap_data <- read.csv(file = redcap_data_file, stringsAsFactors = FALSE)
    redcap_script <- readLines(redcap_script_file)
    
    # We want to remove the appended .factor, but since releveling the numerically coded data erases the labels, we need to reorder the file so that the labels are last.
    # Every line in the script file that uses the factor() function
    redcap_factor <- redcap_script[grep("factor\\(", redcap_script)]
    
    # Every line in the script file that uses the levels() function
    redcap_levels <- redcap_script[grep("levels\\(", redcap_script)]
    
    # Every line in the script file that begins with the label function
    redcap_label <- redcap_script[grep("^label\\(", redcap_script)]
    
    # Reorder the chunks in the script file.
    redcap_reorder <- c(redcap_factor, "", redcap_levels, "", redcap_label)
    
    # Remove the appended .factor.
    redcap_no_append <- gsub("\\.factor", "", redcap_reorder)
    
    # REDCap defaults to calling the data 'data'.  Before evaluating, we need to change this to what the data is named here.
    redcap_rename <- gsub("data\\$", "redcap_data\\$", redcap_no_append)
    
    # Now we can safely evaluate the script file.
    eval(parse(text = redcap_rename))
    
    return(redcap_data)
}

d <- redcap_to_r_data_set(redcap_data_file = "fromREDCap_DATA_2019-12-05_1111.csv", redcap_script_file = "fromREDCap_R_2019-12-05_1111.R")

Below is the completed R data set after cleaning and parsing the REDCap script file.

d_DT <- datatable(
    d,
    rownames = FALSE,
    class = 'cell-border stripe',
    caption = htmltools::tags$caption(
        style = 'caption-side: bottom; text-align: center;',
        '', htmltools::em('The complete R data set after cleaning and parsing the REDCap script file.')
    ),
    filter = 'top',
    options = list(
        pageLength = 5,
        autoWidth = TRUE,
        scrollX = TRUE
    )
)

d_DT

Printing the data summary shows that variable coding worked as expected. The label accessor function shows that the labels were accurately created.

summary(d)

   record_id  redcap_event_name  visit_date          randomization
 Min.   : 1   Baseline :25      Length:50          Control  :10   
 1st Qu.: 7   Follow-up:25      Class1:labelled    Treatment:15   
 Median :13                     Class2:character   NA's     :25   
 Mean   :13                     Mode  :character                  
 3rd Qu.:19                                                       
 Max.   :25                                                       
                                                                  
        sex          age        social_connectedness   comments        
 Male     : 0   Min.   : 25.0   1        :9          Length:50         
 Female   :11   1st Qu.: 33.0   2        :9          Class1:labelled   
 Ambiguous: 0   Median : 38.0   3        :8          Class2:character  
 Missing  : 3   Mean   :228.8   4        :8          Mode  :character  
 NA's     :36   3rd Qu.: 61.0   Missing  :6                            
                Max.   :999.0   Ambiguous:4                            
                NA's   :25      (Other)  :6

label(d)

             record_id      redcap_event_name             visit_date 
           "Record ID"           "Event Name"           "Visit Date" 
         randomization                    sex                    age 
       "Randomization"                  "Sex"                  "Age" 
  social_connectedness               comments 
"Social Connectedness"             "Comments"

R Session Info

sessionInfo()

R version 4.5.1 (2025-06-13)
Platform: aarch64-apple-darwin24.4.0
Running under: macOS Tahoe 26.0.1

Matrix products: default
BLAS:   /opt/homebrew/Cellar/openblas/0.3.30/lib/libopenblasp-r0.3.30.dylib 
LAPACK: /opt/homebrew/Cellar/r/4.5.1/lib/R/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] C.UTF-8/C.UTF-8/C.UTF-8/C/C.UTF-8/C.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Hmisc_5.2-4      DT_0.34.0        kableExtra_1.4.0 bookdown_0.45   
[5] knitr_1.50      

loaded via a namespace (and not attached):
 [1] sass_0.4.10        generics_0.1.4     xml2_1.4.0         stringi_1.8.7     
 [5] digest_0.6.37      magrittr_2.0.4     evaluate_1.0.5     grid_4.5.1        
 [9] RColorBrewer_1.1-3 fastmap_1.2.0      jsonlite_2.0.0     nnet_7.3-20       
[13] backports_1.5.0    Formula_1.2-5      gridExtra_2.3      crosstalk_1.2.2   
[17] viridisLite_0.4.2  scales_1.4.0       jquerylib_0.1.4    textshaping_1.0.4 
[21] cli_3.6.5          rlang_1.1.6        cachem_1.1.0       base64enc_0.1-3   
[25] yaml_2.3.10        tools_4.5.1        checkmate_2.3.3    htmlTable_2.4.3   
[29] dplyr_1.1.4        colorspace_2.1-2   ggplot2_4.0.0      vctrs_0.6.5       
[33] R6_2.6.1           rpart_4.1.24       lifecycle_1.0.4    stringr_1.5.2     
[37] htmlwidgets_1.6.4  foreign_0.8-90     cluster_2.1.8.1    pkgconfig_2.0.3   
[41] bslib_0.9.0        pillar_1.11.1      gtable_0.3.6       glue_1.8.0        
[45] data.table_1.17.8  systemfonts_1.3.1  xfun_0.53          tibble_3.3.0      
[49] tidyselect_1.2.1   rstudioapi_0.17.1  farver_2.1.2       htmltools_0.5.8.1 
[53] rmarkdown_2.30     svglite_2.2.1      compiler_4.5.1     S7_0.2.0

References

Harrell Jr, Frank E, with contributions from Charles Dupont, and many others. 2019. Hmisc: Harrell Miscellaneous. https://CRAN.R-project.org/package=Hmisc.

Harris, Paul A, Robert Taylor, Brenda L Minor, Veida Elliott, Michelle Fernandez, Lindsay O’Neal, Laura McLeod, et al. 2019. “The REDCap Consortium: Building an International Community of Software Platform Partners.” Journal of Biomedical Informatics 95: 103208.

Harris, Paul A, Robert Taylor, Robert Thielke, Jonathon Payne, Nathaniel Gonzalez, Jose G Conde, et al. 2009. “A Metadata-Driven Methodology and Workflow Process for Providing Translational Research Informatics Support.” J Biomed Inform 42 (2): 377–81.

R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Xie, Yihui. 2018. Bookdown: Authoring Books and Technical Documents with r Markdown. https://github.com/rstudio/bookdown.

———. 2019. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.name/knitr/.

Xie, Yihui, Joe Cheng, and Xianying Tan. 2018. DT: A Wrapper of the JavaScript Library ’DataTables’. https://CRAN.R-project.org/package=DT.

Zhu, Hao. 2019. kableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.