library("knitr")
library("bookdown")
library("kableExtra")
library("ggplot2")
library("Hmisc")
library("DT")
Creating an Analytic Data Set in R
Introduction
Creating an analytic data set is very important when doing data analysis and will be used to reproduce the results. An analytic data set should include the following:
- Variables (observed data) used in the analysis
- Variables (observed data) used to construct other variables used in the analysis
- Variable labels
- Variable coding
This tutorial uses the R programming language (R Core Team 2019). All of the files needed to reproduce these results can be downloaded from the Git repository https://github.com/wkingc/creating-an-analytic-data-set-r.
Libraries
The libraries knitr, bookdown, and kableExtra are used to generate the HTML output (Xie 2019, 2018; Zhu 2019). The ggplot2 library is loaded for the example data set that is used (Wickham 2016). The Hmisc and DT libraries provide functions needed to make and display the analytic data set (Harrell Jr, Charles Dupont, and others. 2019; Xie, Cheng, and Tan 2018).
Example Data
The example data for this tutorial will be mpg from the ggplot2 package. From the description in the manual:
This dataset contains a subset of the fuel economy data that the EPA makes available here. It contains only models which had a new release every year between 1999 and 2008 - this was used as a proxy for the popularity of the car.
data(mpg)
<- data.frame(mpg)
mpg
kable(head(mpg), caption = "Header of <b>mpg</b>.", booktabs = TRUE, escape = FALSE, table.attr = "data-quarto-disable-processing=true") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class |
---|---|---|---|---|---|---|---|---|---|---|
audi | a4 | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | compact |
audi | a4 | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | compact |
audi | a4 | 2.0 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | compact |
audi | a4 | 2.0 | 2008 | 4 | auto(av) | f | 21 | 30 | p | compact |
audi | a4 | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | compact |
audi | a4 | 2.8 | 1999 | 6 | manual(m5) | f | 18 | 26 | p | compact |
Data Setup
There are two essential parts to setting up the data. First, make sure each variable is coded to reflect its class. Second, we want to add labels to each variable in the data set using the label function of the Hmisc library.
set.seed(123)
data(mpg)
<- data.frame(mpg)
mpg
colnames(mpg)[which(colnames(mpg) == "manufacturer")] <- "manu"
$manu <- factor(mpg$manu)
mpg$model <- factor(mpg$model)
mpg$displ <- as.numeric(mpg$displ)
mpg$year <- factor(mpg$year, levels = c("1999", "2008"), ordered = TRUE)
mpg
$dp <- as.Date(NA, origin = "1970-01-01")
mpg$dp[which(mpg$year == "1999")] <- sample(seq(as.Date('1999-01-01', format = "%Y-%m-%d", origin = "1970-01-01"), as.Date('1999-12-25', format = "%Y-%m-%d", origin = "1970-01-01"), by = "day"), dim(mpg)[1]/2)
mpg$dp[which(mpg$year == "2008")] <- sample(seq(as.Date('2008-01-01', format = "%Y-%m-%d", origin = "1970-01-01"), as.Date('2008-12-25', format = "%Y-%m-%d", origin = "1970-01-01"), by = "day"), dim(mpg)[1]/2)
mpg
$cyl <- factor(mpg$cyl, levels = c(4, 5, 6, 8), ordered = TRUE)
mpg$trans <- factor(mpg$trans)
mpg$drv <- factor(mpg$drv, levels = c("f", "r", "4"), labels = c("front-wheel drive", "rear wheel drive", "4wd"))
mpg$fl <- factor(mpg$fl)
mpg$class <- factor(mpg$class)
mpg
$rn <- rnorm(dim(mpg)[1], mean = 10, sd = 5)
mpg$rn[sample(1:length(mpg$rn), size = 50)] <- NA
mpg
$party <- factor(sample(c("republican", "democrat", "independent", NA), dim(mpg)[1], replace = TRUE), levels = c("republican", "democrat", "independent"))
mpg$comments <- sample(c("I like this car!", "Meh.", "This is the worst car ever!", "Does it come in green?", "want cheese flavoured cars.", "Does it also fly?", "Blah, Blah, Blah, Blah, Blah, Blah, Blah, Blah", NA), dim(mpg)[1], replace = TRUE)
mpg
label(mpg$manu) <- "manufacturer"
label(mpg$model) <- "model name"
label(mpg$displ) <- "engine displacement, in litres"
label(mpg$year) <- "year of manufacture"
label(mpg$dp) <- "date of purchase"
label(mpg$cyl) <- "number of cylinders"
label(mpg$trans) <- "type of transmission"
label(mpg$drv) <- "drive type"
label(mpg$cty) <- "city miles per gallon"
label(mpg$hwy) <- "highway miles per gallon"
label(mpg$fl) <- "fuel type"
label(mpg$class) <- "type of car"
label(mpg$rn) <- "some random numbers that are generated from a normal distrubtion with mean = 10 and sd = 5"
label(mpg$party) <- "some random political parties"
label(mpg$comments) <- "some random comments"
kable(head(mpg), caption = "Header of <b>mpg</b>.", booktabs = TRUE, escape = FALSE, table.attr = "data-quarto-disable-processing=true") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
manu | model | displ | year | cyl | trans | drv | cty | hwy | fl | class | dp | rn | party | comments |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
audi | a4 | 1.8 | 1999 | 4 | auto(l5) | front-wheel drive | 18 | 29 | p | compact | 1999-06-28 | 18.98570 | democrat | I like this car! |
audi | a4 | 1.8 | 1999 | 4 | manual(m5) | front-wheel drive | 21 | 29 | p | compact | 1999-01-14 | NA | NA | I like this car! |
audi | a4 | 2.0 | 2008 | 4 | manual(m6) | front-wheel drive | 20 | 31 | p | compact | 2008-02-08 | 19.50450 | independent | Does it also fly? |
audi | a4 | 2.0 | 2008 | 4 | auto(av) | front-wheel drive | 21 | 30 | p | compact | 2008-07-14 | NA | independent | I like this car! |
audi | a4 | 2.8 | 1999 | 6 | auto(l5) | front-wheel drive | 16 | 26 | p | compact | 1999-07-14 | 13.68097 | democrat | Meh. |
audi | a4 | 2.8 | 1999 | 6 | manual(m5) | front-wheel drive | 18 | 26 | p | compact | 1999-11-02 | 16.82888 | NA | Meh. |
Data Dictionary
After the data is set up, the data dictionary can be printed as follows.
html(contents(mpg),
sortlevels = TRUE,
maxlevels = Inf,
levelType = "table")
Data frame:mpg
234 observations and 15 variables, maximum # NAs:55Name | Labels | Levels | Class | Storage | NAs |
---|---|---|---|---|---|
manu | manufacturer | 15 | integer | 0 | |
model | model name | 38 | integer | 0 | |
displ | engine displacement, in litres | numeric | double | 0 | |
year | year of manufacture | 2 | ordered | integer | 0 |
cyl | number of cylinders | 4 | ordered | integer | 0 |
trans | type of transmission | 10 | integer | 0 | |
drv | drive type | 3 | integer | 0 | |
cty | city miles per gallon | integer | integer | 0 | |
hwy | highway miles per gallon | integer | integer | 0 | |
fl | fuel type | 5 | integer | 0 | |
class | type of car | 7 | integer | 0 | |
dp | date of purchase | Date | double | 0 | |
rn | some random numbers that are generated from a normal distrubtion with mean = 10 and sd = 5 | numeric | double | 50 | |
party | some random political parties | 3 | integer | 55 | |
comments | some random comments | character | character | 31 |
Variable | Levels |
---|---|
manu | audi |
chevrolet | |
dodge | |
ford | |
honda | |
hyundai | |
jeep | |
land rover | |
lincoln | |
mercury | |
nissan | |
pontiac | |
subaru | |
toyota | |
volkswagen | |
model | 4runner 4wd |
a4 | |
a4 quattro | |
a6 quattro | |
altima | |
c1500 suburban 2wd | |
camry | |
camry solara | |
caravan 2wd | |
civic | |
corolla | |
corvette | |
dakota pickup 4wd | |
durango 4wd | |
expedition 2wd | |
explorer 4wd | |
f150 pickup 4wd | |
forester awd | |
grand cherokee 4wd | |
grand prix | |
gti | |
impreza awd | |
jetta | |
k1500 tahoe 4wd | |
land cruiser wagon 4wd | |
malibu | |
maxima | |
mountaineer 4wd | |
mustang | |
navigator 2wd | |
new beetle | |
passat | |
pathfinder 4wd | |
ram 1500 pickup 4wd | |
range rover | |
sonata | |
tiburon | |
toyota tacoma 4wd | |
year | 1999 |
2008 | |
cyl | 4 |
5 | |
6 | |
8 | |
trans | auto(av) |
auto(l3) | |
auto(l4) | |
auto(l5) | |
auto(l6) | |
auto(s4) | |
auto(s5) | |
auto(s6) | |
manual(m5) | |
manual(m6) | |
drv | front-wheel drive |
rear wheel drive | |
4wd | |
fl | c |
d | |
e | |
p | |
r | |
class | 2seater |
compact | |
midsize | |
minivan | |
pickup | |
subcompact | |
suv | |
party | republican |
democrat | |
independent |
Analytic Data
This analytic data is created using the datatable function of the DT library. There are many options here, so I will only cover the high points. For more information check out the excellent documentation here.
The main things I want to show are the extensions declared by the extensions argument. The Buttons option creates buttons so that the end user can download the data set in various formats. The KeyTable allows users to select cells within the table and navigate similar to how they would in spreadsheet software. The FixedColumns allows us to create fixed columns that follow the cursor and scroll bars.
<- lapply(mpg, grep, pattern = ",")
commas <- names(which(unlist(lapply(commas, length)) > 0))
commas <- gsub(",", "_comma_", mpg[, commas], fixed = TRUE)
mpg[, commas]
datatable(
escape = FALSE,
mpg, extensions = c('Buttons', 'KeyTable', 'FixedColumns'),
class = 'cell-border stripe',
rownames = TRUE,
options = list(
dom = 'Bfrtip',
pageLength = 5,
deferRender = TRUE,
fixedColumns = list(leftColumns = 1),
responsive = TRUE,
scrollX = TRUE,
scrollCollaspe = TRUE,
paging = TRUE,
autoWidth = TRUE,
keys = TRUE,
buttons = c('copy', 'csv', 'excel', 'pdf', 'print')
))
R Session Information
sessionInfo()
R version 4.5.1 (2025-06-13)
Platform: aarch64-apple-darwin24.4.0
Running under: macOS Tahoe 26.0.1
Matrix products: default
BLAS: /opt/homebrew/Cellar/openblas/0.3.30/lib/libopenblasp-r0.3.30.dylib
LAPACK: /opt/homebrew/Cellar/r/4.5.1/lib/R/lib/libRlapack.dylib; LAPACK version 3.12.1
locale:
[1] C.UTF-8/C.UTF-8/C.UTF-8/C/C.UTF-8/C.UTF-8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] DT_0.34.0 Hmisc_5.2-4 ggplot2_4.0.0 kableExtra_1.4.0
[5] bookdown_0.45 knitr_1.50
loaded via a namespace (and not attached):
[1] sass_0.4.10 generics_0.1.4 renv_1.0.10 xml2_1.4.0
[5] stringi_1.8.7 digest_0.6.37 magrittr_2.0.4 evaluate_1.0.5
[9] grid_4.5.1 RColorBrewer_1.1-3 fastmap_1.2.0 jsonlite_2.0.0
[13] nnet_7.3-20 backports_1.5.0 Formula_1.2-5 gridExtra_2.3
[17] crosstalk_1.2.2 viridisLite_0.4.2 scales_1.4.0 jquerylib_0.1.4
[21] textshaping_1.0.4 cli_3.6.5 rlang_1.1.6 cachem_1.1.0
[25] base64enc_0.1-3 withr_3.0.2 yaml_2.3.10 tools_4.5.1
[29] checkmate_2.3.3 htmlTable_2.4.3 dplyr_1.1.4 colorspace_2.1-2
[33] vctrs_0.6.5 R6_2.6.1 rpart_4.1.24 lifecycle_1.0.4
[37] stringr_1.5.2 htmlwidgets_1.6.4 foreign_0.8-90 cluster_2.1.8.1
[41] pkgconfig_2.0.3 bslib_0.9.0 pillar_1.11.1 gtable_0.3.6
[45] glue_1.8.0 data.table_1.17.8 systemfonts_1.3.1 xfun_0.53
[49] tibble_3.3.0 tidyselect_1.2.1 rstudioapi_0.17.1 farver_2.1.2
[53] htmltools_0.5.8.1 rmarkdown_2.30 svglite_2.2.1 compiler_4.5.1
[57] S7_0.2.0