Summarizing quantitative data using R

May 11, 2021 5 min read

In a previous article, we outlined how you can summarize qualitative data using frequency tables and bar graphs. We now turn our attention to quantitative variables and their numerical, and graphical summaries.

Let’s load the data we had used in our previous session into R. This data is adapted from Prof. Rebecca Hubbard’s github repository.

 df <- read.csv("https://www.stat.courses/uploads/diabetes_vs_depression.csv"

Let’s take a look at the first few rows of the data.

head(df)

patientid	number_of_visits	average_age	race	gender	type2_diabetes_ind	Type1_diabetes_ind	endocrinologist_vist_ind	depression_ind	average_bmi	average_glucose	average_hba1c
100172	2	11.5	Black	Male	FALSE	FALSE	TRUE	FALSE	27.51	NA	5.91
100182	2	10.5	Black	Male	TRUE	FALSE	TRUE	FALSE	24.50	NA	6.54
100463	5	18	White	Male	FALSE	FALSE	TRUE	TRUE	43.19	NA	4.92
100545	4	13.25	White	Male	FALSE	FALSE	FALSE	TRUE	44.38	87.62	NA
100582	3	12	Black	Male	FALSE	TRUE	TRUE	FALSE	26.48	NA	7.98
100612	5	12.8	White	Female	FALSE	TRUE	TRUE	TRUE	49.58	84.27	5.365

Quantitative Variables

Numerical measurement of attributes/characteristics of individuals can be achieved using quantitative variables.

What are examples of quantitative variables in the the dataset?

average_bmi: average body mass index (expressed in kg/m²)
average_glucose: laboratory measurement of blood glucose level (measured in mg/dl)
average_hba1c: hemoglobin A1c (hba1c) in %
number_of_visits: number of visits to outpatient facility

Discrete vs. continuous quantitative variables

Discrete quantitative variables (e.g., number of visits to a clinic) take a countable values, while continuous variables could take any value in the range of possible values (e.g., blood glucose level measured in mg/dl, or hemoglobin A1c (hba1c) in % ).

Graphical Summary of Quantitative Variables using Histograms

Histograms visually display the distribution of a quantitative variables. A histogram helps answer what ranges of values are more likely and least likely to occur in the sample data.

Using `hist` command to create histograms

hist(df$number_of_visits, breaks = 15)

Plotting a histogram using `hist()` function in R

Notes from the histogram:

The most likely range of number of visits is between 2 and 4.
It is least likely for higher values of number of visits to be observed.
We can also get a visual display of the “center” and the “dispersion”/“spread” of the number of visits variable.

Using `dplyr` package to create histograms

You could also use the popular ggplot2 R package to create histograms. If you have not yet installed the R packages dplyr and ggplot2, install and load them using the following commands install.packages() and library().

install.packages("ggplot2")
install.packages("dplyr")

#load packages 
library("ggplot2")
library("dplyr")

Plotting a histogram using ggplot library.

ggplot(df, mapping = aes(x=number_of_visits)) +
  geom_histogram(binwidth = 2 ) +
  labs(title = "Histogram of number of visits")

Plotting a histogram using `ggplot2` library in R

Numerical description of quantitative variables

We have noted that a histogram gives us an idea of where the center of the data measurement lies. It also gives us an idea about the dispersion of the measurements. In addition to the visual display, we can quantify the notions of center/spread using numerical summaries.

Mean as a measure of center

The sample mean is the most widely used measure of center. It is the average observed value in the sample. You can use the following command in R to calculate the sample mean

mean(df$number_of_visits)

The sample mean is typically represented using the notation $\bar{Y}$, where $y_1, y_2, \ldots, y_n$ are the measurements/observables of $n$ individuals in our sample.

$$\bar{Y} = \frac{\sum_{i = 1}^{n}y_i}{n}$$

The population mean about which we want to infer using the sample mean is typically represented using the Greek symbol $\mu$.

Median as a measure of center

The sample median is a value such that 50% of our sample is below it and 50% is above it. R command for calculating the median is

median(df$number_of_visits)

Both the sample mean and the median measure the notion of center of data, but they differ in how they accomplish that. Therefore, they need not always concur.

Compared to the sample mean, the sample median is not affected by extremely high or low observations in the sample.

The function summary(df$number_of_visits) can also be used to calculate the sample mean and the sample median.

REMARK: We hope that the numerical and graphical summaries from our sample (e.g., the average number of visits, average blood glucose level, etc. ) resemble that of the population from which our sample is taken from.

Measures and visualizations of spread

Visual display of spread

Two populations (or samples) may have the same mean, but have different spread.

One way to visually see the spread of a data is using histogram (see above for an example). Another visual display of spread can be achieved using a box plot.

boxplot(df$number_of_visits)

Numerical quantification of spread

The most commonly used numerical measures of spread are the variance and standard deviation. Other measures of spread include range and interquartile range.

Range is the difference between the maximum observed value and the minimum observed value.

 max(df$number_of_visits) - min(df$number_of_visits)

Interquartile range is the is the difference between the third quartile (Q3) and the first quartile (Q1), where the third quartile is the 75th percentile and the first quartile is the 25th percentile.

 IQR(df$number_of_visits)

Statistics Programming

Bereket Kindo, PhD

Bereket Kindo’s interests include Bayesian analysis and machine learning.