In this chapter, we will learn about how descriptive statistics can be used to describe employee-demographic variables. To determine which type of descriptive statistics is appropriate for a given variable, we will learn about measurement scales and how to distinguish from a construct and a measure. Finally, we’ll conclude with a tutorial.

## 24.1 Conceptual Overview

In this section, we’ll review the four different types of measurement scales (i.e., nominal, ordinal, interval, ratio), the distinctions between constructs, measures, and measurement scales, and different types of descriptive statistics (e.g., counts, measures of central tendency, measures of dispersion).

### 24.1.1 Review of Measurement Scales

When determining what type of descriptive statistics is appropriate for summarizing data contained within a particular variable, it is important to determine the measurement scale of the variable. **Measurement scale** (i.e., *scale of measurement*, *level of measurement*) refers to the type of information contained within a vector of data (e.g., variable), and the four measurement scales are: nominal, ordinal, interval, and ratio.

#### 24.1.1.1 Nominal

Variables with a **nominal** measurement scale have different *category labels*, which are sometimes referred to as *levels*. The category labels, however, do not have any inherent numeric properties. As an example, let’s operationalize gender identity as having a nominal measurement scale, such that gender identity includes the following category labels: agender, man, nonbinary, trans man, trans woman, and woman. These category labels do not have any inherent numeric values, and although we could assign numeric values to the gender identity category labels (e.g., agender = 1, man = 2, nonbinary = 3, etc.), doing so wouldn’t imply that one category label has a higher value than another. Variables with a nominal measurement scale are sometimes referred to as *categorical* variables.

The Facility and Gender variables (i.e., columns) contain examples of nominal measurement scales, as each variable has category labels that lack any inherent numeric values and cannot be ordered in a meaningful way.

#### 24.1.1.2 Ordinal

Like variables with a nominal measurement scale, variables with an **ordinal** measurement scale are a specific type of *categorical* variable; however, unlike nominal variables, the category labels (i.e., levels) associated with ordinal variables can be ordered or ranked in a meaningful way. It should be noted that the gaps – or intervals – between categorical labels of an ordinal variable are unknown; in other words, we can’t quantify the exact difference between adjacent category labels (i.e., levels). For example, let’s operationalize employee education levels with the following ordered category labels: high school diploma, some college, and college degree. That is, completing some portion of a college degree is a higher level of education than earning a high school diploma, and completing a college degree is a higher level of education than completing some portion of a college degree. We don’t know, though, the size of the interval between earning a high school diploma and completing some college, and between some completing some college and earning a college degree; thus, as operationalized in this example, employee education level demonstrates an ordinal measurement scale (as opposed to an interval measurement scale, which is described in the following section).

A controversial example of an ordinal measurement scale is any type of Likert (or Likert-type) scale or response format. Examples of Likert scales include agreement response formats (e.g., Strongly Disagree, Disagree, Neither Disagree Nor Agree, Agree, Strongly Agree) and frequency response formats (e.g., Never, Rarely, Sometimes, Always). Likert scales are commonly used in employee surveys; for example, survey respondents might be asked to indicate their level of agreement with the following survey item that is designed to assess job satisfaction: “In general, I am satisfied with my job.” Just like any variable with an ordinal measurement scale, we don’t know the exact quantitative intervals between adjacent category labels (i.e., response options) on a Likert scale. Nonetheless, in the social sciences, it is relatively common for analysts to apply numerical values to the ordered category labels on a Likert scale (e.g., 1 = Strongly Disagree, 2 = Disagree, 3 = Neither Disagree Nor Agree, 4 = Agree, 5 = Strongly Agree). After adding these numerical values, the analysts often treat Likert scales as though they were *interval* measurement scales for the purposes of data analysis, particularly when composite variables (i.e., overall scale score variables) are created by summing or averaging respondents’ scores across multiple survey items.

The Education and Performance variables (i.e., columns) contain examples of ordinal measurement scales, as each variable has category labels can be ordered in a meaningful way but where the exact quantitative intervals between category labels are unknown or undefined.

#### 24.1.1.3 Interval

Variables with an **interval** measurement scale have a numeric scale (e.g., have inherent numeric values), and not only is there an order to the numeric values, equally sized intervals between values have the same meaning or interpretation – hence, the term *interval* measurement scale. With all that being said, interval variables lack a true or meaningful zero value; in other words, a value of zero is an arbitrary point on the scale – if it even appears in the possible range of values in the first place. Variables with an interval measurement scale are sometimes referred to as *continuous* variables. As an example, suppose we purchase a cognitive ability (i.e., intelligence) test that we plan to administer to job applicants. Let’s now imagine that this test operationalizes cognitive ability, such that scores can range from 0 to 200, where 100 indicates the average level of cognitive ability in the population. Further, the test is designed such that every 1-point interval holds the same interpretation and is of equal quantitative size when compared to other 1-point intervals on the scale. For instance, let’s imagine that the 1-point interval between 78 and 79 has the same meaning (and quantitative size) as the 1-point interval between 110 and 111. In other words, equally sized intervals between values have the same meaning or interpretation in terms of incremental differences in cognitive ability. Even though this cognitive ability test can produce a score of zero, the zero value is not meaningful, as it does not imply the absence of cognitive ability; rather, it just indicates the lowest point on the numeric scale used to assess cognitive ability happens to be zero, making the zero point on the scale somewhat arbitrary.

The Cognitive Ability and BARS (Behaviorally Anchored Rating Scale) variables (i.e., columns) contain examples of interval measurement scales, as each variable has a numeric scale in which equally sized intervals between values have the same meaning or interpretation; however, both variables lack a meaningful or true zero.

#### 24.1.1.4 Ratio

Like variables with an interval measurement scale, variables with a **ratio** measurement scale are a specific type of *continuous* variable, as they have a numeric scale in which equally sized intervals between values have the same meaning or interpretation. Unlike interval variables, however, ratio variables have a true and meaningful zero value, such that zero indicates the absence of the construct being measured. Common examples of variables with a ratio measurement scale include those that measure (elapsed) time, where time is measured in standardized units like seconds, minutes, hours, days, months, years, decades, or centuries. Equally sized intervals between various time points have the same meaning, and a time of zero implies the absence of time having elapsed. In organizational settings, we often measure employee age and tenure as numeric elapsed time since a prior date. Because there is a true zero associated with ratio measurement scales, we can make statements like “this individual is twice as old as that individual” or “this individual has worked here one third as long as that individual.” Finally, I should note even if we do *not* observe a true-zero value in our acquired data, a variable can still have a ratio measurement scale. What matters is whether the scale used to measure the construct in question has a possible true-zero value. Using the example of employee age, we can safely assume that we won’t observe any employees who have an age of exactly zero years; however, because age is measured as a standardized unit of time (i.e., years), we know that when measuring time in this way a value of zero years does exist on this scale – and it it would indicate the absence of time having passed. That is, a value of zero could hypothetically indicate the lack of time having passed since the exact moment of a person’s birth. In sum, even if we don’t observe a zero score in our data, a variable can still be classified as having a ratio measurement scale, so long as the scale used to measure the underlying construct could theoretically include a true or meaningful zero value.

The Age and Monthly Pay variables (i.e., columns) contain examples of ratio measurement scales, as each variable has a numeric scale in which equally sized intervals between values have the same meaning or interpretation; in addition, both variables have a meaningful or true zero, where zero implies the absence of whatever is being measured.

### 24.1.2 Constructs, Measures, & Measurement Scales

Importantly, we use **measures** to assess **constructs** (i.e., **concepts**), and often there are different ways in which we can measure or operationalize the same construct. Consequently, different measures might have a different measurement scale, even though they are each designed to assess the same construct. For example, if wish to assess the construct of job performance for sales professionals, we could have supervisors rate employee performance using a three-point scale, ranging from “Does Not Meet Expectations” to “Meets Expectations” to “Exceeds Expectations,” which could be described as an *ordinal* measurement scale. Alternatively, we might also assess the construct of job performance for sales professionals based on how much revenue they generate (in US dollars), which could be described as a *ratio* measurement scale.

### 24.1.3 Types of Descriptive Statistics

Link to conceptual video: https://youtu.be/WCC4IXavits

Once we have determined the measurement scale of a variable, we’re ready to choose an appropriate type of descriptive statistics to summarize the data associated with that variable. **Descriptive statistics** are used to describe the characteristics of a sample drawn from a population; often, when dealing with data about human beings in organizations, it’s not feasible to attain data for the entire population, so instead we settle for what is hopefully a representative sample of individuals from the focal population. Common types of descriptive statistics include counts (i.e., frequencies), measures of central tendency (e.g., mean, median, mode), and measures of dispersion (e.g., variance, standard deviation, interquartile range). Note that descriptive statistics are *not* tests of statistical significance; for tests of statistical significance, we need to look to *inferential statistics* (e.g., independent-samples *t*-test, multiple linear regression). When we analyze employee demographic data, for example, we often compute descriptive statistics like the number of employees who identify with each race/ethnicity category or the average employee age and standard deviation. It’s important to remember that descriptive statistics are, well, descriptive. That is, they help us *summarize* characteristics of a sample, which is why they are sometimes referred to as **summary statistics**. As discussed in the chapter on the Data Analysis phase of the HR Analytics Project Life Cycle, descriptive statistics are a specific type of descriptive analytics, as they summarize data that were collected in the past.

Broadly speaking, when describing just a single variable (i.e., applying *univariate* descriptive statistics), we can distinguish between descriptive statistics that are appropriate for describing categorical versus continuous variables, where categorical variables have a nominal or ordinal measurement scale and continuous variable have an interval or ratio measurement scale. Often, counts (i.e., frequencies) are used to describe data associated with a categorical variable, and measures of central tendency and dispersion are used to describe data associated with a continuous variable.

#### 24.1.3.1 Counts

**Counts** are useful descriptive statistics when a variable has a *nominal* or *ordinal* measurement scale. Counts are also referred to as **frequencies**, so I’ll use those two terms interchangeably. As an added benefit, counts tend to be understood by a broad audience, as they simply refer to counting or tallying how many instances of each discrete instances of a category label (i.e., level) of a nominal or ordinal variable have occurred. In fact, sometimes it can be quite amazing what insights we can gleaned just by counting things. A common example of counts in the HR context is headcount by department, facility, or unit. Imagine if you will an organization with facilities in three locations: Beaverton, Hillsboro, and Portland. After tallying up how many employees work at each location, we might find that 15 work at the Beaverton facility, 5 at the Hillsboro facility, and 10 at the Portland facility. In this example, “Beaverton,” “Hillsboro,” and “Portland” are our category labels for this nominal variable, and the values 15, 5, and 10, respectively, are the counts associated with each of those category labels.

#### 24.1.3.2 Measures of Central Tendency & Dispersion

**Measures of central tendency** (e.g., mean, median, mode) summarize the center or most common scores from a distribution of numeric scores, whereas **measures of dispersion** (e.g., variance, standard deviation, range, interquartile range) summarize variation in numeric scores. Typically, one would apply these specific types of descriptive statistics to describe or summarize variables that have an *interval* or *ratio* measurement scale. For example, we might compute the median pay (in US dollars) and the interquartile range in pay for a sample of workers, where pay in this example has a ratio measurement scale.

In some instances, however, numeric values could be assigned to category labels of a variable that can be most accurately described as having an *ordinal* measurement scale – and upon doing so, the variable might be reclassified as having an *interval* measurement scale. Such a numeric conversion from ordinal to ratio allows for measures of central tendency and dispersion to be computed. For example, a variable with five Likert responses options ranging from “Strongly Disagree” to “Strongly Agree” would technically have an ordinal measurement scale because there are unknown intervals between each of the levels (i.e., category labels); in other words, the interval distance between “Strongly Disagree” and “Disagree” might not be equal to the interval distance between “Disagree” and “Neither Disagree Nor Agree”. Yet, in order to perform certain analyses, sometimes such variables are reconceptualized as having equal intervals and thus having an interval measurement scale. To do so, we would typically assign numeric values to each of the Likert response options, such as 1 = “Strongly Disagree” and 5 = “Strongly Agree” – which gives the illusion of equal intervals. Perhaps a more compelling case for treating a variable with Likert responses as a having an interval measurement scale is when we create a composite variable (i.e., overall scale score) based on the sum or average of scores from multiple Likert variables (e.g., multiple survey items from a measure).

### 24.1.4 Sample Write-Up

Based on data stored in the organization’s HR information system, we sought out to describe the organization’s employee demographics. The employee gender and race/ethnicity variables have nominal measurement scales, and thus we computed counts to describe these variables. Specifically, 321 employees identified as women, 300 as men, 25 as nonbinary, 8 as trans women, and 7 as trans men. Further, 192 employees identified as Hispanic/Latino, 145 as White, 132 as Asian, 119 as Black, 40 as Native American, and 33 as Native Hawaiian. Given that employee age was measured in years since birth, we classified the variable as having a ratio measurement scale, meaning that measures of central tendency and dispersion would be appropriate for describing the variable. We found that employee ages were normally distributed, and that the average employee age was 42.13 years with a standard deviation of 7.71, indicating that roughly two-thirds of employees’ ages fall between 34.42 and 49.84 years.

## 24.2 Tutorial

This chapter’s tutorial demonstrates how to compute various types of descriptive statistics and how to present the findings visually.

### 24.2.1 Video Tutorials

As usual, you have the choice to follow along with the written tutorial in this chapter or to watch one of the following video tutorials below. Note that in the videos below, I show how to read in the data using the `read.csv`

function from base R, whereas in the written tutorial portion of this chapter, I show how to read in the data using the `read_csv`

function from the `readr`

package.

Link to video tutorial: https://youtu.be/Xg0wiBofjCU

Link to video tutorial: https://youtu.be/10jYstRPDAU

### 24.2.2 Functions & Packages Introduced

Function | Package |
---|---|

`table` | base R |

`levels` | base R |

`factor` | base R |

`c` | base R |

`barplot` | base R |

`pie` | base R |

`colors` | base R |

`abline` | base R |

`hist` | base R |

`boxplot` | base R |

`c` | base R |

`mean` | base R |

`median` | base R |

`var` | base R |

`sd` | base R |

`min` | base R |

`max` | base R |

`range` | base R |

`IQR` | base R |

`quantile` | base R |

`summary` | base R |

### 24.2.3 Initial Steps

If you haven’t already, save the file called **“employee_demo.csv”** into a folder that you will subsequently set as your working directory. Your working directory will likely be different than the one shown below (i.e., `"H:/RWorkshop"`

). As a reminder, you can access all of the data files referenced in this book by downloading them as a compressed (zipped) folder from the my GitHub site: https://github.com/davidcaughlin/R-Tutorial-Data-Files; once you’ve followed the link to GitHub, just click “Code” (or “Download”) followed by “Download ZIP”, which will download all of the data files referenced in this book. For the sake of parsimony, I recommend downloading all of the data files into the same folder on your computer, which will allow you to set that same folder as your working directory for each of the chapters in this book.

Next, using the `setwd`

function, set your working directory to the folder in which you saved the data file for this chapter. Alternatively, you can manually set your working directory folder in your drop-down menus by going to *Session > Set Working Directory > Choose Directory…*. Be sure to create a new R script file (.R) or update an existing R script file so that you can save your script and annotations. If you need refreshers on how to set your working directory and how to create and save an R script, please refer to Setting a Working Directory and .

`# Set your working directorysetwd("H:/RWorkshop")`

Next, read in the .csv data file called **“employee_demo.csv”** using your choice of read function. In this example, I use the `read_csv`

function from the `readr`

package (Wickham, Hester, and Bryan 2024). If you choose to use the `read_csv`

function, be sure that you have installed and accessed the `readr`

package using the `install.packages`

and `library`

functions. *Note: You don’t need to install a package every time you wish to access it; in general, I would recommend updating a package installation once ever 1-3 months.* For refreshers on installing packages and reading data into R, please refer to Packages and Reading Data into R.

`# Install readr package if you haven't already# [Note: You don't need to install a package every # time you wish to access it]install.packages("readr")`

`# Access readr packagelibrary(readr)# Read data and name data frame (tibble) objectdemo <- read_csv("employee_demo.csv")`

`## Rows: 30 Columns: 5## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────## Delimiter: ","## chr (3): EmpID, Facility, Education## dbl (2): Performance, Age## ## ℹ Use `spec()` to retrieve the full column specification for this data.## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.`

`# Print the names of the variables in the data frame (tibble) objectnames(demo)`

`## [1] "EmpID" "Facility" "Education" "Performance" "Age"`

`# Print number of rows in data frame (tibble) objectnrow(demo)`

`## [1] 30`

`# Print data frame (tibble) objectprint(demo)`

`## # A tibble: 30 × 5## EmpID Facility Education Performance Age## <chr> <chr> <chr> <dbl> <dbl>## 1 EE123 Beaverton College Degree 3.8 25## 2 EE124 Beaverton Some College 9 30## 3 EE125 Portland High School Diploma 8.3 32## 4 EE126 Beaverton Some College 9.8 28## 5 EE127 Beaverton Some College 5.7 30## 6 EE128 Beaverton College Degree 8.2 30## 7 EE129 Beaverton College Degree 7.3 28## 8 EE130 Beaverton College Degree 7.7 28## 9 EE131 Portland Some College 6.3 28## 10 EE132 Hillsboro Some College 8.4 27## # ℹ 20 more rows`

The `demo`

data frame object contains five variables. `EmpID`

, `Facility`

, `Education`

, `Performance`

, and `Age`

. The `EmpID`

variable is the employee unique identifier, and in this data frame, each row corresponds to a unique employee. The `Facility`

variable contains the name of the facility where each employee works. The `Education`

variable includes the highest level of education each employee attained (i.e., High School Diploma, Some College, College Degree). The `Performance`

variable includes the employees’ annual performance scores (as derived by a proprietary algorithm), where a score of 0.0 would indicate exceptionally low job performance and a score of 10 would indicate exceptionally high job performance. The `Age`

variable includes employees’ age (in years).

### 24.2.4 Determine the Measurement Scale

As described above, we have four employee-demographic variables at our disposal in the data frame object we named `demo`

: `Facility`

, `Education`

, `Performance`

, and `Age`

. Now it’s time to determine which measurement scale best describes each variable – and spoiler alert: These four variables correspond to nominal, ordinal, interval, and ratio measurement scales respectively. Below, I describe why a particular measurement scale maps onto each variable.

By viewing our the data frame object called `demo`

using the `print`

, `head`

, or `View`

functions (as show above in the Initial Steps), we can see that the `Facility`

variable consists of the following categories (i.e., levels): `Beaverton`

, `Hillsboro`

, and `Portland`

. These categories do not have inherent numeric properties, and they can’t be ordered meaningfully given that they just represent different facility locations for this fictitious organization. Given all that, the `Facility`

variable can best be described as having a *nominal* measurement scale.

The `Education`

variable contains three levels (i.e., categories): `High School Diploma`

, `Some College`

, and `College Degree`

. These three discrete categories do not have inherent numeric properties but can be ordered in terms of a conventional educational progression, where earning a high school diploma would be the lowest level and earning a college degree would be the highest level (of the three). Furthermore, although the three variable levels can be ordered, they do not necessarily have equal intervals between the levels; in other words, the distance (e.g., time) between a high school diploma and completing some college is not necessarily the same as the distance between completing some college and a college degree. Given all of those characteristics, the `Education`

variable in these data can best be described as having an *ordinal* measurement scale.

The `Performance`

variable includes the annual performance score for each employee (as derived from a proprietary algorithm), where a score of 0.0 would indicate exceptionally low job performance and a score of 10 would indicate exceptionally high job performance. We can assume in this case that intervals between integers are equal, such that the distance between scores of 1 and 2 is the same as the distance between scores 2 and 3; however, because a value of zero (0.0) does not indicate the absence of performance for this variable (but rather exceptionally low job performance), we must conclude that it has an *interval* measurement scale as opposed to a ratio measurement scale.

Finally, the `Age`

variable includes the age of each employee measured in years. Because `Age`

has ordered numeric values and because there are equal intervals between years as a standard measure of time, we can conclude that the variable does *not* have a nominal or ordinal measurement scale. What’s more, *hypothetically*, a value of zero when measuring something in years would imply the absence of years – which is to say `Age`

as measured in years has a meaningful zero value. Given all that, the `Age`

variable can be most accurately described as having a *ratio* measurement scale.

### 24.2.5 Describe Nominal & Ordinal (Categorical) Variables

We can describe variables with *nominal* or *ordinal* measurement scales by computing counts (i.e., frequencies) and by creating univariate bar charts (or pie charts), and we’ll work through each of these descriptive approaches in the following sections.

#### 24.2.5.1 Compute Counts & Frequencies

Fortunately, it’s quite easy to run counts in R, and we’ll begin by running counts for the `Facility`

variable. One of the simplest approaches is to use the `table`

function from base R. As the sole parenthetical argument, just type the name of the data frame object (`demo`

) followed by the `$`

operator and the name of the variable that belongs to that data frame object (`Facility`

).

`# Compute counts for Facility variable (which has nominal measurement scale)table(demo$Facility)`

`## ## Beaverton Hillsboro Portland ## 15 5 10`

As we can see, 15 employees work at the Beaverton facility, 5 at the Hillsboro facility, and 10 at the Portland facility. Simply put, the most employees work in Beaverton, followed by Portland and Hillsboro. Of course, we also would hope that these data are accurate and timely, and point-in-time headcount data in organizations can be surprisingly difficult to estimate accurately in some organizations, but that’s a story for another time.

Because we have classified the `Education`

variable as ordinal, we want to make sure that it has ordered levels. That is, `High School Diploma`

should be the lowest level and `College Degree`

should be the highest. To check to see if the variable is a factor with ordered levels, we can apply the `levels`

function from base R and, as the sole parenthetical argument, type the name of the data frame object (`demo`

) followed by the `$`

operator and the name of the variable that belongs to that data frame object (`Education`

).

`# Determine whether the Education variable is a factor with ordered levelslevels(demo$Education)`

`## NULL`

Running the `levels`

function for the `Education`

variable returns `NULL`

, which indicates that this variable is not a factor variable with ordered levels. Never fear, we can fix that by using the `factor`

function from base R.

To convert the `Education`

variable to an ordered factor variable, we will overwrite the existing `Education`

variable from the `demo`

data frame object. Thus, we will start by typing the name of the data frame object (`demo`

) followed by the `$`

operator and the name of the variable (`Education`

), and to the right, we will type the `<-`

operator so that we can perform the variable assignment. To the right of the `<-`

operator, we will type the name of the `factor`

function. As the first argument, we will type the name of the data frame object (`demo`

) followed by the `$`

operator and the name of the variable (`Education`

). As the second argument, we will type `ordered=TRUE`

to signify that this variable will have ordered levels. As the third argument, we’ll type `levels=`

followed by a vector of the variable levels in ascending order. Note that we use the `c`

(combine) function from base R to construct the vector, and we need to put each level within quotation marks (`" "`

).

`# Convert Education variable to ordered factordemo$Education <- factor(demo$Education, ordered=TRUE, levels=c("High School Diploma", "Some College", "College Degree"))`

Now that we’ve converted the `Education`

variable to an ordered factor variable, let’s verify that we did so correctly by running the same `levels`

function that we did above.

`# Determine whether the Education variable is a factor with ordered levelslevels(demo$Education)`

`## [1] "High School Diploma" "Some College" "College Degree"`

Instead of `NULL`

, now we see the levels of the variable in ascending order. Good for us!

With the `Education`

variable now an ordered factor, it now makes sense to run the `table`

function to compute the counts.

`# Compute counts for Education variabletable(demo$Education)`

`## ## High School Diploma Some College College Degree ## 4 15 11`

Descriptively, we see that the most people completed some college (15), followed closely by 11 people who completed a full college degree. Relatively few employees in this sample had just a high school diploma (4).

#### 24.2.5.2 Create Data Visualizations

When interpreting descriptive statistics, it’s often useful to create some kind of data visualization to display the findings in a pictorial or graphical format. A bar chart is a simple data visualization that many potential audience members will be familiar with, making it a good choice. In addition, when the different categories (e.g., levels) are mutually exclusive and sum to a whole, we might also choose to create a pie chart. We’ll begin by creating a bar chart for the `Facility`

variable and follow that up with creating a pie chart for the `Education`

variable – though, we just as easily could make a bar chart for the `Education`

variable and a pie chart for the `Facility`

variable.

**Create Bar Charts:** Using the `barplot`

function from base R, we can create a very simple and straightforward bar chart without too many frills and embellishments. Let’s start with the `Facility`

variable. As the sole parenthetical argument in the `barplot`

function, simply, enter the `table(demo$Facility)`

code that we wrote in the previous section.

`# Create a bar chart based on Facility countsbarplot(table(demo$Facility))`

As you can see, a very simple (and not super aesthetically pleasing) bar chart appears in our Plots window. *When exploring data on our own, it is often fine to just complete a simple bar chart like this one, as opposed to fine-tuning the aesthetics (e.g., size, color, font) of the plot.* If you want, you can export this plot as a PDF or PNG image file, or you can copy it and paste it in another document. To do so, just click on the *Export* button in the Plots window, which should appear in the lower right of your RStudio interface.

If you’re feeling adventurous and would like to learn how to fine-tune the bar chart, feel free to continue on with this tutorial. Additional attention paid to aesthetics might be worthwhile if you plan to present the plot to others in a formal presentation or report.

Using the `barplot`

code we wrote above, we can add a second argument in which we apply `ylim=`

followed by a vector (using the `c`

function) of the lower and upper limits for the y-axis. In this example, I set the lower and upper y-axis limits to 0 and 20.

`# Create a bar chart based on Facility countsbarplot(table(demo$Facility), ylim=c(0,20))`

Building on the previous code, we add additional arguments in which we provide more meaningful labels for the x- and y-axes. To do so, we use the `xlab`

argument for the x-axis label and the `ylab`

argument for the y-axis label. Just make sure to put quotation marks (`" "`

) around whatever text you come up with for your axis labels.

`# Create a bar chart based on Facility countsbarplot(table(demo$Facility), ylim=c(0,20), xlab="Facility", ylab="Counts")`

We can change the colors of the bars by adding the `col`

(color) argument. There are many, many different colors that can be used in R, and one of my favorites is “dodgerblue”.

`# Create a bar chart based on Facility countsbarplot(table(demo$Facility), ylim=c(0,20), xlab="Facility", ylab="Counts", col="dodgerblue")`

If you’d like to explore additional colors, check out this website: https://www.r-graph-gallery.com/colors.html. Or, you can run the `colors()`

function (without any arguments), and you’ll get a (huge) list of the color options.

`# List names of base R color choicescolors()`

`## [1] "white" "aliceblue" "antiquewhite" "antiquewhite1" "antiquewhite2" ## [6] "antiquewhite3" "antiquewhite4" "aquamarine" "aquamarine1" "aquamarine2" ## [11] "aquamarine3" "aquamarine4" "azure" "azure1" "azure2" ## [16] "azure3" "azure4" "beige" "bisque" "bisque1" ## [21] "bisque2" "bisque3" "bisque4" "black" "blanchedalmond" ## [26] "blue" "blue1" "blue2" "blue3" "blue4" ## [31] "blueviolet" "brown" "brown1" "brown2" "brown3" ## [36] "brown4" "burlywood" "burlywood1" "burlywood2" "burlywood3" ## [41] "burlywood4" "cadetblue" "cadetblue1" "cadetblue2" "cadetblue3" ## [46] "cadetblue4" "chartreuse" "chartreuse1" "chartreuse2" "chartreuse3" ## [51] "chartreuse4" "chocolate" "chocolate1" "chocolate2" "chocolate3" ## [56] "chocolate4" "coral" "coral1" "coral2" "coral3" ## [61] "coral4" "cornflowerblue" "cornsilk" "cornsilk1" "cornsilk2" ## [66] "cornsilk3" "cornsilk4" "cyan" "cyan1" "cyan2" ## [71] "cyan3" "cyan4" "darkblue" "darkcyan" "darkgoldenrod" ## [76] "darkgoldenrod1" "darkgoldenrod2" "darkgoldenrod3" "darkgoldenrod4" "darkgray" ## [81] "darkgreen" "darkgrey" "darkkhaki" "darkmagenta" "darkolivegreen" ## [86] "darkolivegreen1" "darkolivegreen2" "darkolivegreen3" "darkolivegreen4" "darkorange" ## [91] "darkorange1" "darkorange2" "darkorange3" "darkorange4" "darkorchid" ## [96] "darkorchid1" "darkorchid2" "darkorchid3" "darkorchid4" "darkred" ## [101] "darksalmon" "darkseagreen" "darkseagreen1" "darkseagreen2" "darkseagreen3" ## [106] "darkseagreen4" "darkslateblue" "darkslategray" "darkslategray1" "darkslategray2" ## [111] "darkslategray3" "darkslategray4" "darkslategrey" "darkturquoise" "darkviolet" ## [116] "deeppink" "deeppink1" "deeppink2" "deeppink3" "deeppink4" ## [121] "deepskyblue" "deepskyblue1" "deepskyblue2" "deepskyblue3" "deepskyblue4" ## [126] "dimgray" "dimgrey" "dodgerblue" "dodgerblue1" "dodgerblue2" ## [131] "dodgerblue3" "dodgerblue4" "firebrick" "firebrick1" "firebrick2" ## [136] "firebrick3" "firebrick4" "floralwhite" "forestgreen" "gainsboro" ## [141] "ghostwhite" "gold" "gold1" "gold2" "gold3" ## [146] "gold4" "goldenrod" "goldenrod1" "goldenrod2" "goldenrod3" ## [151] "goldenrod4" "gray" "gray0" "gray1" "gray2" ## [156] "gray3" "gray4" "gray5" "gray6" "gray7" ## [161] "gray8" "gray9" "gray10" "gray11" "gray12" ## [166] "gray13" "gray14" "gray15" "gray16" "gray17" ## [171] "gray18" "gray19" "gray20" "gray21" "gray22" ## [176] "gray23" "gray24" "gray25" "gray26" "gray27" ## [181] "gray28" "gray29" "gray30" "gray31" "gray32" ## [186] "gray33" "gray34" "gray35" "gray36" "gray37" ## [191] "gray38" "gray39" "gray40" "gray41" "gray42" ## [196] "gray43" "gray44" "gray45" "gray46" "gray47" ## [201] "gray48" "gray49" "gray50" "gray51" "gray52" ## [206] "gray53" "gray54" "gray55" "gray56" "gray57" ## [211] "gray58" "gray59" "gray60" "gray61" "gray62" ## [216] "gray63" "gray64" "gray65" "gray66" "gray67" ## [221] "gray68" "gray69" "gray70" "gray71" "gray72" ## [226] "gray73" "gray74" "gray75" "gray76" "gray77" ## [231] "gray78" "gray79" "gray80" "gray81" "gray82" ## [236] "gray83" "gray84" "gray85" "gray86" "gray87" ## [241] "gray88" "gray89" "gray90" "gray91" "gray92" ## [246] "gray93" "gray94" "gray95" "gray96" "gray97" ## [251] "gray98" "gray99" "gray100" "green" "green1" ## [256] "green2" "green3" "green4" "greenyellow" "grey" ## [261] "grey0" "grey1" "grey2" "grey3" "grey4" ## [266] "grey5" "grey6" "grey7" "grey8" "grey9" ## [271] "grey10" "grey11" "grey12" "grey13" "grey14" ## [276] "grey15" "grey16" "grey17" "grey18" "grey19" ## [281] "grey20" "grey21" "grey22" "grey23" "grey24" ## [286] "grey25" "grey26" "grey27" "grey28" "grey29" ## [291] "grey30" "grey31" "grey32" "grey33" "grey34" ## [296] "grey35" "grey36" "grey37" "grey38" "grey39" ## [301] "grey40" "grey41" "grey42" "grey43" "grey44" ## [306] "grey45" "grey46" "grey47" "grey48" "grey49" ## [311] "grey50" "grey51" "grey52" "grey53" "grey54" ## [316] "grey55" "grey56" "grey57" "grey58" "grey59" ## [321] "grey60" "grey61" "grey62" "grey63" "grey64" ## [326] "grey65" "grey66" "grey67" "grey68" "grey69" ## [331] "grey70" "grey71" "grey72" "grey73" "grey74" ## [336] "grey75" "grey76" "grey77" "grey78" "grey79" ## [341] "grey80" "grey81" "grey82" "grey83" "grey84" ## [346] "grey85" "grey86" "grey87" "grey88" "grey89" ## [351] "grey90" "grey91" "grey92" "grey93" "grey94" ## [356] "grey95" "grey96" "grey97" "grey98" "grey99" ## [361] "grey100" "honeydew" "honeydew1" "honeydew2" "honeydew3" ## [366] "honeydew4" "hotpink" "hotpink1" "hotpink2" "hotpink3" ## [371] "hotpink4" "indianred" "indianred1" "indianred2" "indianred3" ## [376] "indianred4" "ivory" "ivory1" "ivory2" "ivory3" ## [381] "ivory4" "khaki" "khaki1" "khaki2" "khaki3" ## [386] "khaki4" "lavender" "lavenderblush" "lavenderblush1" "lavenderblush2" ## [391] "lavenderblush3" "lavenderblush4" "lawngreen" "lemonchiffon" "lemonchiffon1" ## [396] "lemonchiffon2" "lemonchiffon3" "lemonchiffon4" "lightblue" "lightblue1" ## [401] "lightblue2" "lightblue3" "lightblue4" "lightcoral" "lightcyan" ## [406] "lightcyan1" "lightcyan2" "lightcyan3" "lightcyan4" "lightgoldenrod" ## [411] "lightgoldenrod1" "lightgoldenrod2" "lightgoldenrod3" "lightgoldenrod4" "lightgoldenrodyellow"## [416] "lightgray" "lightgreen" "lightgrey" "lightpink" "lightpink1" ## [421] "lightpink2" "lightpink3" "lightpink4" "lightsalmon" "lightsalmon1" ## [426] "lightsalmon2" "lightsalmon3" "lightsalmon4" "lightseagreen" "lightskyblue" ## [431] "lightskyblue1" "lightskyblue2" "lightskyblue3" "lightskyblue4" "lightslateblue" ## [436] "lightslategray" "lightslategrey" "lightsteelblue" "lightsteelblue1" "lightsteelblue2" ## [441] "lightsteelblue3" "lightsteelblue4" "lightyellow" "lightyellow1" "lightyellow2" ## [446] "lightyellow3" "lightyellow4" "limegreen" "linen" "magenta" ## [451] "magenta1" "magenta2" "magenta3" "magenta4" "maroon" ## [456] "maroon1" "maroon2" "maroon3" "maroon4" "mediumaquamarine" ## [461] "mediumblue" "mediumorchid" "mediumorchid1" "mediumorchid2" "mediumorchid3" ## [466] "mediumorchid4" "mediumpurple" "mediumpurple1" "mediumpurple2" "mediumpurple3" ## [471] "mediumpurple4" "mediumseagreen" "mediumslateblue" "mediumspringgreen" "mediumturquoise" ## [476] "mediumvioletred" "midnightblue" "mintcream" "mistyrose" "mistyrose1" ## [481] "mistyrose2" "mistyrose3" "mistyrose4" "moccasin" "navajowhite" ## [486] "navajowhite1" "navajowhite2" "navajowhite3" "navajowhite4" "navy" ## [491] "navyblue" "oldlace" "olivedrab" "olivedrab1" "olivedrab2" ## [496] "olivedrab3" "olivedrab4" "orange" "orange1" "orange2" ## [501] "orange3" "orange4" "orangered" "orangered1" "orangered2" ## [506] "orangered3" "orangered4" "orchid" "orchid1" "orchid2" ## [511] "orchid3" "orchid4" "palegoldenrod" "palegreen" "palegreen1" ## [516] "palegreen2" "palegreen3" "palegreen4" "paleturquoise" "paleturquoise1" ## [521] "paleturquoise2" "paleturquoise3" "paleturquoise4" "palevioletred" "palevioletred1" ## [526] "palevioletred2" "palevioletred3" "palevioletred4" "papayawhip" "peachpuff" ## [531] "peachpuff1" "peachpuff2" "peachpuff3" "peachpuff4" "peru" ## [536] "pink" "pink1" "pink2" "pink3" "pink4" ## [541] "plum" "plum1" "plum2" "plum3" "plum4" ## [546] "powderblue" "purple" "purple1" "purple2" "purple3" ## [551] "purple4" "red" "red1" "red2" "red3" ## [556] "red4" "rosybrown" "rosybrown1" "rosybrown2" "rosybrown3" ## [561] "rosybrown4" "royalblue" "royalblue1" "royalblue2" "royalblue3" ## [566] "royalblue4" "saddlebrown" "salmon" "salmon1" "salmon2" ## [571] "salmon3" "salmon4" "sandybrown" "seagreen" "seagreen1" ## [576] "seagreen2" "seagreen3" "seagreen4" "seashell" "seashell1" ## [581] "seashell2" "seashell3" "seashell4" "sienna" "sienna1" ## [586] "sienna2" "sienna3" "sienna4" "skyblue" "skyblue1" ## [591] "skyblue2" "skyblue3" "skyblue4" "slateblue" "slateblue1" ## [596] "slateblue2" "slateblue3" "slateblue4" "slategray" "slategray1" ## [601] "slategray2" "slategray3" "slategray4" "slategrey" "snow" ## [606] "snow1" "snow2" "snow3" "snow4" "springgreen" ## [611] "springgreen1" "springgreen2" "springgreen3" "springgreen4" "steelblue" ## [616] "steelblue1" "steelblue2" "steelblue3" "steelblue4" "tan" ## [621] "tan1" "tan2" "tan3" "tan4" "thistle" ## [626] "thistle1" "thistle2" "thistle3" "thistle4" "tomato" ## [631] "tomato1" "tomato2" "tomato3" "tomato4" "turquoise" ## [636] "turquoise1" "turquoise2" "turquoise3" "turquoise4" "violet" ## [641] "violetred" "violetred1" "violetred2" "violetred3" "violetred4" ## [646] "wheat" "wheat1" "wheat2" "wheat3" "wheat4" ## [651] "whitesmoke" "yellow" "yellow1" "yellow2" "yellow3" ## [656] "yellow4" "yellowgreen"`

Finally, the `barplot`

function does not provide a horizontal line where the y-axis is equal to 0. If you’d like to add such a line, simply follow up your `barplot`

function with the `abline`

function, and as the sole argument, type `h=0`

.

`# Create a bar chart based on Facility countsbarplot(table(demo$Facility), ylim=c(0,20), xlab="Facility", ylab="Counts", col="dodgerblue")abline(h=0)`

And finally, here’s a quick example of how you might visualize the `Education`

variable using the `barplot`

function.

`# Create a bar chart for Education variablebarplot(table(demo$Education), ylim=c(0,20), xlab="Education Level", ylab="Counts", col="orange")abline(h=0)`

**Create Pie Charts:** Using the `pie`

function from base R, we can create a very simple and straightforward bar chart without too many frills and embellishments. Let’s start with the `Education`

variable. As the sole parenthetical argument in the `barplot`

function, simply, enter the `table(demo$Education)`

code that we wrote in the section called .

`# Create a bar chart based on Education countspie(table(demo$Education))`

A very simple and generic pie chart appears in our Plots window. *When exploring data on our own, it is often fine to just complete a simple pie chart like this one, as opposed to fine-tuning the aesthetics (e.g., size, color, font) of the plot.* If you want, you can export this plot as a PDF or PNG image file, or you can copy it and paste it in another document. To do so, just click on the *Export* button in the Plots window, which should appear in the lower right of your RStudio interface.

If you’re feeling adventurous and would like to learn how to adjust the colors pie chart, feel free to continue on with this tutorial.

Using the `pie`

code we wrote above, let’s add the `col=`

argument followed by the `c`

(combine) function containing a vector of colors – one color for each slice of the pie. Here, I chose the primar colors of red, yellow, and blue.

`# Create a bar chart based on Education countspie(table(demo$Education), col=c("red", "yellow", "blue"))`

If you’d like to explore additional colors, check out this website: https://www.r-graph-gallery.com/colors.html. Or, you can run the `colors()`

function (without any arguments), and you’ll get a (huge) list of the color options.

`# List names of base R color choicescolors()`

`## [1] "white" "aliceblue" "antiquewhite" "antiquewhite1" "antiquewhite2" ## [6] "antiquewhite3" "antiquewhite4" "aquamarine" "aquamarine1" "aquamarine2" ## [11] "aquamarine3" "aquamarine4" "azure" "azure1" "azure2" ## [16] "azure3" "azure4" "beige" "bisque" "bisque1" ## [21] "bisque2" "bisque3" "bisque4" "black" "blanchedalmond" ## [26] "blue" "blue1" "blue2" "blue3" "blue4" ## [31] "blueviolet" "brown" "brown1" "brown2" "brown3" ## [36] "brown4" "burlywood" "burlywood1" "burlywood2" "burlywood3" ## [41] "burlywood4" "cadetblue" "cadetblue1" "cadetblue2" "cadetblue3" ## [46] "cadetblue4" "chartreuse" "chartreuse1" "chartreuse2" "chartreuse3" ## [51] "chartreuse4" "chocolate" "chocolate1" "chocolate2" "chocolate3" ## [56] "chocolate4" "coral" "coral1" "coral2" "coral3" ## [61] "coral4" "cornflowerblue" "cornsilk" "cornsilk1" "cornsilk2" ## [66] "cornsilk3" "cornsilk4" "cyan" "cyan1" "cyan2" ## [71] "cyan3" "cyan4" "darkblue" "darkcyan" "darkgoldenrod" ## [76] "darkgoldenrod1" "darkgoldenrod2" "darkgoldenrod3" "darkgoldenrod4" "darkgray" ## [81] "darkgreen" "darkgrey" "darkkhaki" "darkmagenta" "darkolivegreen" ## [86] "darkolivegreen1" "darkolivegreen2" "darkolivegreen3" "darkolivegreen4" "darkorange" ## [91] "darkorange1" "darkorange2" "darkorange3" "darkorange4" "darkorchid" ## [96] "darkorchid1" "darkorchid2" "darkorchid3" "darkorchid4" "darkred" ## [101] "darksalmon" "darkseagreen" "darkseagreen1" "darkseagreen2" "darkseagreen3" ## [106] "darkseagreen4" "darkslateblue" "darkslategray" "darkslategray1" "darkslategray2" ## [111] "darkslategray3" "darkslategray4" "darkslategrey" "darkturquoise" "darkviolet" ## [116] "deeppink" "deeppink1" "deeppink2" "deeppink3" "deeppink4" ## [121] "deepskyblue" "deepskyblue1" "deepskyblue2" "deepskyblue3" "deepskyblue4" ## [126] "dimgray" "dimgrey" "dodgerblue" "dodgerblue1" "dodgerblue2" ## [131] "dodgerblue3" "dodgerblue4" "firebrick" "firebrick1" "firebrick2" ## [136] "firebrick3" "firebrick4" "floralwhite" "forestgreen" "gainsboro" ## [141] "ghostwhite" "gold" "gold1" "gold2" "gold3" ## [146] "gold4" "goldenrod" "goldenrod1" "goldenrod2" "goldenrod3" ## [151] "goldenrod4" "gray" "gray0" "gray1" "gray2" ## [156] "gray3" "gray4" "gray5" "gray6" "gray7" ## [161] "gray8" "gray9" "gray10" "gray11" "gray12" ## [166] "gray13" "gray14" "gray15" "gray16" "gray17" ## [171] "gray18" "gray19" "gray20" "gray21" "gray22" ## [176] "gray23" "gray24" "gray25" "gray26" "gray27" ## [181] "gray28" "gray29" "gray30" "gray31" "gray32" ## [186] "gray33" "gray34" "gray35" "gray36" "gray37" ## [191] "gray38" "gray39" "gray40" "gray41" "gray42" ## [196] "gray43" "gray44" "gray45" "gray46" "gray47" ## [201] "gray48" "gray49" "gray50" "gray51" "gray52" ## [206] "gray53" "gray54" "gray55" "gray56" "gray57" ## [211] "gray58" "gray59" "gray60" "gray61" "gray62" ## [216] "gray63" "gray64" "gray65" "gray66" "gray67" ## [221] "gray68" "gray69" "gray70" "gray71" "gray72" ## [226] "gray73" "gray74" "gray75" "gray76" "gray77" ## [231] "gray78" "gray79" "gray80" "gray81" "gray82" ## [236] "gray83" "gray84" "gray85" "gray86" "gray87" ## [241] "gray88" "gray89" "gray90" "gray91" "gray92" ## [246] "gray93" "gray94" "gray95" "gray96" "gray97" ## [251] "gray98" "gray99" "gray100" "green" "green1" ## [256] "green2" "green3" "green4" "greenyellow" "grey" ## [261] "grey0" "grey1" "grey2" "grey3" "grey4" ## [266] "grey5" "grey6" "grey7" "grey8" "grey9" ## [271] "grey10" "grey11" "grey12" "grey13" "grey14" ## [276] "grey15" "grey16" "grey17" "grey18" "grey19" ## [281] "grey20" "grey21" "grey22" "grey23" "grey24" ## [286] "grey25" "grey26" "grey27" "grey28" "grey29" ## [291] "grey30" "grey31" "grey32" "grey33" "grey34" ## [296] "grey35" "grey36" "grey37" "grey38" "grey39" ## [301] "grey40" "grey41" "grey42" "grey43" "grey44" ## [306] "grey45" "grey46" "grey47" "grey48" "grey49" ## [311] "grey50" "grey51" "grey52" "grey53" "grey54" ## [316] "grey55" "grey56" "grey57" "grey58" "grey59" ## [321] "grey60" "grey61" "grey62" "grey63" "grey64" ## [326] "grey65" "grey66" "grey67" "grey68" "grey69" ## [331] "grey70" "grey71" "grey72" "grey73" "grey74" ## [336] "grey75" "grey76" "grey77" "grey78" "grey79" ## [341] "grey80" "grey81" "grey82" "grey83" "grey84" ## [346] "grey85" "grey86" "grey87" "grey88" "grey89" ## [351] "grey90" "grey91" "grey92" "grey93" "grey94" ## [356] "grey95" "grey96" "grey97" "grey98" "grey99" ## [361] "grey100" "honeydew" "honeydew1" "honeydew2" "honeydew3" ## [366] "honeydew4" "hotpink" "hotpink1" "hotpink2" "hotpink3" ## [371] "hotpink4" "indianred" "indianred1" "indianred2" "indianred3" ## [376] "indianred4" "ivory" "ivory1" "ivory2" "ivory3" ## [381] "ivory4" "khaki" "khaki1" "khaki2" "khaki3" ## [386] "khaki4" "lavender" "lavenderblush" "lavenderblush1" "lavenderblush2" ## [391] "lavenderblush3" "lavenderblush4" "lawngreen" "lemonchiffon" "lemonchiffon1" ## [396] "lemonchiffon2" "lemonchiffon3" "lemonchiffon4" "lightblue" "lightblue1" ## [401] "lightblue2" "lightblue3" "lightblue4" "lightcoral" "lightcyan" ## [406] "lightcyan1" "lightcyan2" "lightcyan3" "lightcyan4" "lightgoldenrod" ## [411] "lightgoldenrod1" "lightgoldenrod2" "lightgoldenrod3" "lightgoldenrod4" "lightgoldenrodyellow"## [416] "lightgray" "lightgreen" "lightgrey" "lightpink" "lightpink1" ## [421] "lightpink2" "lightpink3" "lightpink4" "lightsalmon" "lightsalmon1" ## [426] "lightsalmon2" "lightsalmon3" "lightsalmon4" "lightseagreen" "lightskyblue" ## [431] "lightskyblue1" "lightskyblue2" "lightskyblue3" "lightskyblue4" "lightslateblue" ## [436] "lightslategray" "lightslategrey" "lightsteelblue" "lightsteelblue1" "lightsteelblue2" ## [441] "lightsteelblue3" "lightsteelblue4" "lightyellow" "lightyellow1" "lightyellow2" ## [446] "lightyellow3" "lightyellow4" "limegreen" "linen" "magenta" ## [451] "magenta1" "magenta2" "magenta3" "magenta4" "maroon" ## [456] "maroon1" "maroon2" "maroon3" "maroon4" "mediumaquamarine" ## [461] "mediumblue" "mediumorchid" "mediumorchid1" "mediumorchid2" "mediumorchid3" ## [466] "mediumorchid4" "mediumpurple" "mediumpurple1" "mediumpurple2" "mediumpurple3" ## [471] "mediumpurple4" "mediumseagreen" "mediumslateblue" "mediumspringgreen" "mediumturquoise" ## [476] "mediumvioletred" "midnightblue" "mintcream" "mistyrose" "mistyrose1" ## [481] "mistyrose2" "mistyrose3" "mistyrose4" "moccasin" "navajowhite" ## [486] "navajowhite1" "navajowhite2" "navajowhite3" "navajowhite4" "navy" ## [491] "navyblue" "oldlace" "olivedrab" "olivedrab1" "olivedrab2" ## [496] "olivedrab3" "olivedrab4" "orange" "orange1" "orange2" ## [501] "orange3" "orange4" "orangered" "orangered1" "orangered2" ## [506] "orangered3" "orangered4" "orchid" "orchid1" "orchid2" ## [511] "orchid3" "orchid4" "palegoldenrod" "palegreen" "palegreen1" ## [516] "palegreen2" "palegreen3" "palegreen4" "paleturquoise" "paleturquoise1" ## [521] "paleturquoise2" "paleturquoise3" "paleturquoise4" "palevioletred" "palevioletred1" ## [526] "palevioletred2" "palevioletred3" "palevioletred4" "papayawhip" "peachpuff" ## [531] "peachpuff1" "peachpuff2" "peachpuff3" "peachpuff4" "peru" ## [536] "pink" "pink1" "pink2" "pink3" "pink4" ## [541] "plum" "plum1" "plum2" "plum3" "plum4" ## [546] "powderblue" "purple" "purple1" "purple2" "purple3" ## [551] "purple4" "red" "red1" "red2" "red3" ## [556] "red4" "rosybrown" "rosybrown1" "rosybrown2" "rosybrown3" ## [561] "rosybrown4" "royalblue" "royalblue1" "royalblue2" "royalblue3" ## [566] "royalblue4" "saddlebrown" "salmon" "salmon1" "salmon2" ## [571] "salmon3" "salmon4" "sandybrown" "seagreen" "seagreen1" ## [576] "seagreen2" "seagreen3" "seagreen4" "seashell" "seashell1" ## [581] "seashell2" "seashell3" "seashell4" "sienna" "sienna1" ## [586] "sienna2" "sienna3" "sienna4" "skyblue" "skyblue1" ## [591] "skyblue2" "skyblue3" "skyblue4" "slateblue" "slateblue1" ## [596] "slateblue2" "slateblue3" "slateblue4" "slategray" "slategray1" ## [601] "slategray2" "slategray3" "slategray4" "slategrey" "snow" ## [606] "snow1" "snow2" "snow3" "snow4" "springgreen" ## [611] "springgreen1" "springgreen2" "springgreen3" "springgreen4" "steelblue" ## [616] "steelblue1" "steelblue2" "steelblue3" "steelblue4" "tan" ## [621] "tan1" "tan2" "tan3" "tan4" "thistle" ## [626] "thistle1" "thistle2" "thistle3" "thistle4" "tomato" ## [631] "tomato1" "tomato2" "tomato3" "tomato4" "turquoise" ## [636] "turquoise1" "turquoise2" "turquoise3" "turquoise4" "violet" ## [641] "violetred" "violetred1" "violetred2" "violetred3" "violetred4" ## [646] "wheat" "wheat1" "wheat2" "wheat3" "wheat4" ## [651] "whitesmoke" "yellow" "yellow1" "yellow2" "yellow3" ## [656] "yellow4" "yellowgreen"`

### 24.2.6 Describe Interval & Ratio (Continuous) Variables

We can describe variables with *interval* or *ratio* measurement scales (i.e., continuous variables) by computing measures of central tendency (e.g., mean, median) and dispersion (e.g., standard deviation, range); however, it’s often good practice to begin by creating data visualizations (e.g., histograms, box plots) that will enable us to understand the nature of each variable’s distribution.

#### 24.2.6.1 Create Data Visualizations

By visualizing the shape of a continuous variable’s distribution (e.g., normal distribution, positive skew, negative skew), we can make a more informed decision regarding how to select, interpret, and report measures of central tendency and dispersion. In this section, we’ll focus on creating histograms and box plots.

**Create Histograms:** A **histogram** visually approximates the distribution of a set of numerical scores. The scores are grouped into ranges (which by default are often equally sized), and the boundaries of these ranges are referred to as *breaks* or *break points*. The bars in a histogram fill these ranges, and their heights represent the frequency (i.e., count) of sources within each range.

Let’s begin with the `Age`

variable. To create a histogram, we can use the `hist`

function from base R. To get things started, let’s enter a single argument: the name of the data frame object (`demo`

), followed by the `$`

operator and the name of the variable we wish to visualize (`Age`

).

`# Create a histogramhist(demo$Age)`

This histogram will do just fine for our purposes. Note that the histogram indicates that the scores from the `Age`

variable appear to be roughly normally distributed. With smaller sample sizes (e.g., fewer than 30 observations or cases), we’re less likely to see a clean, normal distribution of scores, and this relates to the *central limit theorem*; though, an explanation of this theorem is beyond the scope of this tutorial. Nevertheless, the take-home message is that histograms provide rough approximations of the shapes of distributions, and a normal distribution is less likely when their are fewer observations (i.e., a smaller sample) and thus fewer scores on a variable.

*For your own internal data-exploration purposes, it is often fine to create a simple histogram like the one we created above, meaning that you would not need to worry about the aesthetics (e.g., size, color) of the histogram.* If you want, you can export this plot as a PDF or PNG image file, or you can copy it and paste it in another document. To do so, just click on the *Export* button in the Plots window, which should appear in the lower right of your RStudio interface.

As *optional* next steps, you can play around with arguments to adjust the y-axis limits (`ylim`

), x-axis label (`xlab`

), y-axis label (`ylab`

), main title (`main`

), and the bar color (`col`

). [If you’d like to explore additional colors, check out this website: https://www.r-graph-gallery.com/colors.html. Or, you can run the `colors()`

function (without any arguments), and you’ll get a (huge) list of the color options.] A more in-depth description of these plot arguments is provided in the section above called Create Bar Charts.

`# Create a histogram and add stylehist(demo$Age, ylim=c(0, 15), # y-axis limits xlab="Employee Age", # x-axis label ylab="Count", # y-axis label main=NULL, # main title col="dodgerblue") # bar color`

We can also specify a vector of the break points between the bars using the `c`

function from base R. Just be sure that the lowest value in your vector is equal to or less than the minimum value for the variable and the the highest value is equal to or greater than the maximum value for the variable. To do so, we can add the `breaks`

argument.

`# Create a histogram and add stylehist(demo$Age, ylim=c(0, 25), # y-axis limits xlab="Employee Age", # x-axis label ylab="Count", # y-axis label main=NULL, # remove main title col="dodgerblue", # bar color breaks=c(20, 25, 30, 35)) # set break points between bars`

**Create Box Plots:** We could use a histogram to visualize the `Performance`

variable, but let’s use this opportunity to create a box plot instead. Like a histogram, a **box plot** (sometimes called a “box and whiskers plot”) also reveals information about the shape of a distribution, including the median, 25th percentile (i.e., lower quartile), 75th percentile (i.e., upper quartile), and the variation outside the 25th and 75th percentiles.

We’ll use the `boxplot`

function from base R. To kick things off, let’s enter a single argument: the name of the data frame object (`demo`

), followed by the `$`

operator and the name of the variable we wish to visualize (`Performance`

).

`# Create a box plotboxplot(demo$Performance)`

The thick horizontal line in the middle of the box is the median score, the lower edge of the box represents the lower quartile (i.e., 25th percentile, median of lower half of the distribution), and the upper edge of the box represents the upper quartile (i.e., 75th percentile, median of the upper half of the distribution). The height of the box is the interquartile range. By default, the `boxplot`

function sets the upper “whisker” (i.e., the horizontal line at the top of the upper dashed line) as the smaller of two values: the maximum value or 1.5 times the interquartile range. Further, the function sets the lower “whisker” (i.e., the horizontal line at the bottom of the lower dashed line) as the larger of two values: the minimum value or 1.5 times the interquartile range.

In the box plot for `Performance`

, we can see that the distribution of scores appears to be slightly negatively skewed, as the upper quartile is smaller than the lower quartile (i.e., the median is closer to the top of the box) and the upper whisker is shorter than the lower whisker. If there had been any outlier scores, these would appear beyond the upper and lower limits of the whiskers.

*If you plan to create a box plot for your own data-exploration purposes only, it is often fine to create a simple box plot like the one we created above, which means you would not need to proceed forward with subsequent steps in which I show how to refine the aesthetics of the box plot.* If you want, you can export this plot as a PDF or PNG image file, or you can copy it and paste it in another document. To do so, just click on the *Export* button in the Plots window, which should appear in the lower right of your RStudio interface.

As *optional* next steps, you can play around with arguments to adjust the y-axis label (`ylab`

) and the box color (`col`

). If you’d like to explore additional colors, check out this website. Or, you can run the `colors()`

function (without any arguments), and you’ll get a (huge) list of the color options.

`# Create a box plot and add styleboxplot(demo$Performance, ylab="Employee Job Performance", # y-axis label col="orange") # bar color`

#### 24.2.6.2 Compute Measures of Central Tendency & Dispersion

Now that we’ve visualized our interval and ratio measurement scale variables, we’re ready to compute some measures of central tendency and dispersion. In R the process is quite straightforward, as the function names are fairly intuitive: `mean`

(mean), `var`

(variance), `sd`

(standard deviation), `median`

(median), `min`

(minimum), `max`

(maximum), `range`

(range), and `IQR`

(interquartile range). Within each function’s parentheses, you will enter the same arguments. Specifically, you should include the name of the data frame (`demo`

), followed by the `$`

operator and the name of the variable ofese measures of central tendency even if there are missing data for the varia interest (`Age`

). Keep the `na.rm=TRUE`

argument as is if you would like to calculate the variable of interest.

Let’s start with some measures of central tendency for the `Age`

variable, specifically the mean (`mean`

) and median (`median`

).

`# Mean of Agemean(demo$Age, na.rm=TRUE)`

`## [1] 28`

`# Median of Agemedian(demo$Age, na.rm=TRUE)`

`## [1] 28`

As you can, see both the median and the mode happen to be 28, which indicates that center of the `Age`

distribution is about 28 years. Should we have a skewed distribution (positive or negative), the median is often a better indicator of central tendency given that it is less susceptible to influential cases (e.g., outliers). A class example of a skewed distribution in organizations involves pay variables, especially when executive pay is included. In U.S. organizations, executive pay often is far greater than average worker’s pay, which often leads us to report the median pay as an indicator of central tendency.

Let’s move on to some measures of dispersion, specifically the variance (`var`

) and standard deviation (`sd`

).

`# Variance of Agevar(demo$Age, na.rm=TRUE)`

`## [1] 7.103448`

`# Standard deviation (SD) of Agesd(demo$Age, na.rm=TRUE)`

`## [1] 2.665229`

The variance is a nonstandardized indicator of dispersion or variation, so we typically interpret the square root of the variance, which is called the standard deviation. Given that we found a mean age of 28 years for this sample of employees, the standard deviation of approximately 2.67 years indicates that approximately 68% of employees’ ages fall within 2.67 years (i.e., 1 SD) of 28 years (i.e., between 25.33 and 30.67 years), and 95% of employees’ ages fall within 5.34 years (i.e., 2 SD) of 28 years (i.e., between 22.66 and 33.34 years). As we saw in the histogram for `Age`

, the variable has a roughly normal distribution.

Let’s compute the minimum and maximum score for `Age`

using the `min`

and `max`

functions, respectively.

`# Minimum of Agemin(demo$Age, na.rm=TRUE)`

`## [1] 22`

`# Maximum of Agemax(demo$Age, na.rm=TRUE)`

`## [1] 34`

The minimum age is 22 years for this sample, and the maximum age is 34 years.

Next let’s compute the range, which will give us the minimum and maximum scores using a single function.

`# Range of Agerange(demo$Age, na.rm=TRUE)`

`## [1] 22 34`

As you can see, the `range`

functions provides both the minimum and maximum scores.

Next, let’s compute the interquartile range (IQR), which is the distance between the lower and upper quartiles (i.e., between the 25th and 75th percentile). As noted above in the section called Create Box Plots, the lower and upper quartiles correspond to the outer edges of the box, whereas the median (50th percentile) corresponds to the line within the box.

`# Interquartile range (IQR) of AgeIQR(demo$Age, na.rm=TRUE)`

`## [1] 3`

The IQR is 3 years, which indicates that middle 50% of ages spans 3 years.

As a follow-up, let’s compute the lower and upper quartiles (i.e., between the 25th and 75th percentiles) by using the `quantile`

function from base R. As the first argument, type the name of the data frame (`demo`

), followed by the `$`

operator and the name of the variable of interest (`Age`

). As the second argument, type `.25`

if you would like to request the 25th percentile (lower quartile) and `.75`

if you would like to request the 75th percentile (upper quartile). Let’s do both.

`# Request specific quartiles/percentilesquantile(demo$Age, .25) # lower quartile / 25th percentile`

`## 25% ## 27`

`quantile(demo$Age, .75) # upper quartile / 75th percentile`

`## 75% ## 30`

Corroborating what we found with the IQR, the difference between the upper and lower quartiles is 3 years (30 - 27 = 3).

The IQR and lower and upper quartiles are typically reported along with the median (as evidenced by the box plot we created above), so let’s report them together. If you recall, the median age was 28 years for this sample, and the IQR spans 3 years from 27 years to 30 years. These measures indicate that the middle 50% of ages for this sample are between 27 and 30 years, and that the middle-most age (i.e., 50th percentile) is 28 years.

Alternatively, if we wish to automatically compute the 0th, 25th, 50th, 75th, and 100th percentile all at once, we can simply type the name of the `quantile`

function and then enter the name of the data frame object (`df`

) followed by the `$`

operator and the name of the variable (`Age`

).

`# Request 0, 25, 50, 75, and 100 percentilesquantile(demo$Age)`

`## 0% 25% 50% 75% 100% ## 22 27 28 30 34`

Finally, one way to compute the minimum, lower quartile (1st quartile), median, mean, upper quartile (3rd quartile), and maximum all at once is to use the `summary`

function from base R with the name of the data frame object (`demo`

) followed by the `$`

operator and the name of the variable (`Age`

) as the sole parenthetical argument.

`# Minimum, lower quartile (1st quartile), median, mean, upper quartile (3rd quartile), and maximumsummary(demo$Age)`

`## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 22 27 28 28 30 34`

### 24.2.7 Summary

In this chapter, we focused on descriptive statistics. First, we began by learning about four different measurements scales (i.e., nominal, ordinal, interval, ratio) and how identifying the measurement scale of a variable is an important first step in determining an appropriate descriptive statistic or data-visualization display type. Second, we learned how to compute counts (i.e., frequencies) for nominal and ordinal variables using the `table`

function from base R. Further, you learned how to convert a variable to an ordered factor using the `factor`

function from base R. Finally, you learned how to visualize counts data using the `barplot`

function from base R. Finally, we learned how to visualize the distribution of a variable with an interval or ratio measurement scale using histograms (`hist`

function from base R) and box plots (`boxplot`

function from base R). In addition, we learned how to compute measures of central tendency and dispersion base R functions like `mean`

(mean), `var`

(variance), `sd`

(standard deviation), `median`

(median), `min`

(minimum), `max`

(maximum), `range`

(range), and `IQR`

(interquartile range).

## 24.3 Chapter Supplement

In this chapter supplement, we will learn how to compute the coefficient of variation (CV).

### 24.3.1 Functions & Packages Introduced

Function | Package |
---|---|

`mean` | base `R` |

`sd` | base `R` |

### 24.3.2 Initial Steps

If required, please refer to the Initial Steps section from this chapter for more information on these initial steps.

`# Set your working directorysetwd("H:/RWorkshop")`

`# Access readr packagelibrary(readr)# Read data and name data frame (tibble) objectdemo <- read_csv("employee_demo.csv")`

`## Rows: 30 Columns: 5## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────## Delimiter: ","## chr (3): EmpID, Facility, Education## dbl (2): Performance, Age## ## ℹ Use `spec()` to retrieve the full column specification for this data.## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.`

`# Print the names of the variables in the data frame (tibble) objectsnames(demo)`

`## [1] "EmpID" "Facility" "Education" "Performance" "Age"`

### 24.3.3 Compute Coefficient of Variation (CV)

The **coefficient of variation (CV)** (also known as *relative standard deviation*) is a standardized indicator of dispersion that can be used to compare the relative variability of two or more variables with different scaling. Technically, it is really only appropriate and meaningful to compute the CV for variables that have a ratio measurement scale and thus a meaningful zero; however, sometimes people relax this assumption (albeit inappropriately) to allow for the CV to be computed for a variable with an interval measurement scale. I urge you to only compute CVs for variables with ratio measurement scales.

As a hypothetical application of the CV, imagine you would like to compare the variability of these two measures – both having a ratio measurement scale: (a) monthly base pay measured in US dollars, and (b) monthly variable pay measured in US dollars.

The formula to compute the CV for a variable is simple. In fact, it’s just the ratio of a variable’s standard deviation (SD) relative to its mean, which results in a proportion. If we multiply that proportion by 100, we can interpret the CV as a percentage.

\(CV = \frac{SD}{mean} * 100\)

Let’s imagine that for a given sample of employees the mean monthly base pay is 3,553 dollars, and the SD is 593 dollars. Further, the mean monthly variable pay for these same employees is 422 dollars, and the SD is 98 dollars. Let’s compute the CV for each measure and then compare.

\(CV_{basepay} = \frac{593}{3553} * 100 = 16.7\)

\(CV_{variablepay} = \frac{98}{422} * 100 = 23.2\)

Note that the CV for monthly base pay is 16.7%, and the CV for the monthly variable pay is 23.2%. We can interpret these descriptively as indicating that monthly variable pay shows higher variability around its mean relative to monthly base pay. In other words, monthly variable pay shows higher relative dispersion than monthly base pay. It’s important to note that comparing CVs in this way is entirely descriptive, which means that we cannot conclude that the two CVs differ significantly from one another in a statistical sense; to make such a conclusion, we would need to estimate an appropriate inferential statistical analysis (Feltz and Miller 1996; Lewontin 1966; Miller 1991).

Alternatively, CVs can be computed to compare the relative variability of the same measure assessed with two independent samples. For example, in a clinical setting, the CV for a measure can be computed for each clinical trial sample in which it was administered to evaluate whether it’s appropriate to combine data from multiple samples.

Now that we understand what a coefficient of variation is, let’s practice computing one by using the data frame we read in called called `demo`

. Note that both the `Performance`

and `Age`

variables can be described as having interval and ratio measurement scales, respectively.

Let’s begin by computing the coefficient of variation (CV) for the `Performance`

variable. As noted in the introduction, the formula is simply a ratio, such that we divide the standard deviation (SD) for the measure by the mean for that measure. We can then convert the resulting proportion to a percentage by multiplying the proportion by 100. To compute the SD, we’ll use the `sd`

function from base `R`

, and to compute the mean, we’ll use the `mean`

function from base `R`

. Within each function, we enter the name of the data frame object (`demo`

) followed by the `$`

operator and the name of the variable in question that belongs to the aforementioned data frame object. To divide we use the forward slash (`/`

), and to multiply we use the asterisk (`*`

), as shown below.

`# Compute coefficient of variation (CV) for Performance variablesd(demo$Performance) / mean(demo$Performance) * 100`

`## [1] 21.15746`

In the Console, we should see that the CV for the `Performance`

variable is approximately 21.2%.

Next, let’s compute the CV for the `Age`

variable. We’ll use the same formula as above, except swap out the `Performance`

variable for the `Age`

variable.

`# Compute coefficient of variation (CV) for Age variablesd(demo$Age) / mean(demo$Age) * 100`

`## [1] 9.518677`

In the Console, we should see that the CV for the `Age`

variable is approximately 9.5%.

As noted in the introduction, we are being descriptive in our comparisons in this tutorial and are not applying an inferential statistical analysis. Given that, we *cannot* make statements indicating that one CV is significantly larger than the other. To make such a statement, we would need to apply an inferential statistical analysis (Lewontin 1966; Miller 1991), which is beyond the scope of this chapter on descriptive statistics.

### References

Feltz, Carol J, and G Edward Miller. 1996. “An Asymptotic Test for the Equality of Coefficients of Variation from k Populations.” *Statistics in Medicine* 15 (6): 647–58.

Lewontin, Richard C. 1966. “On the Measurement of Relative Variability.” *Systematic Zoology* 15 (2): 141–42.

Miller, Edward G. 1991. “Asymptotic Test Statistics for Coefficients of Variation.” *Communications in Statistics-Theory and Methods* 20 (10): 3351–63.

Wickham, Hadley, Jim Hester, and Jennifer Bryan. 2024. *Readr: Read Rectangular Text Data*. https://CRAN.R-project.org/package=readr.