#Call the dataset that we would like to use
data(iris)
Datasets and Data Types As Seen in R
Understanding Data Types in R: From Nominal To Ratio
When we begin working with data in R, one of the first challenges is learning how to think about the different types of variables. In statistics, we often distinguish between four levels of measurement: nominal, ordinal, interval, and ratio. These levels aren’t just technicalities. They determine what we can meaningfully do with our data, and just as importantly, what we cannot do. R doesn’t enforce these distinctions for us, but it does represent variables in certain ways (i.e., as numbers, factors, or characters) that shape how we interact with them.
To make these ideas more concrete, let’s start by loading a dataset. R comes with several datasets built in (sometimes called “data frames”), which makes practice very easy. One of the most famous is the iris
dataset, which contains measurements of sepal and petal dimensions for 150 flowers across three species. In your R
Console, you can type help(iris)
if you want more information on the dataset.
R
comes with extensive built-in documentation for every function, dataset, and object. The help()
function is the primary way to access this documentation. For example, running help(iris)
or ?iris
in the console brings up a page describing the iris
dataset: how many rows and columns it has, what each variable represents, and sometimes even references to the original source. Using help()
is essential when you are exploring a new dataset or learning a new function, because it provides the authoritative explanation of how the object is structured and what operations make sense. It also shows examples of usage, which is invaluable when you are experimenting in R
for the first time.
Loading in a dataset
When we are working with any dataset in R, the first task is to load it into our R environment. Typically, we do this using the load()
command. But because iris
is comes built into R
, we do not need to do this, nor do we have to downlaod anything. All we have to do is type data(iris)
to make it available in our session.
Once called, you should see the iris
dataset appear in your environment tab. Its name should be iris
and you should see the label <Promise>
next to it. This signals that the dataset is ready for use.
This <Promise>
and activation step only happen when working with R
’s built-in datasets. When we load external data (i.e., datasets used in the CP assignments), they become instantly available for use and manipulation.
In this case, the iris
dataset is also instantly available for use and manipulation, you just have to explicitly do something with the dataset for it to become activated. For example, if you hover your mouse over the iris
object and click on it, you will see that the <Promise>
label disappears and is replaced with […]. Similarly, if you run any form of command, as we will do shortly, using the dataset in question, then it will automatically activate.
Once loaded into R
, we can do a whole host of things to the dataset. One recommended starting point is to create a carbon copy of the raw data you are using. This can be helpful if you plan on manipulating the data or changing it in some way. It is always advisable to keep an unaltered copy of the raw data on hand in case you make a bags of it and need to start over - if you alter the only copy of your raw data and it turns out to be incorrect, you could land yourself in hot water.
#Generate a copy of our dataset and call it iris_data
<- iris iris_data
All we have done here is created an identical object but given it a new name. In R
-speak, what we have done is, using the object iris
, we created a new object called iris_data
. The direction of the arrow illustrates which object is an input of the function, and which is an output. For example, we could also do it this way:
#Flip the sign of the arrow
-> iris_flip iris
It is, however, much less typical to structure your code this way because there are normally far more inputs than there are outputs. So having your ouptuts on the left-hand side of the arrow (i.e., iris_data <- iris
) allows you utilise the full breadth of the document when constructing code and commands.
Exploring the structure of datasets
Okay. Enough about loading datasets in. If we want to explore our dataset in more detail, we can go about it in a few ways. On one hand, you can use your mouse and simply hover over your dataset, click on it, and navigate the table that automatically opens. This table is your data. Each cell represents a data point. In most datasets, the rows will represent cases; so that, in this case, each row represents a different flower. Meanwhile, the columns will usually represent variables - that is, the different characteristics of each flower that this dataset captures.
In this case, for example, we see that there are five variables (characteristics) in total and 150 observations (flowers) in total. We can confirm this in the table, but also in the environment tab, as this information now replaces the <Promise>
label. For each of these 150 flowers, we have information about their petals and sepals, as well as the type of iris species they belong to. We see that, for four of these variables, their values are expressed as numbers, whereas for the fifth, its values are expressed in words.
This is about the most we can gather from this table without having to make educated guesses about the underlying structure of the data. To dig a bit deeper we will have to use some code to examine the structure of the dataset and its different variables. Many of these commands will give us the same information as did the table anyway, so it is often more convenient to skip the table and jump straight into the code. That being said, it is always valuable to look at your raw data, something which will be explored in more detail in later posts.
Anyway, lets say we want to explicitly examine the structure of the dataset. By that I mean we want to examine the type of data each variable captures. To do this, we can use the str()
command.
#Examine a dataset's structure
str(iris_data)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
The output of this command is quite informative. We see the total number of observations (flowers) and variables (characteristics). But we also see the structure of each variable. We can confirm now that R
interprets four of the five variables as numeric
, meaning it treats them as numbers. Interestingly, R
does not distinguish between the type of numeric data these variables are. Although they all appear to be ratio variables, we cannot be sure without digging a bit deeper. Fortunately, we do not need to dig deeper, as in practice all that matters in nine-cases-out-of-ten is whether these data are numbers or not. In other words, in practice the distinction between ratio and interval seldom comes into play beyond the extent to which we use theory to interpret their meaning - at least not at the level of Beginners statistics.
Meanwhile, the fifth variable in our dataset is classified as a Factor
variable. This factor variable has three levels, described through words (although in the output we can only see two) and each level is associated with a specific value. You cannot really infer this directly from the output, but take it from me as being true. This variable thus appears to contain two types of data: words and numbers. What gives?
TRUE
orFALSE
: The variableSpecies
is categorical.TRUE
orFALSE
: The variableSpecies
is interval.Which of the following most accurately describes the structure of the variable
Species
:
Well, this is something of a trick. Levels in factor variables are better thought of as labels, labels that are discrete in nature. Thus, the label setosa
, for example, is really just a placeholder to give more intuitive meaning to the actual value within the cell, which is 1. This is an interesting case where the distinction between discrete and continuous variables becomes important. Factors in R are technically stored as integers under the hood, with each level corresponding to a number. You can check this by converting a factor back into numeric form: what once looked like setosa, versicolor, and virginica will suddenly appear as 1, 2, and 3.
The crucial point is that even though factors are stored as numbers, those numbers do not carry any quantitative meaning. Adding 1 and 2 does not give you a “hybrid” category between setosa and versicolor, and calculating an average of factor levels is meaningless. The numeric coding is only there for efficiency and to allow R to keep track of categories in a consistent way.
This distinction also helps us see why categorical data are considered discrete. With continuous variables such as Sepal.Length
, we could in principle measure infinitely fine differences — 5.1 cm, 5.11 cm, 5.111 cm, and so on. But with a factor, there is no continuum between setosa
and virginica
. The categories are separate bins, and an observation must fall entirely into one of them.
Individual variables can only host one data type (i.e., nominal, ordinal, interval, or ratio). In cases where there are, for some reason, multiple different types of data (e.g., numbers and words) contained within the same column, R
will revert to classifying the column according to the least restrictive classification.
For example, if one of your variables contains a mixture of words and numbers (e.g., zero, 1, 2, 3 …), then R
will classify everything in that column as a word. Or, more precisely, as a string
variable. String variables represent textual data (i.e., words) and form a separate class of data in R
. What makes them interesting is that each unique string value is often mapped internally to a number, while the original string is retained as a label. This is exactly how factor variables work: for instance, the Species
variable in the iris
dataset is a factor. Internally, setosa
, versicolor
, and virginica
are stored as the integers 1, 2, and 3, but R
keeps the original string labels so that outputs remain readable. This mapping allows R
to efficiently store categorical data and perform statistical operations that depend on levels rather than the textual content itself.
Another important special type is date-time data. Variables that represent dates or times, such as 2025-09-03
or 12:30:00
, are stored in classes like Date
or POSIXct
in R
. These are technically numeric under the hood (representing days or seconds since a reference point), but they behave differently because R
provides specialized functions for comparison, formatting, and arithmetic with dates and times. For example, you can calculate the difference between two dates, extract the month or weekday, or filter data by date ranges. This type of data are not so important for beginners but can become quite important if you wish to work with time-series or longitudinal data, for example.
When thinking about the distinction between nominal and ordinal data in R
, things become a bit more tricky as you can make this distinction in R
by creating “unordered” and “ordered” factor variables, but this distinction usually only matters in more advanced contexts. As a result, they will be covered in a seperate post. What is important for now is that nominal and ordinal data are both categorical. From an R
perspective, this means they should take a factor form. But nominal data can also be represented using string
. But, as explained above, it is difficult to do any statistics with this type of data. Meanwhile, ordinal data can be represented using numeric
, but this is theoretically problematic and makes the interpretation of any statistical analysis difficult, something which will also be covered in later posts. Bottom line: categorical data are factor data.
Looking at your data: A Primer
When working with categorical variables, it is often convenient to look at how well each of the categories are represented in your data. A simple way to do this is through the table()
command.
#Look at categories more closely
table(iris_data$Species)
setosa versicolor virginica
50 50 50
The $
operator in R
is a convenient way to access individual columns (variables) from a dataset. For instance, if we have the iris
dataset loaded, iris$Sepal.Length
returns the column (sometimes called “vector”) of all sepal lengths. Using $
is often more readable and faster than alternative methods, like indexing with iris[, "Sepal.Length"]
, and it allows you to chain operations using packages like tidyverse
. One important thing to remember is that $
only works with column names that are valid R
identifiers — if a column name has spaces or unusual characters, you’ll need to use a different indexing method. Using $
in combination with functions like table()
makes it simple to explore and summarize individual variables (Species
) in a dataset (iris_data
).
Here we see that we have an equal number of observations for each species of flower. This may or may not be meaningful, depending on your context. Where things can become interesting is when, for example, you have categories which are represented by hardly any observations. In more advanced statistical analyses, this can raise questions around, for example, whether such categories should be aggregated and combined. Usually, this type of decision is inspired by both empirics and theory. Theoretically, it becomes a question of what these categories mean and the extent to which you are actually interested in them. Empirically, it is often difficult to work with categories which only report a small number of observations.
To take another example, when studying property prices, it is often important to consider the type of building people are living in. It seems obvious that houses, boats, and apartments are probably priced differently, so this makes sense. But there are potentially dozens of way to classify a building, depending on whether you are considering architectural, land-use, and/or household differences, for example.
Most people are not explicitly interested in all the specific categories you can make, but rather just a handful (e.g., standalone houses and apartments). In such cases, it makes sense to think about aggregating many highly detailed categories into broader groups because, theoretically, these broader groups better capture what you are interested in theoretically, while these highly disaggregated categories offer little-to-no value to your empirical work.
You see similar problems emerge in countless other contexts. Studies of race/ethnicity, age, and gender are three examples which come directly to mind.