R on R (for ecology)

Top five(ish) sources of ecological data

Thu, 23 Mar 2023 08:45:39 +0000

As you’re learning R, it can be hard to come up with data sets that you can practice with. Though many of us have our own data, those might not always be in the best format to do what we want. Our own data are often messy and require a lot of recoding and reformatting. Wouldn’t it be nice if we could download clean data sets that we could work with? Luckily, there are a number of resources out there - you just have to know where to look!

In this tutorial, I discuss the following data sets:

I also mention the Ocean Biodiversity Information System, DataOne, and the Central Michigan University Library website’s list of resources.

1) Basic data sets in R

One of the first places you can look for practice data sets is within R itself.

R comes with some standard data sets that you can view if you type data() into the console. These data sets range from describing the survival of Titanic passengers to describing the locations of earthquakes off the island of Fiji. They are wide-ranging and fun to explore, but most of them are not explicitly ecological.

Some common ecological data sets that you might use are iris, PlantGrowth, and Loblolly. I find these data sets useful when I’m trying to do something quick, like testing how a new function works. Since these data sets are so straightforward, I can usually predict what my expected output should be, and then I can know whether or not the function worked correctly. I also use these data sets as examples for the blog posts that I write - these data sets are great teaching tools because they’re fairly simple and easy to understand.

These data sets are not really intended to be used to conduct your own research; they are primarily used for practice and demonstration purposes.

2) The Knowledge Network for Biocomplexity

Introduction and how-to

The Knowledge Network for Biocomplexity (KNB) is an international repository of ecological data sets that have been uploaded by scientists to facilitate environmental research. These data are also often affiliated with published papers.

You can search data sets in a variety of ways. On the left side, you can filter the data based on different attributes (e.g., author, year, taxon, geographic location). On the right side, you can look for data sets by location by navigating the handy world map and clicking on the different squares.

When you click on a data set, you’re taken to a page where you can download all the associated files. The heading at the top is also the citation for the data package, so it’s easy to correctly attribute the work. If you’re using a public data set and publishing something (even if just in a blog post or an example), it’s a good idea to cite the data set.

Published data sets are often identified by their DOI, or “digital object identifier”. This is just a unique ID assigned to each published entity. If you type in the DOI string after “https://doi.org/" (e.g., https://doi.org/10.5063/F1FN14M4 ), you’ll get a URL that takes you to the publication.

This page also includes the metadata for the data set to make it easier to navigate and understand the data you’re downloading.

All good data sets come with metadata, or data that describes the data set of interest. When you download a data set that was collected by someone else, it’s usually hard to tell what each column means, how it was collected, and what its units are. Luckily, metadata helps us figure out how a data set is organized and how we might want to use it. If a data set doesn’t come with metadata, then it’s very difficult to use and understand the data, rendering it almost useless.

For example, this data set by Haas-Desmarais et al. (2021) comes with great metadata for each file that’s included in the data package. The “observations_complete.csv” file contains several variables, listed on the side. The authors have defined each variable for us - now we know that the variable “actual_time” represents the time listed on the camera and does not reflect the actual time in the world. The metadata also tells us the format / unit of the measurement.

Takeaways and application

One of the great things about KNB data sets is that there’s often a published journal article associated with them (usually linked in the metadata). This allows you to put the data set in the context of the research, and can give you an idea of how you might be able to manipulate the data as you’re practicing your R skills. Maybe reading the article will even raise some questions for you that you might want to explore.

Sometimes the data sets also come with associated R scripts or R Markdown documents that contain the analysis for the paper. This provides a great learning tool where you can see how other scientists conducted their analyses and try to reproduce them.

You can also download data from the KNB through R, using the package rdataone. However, I usually like to download data directly from the site so I can first familiarize myself with the data set.

3) The Environmental Data Initiative

Introduction and how-to

One of my favorite places to download ecological data is the Environmental Data Initiative (EDI) data portal. The EDI archives a lot of environmental data that come from publicly-funded research. The EDI’s specialty is that it is the primary location where data from Long-Term Ecological Research (LTER) sites in the United States are archived. This means that the EDI will often have several years' worth of data for a given data set, making this a great resource for examining long-term trends. For example, the EDI hosts data for a project called “EcoTrends”, which is a large synthesis effort that aggregates ecological data on a yearly or monthly time-scale. The aim of the project is to make long-term ecological data easier to access, analyze, and compare among research sites to evaluate global change. All the EcoTrends data are organized into a common and clean data format (maybe providing good practice for making plots in R?).

As with the KNB, you can browse data in the EDI portal in a number of ways - you can search by LTER site, or based on keywords that the data creators associated with their data set. Some especially useful methods might be to look for data by discipline, by ecosystem, or by organism.

You can also browse data sets by their package identifier, which groups data sets by LTER site or by a specific project (e.g., EcoTrends or the PaleoEcological Observatory Network). Examples of package identifier names include “edi”, “ecotrends”, or “knb-lter-arc”. These codes, in combination with strings of numbers, are used within the EDI to uniquely identify each data set.

The EDI also has an advanced search tool, where you can specify several attributes like geographic location, temporal scale, research site, authors, taxon, etc.

Once you’ve decided on a data set, you’ll be taken to a page that summarizes the data package you’re looking at. This page will provide some basic information like authors, publication date, citation, abstract, and spatial coverage. There will also be a link to download the data, and a link to view the full metadata. Like with the KNB, some data sets come with R scripts that you can run and learn from.

Takeaways and application

Something really neat that the EDI provides on each data package page is a code generator that will read in the data for you and format it appropriately. The EDI will generate code for several different coding languages, like Matlab, Python, R, and SAS. We are of course interested in the “R” and “tidyr” options.

The code under the “R” option will read in the data as a data frame, while the code under the “tidyr” option will read in the data as a tibble, using the tidyverse package (check out our post here [LINK] for a rundown on the differences between data frames and tibbles). You can either download an .R file with the code already written, or you can copy and paste the code into your own file.

Again, EDI data boasts numerous data sets with long-term measurements (some on the scale of decades!), making it really useful for examining long-term trends.

Quick note from Luka: But what do you do with your data once you have it? If you are still a beginner with R, then I encourage you to check out my full course on The Basics of R (for ecologists). I designed the course to take away the stress of learning R by leading you through a self-paced curriculum that makes R easy and painless. I’m confident this course will give you all the essentials you need to feel comfortable working with your own data in just a few weeks. Just click below 👇 to start the course and see what you think!

Or, if you already feel solid with the basics, take your data visualization to the next level with my Introduction to Data Visualization with R (for ecologists) where I teach you everything you need to create professional and publication-quality figures in R. 👇

4) National Ecological Observatory Network

Introduction and how-to

The next resource I’m going to discuss is the National Ecological Observatory Network (NEON), which is a network of field sites across the United States at which several types of ecological data are regularly collected in terrestrial and aquatic environments.

The network is designed so that the U.S. is divided into 20 ecological/climatic domains. Almost every domain has terrestrial and aquatic field sites, which are often placed in close proximity to one another to allow for analysis of linkages across these ecosystems. NEON collects remotely-sensed data, observational data, and data via automatic sensors (e.g., meteorological towers), with the idea that these data will be collected over many, many years. These data are also standardized across NEON sites. As a result, NEON data covers a broad spatial and temporal extent, allowing us to collect and compare certain measurements across the entire U.S. and over long periods of time.

When you’re looking for NEON data, you can search for data in one of two ways.

The first way is to look for data by site or location through the interactive map on NEON’s homepage. This is more of an exploratory approach, where you can zoom in on different parts of the map. The table beneath the map shows you what field sites and plots are visible. If you want to look at a site’s data, you can just click “Explore Data” under the site name, and you’ll be taken to NEON’s data archive page.

If you zoom in on a specific research site (I zoomed in on the Smithsonian Environmental Research Center), the map will show you specific plots and locations of towers.

If you’re curious about a specific research site, you can also navigate to the site’s information page, which gives a lot of great background about the history of the site, some native fauna and flora, the geology, climate, etc. The image below shows part of the Toolik Field Station NEON page. The right-hand side of the page shows a lot of basic information about the site, like the coordinates, elevation, mean annual temperature, etc. Note that many NEON sites are also LTER sites (e.g., Toolik, Konza Prairie, Jornada).

The other way to search for data is to simply go to NEON’s “Explore Data Products” page. You can filter your data search by date, research site, state, domain, and research theme (e.g., atmosphere, biogeochemistry, land cover, organisms/populations/communities). The data sets are grouped by measurement and not by research site. So, for example, you can download a wind speed data set that includes wind speeds from all the research sites that collect that data.

When you decide on a data set that you want to look at, you can click on the data set name. This will take you to the page for the specific data set, which has loads of information.

The first part of the page shows information on the data set, including a description of the data, an abstract / reasoning for the data collection, and a citation for when you use the data.

If you scroll down, you can see information about how the data was collected and processed. NEON provides a brief description about the sampling scheme and instrumentation, as well as detailed documentation about the methods and QA/QC process. They also provide an issue log to address problems that arose during data collection or processing, and they let you know at what sites those issues occurred.

The next section shows the spatial and temporal availability of the data. In the table below, each row represents a research site and each column represents a month. The cells are colored in if there is data available at the research site during that month. The cells are grey if there is no data available. You can click the blue “Download Data” button to begin the data downloading process.

When you’re ready to begin downloading data, you can choose what research sites and time periods you want to download data for. Note the estimated file size in the top right corner, as some data sets are very large and can take a while to download. The page provides instructions for how to select sites and your date range. After you make your selection, you will be able to choose whether or not you want to download any associated documentation (i.e., sampling scheme and protocol documents listed in the “Collection and Processing” section). You can then choose whether you want a basic data package or expanded package, which includes QA/QC metrics. After you agree to NEON Usage and Citation policies, you can then download your data set!

When you unzip the data download, you’ll see a bunch of folders. Each folder represents a site-month combination. Within each folder, there are several .csv files. I recommend that you read the .txt file that comes with it, as it describes what each .csv file contains and helps you put together the pieces to understand the data.

NEON also comes with a helpful visualization tool on the data set information page. The tool will graph the data for you, so you can get an idea of what it looks like before you download it. You can manipulate pretty much any aspect of the graph. You can add sites to the plot to see how they compare to one another, and you can choose what specific sensor’s data you want to display (each site usually has multiple sensors at different locations). You can also adjust the date range that is displayed and the specific variable that is plotted (e.g., minimum, maximum, or mean values). The scroll bar below the X axis allows you to zoom in/focus on a specific time range. The axes ranges, scales, and breaks can also be adjusted. Lastly, you can download the plot as a PNG.

I encourage you to play around with this - it’s such a neat tool! Unfortunately, the visualization tool isn’t available for every data set, but it’s often available for measurements that are taken by automatic sensors or towers (e.g., air temperature, wind speed, barometric pressure).

Takeaways and application

NEON has its own R package, called neonUtilities. The package provides functions to help you work with and import NEON data. Something great that NEON provides are R tutorials for working with NEON data and for general ecological analysis. For example, here’s a tutorial on how to download and explore NEON data. And here’s a guided practice lesson where you can learn how to search for and visualize precipitation data. Here are NEON’s recommendations for people who are just getting started with NEON data and/or R.

In short, NEON data are useful for illuminating spatiotemporal trends. NEON is great for comparing several types of data (phenological, biogeochemical, climatological, etc.) across different terrestrial and aquatic environments in the United States. There are also several sites within each ecoclimatic Domain, so you can examine trends across ecological gradients (e.g., elevation).

5) Species and biodiversity data

The Global Biodiversity Information Facility

Introduction and how-to

Collecting species occurrence and biodiversity data can be really useful for modeling species distributions and understanding how they might change (e.g., studying impacts of climate change or predicting the spread of invasive species).

The Global Biodiversity Information Facility (GBIF) is an international data repository that is commonly used to obtain species occurrence data. Let’s check it out.

The main ways to search for data are to search for occurrences, to search for species, or to browse data sets.

When you search for data by occurrences, the easiest method is probably to search for your species of interest. When you type in your species name in the search bar, a drop down menu will appear that shows you the different names or subspecies that your species of interest might be known by. If you want to download all occurrences for your species, then you should include all possible names in your search. In the image below, I searched for Callinectes sapidus, commonly known as the Atlantic blue crab.

Once you complete your search, you can view occurrences in a table, as a map, or through a photo gallery (usually photos from iNaturalist, an app used for sharing biodiversity/wildlife observations).

There’s also a tab that you can click on to download occurrence data, which will look something like this once it’s downloaded. Each row of data is one observation of the species, and there are columns that will give you information on taxonomy, the country where the species was observed, the coordinates, and the date, among other data.

The species search is slightly different from the occurrence search. As one might think, the species search focuses more on information about the species itself than individual records of occurrence data. The page has a pane on the left that describes the species taxonomy. The pane on the right shows an overview of the species, including the photo gallery, a map of its distribution, its common names, and places where the species is classified as “introduced” rather than native. This is helpful for broadly learning about your species of interest before you dive into the data.

Lastly, you can browse GBIF-associated data sets, which are not organized by species but by network / event / project.

For example, if I click on the “iNaturalist Research-grade Observations” data set, I’m taken to a page where I can download the whole iNaturalist database of species observations, see the geographic distribution of occurrences, and see the taxonomic breakdown of species listed in the data set.

Takeaways and application

GBIF also has a “Resources” section that can provide inspiration for projects and show you several helpful tools. For example, the “Data Use” tab lists different publications and projects that use GBIF data, showing you how GBIF data can be used to drive research.

You can also explore biodiversity and species distribution-related tools in the “Tools” tab and search for GBIF-related literature in the “Literature” tab. GBIF also has a data blog, where they discuss tips and tricks for how to use GBIF. Very useful!

One last note about GBIF is that it has its own R package, called rgbif. rgbif makes it really easy to read GBIF data into R. For more on this, check out this blog post from R-bloggers, which provides a commented script that walks you through how to import, clean, and map the data. GBIF is pretty commonly used, so there are several tutorials out there on how to use the data.

The Ocean Biodiversity Information System

There’s also the Ocean Biodiversity Information System (OBIS), which is like GBIF but for marine species (OBIS actually contributes marine data to GBIF). I’m not going to dive too deep into this resource, but OBIS also comes with its own R package, called robis. Something nice is that OBIS provides a few examples of analyses that can be done using OBIS data and using the robis package. The image below is an example of an R notebook that OBIS created to showcase its data - this can be a great learning tool to follow along with!

OBIS also has a great visualization tool, called “mapper”, that allows you to map species distributions on top of one another. Mapper is also the primary way you can search for species records in OBIS. In the image below, I mapped Callinectes sapidus (blue crab) distributions on top of Zostera marina (eelgrass) distributions. The green drop down menu beside each species occurrence layer also allows you to view or download occurrence data for that species and modify its appearance on the map.

Looking for more?

The DataOne portal is a huge archive of environmental data that aggregates data sets from several different repositories and organizations, including many of the resources we listed above (e.g., KNB, EDI, NEON). This is a good portal to look to if you want a very comprehensive search, or if you don’t know exactly what you’re looking for. The other repositories might be more helpful if you already know exactly what kind of data you want to retrieve.

I also want to highlight the Central Michigan University Library website, which has a great list of resources that you can consult to find data relating to the life sciences (including ecological data!). The website lists a few of the sources we described above, and more. It also provides some good sources of environmental data (e.g., habitat/spatial data and climate data), which could be helpful for modeling. I would definitely check it out, especially if you’re searching for public data to use for your own research.

If you’re just looking for practice data, the resources we listed above should provide plenty of data sets for you to use! I recommend that you explore all the different data repositories that I recommended - they’re rich with tools and exciting data beyond what I covered in this blog post.

Do you have any favorite sources of ecological data? Let us know in the comments below! We made a top 5 list so we could dive deep into the details of each one, but it never hurts to learn about more resources. ;)

I hope this tutorial was helpful. As always, happy coding!

Quick note from Luka: If you are just starting with R, then I encourage you to check out my full course on The Basics of R (for ecologists). I designed the course to take away the stress of learning R by leading you through a self-paced curriculum that makes R easy and painless. I’m confident this course will give you all the essentials you need to feel comfortable working with your own data in just a few weeks. Just click below 👇 to start the course and see what you think!

Also be sure to check out R-bloggers for other great tutorials on learning R

Citations

Stephanie Haas-Desmarais, Gabriel Benjamen, and Christopher Lortie. 2021. The effect of shrubs and exclosures on animal abundance, Carrizo National Monument. Knowledge Network for Biocomplexity. doi:10.5063/F1FN14M4.

How to make a scatterplot in R

Mon, 14 Nov 2022 09:30:50 -0400

Now that you’ve learned the very basics of plotting from our earlier tutorial on making your very first plot in R, this blog post will teach you how to customize your scatterplots to make them look better. If you want to take this even a step further, check out my step-by-step tutorial introduction to publication-quality scatterplots.

You can also watch this blog post as a video by clicking on the image below.

Scatterplots are one of the most common types of plots in ecology, where they show the relationship (or lack thereof) between two continuous variables.

We’re going to create the same scatterplot that we did in the other lesson by loading up the data set PlantGrowth.

This data set has 30 rows of data and two columns. The first column, “weight”, represents the dry biomass of each plant in grams. The second column, “group”, lists the experimental treatment that each plant was given. We’re going to add another column to this data set called “water”, which will describe the amount of water that each plant has received throughout its life, in liters. If you’re following along in RStudio (which you should be! 😄), then you can just copy and paste the code below to add the new column.

# Load data
data(PlantGrowth)
# Add a new column
PlantGrowth$water <- c(3.063, 3.558, 2.233, 3.147, 2.379, 2.106, 2.384, 2.444, 2.492, 3.292, 2.732, 2.153, 2.660, 1.938, 3.583, 1.817, 3.494, 2.559, 1.530, 2.372, 3.176, 2.611, 3.262, 2.947, 2.523, 2.152, 2.771, 2.878, 2.263, 2.518)
# View first few rows of data
head(PlantGrowth)

## weight group water
## 1 4.17 ctrl 3.063
## 2 5.58 ctrl 3.558
## 3 5.18 ctrl 2.233
## 4 6.11 ctrl 3.147
## 5 4.50 ctrl 2.379
## 6 4.61 ctrl 2.106

Awesome. Now, using the plot() function, let’s create a plot of plant weight versus the amount of water that the plant received.

# Plot plant weight versus water received
plot(weight ~ water, data = PlantGrowth)

Now we have a basic scatterplot, but it doesn’t look all that great aesthetically. To help with that, I’m going to show you some different customizations that allow you to modify several of the plot elements.

Let’s start with the axis labels. We can modify the xlab and ylab arguments within the plot() function. xlab refers to the label on the X axis, while ylab refers to the label on the Y axis. Notice that I also pressed the “Enter” or “Return” key after each comma in the plot() function. This just keeps the code cleaner and more readable, but you could have also written it all in one long line.

# Edit the axis labels of the plot
plot(weight ~ water,
data = PlantGrowth,
xlab = "Total Water (L)",
ylab = "Dried Biomass Weight (g)")

Great! Our axis labels look good. We can also make the graph a little more spacious by editing the limits of the axes. We can do this using the xlim and ylim arguments. These arguments accept vectors of the form c(lower_limit, upper_limit). So if we wanted the X axis to go from 1 to 5, we would say xlim = c(1, 5).

# Edit the axis limits of the plot
plot(weight ~ water,
data = PlantGrowth,
xlab = "Total Water (L)",
ylab = "Dried Biomass Weight (g)",
xlim = c(1.25, 3.75),
ylim = c(3.25, 6.75))

Nice, our plot looks a little less crowded. The last aspect of the axes that you might want to change are the axis tick marks. We can do this using the xaxp and yaxp arguments. These arguments accept vectors in the form c(lower_limit, upper_limit, number_of_intervals). So if we want the X axis tick marks to go from 1.25 to 3.75 with 5 intervals in between, we would write xaxp = c(1.25, 3.75, 5).

# Edit the axis tick marks of the plot
plot(weight ~ water,
data = PlantGrowth,
xlab = "Total Water (L)",
ylab = "Dried Biomass Weight (g)",
xlim = c(1.25, 3.75),
ylim = c(3.25, 6.75),
xaxp = c(1.25, 3.75, 5),
yaxp = c(3.5, 6.5, 3))

Now let’s change the appearance of the points in the plot. The open circles that we currently have can be nice, especially if many of the points overlap. However, normally we would probably want to have simple, filled-in circles.

We can change the shape of the points using the pch argument. 16 happens to be the value that corresponds to filled-in points, but you can play around with other numbers to see the types of symbols that are available.

# Edit the point shape
plot(weight ~ water,
data = PlantGrowth,
xlab = "Total Water (L)",
ylab = "Dried Biomass Weight (g)",
xlim = c(1.25, 3.75),
ylim = c(3.25, 6.75),
xaxp = c(1.25, 3.75, 5),
yaxp = c(3.5, 6.5, 3),
pch = 16)

You can also change the color of the points using the col argument, where you can just type the name of a color in quotes.

# Edit the point shape
plot(weight ~ water,
data = PlantGrowth,
xlab = "Total Water (L)",
ylab = "Dried Biomass Weight (g)",
xlim = c(1.25, 3.75),
ylim = c(3.25, 6.75),
xaxp = c(1.25, 3.75, 5),
yaxp = c(3.5, 6.5, 3),
pch = 16,
col = "blue")

It can be fun to use different colors, but the best practice is to keep your figures in grayscale unless the colors in your figure specifically signify something. In the case of our figure, there isn’t really a reason to change the color of the points except for the purposes of demonstration. So let’s change the color back to black.

You can also change point size using the argument cex. The default for cex is 1, which represents 100%. So if we change the cex argument to 1.5, the points will be 50% larger.

# Edit the point shape
plot(weight ~ water,
data = PlantGrowth,
xlab = "Total Water (L)",
ylab = "Dried Biomass Weight (g)",
xlim = c(1.25, 3.75),
ylim = c(3.25, 6.75),
xaxp = c(1.25, 3.75, 5),
yaxp = c(3.5, 6.5, 3),
pch = 16,
col = "black",
cex = 1.5)

And now we have a nicer-looking scatterplot. The axis labels are clearer, the points have been filled in, and our plot looks less crowded. Now you know how to customize the axis labels, the axis tick marks and limits, and the point shape, color, and size within your scatterplot.

There is of course a lot more that you can do, but this tutorial is aimed at giving you the most important attributes that you can modify in the base plot() function. I used only these for the longest time without needing to branch out to ggplot or other more advanced techniques. But be sure to check out my other tutorial that takes this just a bit further to show you how to make publication-quality scatterplots. Happy visualizing!

Found this tutorial helpful? Check out my full course Introduction to Data Visualization with R (for ecologists) here:

Start Intro to Data Viz with R now

Also be sure to check out R-bloggers for other great tutorials on learning R

Free workshop on how to learn R

Thu, 12 May 2022 09:30:50 -0400

Hello everyone! I am psyched to announce the launch of my free workshop about how to learn R. It’s been a long time in the making, but it is finally here. The workshop is called The Myth of the R Learning Curve (or how not to go crazy when learning R).

In the workshop, I go over my own personal story and how I came to love and learn R. I also talk about why R is such a powerful tool that brings you to the cutting edge of science. Some of the other key topics I cover in the workshop include:

The Myth of the R learning curve: what it is, why it’s there, and how to quickly get beyond it
Why the order in which you learn R is critical for making it easy to learn
How to apply Pareto’s principle (the 80:20 rule) when learning R
The counter-intuitive secret about statistics and R
How you can keep practicing R even if you don’t have any data yet (and have fun in the process!)

I feel strongly about the fact that it doesn’t need to take years and expensive university courses to feel comfortable working with R. I have taught hundreds of students and I am excited to share what I’ve learned along the way. Don’t let the R learning curve stand in the way of doing good science. Learning R can be faster, more fun, and easier than you thought.

If you watch it to the very end, I’ll be sharing a cool bonus surprise so that you never have another reason to say you don’t know R.

To watch my free workshop, just click here to enter your email and access the workshop. 👇

I look forward to seeing you there!

~ Luka

The basics of prototyping and exporting your plots in R

Thu, 05 May 2022 09:30:50 -0400

It’s super rewarding when you finally figure out how to plot and visualize your data. But to show off your plot to the rest of the world, you need to first be able to save and export it from your R Studio workspace.

In this tutorial, I’m going to show you how to prototype, save, and export your plots from R. (Note, I use the term ‘plot’ and ‘figure’ interchangeably to mean the same thing: a data visualization!)

For starters, if you need a tutorial on how to make plots in R, you can check out this video on how to make your first plot. You can also enroll in our full online course on data visualization, titled “Intro to data visualization in R (for ecologists)”.

You can also watch this blog post as a video if you want to follow along with one of the lessons from my full course:

Let’s start out by loading up some data:

# Load up our data
data(PlantGrowth)
# Look at our data
head(PlantGrowth)

## weight group
## 1 4.17 ctrl
## 2 5.58 ctrl
## 3 5.18 ctrl
## 4 6.11 ctrl
## 5 4.50 ctrl
## 6 4.61 ctrl

The PlantGrowth data are pre-built into R, so you can load the dataset with just the data() function as I’ve done above. These data describe the weights of plants that were placed under different experimental treatments. Let’s look at those treatments:

# Look at the treatment levels
levels(PlantGrowth$group)

## [1] "ctrl" "trt1" "trt2"

Hmmm… “ctrl”, “trt1”, and “trt2” are not very good descriptions of the treatment levels. We need to describe them better if we want to put these treatment levels in a useful plot. Let’s rename them and we can say that the different treatments reflect different light levels.

# Change the names of the treatment levels 
levels(PlantGrowth$group) <- c("Control", "High-Light", "Low-Light")

Now the data look a little better.

# View the levels again
levels(PlantGrowth$group)

## [1] "Control" "High-Light" "Low-Light"

Let’s run this code to create a boxplot (see how I came up with this code in this video).

# Create the plot
plot(weight ~ group, data = PlantGrowth,
xlab = "Sunlight Treatment",
ylab = "Dried Biomass Weight (g)",
col = 4,
boxlty = 0,
whisklty = 1,
whisklwd = 1.5,
staplelwd = 1.5)

When we just create a plot like this in R Studio, the visual proportions of the plot aren’t set automatically. In other words, your figure is plotted in, and conforms to, the Viewer tab in R Studio. If you drag the size of that viewer, you can make the plot have whatever proportions you want. As a result, it can be hard to come up with figures that have consistent and correct sizing and proportions, especially if you’re making several figures that need to have consistent sizing.

So the general workflow that I use for creating figures is to first create something that looks more or less good in the viewer window. Then I begin prototyping the different sizing and aspect ratio of the figure by writing out the width and height right in the code until I find something that I like.

You can do this using the quartz() function on a Mac. If you run quartz(), it will open up a blank graphics device window like this one:

# Open graphics device window
quartz()

And then if we run our plot code after running quartz(), the plot will show up in the pre-sized window:

(Note that for Windows, you can use the windows() function, and for Linux, you can use the x11() function. I’m going to show how to do it on Mac since that’s the computer I have, but it should be similar on Windows or Linux computers as long as you change the function.)

We can set a standard height and width for the new window, where h is height and w is width:

# Set a standard plot size
quartz(h = 4, w = 4)

If you don’t specify a height or width, the default size for quartz() is height = 7 and width = 7, measured in inches; as you can see in the image above, our h = 4 and w = 4 Quartz window is much smaller than the default behind it.

But notice that the font sizes and other graphical elements such as line widths or point sizes remain the same size! This is why it’s important to prototype the sizing. For example, it’s quite clear that the smaller 4x4 figure looks a bit better, aesthetically speaking, than the 7x7.

My biggest pet peeve is the common tendency of saving figures with a size that is way too big relative to the font and point size. This creates at best a very unappealing visual, and at worst a figure that is very hard to read or interpret in the first place.

Ok. I’ll stop that rant… :😆:

When you’re assigning values to height and width, you should generally use values ranging from 1 to 10. But also watch out if you make the figure too small, because you might receive an error about the figure margins being too large to fit the figure itself.

For example, if I set the window to be 1 inch by 1 inch and then try to run the plot code, the console says Error in plot.new() : figure margins too large

You can keep playing around with the window size until you find something that works for you. Since 1"x 1" was too small, let’s set our plot size at height = 4.5, width = 4.5.

# Set plot size
quartz(h = 4.5, w = 4.5)

Now we have a plot that we’re happy with! To save the figure from the Quartz window, go to the “RStudio” menu tab and click “Save”. On a Windows computer, you might go to “File” or a similar menu tab.

This will prompt you to save your figure as a .pdf file. PDFs are actually one of the best file formats for figures because they have a virtually infinite resolution (try to keep zooming in on a figure you save as a PDF and you’ll see what I mean!). This also means that the file size ends up being pretty large, so you can just convert it to a .jpg or .png whenever you need a smaller file type.

You also have the option to export your figure from the R figure viewer pane, either as an image or as a PDF. When you select either option, a window will pop up that will allow you to choose your figure height and width. The disadvantages of doing this versus using the Quartz window is that you aren’t really able to visualize what your sizing might look like, and if you want to share reproducible code with someone, they won’t know what size to save the figure as.

Now if we go to our files and click on the plot that we saved, we can see it in PDF form.

So those are the basics of prototyping and saving your plots in R. These are the tools I’ve always used for the majority of all my visualization work in R. That’s not to say that there aren’t other (even fancier) ways to save plots directly from the R code. Remember that from the Quartz window you still have to go to File > Save in order to export the plot. However, I find that this simple system using Quartz windows is the perfect intermediary between hard coding everything and total point-and-click exporting. It also lets you easily prototype your figures as you create them.

And that’s it! Remember that if you want to learn more about visualization, be sure to check out my complete course “Intro to data visualization in R (for ecologists)”.

If you liked this and want learn even more more, you can check out the full course on the complete basics of R for ecology right here.

Check out my full course data visualization with R (for ecologists) here:

Start visualizing your data now

Also be sure to check out R-bloggers for other great tutorials on learning R

How to reshape your data in R for analysis

Thu, 28 Apr 2022 09:30:50 -0400

One of the toughest parts of data analysis is preparing your data to be analyzed. We often have to deal with problems like NAs, typos, and data that are formatted incorrectly. In this blog post, I’m specifically going to help you with that last one — I’m going to show you how to reshape data so that it’s in the correct form for data analysis in R.

Wide vs. Long format data

Data often comes in two formats: wide or long.

Wide format data looks something like this:

This table describes the average diameter at breast height (DBH) in centimeters for three different tree species (red maple, white oak, and loblolly pine) at four sites labeled A, B, C, and D.

This is a very common way to format data, where the first column (tree species) contains unique values. Each tree species appears only once, and their DBH measurements are sorted in the table by site. So in wide format data, each row represents a tree species that we’re observing. This format is called “wide format” because the table becomes wider — if you want to add data, you need to add more columns to the table.

Note that it is common to start with your data in wide format since this format often makes it easy to enter data in the field. So when you transcribe your data sheets to your computer, it’s usually easiest to follow the same format — especially if you have hundreds of pages to upload.

Long format data, in contrast, looks like this:

Here, each tree species is repeated several times in the first column, and the site names all become values in a new column, called “Site”. Now, each row contains a tree species, the site it was found at, and its DBH measurement. In long format data, each row should represent one observation (each DBH measurement = one observation). This type of data is called “long format” because the table grows longer when you add more data.

Which format is more useful?

The advantage of wide format data is that it clearly and concisely summarizes DBH measurements for us. Wide format data is what we usually use to display data as tables for presentations or papers. Long format data, by comparison, is easier to use for data analysis or visualization in R.

Long format data is also called “tidy data”, as termed by Hadley Wickham, the lead developer of the tidyverse packages. He describes “tidy data” as having the following attributes:

Each column is its own variable
Each row is one observation
Each cell is one value

Our long format data fulfills all of those requirements. Each column in our long format data represents a variable (tree species, site, and DBH). Each row represents one observation: DBH. And each cell contains one value.

Let me provide a concrete example to show why tidy data is useful. What if we go to other sites and not all the species are present there? Let’s say Quercus alba is present at Site E but not Site F, while Acer rubrum and Pinus taeda are present at Site F but not Site E. If we add this data to our wide format table, it would look like this:

There are NAs in the “Site E” column for Acer rubrum and Pinus taeda because they weren’t present at that site, so it wouldn’t make sense for them to have a DBH measurement there. Likewise, there is an NA in the “Site F” column for Quercus alba because it wasn’t present at that site. By adding data to our wide format table, we also added in missing values that we’ll have to deal with.

If we add this data to our long format table instead, we don’t have any NAs because each row in our table represents a DBH measurement. The places where there were NAs in the wide format table just don’t have a row in our long format table. Though this is an advantage in some cases, the fact that the missing observations are not there at all can also make it easy to overlook missing data in long format. The nice thing about converting from wide to long format in R, though, is that those rows with NA values can be preserved if you need them.

Long format data also clearly shows the categories that we might want to analyze the data by — we can see that we have columns for tree species and for site. This makes it easy for us to average DBH for a specific species, or summarize DBH for a specific site. Organizing data in this way makes it much easier for us to add and analyze data.

Unfortunately, data often starts in wide format and it can be tedious to manually change it to long format data. Good thing there are functions in R to help us out! Let’s see how they work.

How to reshape your data using tidyr

First, let’s upload the tidyverse package, which contains the tidyr package within it. You might need to run install.packages("tidyverse") if you don’t have this package installed yet.

library(tidyverse)

Preparing our data

Let’s also load a data set. I downloaded data describing forest area as a percent of total land area for each country of the world, from the World Bank’s Open Data catalog. You can find the same data set here to follow along.

# Load data
forest_dat <- read.csv("pct_forest_world.csv")
# Subset data and rename columns for easier visualization for this blog post
forest_dat <- forest_dat[114:122, c(1:2, 33:40)] %>%
rename(name = Country.Name, code = Country.Code)
# View data
head(forest_dat)

## name code X1988 X1989 X1990 X1991
## 114 Iraq IRQ NA NA 1.8382605 1.8414615
## 115 Iceland ISL NA NA 0.1702743 0.1830025
## 116 Israel ISR NA NA 6.0998152 6.1968577
## 117 Italy ITA NA NA 25.8058210 26.0708578
## 118 Jamaica JAM NA NA 48.1329640 48.1303786
## 119 Jordan JOR NA NA 1.1049411 1.1049411
## X1992 X1993 X1994 X1995
## 114 1.8446624 1.8478634 1.851064 1.8542653
## 115 0.1957307 0.2084589 0.221187 0.2339152
## 116 6.2939002 6.3909427 6.487985 6.5850277
## 117 26.3358947 26.6009316 26.865969 27.1310054
## 118 48.1277932 48.1252078 48.122622 48.1200369
## 119 1.1049411 1.1049411 1.104941 1.1049411

If we check out the data, we can see that the first two columns describe country name and country code. After that, each column represents one year, ranging from 1988 to 1995. There are a bunch of NAs in the data before 1990, which is likely when the data set starts.

This data is currently in wide format. Each row represents one country, and the observations (% forest area each year) are spread out across a lot of columns. As time goes on, more columns will be added for each new year. This is very common for time-series data, where each column represents a new time point.

Before we begin reshaping our data, let’s get rid of the “X” that’s in front of every single year. We can do this using the sub() function. We asked R to substitute the “X” in front of all the forest_dat column names with "" (nothing).

# Edit column names
colnames(forest_dat) <- sub("X", "", colnames(forest_dat))
# View data
head(forest_dat)

## name code 1988 1989 1990 1991 1992
## 114 Iraq IRQ NA NA 1.8382605 1.8414615 1.8446624
## 115 Iceland ISL NA NA 0.1702743 0.1830025 0.1957307
## 116 Israel ISR NA NA 6.0998152 6.1968577 6.2939002
## 117 Italy ITA NA NA 25.8058210 26.0708578 26.3358947
## 118 Jamaica JAM NA NA 48.1329640 48.1303786 48.1277932
## 119 Jordan JOR NA NA 1.1049411 1.1049411 1.1049411
## 1993 1994 1995
## 114 1.8478634 1.851064 1.8542653
## 115 0.2084589 0.221187 0.2339152
## 116 6.3909427 6.487985 6.5850277
## 117 26.6009316 26.865969 27.1310054
## 118 48.1252078 48.122622 48.1200369
## 119 1.1049411 1.104941 1.1049411

Great, now we’re ready to reshape our data.

How to use `pivot_longer()`

The tidyr package provides us with some useful functions to help us reshape our data. One of these functions is pivot_longer(), which — you guessed it — changes your data from wide to long format.

The important arguments to know in this function are as follows: pivot_longer(data = data.frame, cols = columns.to.pivot, names_to = "New Column Name", values_to = "New Column Name")

data is just the data frame you want to reshape. cols lists the columns that you want to pivot. names_to is the name of the column that will be created from the variables that are in the column names. values_to is the name of the column that will be created from the values that are in the cells of the table.

This image shows how the function would pivot a simplified version of our data. You can see that each cell in the wide table on the right becomes its own row in the long table on the left:

Let’s look at a concrete example to see how the function works. In the code below, I asked the function to pivot our forest_dat data set, focusing on columns labeled “1988” through columns labeled “1995”. The function will create a new column called “year” to store all of the years that currently act as column names. The function will also create a new column to store all the % forest area values. I also added an argument, values_drop_na and set it to TRUE, which asks the function to drop rows where all values are missing (NAs).

# Create a long format table
forest_dat_long <- pivot_longer(forest_dat, cols = "1988":"1995", names_to = "year",
values_to = "pct_forest_area", values_drop_na = TRUE)

And now if you look at the data, you can see that our table is much longer than it was before (going from 9 to 54 rows). The country names are repeated several times in the first column, and now we have a column that contains the year and another column that contains the observation (% forest area). Now, each row of the data table describes one year’s measurement of % forest area in a certain country. You’ll also notice that the 1988 and 1989 columns were dropped because they contained missing values (NAs) for all countries. If we hadn’t added the values_drop_na argument, then we would still have values for 1988 and 1989 in our table, and it would just say NA for those rows.

# View data
print(forest_dat_long, n = 54)

## # A tibble: 54 × 4
## name code year pct_forest_area
## <chr> <chr> <chr> <dbl>
## 1 Iraq IRQ 1990 1.84
## 2 Iraq IRQ 1991 1.84
## 3 Iraq IRQ 1992 1.84
## 4 Iraq IRQ 1993 1.85
## 5 Iraq IRQ 1994 1.85
## 6 Iraq IRQ 1995 1.85
## 7 Iceland ISL 1990 0.170
## 8 Iceland ISL 1991 0.183
## 9 Iceland ISL 1992 0.196
## 10 Iceland ISL 1993 0.208
## 11 Iceland ISL 1994 0.221
## 12 Iceland ISL 1995 0.234
## 13 Israel ISR 1990 6.10
## 14 Israel ISR 1991 6.20
## 15 Israel ISR 1992 6.29
## 16 Israel ISR 1993 6.39
## 17 Israel ISR 1994 6.49
## 18 Israel ISR 1995 6.59
## 19 Italy ITA 1990 25.8
## 20 Italy ITA 1991 26.1
## 21 Italy ITA 1992 26.3
## 22 Italy ITA 1993 26.6
## 23 Italy ITA 1994 26.9
## 24 Italy ITA 1995 27.1
## 25 Jamaica JAM 1990 48.1
## 26 Jamaica JAM 1991 48.1
## 27 Jamaica JAM 1992 48.1
## 28 Jamaica JAM 1993 48.1
## 29 Jamaica JAM 1994 48.1
## 30 Jamaica JAM 1995 48.1
## 31 Jordan JOR 1990 1.10
## 32 Jordan JOR 1991 1.10
## 33 Jordan JOR 1992 1.10
## 34 Jordan JOR 1993 1.10
## 35 Jordan JOR 1994 1.10
## 36 Jordan JOR 1995 1.10
## 37 Japan JPN 1990 68.4
## 38 Japan JPN 1991 68.4
## 39 Japan JPN 1992 68.4
## 40 Japan JPN 1993 68.4
## 41 Japan JPN 1994 68.3
## 42 Japan JPN 1995 68.3
## 43 Kazakhstan KAZ 1990 1.27
## 44 Kazakhstan KAZ 1991 1.27
## 45 Kazakhstan KAZ 1992 1.17
## 46 Kazakhstan KAZ 1993 1.17
## 47 Kazakhstan KAZ 1994 1.17
## 48 Kazakhstan KAZ 1995 1.17
## 49 Kenya KEN 1990 6.78
## 50 Kenya KEN 1991 6.80
## 51 Kenya KEN 1992 6.82
## 52 Kenya KEN 1993 6.83
## 53 Kenya KEN 1994 6.85
## 54 Kenya KEN 1995 6.87

This format also makes it easy to plot our data. For example, let’s look at % forest cover over the years for Japan.

# Filter out all rows for Japan
japan <- filter(forest_dat_long, name == "Japan")
# Plot % forest cover over time in Japan
plot(data = japan, pct_forest_area ~ year)

With our data in long format, we can easily see how we might want to group our data (maybe by country or by year) and then analyze it. Now let’s see how to turn our data back into a wide format table.

How to use `pivot_wider()`

The pivot_wider() function works similarly to the pivot_longer function, but the opposite.

Now, we want to widen our data and spread it out instead of gathering it into a longer form. The function works like this:

pivot_wider(data = data.frame, id_cols = identifying_columns, names_from = "Col with Names", values_from = "Col with Values")

data is just the data frame you want to reshape. id_cols lists the columns that contain essential identifying information for each observation. names_from is the name of the column that will be spread out to become more column names. values_from is the name of the column that the cell values will come from.

In the code below, I asked pivot_wider() to keep the columns “name” and “code” as identifying columns. I told the function to take all the new column names from the “year” column, and to take all the new values to fill in the table from the “pct_forest_area” column.

# Create a wide format table
forest_dat_wide <- pivot_wider(data = forest_dat_long, id_cols = c("name", "code"),
names_from = "year", values_from = "pct_forest_area")
# View table
print(forest_dat_wide, n = 9)

## # A tibble: 9 × 8
## name code `1990` `1991` `1992` `1993` `1994` `1995`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Iraq IRQ 1.84 1.84 1.84 1.85 1.85 1.85
## 2 Iceland ISL 0.170 0.183 0.196 0.208 0.221 0.234
## 3 Israel ISR 6.10 6.20 6.29 6.39 6.49 6.59
## 4 Italy ITA 25.8 26.1 26.3 26.6 26.9 27.1
## 5 Jamaica JAM 48.1 48.1 48.1 48.1 48.1 48.1
## 6 Jordan JOR 1.10 1.10 1.10 1.10 1.10 1.10
## 7 Japan JPN 68.4 68.4 68.4 68.4 68.3 68.3
## 8 Kazakhstan KAZ 1.27 1.27 1.17 1.17 1.17 1.17
## 9 Kenya KEN 6.78 6.80 6.82 6.83 6.85 6.87

You can see that the table looks much as it did when we first downloaded it. We have our main identifying columns for country name and code, and then we have several columns after that, each representing one year. The 1988 and 1989 columns didn’t get added back in because they only had NA values.

And now you know how to reshape your data from wide to long format and then back again. I hope this tutorial was helpful! Happy coding!

If you liked this and want learn more, you can check out the full course on the complete basics of R for ecology right here or by clicking the link below.

Check out Luka's full course the Basics of R (for ecologists) here:

Start learning R now

Also be sure to check out R-bloggers for other great tutorials on learning R

How to create your own functions in R

Thu, 21 Apr 2022 11:46:53 -0400

We’ve talked a lot about how to use different pre-made functions in R, but sometimes you just need to make your own function to tackle your data. In this blog post, I’m going to talk about how to create your own function and give a few examples.

Components of a function

Remember that a function is essentially a “black box” into which you add some inputs and then receive some outputs. Building a function is about building that “black box”, and there are several components that go into it.

Let’s first discuss those components. I’ve created an example function below, called “add_three”. It adds three to the value that is passed by the user.

add_three <- function(x){
y <- x + 3
return(y)
}
add_three(5)

## [1] 8

Take note of a few important elements:

Function name (add_three): this is just the name that you want to call your function. It should be something pretty short and easy to remember, like so many of the common functions we use (e.g., mean, plot, select). I chose the name “add_three”. As when we create any variables or objects in R, we use the arrow <- to assign this name to our function.
“function” and arguments (function(x)): we tell R that we want to create a function using function(). Within the parentheses, we can specify the number of arguments that we want our function to have. It doesn’t matter what we name our arguments within the parentheses (I named mine x), as long as we use the same names in the body of the function. If you want to have multiple arguments, it would look something like this: function(arg1, arg2, arg3, ...). Later, when you put your function to use, you’ll have to specify values for the arguments, like I did with the 5 in add_three(5).
Curly brackets: { and } come after function(argument) and need to bracket the actual function code that you’re writing.
Body of the function: this is the code in the function between the curly brackets that executes the task that you want. Here, I’ve created a new variable, y, to store the x + 3 value.
The return value (return(y)): Also inside the curly brackets, but usually at the end, this is the result that the function prints for you when it’s done running. I asked the function to return the value of y (aka, x + 3) to me.

And that’s all there is to creating your own function! Now I have a great function called add_three that I can use over and over again. You’ll notice that when you create a function, R adds this function to your environment. Just like you have to load packages to use them in a script, you’ll have to run your function code to add it to your environment each time you use it in a new script.

The example I just gave was very simple, but learning how to create your own function unlocks a whole new realm of coding that can be as simple or complex as you want.

A few examples

Mathematical formulas

It’s not that hard to add three to a value. In fact, we probably didn’t need to create a function for that. But what if we want to create a function that performs something more complex, like solving quadratic equations? Let’s create a quadratic formula function.

quadratic <- function(a, b, c){
root1 <- (-b + sqrt(b^2 - 4 * a * c)) / (2 * a)
root2 <- (-b - sqrt(b^2 - 4 * a * c)) / (2 * a)
root1 <- paste("x =", root1)
root2 <- paste("x =", root2)
ifelse(root1 == root2, return(root1), return(c(root1, root2)))
}

This function accepts the coefficient $a$ of the quadratic term, the coefficient $b$ of the linear term, and the constant $c$ as arguments. I created two values to hold the two possible roots of the equation. I also wanted the function to print “x = answer”, so I created values that pasted the “x =” string onto the answer. The ifelse statement at the end just says that if the two roots are equivalent, print only one of them. Otherwise, print both roots.

Now let’s see if the function works. Let’s test an equation with only one root, $x^2 + 6x + 9 = 0$, and an equation with two roots: $x^2 - 8x + 15 = 0$

quadratic(1, 6, 9)

## [1] "x = -3"

quadratic(1, -8, 15)

## [1] "x = 5" "x = 3"

It works! And now we have a function to help with our math homework :)

Manipulating strings

Now that we’ve created a mathematical function, let’s try creating a function that manipulates strings. Let’s say we want a function that accepts a species name as an argument and returns an abbreviated version: the first letter of the genus + the rest of the species name. For example, the blue crab, Callinectes sapidus, would be shortened to C. sapidus.

shorten <- function(name){
name_split <- strsplit(name, split = " ")
genus <- substr(name, 1, 1)
species <- name_split[[1]][2]
new_name <- paste(genus, ". ", species, sep = "")
print(new_name)
}

I first used strsplit() to split up the full species name into genus and species, by specifying that I wanted the split to occur at the space between the words. The function substr() allows you to pick out specific characters in a string. I asked substr() to just take the first letter of the name. Then I created the new string by pasting together the first letter of the genus, a period and space, and the species.

shorten("Homo sapiens")

## [1] "H. sapiens"

shorten("Leiostomus xanthurus")

## [1] "L. xanthurus"

Neat! This could be really useful for shortening the names in a list of species — writing a custom function makes the process much easier.

Functions without arguments

You can also create functions that don’t require arguments at all. For example, I could create a function that generates random coordinates for me, prints them, and plots them on a world map.

# Load necessary packages
library(tidyverse)
library(maps)
# Create the function
coords <- function(){
# Randomly sample to get a random lat and long
latitude <- runif(n = 1, min = -90, max = 90)
longitude <- runif(n = 1, min = -180, max = 180)
print(paste("Latitude: ", latitude, " Longitude: ", longitude, sep = ""))
# get data to plot a world map 
world <- map_data("world")
# Plot the world map
ggplot() +
geom_map(data = world, map = world,
aes(long, lat, map_id = region),
color = "black", fill = "lightgray", size = 0.1) +
# Plot our random point on top of the world map
geom_point(aes(longitude, latitude), color = "red")
}

I used the runif() function to randomly sample one value each from the range of viable latitudes and longitudes. I used print() and paste() to display a message telling you the latitude and longitude values. Then I plotted the world map and our random point on top.

# What coordinates will I get this time?
coords()

## [1] "Latitude: 61.1949375551194 Longitude: 90.1810506172478"

# What about this time?
coords()

## [1] "Latitude: -62.7056198241189 Longitude: 24.0213383920491"

These examples were just three of many, many possibilities. Whatever task or operation you can think of in R, you can code it in a function. Get creative and have fun! Happy coding!

If you liked this and want learn more, you can check out the full course on the complete basics of R for ecology right here or by clicking the link below.

Check out Luka's full course the Basics of R (for ecologists) here:

Start learning R now

Also be sure to check out R-bloggers for other great tutorials on learning R

A simple introduction to ggplot2 (for plotting your data!)

Thu, 14 Apr 2022 09:20:07 -0400

If you’ve ever been totally confused by ggplot2 and what it is or how it works, my intention is that this short tutorial simplifies it down to a conceptual level from which you can build up later. Hope you enjoy!

Data visualization is a powerful tool for scientists and their audiences to easily grasp relationships and trends in data. Some of you may already know how to generate plots using base R. In this blog post, we’re going to introduce a package called “ggplot2” that makes it more intuitive to create consistently nice-looking figures in R.

You can also watch this blog post as a video if you want to follow along while reading:

The “gg” part of “ggplot2” stands for the grammar of graphics. Just like sentences are composed of various parts of speech (e.g., nouns, verbs, adjectives) that are arranged using a grammatical structure, ggplot2 allows us to create figures using a standardized syntax.

The first element in data visualization is your data, of course! Let’s load up a data set that comes built into R, called ChickWeight, and take a quick look at it. The data describes the weights and ages of chicks that are fed different diets.

# Load data
data(ChickWeight)
head(ChickWeight)

## weight Time Chick Diet
## 1 42 0 1 1
## 2 51 2 1 1
## 3 59 4 1 1
## 4 64 6 1 1
## 5 76 8 1 1
## 6 93 10 1 1

The next element is the aesthetics. This includes things like which variable goes on the X axis, which variable goes on the Y axis, and what size, shape, or color you want your points/lines/bars/etc. to be. You might have already noticed, but in this blog post, I’m going to assign different colors to the different graphical elements so that you can quickly pick them out in the syntax.

Let’s say we want to create a scatterplot showing weight versus time for these chicks. We’re going to assign time to the X axis, weight to the Y axis, and we want the different diets to show up as different point colors. When we assign variables to the different aesthetic elements, this is called “mapping” the variables to the elements.

Once you figure out how you want to map your data to aesthetic elements, then you present your data using a geometric object, like a scatterplot, boxplot, lineplot, etc.

So now we’ve talked about the essential graphical elements: data, aesthetics, and geometry.

There are a couple more elements in ggplot such as the coordinates, which allow you to choose what part of the plot you’re showing, and the theme, which allows you to decide how the graph looks in terms of things like font color, font family, and font size. If you don’t specify them, ggplot will just use the default settings for your plot.

Now let’s see how we actually code this in R! The basic method of constructing a figure in ggplot begins with the function:

ggplot()

Notice that this doesn’t say ggplot2(), though that’s the name of the package.

The first argument in the function are the data:

ggplot(data)

Then, we add the aesthetics:

ggplot(data, aes(x, y))

What would happen if we tried to plot this right now using the data? Remember when we loaded up ChickWeight way back at the start of this blog post?

ggplot(ChickWeight, aes(x = Time, y = weight))

We do see time on the X axis and weight on the Y axis, but nothing has shown up in the actual bounds of our plot because we’re missing our geometry.

To add a geometry object to the ggplot() function, we just have to add a "+" sign, add a new row, and add the geometry.

The function for a scatterplot is geom_point(). This specific function changes depending on what kind of plot you want, but the functions all begin with **geom_**. Within geom_point(), we can also specify aesthetics such as color or fill of the points, or any other aesthetic property that might be connected to the data. So now we have:

ggplot(data, aes(x, y)) +

geom_point(aes(color))

Now to actually put the data in! To map data to aesthetics, we just set the aesthetics equal to whatever the variable name is in our dataframe. Using the current data, the code should look like this:

ggplot(ChickWeight, aes(x = Time, y = weight)) +

geom_point(aes(color = Diet))

And now if we plot it…

ggplot(ChickWeight, aes(x = Time, y = weight)) +
geom_point(aes(color = Diet))

Ta-da! We have a graph showing chick weight versus time, and we are able to represent different chick diets with different colors in the figure. Notice that ggplot automatically adds in a legend for you.

If we really want, we can also add in other elements such as the coordinates and theme like so (the X’s are stand-ins for various functions that could fill in the space, such as “theme_classic()”, for example):

ggplot(ChickWeight, aes(x =Time, y = weight)) +

geom_point(aes(color = Diet)) +

coord_XXX() +

theme_XXX()

If we plot this out, it might look something like this:

ggplot(ChickWeight, aes(x = Time, y = weight)) +
geom_point(aes(color = Diet)) +
coord_cartesian() +
theme_classic()

And don’t worry if all this gets a bit confusing or hard to remember after the basic graphical elements. Luckily there’s a cheatsheet online to help you remember everything that you can do with ggplot2.

Hope you enjoyed this brief introduction to ggplot2! It took me a long time to come to terms with learning how to ggplot, but when I finally did, it really did change how I do data visualizations. If you want to learn even more about how to create different types of figures with ggplot2, check out my full online course in data visualization, titled “Introduction to data visualization in R (for ecologists)” here. There I go over the five key types of plots in R for ecology and much more! Here’s a sample of what you’ll learn to create in that course:

If you liked this and want learn even more more, check out my full course data visualization with R (for ecologists) here:

Start visualizing your data now

Also be sure to check out R-bloggers for other great tutorials on learning R

How to make a boxplot in R

Wed, 06 Apr 2022 11:46:53 -0400

In this tutorial, I’m going to show you how to plot and customize boxplots (also known as box and whisker plots). Boxplots are a common type of graph that allow you to look at the relationships between a continuous variable and various categorical groups. They are super common in ecology because we often need to compare values between different categories.

BTW, you can also follow along with a video tutorial of this blog post if you click on the image below:

For this tutorial, we’re going to use the built-in R dataset PlantGrowth, which might seem familiar to you because we used it in a few other data visualization tutorials.

To refresh your memory, PlantGrowth has 30 rows and two columns. The “weight” column represents the dry biomass of each plant in grams, while the “group” column describes the experimental treatment that each plant was given.

# Load the data
data(PlantGrowth)
# View the data
head(PlantGrowth)

## weight group
## 1 4.17 ctrl
## 2 5.58 ctrl
## 3 5.18 ctrl
## 4 6.11 ctrl
## 5 4.50 ctrl
## 6 4.61 ctrl

Let’s say we want to compare the weight of plants among the different treatments. A boxplot is perfect for this type of visualization.

We’ve already learned about the plot() function in our earlier scatterplot tutorial (see our previous blog post). Something neat about plot() is that if the X axis is a categorical variable, the function will recognize that and will automatically graph a boxplot for you instead of a scatterplot.

If we look at the levels in the “group” column, we can see that “group” is indeed a categorical variable, with three different levels:

# Look at the levels of the "group" column
levels(PlantGrowth$group)

## [1] "ctrl" "trt1" "trt2"

So if we plot weight as a function of group (y as a function of x), we should get a boxplot.

# Make a boxplot of weight as a function of treatment group
plot(weight ~ group, data = PlantGrowth)

Awesome! We can see plant weight across the three different treatment groups, allowing us to easily compare groups.

Boxplot components

Now, let’s quickly go over the components of a box plot.

The solid black line in the middle of each box represents the median of the data.
The grey box represents the “interquartile range” (IQR) of your data, or the range between the 1st and 3rd quartiles. Values below the 1st quartile represent the lowest 25% of your data points, while values above the 3rd quartile represent the highest 25% of your data. The interquartile range contains the middle 50% of your data points.
The “whiskers” of a box and whisker plot are the dotted lines outside of the grey box. These end at the minimum and maximum values of your data set, excluding outliers.
Sometimes, you will have outliers in your data that are shown as points in the plot. Outliers are points that are more than (1.5 * IQR) below the 1st quartile or above the 3rd quartile.

Quick note about the Min and Max whiskers: The maximum and minimum whisker markers (the staples or “T"s) only indicate the maximum or minimum of the data if the 3rd quartile + 1.5 x IQR exceeds the maximum value or 1st quartile - 1.5 x IQR exceeds the minimum value, respectively. In other words, the whiskers exclude outliers, which are all points greater than 1.5 x IQR above or below the 3rd or 1st quartiles.

Modifying the axes

Now that we understand all the parts of a boxplot, let’s play around with the different components of the plot, starting with the axes. Customizing the axes is the same as for scatterplots, where we’ll use the arguments xlab and ylab to change the axis labels.

# Adding axis labels
plot(weight ~ group,
data = PlantGrowth,
xlab = "Treatment Group",
ylab = "Dried Biomass Weight (g)")

Great, now we have axis labels! But the individual treatment group labels on our X axis are still worded pretty vaguely. To change this, let’s actually go back to our data. Let’s change “ctrl” to “Control”, “trt1” to “High light”, and “trt2” to “Low light”.

# Look at the levels of the group column
levels(PlantGrowth$group)

## [1] "ctrl" "trt1" "trt2"

# Change the names of the treatments in the data set itself
levels(PlantGrowth$group) <- c("Control", "High light", "Low light")
# View the group column again
PlantGrowth$group

## [1] Control Control Control Control Control Control
## [7] Control Control Control Control High light High light
## [13] High light High light High light High light High light High light
## [19] High light High light Low light Low light Low light Low light
## [25] Low light Low light Low light Low light Low light Low light
## Levels: Control High light Low light

Now that we’ve changed the names of our treatments, let’s run the plot again.

plot(weight ~ group,
data = PlantGrowth,
xlab = "Treatment Group",
ylab = "Dried Biomass Weight (g)")

Modifying the boxes and whiskers

Our plot is looking pretty good so far. Now let’s see how we can change the appearance of the boxes and whiskers. We can do this using the col argument, which accepts any color name or hex code in quotes. You can also set col to any number, which represents a predetermined color.

plot(weight ~ group,
data = PlantGrowth,
xlab = "Treatment Group",
ylab = "Dried Biomass Weight (g)",
col = 4) # or something like "blue" or a hex code like "#f234f9"

It can be fun to use colors, but it’s data visualization best-practice to keep your figures black and white (or grey-scale) unless you need to use colors to signify something in particular. Note that in the case of our figure, there isn’t really a reason to change the color of the boxes except for the purposes of demonstration here.

We can also change the appearance of the boxes' borders using boxlty, which stands for “box line type”. This argument can accept integers, which represent different line types. 1 corresponds to a normal line, 2 corresponds to a dashed line, and 0 corresponds to no line. You can test out other numbers, too! For now, let’s get rid of the box borders.

plot(weight ~ group,
data = PlantGrowth,
xlab = "Treatment Group",
ylab = "Dried Biomass Weight (g)",
col = 4,
boxlty = 0)

To change the whisker line type, you can use the argument whisklty, which works the same way as boxlty. You can also change whisker line thickness using whisklwd.

plot(weight ~ group,
data = PlantGrowth,
xlab = "Treatment Group",
ylab = "Dried Biomass Weight (g)",
col = 4,
boxlty = 0,
whisklty = 3,
whisklwd = 1.5)

Lastly, you can change the line thickness of the ends of the whiskers (these are called staples) using the staplelwd argument.

plot(weight ~ group,
data = PlantGrowth,
xlab = "Treatment Group",
ylab = "Dried Biomass Weight (g)",
col = 4,
boxlty = 0,
whisklty = 3,
whisklwd = 1.5,
staplelwd = 1.5)

You’ll notice that the arguments boxlty and whisklty seem similar, and that whisklwd and staplelwd also seem similar. You might have already figured out that to change the different plot components and their attributes, you can just mix and match box, whisk, and staple with lty, lwd, and col (which changes the color).

Changing the boxplot orientation

The last thing you can modify is the orientation of the boxplot. Right now, the boxes and whiskers are oriented vertically. If you want them to be horizontal, you can just add the argument horizontal = TRUE. This can be especially helpful if you have a lot of groups that you want to compare.

plot(weight ~ group,
data = PlantGrowth,
xlab = "Treatment Group",
ylab = "Dried Biomass Weight (g)",
col = 4,
boxlty = 0,
whisklty = 3,
whisklwd = 1.5,
staplelwd = 1.5,
horizontal = TRUE)

And that’s it! Now we have a good-looking boxplot. In this tutorial I went over what the different parts of a boxplot mean, as well as how to modify the axes, the boxes and whiskers, and the orientation of the plot.

I hope you enjoyed this post! If you liked this and want learn more, you can check out my full course on the complete basics of R for ecology right here or my course on data visualization with R (for ecologists) by clicking the link below.

Check out my full course on Data Visualization with R (for ecologists) here:

Start Visualizing Data with R now

Also be sure to check out R-bloggers for other great tutorials on learning R

Search through your ecological data with the 'grep()' function

Tue, 29 Mar 2022 09:09:55 -0400

We often want to search for a certain character pattern in our data. We do this all the time when we press “ctrl + F” (or “cmd + F” for a mac) on a webpage. For example, maybe you have a list of species names and want to find all of the individuals within a certain genus. Or maybe you have several columns of climate data and only want to select the ones related to precipitation.

Here, I’m going to talk about the functions called grep() and grepl() that allow you to find strings in your data that match the pattern you’re looking for. I’m also going to discuss a function called sub(), which allows you to find and replace strings.

First, let’s load the dplyr package, which I’ll be using once or twice during the tutorial to demonstrate common uses for grep() and grepl(). Note that grep(), grepl(), and sub() come with base R, so there’s no need to load packages to use those functions.

library(dplyr)

Find matches using `grep()` and `grepl()`

To demonstrate how to use these functions, I’ve downloaded a data set from the Environmental Data Initiative (EDI) data portal. The EDI archives troves of environmental data that are publicly available and great for demonstration purposes or for supporting your own research. The data I downloaded describe the vegetation on barrier islands within the Virginia Coast Reserve Long-Term Ecological Research project. To follow along, you can download the data here.

Let’s import the data into R and subset it so that it’s easier to understand for this tutorial. I used the select() function in dplyr, where I first listed the data frame I want to analyze, and then the names of the columns I want to keep.

# Upload data
veg_dat <- read.csv("VCR_data.csv")
# Select specific columns
veg_dat <- dplyr::select(veg_dat, genus, species, island, habitat, relabund)
# View first few rows
head(veg_dat)

## genus species island habitat relabund
## 1 Acer rubrum Smith Pine-Hardwood_forest_stands 6
## 2 Acer rubrum Parramore Hardwood_forest_stands 6
## 3 Achillea millefolium Wreck Foredune_grassland 3
## 4 Achillea millefolium Smith Hardwood_forest_stands 4
## 5 Achillea millefolium Smith Dense_grasslands 4
## 6 Achillea millefolium Smith Foredune_grassland 4

This data set lists observations of species presence on different islands and in different habitats on those islands. We have a column for genus, species, island, habitat type, and the relative abundance of the species.

Let’s say that we’re interested in looking at all species that are found in forested habitats. How many habitat types do we have that are forested? We can use the unique() function to view all the unique entries for the habitat column.

# View all unique values in the habitat column
unique(veg_dat$habitat)

## [1] "Pine-Hardwood_forest_stands"
## [2] "Hardwood_forest_stands"
## [3] "Foredune_grassland"
## [4] "Dense_grasslands"
## [5] "Low_thickets"
## [6] "Tall_thickets"
## [7] "Open_dunes-thicket_complex"
## [8] "Beachgrass_dunes--Dense_grassland_dunes"
## [9] "Foredune-sparse_grassland_complex"
## [10] "Salt_flat"
## [11] "Sparse_grassland"
## [12] ""
## [13] "Drift--Wrack"
## [14] "Beach"
## [15] "Pine_forest_stands"
## [16] "Fresh_marsh"
## [17] "Upper_salt_marsh"
## [18] "Overwash_flats"
## [19] "Mudflats"
## [20] "Brackish_marsh"
## [21] "Lower_salt_marsh"
## [22] "Pine-hardwood_forest_stands"
## [23] "Juniper_thickets"
## [24] "code_error"
## [25] "Open_water"

It looks like we have several types of forest: “Pine-Hardwood_forest_stands”, “Hardwood_forest_stands”, and “Pine_forest_stands”. We also have “Pine-hardwood_forest_stands”, which is the same as the first one, but identified as a separate entry because “hardwood” is not capitalized. The easiest way to pick all of these habitats out of the data set would be if we could “ctrl-F” the word “forest” in the habitat column. Luckily, we have the grep() function to help us with that!

You can use the function like so: grep(pattern_text, vector). You can also add an argument to grep() where you set ignore.case = TRUE. This tells the function that you don’t want your search to be case-sensitive (if you leave the ignore.case argument out, the default is that the function is case-sensitive). Let’s try it out.

# Let's see how grep works
grep("forest", veg_dat$habitat, ignore.case = TRUE)

## [1] 1 2 4 7 34 35 54 91 96 97 106 112 113 114 146
## [16] 147 152 207 209 218 257 258 259 260 262 263 349 383 384 385
## [31] 388 389 390 397 398 414 424 442 443 444 474 484 485 488 545
## [46] 555 558 581 585 595 615 619 750 752 753 754 759 760 762 764
## [61] 768 771 812 823 828 834 835 837 855 911 915 916 932 933 934
## [76] 943 944 949 964 965 998 1015 1016 1028 1032 1033 1109 1122 1124 1125
## [91] 1138 1141 1142 1144 1145 1146 1173 1174 1175 1179 1180 1211 1215 1223 1224
## [106] 1227 1228 1229 1237 1238 1246 1247 1248 1250 1251 1252 1253 1256 1259 1264
## [121] 1265 1269 1270 1271 1272 1288 1289 1300 1408 1410 1458 1459 1462 1463 1464
## [136] 1538 1554 1555 1560 1561 1565 1567 1568 1569 1577 1578 1579 1585 1658 1659
## [151] 1745 1746 1877 1897 1899 1900 1903 1904 1908 1909 1910 1912 1915 1916 1917
## [166] 1918 1922 1923 1930 1931 1951 1952

You can see that grep() returns every row number in the data frame where a habitat type contains the string “forest”. If we add another argument to grep() that says value = TRUE, we can see the values where the function has found a match (in this case, the actual habitat types).

# Assign the list of values to a variable so we don't see them all at once (it's a long list)
hab <- grep("forest", veg_dat$habitat, ignore.case = TRUE, value = TRUE)
# View some of the values
head(hab)

## [1] "Pine-Hardwood_forest_stands" "Hardwood_forest_stands"
## [3] "Hardwood_forest_stands" "Hardwood_forest_stands"
## [5] "Pine-Hardwood_forest_stands" "Hardwood_forest_stands"

Great! grep() seems pretty useful. So then how does grepl() differ from grep()? The input arguments are the same, but the function gives us a different output. Let’s see:

# Assign the list of values to a variable so we don't have to see all of them (it's a long list)
hab_log <- grepl("forest", veg_dat$habitat, ignore.case = TRUE)
# View the grepl output
head(hab_log, 60)

## [1] TRUE TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE

The extra “l” in grepl() stands for “logical”, which is the data type that it returns. grepl() returns TRUE when there is a match, and FALSE when there isn’t one, all the way down the entire data frame.

Now that we know what grep() and grepl() return, we can subset our data frame using their outputs.

Here, I used grep() to subset the data to return all rows where a species is found in forested habitat.

# Use the grep output to subset the data frame
forest_species <- veg_dat[grep("forest", veg_dat$habitat, ignore.case = TRUE), ]
# View the new data set
head(forest_species)

## genus species island habitat relabund
## 1 Acer rubrum Smith Pine-Hardwood_forest_stands 6
## 2 Acer rubrum Parramore Hardwood_forest_stands 6
## 4 Achillea millefolium Smith Hardwood_forest_stands 4
## 7 Achillea millefolium Parramore Hardwood_forest_stands 5
## 34 Amelanchier obovalis Smith Pine-Hardwood_forest_stands 5
## 35 Amelanchier obovalis Smith Hardwood_forest_stands 5

# How many rows?
nrow(forest_species)

## [1] 172

Now I’ll demonstrate the same thing, but this time I’ll use grepl() to subset the data, combined with the filter() function in the dplyr package. The filter() function accepts the name of the data frame you want to analyze, then a logical test. The function will return the rows that are TRUE.

These two methods I demonstrated will return the same data frame, but some prefer to use the filter method as it follows the dplyr methodology for tidy scripts (since it can easily be combined with other functions such as select and mutate. See our post on the pipe operator to learn more).

# Use the grepl output to subset the data frame
forest_species2 <- dplyr::filter(veg_dat, grepl("forest", habitat, ignore.case = TRUE))
# View the new data set
head(forest_species2)

## genus species island habitat relabund
## 1 Acer rubrum Smith Pine-Hardwood_forest_stands 6
## 2 Acer rubrum Parramore Hardwood_forest_stands 6
## 3 Achillea millefolium Smith Hardwood_forest_stands 4
## 4 Achillea millefolium Parramore Hardwood_forest_stands 5
## 5 Amelanchier obovalis Smith Pine-Hardwood_forest_stands 5
## 6 Amelanchier obovalis Smith Hardwood_forest_stands 5

# How many rows?
nrow(forest_species2)

## [1] 172

Now that I have a data frame containing only species found in forested habitat, I can do whatever type of data manipulation I want. For example, I could group by island and habitat and use the summarize() function to see how many species are found within each habitat type on each island (to learn more about the group_by() and summarize() functions, check out our tutorial here).

I used the function n() to summarize the data, which just counts the number of rows in each group.

# A dplyr workflow, starting by filtering with grepl(), grouping the data, then summarizing it
forest_summary <- dplyr::filter(veg_dat, grepl("forest", habitat, ignore.case = TRUE)) %>%
group_by(island, habitat) %>%
summarize(obs = n())

## `summarise()` has grouped output by 'island'. You can override using the `.groups` argument.

# View summary table
print(forest_summary)

## # A tibble: 8 × 3
## # Groups: island [3]
## island habitat obs
## <chr> <chr> <int>
## 1 Parramore Hardwood_forest_stands 21
## 2 Parramore Pine_forest_stands 27
## 3 Parramore Pine-Hardwood_forest_stands 31
## 4 Revel Pine-Hardwood_forest_stands 7
## 5 Smith Hardwood_forest_stands 44
## 6 Smith Pine_forest_stands 1
## 7 Smith Pine-hardwood_forest_stands 1
## 8 Smith Pine-Hardwood_forest_stands 40

Cool! This is useful information to know, and it’s all thanks to grepl() that we were able to perform this operation so easily.

Find and replace using `sub()`

You may have noticed that the last two rows of the table show that Smith Island has 1 species in “Pine-hardwood_forest_stands”, and 40 species in “Pine-Hardwood_forest_stands”. This is a typo that we need to fix — those two habitat types should be the same.

No worries, we can use the sub() function to replace all instances of “Pine-hardwood_forest_stands” with “Pine-Hardwood_forest_stands”. The function works like this: sub(pattern_text, replacement_text, vector). We’re also going to tell the function ignore.case = F because in this case, we care about the lowercase versus uppercase “H”.

# Substitute all instances of "hardwood" with "Hardwood"
veg_dat$habitat <- sub("hardwood", "Hardwood", veg_dat$habitat, ignore.case = F)

Now if we summarize the data like we did above, all the pine-hardwood forests should be aggregated under the type “Pine-Hardwood_forest_stands”.

# A dplyr workflow, starting by filtering with grepl(), grouping the data, then summarizing it
forest_summary <- dplyr::filter(veg_dat, grepl("forest", habitat, ignore.case = TRUE)) %>%
group_by(island, habitat) %>%
summarize(obs = n())

## `summarise()` has grouped output by 'island'. You can override using the `.groups` argument.

# View summary table
forest_summary

## # A tibble: 7 × 3
## # Groups: island [3]
## island habitat obs
## <chr> <chr> <int>
## 1 Parramore Hardwood_forest_stands 21
## 2 Parramore Pine_forest_stands 27
## 3 Parramore Pine-Hardwood_forest_stands 31
## 4 Revel Pine-Hardwood_forest_stands 7
## 5 Smith Hardwood_forest_stands 44
## 6 Smith Pine_forest_stands 1
## 7 Smith Pine-Hardwood_forest_stands 41

Great! It looks like that fixed the issue.

This was just one example of all the things you could do with grep() and related functions. They’re extremely useful for organizing data and searching for the data you want.

For further reading on strings and how to make your search queries with grep() more specific, learn more about regex (regular expressions) here:

I hope you found this tutorial helpful! Happy coding!

Data set citation:

McCaffrey, C. and R. Dueser. 2018. Vegetation Survey on the Virginia Barrier Islands - Species by habitat, 1974 ver 3. Environmental Data Initiative. https://doi.org/10.6073/pasta/9c276fb0ce844030c4afae81ff2cadfb (Accessed 2022-02-25).

If you enjoyed this tutorial and want learn more about searching and filtering your data, you can check out Luka Negoita's full course on the complete basics of R for ecology here:

Start learning R now

Also be sure to check out R-bloggers for other great tutorials on learning R

Learning about data structures in R

Wed, 23 Mar 2022 09:09:55 -0400

Last week, we posted a tutorial on the different types of data in R (check it out here). In this tutorial, we’re going to talk about the different structures that R provides to help you organize your data.

Data structures go hand-in-hand with data types, as both of these form the foundation for the work we do in R. You may have already worked with many of the structures that we describe in this blog post, but I wanted to take the time to describe them in depth and show you how they relate to or are different from one another.

Let’s jump in!

The different data structures

R provides several data structures that we commonly use as ecologists:

Vectors
Lists
Matrices
Data frames

4a. Tibbles

Vectors

Vectors are one of the most common data structures. You can create a vector using the function c(). c() combines all of its arguments into a vector like so:

# Create a vector
vec <- c("this", "is", "a", "vector")
# View the vector
vec

## [1] "this" "is" "a" "vector"

You can create a vector using any data type (numeric, character, logical, etc). However, if you combine data types in a vector, R will force all elements to be the same type. The type that R chooses for the vector will be the most “flexible” data type. Data types in order from least to greatest flexibility are: logical, integer, numeric, and character. For example, in the vector below, I combined numbers and characters into one vector.

# Create a vector
ex <- c(1, "species", 10)
# View vector
class(ex)

## [1] "character"

When we check the data type of the vector, it says character because we can change 1 and 10 to be “1” and “10”, but we can’t change “species” into a number. What number would “species” represent?? So here, R has chosen the more flexible data type — characters.

You can also examine certain attributes of the vector such as length() (i.e., number of elements) or, if you have a character vector, number of characters in each element (nchar()).

# View vector
ex

## [1] "1" "species" "10"

# Length of vector
length(ex)

## [1] 3

# Number of characters
nchar(ex)

## [1] 1 7 2

Vector elements can also be given names. You do this by assigning a character vector to names(my.vector).

# Create a vector
crabs <- c(10, 15, 26)
# Give the vector names
names(crabs) <- c("Blue crab", "Mud crab", "Ghost crab")
# View named vector
crabs

## Blue crab Mud crab Ghost crab
## 10 15 26

You can subset a vector by specifying the element number in square brackets. You could also subset a vector using the element name.

# Choose element number 3
crabs[3]

## Ghost crab
## 26

# Choose element named "Ghost crab"
crabs["Ghost crab"]

## Ghost crab
## 26

Lastly, you can view the structure of a vector using the str() function. This will tell us that the vector is a numeric vector with 3 elements: 10, 15, and 26. Below the vector, it also says that the attribute names for the vector is a character vector with the elements “Blue crab”, “Mud crab”, and “Ghost crab”.

str(crabs)

## Named num [1:3] 10 15 26
## - attr(*, "names")= chr [1:3] "Blue crab" "Mud crab" "Ghost crab"

Lists

Lists are similar to vectors, but are unique in that their elements do not all have to be the same type, and they can also be lists — in other words, it allows you to have vectors nested within other vectors.

To create a list, you use list() instead of c().

# Create a list
animals <- list(c("Eastern elliptio", "Diamondback terrapin", "Spring peeper", "American eel"),
c(25, 3, 0, 10),
"Maryland",
c(T, T, F, T))
# View the structure of the list
str(animals)

## List of 4
## $ : chr [1:4] "Eastern elliptio" "Diamondback terrapin" "Spring peeper" "American eel"
## $ : num [1:4] 25 3 0 10
## $ : chr "Maryland"
## $ : logi [1:4] TRUE TRUE FALSE TRUE

Here, my list contains a vector of animal names (character), a vector of numbers (integer), the U.S. state that these animals can be found in (character), and a logical vector. The vectors don’t all need to be the same length — the third element has only one value, “Maryland”, while all the other elements have a length of 4.

If we view the list, you’ll notice that each element is identified within double square brackets [[these]].

# View list
animals

## [[1]]
## [1] "Eastern elliptio" "Diamondback terrapin" "Spring peeper"
## [4] "American eel"
##
## [[2]]
## [1] 25 3 0 10
##
## [[3]]
## [1] "Maryland"
##
## [[4]]
## [1] TRUE TRUE FALSE TRUE

You can subset elements of a list using double square brackets, and further subset that list element using single square brackets.

# View animal names (element 1 in the list)
animals[[1]]

## [1] "Eastern elliptio" "Diamondback terrapin" "Spring peeper"
## [4] "American eel"

# View the second animal name (element 2 of element 1 in the list)
animals[[1]][2]

## [1] "Diamondback terrapin"

As with vectors, you can give list elements names. Let’s create the same list that we did above, but give it some more descriptive names by writing name.of.element = element within the list() function. In the code below, I named the list elements “common.name”, “abundance”, “state”, and “presence”.

# Create a list
animals <- list(common.name = c("Eastern elliptio", "Diamondback terrapin",
"Spring peeper", "American eel"),
abundance = c(25, 3, 0, 10),
state = "Maryland",
presence = c(T, T, F, T))
# View list
animals

## $common.name
## [1] "Eastern elliptio" "Diamondback terrapin" "Spring peeper"
## [4] "American eel"
##
## $abundance
## [1] 25 3 0 10
##
## $state
## [1] "Maryland"
##
## $presence
## [1] TRUE TRUE FALSE TRUE

Now, instead of numbers inside of double square brackets, each element is identified by $name. You can still subset the list using the element number in square brackets, like this: [[1]], but you can also subset the list using this dollar sign notation:

# View whether the animals were present in our survey
animals$presence

## [1] TRUE TRUE FALSE TRUE

Lists are really useful for storing lots of data, but it can get confusing if you have several lists nested in other lists. Naming your elements can help you keep things straight when subsetting your data.

Matrices

The next data structure I want to introduce is the matrix. Matrices are two-dimensional, rectangular objects that must contain elements of the same type, like a vector. These are most useful for mathematical operations, but are also common with species abundance/site data where column names are the species or sites and the rows are the other one. The cell values are the abundance of each species at every species x site combination — useful for multivariate analyses.

You can create matrices using matrix(data = your.data, nrow = num.rows, ncol = num.cols, byrow = T/F, dimnames = your.names).

data accepts a vector of the data you want to use. nrow is the number of rows you want in your matrix, while ncol is the number of columns you want. The byrow argument can be set to TRUE or FALSE depending on whether you want the matrix to fill your table by rows or by columns, though the default is FALSE. dimnames accepts a list of 2 elements that specifies names for the rows and columns of your matrix.

The byrow argument is best understood through demonstration:

# Create a matrix that is filled by rows
m1 <- matrix(data = 1:12, nrow = 4, ncol = 3, byrow = T)
m1

## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
## [4,] 10 11 12

# Create a matrix that is filled by columns
m2 <- matrix(data = 1:12, nrow = 4, ncol = 3, byrow = F)
m2

## [,1] [,2] [,3]
## [1,] 1 5 9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12

You can see that the first bit of code fills in the table row by row — it fills it in from left to right, then moves down. The second chunk of code fills in the table by columns — it fills it in from top to bottom, then moves to the right.

You can access matrix elements using single square brackets where the first number represents the row, while the second represents the column. So m1[2,3] would access the element in the 2nd row and 3rd column. You could also type m1[2, ], leaving the column space blank. This will return the entire 2nd row of the matrix. Inversely, you could type m1[ , 3], which leaves the row space blank and returns the entire 3rd column of the matrix. Let’s see these in action.

# Return element in 2nd row, 3rd column
m1[2,3]

## [1] 6

# Return 2nd row
m1[2, ]

## [1] 4 5 6

# Return 3rd column
m1[ , 3]

## [1] 3 6 9 12

We can also look at the number of rows and columns of a matrix by using nrow() and ncol(); these functions are analogous to the length() function that we used for vectors. Alternatively, we can use dim(), which will tell us both the number of rows and columns.

# Number of rows
nrow(m1)

## [1] 4

# Number of columns
ncol(m1)

## [1] 3

# View matrix dimensions
dim(m1)

## [1] 4 3

Data frames

Data frames are the most common way to store and display tabular data in R and are the standard format for applying any analyses to your data. Like matrices, these are two-dimensional objects with rows and columns. But data frames are also like lists, in that you can have elements of several types within them. In fact, a data frame is a type of list where each list element has the same length (this is what makes them rectangular / tabular).

You have likely encountered data frames before, for example when importing data into R using functions such as read.csv().

You can create a data frame using the function data.frame(col1 = vector1, col2 = vector2, etc.), where each vector should be the same length. You could also have a vector of length 1 or a length that is a divisor of the other vector lengths — this shorter vector will then get recycled until it reaches the length of the other columns.

In the code below, I created a data frame of species, whether or not they were present, and their abundance. Each column consists of different data types. The 1st column is a character vector, the 2nd is logical, and the 3rd is numeric. This is really useful and allows us to store much more information than in a matrix.

# Create a data frame
species_dat <- data.frame(species = c("Callinectes sapidus",
"Sciaenops ocellatus",
"Anchoa mitchilli",
"Micropognias undulatus",
"Menidia menidia"),
presence = c(T, F, T, F, T),
abundance = c(2, 0, 10, 0, 9))
# View data frame
species_dat

## species presence abundance
## 1 Callinectes sapidus TRUE 2
## 2 Sciaenops ocellatus FALSE 0
## 3 Anchoa mitchilli TRUE 10
## 4 Micropognias undulatus FALSE 0
## 5 Menidia menidia TRUE 9

You also have the option to add an argument row.names = c("vector", "of", "names", "for", "rows"), though adding row.names is less common for data frames.

As with matrices, you can view number of rows and columns using nrow(my.dataframe) or ncol(my.dataframe), or use dim(my.dataframe) to view the full dimensions.

And like matrices, you can subset your data frame into its rows or columns using single square brackets: my.dataframe[row.num, col.num].

# View the third item in the first column
species_dat[3, 1]

## [1] "Anchoa mitchilli"

# View the first column
species_dat[ , 1]

## [1] "Callinectes sapidus" "Sciaenops ocellatus" "Anchoa mitchilli"
## [4] "Micropognias undulatus" "Menidia menidia"

# View the third row
species_dat[3, ]

## species presence abundance
## 3 Anchoa mitchilli TRUE 10

Alternatively, you can subset your data frame in the same way as lists, by using the dollar sign symbol or double square brackets. Each column is essentially a list element, so you can easily choose a data frame column using my.dataframe$col.name.

# View the abundance column in three different ways
species_dat$abundance

## [1] 2 0 10 0 9

species_dat[[3]]

## [1] 2 0 10 0 9

species_dat[["abundance"]]

## [1] 2 0 10 0 9

The function str() is also useful. It shows you the structure of your data frame. This will tell you the number of rows and columns in your data frame and will tell you the data types of each column.

# View structure
str(species_dat)

## 'data.frame': 5 obs. of 3 variables:
## $ species : chr "Callinectes sapidus" "Sciaenops ocellatus" "Anchoa mitchilli" "Micropognias undulatus" ...
## $ presence : logi TRUE FALSE TRUE FALSE TRUE
## $ abundance: num 2 0 10 0 9

These are a few functions that are very useful for getting to know your data frames.

head() or tail() to view the first 6 or last 6 rows of your data frame
dim(), nrow(), or ncol() to view the number of rows or columns (or both!) of your data frame
rownames() or colnames() to view or set the row or column names of your data frame. Note that just names() will also give you the column names of a data frame.
str() to view the structure of your data frame

As you can see, data frames are very useful for organizing complex, multi-attribute data sets that contain data of different types. No wonder we use them so often!

Tibbles

I added in tibbles as a side data structure — even though it isn’t an official data structure in R, it’s something that comes up often if you use the tidyverse set of packages. Tibbles come with the tibble package (which comes with the tidyverse) and are basically data frames with a few added benefits!

Functionally, tibbles are the same as data frames when you manipulate them. They can do everything that data frames can do, but they have slightly different properties that make them more convenient. In fact, ‘tibble’ stands for ‘tidy table’ :) Let’s find out what makes tibbles different.

First, let’s load up the tidyverse set of packages.

library(tidyverse)

To create a tibble, all you have to do is use the function tibble(), which works the same way as the function data.frame(). When you’re creating a tibble, you can only use vectors that are either all the same length, or have length of 1. The vector with a length of 1 will just be recycled until it fills all of the rows in its column. Tibbles also don’t use row.names(), which keeps things simpler.

Let’s create the same species table that we did earlier, but this time as a tibble.

# Create a tibble
species_dat <- tibble(species = c("Callinectes sapidus",
"Sciaenops ocellatus",
"Anchoa mitchilli",
"Micropognias undulatus",
"Menidia menidia"),
presence = c(T, F, T, F, T),
abundance = c(2, 0, 10, 0, 9))
# View the tibble and the class
species_dat

## # A tibble: 5 × 3
## species presence abundance
## <chr> <lgl> <dbl>
## 1 Callinectes sapidus TRUE 2
## 2 Sciaenops ocellatus FALSE 0
## 3 Anchoa mitchilli TRUE 10
## 4 Micropognias undulatus FALSE 0
## 5 Menidia menidia TRUE 9

When we print the tibble, it clearly tells us that it’s a tibble. It also tells us the table dimensions and the column names and data types.

You might be thinking: okay…and? The tibble doesn’t look that different from the data frame we originally created.

Let’s try another example.

This time, let’s load up an example data set that comes with the ggplot2 package. This data set is called msleep, and describes the sleep times and brain weights of several different types of mammals. This data set already comes as a tibble, so let’s turn it into a data frame for the purposes of demonstration, using the as.data.frame() function.

# Load data
data("msleep")
# Turn data into class data frame
msleep <- as.data.frame(msleep)
# View data
msleep

## name genus vore
## 1 Cheetah Acinonyx carni
## 2 Owl monkey Aotus omni
## 3 Mountain beaver Aplodontia herbi
## 4 Greater short-tailed shrew Blarina omni
## 5 Cow Bos herbi
## 6 Three-toed sloth Bradypus herbi
## 7 Northern fur seal Callorhinus carni
## 8 Vesper mouse Calomys <NA>
## 9 Dog Canis carni
## 10 Roe deer Capreolus herbi
## 11 Goat Capri herbi
## 12 Guinea pig Cavis herbi
## 13 Grivet Cercopithecus omni
## 14 Chinchilla Chinchilla herbi
## 15 Star-nosed mole Condylura omni
## 16 African giant pouched rat Cricetomys omni
## 17 Lesser short-tailed shrew Cryptotis omni
## 18 Long-nosed armadillo Dasypus carni
## 19 Tree hyrax Dendrohyrax herbi
## 20 North American Opossum Didelphis omni
## 21 Asian elephant Elephas herbi
## 22 Big brown bat Eptesicus insecti
## 23 Horse Equus herbi
## 24 Donkey Equus herbi
## 25 European hedgehog Erinaceus omni
## 26 Patas monkey Erythrocebus omni
## 27 Western american chipmunk Eutamias herbi
## 28 Domestic cat Felis carni
## 29 Galago Galago omni
## 30 Giraffe Giraffa herbi
## 31 Pilot whale Globicephalus carni
## 32 Gray seal Haliochoerus carni
## 33 Gray hyrax Heterohyrax herbi
## 34 Human Homo omni
## 35 Mongoose lemur Lemur herbi
## 36 African elephant Loxodonta herbi
## 37 Thick-tailed opposum Lutreolina carni
## 38 Macaque Macaca omni
## 39 Mongolian gerbil Meriones herbi
## 40 Golden hamster Mesocricetus herbi
## 41 Vole Microtus herbi
## 42 House mouse Mus herbi
## 43 Little brown bat Myotis insecti
## 44 Round-tailed muskrat Neofiber herbi
## 45 Slow loris Nyctibeus carni
## 46 Degu Octodon herbi
## 47 Northern grasshopper mouse Onychomys carni
## 48 Rabbit Oryctolagus herbi
## 49 Sheep Ovis herbi
## 50 Chimpanzee Pan omni
## 51 Tiger Panthera carni
## 52 Jaguar Panthera carni
## 53 Lion Panthera carni
## 54 Baboon Papio omni
## 55 Desert hedgehog Paraechinus <NA>
## 56 Potto Perodicticus omni
## 57 Deer mouse Peromyscus <NA>
## 58 Phalanger Phalanger <NA>
## 59 Caspian seal Phoca carni
## 60 Common porpoise Phocoena carni
## 61 Potoroo Potorous herbi
## 62 Giant armadillo Priodontes insecti
## 63 Rock hyrax Procavia <NA>
## 64 Laboratory rat Rattus herbi
## 65 African striped mouse Rhabdomys omni
## 66 Squirrel monkey Saimiri omni
## 67 Eastern american mole Scalopus insecti
## 68 Cotton rat Sigmodon herbi
## 69 Mole rat Spalax <NA>
## 70 Arctic ground squirrel Spermophilus herbi
## 71 Thirteen-lined ground squirrel Spermophilus herbi
## 72 Golden-mantled ground squirrel Spermophilus herbi
## 73 Musk shrew Suncus <NA>
## 74 Pig Sus omni
## 75 Short-nosed echidna Tachyglossus insecti
## 76 Eastern american chipmunk Tamias herbi
## 77 Brazilian tapir Tapirus herbi
## 78 Tenrec Tenrec omni
## 79 Tree shrew Tupaia omni
## 80 Bottle-nosed dolphin Tursiops carni
## 81 Genet Genetta carni
## 82 Arctic fox Vulpes carni
## 83 Red fox Vulpes carni
## order conservation sleep_total sleep_rem
## 1 Carnivora lc 12.1 NA
## 2 Primates <NA> 17.0 1.8
## 3 Rodentia nt 14.4 2.4
## 4 Soricomorpha lc 14.9 2.3
## 5 Artiodactyla domesticated 4.0 0.7
## 6 Pilosa <NA> 14.4 2.2
## 7 Carnivora vu 8.7 1.4
## 8 Rodentia <NA> 7.0 NA
## 9 Carnivora domesticated 10.1 2.9
## 10 Artiodactyla lc 3.0 NA
## 11 Artiodactyla lc 5.3 0.6
## 12 Rodentia domesticated 9.4 0.8
## 13 Primates lc 10.0 0.7
## 14 Rodentia domesticated 12.5 1.5
## 15 Soricomorpha lc 10.3 2.2
## 16 Rodentia <NA> 8.3 2.0
## 17 Soricomorpha lc 9.1 1.4
## 18 Cingulata lc 17.4 3.1
## 19 Hyracoidea lc 5.3 0.5
## 20 Didelphimorphia lc 18.0 4.9
## 21 Proboscidea en 3.9 NA
## 22 Chiroptera lc 19.7 3.9
## 23 Perissodactyla domesticated 2.9 0.6
## 24 Perissodactyla domesticated 3.1 0.4
## 25 Erinaceomorpha lc 10.1 3.5
## 26 Primates lc 10.9 1.1
## 27 Rodentia <NA> 14.9 NA
## 28 Carnivora domesticated 12.5 3.2
## 29 Primates <NA> 9.8 1.1
## 30 Artiodactyla cd 1.9 0.4
## 31 Cetacea cd 2.7 0.1
## 32 Carnivora lc 6.2 1.5
## 33 Hyracoidea lc 6.3 0.6
## 34 Primates <NA> 8.0 1.9
## 35 Primates vu 9.5 0.9
## 36 Proboscidea vu 3.3 NA
## 37 Didelphimorphia lc 19.4 6.6
## 38 Primates <NA> 10.1 1.2
## 39 Rodentia lc 14.2 1.9
## 40 Rodentia en 14.3 3.1
## 41 Rodentia <NA> 12.8 NA
## 42 Rodentia nt 12.5 1.4
## 43 Chiroptera <NA> 19.9 2.0
## 44 Rodentia nt 14.6 NA
## 45 Primates <NA> 11.0 NA
## 46 Rodentia lc 7.7 0.9
## 47 Rodentia lc 14.5 NA
## 48 Lagomorpha domesticated 8.4 0.9
## 49 Artiodactyla domesticated 3.8 0.6
## 50 Primates <NA> 9.7 1.4
## 51 Carnivora en 15.8 NA
## 52 Carnivora nt 10.4 NA
## 53 Carnivora vu 13.5 NA
## 54 Primates <NA> 9.4 1.0
## 55 Erinaceomorpha lc 10.3 2.7
## 56 Primates lc 11.0 NA
## 57 Rodentia <NA> 11.5 NA
## 58 Diprotodontia <NA> 13.7 1.8
## 59 Carnivora vu 3.5 0.4
## 60 Cetacea vu 5.6 NA
## 61 Diprotodontia <NA> 11.1 1.5
## 62 Cingulata en 18.1 6.1
## 63 Hyracoidea lc 5.4 0.5
## 64 Rodentia lc 13.0 2.4
## 65 Rodentia <NA> 8.7 NA
## 66 Primates <NA> 9.6 1.4
## 67 Soricomorpha lc 8.4 2.1
## 68 Rodentia <NA> 11.3 1.1
## 69 Rodentia <NA> 10.6 2.4
## 70 Rodentia lc 16.6 NA
## 71 Rodentia lc 13.8 3.4
## 72 Rodentia lc 15.9 3.0
## 73 Soricomorpha <NA> 12.8 2.0
## 74 Artiodactyla domesticated 9.1 2.4
## 75 Monotremata <NA> 8.6 NA
## 76 Rodentia <NA> 15.8 NA
## 77 Perissodactyla vu 4.4 1.0
## 78 Afrosoricida <NA> 15.6 2.3
## 79 Scandentia <NA> 8.9 2.6
## 80 Cetacea <NA> 5.2 NA
## 81 Carnivora <NA> 6.3 1.3
## 82 Carnivora <NA> 12.5 NA
## 83 Carnivora <NA> 9.8 2.4
## sleep_cycle awake brainwt bodywt
## 1 NA 11.90 NA 50.000
## 2 NA 7.00 0.01550 0.480
## 3 NA 9.60 NA 1.350
## 4 0.1333333 9.10 0.00029 0.019
## 5 0.6666667 20.00 0.42300 600.000
## 6 0.7666667 9.60 NA 3.850
## 7 0.3833333 15.30 NA 20.490
## 8 NA 17.00 NA 0.045
## 9 0.3333333 13.90 0.07000 14.000
## 10 NA 21.00 0.09820 14.800
## 11 NA 18.70 0.11500 33.500
## 12 0.2166667 14.60 0.00550 0.728
## 13 NA 14.00 NA 4.750
## 14 0.1166667 11.50 0.00640 0.420
## 15 NA 13.70 0.00100 0.060
## 16 NA 15.70 0.00660 1.000
## 17 0.1500000 14.90 0.00014 0.005
## 18 0.3833333 6.60 0.01080 3.500
## 19 NA 18.70 0.01230 2.950
## 20 0.3333333 6.00 0.00630 1.700
## 21 NA 20.10 4.60300 2547.000
## 22 0.1166667 4.30 0.00030 0.023
## 23 1.0000000 21.10 0.65500 521.000
## 24 NA 20.90 0.41900 187.000
## 25 0.2833333 13.90 0.00350 0.770
## 26 NA 13.10 0.11500 10.000
## 27 NA 9.10 NA 0.071
## 28 0.4166667 11.50 0.02560 3.300
## 29 0.5500000 14.20 0.00500 0.200
## 30 NA 22.10 NA 899.995
## 31 NA 21.35 NA 800.000
## 32 NA 17.80 0.32500 85.000
## 33 NA 17.70 0.01227 2.625
## 34 1.5000000 16.00 1.32000 62.000
## 35 NA 14.50 NA 1.670
## 36 NA 20.70 5.71200 6654.000
## 37 NA 4.60 NA 0.370
## 38 0.7500000 13.90 0.17900 6.800
## 39 NA 9.80 NA 0.053
## 40 0.2000000 9.70 0.00100 0.120
## 41 NA 11.20 NA 0.035
## 42 0.1833333 11.50 0.00040 0.022
## 43 0.2000000 4.10 0.00025 0.010
## 44 NA 9.40 NA 0.266
## 45 NA 13.00 0.01250 1.400
## 46 NA 16.30 NA 0.210
## 47 NA 9.50 NA 0.028
## 48 0.4166667 15.60 0.01210 2.500
## 49 NA 20.20 0.17500 55.500
## 50 1.4166667 14.30 0.44000 52.200
## 51 NA 8.20 NA 162.564
## 52 NA 13.60 0.15700 100.000
## 53 NA 10.50 NA 161.499
## 54 0.6666667 14.60 0.18000 25.235
## 55 NA 13.70 0.00240 0.550
## 56 NA 13.00 NA 1.100
## 57 NA 12.50 NA 0.021
## 58 NA 10.30 0.01140 1.620
## 59 NA 20.50 NA 86.000
## 60 NA 18.45 NA 53.180
## 61 NA 12.90 NA 1.100
## 62 NA 5.90 0.08100 60.000
## 63 NA 18.60 0.02100 3.600
## 64 0.1833333 11.00 0.00190 0.320
## 65 NA 15.30 NA 0.044
## 66 NA 14.40 0.02000 0.743
## 67 0.1666667 15.60 0.00120 0.075
## 68 0.1500000 12.70 0.00118 0.148
## 69 NA 13.40 0.00300 0.122
## 70 NA 7.40 0.00570 0.920
## 71 0.2166667 10.20 0.00400 0.101
## 72 NA 8.10 NA 0.205
## 73 0.1833333 11.20 0.00033 0.048
## 74 0.5000000 14.90 0.18000 86.250
## 75 NA 15.40 0.02500 4.500
## 76 NA 8.20 NA 0.112
## 77 0.9000000 19.60 0.16900 207.501
## 78 NA 8.40 0.00260 0.900
## 79 0.2333333 15.10 0.00250 0.104
## 80 NA 18.80 NA 173.330
## 81 NA 17.70 0.01750 2.000
## 82 NA 11.50 0.04450 3.380
## 83 0.3500000 14.20 0.05040 4.230

Okay, wow. When we print the data frame it’s pretty overwhelming. Printing the data frame shows us all of our rows and columns. And because our columns don’t all fit on one row, they have to be carried over and added as extra rows, making the printed output even longer. This is a very messy and confusing way to view our data.

Let’s turn the data back into a tibble using the as_tibble() function, and let’s see what that looks like.

# Turn data into a tibble
msleep <- as_tibble(msleep)
# View data
msleep

## # A tibble: 83 × 11
## name genus vore order conservation sleep_total
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 Cheetah Acinon… carni Carni… lc 12.1
## 2 Owl monkey Aotus omni Prima… <NA> 17
## 3 Mountain b… Aplodo… herbi Roden… nt 14.4
## 4 Greater sh… Blarina omni Soric… lc 14.9
## 5 Cow Bos herbi Artio… domesticated 4
## 6 Three-toed… Bradyp… herbi Pilosa <NA> 14.4
## 7 Northern f… Callor… carni Carni… vu 8.7
## 8 Vesper mou… Calomys <NA> Roden… <NA> 7
## 9 Dog Canis carni Carni… domesticated 10.1
## 10 Roe deer Capreo… herbi Artio… lc 3
## # … with 73 more rows, and 5 more variables:
## # sleep_rem <dbl>, sleep_cycle <dbl>, awake <dbl>,
## # brainwt <dbl>, bodywt <dbl>

The printed tibble is much neater than the printed data frame! Although there are ways to print data frames more neatly, tibbles are automatically formatted so that the columns are abbreviated to fit on one row (or are not printed), and you only see the first ten rows of data instead of every single row. This makes it way more convenient to view your data sets.

Tibbles also reduce errors when subsetting your data. For example, when subsetting with single square brackets [ ], tibbles always return another tibble. In contrast, subsetting data frames will sometimes return a vector instead of another data frame.

And if you try to subset a tibble using a column that does not exist, you’ll receive a warning that the column does not exist. In contrast, subsetting a data frame using a column that doesn’t exist will only return NULL, and you don’t receive an explanation of why.

# See if msleep (the tibble) has a column called "abc"
msleep$abc

## Warning: Unknown or uninitialised column: `abc`.

## NULL

# Turn msleep into a data frame
msleep <- as.data.frame(msleep)
# See if msleep (the data frame) has a column called "abc"
msleep$abc

## NULL

One other advantage to tibbles is that they allow your column names to have spaces. Normally you wouldn’t go out of your way to add spaces to your column names since it’s much better practice to use underscores “_” in place of spaces to begin with. However, sometimes the data you upload into R will contain spaces in the column names. While regular data frames replace spaces with periods “.”, tibbles maintain the original column names surrounded by back ticks (also known as the acute or left quote, it’s the apostrophe-like thing usually located above your left tab key and with the tilde ‘~’ on your keyboard). When uploading data into R, you can upload directly as a tibble and ensure all column names are maintained as they were in the original CSV by using read_csv() (note the underscore between ‘read’ and ‘csv’ versus of the function “read.csv()”, which reads in your data as a data frame).

In short, tibbles make a number of changes to normal data frames that can help reduce errors in your data analysis. These improvements in printing and subsetting are small, but useful!

And that’s it for our blog post on data structures in R! I hope this post taught you a few useful tips and tricks for working with your data. Happy coding!

If you enjoyed this tutorial and want learn more about data frames and tibbles, and how to use them, you can check out Luka Negoita's full course on the complete basics of R for ecology here:

Start learning R now

Also be sure to check out R-bloggers for other great tutorials on learning R

R Data types 101, or What kind of data do I have?

Wed, 16 Mar 2022 09:45:39 -0400

Most of us are pretty familiar with data types in our daily lives — we can easily tell that things like 1, 2, 3, and 4 are numbers (in this case, integers). 15.7 is still a number, but has a decimal. We know that every single word I’m typing in this sentence is composed of characters, and we know that in math, “true” and “false” are the answers to logical statements.

Just as we do in our heads, R also categorizes our data into different classes. These categories are similar to the real-life ones I described above, but can be a little different in terms of syntax and things to watch out for in your code.

To work in R and perform data analyses, you’ll need to have a solid understanding of data types. In this tutorial, I’m going to introduce several different types of data, explain how to use and manipulate each of them, and show you how to check what type of data you have. Let’s dive in.

Types of data

There are five main types of data in R that you’d come across as an ecologist. I’ll discuss all of them below except complex numbers, which are rarely used for data analysis in R.

Numeric (1.2, 5, 7, 3.14159)
Integer (1, 2, 3, 4, 5)
Complex (i + 4)
Logical (TRUE / FALSE)
Character ("a", "apple")

I’m also going to discuss a sixth, related category that helps you work with categorical variables:

Factor

Numeric

Numeric data types are pretty straightforward. These are just numbers, written as either integers or decimals. We can check if our vector is numeric by using the function is.numeric().

# Create a numeric vector
x <- c(3, 5, 6, 10.7)
# Is our vector numeric? Yes!
is.numeric(x)

## [1] TRUE

We can check our data type by using the functions class() and typeof(). class() tells us that we’re working with numeric values, while typeof() is more specific and tells us we’re working with doubles (i.e., numbers with decimals).

# Check the type of data class we have
class(x)

## [1] "numeric"

# Check the specific type of data that you have
typeof(x)

## [1] "double"

You can, of course, perform mathematical operations with numeric values.

# Add 4 to all the values in the vector
x + 4

## [1] 7.0 9.0 10.0 14.7

Integer

You can also do math with integers, which represent numbers without decimal places. These are usually used if you’re counting something — for example, you can observe 7 butterflies in a plot, but you can’t observe 7.2 butterflies (or at least I hope not!).

If you create a vector manually and don’t have any decimal values, R will still identify your vector as the class “numeric”.

# Create a vector with only integers
x <- c(1, 4, 2, 7, 8)
# Look at the class
class(x)

## [1] "numeric"

You can change this vector to be an integer by using the function as.integer().

# Change the vector class
x <- as.integer(x)
# Look at the class
class(x)

## [1] "integer"

Alternatively, you can generate an integer vector like this. The “L” after each number tells R that you want it to be an integer.

# Create an integer vector
x <- c(1L, 2L, 5L, 3L, 10L)
# View vector
x

## [1] 1 2 5 3 10

# View class
class(x)

## [1] "integer"

You could also create an integer vector like this. The colon (:) tells R to generate a sequence of vectors from 1 to 10, going up by 1 each time.

# Create a sequence of integers
x <- c(1:10)
# View vector
x

## [1] 1 2 3 4 5 6 7 8 9 10

# View data class
class(x)

## [1] "integer"

Some functions will also automatically generate integer vectors, like the function sample(). This function randomly samples a certain number of integer values within a specified range. I asked sample() to choose ten values between 1 and 10.

# Create a random sequence of integers from 1 to 10:
set.seed(123) # use set.seed to get the same random values as me
x <- sample(1:10, 10)
# View vector
x

## [1] 3 10 2 8 6 9 1 7 5 4

# View data class
class(x)

## [1] "integer"

Complex

I’m not going to discuss this one because complex numbers aren’t used much in R for data analysis, though they exist. These are just numbers with real and imaginary components (containing the number i, or the square root of -1).

Character

Characters are another common data type. These are used to store text in R (also called “strings”). To indicate something is a character, we put quotation marks around it "".

# Create a vector of characters
x <- c("These", "are", "characters")
# View class
class(x)

## [1] "character"

Putting quotation marks around numbers will also turn them into characters, which can get confusing.

# Create a vector of characters
x <- c("1", "4", "5", "7", "8")
# View vector
x

## [1] "1" "4" "5" "7" "8"

You can’t do math with a vector of numbers that are classed as characters.

# Try to do math
mean(x)

## Warning in mean.default(x): argument is not numeric or logical: returning NA

## [1] NA

Why? Because R views them as text!

# View class
class(x)

## [1] "character"

You can turn this character vector of numbers into a numeric vector using the as.numeric() function.

Note: a common case of this happening is if you happen to accidentally have a character value (i.e. a letter or symbol) in a column of values that are otherwise supposed to be numeric. Adding a space to a number or empty cell might have the same effect. This can happen accidentally (and so easily!) during data entry, so using as.numeric() is one way to resolve that issue. Any values that were character will be converted to NAs. In that scenario you’ll probably want to go back and fix your raw CSV file, but at least now the NAs will help you find where the problem was.

# Turn it into a numeric vector
x <- as.numeric(x)
# View vector
x

## [1] 1 4 5 7 8

# View class
class(x)

## [1] "numeric"

And then you can turn it back into a character using as.character().

# Turn it back into a character
x <- as.character(x)
# View vector
x

## [1] "1" "4" "5" "7" "8"

# View class
class(x)

## [1] "character"

Logical

The logical class is represented by only two possible values: TRUE or FALSE (also can be written T / F, but never true / false or t / f).

These values result from any logical statements that are made. For example, in the code below I asked R if the elements of my vector were greater than 5. This returns a logical vector where each element is either TRUE or FALSE.

# Create a vector
x <- c(1, 5, 6, 7, 2, 8)
# Are the elements of vector x greater than 5? Store results in vector y
y <- x > 5
# View y
y

## [1] FALSE FALSE TRUE TRUE FALSE TRUE

# View class
class(y)

## [1] "logical"

You can also create a vector of logical statements.

# Create logical vector
x <- c(T, F, T, F, F, T)
# View vector
x

## [1] TRUE FALSE TRUE FALSE FALSE TRUE

And you can convert logical values to numeric values, and back. FALSE is the same as 0, while TRUE is the same as 1.

# Convert to numeric vector
x <- as.numeric(x)
# View vector
x

## [1] 1 0 1 0 0 1

# Convert back to logical vector
x <- as.logical(x)
# View vector again
x

## [1] TRUE FALSE TRUE FALSE FALSE TRUE

This also means that you can do math with logical values. This is useful if, for example, you’re trying to see how many TRUE values you have in your vector. In fact, applying any math operations to a logical vector will automatically convert the values to 1s and 0s.

# View vector
x

## [1] TRUE FALSE TRUE FALSE FALSE TRUE

# Count how many "TRUE" values there are. There are 3!
sum(x)

## [1] 3

Factor

Factors are a special data type that is primarily used to represent repeating categories (i.e., categorical variables). When you specify an object as a factor, you’re telling R to think of it as a categorical variable, with different levels. This can be helpful when analyzing your data, as categorical variables and continuous variables are often handled differently in statistical analyses.

In the code below, I created a data frame showing the height and sex of five individuals.

# Create an example data frame
example <- data.frame(indiv = c("A", "B", "C", "D", "E"),
height = c(15, 10, 12, 9, 17),
sex = c("female", "female", "female", "male", "female"))
# View structure of data frame
str(example)

## 'data.frame': 5 obs. of 3 variables:
## $ indiv : chr "A" "B" "C" "D" ...
## $ height: num 15 10 12 9 17
## $ sex : chr "female" "female" "female" "male" ...

Right now, the sex column is a character vector because I entered the data in quotation marks. But really what I want to do is tell R that sex is a categorical variable, with “female” and “male” as levels. To do that, all I have to do is use the as.factor() function.

# Change the sex column to be a factor
example$sex <- as.factor(example$sex)
# View the factor
example$sex

## [1] female female female male female
## Levels: female male

You can see that R listed the vector and then beneath that, has figured out on its own that the levels are “female” and “male”. When writing the levels, R will sort them in alphabetical order. That’s why the levels are female male instead of male female.

You may want to change the order of your factor levels (this can be useful when plotting your data and determining the order in which they appear).

For example, you might have a vector like this:

# Create vector
places <- factor(c("first", "first", "second", "third", "fifth", "fourth", "second"))
# View factor
places

## [1] first first second third fifth fourth second
## Levels: fifth first fourth second third

The order of the levels doesn’t make sense. We want it to go from first through fifth in the implied numeric order — not alphabetically. So let’s change the order using factor(vector, levels = c("first", "second", "third", etc.)).

# Change level order
places <- factor(places, levels = c("first", "second", "third", "fourth", "fifth"))
# View factor
places

## [1] first first second third fifth fourth second
## Levels: first second third fourth fifth

Much better!

Factors don’t just have to be text. They can also be integers. For example, in the code below I created a data frame describing the stream width and order of several stream sites. Stream order is not a continuous variable, even though it’s represented by numbers. It would be best to convert stream order to a factor.

# Create data frame 
example2 <- data.frame(stream = c("Patuxent", "Patapsco", "Deer Creek", "Town Creek", "Browns Branch"),
width = c(37, 42, 25, 32, 22),
order = c(6, 6, 4, 5, 3))
# View data frame structure
str(example2)

## 'data.frame': 5 obs. of 3 variables:
## $ stream: chr "Patuxent" "Patapsco" "Deer Creek" "Town Creek" ...
## $ width : num 37 42 25 32 22
## $ order : num 6 6 4 5 3

R sees stream order as being numeric, which makes sense. But let’s tell R that stream order is a factor.

# Change stream order to a factor
example2$order <- as.factor(example2$order)
# View stream order
example2$order

## [1] 6 6 4 5 3
## Levels: 3 4 5 6

Looks good. Since these are numbers, R just orders the levels in ascending order.

How to check and manipulate data types

As demonstrated throughout this tutorial, it can be useful to check the type of data you’re working with and be able to change it to another type if you need. You might need this especially in situations where you’re reading in data from a .csv, and need to check that all your numbers are numeric instead of characters.

The main way to check your data type is to use the function class(). If you have a data frame, another easy way to check data types is to use the str() function. This displays the structure of your data frame and tells you what data type each of your columns is. The example below lists heights over time for five individuals.

# Create an example data frame
example <- data.frame(indiv = c("A", "B", "C", "D", "E"),
height_0 = c(15, 10, 12, 9, 17),
height_10 = c(20, 18, 14, 15, 19),
height_20 = c(23, 24, 18, 17, 26))
str(example)

## 'data.frame': 5 obs. of 4 variables:
## $ indiv : chr "A" "B" "C" "D" ...
## $ height_0 : num 15 10 12 9 17
## $ height_10: num 20 18 14 15 19
## $ height_20: num 23 24 18 17 26

You can see that the column indiv is a character vector (abbreviated “chr”), while each successive column is numeric (abbreviated “num”).

You also noticed me using functions like is.numeric() or as.character(). All of the data types have is. and as. functions, where the first one is a logical statement to check the specific data type, asking “is this object of the class XXX?” and returns TRUE or FALSE. The as. functions are actions that convert objects into a new data type. You may find yourself using these often when you’re first formatting your data and preparing it for analysis.

That’s it for data types in R! Keep an eye out for our next tutorial, which will go over different data structures in R like vectors, lists, data frames, and tibbles. I hope this tutorial was helpful! Happy coding!

If you enjoyed this tutorial and want learn more about data types and how to use them, you can check out Luka Negoita's full course on the complete basics of R for ecology here:

Start learning R now

Also be sure to check out R-bloggers for other great tutorials on learning R

Complete tutorial on using 'apply' functions in R

Tue, 08 Mar 2022 09:45:39 -0500

Today I’m going to talk about a useful family of functions that allows you to repetitively perform a specified function (e.g., sum(), mean()) across a vector, list, matrix, or data frame. For those of you familiar with ‘for’ loops, the apply() family often allows you to avoid constructing those and instead wrap the loop into one simple function.

I’m going to discuss the functions apply(), lapply(), sapply(), and tapply() in this blog post (as well as using the dplyr library for similar tasks). These functions all end in apply() because you apply the function you want across all the specified elements.

Let’s see how they work.

The `apply()` function

Let’s start with the apply() function. First, we’ll create an example data set. This data set is in wide format* and describes the heights of five individuals (e.g., plants) in inches at three different time points (0, 10, and 20 days). The first column contains the IDs for each individual, and each successive column describes their heights at time points 0, 10, and 20 in that order.

*Note: Wide format refers to having multiple repeated variations of the same column. In this example, Long format would entail having just one column for ‘height’ with the values 0, 10, and 20 listed below.

# Create data frame
example <- data.frame(indiv = c("A", "B", "C", "D", "E"),
height_0 = c(15, 10, 12, 9, 17),
height_10 = c(20, 18, 14, 15, 19),
height_20 = c(23, 24, 18, 17, 26))
# View the data frame
head(example)

## indiv height_0 height_10 height_20
## 1 A 15 20 23
## 2 B 10 18 24
## 3 C 12 14 18
## 4 D 9 15 17
## 5 E 17 19 26

apply() lets you perform a function across a data frame’s rows or columns. In the arguments, you specify what you want as follows: apply(X = data.frame, MARGIN = 1, FUN = function.you.want). First, you enter the data frame you want to analyze, then MARGIN asks you which dimension you want to analyze. MARGIN = 1 indicates that you want to analyze across the data frame’s rows, while MARGIN = 2 analyzes across columns. Then you enter the name of the function that will be applied to the rows or columns (don’t include parentheses or function arguments).

So let’s try finding the mean plant height for each row (i.e., for each individual). We also have to subset our data to only contain height values (columns 2 through 4) because our first column contains the individual identifiers.

# Calculating the mean for each row in the data frame
row.avg <- apply(X = example[, 2:4], MARGIN = 1, FUN = mean)
# View row.avg
row.avg

## [1] 19.33333 17.33333 14.66667 13.66667 20.66667

This returns a vector where each position corresponds to the row number that was averaged. Individual A’s average height is in position 1, B’s is in position 2, etc.

If we find the mean for each column (i.e., each time point), it returns a vector with named positions for each column that was analyzed.

# Calculating the mean for each column in the data frame
col.avg <- apply(example[, 2:4], 2, mean)
# View col.avg
col.avg

## height_0 height_10 height_20
## 12.6 17.2 21.6

Note: I used finding the mean as an example, but if you were actually trying to find the mean across the rows or columns of a data frame, you should use the rowMeans() or colMeans() functions instead of apply(), as they work more efficiently.

You don’t just have to use pre-made functions like sum() or mean(). You could also write your own function to use. In the code below, I wrote a function that tells you if the average plant height is above 15 inches.

# Create function is_tall
is_tall <- function(x) {
value <- mean(x) > 15
return(value)
}
# Apply the function to the columns in the data frame
apply(example[, 2:4], 2, is_tall)

## height_0 height_10 height_20
## FALSE TRUE TRUE

This tells me that at time point 0, the plants are not taller than 15 cm on average, while the opposite is true for time points 10 and 20.

The `lapply()` function

Let’s look at another function, called lapply(). The “L” in front of “apply” stands for “lists”, because this function is used on list objects and returns a list as well.

I created a list called plants, containing three elements that are each vectors with a length of ten. Each element in the list contains different plant attributes (height, mass, and # of flowers). I used the runif() function to generate random numbers, and used the sample() function to generate random integers between one and ten.

# Set seed so that the randomly-generated numbers are the same each time
set.seed(123)
# Create a list using randomly-generated numbers
plants <- list(height = runif(10, min = 10, max = 20),
mass = runif(10, min = 5, max = 10),
flowers = sample(1:10, 10))
# View the list
plants

## $height
## [1] 12.87578 17.88305 14.08977 18.83017 19.40467 10.45556 15.28105 18.92419
## [9] 15.51435 14.56615
##
## $mass
## [1] 9.784167 7.266671 8.387853 7.863167 5.514623 9.499125 6.230439 5.210298
## [9] 6.639604 9.772518
##
## $flowers
## [1] 9 10 1 5 3 2 6 7 8 4

If we wanted to calculate the average value for each list element, we could do it individually:

mean(plants$height)

## [1] 15.78248

mean(plants$mass)

## [1] 7.616846

mean(plants$flowers)

## [1] 5.5

This method is pretty inefficient and makes us repeat our code. And what if we have more than three list elements? That would be a pain to type out. Let’s try another method.

We could create a for loop and save the results in a vector:

# Create an empty vector
plant_avgs <- c()
# Loop the averages for each element and save in our vector
for(i in 1:3){
plant_avgs[i] <- mean(plants[[i]])
}
# View the vector
plant_avgs

## [1] 15.782475 7.616846 5.500000

This method is better because it automates the process, which would be especially useful if our list had a ton of elements. But for loops also take more time to run and construct, and still take up quite a bit of space in our code.

Let’s try one last method: using lapply() to wrap this whole process into a neat function. lapply() doesn’t have the MARGIN argument that apply() has. Instead, lapply() already knows that it should apply the specified function across all list elements. You can just type lapply(X = list, FUN = function.you.want), like this:

# Use lapply to find the mean of each list element
lapply(plants, mean)

## $height
## [1] 15.78248
##
## $mass
## [1] 7.616846
##
## $flowers
## [1] 5.5

You’ll notice that the output of lapply() is also a list, where the means of height, mass, and flowers are saved as list elements of the same name. lapply() does the same thing as the for loop, but is far more efficient in terms of space and effort. lapply() ends up being the best of the three methods I just showed you.

The `sapply()` function

In the previous example, our means were returned as elements in a list, but each list element was represented by just one value. There wasn’t really any reason for those values to be put in a list format instead of, say, a vector.

This is where the sapply() function comes in. It goes hand-in-hand with lapply() and works the same way, where it can accept a list and a function name as the input. But instead of returning a list, it will return the answers in the simplest possible format. In our case, this would mean returning the answers as a vector like below, which usually makes it easier to work with down the line.

# Use sapply to find the mean of each list element
sapply(plants, mean)

## height mass flowers
## 15.782475 7.616846 5.500000

The `tapply()` function

The tapply() function works in much the same way as the other functions, but allows you to perform an operation across specified groups in your data. For those of you familiar with the dplyr package, this does the same thing as the group_by() and summarize() functions.

Let’s return to our example data set from before, where we described the heights of several different individuals over time. This time, we’re going to write the data in long format, so that each row represents one observation. Stay tuned for a tutorial post on reshaping data in R coming soon if you’re interested in learning more about wide vs. long format data.

# Load library to use the pivot_longer() function
library(tidyverse)

# Pivot the data so that the data are in long format instead of wide format
example <- pivot_longer(example, cols = 2:4, names_to = "time", values_to = "height")
# Use sub() to get rid of the string "height_" in front of the time values
example$time <- sub("height_", "", example$time)
# View data
head(example)

## # A tibble: 6 × 3
## indiv time height
## <chr> <chr> <dbl>
## 1 A 0 15
## 2 A 10 20
## 3 A 20 23
## 4 B 0 10
## 5 B 10 18
## 6 B 20 24

You can see that now we have a column for time, with values of 0, 10, and 20. Let’s use tapply() to look at each individuals' heights, grouped by time. The function accepts a new argument called INDEX: tapply(X = vector.to.analyze, INDEX = vector.to.group.by, FUN = function.you.want). In the code below, I wanted to analyze the height values grouped by time, using the function mean().

# Use tapply() to find average height by time grouping
tapply(X = example$height, INDEX = example$time, mean)

## 0 10 20
## 12.6 17.2 21.6

Looks good! tapply() returned a vector of values for the average heights at different time points.

Note: You may have noticed that in all of my examples, I’m using apply() across a list or a data frame. Even though the apply() family of functions can be used across a simple vector, there’s often no need to do so. Most functions in R are already “vectorized”, which means the function will be applied to each element of the vector instead of having to loop through one element at a time. For example, the sqrt() function is vectorized. Doing sqrt(vector) and sapply(vector, sqrt) will return the same answer, so using the apply() function is unnecessary. It is almost always faster to use the vectorized function than to run a loop or to use an apply() function, if you have the option. And in some cases, running a for loop might even be faster than using an apply() function. Check out this blog post by Michael Mayer for a great comparison of different methods.

The `apply()` functions vs. `dplyr` functions

Some of you may be wondering about how useful the apply() functions can be after you’ve learned how to use dplyr functions.

I just demonstrated how to use tapply(), but the same thing could have been accomplished in dplyr. Below, I grouped the data by the time column, and created a column called avg_height that calculates the mean height for each time group. See our tutorial here for a more in-depth discussion of the group_by() function.

# Show grouping example in dplyr
example %>%
group_by(time) %>%
summarize(avg_height = mean(height)) %>%
ungroup()

## # A tibble: 3 × 2
## time avg_height
## <chr> <dbl>
## 1 0 12.6
## 2 10 17.2
## 3 20 21.6

This returns a table of values rather than a vector, but it still contains the same basic information. It shows the average heights of individuals at three different time points. So which method is better, dplyr functions or tapply()?

The answer is that it depends on what you’re going to do afterwards! tapply() might be useful to get a quick answer. It’s one easy line of code that tells you the average heights. The dplyr method is useful if you’re going to keep working on the data. The pipe operator (%>%) allows you to use the output of one function as the input of another, without having to create intermediate variables.

For example, in the code below, I wanted to not only summarize the average heights at each time point, but I also wanted to filter out only the heights that were greater than 15. I did that easily by adding another pipe to the end of my previous line and typing the next short bit of code.

# Show grouping example in dplyr and further manipulation
example %>%
group_by(time) %>%
summarize(avg_height = mean(height)) %>%
ungroup() %>%
filter(avg_height > 15)

## # A tibble: 2 × 2
## time avg_height
## <chr> <dbl>
## 1 10 17.2
## 2 20 21.6

There are other dplyr() functions that are analogous to the rest of the apply() family. For example, the across() function works similarly to apply(). Let’s go back to the previous wide format of our example data frame by using pivot_wider().

# Turn the data frame back into wide format
example <- pivot_wider(example, indiv, names_from = time, values_from = height)
# View data frame
head(example)

## # A tibble: 5 × 4
## indiv `0` `10` `20`
## <chr> <dbl> <dbl> <dbl>
## 1 A 15 20 23
## 2 B 10 18 24
## 3 C 12 14 18
## 4 D 9 15 17
## 5 E 17 19 26

Let’s say we want to convert our height values from inches to centimeters by multiplying by 2.54. We can use the across() function to do this. In the code below, I wrote a quick function that multiplies your values by 2.54 to convert from inches to cm. Then I used the function mutate() to change the data frame. Using across(), I indicated that I wanted to modify columns 2 through 4 using the to_cm() function.

# Write function called to_cm that converts values from inches to cm
to_cm <- function(x){
cm <- x * 2.54
return(cm)
}
# Convert height from inches to centimeters
example %>%
mutate(across(2:4, to_cm))

## # A tibble: 5 × 4
## indiv `0` `10` `20`
## <chr> <dbl> <dbl> <dbl>
## 1 A 38.1 50.8 58.4
## 2 B 25.4 45.7 61.0
## 3 C 30.5 35.6 45.7
## 4 D 22.9 38.1 43.2
## 5 E 43.2 48.3 66.0

And… ta-da! Our data has now been changed from inches to cm.

If we were to perform an operation across rows in dplyr, we would need to group by rows using the rowwise() function before performing any other operation (it works the same way as the group_by() function, just groups by rows).

Again, using the dplyr functions instead of apply() is up to your own discretion. apply() is an easy, one-line function that can account for row-wise and column-wise operations. But dplyr offers a useful grammar (pipes!) that allows you to keep working smoothly without interruption in your code. Different circumstances will call for different methods, and it might take some trial and error before you discover the method that works best for you in each situation.

That concludes our summary of the apply() functions! We learned how to use apply(), lapply(), sapply(), and tapply(), and we discussed equivalent dplyr functions for apply() and tapply().

Let us know what you think of apply() vs dplyr in the comments! Do you have a preferred method?

If you enjoyed this tutorial and want learn more, you can check out Luka Negoita's full course on the complete basics of R for ecology here:

Start learning R now

Also be sure to check out R-bloggers for other great tutorials on learning R

How to use pipes to clean up your R code

Wed, 02 Mar 2022 02:45:39 -0500

I’ve talked a little bit about pipes (written as %>%) in a past blog post, but they’re so important in R that I thought they deserved their own post.

In this tutorial, I’m going to give an explanation of what pipes are and when they can be used, and then I’m going to demonstrate how useful they can be for writing clean and neat R code.

What is a pipe?

A pipe is a type of operator in R that comes with the magrittr package. It takes the output of one function and passes it as the first argument of the next function, allowing us to chain together several steps in R. Pipes help your code flow better, making it cleaner and more efficient.

The pipe shines when used in conjunction with the dplyr package and its functions such as filter, mutate, and summarise, as we often need to use these one after another to manipulate our data. Luckily, the pipe comes loaded with dplyr, so there’s no need to load the magrittr package unless you specifically need to use the other magrittr operators.

A quick demonstration on how to use pipes

Let’s see pipes in action. First, load the dplyr package and download the classic iris data set that comes with R. If you don’t have dplyr installed yet, you’ll need to run install.packages("dplyr") before loading the package.

# Load dplyr
library(dplyr)

# Load data
data("iris")
# View data
head(iris)

## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa

These data describe several measurements for three plant species (Iris setosa, Iris versicolor, and Iris virginica). These measurements describe morphological differences among the three species in terms of sepal length and width and petal length and width, all in centimeters.

I want to keep only the largest plants in the data set, so let’s only include plants with Sepal.Length greater than 5 cm, and Petal.Length greater than 3 cm. I also want to create two columns called “Sepal.Area” and “Petal.Area”, equivalent to length x width (for an approximation of sepal/petal area). To do this, I’ll use the filter() and mutate() functions. Notice that I also hit “Enter” or “Return” to add a new line after every pipe to keep the code clean and keep each function on a separate line.

# Filter and mutate data
new_iris <- iris %>%
filter(Sepal.Length > 5 & Petal.Length > 3) %>%
mutate(Sepal.Area = Sepal.Length * Sepal.Width,
Petal.Area = Petal.Length * Petal.Width)
# View new data
head(new_iris)

## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Area
## 1 7.0 3.2 4.7 1.4 versicolor 22.40
## 2 6.4 3.2 4.5 1.5 versicolor 20.48
## 3 6.9 3.1 4.9 1.5 versicolor 21.39
## 4 5.5 2.3 4.0 1.3 versicolor 12.65
## 5 6.5 2.8 4.6 1.5 versicolor 18.20
## 6 5.7 2.8 4.5 1.3 versicolor 15.96
## Petal.Area
## 1 6.58
## 2 6.75
## 3 7.35
## 4 5.20
## 5 6.90
## 6 5.85

Our data set looks good. You’ll see that my arguments in the filter() and mutate() functions are a bit different from usual. Normally, most of the dplyr functions are formatted like this: function(data, arguments).

Remember that pipes take the output of what came before it and passes it as the first argument of the function that follows. Thus, the filter() function receives iris as it’s data argument, and then the mutate() function receives filter(data=iris, Sepal.Length > 5 & Petal.Length > 3) as its data argument.

With pipes there was no need for me to write filter(iris, Sepal.Length > 5 & Petal.Length > 3), because that would be repetitive—I could just skip straight to the arguments and write filter(Sepal.Length > 5 & Petal.Length > 3).

To summarize in plain English (each then in this sentence can be substituted for a pipe):

I wrote code starting with the iris data set, then filtered it by Sepal.Length and Petal.Length, then used mutate to create two new columns.

Without pipes, our sentence becomes longer:

I wrote code starting with the iris data set. I filtered the iris data set by Sepal.Length and Petal.Length. Using the filtered iris data, I used mutate to create two new columns.

And those are the essentials of using pipes!

Cleaning code with pipes

After that last example, you might be thinking, OK, that’s pretty cool. But can it really make that big of a difference for organizing my code? The answer is…yes! And I’ll quickly demonstrate why.

Example 1: Creating new variables for each step

Let’s filter and mutate our data like we did above, then group by species and summarize to find the average sepal and petal area within each species. Without pipes, our code might look like this:

filtered_iris <- filter(iris, Sepal.Length > 5 & Petal.Length > 3)
mutated_iris <- mutate(filtered_iris,
Sepal.Area = Sepal.Length * Sepal.Width,
Petal.Area = Petal.Length * Petal.Width)
grouped_iris <- group_by(mutated_iris, Species)
summary_iris <- summarize(grouped_iris,
avg.sepal.area = mean(Sepal.Area),
avg.petal.area = mean(Petal.Area))
# View result
summary_iris

## # A tibble: 2 × 3
## Species avg.sepal.area avg.petal.area
## <fct> <dbl> <dbl>
## 1 versicolor 17.0 5.93
## 2 virginica 19.8 11.4

Whew. It can be a little exhausting to have to save each step as a new variable, and now our environment will be cluttered with a bunch of intermediate variables. Aside from the clutter, your code is also much more prone to errors if you change something in the earlier steps but forget to run those lines before the later steps again. So let’s not do that then.

Example 2: Nesting functions

Let’s try another method, where we nest each function inside the previous one.

summarize(group_by(mutate(filter(iris,
Sepal.Length > 5 & Petal.Length > 3),
Sepal.Area = Sepal.Length * Sepal.Width,
Petal.Area = Petal.Length * Petal.Width),
Species),
avg.sepal.area = mean(Sepal.Area),
avg.petal.area = mean(Petal.Area))

## # A tibble: 2 × 3
## Species avg.sepal.area avg.petal.area
## <fct> <dbl> <dbl>
## 1 versicolor 17.0 5.93
## 2 virginica 19.8 11.4

That doesn’t really look much better. If all these nested functions are making your head spin, don’t worry, it’s doing that to me too. Code like this is a great way to spend hours searching for errors… only to realize you’re missing a parenthesis. 😖

Example 3: Pipes!

Let’s try it with pipes:

iris %>%
filter(Sepal.Length > 5 & Petal.Length > 3) %>%
mutate(Sepal.Area = Sepal.Length * Sepal.Width,
Petal.Area = Petal.Length * Petal.Width) %>%
group_by(Species) %>%
summarize(avg.sepal.area = mean(Sepal.Area),
avg.petal.area = mean(Petal.Area))

## # A tibble: 2 × 3
## Species avg.sepal.area avg.petal.area
## <fct> <dbl> <dbl>
## 1 versicolor 17.0 5.93
## 2 virginica 19.8 11.4

Now the flow of our code is much cleaner and clearer. Others will be able to follow our code much more easily, and there’s no need to create new variables each step of the way. Pipes take us smoothly from beginning to end.

This way of writing the code also lets us insert comments at each step so we can clearly document our process:

iris %>%
# first filter and keep only sepals greater than 5cm long and 3cm wide:
filter(Sepal.Length > 5 & Petal.Length > 3) %>%
# then approximate sepal and petal area by multiplying length and width:
mutate(Sepal.Area = Sepal.Length * Sepal.Width,
Petal.Area = Petal.Length * Petal.Width) %>%
# after that group by species to summarize the mean 
# sepal/petal area of each species:
group_by(Species) %>%
summarize(avg.sepal.area = mean(Sepal.Area),
avg.petal.area = mean(Petal.Area))

## # A tibble: 2 × 3
## Species avg.sepal.area avg.petal.area
## <fct> <dbl> <dbl>
## 1 versicolor 17.0 5.93
## 2 virginica 19.8 11.4

All that said, I’m not suggesting that your entire R analysis script fit inside one long set of pipes. Find what works best for you and your analyses in terms of splitting up your code into neat organized chunks that make sense.

We owe a big thank you to Stefan Milton Bache (@stefanbache on Twitter), creator of the magrittr package and the almighty pipe! Hope you found this tutorial helpful. Happy coding!

P.S. A highly relevant tweet explaining pipes… (from WeAreRLadies on Twitter)

If you enjoyed this tutorial and want learn more, you can check out Luka Negoita's full course on the complete basics of R for ecology here:

Start learning R now

Also be sure to check out R-bloggers for other great tutorials on learning R

How to use the group_by function with your ecological data

Wed, 23 Feb 2022 08:45:39 -0500

In scientific data and experiments, we often have groups of subjects between which we want to compare an observed response. For example, we might want to compare the growth rates of plants under different light treatments. Or maybe we want to compare CO² emissions of different countries over time. Each of these scenarios requires you to group your data based on a certain variable before you can compare any kind of statistic such as mean, minimum, or maximum.

In this tutorial, I’m going to discuss how to use a handy function called group_by(), which allows you to do what I just described.

group_by() is part of the dplyr package, so we’ll load that up first. Remember that if you haven’t used or installed the package before, you need to run install.packages("dplyr") before loading it in your script. Let’s also load up a data set that comes with R, called Loblolly.

# Load package
library(dplyr)

# Load data
data(Loblolly)
# View data
head(Loblolly)

## height age Seed
## 1 4.51 3 301
## 15 10.89 5 301
## 29 28.72 10 301
## 43 41.74 15 301
## 57 52.70 20 301
## 71 60.92 25 301

Loblolly describes the height of Loblolly pine trees at different ages. “Height” is given in feet, “age” is given in years, and “seed” is a unique identifier for each tree.

How to use group_by() and summarise()

Let’s say we want to see the average height of loblolly pine trees within each of the age groups. To do that, we need to group our data by the variable “age”. We use the group_by() function like this: group_by(data, column).

# Group the Loblolly data by tree age
group_by(Loblolly, age)

## # A tibble: 84 × 3
## # Groups: age [6]
## height age Seed
## <dbl> <dbl> <ord>
## 1 4.51 3 301
## 2 10.9 5 301
## 3 28.7 10 301
## 4 41.7 15 301
## 5 52.7 20 301
## 6 60.9 25 301
## 7 4.55 3 303
## 8 10.9 5 303
## 9 29.1 10 303
## 10 42.8 15 303
## # … with 74 more rows

When we do this, our data look the same. But behind the scenes, R makes note of how we want to group our data and returns a table that is grouped accordingly. In fact, our data look the same aside from the Groups: age [6] labeled at the top of the table. However, after grouping the data, we can now apply functions that calculate summary statistics within each group using the function summarize(), or summarise() (the spelling depends on if you use British or American English).

summarise() can be used like so: summarise(data, new_column_name = function(column_to_evaluate)).

So if we wanted to summarize mean heights of trees, it would look like summarise(Loblolly, avgheight = mean(height)).

# Group the Loblolly data by tree age and then summarize the mean, min, and max heights in each group
group_by(Loblolly, age) %>%
summarise(avgheight = mean(height),
minheight = min(height),
maxheight = max(height))

## # A tibble: 6 × 4
## age avgheight minheight maxheight
## <dbl> <dbl> <dbl> <dbl>
## 1 3 4.24 3.46 4.81
## 2 5 10.2 9.03 11.4
## 3 10 27.4 25.4 30.2
## 4 15 40.5 37.8 44.4
## 5 20 51.5 48.3 55.8
## 6 25 60.3 56.4 64.1

In essence, summarise() produces a new table that contains a column for your group, and then new columns of summary statistics that you define. In the code above, I asked summarise() to create new columns called “avgheight” for the mean height of trees in each age group, “minheight” for the minimum, and “maxheight” for the maximum. After we summarize our data, dplyr will also automatically ungroup our output.

You might be wondering about this guy %>% in the code above. This operator is called a pipe, and it comes loaded with the dplyr package. Importantly, this pipe doesn’t come with base R. For now, what you need to know about pipes are that they feed the output of one statement into the input of another. In the code above, the new table that came out of group_by() was passed into the data argument of summarise(), so there was no need for me to write data = Loblolly in the summarise() function. In plain English, I asked the code to “group the Loblolly data by tree age, and then (pipe!) summarize those groups using their mean, max, and min”.

Pipes can make your code a lot cleaner, especially if you’re performing several operations on one data frame. Don’t worry, we have a more comprehensive tutorial post on pipes coming up soon.

group_by() and other dplyr functions

We just went over the summarise() function, which is one of the most common dplyr functions to use with group_by(). But you could also use other dplyr functions such as mutate() and filter().

mutate()

For example, we could once again group our data by age, and then we could use mutate() to create a new column for mean height.

# Group the Loblolly data by age and create a new column for average height by age group
group_by(Loblolly, age) %>%
mutate(age_avgheight = mean(height))

## # A tibble: 84 × 4
## # Groups: age [6]
## height age Seed age_avgheight
## <dbl> <dbl> <ord> <dbl>
## 1 4.51 3 301 4.24
## 2 10.9 5 301 10.2
## 3 28.7 10 301 27.4
## 4 41.7 15 301 40.5
## 5 52.7 20 301 51.5
## 6 60.9 25 301 60.3
## 7 4.55 3 303 4.24
## 8 10.9 5 303 10.2
## 9 29.1 10 303 27.4
## 10 42.8 15 303 40.5
## # … with 74 more rows

This essentially did the same thing as summarise(), but instead of creating a new table, mutate() just added this “age_avgheight” column to the original data set. You can see that for trees of the same age, the “age_avgheight” value is the same. This makes sense, since we grouped the data by age before taking the mean, and there should only be one mean height for each age group.

For functions like mutate() and filter() where we might want to keep working on the same data set afterwards, we need to ungroup() the data after grouping it so that the grouping doesn’t affect other functions down the line. I’ll demonstrate quickly:

# Demonstrating ungrouping data and mutating a new column for average height
group_by(Loblolly, age) %>%
mutate(age_avgheight = mean(height)) %>%
ungroup() %>%
mutate(all_avgheight = mean(height))

## # A tibble: 84 × 5
## height age Seed age_avgheight all_avgheight
## <dbl> <dbl> <ord> <dbl> <dbl>
## 1 4.51 3 301 4.24 32.4
## 2 10.9 5 301 10.2 32.4
## 3 28.7 10 301 27.4 32.4
## 4 41.7 15 301 40.5 32.4
## 5 52.7 20 301 51.5 32.4
## 6 60.9 25 301 60.3 32.4
## 7 4.55 3 303 4.24 32.4
## 8 10.9 5 303 10.2 32.4
## 9 29.1 10 303 27.4 32.4
## 10 42.8 15 303 40.5 32.4
## # … with 74 more rows

After I ungrouped the data, I used mutate() to create a new column for average height again. But this time, because the data is ungrouped, the “all_avgheight” column just contains the average height of all trees in the data set rather than by age group.

filter()

For the filter() example, I’m going to remove a few rows of data from the Loblolly data set so that we can more clearly see the effect of the filter. If you want to follow along, you can copy and paste the following code into your script:

# Remove some rows at random (sort of)
Loblolly <- Loblolly[-c(1, 2, 3, 4, 9, 10, 11, 17, 18, 22, 29, 30, 34, 35, 47, 55, 56, 70, 82, 83), ]

Now let’s see how to use filter() with group_by(). In our data set, we have 6 age classes for each tree: 3, 5, 10, 15, and 25. But because I removed several rows of data, we are now missing age data for some trees (e.g., for trees 301 and 303).

# Look at age classes
sort(unique(Loblolly$age))

## [1] 3 5 10 15 20 25

# View modified data
head(Loblolly, 10)

## height age Seed
## 57 52.70 20 301
## 71 60.92 25 301
## 2 4.55 3 303
## 16 10.92 5 303
## 72 63.39 25 303
## 3 4.79 3 305
## 17 11.37 5 305
## 31 30.21 10 305
## 45 44.40 15 305
## 4 3.91 3 307

Let’s say our data analysis requires that we have at least 5 age classes for each tree. In that case, we’ll have to eliminate all trees for which there are fewer than 5 ages. We can use group_by() to group by Seed (the individual tree), then use filter() to only include data that are in a group of at least 5. The function n() will help us count the number of rows in each group.

# Filtering to include groups of at least 5
group_by(Loblolly, Seed) %>%
filter(n() >= 5) %>%
ungroup()

## # A tibble: 39 × 3
## height age Seed
## <dbl> <dbl> <ord>
## 1 3.91 3 307
## 2 9.48 5 307
## 3 25.7 10 307
## 4 50.8 20 307
## 5 59.1 25 307
## 6 4.32 3 315
## 7 10.4 5 315
## 8 27.2 10 315
## 9 40.8 15 315
## 10 51.3 20 315
## # … with 29 more rows

We see that the data set is greatly reduced, and trees like 301 and 303 have been removed because they have fewer than 5 age classes. We can also run the opposite filter and only include data that are in a group of less than 5.

# Filtering to include groups of less than 5
group_by(Loblolly, Seed) %>%
filter(n() < 5) %>%
ungroup()

## # A tibble: 25 × 3
## height age Seed
## <dbl> <dbl> <ord>
## 1 52.7 20 301
## 2 60.9 25 301
## 3 4.55 3 303
## 4 10.9 5 303
## 5 63.4 25 303
## 6 4.79 3 305
## 7 11.4 5 305
## 8 30.2 10 305
## 9 44.4 15 305
## 10 4.81 3 309
## # … with 15 more rows

Great! Now you’ve learned how to use the group_by() function along with several of the main dplyr functions summarise(), mutate(), and filter(). I covered just a few ways you might use these functions; it’s up to you to play around with them and learn even more. And don’t forget to use ungroup()!

If you want learn more about data wrangling with dplyr functions, you can check out our full course on the complete basics of R for ecology here:

Start learning R now

Also be sure to check out R-bloggers for other great tutorials on learning R

How to use R Markdown (part two): for learning R!

Tue, 15 Feb 2022 10:45:39 +0000

Welcome to part two of my blog series on R Markdown. In the first part, I went over how to create a basic R Markdown document and how to use R Markdown syntax. In this post, I’m going to talk about how you can use R Markdown to learn R.

So why is R Markdown good for learning R? As you saw in the first post, R Markdown is a method for typing normal and formatted text alongside your R code and its outputs. This is perfect for documenting your analyses by taking notes on specific chunks of code and writing down what worked or didn’t work. This same process is perfect for creating tutorials (like the one here!) and keeping track of what you learn. Eventually, the end goal is to have a series of R Markdown documents that cover all the topics and code that you learn, which include both the code and notes explaining what everything does. These documents then also serve as a guide that you can refer back to for troubleshooting or jogging your memory.

In other words, using R Markdown to learn R allows you to:

Have a project-based learning experience
Fully document your learning
Create a reference that you can look back on in the future if you get stuck or can’t remember something
Learn by teaching because you’re explaining things in your own words and taking notes
Create a teaching resource for yourself that you can then use to help others as well

You can also follow along with this post as a video if you click on the image below. Start the video at 35:35 to cover the material in this post.

Getting set up

To use R Markdown, you’ll need to have R and RStudio already installed. If you need help with downloading R and RStudio, you can check out my blog post and lessons one, two, and three of my online course.

You’ll also have to install two packages: rmarkdown and knitr. To do that you can run install.packages("rmarkdown") and install.packages("knitr"). You’ll only need to do this once for your computer (at least until the next time you update R).

If you are completely new to R and R Markdown, then I strongly suggest you start with my previous blog post on how to use R Markdown (after going through the three lessons linked above). It goes through all the most important tools in R Markdown.

With the basic software and packages installed, the first thing is to create a new RStudio project where you’ll be working on your R Markdown documents. RStudio projects are incredibly helpful for file organization and managing working directories, and they remove the need to use functions like setwd() and getwd(). You can read more about RStudio projects in our post on the subject here.

To learn more about RStudio Projects and why you should always use them, check out these three other great posts from other blogs:

Go to “File” and click on “New Project…”.

This will open a new window, where you’ll click on “New Directory” and “New Project”. That should take you to the next window, where you can give your project a name, like “Learning R” and then can choose somewhere to save it. Then hit “Create Project”, and you’ll have a new RStudio project.

Now that you’ve created your new project, anytime you want to work on it, all you have to do is just open the ‘.Rproj’ file, and RStudio will open up with the scripts you were working on (R Markdown documents in this case).

With this R project open, go to “File” » “New File” and click on “R Markdown…”

Give your R Markdown document a title and hit OK. I titled my document “The basics”.

Organizing your documents

One method is to create new R Markdown documents for every topic you cover. You might start off with one document that covers the basics, then the next one might cover how to upload data, and then you’ll have another one for data visualization, etc.

If you still want to create one large R Markdown document, or if you have sub-topics within your larger topics, you can add a table of contents to your document. Do this by going to the gear button at the top of the document and clicking on “Output Options”.

From there, a window will open up and you can check the box “Include table of contents”.

After that, you’ll notice that the text “toc: yes” appears at the top of your R Markdown document. When you knit your document, any headers you’ve added will appear in the table of contents at the top (like my header, “The basics”). The table of contents is clickable, so it will take you to wherever that section is in your document.

You can also number the table of contents and the section headings by checking that option in the “Output Options” window.

Numbering the sections can help your document become more clearly organized. If you add subsections, the document will take that into account when numbering. For example, my section 1 is called “The basics”. I made two subsections within “The basics”, called “Defining variables” and “Vectors”. “Defining variables” is given the number 1.1 and “Vectors” is given the number 1.2 because they’re nested under section 1. You can also see that in the table of contents, the subsections are tabbed in under their umbrella section to show that they’re nested.

Dealing with errors

I want to show you one more thing that I like to do when using R Markdown for learning. Sometimes we get errors that show up, and we aren’t sure how to resolve them. For example, in the code below, I get an error that says my variable can’t be found (this is because I haven’t created a variable called “my_number” yet).

When you have an error in your code, R Markdown won’t let you knit the document unless you’ve resolved the error. One thing you could do is to delete the problematic code, but then you might make the same mistake in the future. What you can do instead is copy and paste the error message and insert it in your code chunk as a comment. Then also comment out the code that caused the error, allowing you to knit the document. Then your code chunk might look something like this.

Now, if you share your document with someone, they can see the error and help you resolve it. Or maybe you’ll come back to the document in the future, see your note, and figure it out on your own after leaving the code alone for a bit.

Learning workflow

To summarize, here is a workflow that you can follow for learning R with R Markdown:

Start by creating the empty R project and your first R Markdown document (making sure to clear out the example contents of the new R Markdown document). Also make sure to add in a table of contents if you plan on keeping it all in one longer document.
Then, as you follow through any tutorials (or online courses! 😄), start new section headings (using ‘#’s) and begin explaining the steps you take to complete the tutorial or lesson.
After each set of text or description, add in the associated R code chunk.
Knit your document often to see the changes you are making as a stand-alone document and to make sure there are no errors in your code.
Follow the steps above for dealing with errors as they come up.
Finally, refer back to your knitted HTML document as often as you need, or even print it out as a physical reference if that helps.

All of the tips that I included in this blog post are intended to help you document your learning process. Taking notes and analyzing your own code can help ensure that everything you’re learning is sticking in your head.

And that’s it for my R Markdown tutorial series! I hope you enjoyed these posts. Remember to keep adding to your documents as you learn—it will help you grasp new topics and can even turn the R learning curve into a fun project!

Have any cool R Markdown documents you’ve created? Share links in the comments below! 👇

If you liked this post and want to learn more, then check out my online course on the complete basics of R for ecology. The course is *perfectly* suited for creating your own R Markdown document as you follow along!

Start learning R now

Also be sure to check out R-bloggers for other great tutorials on learning R

How to use R Markdown (part one)

Wed, 09 Feb 2022 10:45:39 -0500

Today I’m excited to share a blog post on how to use R Markdown. R Markdown is a dynamic file format that allows you to make documents containing normal text alongside chunks of embedded R code. In fact, all of my blog posts are written using R Markdown, which is how I’m able to write text like this, write code, and even insert a chunk of code

like_this <- c("isn't", "this", "neat?")

R Markdown is useful for several reasons:

It’s great for reproducibility, where you can explain your analyses alongside your code and output so someone can follow along and replicate your work
It helps with accountability, because all your code and the exact corresponding outputs are knit together into the final document
It allows you to make tutorials like this one
Finally, you can use it for learning R by helping you keep track of your notes and thinking process all while creating a custom reference document (more on this in part two!)

This tutorial is the first post of a two-part series on R Markdown. Here, you’ll learn how to create R Markdown documents with different types of content, and in part two I’ll go into how you can use it for learning R.

You can also follow along with this blog post in video format if you click on the image below. This post covers material in the video up to 35:35. I’ll cover the rest of the video in part two next week.

Getting set up with R Markdown

To use R Markdown, you’ll need to have R and RStudio already installed. If you need help with that, you can check out my blog post and lessons one, two, and three of my online course. These resources show you how to get started with R and RStudio.

install.packages("rmarkdown")
install.packages("knitr")

Now that you have the basic software and packages installed, you can get started with using R Markdown!

The first thing you’ll do after opening RStudio is go to File » New File » R Markdown.

Then a new window will pop up where you can fill out the title of your new document, the author (um, your name? 😉), and the output format. You can choose between HTML, PDF, and Word. We’re going to choose HTML for now, since that’s the simplest option and all you really need. Then hit OK.

You’ll now have a new document that is filled in with a bunch of example content. At the top, in between the boundary lines (---), you’ll see a list of document parameters that should reflect what you entered in the previous window (title, author, output). You don’t need to change anything there. We’ll come back to this header section later.

The next section shows a code chunk that says “r setup”. This sets a bunch of code chunk parameters for the rest of the document. There’s no real need to include this, so we’ll delete it for now.

What you see now is the raw R Markdown content, which contains chunks of R code between chunks of regular text with Markdown formatting. But the true power of R Markdown is when you transform that text and code into a stand-alone document.

So how do we get from an R Markdown document in RStudio to the HTML document? You click on the “Knit” button (no need to click on the dropdown arrow). The lingo here is that the R Markdown document “knits” itself into an HTML document.

If you haven’t saved your document yet, you’ll be prompted to save it when you click “Knit”. You’ll notice that R Markdown files are saved as a .Rmd file instead of a .R file. Now that you’ve saved your document somewhere, it will automatically save itself every time you press “Knit” from now on.

The final HTML file will automatically display in the ‘Viewer’ panel (usually on the right).

If you explore the document a little, you can see that R Markdown really lets you do a lot. You can include different types of text formats, links, code chunks, and even plots.

Working on your own R Markdown document

Cool. Now let’s get started on creating our own R Markdown document. First let’s start with a blank slate. Go ahead and delete everything in the sample document so that all you have left is the parameter header. It’s important that you leave that in!

Try to type some text, whatever you want. If you press “Knit”, it should then show up in a knitted document, and your .Rmd file should be automatically saved. Anything you type will show up in the knitted document! Neat.

Learning R Markdown text formats

Now let’s explore different types of text formatting in R Markdown. To organize different sections of your report, you’ll want to add section headings or titles. You can write headings of different sizes by writing different numbers of pound signs (#) + a space + your text, like this:

As you can see, the more pound signs you add, the smaller the headings get.

You can also add bold and italicized text by surrounding text with asterisks (*). Using one asterisk gives you italicized text, using two gives you bold text, and using three gives you bold italicized text.

You can create numbered lists just using 1. 2. 3. (…) in front of your text.

And you can create bulleted lists using either hyphens (-) or asterisks (*) before your text.

You can add links by putting square brackets [these] around the word or phrase that you want to hyperlink, and then immediately put the link (with the https://) in parentheses after the square brackets, like this:

Lastly, if you’re already an HTML wiz, you can also add any kind of HTML code to your document since the final document is HTML anyway, but I’m going to keep this tutorial simple and let you experiment with HTML on your own.

RStudio has an excellent cheat sheet that you can check out if you’re interested in learning more about what you can do with R Markdown. I just wanted to cover the essential features here, which is all you really need to know for creating most reports.

Learning how to embed code in R Markdown

Now that we’ve talked about how to format the text, let’s move on to embedding code!

You can add a code chunk by clicking on this button in the toolbar at the top of your screen:

That will add a code chunk, which looks like this:

You could also type out the code chunk boundaries yourself instead of pressing the button at the top, if you want. Those single quotes aren’t normal quotes—they’re the quote symbol ( ` ) that’s located under the escape key on a standard U.S. keyboard, usually paired with the tilde (~).

You can also use these quotes to casually embed code in your text by using them like normal quotation marks, like this. This will turn text into a little code snippet in the middle of your sentence.

Back to the code chunk. If you type within the code chunk, whatever you type will appear as if you are typing it in a normal R script. Then your output will look like a normal output in your console or plot viewer. I wrote a comment in my code chunk in the image below, but the cool thing about R Markdown is that you can put most of the commentary in your R Markdown text, so there’s no need to clutter the actual code with long explanation comments.

I’m going to embed some code now in the current R Markdown document (the one that I’m writing this blog post with). I created a variable called “answer” and loaded a data set called “cars” that comes with R. R actually comes with a whole bunch of premade data sets that you can look at if you type data() into the console.

# I'm writing some code here:
answer <- 2 + 4
# View the answer
answer

## [1] 6

# Let's load some data
my_data <- cars
# View the first few rows of my data
head(my_data, 3)

## speed dist
## 1 4 2
## 2 4 10
## 3 7 4

You’ll notice that R Markdown has split up the code chunk into different boxes each time there’s a piece of code that prints an output. The code is contained within light grey boxes, and the output is printed in white boxes. This just helps keep things organized so you can see what output goes with what code.

Let’s briefly explore another, related element of R Markdown: displaying plots. We’ll plot car speed as a function of distance (Y as a function of X).

# Plotting speed vs. distance from the cars data set
plot(my_data$speed ~ my_data$dist)

Awesome! We can see our code and our plot output.

Running code in R Markdown

I’ve been pressing the “Knit” button to see the output of my code, but you can also run your code in RStudio as if you’re doing it in a normal R script. You can just put your cursor wherever you want and then press command + return on a Mac, or control + Enter on a PC.

As you run the code in R Markdown, the output will appear below your code chunk:

Running your code within the code chunk first (instead of knitting it) is especially useful if you want to work through any errors, since the error messages will be easier to understand.

Note: if you’re running code directly within code chunks, it’s important to note that like a normal R script, you have to run all of the code in the correct order. This will ensure that all your variables and packages are loaded when you need them later on in your code.

For example, we refer to a variable called my_data in our plot code. If we’re running our code manually and try to run the plot code without creating the my_data variable first, we’re going to get an error. We have to run my_data <- cars before we run plot(my_data$speed ~ my_data$dist) for the code to work. Luckily, you don’t have to worry about this when you’re knitting your document because knitting runs all of the code in order for you.

There’s also a neat trick you can use to make sure you’ve run all the necessary code and prevent errors. Pressing the button in the image below will run all of the code up to the chunk that you’re on, so you don’t have to manually go line by line or chunk by chunk.

Changing your R Markdown theme

One last thing you can do to make your document look nice is to change the theme. You can click on the gear icon in the toolbar at the top, and select “Output Options…”

A window will open up, from which you can do things like change the theme of your HTML document. If you go to “Apply theme” and select the dropdown menu, you’re given a list of different themes to choose from. Changing the theme will do things like change the fonts and colors that are displayed. You can play around with the themes to see what you prefer, just remember to press “Knit” to process the theme change.

If you’re interested in HTML and CSS, you can also apply your own CSS file to change the style of the document. Again, we’re going to keep it simple in this blog post—you can explore CSS on your own but please comment down below if you have any cool style sheets/themes for your R Markdown documents.

In the same way that you can change the theme of your document, you can also change the syntax highlighting. That changes how your code looks when it’s embedded in the document. For example, the image below shows the “zenburn” option.

Now it’s time for what I think is the most useful addition. The “Output Options” window also allows you to include an interactive, clickable table of contents for your document. This is especially useful for larger documents with multiple sections. The table of contents in your document will be based off of the different headings that you use, with smaller heading sizes nested within larger ones.

You’ll notice that headings 4, 5, and 6 aren’t included in the table of contents. You can change this if you want in the “Output Options” window, where it says “depth of headers for table of contents”. If you set the depth of the headings to 6, then the table of contents will display headings all the way up to heading level 6.

Once you make these changes, you’ll notice that these changes have also been added to the heading section of your document. This means that once you familiarize yourself with the themes, you can type this information into the heading yourself. I showed you how to do it via “Output Options” because we didn’t know what the different themes were called, nor what our options were.

One quick pointer for creating an R Markdown document is to end your document with the code sessionInfo(). This will show the information about your current R session, including the version of R you’re using, the operating system, and the packages you have loaded up. The reason this is important is because packages and software get updated over time and things can change. Certain aspects of your code might not work in the same way in the future, depending on what versions of software and packages you’re using. Having that information in the future can help you track down the issues. If you know the version information for how the code was originally run, then there are ways to download older versions of R and associated packages, or at least know where the error stems from (and how to fix it in the code). In essence, including your session info can help ensure reproducibility in the future.

sessionInfo()

## R version 4.2.2 (2022-10-31)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] bookdown_0.31 digest_0.6.31 R6_2.5.1 lifecycle_1.0.3
## [5] jsonlite_1.8.4 magrittr_2.0.3 evaluate_0.19 highr_0.9
## [9] blogdown_1.16 stringi_1.7.8 cachem_1.0.6 rlang_1.0.6
## [13] cli_3.4.1 rstudioapi_0.14 jquerylib_0.1.4 bslib_0.4.2
## [17] vctrs_0.5.1 rmarkdown_2.19.1 tools_4.2.2 stringr_1.5.0
## [21] glue_1.6.2 xfun_0.35 yaml_2.3.6 fastmap_1.1.0
## [25] compiler_4.2.2 htmltools_0.5.4 knitr_1.41 sass_0.4.4

And that’s it for our basic R Markdown tutorial! You learned how to create an R Markdown document, how to apply different types of text formats, how to embed code inline or in code chunks, and how to stylize your final R Markdown document. Our next blog post will be about how to use R Markdown to learn R, so keep your eyes peeled for a Part Two.

Have any cool R Markdown documents you’ve created? Share links in the comments below! 👇

If you liked this post and want to learn more, then check out my online course on the complete basics of R for ecology:

Start learning now

Also be sure to check out R-bloggers for other great tutorials on learning R

Introduction to missing data (NAs) in R

Tue, 01 Feb 2022 09:45:39 -0500

As many of us know, science is not a perfect process. Maybe you can’t get out in the field on a certain day. Maybe you can only sample a portion of what needs to get done. Or maybe you’re downloading public data sets and they aren’t lining up perfectly. All of these can result in missing data, which can be a real pain when it comes time for analysis.

Another common source of missing data, especially when recording species abundance data in community ecology, is when you forget to write a ‘0’ and instead leave the entry blank. In the moment you might know that blank entries mean zero, but give it just a few weeks and you’ll be scratching your head! In those cases it’s often best to label those entries as unknown or missing.

In this tutorial, I’m going to explain what exactly an NA value is, how you can find NAs in your data, and how you can remove them.

What does it mean to have NAs in my data?

NAs represent missing values in R. This is pretty common if you’re importing data from Excel and have some empty cells in the spreadsheet. When you load the data into R, the empty cells will be populated with NAs.

Note: missing data points, or those where you don’t actually know what the true value should be, are marked as NA (which stands for ‘Not Available’) in R. In fact, you’ll notice the color change when you type NA in your code since R already knows what that means.

# Read in an example data set with NAs
ex <- read.csv("example_data.csv")
# View data
ex

## example data set
## 1 1 2 4
## 2 NA 2 4
## 3 16 1 4
## 4 2 NA 5
## 5 3 1 NA
## 6 6 7 8

Click here to download the example_data.csv file if you want to follow along.

NAs cannot be treated like other types of data (e.g, strings, numeric values). For example, you can’t perform math with them or use them in logical comparisons. If you do so, all you’ll get is an NA. In the following examples, all positions in the vector with NA just return NA again, no matter what operation is performed. We also get NA if we use mathematical functions such as sum() on the vector, because R can’t add NAs.

# Create a vector with NAs
v <- c(1.2, 4.5, NA, 8.9, NA)
# Can we do math with NAs?
v + 1

## [1] 2.2 5.5 NA 9.9 NA

sum(v)

## [1] NA

# Can we perform logical comparisons?
v < 7

## [1] TRUE TRUE NA FALSE NA

v == 4.5

## [1] FALSE TRUE NA FALSE NA

And the reason of course is simple… What’s the answer to 5 + 'some unknown number' ?

Have you figured it out yet?

The answer is 'some unknown number'! 😄

Thus: 5 + NA = NA

How can I detect NAs in my data?

So how can we see if we have NAs in our data? We normally use == to see if a value is equal to another one. Let’s see if that will work on our vector. We know that there’s an NA in the 3rd position of our vector.

# Create a vector with NAs
v <- c(1.2, 4.5, NA, 8.9, NA)

So theoretically, v == NA should return FALSE FALSE TRUE FALSE TRUE.

# Are there any NAs in our vector?
v == NA

## [1] NA NA NA NA NA

But this code just gives us NAs. Unfortunately, NAs don’t work with any kind of logical operator either.

Same as with math operations, NA is just a placeholder for 'I don't know the real value', so asking does NA == NA, is the same as saying does 'some unknown number' == 'some unknown number', which clearly has no known answer.

Luckily, R gives us a special function to detect NAs. This is the is.na() function. And actually, if you try to type my_vector == NA, R will tell you to use is.na() instead.

is.na() will work on individual values, vectors, lists, and data frames. It will return TRUE or FALSE where you have an NA or where you don’t.

# Which values in my vector are NA?
is.na(v)

## [1] FALSE FALSE TRUE FALSE TRUE

# Which values in my data frame are NA?
is.na(ex)

## example data set
## [1,] FALSE FALSE FALSE
## [2,] TRUE FALSE FALSE
## [3,] FALSE FALSE FALSE
## [4,] FALSE TRUE FALSE
## [5,] FALSE FALSE TRUE
## [6,] FALSE FALSE FALSE

You can also combine is.na() with sum() and which() to figure out how many NAs you have and where they’re located.

# How many NAs in my data frame?
sum(is.na(ex))

## [1] 3

# Which row contains an NA in the 'data' column?
which(is.na(ex$data))

## [1] 4

# Which vector positions contain NAs?
which(is.na(v))

## [1] 3 5

Note: the reason sum(is.na(ex)) works is because is.na() first converts your values to TRUE or FALSE, and applying math operations to T/F values automatically converts them to 1s or 0s.

How do I remove NAs from my data?

Now that we know we have NAs in our data… how do we get rid of them?

Some functions have an easy built-in argument, na.rm, which you can set to TRUE or FALSE to remove NAs from the data to be evaluated. If you remember the example from earlier, just running sum(v) returned NA. Adding na.rm fixes this:

# Sum across vector v
sum(v, na.rm = TRUE)

## [1] 14.6

# Take the mean of our vector v
mean(v, na.rm = TRUE)

## [1] 4.866667

Note that the decision to get rid of or replace missing values rather than leaving them in as-is, is both a technical and philosophical topic of conversation and should be addressed on a case-by-case basis. There are statistical methods for replacing missing values without biasing the outcome of analyses (e.g., in multivariate ordination analyses). Many statistical tests in R will automatically remove NA values, but in other cases it makes more sense to remove them manually. Either way, this goes beyond the current scope of this post, but it is an important note to keep in mind.

If you want to remove all observations containing NAs, you can also use the na.omit() function. Keep in mind that removing an observation means removing the entire row of data.

# remove NAs from our data frame
na.omit(ex)

## example data set
## 1 1 2 4
## 3 16 1 4
## 6 6 7 8

Something else you might want to do is replace those NAs with another value. Maybe you want to replace missing values with 0 (You’re 200% sure those missing values were supposed to be 0s?? 😄), or maybe you want to replace those missing values with the mean of your data to approximate what those values would be (that can be especially useful for multivariate analyses). You can subset your vector or data frame to the places where is.na() is true, and set those equal to a new value.

# Replace NAs in data frame with 0
ex[is.na(ex)] <- 0
# View data frame
ex

## example data set
## 1 1 2 4
## 2 0 2 4
## 3 16 1 4
## 4 2 0 5
## 5 3 1 0
## 6 6 7 8

# Replace NAs in vector with the mean
v[is.na(v)] <- mean(v, na.rm = TRUE)
# View vector
v

## [1] 1.200000 4.500000 4.866667 8.900000 4.866667

Awesome! Now you know how to find NAs in your data, perform functions without letting NAs get in the way, and remove NAs from your data for further analysis. Soon these functions will come to you NAturally…haha. I hope you found this tutorial helpful. Happy coding!

P.S. I’d recommend listening to this song to put you in the NA-removing mood!

If you liked this post and want to learn more, then check out our online course on the complete basics of R for ecology:

Start learning now

Also be sure to check out R-bloggers for other great tutorials on learning R

How to join tables in R

Wed, 26 Jan 2022 11:45:39 -0500

In this blog post, I’m going to talk about joining data tables together. Joining tables is incredibly useful when you have to download several data files on a common set of subjects and then aggregate them into a larger, singular data set.

This is pretty common with spatial data. For example, you might have one table that contains geographic information on parcels of land like census tracts, each with their own ID. You can then find separate demographic or economic data tables online that can link up with the geographic data using the census tract ID.

Another common example is if you collected community survey data from plots, but then also have associated environmental data collected from those same plots saved as a different spreadsheet of data.

These kinds of situations would call for you to merge, or join, your two data tables together. In this tutorial, I’m going to introduce you to different types of joins, and I’ll show you how to perform joins both in base R and using the dplyr package.

Joining data in base R

We’re going to start with a basic data set. These data contain 6 different students and the distance of their morning commute to school, in miles.

# Create a data frame with information on where students live
set.seed(123)
student_residence <- data.frame(student = seq(1, 6),
distance = runif(6, 3, 10))
# Look at the data
head(student_residence)

## student distance
## 1 1 5.013043
## 2 2 8.518136
## 3 3 5.862838
## 4 4 9.181122
## 5 5 9.583271
## 6 6 3.318895

The runif() function creates a random assortment of numbers between a minimum and maximum value that you specify. I asked runif() to generate 6 random numbers between 3 and 10. The set.seed() function just makes it so that each time you run this code, the random output will always be the same (when using the same seed number). Use set.seed(123) if you’d like to follow along with the same numbers I have here.

Students at this school were also surveyed to find out what method of transportation they use to get to school in the morning. This survey was offered to several students, but not everyone responded (looks like only students 1, 3, 5, and 7 responded). Note that in this scenario we somehow don’t have data on commute distance for student 7.

# Create another data frame with information on how students get to school
student_transport <- data.frame(student = seq(1, 7, by = 2),
transport = c("Bus", "Carpool", "Walk", "Bus"))
# Look at the data
head(student_transport)

## student transport
## 1 1 Bus
## 2 3 Carpool
## 3 5 Walk
## 4 7 Bus

Let’s say we want to look at both student transportation methods and morning commute distance so we can create a better bus schedule. It’s tough to do that when transportation method and commute distance are in different data sets, so we want to join them together.

Note: I’m using the term ‘table’ and ‘data frame’ interchangeably here.

Importantly, to join two different tables together, you need to make sure you have a column in common between both data sets. This common column is called a “key”, and it should provide a unique identifier for every row. In the case of our data, the “student” column is our key, and it provides a unique number for each student.

To join our data, we can use the merge() function in base R. merge() will first accept two data frames as arguments, and then the name of the column that the two data frames have in common, like so: merge(x = dataframe1, y = dataframe2, by = "column name"). With our data, this would look like:

# Merge data frames together
students <- merge(x = student_residence, y = student_transport, by = "student")

If we compare the values for student 1 in the new and old data sets, the values are the same. Great! Looks like the merge worked.

# Compare the data to see if the merge worked
head(students)

## student distance transport
## 1 1 5.013043 Bus
## 2 3 5.862838 Carpool
## 3 5 9.583271 Walk

student_residence[1, 2]

## [1] 5.013043

student_transport[1, 2]

## [1] "Bus"

But what if the common columns that we want to merge by don’t have the same name? Let’s change the name of the “student” column in student_transport to “studentID” instead.

# View data
head(student_transport)

## studentID transport
## 1 1 Bus
## 2 3 Carpool
## 3 5 Walk
## 4 7 Bus

If this is the case, we can still use the merge() function with the names of two data frames, but instead of using one “by” argument, we’re going to use two, the by.x() and by.y() arguments, like so: merge(x = dataframe1, y = dataframe2, by.x = "dataframe1 column", by.y = "dataframe2 column").

# Try the merge again
students2 <- merge(x = student_residence, y = student_transport, by.x = "student", by.y = "studentID")
# Compare this new data set to the old one
head(students2)

## student distance transport
## 1 1 5.013043 Bus
## 2 3 5.862838 Carpool
## 3 5 9.583271 Walk

head(students)

## student distance transport
## 1 1 5.013043 Bus
## 2 3 5.862838 Carpool
## 3 5 9.583271 Walk

The data sets look the same, so we know both methods worked.

Types of Joins

Inner join

You probably noticed that in the join we just performed, there were only three rows in the joined table. That’s because we performed something called an “inner join”, where R only returns the data frame rows that match up with the other data frame. If you were to visualize this type of join, it would look something like this:

Left join

There are also “left” joins and “right” joins. A left join returns all rows from the left data frame and any matching rows from the right data frame. In the merge() function, the “left” data frame is the x data frame, or the one you name first. The “right” data frame is the y data frame, or the one you list second. We can tell merge() that we want to keep all rows from the “left” data frame by adding the argument all.x = TRUE. If we’re more interested in where students live, we’ll want to keep all the rows from student_residence. Let’s go ahead and do that:

# Perform a left join
merge(x = student_residence, y = student_transport, by = "student", all.x = T)

## student distance transport
## 1 1 5.013043 Bus
## 2 2 8.518136 <NA>
## 3 3 5.862838 Carpool
## 4 4 9.181122 <NA>
## 5 5 9.583271 Walk
## 6 6 3.318895 <NA>

We can see that indeed, all the rows from student_residence have been kept. Since student_transport was missing some of the student records, there are NAs in the table where the join operation couldn’t find a match for the student. The image below visualizes what a left join would look like.

Right join

A right join does the same thing as a left join, just swapping the arguments. Instead of specifying all.x, we’ll use the argument all.y = TRUE. If we’re more interested in student transportation methods, we’ll want to keep all the rows from student_transport.

# Perform a right join
merge(x = student_residence, y = student_transport, by = "student", all.y = T)

## student distance transport
## 1 1 5.013043 Bus
## 2 3 5.862838 Carpool
## 3 5 9.583271 Walk
## 4 7 NA Bus

Now, we have all the rows from student_transport. Again, there’s an NA where the join operation couldn’t find a match for the student in the other data frame. The image below visualizes what a right join does.

Full join

The last type of join is called a “full join” (or “outer join”) which includes all the rows from both data frames, whether or not they match with one another. We can specify this by including both the all.x and all.y arguments.

# Perform a full join
merge(x = student_residence, y = student_transport, by = "student", all.x = T, all.y = T)

## student distance transport
## 1 1 5.013043 Bus
## 2 2 8.518136 <NA>
## 3 3 5.862838 Carpool
## 4 4 9.181122 <NA>
## 5 5 9.583271 Walk
## 6 6 3.318895 <NA>
## 7 7 NA Bus

Joining data using the dplyr package

I just demonstrated how to join tables in base R, but many of you are probably also familiar with the dplyr package. dplyr provides a convenient way to perform the different types of joins using the functions inner_join(), left_join(), right_join(), and full_join(). All of these functions accept the forms XXX_join(dataframe1, dataframe2, by = "column name"), and you don’t need to add anything else like all.x or all.y because the specific type of join is already built into the specific function. I’ll quickly demonstrate how to use these functions below:

# Load package
library(dplyr)

# Inner join
inner_join(student_residence, student_transport, by = "student")

## student distance transport
## 1 1 5.013043 Bus
## 2 3 5.862838 Carpool
## 3 5 9.583271 Walk

# Left join
left_join(student_residence, student_transport, by = "student")

## student distance transport
## 1 1 5.013043 Bus
## 2 2 8.518136 <NA>
## 3 3 5.862838 Carpool
## 4 4 9.181122 <NA>
## 5 5 9.583271 Walk
## 6 6 3.318895 <NA>

# Right join
right_join(student_residence, student_transport, by = "student")

## student distance transport
## 1 1 5.013043 Bus
## 2 3 5.862838 Carpool
## 3 5 9.583271 Walk
## 4 7 NA Bus

# Full join
full_join(student_residence, student_transport, by = "student")

## student distance transport
## 1 1 5.013043 Bus
## 2 2 8.518136 <NA>
## 3 3 5.862838 Carpool
## 4 4 9.181122 <NA>
## 5 5 9.583271 Walk
## 6 6 3.318895 <NA>
## 7 7 NA Bus

# Inner join but if your data frames have different column names
colnames(student_transport)[1] <- "studentID"
inner_join(student_residence, student_transport, by = c("student" = "studentID"))

## student distance transport
## 1 1 5.013043 Bus
## 2 3 5.862838 Carpool
## 3 5 9.583271 Walk

These joins should look the same as the ones demonstrated above using the merge() function. And now you know how to perform several types of join operations depending on which rows you need to retain!

I hope this tutorial was helpful! Let us know what other tutorials you’d like to see in the comments below. 👇

If you liked this post and want to learn more, then check out our online course on the complete basics of R for ecology:

Start learning now

Also be sure to check out R-bloggers for other great tutorials on learning R

The Basics of R (in Spanish!)

Wed, 19 Jan 2022 09:45:39 -0500

(¿Quieres más detalles sobre el curso en español? Desplázate hacia abajo.)

Hello everyone! This blog post is a bit different from usual posts in that I’d like to make a very exciting announcement about an upcoming course launch.

Part of my vision with R for Ecology is to make as accessible as possible to as many people as possible—especially ecologists and other scientists. Understanding how to work with, organize, visualize, and analyze data is essential for doing good science. Either way, I’m very fortunate to have partnered with a fantastic biologist and ecologist from Argentina named Joaquin Cochero who has done an outstanding job translating my entire Basics of R (for ecologists) course into Spanish!

Dr. Joaquin Cochero (Spanish-speaking R guru)

Having worked in Galapagos, Ecuador for the last four years, I’ve made many Spanish-speaking friends and colleagues that have long requested that I make my course available in their native tongue. I’m very excited to say that it’s finally almost here.

Without any further ado and for those that are not familiar with my original course in English, here is some more information about the course (in Spanish, of course 😄). And if you are interested in enrolling, there’s a link at the bottom of this post to pre-register for when enrollment opens.

Lo esencial de R (para ecólogos)

Este es el curso que me gustaría haber tenido cuando empecé como estudiante de posgrado. Uno puede pasar mucho tiempo diseñando experimentos y recogiendo datos y luego no tener ni idea de cómo explorar o visualizar esos datos.

La curva de aprendizaje de R no tiene por qué ser tan difícil y larga. En este curso, he seleccionado cuidadosamente los temas y funciones clave que te ayudarán a dominar los fundamentos y a superar rápidamente la curva con confianza, incluso si eres un completo principiante.

En realidad, con sólo unas pocas funciones y métodos en R se puede hacer al menos el 80% de toda la manipulación y visualización de datos que necesitará hacer en ecología. Este curso se centra en esos conceptos clave para que su experiencia de aprendizaje sea lo más eficiente posible.

Un breve resumen del plan de estudios:

Bienvenida al curso: Bienvenida y descarga del material del curso.
Introducción a R y R: Studio: Instalación, y lo más básico.
Vectores y marcos de datos: Todo lo esencial sobre cómo empezar a trabajar con números en R cuando están en forma de conjuntos de datos.
Carga de datos: Cómo cargar y acceder a tus propios datos en R y R Studio
Visualización básica de datos: Los tipos de gráficos más comunes y cómo crearlos para visualizar tus datos.
Manejo básico de datos: Aprenda todo lo esencial para organizar sus conjuntos de datos y prepararlos para su visualización o análisis, incluyendo cómo utilizar el paquete ‘dplyr’ para una limpieza y organización de datos potente y eficiente.
Manejo avanzado de datos: Llene su cinturón de herramientas con herramientas y técnicas adicionales para hacer casi todo con sus datos, desde la unión de diferentes conjuntos de datos, hasta el tratamiento de los datos faltantes, y trabajar con formato de fechas.
Organización de proyectos: En esta sección final repaso cómo puedes organizar los proyectos que haces en R para conseguir un flujo de trabajo eficiente y potente.
Conclusión

También es importante saber lo que este curso NO cubre:

Para cubrir los fundamentos de R de una manera efectiva, no puedo cubrir todo. Así que este curso NO cubre:

Análisis estadístics o modelización de datos
SIG o visualización espacial
Temas avanzados de visualización como el uso de ‘ggplot2’ Creo que estos temas son relativamente fáciles de profundizar una vez que se han establecido los fundamentos, y tengo previsto ampliar estos y otros temas en futuros cursos, y tratar de cubrir todo esto en tu primer curso no es necesario (y añadirá mucho estrés con la curva de aprendizaje).

Para responder a algunas preguntas frecuentes:

¿Cuándo empieza y termina el curso? El curso es un curso en línea completamente a su ritmo, por lo que usted decide cuándo empieza y cuándo termina.
¿Durante cuánto tiempo tendré acceso al curso? ¿Cómo es el acceso de por vida? Después de inscribirse, tendrá acceso ilimitado a este curso durante todo el tiempo que desee, en todos los dispositivos que posea.
¿Y si no soy ecólogo? ¿Sigue siendo relevante el curso? Sí. Aunque el curso se basa en mi propia experiencia en el uso de R para la ecología, todo el contenido del curso será aplicable y relevante para la mayoría de los otros campos de la biología, si no muchos campos incluso fuera de las ciencias. El curso también utiliza conjuntos de datos ecológicos, pero los principios son en su mayoría universales.
¿Pero qué pasa si no sé nada sobre R o estadística? No pasa nada! Este curso está diseñado como el primer paso para cualquier persona interesada en aprender a usar R y el contenido del curso no asume ningún requisito previo.

Si estás interesado en lo esencial de R (para ecólogos) (en español!), sólo tienes que hacer clic abajo para preinscribirte en el próximo lanzamiento!

Página de preinscripción

Making your first plot in R

Wed, 05 Jan 2022 09:45:39 -0500

With the new year, I’m hoping more of you take up learning R, so with that I want to share a tutorial from my course on an introduction to data visualization with R to help get you started.

If you are completely new to R and don’t even know where to start, check out my last post on installing R and RStudio here.

In this tutorial I’ll teach you how to create a scatterplot using the base R package, which includes all the basic functions and is already installed in R (no need to use any additional packages).

You can also follow along with this blog post in the video tutorial that is part of my course if you click on the thumbnail below:

To start with, we’re going to use some data that’s built into R using the data() function to access it:

# Load data
data(PlantGrowth)
# Look at the beginning and ending 4 rows of data
head(PlantGrowth, 4)

## weight group
## 1 4.17 ctrl
## 2 5.58 ctrl
## 3 5.18 ctrl
## 4 6.11 ctrl

tail(PlantGrowth, 4)

## weight group
## 27 4.92 trt2
## 28 6.15 trt2
## 29 5.80 trt2
## 30 5.26 trt2

It looks like we have 30 rows of data and two columns. One column is called “weight”, which represents the dry biomass of each plant in grams. The other column is called “group”, and describes the experimental treatment that each plant is given.

We can also see that there are ten plants in each treatment group. Note that I used $ after the name of the data set to refer to the ‘group’ column in this case:

# View number of rows per treatment group
table(PlantGrowth$group)

##
## ctrl trt1 trt2
## 10 10 10

Let’s add another column to this data set that describes the amount of water that each plant has received throughout its life (in liters). You can just copy and paste these numbers from the code here:

# Add a new column
PlantGrowth$water <- c(3.063, 3.558, 2.233, 3.147, 2.379, 2.106, 2.384, 2.444, 2.492, 3.292,
2.732, 2.153, 2.660, 1.938, 3.583, 1.817, 3.494, 2.559, 1.530, 2.372,
3.176, 2.611, 3.262, 2.947, 2.523, 2.152, 2.771, 2.878, 2.263, 2.518)

And now if we view our data, we can see that the new column was added.

# View first few rows of data 
head(PlantGrowth)

## weight group water
## 1 4.17 ctrl 3.063
## 2 5.58 ctrl 3.558
## 3 5.18 ctrl 2.233
## 4 6.11 ctrl 3.147
## 5 4.50 ctrl 2.379
## 6 4.61 ctrl 2.106

For our first plot, let’s create a scatterplot to see how plant weight varies with the amount of water that the plant has received.

To do this, we’re going to use the plot() function, where you can assign variables to the X and Y axes. Since we want to see how weight varies as a function of water, we’ll put weight on the Y axis and water on the X axis. Remember that we use the dollar sign $ to reference a specific column in a data set.

# Our first plot!
plot(x = PlantGrowth$water, y = PlantGrowth$weight)

And that’s our first plot! You can make the plot smaller or larger by just moving the plot viewing window around.

There’s also another way to use the plot() function, and this method is generally considered the better practice (and will translate to other types of data visualization and analysis techniques).

As we said before, we visualize relationships between the X and Y axes by viewing the Y variable “as a function of” X. If we’re talking in terms of experimental design, the Y axis is the dependent variable (the variable you measure), and the X axis is the independent variable (the variable you control or want to examine the effect of).

The shorthand for “as a function of” is the ~ symbol, or the tilde. The tilde can be found under the Escape key on a keyboard, and you usually have to hold Shift down to type it.

So if we use this with the plot() function, we would just write:

# Plotting plant weight as a function of the amount of water it received
plot(PlantGrowth$weight ~ PlantGrowth$water)

In plain English, we are plotting plant weight as a function of the amount of water it has received. This plot looks exactly the same as the plot that we made earlier, as it should.

We can also make this code simpler by adding another argument to the function. If we specify the data that we want to use, we can just use the column names directly instead of typing out the whole phrase PlantGrowth$water, like so:

plot(weight ~ water, data = PlantGrowth)

So now, the axis labels look much nicer because they just say “weight” and “water” instead of having “PlantGrowth$” in front of both words. Voila! we now we have a basic scatterplot.

In summary, we learned:

How to load in built-in data as well as adding our own custom data as another column in the data set
How to plot a simple scatterplot in base R using the plot() function
How to use a tilde in the plot() function to make the code neater

Best of luck making your first plots using your own data! I hope this tutorial was helpful.

If you liked this post and want to learn more, then check out my online course on the Introduction to Data Visualization with R (for ecologists):

Start visualizing your data now

Also be sure to check out R-bloggers for other great tutorials on learning R

How to install (and update!) R and RStudio

Sat, 01 Jan 2022 05:28:39 -0500

One of the first steps to learning R is to have it downloaded and installed on your computer. In this post I’ll show you how to do that and how to download and install RStudio—a key tool for using R, and how I do all my work and tutorials.

If you want to follow along with a video tutorial, you can click on the image below where you can watch the first lesson in my full course on the Basics of R (for ecologists).

For starters, R is a free open-source programming language used for organizing, analyzing, and visualizing data. Its versatility is highlighted by the large number of user-created packages that it comes with, which provide useful functions and guides that anyone can use (e.g., found on CRAN). So R is the programming language itself, and it comes with an environment or console that can read and execute your code. You could code in R without using RStudio, as you can see in the image below. That’s what the plain R console looks like; I just loaded up some data, viewed the first few rows, and renamed the columns.

By comparison, RStudio is a more versatile IDE, or Integrated Development Environment. Most people who use R also use RStudio because it provides a clean point-and-click dashboard of tools where you can type your code, view your figures, organize your data, variables, and files, as well as viewing the help window. In comparison to RStudio, the basic R IDE/console is extremely basic and doesn’t provide as many accessible tools as RStudio does.

Here I’ve set the editor color theme in RStudio to Solarized Dark, which is easier on the eyes when spending a lot of time coding in R. To change the theme, just go to RStudio –> Preferences (on a Mac) or Tools –> Options (on a Windows) and then click the Appearance tab where you can modify the Editor theme. Also check out this tutorial where I show you how to do that plus a few other useful tweaks for setting up RStudio.

If you are installing R and RStudio for the first time:

To download R, go here. Choose the download link that corresponds to your computer. I have a Mac, so I clicked that link.

You can download RStudio here, and you want to choose “RStudio Desktop”.

The important thing when installing R and RStudio is that you need to install R before you install RStudio. If you do it in the reverse order, you will likely run into errors. All you’ll need to do is open the files you downloaded for R and RStudio, and the installation process should begin on its own.

For Mac users, there’s also something called XQuartz, which you might not need for basic coding in R, but which might be helpful down the line for running certain packages. You can download XQuartz here. Similarly, if you just open the downloaded file, XQuartz should install on its own.

If you want to update R and RStudio:

There are a few ways you can check your version of R and see whether or not it needs to be updated. One way is to run the actual R program. There, you can go to the “R” menu and click “Check for R Updates” (see image below). If you do that, R will tell you the current version you’re on, and whether or not there is a more updated version that you can download (circled in blue).

Alternatively, if you’re in RStudio, you can type and run “sessionInfo()” in the R Console. The first line that the console returns is the version of R that you’re using. You can then download and install the latest version of R here for Mac, and here for Windows.

If you’re using a Windows computer, you may need to uninstall R to update it. You can find a quick guide for that here. Another great option for Windows users is to use a package called installr (unfortunately only available for Windows, @Mac users). All you need to do is install “installr”, load up the library, and run the code “updateR()”. This function will check for newer versions and will guide you through the update process.

If you want to update to the latest version of RStudio, hover over “Help” on the top menu bar of your Mac, and click “Check for Updates”. Then, quit the RStudio program, go to the RStudio website, and download and install the latest version. Now you should have the latest versions of R and RStudio on your computer. I hope this tutorial was helpful!

As a quick note: my “Basics of R” course uses R version 4.0.2 and RStudio version 1.3.959. There shouldn’t be any incompatibility issues if you’re running a slightly different version, but it is usually best to stay up to date with your software!

If you liked this post and want to learn more, then check out my online course on the complete basics of R for ecology:

Start learning now

Also be sure to check out R-bloggers for other great tutorials on learning R

Where to ask for help when coding in R

Fri, 03 Dec 2021 09:08:00 -0600

When learning R, it can be tough to figure out how to apply what you’ve learned to your own data. We often learn general skills that are helpful for manipulating our data, but things aren’t always so simple when it comes to your own analysis. Sometimes, we have very specific problems that we need to address but don’t know how.

In this blog post, I’m going to describe a few R forums that are particularly useful when you need specific help with your own project.

As an example, let’s say that we want to replace a specific character of every word (string) in this vector:

words <- c("Apple", "Orange", "Banana", "Peach", "Nectarine")

I know that there must be a function that can address this, but I don’t know how to accomplish my particular need.

No worries, we’ll just turn to Google real quick. How convenient—one of the first results is someone asking a similar question on StackOverflow. Even better, there are a whole bunch of related questions that are listed underneath the main result, in case any of those might also help me out.

If we click on the link, we can see the specific question that the person asked.

And if we scroll down further, we can see the answers that people have provided. The really awesome part about StackOverflow and similar forums is that you can receive opinions from multiple people. There will always be multiple ways to solve a problem, and learning about the multiple ways can help you think more creatively when you code. People will often also comment on the answers themselves, generating discussion about why a certain method might be better than another, or how it can be improved.

These forums are great references because you’ll find that a lot of people have similar questions to you. But there are also situations where you’re analyzing your data and have a question that is VERY specific to your data or analysis. Times like this will call for you to make your own detailed post!

I highlighted StackOverflow in this blog post, but there are a number of other sites that serve similar purposes.

Here are some of my favorite forum resources:

…for questions seeking advice on statistics or research:

Now go forth and add all these resources to your R toolbelt! Feel free to leave any of your favorite resources in the comments below.

If you liked this post and want to learn more, then check out my online course on the complete basics of R for ecology:

Start learning now

Also be sure to check out R-bloggers for other great tutorials on learning R

How to go from R to nice tables in Microsoft Word

Tue, 23 Nov 2021 12:28:39 -0600

As scientists, we often have data or results in R that we want to export to Microsoft Word for the reports or publications that we’re writing.

In this tutorial I show you how to do just that. You can also watch this tutorial as a video if you want to follow along while I code:

The first step is to load up some data. We’re going to use the Orange dataset that comes built into R, which describes the growth of orange trees:

# Load the data
data(Orange)

If we view the data, we can see the following columns: “Tree”, which contains an identifier for each tree that was measured; “age”, which contains the age (in days) of the tree at the time of measurement; and “circumference”, which is the circumference of the tree trunk, measured in millimeters.

head(Orange, 15)

## Tree age circumference
## 1 1 118 30
## 2 1 484 58
## 3 1 664 87
## 4 1 1004 115
## 5 1 1231 120
## 6 1 1372 142
## 7 1 1582 145
## 8 2 118 33
## 9 2 484 69
## 10 2 664 111
## 11 2 1004 156
## 12 2 1231 172
## 13 2 1372 203
## 14 2 1582 203
## 15 3 118 30

So in this dataset, there are five different trees, each of which have been measured at the same time points (age).

Let’s say we want to summarize this dataset to see how the different age groups compare in their growth. In the script below I’ve organized the data so that now we have a table called Orange_summ, which shows the mean and standard deviation of the tree circumferences for each age group. (To run the code below, just make sure that the 'dplyr' package is installed if not already):

# install.packages("dplyr")
library(dplyr)
Orange_summ <- group_by(Orange, "days"=age) %>%
summarize(mean_circ_mm = mean(circumference), sd_circ_mm = round(sd(circumference), 2))
Orange_summ

## # A tibble: 7 × 3
## days mean_circ_mm sd_circ_mm
## <dbl> <dbl> <dbl>
## 1 118 31 1.41
## 2 484 57.8 8.17
## 3 664 93.2 17.2
## 4 1004 134. 25.9
## 5 1231 146. 29.2
## 6 1372 173. 32.8
## 7 1582 176. 33.3

If you’re interested in learning more about how to summarize data like this, check out our full online course, “The Basics of R (for ecologists)” here.

Great! Now we have a summary table that we can export to Word. First, we’re going to save our table as a ‘*.csv’ file.

write.csv(Orange_summ, "Orange_summ.csv", row.names = F)

What’s important to note here is that we set row.names to False—doing this eliminates the row numbers in our .csv file, since we don’t need them.

Next, open the .csv file. You can see below that Microsoft Excel is the default software for opening .csv files, but we don’t want that. We’re going to open the file in TextEdit or a similar text editor by right-clicking on our file and choosing the appropriate app.

It should look something like this.

After opening the .csv file in your text editor app, just copy and paste the text onto a blank Microsoft Word document.

In Word, highlight the text, and then go to Table » Convert » Convert Text to Table…

That will open a window where you should check that the number of columns is correct, and make sure you have chosen “Commas” in the “Separate text at” section. That’s because you saved the file as a .csv, or “comma-separated values” file.

If you have a Windows computer, the exact method for converting text to tables might be slightly different, but the concept is the same—you can find a tutorial for that here.

Click “OK” and we have a table!

Next, use the “Find and Replace” function to clean up the table by going to Edit » Find » Replace. (The Mac keyboard shortcut for this is Shift + Command + H).

We want to get rid of all the double quotes in our table, so put double quotes “ in the top bar, and leave the bottom bar blank. Then click “Replace all”. Word should have found 6 replacements. This is definitely something that could have been fixed manually in this case since there are only 6 occurrences, but if your table contains a character or factor column, all the values in that column will end up having double quotes around them, so that’s where this trick comes in handy…

Looking good!

Now just rename the columns and reformat the table to make it nice and polished. Word has several border editing tools that allow you to change which borders are visible. I like to remove all borders first. Then, by putting your cursor in a table cell, you can go to Table Design » Border Painter, which lets you “paint” in whichever borders you do want to add.

And that’s it! You’ve just exported your first table from R into Microsoft Word.

If you liked this post and want to learn more, then check out my online course on the complete basics of R for ecology:

Start learning now

Also be sure to check out R-bloggers for other great tutorials on learning R

Video tutorial on the essentials of R for ecology cheatsheet

Fri, 10 Sep 2021 09:30:39 -0400

Hey everyone! I just finished putting together a video tutorial that goes over my Essential Functions of R (for ecology) Cheatsheet. I decided to create a separate post here because some of you were asking for an easy walk-through of the functions on the cheatsheet and I think that merits its own post. For those that are ready to just download the cheatsheet and go running with it, here is the link to my original post on the subject.

👇 Download the Cheatsheet here 👇
Click here to download the Essential Functions of R Cheatsheet.

The cheatsheet is still a work in progress, but for now the video goes over my first version. I thought this is also a good opportunity to go over some of the questions and suggestions I’ve received since publishing this first version. More on this towards the bottom of this post.

But first, here is a link to the video:

And here is the starting code that I use (that you can copy and paste) for following along in the tutorial:

# Starting Code (contains most of the data used for this tutorial):
num_vec <- c(3,6,3,8)
spp_vec <- c("spp1","spp3","spp2","spp3")
dataframe <- data.frame(num_vec, spp_vec)
data(trees)
tree_data <- trees
tree_data$light <- c(rep(c("shade","sun"), each=15), "sun")
tree_data$light <- as.factor(tree_data$light)
my_matrix <- as.matrix(dataframe)

If helpful, you can also download the entire script that I wrote out over the course of the tutorial: Click here to download the entire script from the video tutorial/walkthrough on the essential functions of R cheatsheet.

If this is what you came for, then you can ignore the rest of this post. (more advanced R users might want to keep reading)

Some notes on two of the more common suggestions I’ve received:

Watch out with setwd() or Jenny Bryan will burn your computer down 😜 https://www.tidyverse.org/blog/2017/12/workflow-vs-script/ — Eric Scott

I’ve already gotten some version of this comment several times. The idea is that setwd() is a function that should rarely (if ever) be used. setwd() allows you to set the working directory so that when you upload your data (or save your results) you can set where that base directory is. The problem is that you have to specify the entire path when using setwd() which makes it only applicable to your own computer (at that moment in time!). How many of you have opened an R script with the following code:

setwd("/Users/lukanegoita/Documents/my_special_folder/another_folder/final_folder")
read.csv("my_data.csv")

Only to get the error message: In file(file, "rt") : cannot open file 'my_data.csv': No such file or directory, and the only reason you get this error is because at some point or another you moved the R script to a different folder or changed some folder names and now you have no idea where that CSV is (or best case scenario it takes you a while to find it again)… Another common reason this happens is when sharing scripts. Someone else’s computer will have a totally different file path than yours. To prevent this error and for good coding / sharing practices, it’s very important to use R Studio Projects for managing all of your scripts. It’s beyond the scope of this post to explain how that works, but you can check out my older post where I explain this in more detail (along with some links to other good articles on the subject).

This is all to say that the only reason I included setwd() in this cheatsheet is because many beginners will still find this function in their code, usually from people sharing their scripts without adhering to the best practice of using Projects instead. I think this is my new thing: Don’t share scripts, share projects. Stop the spread of STWDs (Scriptually Transmitted Working Directories).

Ok, enough on that.

Second, I’ve gotten comments on why I didn’t include any of the “apply” category functions (such as lapply(), tapply(), vapply(), sapply() and just apply()). It’s true that those functions may creep up every once in a while, and they are no doubt a powerful set of tools for working with data. However, I have always been thoroughly confused by the multitude of different “apply” functions and not knowing where to use which one. Discovering the dplyr group_by() and summarize() functions made it so that I (almost) never have to use the “apply” functions now. To prevent others from going through the same frustration I went through, I just decided to omit that family of functions and stick to the few key dplyr functions I did include. The point is that I’ve been able to do most of my work without needing to use “apply” functions, so I think others can too.

Convince me otherwise and I’ll include them in a future version of the cheatsheet 😉

That’s it for now, but comment down below to keep the conversation going! I hope this cheatsheet evolves into the most helpful resource that it can be.

If you liked this post and want to learn more, then check out my online course on the complete basics of R for ecology:

Start learning now

Also be sure to check out R-bloggers for other great tutorials on learning R

Intro to evolutionary algorithms with R for beginners (from scratch) [PART 1]

Fri, 27 Aug 2021 00:00:00 +0000

Evolution by natural selection is a powerful process that leads to (and continues to shape) the wonderful diversity of organisms on Earth today. It is fairly simple in its mechanics—essentially an optimization routine that allows species to evolve to an ever-changing environment:

Basic flowchart of biological evolution by natural selection

In this intro series of posts on the basics, I want to show you how you can use the same evolutionary optimization algorithm to ‘evolve’ (optimize) solutions to other problems. Using evolutionary algorithms to solve problems is very powerful—just think of how many different solutions to flight have been reached through biological evolution.

Bird, Bat, and Insect wings (from https://scholarblogs.emory.edu/artsbrain/2020/09/20/the-biggest-mystery-in-evolution-the-origin-of-insect-flight/)

There are all sorts of really cool applications for evolutionary algorithms, including some fun ways of simulating biological evolution itself. Jeffrey Ventrella is an algorithmic artist that uses various types of coded algorithms to create some really outstanding works of art. Of particular interest for me as an ecologist and biologist was his creation of Gene Pool—an artificial life ecosystem of swimming creatures that slowly evolve through natural selection. Check out the Swimbots website to learn more about that and see his simulation in action. Full disclosure, I’ve been recently collaborating with him on some really interesting projects aimed at studying the evolution and co-existence of swimbot creatures in Gene Pool (click here to see a video about our latest work).

Screenshot of the browser-run Gene Pool simulation from http://www.swimbots.com

For now, let’s back down a bit and focus on something somewhat simpler than virtual swimmers or flight… We’ll start with how to use evolutionary algorithms for fitting a basic linear model. We’ll use the same dataset that I went over in my tutorial for basic linear regression so that you can compare the two methods.

The basic evolutionary algorithm we use is very similar to the biological algorithm of evolution by natural selection, but I’ll expand it a bit in more detail and explain each step. I’ll note that there are some packages and functions built for running evolutionary algorithms in R, but I want to show you how it’s done from scratch so that you can understand the mechanics more directly.

Basic form of the evolutionary algorithm

To start, let’s load the data we are working with first using the data() function.

# Load the data:
data(trees)
# rename columns
names(trees) <- c("DBH_in","height_ft", "volume_ft3")
# Show the top few entries:
head(trees)

## DBH_in height_ft volume_ft3
## 1 8.3 70 10.3
## 2 8.6 65 10.3
## 3 8.8 63 10.2
## 4 10.5 72 16.4
## 5 10.7 81 18.8
## 6 10.8 83 19.7

These data include measurements of the diameter at breast height in inches (DBH_in), height in feet (height_ft) and volume in feet cubed (volume_ft3) of 31 black cherry trees. I just went ahead and renamed those columns for clarity.

For this tutorial (as with my tutorial on simple linear regression), our goal is to model the association between tree height and diameter.

`DBH_in ~ height_ft`

A quick plot shows that there is probably a relationship there:

plot(DBH_in ~ height_ft, data = trees, pch=16, cex=1.5)

So our goal is to determine the coefficients of the relationship between height and DBH. In other words, we want to find the best-fit line for this regression.

The actual function for the line we want is Y = a + b*X, where Y is DBH_in, X is height_ft, ‘a’ is the intercept, and ‘b’ is the slope of the line. Our goal is to ‘evolve’ a solution for the ‘a’ and ‘b’ coefficients that generate the best-fit line.

1) Starting Population

First we’ll create a function that generates a starting population of 100 potential organisms (models):

set.seed(123) # to get the same results as me
# 100 random 'a' values based on a uniform distribution from -100 to 100
a_coef <- runif(min=-100, max=100, n=100)
# 100 random 'b' values based on a uniform distribution from -100 to 100
b_coef <- runif(min=-100, max=100, n=100)
# pair these together into a dataframe of two columns and call this a population
# and also add in a column for fitness so that we can keep track of the fitness
# of each organism/model:
population <- data.frame(a_coef, b_coef, fitness=NA)
head(population)

## a_coef b_coef fitness
## 1 -42.48450 19.997792 NA
## 2 57.66103 -33.435292 NA
## 3 -18.20462 -2.277393 NA
## 4 76.60348 90.894765 NA
## 5 88.09346 -3.419521 NA
## 6 -90.88870 78.070044 NA

Now put this code into its own function:

gen_starting_pop <- function(){
# 100 random 'a' values based on a uniform distribution from -100 to 100
a_coef <- runif(min=-100, max=100, n=100)
# 100 random 'b' values based on a uniform distribution from -100 to 100
b_coef <- runif(min=-100, max=100, n=100)
# pair these together into a dataframe of two columns and call this a population
# and also add in a column for fitness so that we can keep track of the fitness
# of each organism/model:
population <- data.frame(a_coef, b_coef, fitness=NA)
return(population)
}

2) Function to evaluate fitness

Next we need to create a function that will evaluate the ‘fitness’ of each ‘organism’ (I’ll just refer to the organisms as the ‘models’ from here on). The fitness we need in this case is based on how good of a fit each model provides. To think in terms of biological evolution, imagine that the line that the coefficients create is the phenotype while the coefficients themselves are the DNA. Natural selection always acts on the phenotype, which we can create from the DNA by applying the full model I described before: DBH_in = a + b*height_ft. We can evaluate the fitness by testing how well the right side of the equation predicts the left side (DBH_in).

So first we’ll loop through each model in the population to calculate how well it can predict DBH values. Then we can subtract the real DBH values from the predicted values to get the net difference between the two. Bigger differences means poor model and lower fitness. To create just one ‘fitness’ value per model, we can then square and sum those difference values together. This generates an overall index of how different the predicted DBH is from the real DBH. The reason we square the values is to count negative and positive differences in the same way (only absolute difference is what matters). Finally to make the fitness value range from 0 to 1, with 1 being the perfect fitness (which is actually impossible in our case), simply inverse the result (1/result):

# loop through each organism:
for(i in 1:100){
# for each organism, calculate predicted DBH values
DBH_predicted = population[i,"a_coef"] + population[i,"b_coef"]*trees$height_ft
# Now, subtract the real DBH values from the predicted values to get the difference:
difference <- DBH_predicted - trees$DBH_in
# calculate the sum of squared differences:
sum_sq_diff <- sum(difference^2)
# make the fitness value range from 0 to 1:
fitness <- 1/sum_sq_diff
# finally, save the fitness values to the population dataframe:
population$fitness[i] <- fitness
}

Then we need to choose survivors. Let’s say only the top 10 models with the highest fitness will survive:

# find the index of the top 10 (highest) fitness values:
top_10_fit <- order(population$fitness, decreasing = T)[1:10]
# then use those to index the population:
survivors <- population[top_10_fit,]
survivors

## a_coef b_coef fitness
## 61 33.02304 0.4599127 1.073925e-05
## 52 -11.55999 -0.4945466 8.224736e-06
## 5 88.09346 -3.4195206 9.275710e-07
## 3 -18.20462 -2.2773933 7.663413e-07
## 35 -95.07726 4.2271452 7.017355e-07
## 86 -13.02145 -3.7420399 3.320522e-07
## 78 22.55420 5.9671372 1.497032e-07
## 66 -10.29673 6.7375891 1.342400e-07
## 96 -62.46178 -6.6934595 9.392552e-08
## 17 -50.78245 9.8569312 6.820099e-08

But let’s also put this code that calculates fitness and picks survivors into its own function to make it easier to write out in future steps:

evaluate_fitness <- function(population){
# loop through each organism:
for(i in 1:100){
# for each organism, calculate predicted DBH values
DBH_predicted = population[i,"a_coef"] + population[i,"b_coef"]*trees$height_ft
# Now, subtract the real DBH values from the predicted values to get the difference:
difference <- DBH_predicted - trees$DBH_in
# calculate the sum of squared differences:
sum_sq_diff <- sum(difference^2)
# make the fitness value range from 0 to 1:
fitness <- 1/sum_sq_diff
# finally, save the fitness values to the population dataframe:
population$fitness[i] <- fitness
}
# find the index value the top 10 (highest) fitness values:
top_10_fit <- order(population$fitness, decreasing = T)[1:10]
# then use those to index the population:
survivors <- population[top_10_fit,]
# return survivors
return(survivors)
}

Now, all we have to do is call:
population <- evaluate_fitness(population)
whenever we want to evaluate the survivors of the population.

3) Mate survivors and mutate DNA (and generate the new population)

Next we need to create a new population of 100 models using those survivors, making sure to add some random mutations to ensure the potential for evolution exists.

First, generate the new population of models by cloning these survivors at random:

Note, I’m just cloning individuals rather than sexual reproduction here to keep things simple for this example. However, DNA crossover can provide an important advantage for evolutionary algorithms as it does in biology.

 # First, choose the parents at randome from the 10 possible survivors:
parents <- sample(1:10, size=100, replace=T)
# and use those parents (index values) to clone offspring:
offspring <- survivors[parents,]
# Then add mutations:
# choose a mutation rate:
mutation_rate <- 0.6
# total number of mutations is our population (100) * the rate, and rounded to 
# make sure the result is an integer value:
total_mutations <- round(100*mutation_rate)
# choose which models recieve mutations for a or b coefficients:
a_to_mutate <- sample(x=c(1:100), size=total_mutations)
b_to_mutate <- sample(x=c(1:100), size=total_mutations)
# then generate a set of random mutations for the a and b coefficients:
a_mutations <- rnorm(n = total_mutations, mean=0, sd=3)
b_mutations <- rnorm(n = total_mutations, mean=0, sd=3)
# and apply those mutations:
offspring$a_coef[a_to_mutate] <- offspring$a_coef[a_to_mutate] + a_mutations
offspring$b_coef[b_to_mutate] <- offspring$b_coef[b_to_mutate] + b_mutations
# finally, reset the row names from 1 to 100:
row.names(offspring) <- 1:100

Note, I’m setting the mutation rate to 0.6 and the magnitude for the mutations to 3 (sd=3), but you can play around with mutation rate and magnitude to see what happens.

But let’s also make this step into a function so that we can easily call it later when we are looping through each generation:

mate_and_mutate <- function(survivors){
# create a series of 100 random indexes chosen from the survivors:
# First, choose the parents at randome from the 10 possible survivors:
parents <- sample(1:10, size=100, replace=T)
# and use those parents (index values) to clone offspring:
offspring <- survivors[parents,]
# Then add mutations:
# choose a mutation rate:
mutation_rate <- 0.6
# total number of mutations is our population (100) * the rate, and rounded to 
# make sure the result is an integer value:
total_mutations <- round(100*mutation_rate)
# choose which models recieve mutations for a or b coefficients:
a_to_mutate <- sample(x=c(1:100), size=total_mutations)
b_to_mutate <- sample(x=c(1:100), size=total_mutations)
# then generate a set of random mutations for the a and b coefficients:
a_mutations <- rnorm(n = total_mutations, mean=0, sd=3)
b_mutations <- rnorm(n = total_mutations, mean=0, sd=3)
# and apply those mutations:
offspring$a_coef[a_to_mutate] <- offspring$a_coef[a_to_mutate] + a_mutations
offspring$b_coef[b_to_mutate] <- offspring$b_coef[b_to_mutate] + b_mutations
# finally, reset the row names from 1 to 100:
row.names(offspring) <- 1:100
# return the new generation of offspring:
return(offspring)
}

So that was one generation, great! Now we need to repeat the cycle for many more generations. We’ll put everything together with the help of the for loop.

# First set the starting population:
population <- gen_starting_pop()
# set how many generations you want to run this for.
# we'll start with 5 for now:
generations <- 5
# begin the for loop:
for(i in 1:generations){
# 1) Evaluate fitness:
survivors <- evaluate_fitness(population)
# 2) Mate and mutate survivors to generate next generation:
next_generation <- mate_and_mutate(survivors)
# 3) Redefine the population using the new generation:
population <- next_generation
}
survivors <- evaluate_fitness(population)
head(survivors)

## a_coef b_coef fitness
## 46 -13.872104 0.3557666 0.004382764
## 95 -14.649663 0.3592565 0.004170481
## 8 -3.179754 0.2292144 0.004046204
## 67 -5.458559 0.2292144 0.003732053
## 11 -15.186570 0.3557666 0.003467042
## 59 -12.313690 0.3557666 0.003383982

Notice how even in just a few generations the fitness has gone up quite a bit. We can actually plot and visualize how the fitness changes over time by saving a fitness value from each generation:

set.seed(1239)
# First set the starting population:
population <- gen_starting_pop()
# set how many generations you want to run this for.
# Use 100 now:
generations <- 100
# define empty variable to collect fitness values:
fitness <- NULL
# begin the for loop:
for(i in 1:generations){
# 1) Evaluate fitness:
survivors <- evaluate_fitness(population)
# 2) Mate and mutate survivors to generate next generation:
next_generation <- mate_and_mutate(survivors)
# 3) Redefine the population using the new generation:
population <- next_generation
# save fitness value from each generation to plot it over time
fitness[i] <- max(population$fitness)
}

And plot the results during the first 100 generations:

plot(fitness ~ c(1:generations), type="l", lwd=2, xlab="Generation", ylab="Fitness")

So you can see how fitness slowly increases over time. Try this again but run the simulation for 1000 generations (note, that it may take a minute to run the for loop):

We can also visualize the evolution of the best-fit line by plotting how the line changes over time. The code below runs the first 100 generations and every 5 generations it pauses to plot the scatterplot with the predicted line:

set.seed(1239)
# First set the starting population:
population <- gen_starting_pop()
# set how many generations you want to run this for.
generations <- 1000
# define empty variable to collect fitness values:
fitness <- NULL
# begin the for loop:
for(i in 1:generations){
# 1) Evaluate fitness:
survivors <- evaluate_fitness(population)
# 2) Mate and mutate survivors to generate next generation:
next_generation <- mate_and_mutate(survivors)
# 3) Redefine the population using the new generation:
population <- next_generation
fitness[i] <- max(population$fitness)
#Every 5 generations, pause the simulation and plot the
# points with the current best-fit line:
if(i %% 50 == 0){
survivors <- evaluate_fitness(population)
plot(DBH_in ~ height_ft, data = trees, pch=16, cex=1.5)
title(main=paste0("Generation: ",i), cex.main=2)
abline(a=survivors$a_coef[1], b=survivors$b_coef[1], lwd=3, col="red")
# pause for 1 second:
Sys.sleep(.5)
}
}

Evolution of the best-fit line

Now let’s extract the highest fitness model after having run the simulation for 1000 generations to see how the coefficients compare to the basic linear model lm() output in R:

# Extract the best model:
top_models <- evaluate_fitness(population) # Be sure to run this first
best_model <- top_models[which.max(top_models$fitness),]
# compare this to a basic linear model result with 'lm()':
evo_model <- best_model
lm_model <- lm(DBH_in ~ height_ft, data=trees)
# Plot the scatterplot and add the best-fit lines:
plot(DBH_in ~ height_ft, data = trees, pch=16, cex=1.5)
# red line for our model solution
abline(a=evo_model$a_coef, b=evo_model$b_coef, col="red", lwd=2)
# blue line for the lm() model solution:
abline(lm_model, col="blue", lwd=2)

Not bad for only 1000 generations! Can barely even see the difference between the two lines.

Conclusion:

our model: a = -5.9599 b = 0.2528
lm() model: a = -6.1884 b = 0.2557

So, those are the basics of how evolutionary algorithms work. I hope this part of the tutorial was helpful. In the next part I’m going to get a bit more advanced and show you how I used an evolutionary algorithm to create images made of letters such as the one below.

Evolving an image made of only the letter R…

If you liked this post and want to learn more, then check out my online course on the complete basics of R for ecology:

Start learning R now

Also be sure to check out R-bloggers for other great tutorials on learning R

Five fun things you can do with R (Vol. 1)

Mon, 16 Aug 2021 00:00:00 +0000

I’ve been having fun with R through some side projects lately. One of which involves trying to use some machine learning and evolutionary algorithms to teach my computer to draw… (I hope to share a post on that project before too long). But that got me thinking… so I decided to start a series of posts on fun projects you can do with R that I’ve either written myself or found online to re-share. Thanks in advance to all the authors of these articles. Special thanks to Ryan Timpe for the inspiration for this post¹ (also see his site: http://www.ryantimpe.com/).

Yes, full disclosure I’m a bit of a geek when it comes to R (if you couldn’t guess already), but if you are just starting out maybe some of the ideas below will spark your interest about the possibilities. If you’re a more advanced R user, then maybe take a shot at completing some of these projects yourself.

So here’s the first installment:

Christmas Card by Greta Gasparac

Make holiday or birthday cards. Yes, this might be the epitome of the geekiest gift you might have for someone, how cool would it be to share your love for R with your friends and family. Greta Gasparac: https://towardsdatascience.com/christmas-cards-81e7e1cce21c

Google Search history barplot by Saul Buentello

Analyze your personal Google search history. What kind of cool patterns can you find and learn about yourself? Saúl Buentello: https://towardsdatascience.com/explore-your-activity-on-google-with-r-how-to-analyze-and-visualize-your-search-history-1fb74e5fb2b6

Example Datasaur tweet by Ryan Timpe

Build a twitterbot that ‘creates’ dinasaurs. Yes. Exactly as that sounds. Ryan Timpe: http://www.ryantimpe.com/post/datasaurs1/

Example Kerasaur phylogeny by Ryan Timpe

Use deep learning to create new dinasaur names. This is connected to the previous idea, but really cool! There are many ways you could apply this to other topics and themes. Ryan Timpe: https://www.ryantimpe.com/post/kerasaurs1/

Example Amazon book purchase history

Analyze your Amazon shopping history. Here’s an older post of mine about how you can visualize your Amazon purchase history and maybe even draw some insights about yourself. WARNING: You might prefer not to see how much money you’ve been spending on all those purchases… 😆 https://lukaneg.github.io/personal-scrape.html

Have you come across or written a post about doing something fun with R? let me know! I’ll try to share it in an upcoming post. Just comment down below.

Footnote 1: Check out Ryan Timpe’s presentation about how side projects are a great way to learn on your own terms, practice, and have fun while doing so!

If you liked this post and want to learn more, then check out my online course on the complete basics of R for ecology:

Start learning R now

Also be sure to check out R-bloggers for other great tutorials on learning R

How to actually make a quality scatterplot in R

Fri, 06 Aug 2021 00:00:00 +0000

Scatterplots are one of the most common types of data visualizations you will encounter as a biologist. They present the relationship between two continuous variables. We might take them for granted by their simplicity, but we shouldn’t assume the seeming intuition with which we can see and comprehend these figures. They are a powerful tool, but one that I believe merits a bit more attention. Check out this really cool article from the New Yorker about ‘When graphs are a matter of life and death’ for more history on the subject.

All through my grad school years and beyond, I’ve repeatedly come across scatterplots that almost defeat the purpose of helping us easily understand the relationship between two variables. Here’s a typical example of the type of plot I’ve seen one-too-many times:

There are several issues here, but without elaborating, here are the same data after a few visual tweaks: Much more striking and easy to read, no?

In other words, while the data may be accurate, the actual visual design of scatterplots is often overlooked and unattended. Unlike a statistical test, the goal of data visualizations is subjective—to help a viewer understand a particular relationship or story. For that reason, it is important that we take a subjective, and dare I say aesthetic, approach towards ensuring scatterplots (and all other plot types, really) are visually appealing and easy to understand on a quick glance.

I have a hunch that the main reason plots such as the first one above are so common is simply due to a lack of knowing how to easily customize plots in R. Unfortunately, even ggplot2—which is commended for the ease with which one can make good quality visualizations—is not so pretty right out of the box.

Hence this blog post ;)

Here is a simple tutorial on how to re-create the nice version of the plot above using the ‘base’ R package. The key is just to include a few additional parameters and functions. In the future I may update this post with how to do this using ggplot2.

First, let’s load the data. In this case we are using the built-in dataset on air quality measurements in New York from May through September in 1973:

# Load the built-in data:
data(airquality)
help(airquality)
head(airquality)

## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6

For this tutorial we are only interested in ozone concentration, which is measured in parts per million (ppm), and wind speed, which is measured in miles per hour (MPH). To get this info, I just ran the ‘help(airquality)’ function to pull up a description of these data.

Next, let’s start with the plot created using the ‘plot()’ function right out of the box:

plot(Ozone ~ Wind, data=airquality)

Which ever variable you want on the Y-axis goes to the left of the tilde ‘~’ and the X-axis goes on the right of the tilde. The neat thing about this notation is that you can just directly use the column names of the data you’d like to plot. In this case we set the ‘data’ argument to our dataframe ‘airquality’, and ‘Ozone’ and ‘Wind’ were the column names taken right from that dataframe.

Next, remove all the axes and tick marks from the plot so that we can start with a clean slate:

plot(Ozone ~ Wind, data=airquality, xaxt="n", yaxt="n", ylab="", xlab="")

Then we’ll add back new axes that are fully customizable using the ‘axis()’ function:

plot(Ozone ~ Wind, data=airquality, xaxt="n", yaxt="n", ylab="", xlab="")
# add the wind speed axis:
axis(side=1, at=seq(0, max(airquality$Wind, na.rm=T), 2), padj=-0.8)
# add the Ozone axis:
axis(side=2, at=seq(0, max(airquality$Ozone, na.rm=T), 20), hadj=0.8, las=2)

The ‘side’ argument indicates what side of the graph we are adding the axis on. Sides 1, 2, 3, and 4 are the bottom, left, top, and right, respectively. Then, ‘at’ is where we tell the function where to put the axis ticks. For example, here is what we use for that argument:

# This finds the maximum value for wind:
max(airquality$Wind, na.rm=T)

## [1] 20.7

# Then use that in the 'seq()' function to create the sequence of places for the tickmarks:
seq(from = 0, to = max(airquality$Wind, na.rm=T), by = 2)

## [1] 0 2 4 6 8 10 12 14 16 18 20

The ‘padj’ and ‘hadj’ (perpendicular adjust and horizontal adjust) arguments are used to nudge the axis tickmark label text so that it lines up more neatly. Play around with those values to see what you get. Finally, the ‘las’ argument when set to 2, turns the y axis tick marks horizontally so that they are more easily readable and all fit on the axis neatly.

Next, add in the axis name labels using the ‘mtext’ function. It’s always important to add units to these labels, which I did. The ‘line’ argument is how far from the edge of the plot you want the label to appear. ‘cex’ affects the size of the text, and finally, ‘font’ is used to make the text bold. You can set ‘font’ to 1, 2, 3, or 4, for normal, bold, italic, or italic + bold respectively. Play around with all those parameters to see how it changes the figure.

plot(Ozone ~ Wind, data=airquality, xaxt="n", yaxt="n", ylab="", xlab="")
# add the wind speed axis:
axis(side=1, at=seq(0, max(airquality$Wind, na.rm=T), 2), padj=-0.8)
# add the Ozone axis:
axis(side=2, at=seq(0, max(airquality$Ozone, na.rm=T), 20), hadj=0.8, las=2)
# add in the labels for each axis:
mtext(side=1, "Wind Speed (mph)", line=2.8, cex=1.5, font=2)
mtext(side=2, "Ozone Concentration (ppb)", line=2.8, cex=1.4, font=2)

Now that we have all the elements there, let’s adjust the points a bit. I’m really not a fan of the circle outline points. Something about it just doesn’t give the emphasis I want to see in the figure (feel free to disagree!). Instead, I prefer to fill in the points using the ‘pch = 16’ argument, and then make them a bit bigger with the ‘cex’ argument (both in the ‘plot()’ function):

plot(Ozone ~ Wind, data=airquality, xaxt="n", yaxt="n", ylab="", xlab="", pch=16, cex=1.5)
# add the wind speed axis:
axis(side=1, at=seq(0, max(airquality$Wind, na.rm=T), 2), padj=-0.8)
# add the Ozone axis:
axis(side=2, at=seq(0, max(airquality$Ozone, na.rm=T), 20), hadj=0.8, las=2)
# add in the labels for each axis:
mtext(side=1, "Wind Speed (mph)", line=2.8, cex=1.5, font=2)
mtext(side=2, "Ozone Concentration (ppb)", line=2.8, cex=1.4, font=2)

I think that looks a lot better. The only problem here is that a lot of points overlap so you loose the ability to see those clusters. To fix that we can make the point color transparent. This is actually very easy to do using the ‘ggplot2’ package, but we can also do it with the ‘base’ package—it just takes a bit more code. I made a function to make creating transparent colors a bit easier:

### transparent colors function
t_col <- function(color, opacity = 0.5) {
rgb.val <- col2rgb(color)
t.col <- rgb(rgb.val[1], rgb.val[2], rgb.val[3], max = 255, alpha = (opacity)*255)
invisible(t.col)
}

This function essentially takes in two arguments: ‘color’ which is the color you want to make transparent, and then ‘opacity’ which goes from 0 to 1, with 0 being totally transparent, and 1 being no transparency. Adding this function to our plot looks like this:

plot(Ozone ~ Wind, data=airquality, xaxt="n", yaxt="n", ylab="", xlab="", pch=16, cex=1.5,
col=t_col("black",0.6))
# add the wind speed axis:
axis(side=1, seq(0, max(airquality$Wind, na.rm=T), 2), padj=-0.8)
# add the Ozone axis:
axis(side=2, at=seq(0, max(airquality$Ozone, na.rm=T), 20), hadj=0.8, las=2)
# add in the labels for each axis:
mtext(side=1, "Wind Speed (mph)", line=2.8, cex=1.5, font=2)
mtext(side=2, "Ozone Concentration (ppb)", line=2.8, cex=1.4, font=2)

Almost done! I like to add a bit of white space around the edges of the points so that they don’t experience any “edge effects” and allow you to figuratively “stand back” when looking at all of the data. There’s also no reason windspeed shouldn’t start at zero since we’re close to that anyway. So to add that spacing and extend the axes, we just change the axis limits using the ‘xlim’ and ‘ylim’ arguments in the ‘plot()’ function. They each take a vector of two values that indicate the minimum and maximum extent of each axis:

plot(Ozone ~ Wind, data=airquality, xaxt="n", yaxt="n", ylab="", xlab="", pch=16, cex=1.5,
col=t_col("black",0.6), ylim=c(0,185), xlim=c(0,22))
# add the wind speed axis:
axis(side=1, seq(0, max(airquality$Wind, na.rm=T), 2), padj=-0.8)
# add the Ozone axis:
axis(side=2, at=seq(0, max(airquality$Ozone, na.rm=T), 20), hadj=0.8, las=2)
# add in the labels for each axis:
mtext(side=1, "Wind Speed (mph)", line=2.8, cex=1.5, font=2)
mtext(side=2, "Ozone Concentration (ppb)", line=2.8, cex=1.4, font=2)

We’ll end by using the ‘par()’ function to set the margins of the plot. I don’t like how close the Y axis label is to the edge of the figure. The ‘mar’ argument is to set the margins around the edge of the figure. I’m not sure what units those are in, but play around with the numbers until you get something that looks good. The four values in the vector represent the four sides in the same order as the ‘side’ argument used for the axes: bottom, left, top, and right:

Here is the plot again with a background color so that you can see what I mean:

And after we added the ‘par(mar=c(5,5,2,2))': That looks good! So here is the final code:

par(mar=c(5,5,2,2))
plot(Ozone ~ Wind, data=airquality, xaxt="n", yaxt="n", ylab="", xlab="", pch=16, cex=1.5,
ylim=c(0,185), xlim=c(0,22), col=t_col("black",0.6))
mtext(side=1, "Wind Speed (mph)", line=2.8, cex=1.5, font=2)
axis(side=1, seq(0, max(airquality$Wind)+2, 2), padj=-0.8)
mtext(side=2, "Ozone Concentration (ppb)", line=2.8, cex=1.4, font=2)
axis(side=2, at=seq(0, 185, 20), hadj=0.8, las=2)

Finally (and maybe most importantly), when you save your figure by clicking ‘Export’ above the ‘Plots’ pane in R Studio, you’ll have the option to resize the figure dimensions and see a preview of how it looks with different dimensions. Don’t neglect the important step of ensuring the figure dimensions are set to a size that considers the proportion of all the elements in the figure. Just play around with the sizing and you’ll see what I mean. This is what you are going for:

Not this:

Alternatively, if you want the image size to also remain in the code, create a ‘quartz()’ window (if using a mac) or windows() window (if using a PC). Set ‘height’ and ‘width’ in those functions to the desired size (I believe the units are inches) and run that function before the code that creates the plot. This will open up an external graphics window that is sized to your specifications and you can then go to the file menu at the top of your screen to save the figure as a PDF.

You can also save directly to a graphic window file. Here is the final code for how to do this:

pdf(file="my_scatterplot.pdf",width=7,height=4.5)
par(mar=c(5,5,2,2))
plot(Ozone ~ Wind, data=airquality, xaxt="n", yaxt="n", ylab="", xlab="", pch=16, cex=1.5,
ylim=c(0,185), xlim=c(0,22), col=t_col("black",0.6))
mtext(side=1, "Wind Speed (mph)", line=2.8, cex=1.5, font=2)
axis(side=1, seq(0, max(airquality$Wind)+2, 2), padj=-0.8)
mtext(side=2, "Ozone Concentration (ppb)", line=2.8, cex=1.4, font=2)
axis(side=2, at=seq(0, 185, 20), hadj=0.8, las=2)
dev.off()

Simply use the ‘pdf()’ function to set the name of the file and directory where it will be saved and specify the height and width in inches. You can play around with those measurements until you find something that works. Then run the code that creates the plot. And finally, run ‘dev.off()’ to close that graphic device.

I recommend always saving your figures as PDFs to retain maximum quality. Check out this other excellent blog post by David Smith with more details about how and why to save your figures in particular formats.

Well done! That’s it for now. Do you think this is easier to do with ggplot? I’ll follow up with an update or post on that as well.

If you liked this article, let me know what you might want to see next in the comments down below.

If you liked this post and want to learn more, then check out my online course on an introduction to data visualization with R for ecology:

Start Visualizing Data with R Now

Also be sure to check out R-bloggers for other great tutorials on learning R

The myth of the R learning curve

Mon, 26 Jul 2021 00:00:00 +0000

I think that the “difficult” R learning curve is a myth.

That’s because what people call the “R learning curve” is actually the combination of several disparate skill sets that are often taught as one conglomerated curriculum.

Let me explain.

Most university courses in biology that teach R, don’t just teach R. The goal of those classes is to help students learn how to plot and analyze their own data (and eventually use those skills for actual research).

So, is the course teaching data analysis and statistics? or R? Usually its goal is to teach all of those things. That’s where the problem is.

Having worked with many undergraduate and graduate students on learning R one-on-one, it’s become clear that there is a particularly deep chasm between what it means to learn R and what it means to learn statistics. R is a programming language, but data analysis and statistics per se are mostly math. R is just a tool for doing statistics. For example, statistics and data analyses can be conducted using tools ranging from a calculator, to Microsoft Excel. While R remains one of the best tools, there is no intrinsic link that implies R must be taught simultaneously with statistics. In fact, that’s my point.

One of the main reasons R appears to have a difficult learning curve is simply because it is often confounded with learning statistics at the same time. One of my goals with the courses that I teach is to separate statistics and R. If I’m going to teach a course on R, it is just about R. Once you have a solid handle on that, then we can move on to using R for learning statistics. But you need to know how to use the right tools first. That’s why I created my course on the basics of R for ecologists. It doesn’t cover any stats or data analysis, but that’s my intention.

I want to outline one more reason why R appears to have a difficult learning curve.

Many of the mainstream R courses (such as the university courses I mentioned above) tend to mistake “learning R” with “learning everything in R.” The professors that teach these courses usually have many years of experience and have thus accumulated a very large tool shed of packages and functions and operations for R (take a look at this Popular Mechanics post about some of the weirdest actual hardware tools). This then becomes the standard for what should be taught and the course is now about cramming 10 years of experience with R into one semester. Not only is this too much to teach in such a short time, but it also takes the focus away from learning what is actually most important for simply plotting and analyzing data.

To be fair, I must say there are a lot of great professors out there that do recognize this issue and carefully focus on the most important functions and operations when teaching R, but those seem to be uncommon.

In a recent post, I shared a cheat sheet on the most common but important functions when using R for ecology. (Click here to see the post and download the cheat sheet). My goal there was to share the most common functions that also provide the most bang for the buck. In other words, the majority of all the code you will ever write in R comes down to just a handful of functions.

So why don’t most R courses start by focusing on those few functions first? Maybe for the same reasons that traditional language classes focus too much on grammar and syntax than just speaking? (Check out the Natural Order Hypothesis about learning new languages.).

To wrap this all up and summarize my point, I think that there are two primary reasons that there appears to be a difficult R learning curve and why so many students do end up having a truly difficult time with R.

First, teaching R is often confounded with teaching statistics. Pick one (preferably R first), and then move on to the other.

Second, start by only teaching the most essential and important functions first. Don’t overwhelm students with all the functions they might ever need to know. And if you know two ways to do the same thing? Just pick one.

So, what do you think about this topic? What are your Stork Beak Pliers in R?

Stork Beak Pliers: an example of an uncommon tool that has a specific purpose but not necessary for beginners to learn.

If you liked this post and want to learn more, then check out my online course on the complete basics of R for ecology:

Start learning now

Also be sure to check out R-bloggers for other great tutorials on learning R

The essential functions of R cheatsheet

Mon, 19 Jul 2021 00:00:00 +0000

👇 Download link is at the bottom of the post 👇

Something that I quickly came to learn as an ecologist using R is that out of the hundreds (possibly thousands?) of functions available in R, only a handful were those that I used frequently throughout my code.

I’m also learning to speak Spanish right now, and I’ve found that for learning a new language it is a good idea to start by focusing on the most common words, since only those few words account for a significant proportion of everything you’ll ever need to say.

Anyone familiar with Tim Ferriss probably knows about the 80-20 rule (Pareto’s principle) that he’s made popular throughout his books and podcasts. The rule simply states that 80% of results come from 20% of the work.

To apply that to learning a language, learning only a small proportion of words (20%) will allow you to say a large proportion (80%) what you’d ever need to say. Now, these percentages might not be exactly the same for every application, but hopefully you get the point.

Now back to R! So, that’s what I did with all the functions I use in R. I found the “20%” of functions that I ever used in ecology that gave me the most results. In other words, if you learn these functions (51 functions to be exact), you will be well on your way to do almost anything you need to do with your data. And if there’s something missing, that will be easy to learn when you need it.

So here is my version 1.0 of a cheat sheet on the essential functions of R (for ecology). Please enjoy and share! Notice a typo? Let me know in the comments below.

Click here to download the Essentials of R cheatsheet v1.0 (PDF)

If you liked this post and want to learn more, then check out my online course on the complete basics of R for ecology:

Start learning now

Also be sure to check out R-bloggers for other great tutorials on learning R

How to do a simple linear regression in R

Fri, 09 Jul 2021 00:00:00 +0000

In this tutorial I show you how to do a simple linear regression in R that models the relationship between two numeric variables.

Check out this tutorial on YouTube if you’d prefer to follow along while I do the coding:

The first step is to load some data. We’ll use the ‘trees’ dataset that comes built in with R:

# Load the data:
data(trees)
# Show the top few entries:
head(trees)

## Girth Height Volume
## 1 8.3 70 10.3
## 2 8.6 65 10.3
## 3 8.8 63 10.2
## 4 10.5 72 16.4
## 5 10.7 81 18.8
## 6 10.8 83 19.7

These data include measurements of the diameter, height and volume of 31 black cherry trees. Note that the ‘Girth’ is actually the diameter at breast height (DBH) in inches, ‘Height’ is height in feet, and ‘Volume’ is volume in cubic feet. So let’s just rename those variable names for clarity:

# rename columns
names(trees) <- c("DBH_in","height_ft", "volume_ft3")
head(trees)

## DBH_in height_ft volume_ft3
## 1 8.3 70 10.3
## 2 8.6 65 10.3
## 3 8.8 63 10.2
## 4 10.5 72 16.4
## 5 10.7 81 18.8
## 6 10.8 83 19.7

There we go.

Now, for the basic linear regression, let’s model how the tree diameters change as they grow taller.

First, let’s start by writing out what it is we actually want to model. We want to know how DBH varies as a function of tree height, but we can also write that out as: DBH_in ~ height_ft, the tilde (~) being read as “is a function of.”

You can also think of this as the Y variable (the dependent or response variable) is a function of the X variable. How does Y depend on or respond to X?

It’s important to note that we are not drawing any conclusions about the causal relationship between DBH and tree height, the linear regression analysis simply allows us to test the correlation or association of these two variables. This is very important to understand.

Now let’s visualize the potential association of these variables by plotting our model. The neat thing is that we can write out the plotting function using our “is a function of” notation:

plot(DBH_in ~ height_ft, data = trees, pch=16)

I’m not a fan of the open circle points, so I added in the argument ‘pch = 16’ to that plotting function to fill in the circles.

So there appears to be a trend of increasing DBH with increasing height, but is that trend statistically significant?

To test that, we will use the function lm(), which stands for linear model. The syntax is actually almost exactly the same as our plot! The only difference is that we will save the output of the model to its own object called ‘mod’:

# Run the linear model and save it as 'mod'
mod <- lm(DBH_in ~ height_ft, data = trees)
# let's view the output:
mod

##
## Call:
## lm(formula = DBH_in ~ height_ft, data = trees)
##
## Coefficients:
## (Intercept) height_ft
## -6.1884 0.2557

If we look at ‘mod’ we don’t get much to work with, just the coefficient estimates for the intercept and slope. But we can run the ‘summary’ function with ‘mod’ to get more interesting results:

summary(mod)

##
## Call:
## lm(formula = DBH_in ~ height_ft, data = trees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2386 -1.9205 -0.0714 2.7450 4.5384
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.18839 5.96020 -1.038 0.30772
## height_ft 0.25575 0.07816 3.272 0.00276 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.728 on 29 degrees of freedom
## Multiple R-squared: 0.2697, Adjusted R-squared: 0.2445
## F-statistic: 10.71 on 1 and 29 DF, p-value: 0.002758

Beneath ‘Call’ and where it shows us what our model looks like, we can see the distribution of the residuals or unexplained variance in our model: the min and max, the 1st and 3rd quartiles, and the median.

But below that we have a table that gets a bit more interesting…

Remember that two coefficients get estimated from a basic linear model: The intercept and the slope. To model a line, we use the equation Y = a + bX, and the goal of the regression analysis is to estimate the a and the b.

In that first column we have that estimate for each coefficient. Then we have the standard error of those estimates, then the test statistic, and finally, the p-value of each coefficient, which tests whether the intercept or slope values are actually zero. In our case, the p-value for the slope (height_ft) coefficient is less than 0.05, allowing you to say that the association of DBH_in and height_ft is statistically significantly.

To make reading these results a bit easier, the this model summary output also includes asterisk symbols that indicate the significance levels of the p-values.

Continuing to go down the summary we can see the residual standard error, and then we have the multiple R squared, or simply R². This can thought of as the proportion of variance in the data explained by the model. You can ignore the adjusted R squared for now if you are just starting out.

Finally we have the F-statistic and p-value testing whether all coefficients in the model are zero.

Next we’ll add a line to our plot that shows the fitted line from this model.

All you have to do is first run the plot function that we ran before, and then run the ‘abline’ function with the model as it’s argument:

# Plot the scatterplot as before:
plot(DBH_in ~ height_ft, data = trees, pch=16)
# And then plot the fitted line:
abline(mod)

Finally, you might need to extract a table with the regression results from the summary output, so I’ll show you a quick trick for doing that easily using the ‘broom’ package.

First, make sure to install the broom package if you haven’t already (though you only have to do this once for your computer), and then run the ‘library’ function to load up that package (you have to do this each time you open up R and start a new working session):

# Install the 'broom' package:
# install.packages('broom') #commented out since I already have it installed
# Then load the package:
library('broom')

Finally, use the ‘tidy’ function to extract the table of results from your model:

# extract the table
my_results <- tidy(mod)
my_results

## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -6.19 5.96 -1.04 0.308
## 2 height_ft 0.256 0.0782 3.27 0.00276

Note that the output is actually a tibble, which is better than a normal dataframe, but that’s for another lesson. ;)

And that’s it!

If you liked this post and want to learn more, then check out my online course on the complete basics of R for ecology:

Start learning now

Also be sure to check out R-bloggers for other great tutorials on learning R

How to organize your analyses with R Studio Projects

Mon, 05 Jul 2021 00:00:00 +0000

Here is a post that I am sharing from my old blog to get this one started. Enjoy!

In this post I’ll go over a basic method method for organizing your ecological data analysis projects in R. Why do this? Reproducing analyses is critical for good science. There is nothing worse than trying to re-run a script when you finally get comments back from your reviewers only to find that your results are a bit different than before. What?! Speaking from personal experience, it’s taken days of blood, sweat, and tears to figure out what was different in the data, what code I was running in the wrong order, or that I was running the wrong code all together! Start now and get in the habit of sticking to a system for organizing your R projects.

While there are many methods and variations on how to do this (see links at the end of the post), the scope of this current post is to offer a short and simple overview of my own method so that you can get started ASAP. Those that follow me know that I am a big fan of getting right into the code and data—that is the best way to learn. So let’s get to it.

1) Use R Studio for all your analyses. Some of you 1% hardcore coders might prefer the minimalist terminal-type interface included in the basic R download, but for everyone else, use R Studio. It’s a no-brainer. See my video tutorial here on how to install it.

2) Create a new project (File > New Project). The directory you set here will be the folder where you store your data, scripts, and other files related to your analysis.

3) Create the folder structure inside your project folder so that it looks like this:

“data” is where you keep your data, split into two folders, “raw” and “processed”. This is self explanatory. “Raw” is where you save your data as you entered or downloaded it (usually an excel spreadsheet file), and “processed” is where you save the CSV file ready for uploading into R
“output” is where you save all the figures and tables that you generate with your R scripts. “scripts” is where you keep all the R code files.
Finally, “temp” is not necessary, but I’ve found it very useful. It is a folder where I can save any temporary outputs or scripts that I want to test out or explore, but that I know should not get confused with the final output of my analyses.

4) Create your R scripts. Unless your analysis is very simple and direct, you should be using multiple scripts (pretty much always the case when your project is large enough for an entire publication). Ideally, each script should be a set of code that you can run in one go. This is not always possible, but strive for that and use a separate script for each component of the analysis. I recommend you create the following scripts right away:

Script for loading packages and custom functions
Script for cleaning up and preparing the data for analysis
Script for each analysis in the project. For example, in one study you might need both a figure that presents two histograms for visualization purposes, along with one linear mixed effects regression to test your primary hypothesis. Each of those should have their own script
Name each script using this format: “##_name_v#”, where ## indicates the order that the scripts should be run in, “name” is a descriptor, and “v#” indicates the version number. Sometimes you want to change the script, but should keep older versions in case you mess something up. That’s where saving a new file with an updated version makes sense. So, all together your first set of scripts might look like this: 00_packages_v1.r 01_dataclean_v1.r 02_HistogramFigure_v1.r 03_LMER_v1.r

5) Start off each R script with a good description of the entire project and particular scope of the script. The more comments the better, but more on script commenting in another post. Here’s an example:

That’s pretty much it! Each time you open the project in RStudio, all the scripts will open. Just make sure to run the packages and dataclean scripts before the others. By using RStudio Projects, there is no need to include a setwd() line, just add in “data/processed/“ before your filename whenever uploading any data, or add “output/“ or “temp/“ whenever exporting something.

If you want some longer in-depth explanations on code management in R, check out these other excellent blog posts:

If you liked this post and want to learn more, then check out my online course on the complete basics of R for ecology:

Start learning now

Also be sure to check out R-bloggers for other great tutorials on learning R

R on R (for ecology)

Top five(ish) sources of ecological data

1) Basic data sets in R

2) The Knowledge Network for Biocomplexity

Introduction and how-to

Takeaways and application

3) The Environmental Data Initiative

Introduction and how-to

Takeaways and application

4) National Ecological Observatory Network

Introduction and how-to

Takeaways and application

5) Species and biodiversity data

The Global Biodiversity Information Facility

Introduction and how-to

Takeaways and application

The Ocean Biodiversity Information System

Looking for more?

Citations

How to make a scatterplot in R

Free workshop on how to learn R

The basics of prototyping and exporting your plots in R

How to reshape your data in R for analysis

Wide vs. Long format data

Which format is more useful?

How to reshape your data using tidyr

Preparing our data

How to use pivot_longer()

How to use pivot_wider()

How to create your own functions in R

Components of a function

A few examples

Mathematical formulas

Manipulating strings

Functions without arguments

A *simple* introduction to ggplot2 (for plotting your data!)

ggplot()

ggplot(data)

ggplot(data, aes(x, y))

ggplot(data, aes(x, y)) +

geom_point(aes(color))

ggplot(ChickWeight, aes(x = Time, y = weight)) +

geom_point(aes(color = Diet))

ggplot(ChickWeight, aes(x =Time, y = weight)) +

geom_point(aes(color = Diet)) +

coord_XXX() +

theme_XXX()

How to make a boxplot in R

Boxplot components

Modifying the axes

Modifying the boxes and whiskers

Changing the boxplot orientation

Search through your ecological data with the 'grep()' function

Find matches using grep() and grepl()

Find and replace using sub()

Data set citation:

Learning about data structures in R

The different data structures

Vectors

Lists

Matrices

Data frames

Tibbles

R Data types 101, or What kind of data do I have?

Types of data

Numeric

Integer

Complex

Character

Logical

Factor

How to check and manipulate data types

Complete tutorial on using 'apply' functions in R

The apply() function

The lapply() function

The sapply() function

The tapply() function

The apply() functions vs. dplyr functions

How to use pipes to clean up your R code

What is a pipe?

How to use `pivot_longer()`

How to use `pivot_wider()`

A simple introduction to ggplot2 (for plotting your data!)

Find matches using `grep()` and `grepl()`

Find and replace using `sub()`

The `apply()` function

The `lapply()` function

The `sapply()` function

The `tapply()` function

The `apply()` functions vs. `dplyr` functions

`DBH_in ~ height_ft`