Sort Dataframe by Column in R
Using dplyr's arrange function to order our data frame based on a columns values.
By Daniel D. Bonneau in R Basics
June 25, 2022
The Data Set
For this project, we’ll use a data set that comes pre-loaded in R - mtcars
. If you haven’t used one of these data sets before, you can check out what data sets are present by using the data()
command and scrolling through the suggested options. Additionally, we’ll be using dplyr
, as we’re going to use the arrange()
function within the package to order our dataframe by a column.
library(dplyr)
data("mtcars")
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Order a Dataframe by Column in R
For us to order the data frame, we only need a few pieces of information: Our data frame, the column we want to sort by, and the direction we want to sort in. For reference, ascending order refers to values that increase as you move down the data set (i.e. the lowest values are at the top), and descending order sorts with the highest values at the top.
For this, we’ll start by just looking at ordering in ascending order.
Sort Dataframe in Ascending Order in R
To do this, all we need to do is pipe the arrange()
function from our data set and pass in the column name. We could also omit the pipe, but I want to display it here as you should become comfortable with ‘piping’ your data if you aren’t yet.
We’ll be using the weight (wt
) column for the original sort.
mtcars %>%
arrange(wt) %>%
head()
## mpg cyl disp hp drat wt qsec vs am gear carb
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
As you can see from the above output, we now have the lowest weight at the top of our data set and the weight increases as we move through it.
If we wanted to arrange in descending order, we have two options which I’ll demonstrate below.
Sort Dataframe in Descending Order in R
First, we can use the desc()
function around our column name within our arrange()
function. As the name suggests, this will arrange our data in descending order. However, we can also do the same thing with a little less code by placing a minus sign (-
) in front of our column name.
Here’s a demonstration of each way:
mtcars %>%
arrange(desc(wt)) %>%
head()
## mpg cyl disp hp drat wt qsec vs am gear carb
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
mtcars %>%
arrange(-wt) %>%
head()
## mpg cyl disp hp drat wt qsec vs am gear carb
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
With either method, we see that we now have the heaviest cars at the top of our data set. For the rest of this demonstration, I’ll be using the desc()
function (to see primary usage of the -
, I use that exclusively within the
YouTube video)
Sorting a Data Frame by Multiple Columns
If we wanted to present our data so it was sorted by multiple columns, we can just pass an additional column into the arrange()
function. For example, let’s say that we were interested in ordering our data set not just by weight, but by the number of cylinders (cyl
) as well.
library(tidyr) # for slice
mtcars %>%
arrange(cyl, desc(wt)) %>%
slice(1:15)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Here, what we did was ordered our data frame so cyl
was ordered in ascending order, and wt
is in descending order. In other words, we’re ordering so we have the fewest cylinders and after we make that sorting, we then order the cars within each cylinder group so they are ordered from heaviest to lightest.
So, we see in the output above it gives us all 4 cylinder cars first, ordered by heaviest to lightest weight. Then, it moves on to 6 cylinder cars and orders the weight similarly.
This feels like sorting within groups. But typically, whenever we’re working within dplyr and want to preserve group structures, we’ll use the formal group_by()
command. If we use this in the code above, let’s take a look at what happens.
Sorting Grouped Data Frame in R
mtcars %>%
group_by(cyl) %>%
arrange(desc(wt)) %>%
print(n = 20)
## # A tibble: 32 × 11
## # Groups: cyl [3]
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 10.4 8 460 215 3 5.42 17.8 0 0 3 4
## 2 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
## 3 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
## 4 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3
## 5 19.2 8 400 175 3.08 3.84 17.0 0 0 3 2
## 6 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4
## 7 15.2 8 276. 180 3.07 3.78 18 0 0 3 3
## 8 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3
## 9 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
## 10 15 8 301 335 3.54 3.57 14.6 0 1 5 8
## 11 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2
## 12 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
## 13 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
## 14 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
## 15 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
## 16 15.2 8 304 150 3.15 3.44 17.3 0 0 3 2
## 17 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
## 18 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
## 19 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
## 20 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
## # … with 12 more rows
Here, we can see that our data frame is simply ordered by weight and cylinders seem to be in no particular order. Well, let’s now try to see if we add cylinders into our arrange if that solves it.
mtcars %>%
group_by(cyl) %>%
arrange(cyl, desc(wt)) %>%
print(n = 20)
## # A tibble: 32 × 11
## # Groups: cyl [3]
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
## 2 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
## 3 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
## 4 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1
## 5 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
## 6 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
## 7 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
## 8 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
## 9 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
## 10 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
## 11 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
## 12 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
## 13 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
## 14 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
## 15 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
## 16 21 6 160 110 3.9 2.88 17.0 0 1 4 4
## 17 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
## 18 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 19 10.4 8 460 215 3 5.42 17.8 0 0 3 4
## 20 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
## # … with 12 more rows
Here, everything seems to be as it should. We have 3 separate groups, and our data set is ordered by both cylinders and weight. However, there is a better way to do this. The arrange()
function includes a default argument: .by_group = FALSE
. With this, it is ignoring any grouping that occurs before we call our arrange()
function. If we simply change that argument to TRUE
instead, then we can omit cyl
from our arrange()
function.
Note: You can shorten TRUE
to T
and FALSE
to F
.
mtcars %>%
group_by(cyl) %>%
arrange(desc(wt), .by_group = T) %>%
print(n = 20)
## # A tibble: 32 × 11
## # Groups: cyl [3]
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
## 2 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
## 3 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
## 4 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1
## 5 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
## 6 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
## 7 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
## 8 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
## 9 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
## 10 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
## 11 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
## 12 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
## 13 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
## 14 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
## 15 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
## 16 21 6 160 110 3.9 2.88 17.0 0 1 4 4
## 17 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
## 18 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 19 10.4 8 460 215 3 5.42 17.8 0 0 3 4
## 20 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
## # … with 12 more rows
But you might be wondering - they both do the same thing and it’s easier to remember to just pass the grouping column into the arrange()
function as well. Well, you’d be correct. However, it does limit the flexibility of your code. For example, if I wanted to transform this piece of code into a function where the user can pass in the column they want to group by and order by, this code would be much more prepared to handle the change into a function.
But don’t worry, we’ll cover that complexity at a later date!