Convert Revenue Data to Numeric
Using the gsub function to turn revenue data into a type of numeric
By Daniel D. Bonneau in R Basics
June 24, 2022
The Data Set
library(dplyr)
rev_df <- read.csv("simple_revenue.csv")
glimpse(rev_df)
## Rows: 30
## Columns: 2
## $ date <chr> "6/1/2022", "6/2/2022", "6/3/2022", "6/4/2022", "6/5/2022", "6…
## $ revenue <chr> "$307.00 ", "$557.00 ", "$549.00 ", "$1,159.00 ", "$1,525.00 "…
For this article, we’ll be using the simple revenue data set I’ve used previously. This data set contains each day in June, 2022 and some made up revenue numbers for each day. As you can see from above, we have both our date and revenue columns reading in as characters. Here, we’ll focus on converting our revenue data to numeric values. To see how we can transform our date column into a proper Date type, see here.
Character’s Cause Problems with Numbers in R
Often, when we try to read in revenue data in R, our revenue numbers are read in as character strings. If this is the case and we try to do something, such as creating a total revenue value that sums up our column, you’re likely to see an error such as this:
sum(rev_df$revenue)
## Error in sum(rev_df$revenue): invalid 'type' (character) of argument
In this article, I’ll walk you through the steps to remove the unnecessary characters (talking about the $
and ,
in this particular case) so that we can convert that value into an actual number to perform mathematical operations with the data.
Will Using as.numeric()
Work?
In some cases, the answer to that question will be yes, without requiring us to do any additional transformation. For example, if we have values that are simple read in as characters but don’t contain any special strings, we’ll be able to just pass those values into as.numeric()
and move on with what we’re trying to do. An example of something like that would be:
nums <- c("1234", "5678")
as.numeric(nums)
## [1] 1234 5678
Here, we create a vector of two strings called nums
. While each element of the vector only contains numbers, they are considered to be strings because we wrap each in quotation marks. However, because there are no special characters, we are able to use the as.numeric()
function to cleanly transform our nums
vector into a numeric value.
Now, let’s consider the simple example where it does not work. We’ll create a very similar example to the code block above, but let’s throw in a comma.
nums_comma <- c("1,234", "5,678")
as.numeric(nums_comma)
## Warning: NAs introduced by coercion
## [1] NA NA
Notice what happens now. NAs introduced by coercion
is a common error you may be likely to encounter as you begin working in R. All this really means is that something you did caused an NA value to appear in the data set. In this case, it was us trying to convert a string that contained a comma into a numeric value.
With this simple example in mind, let’s move in to our actual data and take the necessary steps to convert our revenue data into numeric values in R.
Converting Revenue Data to Numeric in R
First, let’s just try something that we know likely won’t work due to the example above - just using as.numeric()
.
as.numeric(rev_df$revenue)
## Warning: NAs introduced by coercion
## [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [26] NA NA NA NA NA
Alright, so we should at least feel comfortable that we’re now seeing the error that we were expecting to see - NAs introduced by coercion
.
If we take a look at our data again, we’ll see that our revenue data is structured like so: $1,159.00
. Therefore, we can immediately see that we have 3 elements that aren’t just numbers: a dollar sign, comma, and period. Luckily, the period is fine as that’s how we would represent decimals. But the dollar sign and comma do present us with issues.
Using gsub() to Remove Dollar Signs
To remove each of these characters from our revenue column, we’re going to use the gsub()
function. The structure for this function is very straightforward and includes 3 arguments you’ll give to it each time you use it:
gsub(What do you want to replace?, What do you want to replace it with?, Where do you want to replace it?)
If we want to remove dollar signs (or any other special characters), we’ll need to use the ‘escape’ command - \\
.
So, let’s just verbally talk about what we’re wanting to do:
- What do we want to replace? - The dollar sign (
\\$
) - What do we want to replace it with? - Nothing (
""
) - Where do we want to replace it? - In the revenue column (
rev_df$revenue
)
Now we can just put it all together and see what we get:
rev_df %>%
mutate(no_dollar = gsub("\\$", "", revenue))
## date revenue no_dollar
## 1 6/1/2022 $307.00 307.00
## 2 6/2/2022 $557.00 557.00
## 3 6/3/2022 $549.00 549.00
## 4 6/4/2022 $1,159.00 1,159.00
## 5 6/5/2022 $1,525.00 1,525.00
## 6 6/6/2022 $1,310.00 1,310.00
## 7 6/7/2022 $1,257.00 1,257.00
## 8 6/8/2022 $1,471.00 1,471.00
## 9 6/9/2022 $1,293.00 1,293.00
## 10 6/10/2022 $1,958.00 1,958.00
## 11 6/11/2022 $944.00 944.00
## 12 6/12/2022 $1,989.00 1,989.00
## 13 6/13/2022 $187.00 187.00
## 14 6/14/2022 $1,370.00 1,370.00
## 15 6/15/2022 $1,442.00 1,442.00
## 16 6/16/2022 $1,321.00 1,321.00
## 17 6/17/2022 $1,829.00 1,829.00
## 18 6/18/2022 $1,958.00 1,958.00
## 19 6/19/2022 $1,511.00 1,511.00
## 20 6/20/2022 $1,087.00 1,087.00
## 21 6/21/2022 $1,687.00 1,687.00
## 22 6/22/2022 $480.00 480.00
## 23 6/23/2022 $1,826.00 1,826.00
## 24 6/24/2022 $566.00 566.00
## 25 6/25/2022 $932.00 932.00
## 26 6/26/2022 $114.00 114.00
## 27 6/27/2022 $418.00 418.00
## 28 6/28/2022 $333.00 333.00
## 29 6/29/2022 $1,683.00 1,683.00
## 30 6/30/2022 $413.00 413.00
Perfect, so now we can see inside of the no_dollar
column we created, there is no longer a dollar sign! Let’s move on to the comma.
Using gsub() to Remove Commas
If we were to run as.numeric()
on our no_dollar
column, it still wouldn’t work for some of our values. As you can see above, some of the revenue numbers have commas, while others don’t. Those that have commas would still be NA
, while those without commas would convert to a number without issue. But we don’t want only some of our data to convert to numbers, rather we want to do it to everything. To do that, let’s run through the same excersize as above:
- Replace what? - Commas (
,
) - With what? - Nothing (
""
) - Where? - In the
no_dollar
column (rev_df$no_dollar
)
Doing so we get:
rev_df %>%
mutate(no_dollar = gsub("\\$", "", revenue)) %>%
mutate(all_clean = gsub(",", "", no_dollar))
## date revenue no_dollar all_clean
## 1 6/1/2022 $307.00 307.00 307.00
## 2 6/2/2022 $557.00 557.00 557.00
## 3 6/3/2022 $549.00 549.00 549.00
## 4 6/4/2022 $1,159.00 1,159.00 1159.00
## 5 6/5/2022 $1,525.00 1,525.00 1525.00
## 6 6/6/2022 $1,310.00 1,310.00 1310.00
## 7 6/7/2022 $1,257.00 1,257.00 1257.00
## 8 6/8/2022 $1,471.00 1,471.00 1471.00
## 9 6/9/2022 $1,293.00 1,293.00 1293.00
## 10 6/10/2022 $1,958.00 1,958.00 1958.00
## 11 6/11/2022 $944.00 944.00 944.00
## 12 6/12/2022 $1,989.00 1,989.00 1989.00
## 13 6/13/2022 $187.00 187.00 187.00
## 14 6/14/2022 $1,370.00 1,370.00 1370.00
## 15 6/15/2022 $1,442.00 1,442.00 1442.00
## 16 6/16/2022 $1,321.00 1,321.00 1321.00
## 17 6/17/2022 $1,829.00 1,829.00 1829.00
## 18 6/18/2022 $1,958.00 1,958.00 1958.00
## 19 6/19/2022 $1,511.00 1,511.00 1511.00
## 20 6/20/2022 $1,087.00 1,087.00 1087.00
## 21 6/21/2022 $1,687.00 1,687.00 1687.00
## 22 6/22/2022 $480.00 480.00 480.00
## 23 6/23/2022 $1,826.00 1,826.00 1826.00
## 24 6/24/2022 $566.00 566.00 566.00
## 25 6/25/2022 $932.00 932.00 932.00
## 26 6/26/2022 $114.00 114.00 114.00
## 27 6/27/2022 $418.00 418.00 418.00
## 28 6/28/2022 $333.00 333.00 333.00
## 29 6/29/2022 $1,683.00 1,683.00 1683.00
## 30 6/30/2022 $413.00 413.00 413.00
Excellent, now we can see inside of the all_clean
column, we no longer have dollar signs, and we no longer have commas! However, there’s still one more step. If we want these to become numbers, we still need to use the as.numeric()
function. The whole purpose of the previous two sections was simply to clean the data so we can pass it in to the as.numeric()
function without producing any NA
s. Let’s go ahead and do that and take a glimpse()
of our data.
Converting Cleaned Data to Numeric
rev_df %>%
mutate(no_dollar = gsub("\\$", "", revenue)) %>%
mutate(all_clean = gsub(",", "", no_dollar)) %>%
mutate(numeric = as.numeric(all_clean)) %>%
glimpse()
## Rows: 30
## Columns: 5
## $ date <chr> "6/1/2022", "6/2/2022", "6/3/2022", "6/4/2022", "6/5/2022", …
## $ revenue <chr> "$307.00 ", "$557.00 ", "$549.00 ", "$1,159.00 ", "$1,525.00…
## $ no_dollar <chr> "307.00 ", "557.00 ", "549.00 ", "1,159.00 ", "1,525.00 ", "…
## $ all_clean <chr> "307.00 ", "557.00 ", "549.00 ", "1159.00 ", "1525.00 ", "13…
## $ numeric <dbl> 307, 557, 549, 1159, 1525, 1310, 1257, 1471, 1293, 1958, 944…
In the code above, we take our all_clean()
column and pass it in to the as.numeric()
function. Finally, we wrap it all up with the glimpse()
function to give us a quick peek into our data. Here, we see that our no_dollar
and all_clean
columns are still of character types (<chr>
). But, the numeric column we made is now a double (<dbl>
) and we can run mathematical functions without any issue.
Let’s go back to our previous attempt of summing all of the data. For this, I’ll put everything together we did above into more streamlined code by combining both arguments within our gsub()
function and immediately wrapping that inside of as.numeric()
.
**Note: Any time we want to pass multiple arguments into gsub()
, we can just separate it with a pipe (|
).
Putting it All Together Into Streamlined Code
rev_df %>%
mutate(rev_cleaned = as.numeric(gsub("\\$|,", "", revenue))) %>%
as_tibble() %>% # Just doing this to get types under column names
head()
## # A tibble: 6 × 3
## date revenue rev_cleaned
## <chr> <chr> <dbl>
## 1 6/1/2022 "$307.00 " 307
## 2 6/2/2022 "$557.00 " 557
## 3 6/3/2022 "$549.00 " 549
## 4 6/4/2022 "$1,159.00 " 1159
## 5 6/5/2022 "$1,525.00 " 1525
## 6 6/6/2022 "$1,310.00 " 1310
Perfect, so we now see the streamlined way of combining all of that code into one simple line. Now… back to the example from earlier:
cleaned_df <- rev_df %>%
mutate(rev_cleaned = as.numeric(gsub("\\$|,", "", revenue)))
sum(cleaned_df$rev_cleaned)
## [1] 33476
As you can see, we run the sum()
the same way we did at the beginning of this article. The only difference is that we create a new data frame (cleaned_df
) that includes our rev_cleaned
variable that does all of the transformations within the gsub()
function like we talked about and then converts it to a number using as.numeric()
.
Closing Thoughts
We want into some depth in this, and I want to make it clear why. If your data is not properly transformed to the correct type, you may think you’re ready to run analysis or visualizations and have no idea why it’s not working. Taking the care to check all of your datas types before you get started with anything is a necessary step that will make sure you won’t run into issues in the next steps of working with your data.
I hope this was useful and you now have a clearer understanding not only of how to transform your data into numeric, but also why you’re getting the errors that you’re getting when attempting to do this process.