This post describes how to use the R dplyr package to calculate percentages. A data set from the U.S Census Bureau was used. Three tests check that the calculation using dplyr was accurate. The code incorporate the pipe operator
%>% that was introduced into R in 2014 via the magrittr package.
The general question is: What is the percentage a measure value associated with an instance of a sub group of data represents of the sum total of the measure values of all instances within the same sub group? This general question statement is accurate but not easy to understand. An example in SQL makes it clear. This statement computes the percentage of an individual’s salary over the total salary within his department:
Select depname, empno, salary, salary/sum(salary) over (partition by depname) from empsalary
It’s the same as a windows function in SQL.
Calculation in dplyr
Here’s the code:
Census data set
The data set is publicly available; the URL is listed in the
read.table() on line 10 of the gist. I selected this data set because it has characteristics that made it useful; it contains data that can be grouped, it contains a calculation that is already available in the data set that I was able to compare my results to. Additionally, I didn’t want Puerto Rico in the results so was able to use the
filter() feature to remove Puerto Rico prior to executing the calculation. The aim was to make sure I knew how to use dplyr.
Here’s a screen shot of the results for region 4, the West:
The field I calculated was the pct18Plus, the right-most column in above table. (This is a screen shot of the result and the column header isn’t included.)
The result for Wyoming is interesting. The percentage calculated is so small because the population of Wyoming is small compared to population of California, not because the 18+ population in Wyoming is a small percentage of Wyoming’s population.
The R dplyr package just works. It’s easier than using base R to complete the same tasks. It’s not arcane. It’s elegant to read and use, and it’s decreased my development time.