This is a collection of useful R code and notes to save me from having to Google stuff.
To set up a new GitHub repository (repo) I did the following:
Go to GitHub and create a new repository
Name the new repo (e.g. ‘testrepo’).
Do not select the option to ‘Initialize this repository with a README’. Click the button ‘Create respository’
In RStudio, enter the following commands into the terminal:
echo "# testrepo" >> README.md git init
git add README.md
git commit -m "first commit"
git remote add origin <https://github.com/SteveRxD/testrepo.git>
git push -u origin master
To make a commit, type the following into the terminal:
# Add:
git add -A
# Commit:
git commit -m "description"
# Push
git push -u
You can create a standalone webpage by (1) knitting your markdown file to HTML, and (2) saving a copy of the html file to github.
Specifically, it must be saved to the [username].github.io repository. In my case, it should be uploaded to https://github.com/SteveRxD/SteveRxD.github.io
The file will then be available online, e.g. https://steverxd.github.io/Useful-R-code.html.
To read data from Excel I typically use the read_excel()
function from the readxl
package. An example is shown below:
df_spend <- readxl::read_excel(
path = "file_name_here.xlsx",
sheet = "sheet_name_here",
range = "A1:D100", # this can be excluded
skip = 0, # ignored if a range is specified above
na = "NA")
There are a number of alternative ways to specify the range of cells to be read. For example, to specify the top right point of a rectangle (whose upper right cell = D2) you could use the following:
range = cell_limits(c(2, NA), c (NA,4))
To specify the rows or columns to read, you could use the following:
read_excel(path, range = cell_rows(3:6))
read_excel(path, range = cell_cols("C:D"))
The documentation for the read_excel() function can be found here.
In some cases, when loading your data, will find that columns have been assigned to the wrong class. For example, a numeric column may have been interpreted as a character column, or a column of characters may be been loaded as factors.
To change the data type for a large number of columns you can use the across()
function as in the following examples:
# to change all columns from 'columnB' to 'columnE' to numeric
my_data %>% mutate(across(columnB:columnE, as.numeric))
# to change all columns that contain factors to characters
my_data %>% mutate(across(where(is.factor), as.character))
I typically format my data in Excel using ‘NA’ to denote missing values. This is automatically converted to a missing value using the read_excel()
function above, using the argument
na = "NA"
This ensures that columns with numbers are are correctly read in as numeric, rather than character variables, even if they contain a few ‘NA’ values.
If you have an existing data frame has the characters “NA” or “N/A” (or similar) to denote missing value, then you can manage these using the dplyr::if_na()
function.
For example, my_data %>% if_na("N/A")
will look through all the columns in the data frame and replace any instances of “N/A” with a missing value.
The following would be used to generate the same ranks from column ‘X’ as would be produced in Microsoft Excel:
my_rank = rank(desc(X), ties.method = 'min', na.last = 'keep')
The last argument, na.last
, specifies that any NA values in column X should be kept in the dataframe and assigned a rank of NA.
This provides an overview of the different types of charts: http://www.sthda.com/english/wiki/ggplot2-essentials
The following is a useful reference for understanding theme elements, by Isabella Benabaye (link):
You can set the default theme and appearance as follows:
# Set theme to theme_minimal() as a baseline
theme_set(theme_minimal())
# Use theme_update() to update theme elements you want to customize
theme_update(
# Set set the font size of axis titles and labels
plot.title = element_text(size = 16, face = "bold"),
axis.title = element_text(size = 14),
axis.text = element_text(size=12),
# add distance between axes and axes labels
axis.title.x = element_text(margin = margin(t = 10)), # t = top
axis.title.y = element_text(margin = margin(r = 10)) # r = right
)
This would produce the following chart:
p <- mtcars %>%
ggplot(aes(x = cyl, y = mpg))+
geom_point() +
ggtitle("MPG vs CYL")
p
Guidance on how to get, set and modify the active theme can be found here.
To add a vertical and/or horizontal axis line, you can change the theme as follows:
# theme(axis.line.x = element_line(colour = 'grey75'),
# axis.line.y = element_line(colour = 'grey75'))
For example, to add a horizontal axis line:
mtcars %>%
ggplot(aes(x = cyl, y = mpg))+
geom_point() +
theme(axis.line.x = element_line(color = 'grey75'))
Alternatively, you can use geom_hline
and geom_vline
to create the following objects, which can then be added to your charts:
add_axis_x <- geom_hline(yintercept = 0, color = 'grey75')
add_axis_y <- geom_vline(xintercept = 0, color = 'grey75')
For example:
mtcars %>%
ggplot(aes(x = cyl, y = mpg))+
geom_point() +
add_axis_x
Note that this means y = 0 will be visible in the chart, which was not the case when theme(axis.line.x)
was used.
To set the axis scales - for example, to change units to percent or comma format - use the scales
package.
# As a percentage, 0 decimal places
scale_y_continuous(labels = percent_format(accuracy = 1))
# As a percentage, 1 decimal place
scale_y_continuous(labels = percent_format(accuracy = 0.1))
# As a percentage, 2 decimal places
scale_y_continuous(labels = percent_format(accuracy = 0.01))
# To use a comma
scale_y_continuous(labels = comma_format())
There are two ways to add limits: (1) within scale_y_continuous
or scale_x_continuous
, or (2) using ylim
and xlim
.
The latter is faster, but if you have to use the former if you are already using these to change the scale of the axis (see section above).
For example, to set the y axis to show a maximum of 30 you would use the following (replacing ‘y’ for ‘x’ as neccessary):
p + scale_y_continuous(limits = c(NA,30))
p + ylim(ylims = c(NA,30))
An alternative is to use coord_cartesian
. This approach effectively ‘zooms in’ the relevant section of the chart and does not drop any of the data, unlike the two options above. For example:
p + coord_cartesian(ylim = c(NA, 30))
To change the title of an axis
# to change axis title descriptions
p + labs(x = 'New X axis label', y = 'New Y axis label')
To remove axis titles and text from the x axis. remove .x
for it to apply to both axes, or use .y
for it to apply to the y axis.
# to remove axis titles or text
theme(axis.title.x= element_blank(),
axis.text.x= element_blank(),
axis.ticks.x= element_blank())
To remove gridlines
# remove all grid lines
theme(panel.grid = element_blank())
# remove all major or all minor grid lines
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
# remove vertical gridlines
theme(panel.grid.major.x = element_blank(), panel.grid.minor.x = element_blank())
# remove horizontal gridlines
theme(panel.grid.major.y = element_blank(), panel.grid.minor.y = element_blank())
To edit the legend:
# to remove the theme legend
theme(legend.position = "none")
# to remove the legend title only
theme(legend.title = element_blank())
# to change the position of the legend within the plot
# specify the x, y coordinates, which fall within the range 0,1
theme(legend.position = c(0.8, 0.85))
You can define some common colors used in Excel:
excel_blue <- "#0070C0"
excel_blue_light <- "#9BC2E6"
excel_red <- "#FF0000"
excel_orange <- "#FF9900"
excel_orange_light <- "#FFC000"
excel_gray <- "#D9D9D9"
There are a couple of ways to add labels to your data points. One is to use geom_text
and another is to use gg_repel
The built-in option geom_text
allows you to add a label to each data point. Note that geom_label
will draw a rectangle behind the text. More options for geom_text
can be found here.
# to make it clearer, first create a column with car names
my_label <- rownames(mtcars)
# add labels to our plot
p +
geom_text(
label = my_label,
check_overlap = TRUE, # this prevents overlap
hjust = 0, # 0 = text at right/bottom, 1 = at top/left
nudge_x = 0.1 # or use nudge_y to shift verically.
) +
xlim(c(4,10)) # add extra space to ensure label is visible
This is part of the ggrepel
package.
More options can be found here.
p +
ggrepel::geom_text_repel(
label = my_label,
hjust = 0,
nudge_x = 1,
direction = "both", # use "x" or "y" if you only want horiz or vert
segment.color = "grey90", # the appearance of the line
segment.alpha = .8 # the appearance of the line
) +
xlim(c(4,12)) # add extra space to ensure label is visible
The stat_summary()
function can be used for adding summary statistics to charts, that include average bars and mean bars.
This can be added directly to the plot code using ggplot() + stat_summary()
, or you could define it in advance.
For example, to add a median or average line to the chart above we could define the following:
# for adding axis lines
add_axis_x <- geom_hline(yintercept = 0, color = 'grey75')
add_axis_y <- geom_vline(xintercept = 0, color = 'grey75')
# for adding median lines
add_median <- stat_summary(fun = median, fun.min = median,
fun.max = median, geom = "crossbar",
width = 0.4, size = .2)
# for adding average lines
add_mean <- stat_summary(fun = mean, fun.min = mean, fun.max = mean,
geom = "crossbar",
width = 0.4, size = .2)
# for adding hollow bars (in this case median)
add_bar <- geom_bar(position = 'dodge', stat = 'summary', fun = 'median',
width = .4, color = 'black', fill = 'grey', alpha = 0.5)
Then add it to our plot:
p + add_median
Or if we wanted to superimpose our points over a bar, we could add the bar first:
p <- mtcars %>%
ggplot(aes(x = cyl, y = mpg))+
add_bar +
geom_point() +
add_axis_x +
ggtitle("MPG vs CYL")
p
I generally use ggsave()
, which saves the last generated plot to an image file.
For example, the following would save the last plot to
ggsave('test_image.png', width = 4, height = 7)
Common tasks when working with strings include:
As an example, say we have the following data set. It has spelling errors (“Dalas”) and includes the word “error” where we might want null.
df_travel_messy <- tibble::tribble(
~from, ~to, ~via, ~minutes, ~charge,
"Dalas", "Forth Worth", "Denton", "80", "1.2",
"Fort Worth", "Denton", "Dalas", "error", "1.1",
"Denton", "Dalas", "Fort Worth", "70", "error"
)
df_travel_messy
from | to | via | minutes | charge |
---|---|---|---|---|
Dalas | Forth Worth | Denton | 80 | 1.2 |
Fort Worth | Denton | Dalas | error | 1.1 |
Denton | Dalas | Fort Worth | 70 | error |
Equivalent function to ‘find and replace’ in Excel is str_replace_all()
. For example, to correct the city name in the first column we could do the following:
df_travel_messy %>%
mutate(
from = str_replace_all(from, pattern = 'Dalas', replacement = 'Dallas')
)
from | to | via | minutes | charge |
---|---|---|---|---|
Dallas | Forth Worth | Denton | 80 | 1.2 |
Fort Worth | Denton | Dalas | error | 1.1 |
Denton | Dalas | Fort Worth | 70 | error |
To replace the city name in all columns we could use a combination of str_replace_all()
and across()
:
df_travel_messy %>%
mutate(across(everything(), str_replace_all, "Dalas", "Dallas"))
from | to | via | minutes | charge |
---|---|---|---|---|
Dallas | Forth Worth | Denton | 80 | 1.2 |
Fort Worth | Denton | Dallas | error | 1.1 |
Denton | Dallas | Fort Worth | 70 | error |
Or if we only wanted to change the city name in the first two columns:
df_travel_messy %>%
mutate(across(from:to, str_replace_all, "Dalas", "Dallas"))
from | to | via | minutes | charge |
---|---|---|---|---|
Dallas | Forth Worth | Denton | 80 | 1.2 |
Fort Worth | Denton | Dalas | error | 1.1 |
Denton | Dallas | Fort Worth | 70 | error |
To replace a string with NA, you need to specify the replacement as NA_character_
:
df_travel_messy %>%
mutate(across(everything(), str_replace_all, "error", NA_character_))
from | to | via | minutes | charge |
---|---|---|---|---|
Dalas | Forth Worth | Denton | 80 | 1.2 |
Fort Worth | Denton | Dalas | NA | 1.1 |
Denton | Dalas | Fort Worth | 70 | NA |
Alternatively it may be ‘cleaner’ to use the na_if
function:
df_travel_messy %>%
mutate(across(everything(), na_if, "error"))
from | to | via | minutes | charge |
---|---|---|---|---|
Dalas | Forth Worth | Denton | 80 | 1.2 |
Fort Worth | Denton | Dalas | NA | 1.1 |
Denton | Dalas | Fort Worth | 70 | NA |
Here is an example of how to use the na_if
function in a single column:
df_travel_messy %>%
mutate(minutes = na_if(minutes, "error"))
from | to | via | minutes | charge |
---|---|---|---|---|
Dalas | Forth Worth | Denton | 80 | 1.2 |
Fort Worth | Denton | Dalas | NA | 1.1 |
Denton | Dalas | Fort Worth | 70 | error |
For example, to filter rows with cities that include the letters “Fort”:
df_travel_messy %>%
filter(str_detect(from,'Fort'))
from | to | via | minutes | charge |
---|---|---|---|---|
Fort Worth | Denton | Dalas | error | 1.1 |
Other regular expressions can be used, e.g. to specify words starting (^) or ending ($) with a character.
# cities in the 'from' column starting with the letter 'd'
df_travel_messy %>%
filter(str_detect(from, "^D"))
from | to | via | minutes | charge |
---|---|---|---|---|
Dalas | Forth Worth | Denton | 80 | 1.2 |
Denton | Dalas | Fort Worth | 70 | error |
# cities in the 'from' column ending with the letter 'n'
df_travel_messy %>%
filter(str_detect(from, "n$"))
from | to | via | minutes | charge |
---|---|---|---|---|
Denton | Dalas | Fort Worth | 70 | error |