This is a collection of useful R code and notes to save me from having to Google stuff.


GitHub

New GitHub repository

To set up a new GitHub repository (repo) I did the following:

  1. Go to GitHub and create a new repository

  2. Name the new repo (e.g. ‘testrepo’).

  3. Do not select the option to ‘Initialize this repository with a README’. Click the button ‘Create respository’

  4. In RStudio, enter the following commands into the terminal:

echo "# testrepo" >> README.md git init
git add README.md
git commit -m "first commit"
git remote add origin <https://github.com/SteveRxD/testrepo.git>
git push -u origin master

Making commits

To make a commit, type the following into the terminal:

# Add: 
git add -A
# Commit: 
git commit -m "description"
# Push
git push -u 

Standalone webpages

You can create a standalone webpage by (1) knitting your markdown file to HTML, and (2) saving a copy of the html file to github.

Specifically, it must be saved to the [username].github.io repository. In my case, it should be uploaded to https://github.com/SteveRxD/SteveRxD.github.io

The file will then be available online, e.g. https://steverxd.github.io/Useful-R-code.html.


Importing data

To read data from Excel I typically use the read_excel() function from the readxl package. An example is shown below:

df_spend <-  readxl::read_excel(
  path = "file_name_here.xlsx", 
  sheet = "sheet_name_here", 
  range = "A1:D100", # this can be excluded 
  skip = 0, # ignored if a range is specified above
  na = "NA")

There are a number of alternative ways to specify the range of cells to be read. For example, to specify the top right point of a rectangle (whose upper right cell = D2) you could use the following:

range = cell_limits(c(2, NA), c (NA,4))

To specify the rows or columns to read, you could use the following:

read_excel(path, range = cell_rows(3:6))

read_excel(path, range = cell_cols("C:D"))

The documentation for the read_excel() function can be found here.


Cleaning & wrangling data

Convert column types

In some cases, when loading your data, will find that columns have been assigned to the wrong class. For example, a numeric column may have been interpreted as a character column, or a column of characters may be been loaded as factors.

To change the data type for a large number of columns you can use the across() function as in the following examples:

# to change all columns from 'columnB' to 'columnE' to numeric
my_data %>% mutate(across(columnB:columnE, as.numeric))

# to change all columns that contain factors to characters 
my_data %>% mutate(across(where(is.factor), as.character))

Replacing ‘NA’

I typically format my data in Excel using ‘NA’ to denote missing values. This is automatically converted to a missing value using the read_excel() function above, using the argument

na = "NA"

This ensures that columns with numbers are are correctly read in as numeric, rather than character variables, even if they contain a few ‘NA’ values.

If you have an existing data frame has the characters “NA” or “N/A” (or similar) to denote missing value, then you can manage these using the dplyr::if_na() function.

For example, my_data %>% if_na("N/A") will look through all the columns in the data frame and replace any instances of “N/A” with a missing value.

Create ranks like Excel

The following would be used to generate the same ranks from column ‘X’ as would be produced in Microsoft Excel:

my_rank = rank(desc(X), ties.method = 'min', na.last = 'keep')

The last argument, na.last, specifies that any NA values in column X should be kept in the dataframe and assigned a rank of NA.


Plots

Useful resources

This provides an overview of the different types of charts: http://www.sthda.com/english/wiki/ggplot2-essentials

The following is a useful reference for understanding theme elements, by Isabella Benabaye (link):

Plot themes

You can set the default theme and appearance as follows:

# Set theme to theme_minimal() as a baseline
theme_set(theme_minimal())

# Use theme_update() to update theme elements you want to customize
theme_update(
    # Set set the font size of axis titles and labels
    plot.title = element_text(size = 16, face = "bold"),
    axis.title = element_text(size = 14),
    axis.text = element_text(size=12),
  
    # add distance between axes and axes labels
    axis.title.x = element_text(margin = margin(t = 10)), # t = top
    axis.title.y = element_text(margin = margin(r = 10))  # r = right
  )


This would produce the following chart:

p <- mtcars %>% 
      ggplot(aes(x = cyl, y = mpg))+
      geom_point() +
      ggtitle("MPG vs CYL")
p

Guidance on how to get, set and modify the active theme can be found here.

Working with the axis

Add axis lines

To add a vertical and/or horizontal axis line, you can change the theme as follows:

# theme(axis.line.x = element_line(colour = 'grey75'),
#       axis.line.y = element_line(colour = 'grey75'))

For example, to add a horizontal axis line:

mtcars %>% 
  
ggplot(aes(x = cyl, y = mpg))+
geom_point() +
theme(axis.line.x = element_line(color = 'grey75'))

Alternatively, you can use geom_hline and geom_vline to create the following objects, which can then be added to your charts:

add_axis_x <- geom_hline(yintercept = 0, color = 'grey75')
add_axis_y <- geom_vline(xintercept = 0, color = 'grey75')

For example:

mtcars %>% 
  
ggplot(aes(x = cyl, y = mpg))+
geom_point() +
add_axis_x

Note that this means y = 0 will be visible in the chart, which was not the case when theme(axis.line.x) was used.

Setting axis scales

To set the axis scales - for example, to change units to percent or comma format - use the scales package.

  # As a percentage, 0 decimal places
  scale_y_continuous(labels = percent_format(accuracy = 1))
  # As a percentage, 1 decimal place
  scale_y_continuous(labels = percent_format(accuracy = 0.1))
  # As a percentage, 2 decimal places
  scale_y_continuous(labels = percent_format(accuracy = 0.01))
  
  # To use a comma
  scale_y_continuous(labels = comma_format())


Axis limits

There are two ways to add limits: (1) within scale_y_continuous or scale_x_continuous, or (2) using ylim and xlim.

The latter is faster, but if you have to use the former if you are already using these to change the scale of the axis (see section above).

For example, to set the y axis to show a maximum of 30 you would use the following (replacing ‘y’ for ‘x’ as neccessary):

p + scale_y_continuous(limits = c(NA,30))
p + ylim(ylims = c(NA,30))

An alternative is to use coord_cartesian. This approach effectively ‘zooms in’ the relevant section of the chart and does not drop any of the data, unlike the two options above. For example:

p + coord_cartesian(ylim = c(NA, 30))


Axis labels and text

To change the title of an axis

# to change axis title descriptions
p + labs(x = 'New X axis label', y = 'New Y axis label')

To remove axis titles and text from the x axis. remove .x for it to apply to both axes, or use .y for it to apply to the y axis.

# to remove axis titles or text
 theme(axis.title.x= element_blank(),
       axis.text.x= element_blank(),
       axis.ticks.x= element_blank())


Gridlines

To remove gridlines

# remove all grid lines
theme(panel.grid = element_blank())

# remove all major or all minor grid lines
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

# remove vertical gridlines
theme(panel.grid.major.x = element_blank(), panel.grid.minor.x = element_blank())

# remove horizontal gridlines
theme(panel.grid.major.y = element_blank(), panel.grid.minor.y = element_blank())


Titles, subtitles & captions

Titles, subtitles and captions can be added to labs() as shown above

To change the title of an axis

# to change axis title descriptions
p + labs(title = 'Chart title',
         subtitle = 'Subtitle',
         caption = 'This is a caption',
         x = 'New X axis label', 
         y = 'New Y axis label')


Legend

To edit the legend:

# to remove the theme legend
theme(legend.position = "none")
      
# to remove the legend title only
theme(legend.title = element_blank())

# to change the position of the legend within the plot
# specify the x, y coordinates, which fall within the range 0,1
theme(legend.position = c(0.8, 0.85))


Colors

You can define some common colors used in Excel:

excel_blue         <- "#0070C0"
excel_blue_light   <- "#9BC2E6"
excel_red          <- "#FF0000"
excel_orange       <- "#FF9900"
excel_orange_light <- "#FFC000"
excel_gray         <- "#D9D9D9"


Label data points

There are a couple of ways to add labels to your data points. One is to use geom_text and another is to use gg_repel

Using geom_text

The built-in option geom_text allows you to add a label to each data point. Note that geom_label will draw a rectangle behind the text. More options for geom_text can be found here.

# to make it clearer, first create a column with car names
my_label <- rownames(mtcars)


# add labels to our plot
p + 
  geom_text(
    label = my_label,
    check_overlap = TRUE, # this prevents overlap
    hjust = 0,            # 0 = text at right/bottom, 1 = at top/left
    nudge_x = 0.1         # or use nudge_y to shift verically.
    ) +
  
    xlim(c(4,10))         # add extra space to ensure label is visible

Using geom_text_repel

This is part of the ggrepel package.

More options can be found here.

p + 
  ggrepel::geom_text_repel(
    label = my_label,
    hjust = 0,
    nudge_x = 1,
    direction = "both", # use "x" or "y" if you only want horiz or vert 
    segment.color = "grey90", # the appearance of the line
    segment.alpha = .8        # the appearance of the line
  ) +
  
  xlim(c(4,12))         # add extra space to ensure label is visible

Adding summary statistics

The stat_summary() function can be used for adding summary statistics to charts, that include average bars and mean bars.

This can be added directly to the plot code using ggplot() + stat_summary(), or you could define it in advance.

For example, to add a median or average line to the chart above we could define the following:

# for adding axis lines 

  add_axis_x <- geom_hline(yintercept = 0, color = 'grey75')
  add_axis_y <- geom_vline(xintercept = 0, color = 'grey75')
  
# for adding median lines  
  add_median <- stat_summary(fun = median, fun.min = median, 
                             fun.max = median, geom = "crossbar", 
                             width = 0.4, size = .2)

# for adding average lines  
  add_mean   <- stat_summary(fun = mean, fun.min = mean, fun.max = mean,
                             geom = "crossbar", 
                             width = 0.4, size = .2)
  
# for adding hollow bars (in this case median)
  add_bar <- geom_bar(position = 'dodge', stat = 'summary', fun = 'median',
             width = .4, color = 'black', fill = 'grey', alpha = 0.5)

Then add it to our plot:

p + add_median

Or if we wanted to superimpose our points over a bar, we could add the bar first:

p <- mtcars %>% 
      ggplot(aes(x = cyl, y = mpg))+
      add_bar +
      geom_point() +
      add_axis_x +
      ggtitle("MPG vs CYL")
p


Saving plots

I generally use ggsave(), which saves the last generated plot to an image file.

For example, the following would save the last plot to

ggsave('test_image.png', width = 4, height = 7)

Working with strings

Common tasks when working with strings include:

  • Finding and replacing values
  • Filtering based on text


As an example, say we have the following data set. It has spelling errors (“Dalas”) and includes the word “error” where we might want null.

df_travel_messy <- tibble::tribble(
         ~from,           ~to,         ~via, ~minutes, ~charge,
       "Dalas", "Forth Worth",     "Denton",     "80",   "1.2",
  "Fort Worth",      "Denton",      "Dalas",  "error",   "1.1",
      "Denton",       "Dalas", "Fort Worth",     "70", "error"
  )

df_travel_messy
from to via minutes charge
Dalas Forth Worth Denton 80 1.2
Fort Worth Denton Dalas error 1.1
Denton Dalas Fort Worth 70 error

Find & replace

Single column

Equivalent function to ‘find and replace’ in Excel is str_replace_all(). For example, to correct the city name in the first column we could do the following:

df_travel_messy %>% 
  mutate(
    from = str_replace_all(from, pattern = 'Dalas', replacement = 'Dallas')
    )
from to via minutes charge
Dallas Forth Worth Denton 80 1.2
Fort Worth Denton Dalas error 1.1
Denton Dalas Fort Worth 70 error


Multiple columns


To replace the city name in all columns we could use a combination of str_replace_all() and across():

df_travel_messy %>% 
  mutate(across(everything(), str_replace_all, "Dalas", "Dallas"))
from to via minutes charge
Dallas Forth Worth Denton 80 1.2
Fort Worth Denton Dallas error 1.1
Denton Dallas Fort Worth 70 error


Or if we only wanted to change the city name in the first two columns:

df_travel_messy %>% 
  mutate(across(from:to, str_replace_all, "Dalas", "Dallas"))
from to via minutes charge
Dallas Forth Worth Denton 80 1.2
Fort Worth Denton Dalas error 1.1
Denton Dallas Fort Worth 70 error


Replacing with NA


To replace a string with NA, you need to specify the replacement as NA_character_:

df_travel_messy %>% 
 mutate(across(everything(), str_replace_all, "error", NA_character_))
from to via minutes charge
Dalas Forth Worth Denton 80 1.2
Fort Worth Denton Dalas NA 1.1
Denton Dalas Fort Worth 70 NA


Alternatively it may be ‘cleaner’ to use the na_if function:

df_travel_messy %>% 
  mutate(across(everything(), na_if, "error"))
from to via minutes charge
Dalas Forth Worth Denton 80 1.2
Fort Worth Denton Dalas NA 1.1
Denton Dalas Fort Worth 70 NA


Here is an example of how to use the na_if function in a single column:

df_travel_messy %>% 
  mutate(minutes = na_if(minutes, "error"))
from to via minutes charge
Dalas Forth Worth Denton 80 1.2
Fort Worth Denton Dalas NA 1.1
Denton Dalas Fort Worth 70 error


Strings when filtering


For example, to filter rows with cities that include the letters “Fort”:

df_travel_messy %>% 
  filter(str_detect(from,'Fort'))
from to via minutes charge
Fort Worth Denton Dalas error 1.1


Other regular expressions can be used, e.g. to specify words starting (^) or ending ($) with a character.

# cities in the 'from' column starting with the letter 'd'
df_travel_messy %>% 
  filter(str_detect(from, "^D"))
from to via minutes charge
Dalas Forth Worth Denton 80 1.2
Denton Dalas Fort Worth 70 error
# cities in the 'from' column ending with the letter 'n'
df_travel_messy %>% 
  filter(str_detect(from, "n$"))
from to via minutes charge
Denton Dalas Fort Worth 70 error