R, with a little help from a few friends

teaching and learning
R
northernBUG
Author

Jarek Bryk

Published

June 28, 2023

My presentation at the Northern BUG

In January 2023, I gave a short talk at the 8th meeting of the Northern Bioinformatics User Group (nBUG for short), which is an informal network of computational biologists and users or bioinformatics services in the (loosely defined) north of England. If you haven’t heard about us and are in range of a reasonable commute, please come to one of our next meetings (we have three one-day, single-track meetings per year), it’s really nice :-).

My talk wasn’t actually that short, as I ran over time and could not finish it properly. My excuse is that I was juggling a presentation (with slides), a live demo in R Studio and sharing my screen over Teams, over a single projector. This makes it a very good reason to write my short presentation as a blog post.

Who may find this useful?

When I thought about the target audience of the talk, I had in mind postgraduate students, who had already done some work in R and are familiar with the basics of the language (e.g. various data types, loading and transforming data, working in R Studio), but who may not have thought about how to organise their data and scripts, or be aware of really simple tricks that would make their work much more effective and efficient. I didn’t really know whether this was the right pitch, but a few post-talk comments indicated that it was a good one.

Photo of the author in front of the presentation slide with a title "I may be completely wrong"

He’s not wrong ;-). Photo by Andy Mason.

Here we go.

1. Use projects + here + Rmd/Qmd for everything

This advice is number one for a reason - projects will instantly make your work easier, because they will force you to organise your files into a consistent structure. And if you combine it with the package here, you will get extra benefits of making your code simpler and, most importantly, portable.

I usually set up a self-explanatory three-folder structure within any project: folders code, data, and output. You can make it as complicated as you want (and there are packages that will build a default structure for you - see also advice #2 below), but for 70% of my needs, this is sufficient (and 100% for everything I teach R with). Any self-contained idea, no matter how small, should be in a separate project.

Screenshot of folder structure from the Files panel in R Studio

A consistent project structure will make your life easier

here() is a simple function that combines strings into paths. The magic bit is that it does so relative to the project location. So you don’t have to remember, or type, that your data is located in /Users/jarek/one_folder/another_folder/folder_hell/my_project_folder/data/my_data.csv. If you use projects + here(), it understands where your project is and creates the path relative to where it is on your hard drive. Like so:

library(here)
here() starts at /Users/jarek/Sites/miserable
# Calling the function with no arguments returns what here understands as the project folder location
here()
[1] "/Users/jarek/Sites/miserable"
# Calling it with arguments returns path to folders and files relative to the project folder location
here("data", "my_data.csv")
[1] "/Users/jarek/Sites/miserable/data/my_data.csv"

It doesn’t matter if you are on a Linux machine and your collaborator on a Windows, as long as you use the same project structure and here(), wherever your code would refer to files in the project folder, it will work on both machines with no changes.

Short rant about file system

The lack of familiarity with the concepts of a filesystem and directory trees is by far the biggest issue for the students who begin working with R. This is compounded by Microsoft’s push to use OneDrive as a main storage space on Windows machines without making it explicit in the user interface. Students tend to download the Rmd/qmd files from our virtual learing environment (VLE) platform and open them directly from the downloads folder. This opens RStudio but confuses here(), which shows the downloads folder as the working directory, making all relative links in project files broken. And it is not obvious what is going on, because RStudio by default opens on the last used project, so the interface shows the “correct” project name and the file system viewer in the bottom right panel shows the “corect” project location on the hard drive. The correct behaviour is to move downloaded files to the appropriate place in the project folder (e.g. code) and then open the project itself in RStudio, but this is also tricky for the students, who often struggle to answer question “where is your project folder?” and locate it with File Explorer (also, “what is File Explorer?”).

2. Name things well

There is nothing that I can say about naming things that Jenny Bryan and Danielle Navarro haven’t already said much better. Check their presentations, pick one of the suggested approaches to naming and stick to it. Sticking to it is more important than the exact approach that you choose.

3. Five or six packages that will make your life much easier

Cheating a little, I also wanted to mention several packages with functions that, in my opinion, really make data wrangling and running statistics (pretty much 90% of what my imaginary target audience wants to do) much easier. Here are the best of:

  • datapasta by Miles McBain. It’s an R Studio addin that lets you easily copy-paste basic data structures into R (e.g. vectors and tables), skipping the formatting and importing steps. Here is an animated GIF from the linked website that explains it:

Animated GIF demonstrating what datapasta does

What datapasta does. Excellent name, too.

I use it quite often, it’s great to create dummy data to test various functions or try to understand what’s going on with my code.

  • janitor by Sam Firke. Probably the most popular of the basic data wrangling packages, with its blockbuster function clean_names(), which standardises messy column names by substituting spaces, normalising type cases and protecting from having names starting with a number or other forbidden symbol. But it also has a function get_dupes() that identifies duplicated rows/variables in the data and a function tabyl() that prettifies tables, including adding rows with totals or formatting the tables as inputs to statistical tests such as χ2.

  • rstatix by Alboukadel Kassambara. This package is useful for two main reasons: a) it provides wrappers around base r statistical tests making them compatible with pipe (incouding outputting test results as tibbles) and b) it provides function get_summary_stats() that calculates basic and not-so-basic descriptive statistics (including operations on groups). Here is an example:

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(rstatix)

Attaching package: 'rstatix'

The following object is masked from 'package:stats':

    filter
mtcars %>% 
    t_test(disp~cyl)
# A tibble: 3 × 10
  .y.   group1 group2    n1    n2 statistic    df        p    p.adj p.adj.signif
* <chr> <chr>  <chr>  <int> <int>     <dbl> <dbl>    <dbl>    <dbl> <chr>       
1 disp  4      6         11     7     -4.42  9.22 2   e- 3 2   e- 3 **          
2 disp  4      8         11    14    -12.5  17.8  3.03e-10 9.09e-10 ****        
3 disp  6      8          7    14     -7.08 17.9  1.36e- 6 2.72e- 6 ****        

If your categorical variable contains more than two groups, t_test will automatically perform all pairwise tests.

Here is another:

mtcars %>% 
    group_by(cyl) %>% 
    get_summary_stats(disp)
# A tibble: 3 × 14
    cyl variable     n   min   max median    q1    q3   iqr   mad  mean    sd
  <dbl> <fct>    <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     4 disp        11  71.1  147.   108   78.8  121.  41.8  43.0  105.  26.9
2     6 disp         7 145    258    168. 160    196.  36.3  11.3  183.  41.6
3     8 disp        14 276.   472    350. 302.   390   88.2  73.4  353.  67.8
# ℹ 2 more variables: se <dbl>, ci <dbl>

Do check parameter type = for options of what descriptive stats you want to include in the output of get_summary_stats().

The only (slight) concern about rstatix is its pace of development. Only one issue was patched in the last 1.5 years and at least some of the wrappers do not yet work (I am aware of chisq_test() which doesn’t yet seem to be pipe-compatible). But other than that, it’s great.

  • broom by David Robinson et al. is another of the “prettifying” packages, this time for statistical model outputs. Essentially, it turns output from lm() (and 100+ other models) into a tidy tabular format. It is also able to add extra columns with residuals and predicted values from the model to the original data. It is now part of the tidymodels approach.

  • forcats by Hadley Wickham. It is a part of the tidyverse metapackage and is meant to facilitate handling of categorical variables. It is particularly useful for ordering these variables and for grouping them. For example, you can plot only the top three categories in your data (lumping the rest into the “Other” category) with fct_lump() and put the values in decreasing order by median of another variable with fct_reorder().

diamonds %>% 
    sample_frac(0.1) %>% 
    mutate(cut = fct_lump(cut, 3), # Group categories outside of top 3 into "Other"
                 cut = fct_reorder(cut, price, median, .desc = TRUE)) %>% # Reorder categories of diamond cut by median of their price, in decreasing order
    ggplot(aes(x = cut, y = price)) + geom_boxplot() + theme_minimal()

4. Know your interface

Spend some time on learning the interface of RStudio and force yourself to use its features until they become second nature. In the simplest case, pick a good colour theme in settings, add coloured lines to indicate tabs (and matching colours for brackets), and a good typeface. Uncheck the default options to save history and environment - and make sure you can reproduce your entire analysis from your Rmd/qmd document.

Then learn the basic keyboard shortcuts (Option-Shift-K shows all shortcuts, for Windows just replace Option with Alt and Command with Windows key): - insert new chunk (Control-Option-I) - insert pipe symbol (Option–) - run current line/run current chunk (Control-Enter/Control-Shift-Enter) - switch between panes and expand them to full screen (Control-Shift-1 for the source panel, etc.. Press again to expand.) - move lines of code up/down (Option-↑ or ↓)

Finally, learn to use these: - multi-line cursors (Control-Option-↑ or ↓) - rename-in-scope (Control-Option-Shift-M) - multi-file find-and-replace (Shift-Command-F): you need to find stuff first, then toggle Replace switch

Good people at Appsilon have compiled those and many others into a nice gif-torial: RStudio IDE Tips And Tricks - Shortcuts You Must Know Part 1 and Part 2.

5. What lies beyond

And that’s it. I hope you will find these tips and resources useful. The slides from the presentation and the code I ran during the talk are available on Github.