Introduction to R Data Analysis

Part 1

Natalie Elphick

November 11th, 2024

Press the ? key for tips on navigating these slides

Introductions

Natalie Elphick
Bioinformatician I

Poll 1

What is your level of experience with coding/data analysis?

  1. I know another data analysis programming language (Python, Matlab etc.)
  2. I can use Excel
  3. I know some R
  4. All of the above
  5. None of the above

Target Audience

  • No background in statistics or computing
  • No prior experience with programming or R/RStudio

Learning Objectives

  1. Navigate the RStudio environment and understand how R works
  2. Understand variable types and data structures
  3. Perform data cleaning and transformation in R
  4. Create simple visualizations using ggplot2

Learning R Takes Time!

  • Workshop Pace: This is an intro, and it’s okay if everything doesn’t click right away.
  • Practice is Key: Plan to spend extra time on practicing concepts after the workshop.
  • Self-Guided Learning: Use the materials provided at the end of the workshop to continue at your own pace.

Keep at it—progress comes with persistence!

Part 1:

  1. What is R and why should you use it?
  2. The RStudio interface
  3. File types
  4. Variables
  5. Error and warning messages
  6. Types & data structures
  7. Math and logic operations
  8. Functions and packages

What is R?

R

  • An open source language developed for statistical computing by Ross Ihaka and Robert Gentleman
  • Inspired by the S language developed at Bell labs in 1976 to make interactive data analysis easier
  • The first official version was released in 2000

Why use R for data analysis?

  • R is and will always be free
  • Can easily implement any statistical analysis
  • Code serves as a record which enables reproducibility with minimal effort
  • As of August 2024, there were over 21,000 open source packages to extend its functionality
    • Highly customizable graphics (ggplot2)
    • Analysis reports (knitr)
    • RNA-seq analysis (DESeq2)

RStudio

RStudio

  • RStudio is an integrated development environment (IDE)
  • An app that makes R code easier to write by providing a feature rich graphical user interface (GUI)



R and RStudio

Layout

Layout

File types

  • Rscript files that end in .R
    • The most basic, a file that contains only R code
  • RMarkdown files that end in .Rmd
  • Let’s create a blank Rscript to see how they work, open RStudio and click:
    • File -> New File -> R Script

R Markdown

  • A file format combining R code with Markdown for text formatting.
  • Designed for creating reproducible research reports in various formats (HTML, PDF, Word).
  • Let’s create an Rmd file in RStudio to explore the basics of how they work:
  • File -> New File -> R Markdown

R Markdown Advanced Usage

  • Presentations: Creating slides (like these) with revealjs.
  • Publications: Authoring online books that combine narrative, code, and output with bookdown.
  • Interactive Documents: Developing interactive tutorials or dashboards with learnR and other embedded applications.

Variables

Variable definition

  • Variables store information that is referenced and manipulated in a computer program
  • There are 3 ways to define variables in R, but one is preferred:
x <- 1  # Preferred way
x = 1
1 -> x
print(x)
[1] 1

Example

  • Run the following in the R console:
x <- 1  
y <- 4
z <- y
x + y + z
[1] 9

Error and Warning Messages

Errors

  • Errors: Stop the execution of your code and must be fixed for the code to run successfully
x <- 5
y <- 10
z <- x + a
Error: object 'a' not found

Common Errors

  • Syntax Error: Invalid R code syntax (e.g. misplaced parentheses)
Error: unexpected ")"
  • Object not found: This variable is not defined (e.g. misspelled variables)
Error: object "a" not found

See this article for more common errors and how to fix them.

Warnings

  • Do not stop the execution but indicate potential issues that you should be aware of and might need to address
a <- c(1, 2, 3, 4, 5)
b <- c(6, 7, 8, 9)
result <- a + b
Warning in a + b: longer object length is not a multiple of shorter object
length

Variable Naming

  • Variables names must start with a letter and can contain underscores and periods
  • It is best practice to use descriptive variable names and stick to one style of names
# Snake case
dog_breeds <- c("Labrador Retriever", "Akita", "Bulldog")

# Period separated
dog.breeds <- c("Labrador Retriever", "Akita", "Bulldog")

# Camel case
DogBreeds <- c("Labrador Retriever", "Akita", "Bulldog")

Poll 2

Which variable name is not valid in R?

  1. cat_dog
  2. CatDOG
  3. cat.dog
  4. catD*g

Excercise 1

  • Open Rscript file part_1.R in Rstudio

Data Types and Structures

Data Types

  • Integer
    • Whole numbers (denoted with L ex. 1L,2L)
  • Numeric
    • Decimal numbers
  • Logical
    • Boolean (TRUE, FALSE)
  • Character
    • Letters and strings of letters
    • “A”, “Labrador Retriever”

Missing Values

  • R has a special data type - NA which represents missing data
  • NAs can take the place of any type but by default are logical
NA + 1
[1] NA

Poll 3

Which of these is not the correct data type for the value?

  1. “1.5” - Numeric
  2. “A” - Character
  3. 1L - Integer
  4. TRUE - Boolean

Data Structures

  • Vectors
    • Atomic vectors - one dimensional lists that store values of the same type
    • Lists - can be multidimensional and contain different types/structures (ex. nested lists)
  • Factors
    • Ordered list with assigned levels
  • Matrix
    • Columns and rows of the same type
  • Data frames
    • Columns and rows of mixed types

Data structures

Exercise 2: Data Types and Structures

  • Reopen Rscript file part_1.R in Rstudio

10 min break

10:00

Math and Logic Operations

Math & Logic

  • Built in functions to get common mathematical summaries of data (eg. mean( ), median( ), mode( ) )
  • Relational comparison operators to compare values
x == y  # Equal to
x != y  # Not equal to
x <  y  # Less than
x > y   # Greater than
x <= y  # Less than or equal to
x >= y  # Greater than or equal to

x %in% y # Is x in this vector y?

Logical Operators

  • Logical operators can compare TRUE or FALSE values
x <- TRUE
y <- FALSE

!x     # Not x
x | y  # x or y
x & y  # x and y

Poll 4

What is the output of the following code?

2 + 2 == 4 & 8 + 10 < 20
  1. TRUE
  2. FALSE
  3. NA

Poll 5

What is the output of the following code?

x <- TRUE
y <- FALSE

y | (y | x)
  1. TRUE
  2. FALSE
  3. NA

Conditional execution

  • Relational and logical operations allow for conditional execution of code
dog_breeds <- c("Labrador Retriever", "Akita", "Bulldog")

if ("Akita" %in% dog_breeds) {
  
  print("dog_breeds already contains Akita")
  
} else {
  
  dog_breeds <- c("Akita", dog_breeds)
  
}
[1] "dog_breeds already contains Akita"

Functions

Functions

  • A function is block of organized, reusable code that is used to perform a single action
  • R has many built in functions, these are called base R functions
  • Not all arguments are required and some have default values

Functions

Defining a function

  • To define a function we use the function keyword, the output is specified with the return function:
add_dog <- function(dog_to_add, input_vector) {
  if (dog_to_add %in% input_vector) {
    
    print("Already contains this dog")
    
  } else {
    
    output <- c(dog_to_add, input_vector)
    return(output)
    
  }
}

Example

add_dog(dog_to_add = "Akita",
        input_vector = dog_breeds)
[1] "Already contains this dog"
add_dog(dog_to_add = "German Shepard",
        input_vector = dog_breeds)
[1] "German Shepard"     "Labrador Retriever" "Akita"             
[4] "Bulldog"           

Poll 6

What does this function do?

mystery_function <- function(x) {
  if (x > 0) {
    return(x)
  } else {
    return(-x)
  }
}
  1. Returns the absolute value of x
  2. Returns x
  3. Returns the square root of x
  4. Returns -x

Packages

Packages

  • Packages are collections of functions that are specialized to a specific task (plotting, data manipulation etc.)
library(ggplot2) # Makes all of the ggplot2 functions available
  • The tidyverse is a collection of commonly used data analysis packages
    • Learning curve is less steep
    • Lots of useful packages for cleaning and “wrangling” data into the correct format

Why use Tidyverse Packages?

  • Most of the work in data analysis is getting data into the correct format to create outputs
  • The tidyverse collection of packages simplifies this process
    • Intuitive syntax
    • Comprehensive (data manipulation, cleaning, modeling and graphics)
    • Consistent data structure
    • Strong community support

End of Part 1

Workshop survey

  • Please fill out our workshop survey so we can continue to improve these workshops

Upcoming Workshops

Introduction to scATAC-seq Data Analysis
November 14 - November 15, 2024 1:00-4:00pm PST

Introduction to Linear Mixed Effects Models
November 18-November 19, 2024 1:00-3:00pm PST

scATAC-seq and scRNA-seq Data Integration
November 22, 2024 1:00-4:00pm PST