Introduction to R Data Analysis

Part 2

Natalie Elphick

November 12th, 2024

Press the ? key for tips on navigating these slides

Introductions

Natalie Elphick
Bioinformatician I

Min-Gyoung Shin
Bioinformatician III

Schedule

  1. Introduction to Tidyverse
  2. Filtering and reformatting data
  3. Plotting data
  4. Hands on data analysis
  5. ChatGPT tips for R
  6. Where to get help

Introduction to Tidyverse

Tidyverse

  • The tidyverse packages work well together because they share common data representations and design principles
    • Rows = observations, columns = variables
  • ggplot2, for data visualization.
  • dplyr, for data manipulation.
  • tidyr, for data tidying.
  • readr, for data import.
  • purrr, for iteration.
  • and more..

dplyr

  • Offers a common “grammar” of functions for data manipulation
    • mutate() adds new variables that are functions of existing columns
    • select() picks columns based on their names
    • filter() picks rows based on their values
    • summarise() reduces multiple values down to a single summary
    • arrange() changes the ordering of the rows
    • group_by() allows any operation to be done “by group”

Example Dataframe

  • mpg is a dataframe built into the ggplot2 package
head(mpg)
# A tibble: 6 × 11
  manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
  <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…

Select Columns

select(.data = mpg,
       year, cty, hwy, manufacturer)
# A tibble: 234 × 4
    year   cty   hwy manufacturer
   <int> <int> <int> <chr>       
 1  1999    18    29 audi        
 2  1999    21    29 audi        
 3  2008    20    31 audi        
 4  2008    21    30 audi        
 5  1999    16    26 audi        
 6  1999    18    26 audi        
 7  2008    18    27 audi        
 8  1999    18    26 audi        
 9  1999    16    25 audi        
10  2008    20    28 audi        
# ℹ 224 more rows

Filter Rows

filter(.data = mpg,
       year == 2008)
# A tibble: 117 × 11
   manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
   <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
 1 audi         a4           2    2008     4 manu… f        20    31 p     comp…
 2 audi         a4           2    2008     4 auto… f        21    30 p     comp…
 3 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
 4 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
 5 audi         a4 quattro   2    2008     4 auto… 4        19    27 p     comp…
 6 audi         a4 quattro   3.1  2008     6 auto… 4        17    25 p     comp…
 7 audi         a4 quattro   3.1  2008     6 manu… 4        15    25 p     comp…
 8 audi         a6 quattro   3.1  2008     6 auto… 4        17    25 p     mids…
 9 audi         a6 quattro   4.2  2008     8 auto… 4        16    23 p     mids…
10 chevrolet    c1500 sub…   5.3  2008     8 auto… r        14    20 r     suv  
# ℹ 107 more rows

Arrange Rows

  • desc() is used to arrange rows in descending order, the default is ascending
arrange(.data = mpg,
        desc(cty))
# A tibble: 234 × 11
   manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
   <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
 1 volkswagen   new beetle   1.9  1999     4 manu… f        35    44 d     subc…
 2 volkswagen   jetta        1.9  1999     4 manu… f        33    44 d     comp…
 3 volkswagen   new beetle   1.9  1999     4 auto… f        29    41 d     subc…
 4 honda        civic        1.6  1999     4 manu… f        28    33 r     subc…
 5 toyota       corolla      1.8  2008     4 manu… f        28    37 r     comp…
 6 honda        civic        1.8  2008     4 manu… f        26    34 r     subc…
 7 toyota       corolla      1.8  1999     4 manu… f        26    35 r     comp…
 8 toyota       corolla      1.8  2008     4 auto… f        26    35 r     comp…
 9 honda        civic        1.6  1999     4 manu… f        25    32 r     subc…
10 honda        civic        1.8  2008     4 auto… f        25    36 r     subc…
# ℹ 224 more rows

Summarising data

  • The dplyr summarise() function computes a table of summaries for a data frame
  • group_by() groups the input data frame by the specified variable(s)
  • Combining these two allows us to easily create summaries for different categorical groupings

Group and Summarise

  • Get the mean and median city mileage within manufacturer
summarise(group_by(.data = mpg,
                   manufacturer),
          mean_cty = mean(cty),
          median_cty = median(cty))
# A tibble: 10 × 3
   manufacturer mean_cty median_cty
   <chr>           <dbl>      <dbl>
 1 audi             17.6       17.5
 2 chevrolet        15         15  
 3 dodge            13.1       13  
 4 ford             14         14  
 5 honda            24.4       24  
 6 hyundai          18.6       18.5
 7 jeep             13.5       14  
 8 land rover       11.5       11.5
 9 lincoln          11.3       11  
10 mercury          13.2       13  

The pipe operator |>

  • Allows “chaining” of function calls to make code more readable
mpg |>
  group_by(manufacturer) |>
  summarise(mean_cty = mean(cty),
            median_cty = median(cty)) |>
  head(5)
# A tibble: 5 × 3
  manufacturer mean_cty median_cty
  <chr>           <dbl>      <dbl>
1 audi             17.6       17.5
2 chevrolet        15         15  
3 dodge            13.1       13  
4 ford             14         14  
5 honda            24.4       24  

Plotting

ggplot2

  • The most popular tidyverse package
  • Create publication quality, highly customizable plots
  • ggplots use “layers” to build, modify and overlap visualizations
    • Layers are added using the + symbol and can be added to an existing ggplot
  • Many popular packages output ggplots which can then be easily modified by adding layers

Creating ggplots



Plotting

Plot Example

ggplot(data = mpg,                         # Input dataframe
       mapping = aes(x = cty, y = hwy)) +  # Aesthetic mapping
  geom_point()                             # Point graph

Adding and Modifying Layers

ggplot(data = mpg,                         
       mapping = aes(x = cty, y = hwy)) +  
  geom_point(color = "brown") +
  geom_smooth(formula = y ~ x, method = "lm")

10 min break

10:00

Hands-on Data Analysis

Dataset Description

  • PanTHERIA
    • A global species-level data set of key traits of all known extant and recently extinct mammals compiled from literature
    • Used in macroecological and macroevolutionary research projects
    • Data is organized by taxonomic rank

Taxonomic Rank

লাল শিয়াল (ভালপেস ভালপেস) Rotfuchs (Vulpes vulpes) Zorro rojo (Vulpes vulpes) မြေခွေးနီ (Vulpes vulpes) обыкновенная лисица (Vulpes vulpes) රතු හිවලා (වුල්පෙස් වුල්පෙස්) Rödräv (Vulpes vulpes) लाल लोमड़ी वुल्पेस वुल्पेस Црвена лисица (Vulpes vulpes) Red fox (Vulpes vulpes) অধিজগৎজগৎপর্বশ্রেণিবর্গগোত্রগণ প্রজাতি DomäneReichStammKlasseOrdnungFamilieGattung Art DominioReinoFiloClaseOrdenFamiliaGénero Especie နယ်ပယ်လောကမျိုးပေါင်းစုမျိုးပေါင်းမျိုးစဉ်မျိုးရင်းမျိုးစု မျိုးစိတ် ДоменЦарствоТипКлассОтрядСемействоРод Вид වසමරාජධානියවංශයවර්ගයගෝත්‍රයකුලයගණය විශේෂය DomänRikeFylumKlassOrdningFamiljSläkte Art अधिजगत्जगत् संघवर्गगणकुटुम्बवंश जाति ДоменЦарствоКоленоКласаРедСемејствоРод Вид DomainKingdomPhylumClassOrderFamilyGenus Species সুকেন্দ্রিকপ্রাণীমেরুদণ্ডীস্তন্যপায়ীশ্বাপদক্যানিডেভালপেস ভালপেস ভালপেস EucariotaAnimaliaCordadosMamíferosCarnívoraCánidosVulpes Vulpes vulpes ယူကာရုတ်တိရစ္ဆာန်ကော်ဒိတ်နို့တိုက်သတ္တဝါကာနီဗိုရာခွေးမျိုးရင်းVulpes Vuples vulpes ЭукариотыЖивотныеХордовыеМлекопитающиеХищныеПсовыеVulpes Vulpes vulpes යුකේරියාඇනිමේලියාකෝඩේටාමමාලියාකානිවෝරාකානිඩේවුල්පෙස් Vuples vulpes EukaryoterDjurRyggsträngsdjurDäggdjurRovdjurHunddjurVulpes Vulpes vulpes सुकेन्द्रकप्राणीरज्जुकीस्तनधारी मांसाहारीश्वानवुल्पेस वुल्पेस वुल्पेस ЕукариотиЖивотниХордовиЦицачиЅверовиКучињаЛисици Црвена лисица EukaryaAnimaliaChordataMammaliaCarnivoraCanidaeVulpes Vulpes vulpes

Data Preview

Order Family Genus Species Binomial ActivityCycle AdultBodyMass_g AdultForearmLen_mm AdultHeadBodyLen_mm AgeatEyeOpening_d AgeatFirstBirth_d BasalMetRate_mLO2hr BasalMetRateMass_g DietBreadth DispersalAge_d GestationLen_d HabitatBreadth HomeRange_km2 HomeRange_Indiv_km2 InterbirthInterval_d LitterSize LittersPerYear MaxLongevity_m NeonateBodyMass_g NeonateHeadBodyLen_mm PopulationDensity_n/km2 PopulationGrpSize SexualMaturityAge_d SocialGrpSize Terrestriality TrophicLevel WeaningAge_d WeaningBodyMass_g WeaningHeadBodyLen_mm References AdultBodyMass_g_EXT LittersPerYear_EXT NeonateBodyMass_g_EXT WeaningBodyMass_g_EXT GR_Area_km2 GR_MaxLat_dd GR_MinLat_dd GR_MidRangeLat_dd GR_MaxLong_dd GR_MinLong_dd GR_MidRangeLong_dd HuPopDen_Min_n/km2 HuPopDen_Mean_n/km2 HuPopDen_5p_n/km2 HuPopDen_Change Precip_Mean_mm Temp_Mean_01degC AET_Mean_mm PET_Mean_mm
Carnivora Canidae Canis latrans Canis latrans crepuscular 11989.1 NA 872.39 11.94 365 3699 10450 1 255 61.74 1 18.88 19.91 365 5.72 NA 262 200.01 NA 0.25 NA 372.9 NA fossorial carnivore 43.71 NA NA 367;542;543;730;1113;1297;1573;1822;2655 NA 1.1000000000000001 NA NA 17099094.300000001 71.39 8.02 39.700000000000003 -67.069999999999993 -168.12 -117.6 0 27.27 0 0.06 53.03 58.18 503.02 728.37
Carnivora Canidae Canis lupus Canis lupus crepuscular 31756.51 NA 1055 14.01 547.5 11254.2 33100 1 180 63.5 1 159.86000000000001 43.13 365 4.9800000000000004 2 354 412.31 NA 0.01 NA 679.37 NA fossorial carnivore 44.82 NA NA 367;542;543;730;1015;1052;1113;1297;1573;1594;2338;2655 NA NA NA NA 50803439.700000003 83.27 11.48 47.38 179.65 -171.84 3.9 0 37.869999999999997 0 0.04 34.79 4.82 313.33 561.11
Carnivora Canidae Canis simensis Canis simensis diurnal 14361.86 NA 938.19 NA NA NA NA 1 180 63.61 1 4.2 5.0199999999999996 365 NA NA NA NA NA 1.2 NA 754.74 NA fossorial carnivore 69.599999999999994 NA NA 542;730;1113;1573;2655 NA 1.1000000000000001 NA NA 11402.81 13.31 6.55 9.93 39.96 38.020000000000003 38.99 30 99.87 30 0.15 83.87 99.03 931.35 1471.36
Carnivora Canidae Atelocynus microtis Atelocynus microtis NA 8363.2199999999993 NA 831.01 NA NA NA NA 1 NA NA 1 NA NA NA NA NA 132 NA NA NA NA NA 1 fossorial carnivore NA NA NA 543;890;1113;2655 NA NA NA NA 7634256.5999999996 4.79 -32.31 -13.76 -43.54 -78.61 -61.08 0 7.43 0 0.12 163.06 235.49 1316.27 1488
Cetacea Balaenopteridae Balaenoptera musculus Balaenoptera musculus NA 154321304.5 NA 30480 NA NA NA NA 1 NA 326.97000000000003 1 NA NA 821.25 1 0.45 1320 2738612.79 7236.55 NA 1 1959.8 1.25 NA carnivore 211.71 16999999.969999999 NA 172;511;543;899;1004;1015;1217;1297;2151;2409;2655 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Cetacea Balaenopteridae Balaenoptera physalus Balaenoptera physalus NA 47506008.229999997 NA 20641.060000000001 NA NA NA NA 2 NA 338.36 1 NA NA 730 1.01 0.37 1392 1899999.99 6273.75 NA 1.5 2666.41 NA NA carnivore 196.58 NA 12000 24;27;543;899;1004;1015;1217;1297;1577;2151;2655 NA NA NA 6395530.4199999999 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

Hands-on Analysis

  • We will read in the data and explore if the trophic level has a significant impact on the adult body mass of mammals

Steps:
1. Combine and clean the data
2. Visualize adult body mass by trophic level
3. Check for overrepresented groups
4. Fit a simple linear model

Hands-on Analysis

  • Open part_2.Rmd
  • If you just want to follow along and not run code, open part2_filled_out.html

ChatGPT Tips for R

General Tips

  • Follow any relevant institutional guidelines on using LLMs
  • Always confirm ChatGPT’s outputs are correct
  • Provide as much detail as possible about the problem in the 1st prompt
  • Use separate chats for separate tasks/projects
  • Try the ‘Custom Instructions’ function

Code Tips

  • Commented R code yields better responses
  • Provide the code and error message in the same prompt
  • ChatGPT can work well to convert syntax and improve your code:
    • “Turn this loop into a function : [your code]”
    • “Is there a better way to do this : [your code]”
  • Check out the file: example_code/1_convert_syntax_example.R for an example use case

Where to Get Help

Bioinformatics Questions

For any bioinformatics specific questions feel free to reach out to the Gladstone Bioinformatics Core.

Debugging Errors

  • Try searching the web by pasting the error message and any relevant keywords (package or function name)
  • Websites like Stack Overflow and Posit Community Forum should have the most relevant answers
  • If the problem is package specific, check the documentation and reach out to the authors using their preferred method

Additional Resources

Coding Templates

Code templates can be used to avoid typing the same code over and over again.

R Resources

End of Part 2

Workshop survey

  • Please fill out our workshop survey so we can continue to improve these workshops

Upcoming Workshops

Introduction to scATAC-seq Data Analysis
November 14 - November 15, 2024 1:00-4:00pm PST

Introduction to Linear Mixed Effects Models
November 18-November 19, 2024 1:00-3:00pm PST

scATAC-seq and scRNA-seq Data Integration
November 22, 2024 1:00-4:00pm PST