R Protocols Notebook

Shantel A. Martinez | Update 2019.02.19


Hyperlink to Rmd Files        Hyperlink within Notebook & External Webpages

This notebook contains summaries, data analysis, and explanations of R scripts and programs found useful for genomic selection and beyond.
This notebook is meant to be shared for the purpose of transparency and as a collaborative resource. It is a working document, therefore mistakes, unintentional incorrect analyses, and un-updated material may be shown.
If you do see a mistake, I would love to learn a better way to analyze my data. Please feel free to email me (shantel.a.martinez@gmail.com) with helpful feedback and I will be sure to update this public R Protocols Notebook

TABLE OF CONTENTS:
General Topic
       Genomic Prediction
       h: Bandwidth Parameter for RKHS
       GWAS
Other Useful Commands
       References for online cheatsheets
       Other Computing Resources
Coding Shortcuts
       Git
       Anaconda
       Jupyter
       R
       Markdown
       Typora


Genomic Prediction Modeling

PHS GS One-step Approach
      rrbLUP five-fold cross-validation broken down and explained in an R markdown notebook

Five-Fold CV
      rrbLUP, RKHS, and LASSO broken down and explained in an R markdown notebook
      5-fold cross-validation
      Two-step approach (MISSING reliabilty adjustment)

D. Sweeney helped me realize I was doing a two-step GP process initially without adjusting for the reliability. So this meant I was calculating BLUPs for my phenotype data, and those BLUP values were what I was using as a phenotype to calculate the GEBVs. So I guess (meaning I still need to read more on this), when you do a two-step process, you need to divide your GEBVs by a reliability factor. And this last part, I had not done before. Which the two-step method is not wrong, however without an adjustment I could be inflating my prediction accuracies.

However what I'm working on now, is a one-step approach. The 1 -step approach is using my covariate variables while I am running the mixed.solve command in rrBLUP.

Progress on the one-step approach is above.

Bandwidth Parameter for RKHS

Bandwidth Parameter Notes
RKHS: Based on genetic distance and a kernel function with a smoothing parameter to regulate the distribution of QTL effects. Effective for detecting nonadditive gene effects.

When I was trying to understand modeling using Reproducing Kernel Hilbert Spaces (RKHS), I also needed to understand the parameters being used by the model in BGLR. This lead me down a rabbit hole of defining a bandwidth parameter h for my datasets.

NOTE1: I am still in the process of fully understanding h and RKHS, so my thoughts and process so far are shown in the R notebook (above link), however it will evolve as I better understand h and read more articles.
NOTE2: I had a meeting with the statistical consulting group and both the advisor and I went down another rabbit hole of determining the bandwidth parameter for RKHS. But, there is hope!
i) people are still working out how to best identify your optimal bandwidth parameter, so at least I am not alone.
ii) some papers that are useful are Campos et al., 2010, Pérez-Elizalde et al., 2015, and Peres and Campos, 2014

"The bandwidth parameter of the Gaussian kernel can be chosen using either cross-validation (CV) or Bayesian methods. From a Bayesian perspective, one possibility is to treat h as random; however, this is computationally demanding because the RK needs to be recomputed any time h is updated. To overcome this problem de los Campos et al. (2010) proposed using a multikernel approach (named kernel averaging, KA) consisting of: (a) defining a sequence of kernels based on a set of values of h, and (b) fitting a multikernel model with as many random effects as kernels in the sequence." - Peres and Campos, 2014


Online Coding Cheatsheets

Simple cheat sheets for markdown:

Github Wiki | Github markdown| md Basics | Typora document | R md document

Simple cheat sheets for Jupyter Notebook:

Jupyter Notebook shortcuts
Datacamps Jupyter and R markdown cheatsheet


Other Computing Resources

Data Management

Data organization for Spreadsheets

A Nature article on how we need to be very transarent in our data analysis pipeline. If I ever want to published raw data files, the script to organize and analyze the data, all the way down to producing the final figures, I know from personal experience that I need to be extremely organized, and I am starting here. I have high admiration for scientists that are very transparent, and I dream of getting to the point of publishing a github repo with every raw data file with 'clean' script for the public to follow. So, I'm working on those skills, and these articles on how to tidy up my script are also a good start.

Furthermore, here is a talk by Karl Broman on collaborating reproducibly. Feel free to get lost in his blog posts on everything coding, science, and reproducability.


Electronic Lab Notebook (ELN)

Great article on getting started with an electronic notebook.
A Nature article about why so many scientists love Jupyter Notebook (I am biased, I know).


Resources found while on the hunt for something

How to output nice tables in R: a list of packages and 5 package tutorials
A compare and contrast of how the Economist presents their data figures.
Want some figure inspiration? Check out the tidyverse, dataviz and TidyTuesday hashtags on instagram


Coding Shortcuts

If recently unused, I often forget these coding commands or shortcuts and I have to google search them again. Instead, I just keep my running list of forgotten favorites listed here

Git

All research files are backed up onto GitHub
git pull origin master : Updates this computers master folder with changes from the other computer
git status : Tells you what has changed since the last push
git add -u : This tells git to automatically stage tracked files -- including deleting the previously tracked files.
       OR git add /folder to add a whole specific folder of changes
git commit -m 'enter commit comment here'
git push origin master
      Enter in user name (email) and password

Anaconda

Use the Anaconda terminal to access Jupyter Notebook
cd /d D:
cd /d General\ Research\ Files/
jupyter notebook This will start Jupyter Notebook in the web browser
      Notebooks are found in the /Lab notebook/ folder

Jupyter

Shorthand keys
Shift-Enter run cell, select below
Ctrl-Enter run cell
Alt-Enter run cell, insert below
M to markdown (remember to esc from edit mode) Y to code

Markdown

<span style="color:#6E8B3D">Green Text</span>: Green Text
<span style="color:#CD5C5C">Red Text</span>: Red Text
<span style="color:#4F94CD">Blue Text</span>: Blue Text
Colors only work when the md output is html ex: github pages or jupyter notebooks

<a id="abbr_name"></a>: Header link
()[#abbr_name]: Reference Header Link
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;: Tab 6 spaces

<div style="text-align: right"> [TOC](#TOC) </div>

-------------- creates a line break

Typora

Ctrl+/: Source Code Mode
Ctrl+Shift+-: Zoom Out
Ctrl+Shift+=: Zoom In

R

colnames(k)<- sub("X","",colnames(k)) replace column name characters like X or V with nothing. This is useful when importing data tables and R formats the empty column names.

c("#F8766D", "#7CAE00", "#00BFC4","#C77CFF") : default ggplot2 colors
2
length(df[!is.na(df)]) Tells me how many values in the df are NOT NA. you can also specify col df$col

CV <- subset(CNLM, GID %in% myCVc$taxa) Subsetting CNLM with similar GID as myCV taxa, There are 1059 GID in common
PCC<- merge(myCVc,CV,by="ID") merge columns by two columns with the same name
names(myCVc)[14] <- "GID" rename column 14 only
PHSred7$GIDx <- with(PHSred7,paste("cuGS",PHSred7$GID,sep="")) Make df\$GID ### to df$GIDx cuGS### by adding a new row
PHSwhite$GIDx <- gsub("cuGSOH", "OH", PHSwhite\$GIDx) replace cuGSOH in column GIDx with OH.

lme4
VarName : Fixed Effect
(1|VarName): Random Effect; random intercept with fixed mean
x + (x|VarName): Random Effect; correlated intercept and slope with the fixed effect x
(1|Env%in%GID): Random effect interaction; GID within Env