Jonathan Pevsner
Bioinformatics and Functional Genomics (3rd edition)
August, 2015

First set your working directory (it can be your desktop or anywhere else). Use getwd() to see the current working directory, and (as needed) use setwd() to change it.

getwd()
## [1] "/Users/pevsner/Documents/#3e/3e_SolutionsToProblems"
# setwd("/Users/pevsner/Documents/#3e/3e_SolutionsToProblems")

Next download a file containing chromosome 11 repeats into this directory. Confirm the file is present using dir().

We’ll load the file into R (using the read.delim function), we’ll call our new table myRepeats (you can assign it almost any name you prefer), specify that the table does have a header, and specify that it is tab-delimited (using sep = “”). Get help on a function such as read.delim by entering ?read.delim at the R command prompt.

myRepeats <- read.delim("ucsc_chr11_repeats.txt", header = TRUE, sep = "\t")

Now we’ll look at the file various ways, and change the names of the column headers.

head(myRepeats)
##   X.swScore genoStart genoEnd strand repName repClass
## 1       208   5230215 5230295      +    MIRb     SINE
## 2      1218   5230647 5231194      -  L1ME3A     LINE
## 3       189   5231331 5231407      -    MIRc     SINE
## 4      1691   5232000 5232286      +   AluJb     SINE
## 5     12383   5232660 5234055      - L1PREC2     LINE
## 6      1530   5234055 5234278      -   L1PA5     LINE
colnames(myRepeats) <- c("swScore", "start", "end", "strand", "name", "class")
head(myRepeats)
##   swScore   start     end strand    name class
## 1     208 5230215 5230295      +    MIRb  SINE
## 2    1218 5230647 5231194      -  L1ME3A  LINE
## 3     189 5231331 5231407      -    MIRc  SINE
## 4    1691 5232000 5232286      +   AluJb  SINE
## 5   12383 5232660 5234055      - L1PREC2  LINE
## 6    1530 5234055 5234278      -   L1PA5  LINE
summary(myRepeats)
##     swScore          start              end          strand      name   
##  Min.   :   21   Min.   :5230215   Min.   :5230295   -:36   AT_rich:10  
##  1st Qu.:  218   1st Qu.:5249258   1st Qu.:5249306   +:55   L2a    : 4  
##  Median :  342   Median :5262340   Median :5262377          (CA)n  : 3  
##  Mean   : 1803   Mean   :5265313   Mean   :5265657          (TA)n  : 3  
##  3rd Qu.: 1402   3rd Qu.:5283566   3rd Qu.:5285348          L1PA11 : 3  
##  Max.   :25729   Max.   :5299772   Max.   :5300052          L1PA7  : 3  
##                                                             (Other):65  
##             class   
##  DNA           : 2  
##  LINE          :34  
##  Low_complexity:15  
##  LTR           : 5  
##  Simple_repeat :17  
##  SINE          :17  
##  Unknown       : 1
dim(myRepeats)
## [1] 91  6
str(myRepeats)
## 'data.frame':    91 obs. of  6 variables:
##  $ swScore: int  208 1218 189 1691 12383 1530 12383 4149 266 797 ...
##  $ start  : int  5230215 5230647 5231331 5232000 5232660 5234055 5234278 5235524 5236584 5236631 ...
##  $ end    : int  5230295 5231194 5231407 5232286 5234055 5234278 5235526 5236191 5236624 5236773 ...
##  $ strand : Factor w/ 2 levels "-","+": 2 1 1 2 1 1 1 2 2 1 ...
##  $ name   : Factor w/ 51 levels "(A)n","(CA)n",..: 48 29 49 13 38 36 38 38 10 23 ...
##  $ class  : Factor w/ 7 levels "DNA","LINE","Low_complexity",..: 6 2 6 6 2 2 2 2 5 6 ...

Plot the repeat classes as a boxplot.

plot(x = myRepeats$class, 
     y = myRepeats$swScore, 
     main = "Repeat classes in the human beta globin locus",
     col = "pink", 
     xlab = "repeat class",
     ylab = "SW score")

Use tapply to apply the functions mean and range over the entire array.

tapply(myRepeats$swScore, myRepeats$class, mean)
##            DNA           LINE Low_complexity            LTR  Simple_repeat 
##      1118.0000      3729.9706       112.4667      1055.4000       252.7059 
##           SINE        Unknown 
##      1383.4706       230.0000
tapply(myRepeats$swScore, myRepeats$class, range)
## $DNA
## [1] 1118 1118
## 
## $LINE
## [1]   193 25729
## 
## $Low_complexity
## [1]  21 556
## 
## $LTR
## [1]  801 1275
## 
## $Simple_repeat
## [1] 185 335
## 
## $SINE
## [1]  189 2615
## 
## $Unknown
## [1] 230 230

Show the session information

sessionInfo()
## R version 3.1.3 (2015-03-09)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.9.5 (Mavericks)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.8    evaluate_0.7.2  formatR_1.2     htmltools_0.2.6
##  [5] knitr_1.10.5    magrittr_1.5    rmarkdown_0.7   stringi_0.5-5  
##  [9] stringr_1.0.0   tools_3.1.3     yaml_2.1.13