V prvej casti cvicenia si kratko vyskusate zhlukovanie dat v R a jednoduche zobrazenie najdenych zhlukov.

Posluzi nam priklad z prednasky:

set1 <- matrix(cbind(rnorm(100,0,2),rnorm(100,0,2)),100,2)
set2 <- matrix(cbind(rnorm(100,0,2),rnorm(100,8,2)),100,2)
set3 <- matrix(cbind(rnorm(100,8,2),rnorm(100,0,2)),100,2)
set4 <- matrix(cbind(rnorm(100,8,2),rnorm(100,8,2)),100,2)
dati <- list(values=rbind(set1,set2,set3,set4),classes=c(rep(1,100),rep(2,100),rep(3,100),rep(4,100)))
# clustering - common methods
op <- par(mfcol = c(2, 2))
par(las =1)
plot(dati$values, col = as.integer(dati$classes), xlim=c(-6,14), ylim = c(-6,14), xlab="", ylab="", main = "True Groups")
party <- kmeans(dati$values,4)
plot(dati$values, col = party$cluster, xlab = "", ylab = "", main = "kmeans")
hc = hclust(dist(dati$values), method = "ward")
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
memb <- cutree(hc, k = 4)
plot(dati$values, col = memb, xlab = "", ylab = "", main = "hclust Euclidean ward")
hc = hclust(dist(dati$values), method = "complete")
memb <- cutree(hc, k = 4)
plot(dati$values, col = memb, xlab = "", ylab = "", main = "hclust Euclidean complete")

par(op)

Cielom zvysnej casti cvicenia je zoznamit sa s pracou s grafmi v R a vytvorit funkciu, ktora zobrazi ako graf okolie lubovolneho vrcholu grafu. Nasledovne tuto funkciu vylepsime o graficke znazornenie vlastnosti vrcholov.

Ako priklad pouzijeme sadu dat z kvasiniek. Obsahuje zoznam proteinov “yeast_proteins.txt”, ich sekvencie “yeastProteins.fasta”, interakcie “yeast_interactions.txt” a vlastnosti “yeastProteins.Rdata”.

Budeme pouzivat balik “igraph”

library(igraph)
## 
## Attaching package: 'igraph'
## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union
## The following object is masked from 'package:tidyr':
## 
##     crossing
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union

Nacitame si data do premennych v R.

edge <- read.table("data/yeast_interactions.txt")
vert <- read.table("data/yeast_proteins.txt")
orfs <- vert[,2] # nazvy genov
#vert <- vert[,1] # budeme pouzivat len cisla vrcholov

a vytvorime prislusny graf (datovy objekt)

g <-  graph.data.frame(edge,vertices=vert,directed=FALSE)

Vykreslit podgraf pre prvych 10 hran je jednoduche

plot(induced_subgraph(g,1:10))

Vsimnite si, ze zobrazenie pri opakovanom volani funkcie vyzera vzdy trochu inak. Aby obrazky boli informativne uniestnime do nich oznacenie z premennej “orfs” a upravime velkost pisma a niekolko dalsich parametrov

plot(induced_subgraph(g,1:10),vertex.label=orfs[1:10],vertex.label.cex=0.8,vertex.label.dist=1,vertex.label.degree=-pi/2)

Umiestnenie uzlov je v tejto chvili nahodne, nechame ich rozmiestnit do kruhu

plot(induced_subgraph(g,1:10),vertex.label=orfs[1:10],vertex.label.cex=0.8,vertex.label.dist=1,vertex.label.degree=-pi/2,layout=layout.circle)

Takto vytvoreny graf nam posluzi ako podklad pre vizualizaciu. Pokusime sa najprv zobrazit na grafe nejake zaujimave vlastnosti proteinov z kvasiniek. Jednou z takych vlastnosti moze byt molekulova hmotnost. Hypotezou by mohlo byt, ze vacsie proteiny maju viac partnerov. Pozrime sa najprv na datovy subor (pochadza z baliku ppiData; ppi = protein/protein interaction)

load("data/yeast_proteins.Rdata")
names(proteinProperty) # nazvy stlpcov (premennych) v tabulke
##  [1] "FEATURE (ORF name)"                                                   
##  [2] "SGDID"                                                                
##  [3] "MOLECULAR WEIGHT (in Daltons)"                                        
##  [4] "PI"                                                                   
##  [5] "CAI (Codon Adaptation Index)"                                         
##  [6] "PROTEIN LENGTH"                                                       
##  [7] "N TERM SEQ"                                                           
##  [8] "C TERM SEQ"                                                           
##  [9] "CODON BIAS"                                                           
## [10] "ALA"                                                                  
## [11] "ARG"                                                                  
## [12] "ASN"                                                                  
## [13] "ASP"                                                                  
## [14] "CYS"                                                                  
## [15] "GLN"                                                                  
## [16] "GLU"                                                                  
## [17] "GLY"                                                                  
## [18] "HIS"                                                                  
## [19] "ILE"                                                                  
## [20] "LEU"                                                                  
## [21] "LYS"                                                                  
## [22] "MET"                                                                  
## [23] "PHE"                                                                  
## [24] "PRO"                                                                  
## [25] "SER"                                                                  
## [26] "THR"                                                                  
## [27] "TRP"                                                                  
## [28] "TYR"                                                                  
## [29] "VAL"                                                                  
## [30] "FOP SCORE (Frequency of Optimal Codons)"                              
## [31] "GRAVY SCORE (Hydropathicity of Protein)"                              
## [32] "AROMATICITY SCORE (Frequency of aromatic amino acids: Phe, Tyr, Trp)" 
## [33] "Feature type (ORF classification: Verified, Uncharacterized, Dubious)"
head(proteinProperty) # prvych 6 riadkov v tabulke
##   FEATURE (ORF name)      SGDID MOLECULAR WEIGHT (in Daltons)    PI CAI (Codon Adaptation Index) PROTEIN LENGTH N TERM SEQ
## 1            YAL001C S000000001                        132107 10.16                         .121           1160    MVLTIYP
## 2            YAL002W S000000002                        144961  5.74                         .134           1274    MEQNGLD
## 3            YAL003W S000000003                         22627  4.13                         .763            206    MASTDFS
## 4            YAL004W S000002136                         23811  5.04                         .198            215    MGVTSGG
## 5            YAL005C S000000004                         69767  4.82                         .709            642    MSKAVGI
## 6            YAL007C S000000005                         24063  4.83                         .235            215    MIKSTIA
##   C TERM SEQ CODON BIAS ALA ARG ASN ASP CYS GLN GLU GLY HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL
## 1    YSIYEST      -.019  52  64  66  79   7  37  73  59  16  87  97 125  17  42  30  97  75  15  35  87
## 2    ESNPKIV       .055  50  41  75  69  26  53  80  45  30 102 161  79  15  59  47 129  67  14  50  82
## 3    IAAMQKL       .741  32   2   6  22   1   9  22   6   3  12  12  18   4  10   6  15   8   3   3  12
## 4    NTRVCCT       .295   0   9  15  13   2  12   9  17   6  13  23   5   1  12   0  28  20   5   0  25
## 5    PTVEEVD        .76  60  26  32  49   3  23  51  50   5  40  48  53   8  30  27  40  45   1  10  41
## 6    GRQKNYV        .23  16   6  14  12   2   5  13   8   1  22  22  14   3  11   4  19  16   2   9  16
##   FOP SCORE (Frequency of Optimal Codons) GRAVY SCORE (Hydropathicity of Protein)
## 1                                    .409                                -.511552
## 2                                    .446                                -.090346
## 3                                    .849                                -.416505
## 4                                    .589                                -.032558
## 5                                    .858                                -.426013
## 6                                    .557                                 .147907
##   AROMATICITY SCORE (Frequency of aromatic amino acids: Phe, Tyr, Trp)
## 1                                                               .07931
## 2                                                              .096546
## 3                                                               .07767
## 4                                                               .07907
## 5                                                              .063863
## 6                                                              .102326
##   Feature type (ORF classification: Verified, Uncharacterized, Dubious)
## 1                                                          ORF|Verified
## 2                                                          ORF|Verified
## 3                                                          ORF|Verified
## 4                                                           ORF|Dubious
## 5                                                          ORF|Verified
## 6                                                          ORF|Verified

Dalej nas bude zaujimat

head(proteinProperty$FEATURE) # identifikator
## [1] YAL001C YAL002W YAL003W YAL004W YAL005C YAL007C
## 6714 Levels: Q0010 Q0017 Q0032 Q0045 Q0050 Q0055 Q0060 Q0065 Q0070 Q0075 Q0080 Q0085 Q0092 Q0105 Q0110 Q0115 Q0120 ... YPR204W
head(proteinProperty$MOL) # molekulova hmotnost
## [1] 132107 144961 22627  23811  69767  24063 
## 6387 Levels: 100017 100024 100034 100152 100208 100228 10027 100281 100289 10030 100331 100339 100341 100394 100475 ... 99999
head(proteinProperty$GRAVY) # hydrofobicita
## [1] -.511552 -.090346 -.416505 -.032558 -.426013 .147907 
## 6472 Levels:  .000359 .000387 .000476 .000637 .000643 -.000858 -.001066 -.001266 -.00138 -.001389 .00147 -.001531 ... -.999528

Vyuzijeme tieto tri udaje na konstrukciu jednoduchych funkcii zistujucich mol.hmotnost lubovolneho proteinu zadaneho pomocou identifikatora alebo jeho hydrofobicitu

mw <-
function (id) 
{
  as.numeric(as.vector(proteinProperty$MOL[which(proteinProperty$FEATURE == id)]))
}

hf <-
function (id) 
{
  as.numeric(as.vector(proteinProperty$GRAVY[which(proteinProperty$FEATURE == id)]))
}

Overte si, ze

mw("YAL005C")
## [1] 69767
hf("YAL005C")
## [1] -0.426013

je 69767 a -0.426013. Teraz vytvorime vektor s velkostami pre uzly grafu odvodenymi z mol.hmotnosti a farbou podla hydrofobicity

size <- as.numeric(as.vector(mapply(mw,as.vector(orfs))))
colr <- as.numeric(as.vector(mapply(hf,as.vector(orfs))))

a priradime preskalovane vlastnostiam vrcholov grafu

V(g)$size <- size/10000
V(g)$color <- floor((colr+2.5)*2)

a teraz spojime vsetky vylepsenia

plot(induced_subgraph(g,1:10),vertex.label=orfs[1:10],vertex.label.cex=0.8,vertex.label.dist=1,vertex.label.degree=-pi/2,layout=layout.circle)

Vytvorte funkciu, ktorej argument bude identifikator a graf, a ktora Vam vrati zoznam uzlov, ktore s danym uzlom susedia. Vyuzit mozete funkciu neighbors(). Vylepsite tuto funkciu aby rovno vytvorila prislusny podgraf, ktory sa da zobrazit prikazom plot().

Experimentujte s tymito datami s balikom ggraph (https://cran.r-project.org/web/packages/ggraph/vignettes/Layouts.html) Na zaver pozhlukujte data o proteinoch hmotnosti a hydrofobicity a zobrazte ako kruhovy dendrogram.