V prvej casti cvicenia si kratko vyskusate zhlukovanie dat v R a jednoduche zobrazenie najdenych zhlukov.
Posluzi nam priklad z prednasky:
set1 <- matrix(cbind(rnorm(100,0,2),rnorm(100,0,2)),100,2)
set2 <- matrix(cbind(rnorm(100,0,2),rnorm(100,8,2)),100,2)
set3 <- matrix(cbind(rnorm(100,8,2),rnorm(100,0,2)),100,2)
set4 <- matrix(cbind(rnorm(100,8,2),rnorm(100,8,2)),100,2)
dati <- list(values=rbind(set1,set2,set3,set4),classes=c(rep(1,100),rep(2,100),rep(3,100),rep(4,100)))
# clustering - common methods
op <- par(mfcol = c(2, 2))
par(las =1)
plot(dati$values, col = as.integer(dati$classes), xlim=c(-6,14), ylim = c(-6,14), xlab="", ylab="", main = "True Groups")
party <- kmeans(dati$values,4)
plot(dati$values, col = party$cluster, xlab = "", ylab = "", main = "kmeans")
hc = hclust(dist(dati$values), method = "ward")
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
memb <- cutree(hc, k = 4)
plot(dati$values, col = memb, xlab = "", ylab = "", main = "hclust Euclidean ward")
hc = hclust(dist(dati$values), method = "complete")
memb <- cutree(hc, k = 4)
plot(dati$values, col = memb, xlab = "", ylab = "", main = "hclust Euclidean complete")
par(op)
Cielom zvysnej casti cvicenia je zoznamit sa s pracou s grafmi v R a vytvorit funkciu, ktora zobrazi ako graf okolie lubovolneho vrcholu grafu. Nasledovne tuto funkciu vylepsime o graficke znazornenie vlastnosti vrcholov.
Ako priklad pouzijeme sadu dat z kvasiniek. Obsahuje zoznam proteinov “yeast_proteins.txt”, ich sekvencie “yeastProteins.fasta”, interakcie “yeast_interactions.txt” a vlastnosti “yeastProteins.Rdata”.
Budeme pouzivat balik “igraph”
library(igraph)
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
## The following object is masked from 'package:tidyr':
##
## crossing
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
Nacitame si data do premennych v R.
edge <- read.table("data/yeast_interactions.txt")
vert <- read.table("data/yeast_proteins.txt")
orfs <- vert[,2] # nazvy genov
#vert <- vert[,1] # budeme pouzivat len cisla vrcholov
a vytvorime prislusny graf (datovy objekt)
g <- graph.data.frame(edge,vertices=vert,directed=FALSE)
Vykreslit podgraf pre prvych 10 hran je jednoduche
plot(induced_subgraph(g,1:10))
Vsimnite si, ze zobrazenie pri opakovanom volani funkcie vyzera vzdy trochu inak. Aby obrazky boli informativne uniestnime do nich oznacenie z premennej “orfs” a upravime velkost pisma a niekolko dalsich parametrov
plot(induced_subgraph(g,1:10),vertex.label=orfs[1:10],vertex.label.cex=0.8,vertex.label.dist=1,vertex.label.degree=-pi/2)
Umiestnenie uzlov je v tejto chvili nahodne, nechame ich rozmiestnit do kruhu
plot(induced_subgraph(g,1:10),vertex.label=orfs[1:10],vertex.label.cex=0.8,vertex.label.dist=1,vertex.label.degree=-pi/2,layout=layout.circle)
Takto vytvoreny graf nam posluzi ako podklad pre vizualizaciu. Pokusime sa najprv zobrazit na grafe nejake zaujimave vlastnosti proteinov z kvasiniek. Jednou z takych vlastnosti moze byt molekulova hmotnost. Hypotezou by mohlo byt, ze vacsie proteiny maju viac partnerov. Pozrime sa najprv na datovy subor (pochadza z baliku ppiData; ppi = protein/protein interaction)
load("data/yeast_proteins.Rdata")
names(proteinProperty) # nazvy stlpcov (premennych) v tabulke
## [1] "FEATURE (ORF name)"
## [2] "SGDID"
## [3] "MOLECULAR WEIGHT (in Daltons)"
## [4] "PI"
## [5] "CAI (Codon Adaptation Index)"
## [6] "PROTEIN LENGTH"
## [7] "N TERM SEQ"
## [8] "C TERM SEQ"
## [9] "CODON BIAS"
## [10] "ALA"
## [11] "ARG"
## [12] "ASN"
## [13] "ASP"
## [14] "CYS"
## [15] "GLN"
## [16] "GLU"
## [17] "GLY"
## [18] "HIS"
## [19] "ILE"
## [20] "LEU"
## [21] "LYS"
## [22] "MET"
## [23] "PHE"
## [24] "PRO"
## [25] "SER"
## [26] "THR"
## [27] "TRP"
## [28] "TYR"
## [29] "VAL"
## [30] "FOP SCORE (Frequency of Optimal Codons)"
## [31] "GRAVY SCORE (Hydropathicity of Protein)"
## [32] "AROMATICITY SCORE (Frequency of aromatic amino acids: Phe, Tyr, Trp)"
## [33] "Feature type (ORF classification: Verified, Uncharacterized, Dubious)"
head(proteinProperty) # prvych 6 riadkov v tabulke
## FEATURE (ORF name) SGDID MOLECULAR WEIGHT (in Daltons) PI CAI (Codon Adaptation Index) PROTEIN LENGTH N TERM SEQ
## 1 YAL001C S000000001 132107 10.16 .121 1160 MVLTIYP
## 2 YAL002W S000000002 144961 5.74 .134 1274 MEQNGLD
## 3 YAL003W S000000003 22627 4.13 .763 206 MASTDFS
## 4 YAL004W S000002136 23811 5.04 .198 215 MGVTSGG
## 5 YAL005C S000000004 69767 4.82 .709 642 MSKAVGI
## 6 YAL007C S000000005 24063 4.83 .235 215 MIKSTIA
## C TERM SEQ CODON BIAS ALA ARG ASN ASP CYS GLN GLU GLY HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL
## 1 YSIYEST -.019 52 64 66 79 7 37 73 59 16 87 97 125 17 42 30 97 75 15 35 87
## 2 ESNPKIV .055 50 41 75 69 26 53 80 45 30 102 161 79 15 59 47 129 67 14 50 82
## 3 IAAMQKL .741 32 2 6 22 1 9 22 6 3 12 12 18 4 10 6 15 8 3 3 12
## 4 NTRVCCT .295 0 9 15 13 2 12 9 17 6 13 23 5 1 12 0 28 20 5 0 25
## 5 PTVEEVD .76 60 26 32 49 3 23 51 50 5 40 48 53 8 30 27 40 45 1 10 41
## 6 GRQKNYV .23 16 6 14 12 2 5 13 8 1 22 22 14 3 11 4 19 16 2 9 16
## FOP SCORE (Frequency of Optimal Codons) GRAVY SCORE (Hydropathicity of Protein)
## 1 .409 -.511552
## 2 .446 -.090346
## 3 .849 -.416505
## 4 .589 -.032558
## 5 .858 -.426013
## 6 .557 .147907
## AROMATICITY SCORE (Frequency of aromatic amino acids: Phe, Tyr, Trp)
## 1 .07931
## 2 .096546
## 3 .07767
## 4 .07907
## 5 .063863
## 6 .102326
## Feature type (ORF classification: Verified, Uncharacterized, Dubious)
## 1 ORF|Verified
## 2 ORF|Verified
## 3 ORF|Verified
## 4 ORF|Dubious
## 5 ORF|Verified
## 6 ORF|Verified
Dalej nas bude zaujimat
head(proteinProperty$FEATURE) # identifikator
## [1] YAL001C YAL002W YAL003W YAL004W YAL005C YAL007C
## 6714 Levels: Q0010 Q0017 Q0032 Q0045 Q0050 Q0055 Q0060 Q0065 Q0070 Q0075 Q0080 Q0085 Q0092 Q0105 Q0110 Q0115 Q0120 ... YPR204W
head(proteinProperty$MOL) # molekulova hmotnost
## [1] 132107 144961 22627 23811 69767 24063
## 6387 Levels: 100017 100024 100034 100152 100208 100228 10027 100281 100289 10030 100331 100339 100341 100394 100475 ... 99999
head(proteinProperty$GRAVY) # hydrofobicita
## [1] -.511552 -.090346 -.416505 -.032558 -.426013 .147907
## 6472 Levels: .000359 .000387 .000476 .000637 .000643 -.000858 -.001066 -.001266 -.00138 -.001389 .00147 -.001531 ... -.999528
Vyuzijeme tieto tri udaje na konstrukciu jednoduchych funkcii zistujucich mol.hmotnost lubovolneho proteinu zadaneho pomocou identifikatora alebo jeho hydrofobicitu
mw <-
function (id)
{
as.numeric(as.vector(proteinProperty$MOL[which(proteinProperty$FEATURE == id)]))
}
hf <-
function (id)
{
as.numeric(as.vector(proteinProperty$GRAVY[which(proteinProperty$FEATURE == id)]))
}
Overte si, ze
mw("YAL005C")
## [1] 69767
hf("YAL005C")
## [1] -0.426013
je 69767 a -0.426013. Teraz vytvorime vektor s velkostami pre uzly grafu odvodenymi z mol.hmotnosti a farbou podla hydrofobicity
size <- as.numeric(as.vector(mapply(mw,as.vector(orfs))))
colr <- as.numeric(as.vector(mapply(hf,as.vector(orfs))))
a priradime preskalovane vlastnostiam vrcholov grafu
V(g)$size <- size/10000
V(g)$color <- floor((colr+2.5)*2)
a teraz spojime vsetky vylepsenia
plot(induced_subgraph(g,1:10),vertex.label=orfs[1:10],vertex.label.cex=0.8,vertex.label.dist=1,vertex.label.degree=-pi/2,layout=layout.circle)
Vytvorte funkciu, ktorej argument bude identifikator a graf, a ktora Vam vrati zoznam uzlov, ktore s danym uzlom susedia. Vyuzit mozete funkciu neighbors(). Vylepsite tuto funkciu aby rovno vytvorila prislusny podgraf, ktory sa da zobrazit prikazom plot().
Experimentujte s tymito datami s balikom ggraph (https://cran.r-project.org/web/packages/ggraph/vignettes/Layouts.html) Na zaver pozhlukujte data o proteinoch hmotnosti a hydrofobicity a zobrazte ako kruhovy dendrogram.