How to make a loop to calculate a function based on strings in a column?

Refresh

March 2019

Views

46 time

1

I have a data.frame that looks like:

              SNP              CLST A1 A2       FRQ IMP     POS CHR BVAL
    1   rs2803291            Brahui  C  T  0.660000   0 1882185   1  878
    2   rs2803291           Balochi  C  T  0.750000   0 1882185   1  878
    3   rs2803291            Hazara  C  T  0.772727   0 1882185   1  878
    4   rs2803291           Makrani  C  T  0.620000   0 1882185   1  878
    5   rs2803291            Sindhi  C  T  0.770833   0 1882185   1  878
    6   rs2803291            Pathan  C  T  0.681818   0 1882185   1  878
    53  rs12060022           Brahui  T  C 0.0600000   1 3108186   1  982
    54  rs12060022          Balochi  T  C 0.0416667   1 3108186   1  982
    55  rs12060022           Hazara  T  C 0.0000000   1 3108186   1  982
    56  rs12060022          Makrani  T  C 0.0200000   1 3108186   1  982
    57  rs12060022           Sindhi  T  C 0.0625000   1 3108186   1  982
    58  rs12060022           Pathan  T  C 0.0681818   1 3108186   1  982
    105   rs870171           Brahui  T  G 0.2200000   0 3332664   1  976
    106   rs870171          Balochi  T  G 0.3333330   0 3332664   1  976
    107   rs870171           Hazara  T  G 0.3636360   0 3332664   1  976
    108   rs870171          Makrani  T  G 0.1800000   0 3332664   1  976
    109   rs870171           Sindhi  T  G 0.2083330   0 3332664   1  976
    110   rs870171           Pathan  T  G 0.1590910   0 3332664   1  976
    157  rs4282783           Brahui  G  T 0.8400000   1 4090545   1  992
    158  rs4282783          Balochi  G  T 0.9583333   1 4090545   1  992
    159  rs4282783           Hazara  G  T 0.8409090   1 4090545   1  992
    160  rs4282783          Makrani  G  T 0.9000000   1 4090545   1  992
    161  rs4282783           Sindhi  G  T 0.8958330   1 4090545   1  992
    162  rs4282783           Pathan  G  T 0.9772727   1 4090545   1  992

Each SNP locus has certain populations associated with it and a certain frequency (FRQ) for each population. There are "L" amount of unique SNPs in the total data.frame. I would like to randomly sample 3 SNPs from the data.frame and then I would like to take the sum of (FRQ_balochi_SNP1 - FRQ_Pathan_SNP1)* *(FRQ_Y_SNP1 - FRQ_Pathan_SNP1) across + (FRQ_balochi_SNP2 - FRQ_Pathan_SNP2) * (FRQ_Y_SNP2 - FRQ_Pathan_SNP2) + (FRQ_balochi_SNP3 - FRQ_Pathan_SNP3) * (FRQ_Y_SNP3 - FRQ_Pathan_SNP3) using the "3" randomly generated SNPs. The notation looks something like Value = Sum(i to 3) of (FRQ_Bal_i - FRQ_Pat_i) * (FRQ_Y_i - FRQ_Pat_i). Y is a given population. For example: "Hazara".

I would like my output to be a list of Values from this calculation along with their Y populations.

For example, let's walk through Hazara as our Y population. We randomly sample and get SNP1, SNP2, and SNP4. The first SNP (rs2803291) gives us (0.75 - 0.681818) * (0.772727 - 0.681818) for a value of 0.006198. The second SNP (rs12060022) gives us (0.041666 - 0.0681818) * (0.0000 - 0.061818) for a value of 0.001639. The fourth SNP (rs4282783) gives us (0.958333 - 0.9772727) * (0.8409090 - 0.9772727) for a value of 0.002582. Summing our values together we would get 0.006198+0.001639+0.002582 for a total sum of 0.01402. Thus the first line of the output file would be

Population   Value
Hazara       0.01402
Makrani      ???

I would like this done for every population, including Balochi and Pathan if possible.

1 answers

2

Я хотел бы создать вспомогательную функцию а затем поместить его в обхвате механизма, который будет пытаться каждую этикетку:

library(dplyr)

snp_sum <- function(SNP, FRQ, CLST) {
  (FRQ[CLST == "Balochi"] - FRQ[CLST == "Pathan"]) * (FRQ[CLST == SNP] - FRQ[CLST == "Pathan"])
}

sum_df <- function(mydf, clst_list) {
  lst <- lapply(clst_list, function(x) {
           mydf %>% group_by(SNP) %>%
           summarise(FRQ_SUM=snp_sum(x, FRQ, CLST)) %>%
           summarise(Value=sum(FRQ_SUM[sample(n(), 3)]))
         })
  cbind.data.frame(Population=clst_list, do.call("rbind", lst))
}

sum_df(df1, unique(df1$CLST))
#   Population        Value
# 1     Brahui 0.0134297098
# 2    Balochi 0.0353677606
# 3     Hazara 0.0400308238
# 4    Makrani 0.0008918497
# 5     Sindhi 0.0161916643
# 6     Pathan 0.0000000000

редактировать

Возможная скорость с встроенным R пакет под названием parallel:

library(parallel)
no_cores <- detectCores() - 1L
cl <- makeCluster(no_cores)
clusterExport(cl, c("df1", "snp_sum"))
clusterEvalQ(cl, library(dplyr))

sum_parallel <- parLapply(cl, unique(df1$CLST), function(x) {

  df1 %>% group_by(SNP) %>%
    summarise(FRQ_SUM = snp_sum(x, FRQ, CLST)) %>%
    summarise(Value=sum(FRQ_SUM[sample(n(), 3)]))
})

cbind.data.frame(Population=unique(df1$CLST), do.call("rbind", sum_parallel))

stopCluster(cl)