OneDrive link doesn't work
Closed this issue · 4 comments
Hi,
I was hoping to access the G4iMGrinder results in the human genome, but the OneDrive link doesn't load for me.
Please can you check the link is working?
Thanks!
Hi Efres,
Thanks for the suggestion! It worked when I tried the link from my laptop at home, so must have been blocked by my institute.
I have also been trying the script on a chromosome by chromosome basis for the newer release of the telomere to telomere human genome. I just want to check if the script I used is valid, as I checked some of the telomere sequences out of curiosity as they should be both G4 and iMotif rich. Looking with the predicted G4 (filtered by >40) and iMotifs (<-40) - at one end of the chromosome, the telomere sequence was purely predicted as G4, and the other end was purely iMotif. I read that the default for the script should be to create the reverse complementary sequence to analyse, is this correct? I've posted an example script for each below:
G4_chr1_t2t <- G4iMGrinder(Name = "chr1_t2t", Sequence = myseq_chr1, Method3 = TRUE)
iMotif_chr1_t2t <- G4iMGrinder(Name = "chr1_t2t", Sequence = myseq_chr1, RunComposition = "C", Method3 = TRUE)
Does this look ok? I would really appreciate your feedback!
Thanks
Alison
Hello again
Great news then that you could download the data.
Regarding the code, and questions:
Yes, by predefined G4iMGrinder function will assume that you want to do the complementary analysis too. For that, it will REVERSE Complement the sequence you have inserted (beware because if you are using annotation files, they are usually just the complement. You need to check that both orientations coincide. If they don’t, in the readme of the webpage and in the help of the package you have a code that will sync both).
You know that the results include both analysis because as part of the results (with both Methods 2 and 3), you will find a column called “strand” for each PQS or PiMS candidate (it will assume the inputed seq is + and the Rev. Complement is -).
On the other hand, when you want to run the analysis only in the inputed strand
(which you have to explicitly state, such as:
G4iMGrinder(Name = "chr1_t2t", Sequence = myseq_chr1, Complementary = F)
) you wont have the strand column.
With the code you have written:
G4_chr1_t2t <- G4iMGrinder(Name = "chr1_t2t", Sequence = myseq_chr1, Method3 = TRUE)
iMotif_chr1_t2t <- G4iMGrinder(Name = "chr1_t2t", Sequence = myseq_chr1, RunComposition = "C", Method3 = TRUE)
You will analyze myseq_chr1 with the predefined values of the algorithm (which I recommend for the entire human genome given its size and nº of results unless you have a very good computer, time and tons of space and RAM).
The first line will analyze for PQS, and the second for PiMS as you added RunComposition = "C".
You have also included “Method3 = TRUE”, which means that you will search the genome with a:
Size restrictive and overlapping manner (Method 2; PQSM2a and PQSM2b results) and
Size unrestrictive and non-overlapping manner (Method 3; PQSM3a and PQSM3a results).
Method 3 is useful to find higher order quadruplex sequences, mostly described in the telomeres actually.
However, if you are not interested in the potential formation of higher order G4s (or i-Motifs), maybe its better for time and result arguments to put “Method3 = F”.
Concretely, In some telomeres the GGGTTA is repeated many many times one after another one. For example, for:
“GGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTA”
With Method 3, it would be detected as 1 single higher order G4 (as all the GGG are linked).
Whilst with Method 2, the algorithm will find all overlapping G4s that fit the max size and the max nº of runs (by predefined, not used); and hence will find 15 candidates (including those that overlap).
Once you have the analysis (G4_chr1_t2t and iMotif_chr1_t2t, for example), what I usually do is unite all the analysis in a df, adding a new variable with the chromosome it is located in.
As you are only interested in those that have high probability of forming a G4 (using dplyr):
Create empty df to store all results
Alldata.M2.40 <- NULL
Getting the results of method 2 for chr 1 that score more than 40, and creating Chr var.
Temp <- G4_chr1_t2t$PQSM2a %>% filter(Score >= 40) %>% mutate(Chr = "1")
bind it to the Alldata df.
Alldata.M2.40 <- rbind(Alldata.M2.40, Temp)
Repeat for all other chromosomes, if possible put in loop
example with chromosome 2
Temp <- G4_chr2_t2t$PQSM2a %>% filter(Score >= 40) %>% mutate(Chr = "2")
Alldata.M2.40 <- rbind(Alldata.M2.40, Temp)
Now you have all the genomic data in Alldata.M2.40 to easily manipulate.
Additionally, you can now actually quantify how many repetitions of every seq appears in the entire genome (and not only per chromosome).
We need to delete the Chr column we created before so:
Alldata.M2.40.B <- G4iMGrinder:::.M2B(df = Alldata.M2.40 %>% select(-Chr), RunComposition = "G" , FreqWeight = 0 )
For the i-Motifs, you can do the same, changing where necessary
iMotif_chr1_t2t_M2.40 <- iMotif_chr1_t2t$PQSM2a %>% filter(Score <= -40) %>% mutate(Chr == "1")
If you finally decide to use Method 3, then again, do the same
G4_chr1_t2t_M3.40 <- G4_chr1_t2t$PQSM3a %>% filter(Score >= 40) %>% mutate(Chr == "1")
Hope I have answered your questions.
Dont hesitate to ask more if you have more questions
Efres
Hi Efres,
Thank you for your really thorough reply, I really appreciate it!
I will try with this and compare what I get to the hg38 version, and I'll come back to you if I have any further questions.
Best wishes,
Alison