BioGenies/tidysq

Add function to apply `write_fasta` on a data frame

danlooo opened this issue · 5 comments

Is your feature request related to a problem? Please describe.

Piping in R is very common to apply a sequence of steps to an object. It usually starts with reading a file, followed by modifications and finally writing the modified object into another file like this:

library(tidyverse)

read_csv("in.csv") %>%
  filter(a > 5) %>%
  mutate(b = a / 2) %>%
  write_csv("out.csv")

I'd like to apply the same concept to tidy sequences.

Describe the solution you'd like

In order to do so, we need an implementation of the function write_fasta accepting a data.frame as an input:

library(tidyverse)
library(tidysq)

# demonstration to make the function generic
write_fasta <- function (...) {
   UseMethod("write_fasta", ...)
}

#' @param sq name of the column containing the sequences
#' @param name name of the column containing the sequence names
write_fasta.data.frame <- function(data, file = "out.fasta", sq = "sq", name = "name", ...) {
  tidysq::write_fasta(x = data[[sq]], name = data[[name]], file = file, ...)
}

read_fasta("in.fasta") %>%
  mutate(name = name %>% toupper()) %>%
  write_fasta("out.fasta")

Note that write_fasta must now be generic so we need to have multiple of them in the codebase. The current function will then be refactored into write_fasta.sq_dna_bsc.

Describe alternatives you've considered

Currently, I'm doing this manually using the current implementation of write_fasta:

read_fasta("in.fasta") %>%
  mutate(name = name %>% toupper()) %>%
  {
    .x <- .
    write_fasta(.x$sq, .x$name, "out.fasta")
  }

First of all, thanks a lot for your detailed description of a problem!

The solution you proposed is pretty much the same as what I'd do. I worry a little about backwards compatibility, but (as far as I know) there aren't that many users of this package for it to be a serious problem.

On the other hand, me and @DominikRafacz were discussing our plans for sqibble class some time ago. That object from read_fasta() is a sqibble, for example (as well as tibble and data.frame). We were thinking about including some metadata about column "roles", that is, some info that column x stores sequences, while column y -- names. This would take care of the need to pass sq and name arguments.

Though, now that I wrote all that, the best approach may be to write a generic for data.frame just as you described and write another generic for sqibble later, where default values are taken from sqibble metadata instead of hardcoded values.

We'll implement your suggestion soon :)

PS. Actually there are more functions that take sq and name parameters, e.g. find_motifs(). I believe I'd like to implement something like you described for all these functions for consistency.

Thanks for the quick response! Since write_fasta will be just copied, e.g. to write_fasta.sq_dna_bsc, I do not see big issues here regarding backwards compatibility.

Additional thoughts:

  • Besides data.frame, there is also tibble and grouped_df classes
  • One might have a data.frame with columns sq and name, but the class of sq is just character without alphabet annotation. This should be enough information to just write a FASTA here. However, maybe a warning should be raised, e.g. to prevent accidentally writing a FASTA of the iris data frame. An error should be raised here, e.g. if the sq column contains numbers besides IUPAC characters.

About the last part, I'd rather not allow to write FASTA with character sq column. It's too risky to allow that, when constructing sq object explicitely is simply a matter of calling %>% mutate(sq = tidysq::sq(sq)) (side note, that's a name clash here that we'd like to change sometime in the future).

@danlooo -- changes merged into main!

You can download the development version from GitHub. We're probably not going to submit it to the CRAN yet, because it's a lot of fuss and rn we're still discussing some low-level aspects of the package so probably more changes are going to be applied in the following weeks...
Once again -- thanks for your great suggestion! We're happy that somebody is using our package ^^

@ErdaradunGaztea -- actually, I think that since we've introduced the concept of the sqibble into the code, maybe it would be a good idea to also adjust find_invalid_letters... ? I know this function is not meant to be commonly used, but unified interface would be cool. Leaving it as a note, possibly will turn it into an issue.