BioPandas/biopandas

Stream support for exporting pdbs

djberenberg opened this issue · 6 comments

Describe the workflow you want to enable

I'd like to be able to export a pdb to a stream instead of to disk. In particular the reason why I'd like to do so is so that I can pass the stream directly to wandb.Molecule

Describe your proposed solution

The PandasPdb.to_pdb method could accept a path_or_stream: typing.Union[io.StringIO, str] instead of just a path: str argument. Internally, if path_or_stream happens to be a io.StringIO object, we don't need an openf function and instread can just execute the internal loops seen here, where f is now the io.StringIO object.

Making this change would enable inplace filling the stream with the pdb text.

Describe alternatives you've considered, if relevant

Currently I am needlessly writing to disk temporarily, reopening the file, and passing its contents to the wandb.Molecule object.

Additional context

a-r-j commented

Hey @djberenberg I've actually done this already. Code to follow once I find it :) I agree this would be a nice feature for biopandas

a-r-j commented

Here you go:

def to_pdb_stream(df: pd.DataFrame) -> StringIO:
    """Writes a PDB dataframe to a stream.

    :param df: PDB dataframe
    :type df: pandas.DataFrame
    :return: StringIO Buffer
    :rtype: StringIO
    """

    df = df.copy().drop(columns=["model_id"])
    df.residue_number = df.residue_number.astype(int)
    records = [r.strip() for r in list(set(df.record_name))]
    dfs = {r: df.loc[df.record_name == r] for r in records}

    for r in dfs:
        for col in pdb_records[r]:
            dfs[r][col["id"]] = dfs[r][col["id"]].apply(col["strf"])
            dfs[r]["OUT"] = pd.Series("", index=dfs[r].index)

        for c in dfs[r].columns:
            # fix issue where coordinates with four or more digits would
            # cause issues because the columns become too wide
            if c in {"x_coord", "y_coord", "z_coord"}:
                for idx in range(dfs[r][c].values.shape[0]):
                    if len(dfs[r][c].values[idx]) > 8:
                        dfs[r][c].values[idx] = str(
                            dfs[r][c].values[idx]).strip()

            if c not in {"line_idx", "OUT"}:
                dfs[r]["OUT"] = dfs[r]["OUT"] + dfs[r][c]

    df = pd.concat(dfs, sort=False)
    df.sort_values(by="line_idx", inplace=True)

    output = StringIO()
    s = df["OUT"].tolist()
    for idx in range(len(s)):
        if len(s[idx]) < 80:
            s[idx] = f"{s[idx]}{' ' * (80 - len(s[idx]))}"
    to_write = "\n".join(s)
    output.write(to_write)
    output.write("\n")
    return output

Thank you @a-r-j !!!

rasbt commented

Wow, thanks @a-r-j . If this works for you @djberenberg , it'd be great to add this to biopandas as a PR :)

a-r-j commented

Sure @rasbt , I'll add it to the open PR once I've got a moment.

@rasbt @a-r-j Works for me, the only changes I added were to conditionally drop "model_id" as it might not be a present column and add output.seek(0) before returning