Stream support for exporting pdbs
djberenberg opened this issue · 6 comments
Describe the workflow you want to enable
I'd like to be able to export a pdb to a stream instead of to disk. In particular the reason why I'd like to do so is so that I can pass the stream directly to wandb.Molecule
Describe your proposed solution
The PandasPdb.to_pdb
method could accept a path_or_stream: typing.Union[io.StringIO, str]
instead of just a path: str
argument. Internally, if path_or_stream
happens to be a io.StringIO
object, we don't need an openf
function and instread can just execute the internal loops seen here, where f
is now the io.StringIO
object.
Making this change would enable inplace filling the stream with the pdb text.
Describe alternatives you've considered, if relevant
Currently I am needlessly writing to disk temporarily, reopening the file, and passing its contents to the wandb.Molecule
object.
Additional context
Hey @djberenberg I've actually done this already. Code to follow once I find it :) I agree this would be a nice feature for biopandas
Here you go:
def to_pdb_stream(df: pd.DataFrame) -> StringIO:
"""Writes a PDB dataframe to a stream.
:param df: PDB dataframe
:type df: pandas.DataFrame
:return: StringIO Buffer
:rtype: StringIO
"""
df = df.copy().drop(columns=["model_id"])
df.residue_number = df.residue_number.astype(int)
records = [r.strip() for r in list(set(df.record_name))]
dfs = {r: df.loc[df.record_name == r] for r in records}
for r in dfs:
for col in pdb_records[r]:
dfs[r][col["id"]] = dfs[r][col["id"]].apply(col["strf"])
dfs[r]["OUT"] = pd.Series("", index=dfs[r].index)
for c in dfs[r].columns:
# fix issue where coordinates with four or more digits would
# cause issues because the columns become too wide
if c in {"x_coord", "y_coord", "z_coord"}:
for idx in range(dfs[r][c].values.shape[0]):
if len(dfs[r][c].values[idx]) > 8:
dfs[r][c].values[idx] = str(
dfs[r][c].values[idx]).strip()
if c not in {"line_idx", "OUT"}:
dfs[r]["OUT"] = dfs[r]["OUT"] + dfs[r][c]
df = pd.concat(dfs, sort=False)
df.sort_values(by="line_idx", inplace=True)
output = StringIO()
s = df["OUT"].tolist()
for idx in range(len(s)):
if len(s[idx]) < 80:
s[idx] = f"{s[idx]}{' ' * (80 - len(s[idx]))}"
to_write = "\n".join(s)
output.write(to_write)
output.write("\n")
return output
Thank you @a-r-j !!!
Wow, thanks @a-r-j . If this works for you @djberenberg , it'd be great to add this to biopandas as a PR :)