aspuru-guzik-group/group-selfies

Request for Adding Support for **Monoatomic Groups** in `GroupGrammar`

Opened this issue · 0 comments

Hello, thanks for your wonderful work!

I would like to request adding support for monoatomic groups in the GroupGrammar class.

Currently, the system effectively handles multi-atom groups by converting them to GroupSELFIES, but monoatomic groups (e.g., [C], [F], [N]) are treated separately as individual tokens outside the GroupGrammar vocabulary. This leads to challenges when attempting to extract the full molecular connectivity between groups and individual atoms.

Required Feature:

Introduce monoatomic groups (e.g., fragC, fragF, fragN, etc.) to the GroupGrammar.vocab to ensure that atoms like ["B", "C", "N", "O", "P", "S", "F", "Cl", "Br", "I", "Li", "Na", "K", "Rb", "Cs", "Fr", "Be", "Mg", "Ca", "Sr", "Ba", "Ra"] can also be processed as groups.
Allow these monoatomic groups to be added dynamically or to be included in the essential grammar set, similar to how groups like frag65, frag66, etc., are treated.

Motivation:

The main issue arises when trying to extract the molecular connectivity between the subgraphs represented by GroupSELFIES. GroupSELFIES, in essence, represents the original molecular graph by grouping atoms into subgraphs (i.e., groups). The connectivity between group tokens is clearly defined, but for monoatomic tokens like [C], the connectivity remains unclear. This inconsistency makes it difficult to extract subgraph-to-subgraph connectivity in a unified way.

Adding support for monoatomic groups would allow all atoms, even single atoms like [C] and [F], to be treated as subgraphs, ensuring that the connections between subgraphs can be easily traced and understood.

Example:

Here is an example where monoatomic atoms are treated separately from the defined groups. Ideally, atoms like [C] and [F] should be included as monoatomic groups within GroupGrammar.vocab to clarify their connectivity:

smiles: Cc1ccc(NC(=O)c2ccc(COc3ccc(F)cc3)o2)c(C)c1
GroupSELFIES: [C][:2frag65][=Branch][:0frag68][Ring1][:5frag66][#Branch][F][pop][pop][pop][#Branch][C][pop]
ATOMS
0 C 1/4 bonds filled group_tag=(3, 0)  # 1
1 C 4/4 bonds filled group_tag=(0, 8)
2 C 3/4 bonds filled group_tag=(0, 6)
3 C 3/4 bonds filled group_tag=(0, 4)
4 C 4/4 bonds filled group_tag=(0, 3)
5 N 2/3 bonds filled group_tag=(0, 2)
6 C 4/4 bonds filled group_tag=(0, 1)
7 O 2/2 bonds filled group_tag=(0, 0)
8 C 4/4 bonds filled group_tag=(2, 1)
9 C 3/4 bonds filled group_tag=(2, 0)
10 C 3/4 bonds filled group_tag=(2, 6)
11 C 4/4 bonds filled group_tag=(2, 4)
12 C 2/4 bonds filled group_tag=(1, 12)
13 O 2/2 bonds filled group_tag=(1, 0)
14 C 4/4 bonds filled group_tag=(1, 1)
15 C 3/4 bonds filled group_tag=(1, 2)
16 C 3/4 bonds filled group_tag=(1, 4)
17 C 4/4 bonds filled group_tag=(1, 6)
18 F 1/1 bonds filled group_tag=(4, 0)  # 2
19 C 3/4 bonds filled group_tag=(1, 8)
20 C 3/4 bonds filled group_tag=(1, 10)
21 O 2/2 bonds filled group_tag=(2, 3)
22 C 4/4 bonds filled group_tag=(0, 12)
23 C 1/4 bonds filled group_tag=(5, 0)  # 3
24 C 3/4 bonds filled group_tag=(0, 10)

BONDS
0 -> 1 order=1 group_idxs [0, 3]  # 4
1 -> 2 order=2 group_idxs []
2 -> 3 order=1 group_idxs []
3 -> 4 order=2 group_idxs []
4 -> 5 order=1 group_idxs []
4 -> 22 order=1 group_idxs []
5 -> 6 order=1 group_idxs []
6 -> 7 order=2 group_idxs []
6 -> 8 order=1 group_idxs [0, 2]
8 -> 9 order=2 group_idxs []
9 -> 10 order=1 group_idxs []
10 -> 11 order=2 group_idxs []
11 -> 12 order=1 group_idxs [1, 2]
11 -> 21 order=1 group_idxs []
12 -> 13 order=1 group_idxs []
13 -> 14 order=1 group_idxs []
14 -> 15 order=2 group_idxs []
15 -> 16 order=1 group_idxs []
16 -> 17 order=2 group_idxs []
17 -> 18 order=1 group_idxs [1, 4]  # 5
17 -> 19 order=1 group_idxs []
19 -> 20 order=2 group_idxs []
20 -> 14 order=1 group_idxs []
21 -> 8 order=1 group_idxs []
22 -> 23 order=1 group_idxs [0, 5]  # 6
22 -> 24 order=2 group_idxs []
24 -> 1 order=1 group_idxs []

GROUPS
<Group frag65 O=C(N(C1=C(*1)C(*1)=C(*1)C(*1)=C1*1)*1)*1>
<Group frag66 O(C1=C(*1)C(*1)=C(*1)C(*1)=C1*1)C(*1)(*1)*1>
<Group frag68 C1=C(*1)OC(*1)=C1*1>
<Group C ??>  # 7
<Group F ??>  # 8
<Group C ??>  # 9

In this example:

  • 1, 2, and 3 are the parts which show a monoatomic group (i.e., [C], [F], and [C]) being treated as a part of a group, which is the behavior we want to implement.
  • Therefore, monomolecular group tokens, such as 4, 5, and 6, are also represented as connections.
  • You can also see that the "GROUPS" has a single-member group defined, such as 7, 8, and 9.
    To this end,
  1. The monoatomic group must be defined in the GroupGrammar.vocab
  2. When converting graphs to group_selfies, you must be able to match monoatomic groups with group tokens.

Conclusion:

By adding support for monoatomic groups, the molecular connectivity between all subgraphs (whether complex groups or individual atoms) can be traced uniformly, greatly simplifying tasks such as graph extraction, reconstruction, and representation.

Thank you for considering this request! Looking forward to your feedback.