elttaes/Revisiting-PLMs

Metal ion binding dataset

empyriumz opened this issue · 19 comments

Hi there,

Nice work!
I have a question about the metal ion binding dataset used in your paper.
Could you let me know where do you get the original dataset?

Thanks!

Hi, empyriumz:

Metal ion binding dataset collected from PDB(https://www.rcsb.org/). If the protein has any Metal ion binding site, we set its label as 1.

Thanks for your reply!
To clarify, I tries to search on PDB for metal ion binding:
image
or
image

both result in 87,669 entries.
Do you also perform similar queries and compile the dataset?

We wrote a crawler to crawl the annotations of each PDB protein. Do you need the original dataset we collected?

By original dataset, do you mean all the PDB files? That would be too large I guess, so could you share the script used for search and annotate the PDB entries?
Thanks!

I am so sorry that the classmates who wrote the crawler are not on the author list and are unwilling to give it to us. They now have a job and will also release the relevant dataset. I can notify you after their paper is released.

But I can give you a simple code that can check whether each page contains keywords. It may help you.

url = 'https://www.rcsb.org/annotations/2XEV'
req = urllib.request.Request(url=url)
content = urllib.request.urlopen(req).read() 
content = content.decode('utf-8') 
soup = BeautifulSoup(content,"html.parser")
tag = soup.find_all(text='metal ion binding')

If the page does not contain the 'metal ion binding' then the code will return a null list.

Hi, I try to use your metal alphafold code to predict other protein features, but I find that your code use a pkl data as the input, so I want to know how you generate the pkl files.Thanks!

Hi, I try to use your metal alphafold code to predict other protein features, but I find that your code use a pkl data as the input, so I want to know how you generate the pkl files.Thanks!

Hi Violet969:

This pkl including MSA and template information.
Related code you can see https://github.com/deepmind/alphafold/blob/main/run_alphafold.py line 172-174. When data_pipeline.process input a fasta and it will return MSA, template and pkl.

feature_dict = data_pipeline.process(
    input_fasta_path=fasta_path,
    msa_output_dir=msa_output_dir)

Pkl detail information you can see Alphafold paper's supplementary information pages 8-9.

I have already released the MSA on https://drive.google.com/drive/folders/1iShEW8NcMIlWqxTRgsEaI_t5ahoHsixt?usp=share_link

But the code to generate pkl maybe you need to modify some on run_alphafold.py. I can upload this part of the preprocessing code later.

I see, thanks for your sample code! I'll try to see if the results match with my aforementioned one.

Thanks for your answer. I also have a question, I saw that you use Evofomer and ESM to predict protein SS. But I don't see these code, will you share that?

Thanks for your answer. I also have a question, I saw that you use Evofomer and ESM to predict protein SS. But I don't see these code, will you share that?

Sure, I will upload this part of the code later.

Thanks for your answer. I also have a question, I saw that you use Evofomer and ESM to predict protein SS. But I don't see these code, will you share that?

Hi Violet969,
Secondary structure related codes and the code that can generate pkl from a3m have been uploaded into the Structure folder and Data folder, if you have any questions you can contact me.

I see, thanks for the answer. I used merge_msa.py but it didn't work, can you show me a case how to use it?

I see, thanks for the answer. I used merge_msa.py but it didn't work, can you show me a case how to use it?

Hi,
I have added an example, you can have a look at the latest code.

I see, thanks for the answer. I used merge_msa.py but it didn't work, can you show me a case how to use it?

Hi, I have added an example, you can have a look at the latest code.

Thanks for your answer. I also want an example for run metal/alphafold/train.py. Can you share that?

I see, thanks for the answer. I used merge_msa.py but it didn't work, can you show me a case how to use it?

Hi, I have added an example, you can have a look at the latest code.

Thanks for your answer. I also want an example for run metal/alphafold/train.py. Can you share that?

Now you should be able to run train.py directly with a few simple modifications. Please make sure you have configured the Alphafold runtime environment.

In addition, it seems that the current Alphafold parameter format is different from before. You can try to find the previous public parameter file.

Thanks for your reply. I try to run 'train.py' on my server. But there always have an error like this.

2023-01-01 07:07:23.507834: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2130] Execution of replica 0 failed: INTERNAL: Failed to allocate 50331648 bytes for new constant
Traceback (most recent call last):
File "train.py", line 264, in
app.run(main)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "train.py", line 216, in main
state = jax.pmap(updater.init)(rng_pmap, data)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/_src/traceback_util.py", line 162, in reraise_with_filtered_traceback
return fun(*args, **kwargs)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/src/api.py", line 2158, in cache_miss
out_tree, out_flat = f_pmapped(*args, **kwargs)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/_src/api.py", line 2034, in pmap_f
out = pxla.xla_pmap(
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/core.py", line 2022, in bind
return map_bind(self, fun, *args, **params)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/core.py", line 2054, in map_bind
outs = primitive.process(top_trace, fun, tracers, params)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/core.py", line 2025, in process
return trace.process_map(self, fun, tracers, params)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/core.py", line 687, in process_call
return primitive.impl(f, *tracers, **params)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/interpreters/pxla.py", line 841, in xla_pmap_impl
return compiled_fun(*args)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/_src/profiler.py", line 294, in wrapper
return func(*args, **kwargs)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/interpreters/pxla.py", line 1656, in call
out_bufs = self.xla_executable.execute_sharded_on_local_devices(input_bufs)
jax._src.traceback_util.UnfilteredStackTrace: jaxlib.xla_extension.XlaRuntimeError: INTERNAL: Failed to allocate 50331648 bytes for new constant: while running replica 0 and partition 0 of a replicated computation (other replicas may have failed as well).

The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "train.py", line 264, in
app.run(main)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "train.py", line 216, in main
state = jax.pmap(updater.init)(rng_pmap, data)
jaxlib.xla_extension.XlaRuntimeError: INTERNAL: Failed to allocate 50331648 bytes for new constant: while running replica 0 and partition 0 of a replicated computation (other replicas may have failed as well).

I have 8 nodes of 12G GPU, and 125G mem. Can you tell me how to solve it?

Thanks for your reply. I try to run 'train.py' on my server. But there always have an error like this.

2023-01-01 07:07:23.507834: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2130] Execution of replica 0 failed: INTERNAL: Failed to allocate 50331648 bytes for new constant
Traceback (most recent call last):
File "train.py", line 264, in
app.run(main)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "train.py", line 216, in main
state = jax.pmap(updater.init)(rng_pmap, data)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/_src/traceback_util.py", line 162, in reraise_with_filtered_traceback
return fun(*args, **kwargs)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/src/api.py", line 2158, in cache_miss
out_tree, out_flat = f_pmapped(*args, **kwargs)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/_src/api.py", line 2034, in pmap_f
out = pxla.xla_pmap(
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/core.py", line 2022, in bind
return map_bind(self, fun, *args, **params)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/core.py", line 2054, in map_bind
outs = primitive.process(top_trace, fun, tracers, params)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/core.py", line 2025, in process
return trace.process_map(self, fun, tracers, params)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/core.py", line 687, in process_call
return primitive.impl(f, *tracers, **params)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/interpreters/pxla.py", line 841, in xla_pmap_impl
return compiled_fun(*args)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/_src/profiler.py", line 294, in wrapper
return func(*args, **kwargs)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/interpreters/pxla.py", line 1656, in call
out_bufs = self.xla_executable.execute_sharded_on_local_devices(input_bufs)
jax._src.traceback_util.UnfilteredStackTrace: jaxlib.xla_extension.XlaRuntimeError: INTERNAL: Failed to allocate 50331648 bytes for new constant: while running replica 0 and partition 0 of a replicated computation (other replicas may have failed as well).
The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "train.py", line 264, in
app.run(main)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "train.py", line 216, in main
state = jax.pmap(updater.init)(rng_pmap, data)
jaxlib.xla_extension.XlaRuntimeError: INTERNAL: Failed to allocate 50331648 bytes for new constant: while running replica 0 and partition 0 of a replicated computation (other replicas may have failed as well).

I have 8 nodes of 12G GPU, and 125G mem. Can you tell me how to solve it?

I tested this code on A40(48GB) server and it works.
You can try to set " os.environ['XLA_PYTHON_CLIENT_MEM_FRACTION'] = '2' " or lower to reduce memory usage.

Thanks for your reply. I try to run 'train.py' on my server. But there always have an error like this.

2023-01-01 07:07:23.507834: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2130] Execution of replica 0 failed: INTERNAL: Failed to allocate 50331648 bytes for new constant
Traceback (most recent call last):
File "train.py", line 264, in
app.run(main)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "train.py", line 216, in main
state = jax.pmap(updater.init)(rng_pmap, data)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/_src/traceback_util.py", line 162, in reraise_with_filtered_traceback
return fun(*args, **kwargs)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/src/api.py", line 2158, in cache_miss
out_tree, out_flat = f_pmapped(*args, **kwargs)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/_src/api.py", line 2034, in pmap_f
out = pxla.xla_pmap(
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/core.py", line 2022, in bind
return map_bind(self, fun, *args, **params)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/core.py", line 2054, in map_bind
outs = primitive.process(top_trace, fun, tracers, params)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/core.py", line 2025, in process
return trace.process_map(self, fun, tracers, params)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/core.py", line 687, in process_call
return primitive.impl(f, *tracers, **params)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/interpreters/pxla.py", line 841, in xla_pmap_impl
return compiled_fun(*args)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/_src/profiler.py", line 294, in wrapper
return func(*args, **kwargs)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/interpreters/pxla.py", line 1656, in call
out_bufs = self.xla_executable.execute_sharded_on_local_devices(input_bufs)
jax._src.traceback_util.UnfilteredStackTrace: jaxlib.xla_extension.XlaRuntimeError: INTERNAL: Failed to allocate 50331648 bytes for new constant: while running replica 0 and partition 0 of a replicated computation (other replicas may have failed as well).
The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "train.py", line 264, in
app.run(main)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "train.py", line 216, in main
state = jax.pmap(updater.init)(rng_pmap, data)
jaxlib.xla_extension.XlaRuntimeError: INTERNAL: Failed to allocate 50331648 bytes for new constant: while running replica 0 and partition 0 of a replicated computation (other replicas may have failed as well).

I have 8 nodes of 12G GPU, and 125G mem. Can you tell me how to solve it?

I tested this code on A40(48GB) server and it works. You can try to set " os.environ['XLA_PYTHON_CLIENT_MEM_FRACTION'] = '2' " or lower to reduce memory usage.

Thanks for your so fast reply, that 'os.environ['XLA_PYTHON_CLIENT_MEM_FRACTION'] = '2'' works.
But i met another error like these

Traceback (most recent call last):
  File "train.py", line 269, in <module>
    app.run(main)
  File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "train.py", line 233, in main
    state, metrics = updater.update(state, data)
  File "train.py", line 176, in update
    if step % self._checkpoint_every_n == 0:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Can you tell me how to solve it?

Thanks for your reply. I try to run 'train.py' on my server. But there always have an error like this.

2023-01-01 07:07:23.507834: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2130] Execution of replica 0 failed: INTERNAL: Failed to allocate 50331648 bytes for new constant
Traceback (most recent call last):
File "train.py", line 264, in
app.run(main)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "train.py", line 216, in main
state = jax.pmap(updater.init)(rng_pmap, data)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/_src/traceback_util.py", line 162, in reraise_with_filtered_traceback
return fun(*args, **kwargs)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/src/api.py", line 2158, in cache_miss
out_tree, out_flat = f_pmapped(*args, **kwargs)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/_src/api.py", line 2034, in pmap_f
out = pxla.xla_pmap(
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/core.py", line 2022, in bind
return map_bind(self, fun, *args, **params)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/core.py", line 2054, in map_bind
outs = primitive.process(top_trace, fun, tracers, params)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/core.py", line 2025, in process
return trace.process_map(self, fun, tracers, params)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/core.py", line 687, in process_call
return primitive.impl(f, *tracers, **params)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/interpreters/pxla.py", line 841, in xla_pmap_impl
return compiled_fun(*args)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/_src/profiler.py", line 294, in wrapper
return func(*args, **kwargs)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/interpreters/pxla.py", line 1656, in call
out_bufs = self.xla_executable.execute_sharded_on_local_devices(input_bufs)
jax._src.traceback_util.UnfilteredStackTrace: jaxlib.xla_extension.XlaRuntimeError: INTERNAL: Failed to allocate 50331648 bytes for new constant: while running replica 0 and partition 0 of a replicated computation (other replicas may have failed as well).
The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "train.py", line 264, in
app.run(main)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "train.py", line 216, in main
state = jax.pmap(updater.init)(rng_pmap, data)
jaxlib.xla_extension.XlaRuntimeError: INTERNAL: Failed to allocate 50331648 bytes for new constant: while running replica 0 and partition 0 of a replicated computation (other replicas may have failed as well).

I have 8 nodes of 12G GPU, and 125G mem. Can you tell me how to solve it?

I tested this code on A40(48GB) server and it works. You can try to set " os.environ['XLA_PYTHON_CLIENT_MEM_FRACTION'] = '2' " or lower to reduce memory usage.

Thanks for your so fast reply, that 'os.environ['XLA_PYTHON_CLIENT_MEM_FRACTION'] = '2'' works. But i met another error like these

Traceback (most recent call last):
  File "train.py", line 269, in <module>
    app.run(main)
  File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "train.py", line 233, in main
    state, metrics = updater.update(state, data)
  File "train.py", line 176, in update
    if step % self._checkpoint_every_n == 0:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Can you tell me how to solve it?

Delete the './tmp' folder.