Multiple-File-Protein-prediction-in-AlphaFold2-ColabFold

View explanation as HTML: https://alejandro2195.github.io/Multiple-File-Protein-prediction-in-AlphaFold2-ColabFold/

Test the code in a modified ColabFold notebook: https://colab.research.google.com/github/Alejandro2195/Multiple-File-Protein-prediction-in-AlphaFold2-ColabFold/blob/main/AlphaFold2%20modified%20code%20for%20multiple%20protein%20prediction.ipynb

The provided code snippet forms a critical component of an automated script aimed at facilitating the prediction of protein structures using the AlphaFold model within a Google Colab environment. The script is tailored to handle the simultaneous upload of multiple input files, each containing multiple protein sequences for prediction. The code initiates by configuring preferences, such as determining whether images should be displayed during execution, and importing essential modules. It performs checks to identify specific GPU types and adapts environment variables accordingly.

An array of callback functions is defined, enabling the visualization of multiple sequence alignments and the predicted protein structures. As the script progresses, it processes the uploaded protein sequences from multiple files, generating unique job names and establishing directories to store the forthcoming results. In a systematic loop, the script iterates through these designated job names. For each job, it configures prediction parameters and invokes the AlphaFold model to predict protein structures for the corresponding sequences.

This integrated system is meticulously designed to offer a seamless user experience. It not only streamlines the prediction process but also ensures the entire pipeline is automated – from the initial upload of files containing protein sequences to the final step of packaging and saving the prediction results. This comprehensive automation, coupled with interactive widgets and the utilization of the ColabFold package, underscores the script’s capability to empower users with efficient and hands-free protein structure prediction and result management.

If you want to test the code you can do it with either AlphaFold or ColabFold, you just have to make sure to install the dependencies of the latter before executing said code.

The code is quite complex and is used to manage the process of running AlphaFold predictions on protein sequences, handling various configurations, and saving the results. It involves managing input data, template options, MSA modes, and other advanced settings to customize the predictions. Let’s break down the code step by step:

1. Display Images Setting:

This line creates a boolean variable display_images that can be toggled using Colab’s interactive widgets. It will be used to determine whether to display images during the execution of the code.

display_images = True #@param {type:"boolean"}

2. Importing Modules:

This section imports various Python modules and classes required for the rest of the code. It includes modules for handling warnings, working with file paths, downloading AlphaFold parameters, setting up logging, managing batches of protein sequences, and plotting Multiple Sequence Alignments (MSAs).

import sys
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
from Bio import BiopythonDeprecationWarning
warnings.simplefilter(action='ignore', category=BiopythonDeprecationWarning)
from pathlib import Path
from colabfold.download import download_alphafold_params, default_data_dir
from colabfold.utils import setup_logging
from colabfold.batch import get_queries, run, set_model_type
from colabfold.plot import plot_msa_v2

3. Checking for GPU Type and Adjusting Environment:

This section checks if a specific GPU type (Tesla K80) is available. If it’s found, it prints a warning and adjusts certain environment variables related to memory management.

try:
  K80_chk = os.popen('nvidia-smi | grep "Tesla K80" | wc -l').read()
except:
  K80_chk = "0"
  pass
if "1" in K80_chk:
  print("WARNING: found GPU Tesla K80: limited to total length < 1000")
  if "TF_FORCE_UNIFIED_MEMORY" in os.environ:
    del os.environ["TF_FORCE_UNIFIED_MEMORY"]
  if "XLA_PYTHON_CLIENT_MEM_FRACTION" in os.environ:
    del os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"]

4. Additional Imports and Configuration:

Here, additional functions and classes are imported for protein visualization, working with file paths, and plotting using Matplotlib.

from colabfold.colabfold import plot_protein
from pathlib import Path
import matplotlib.pyplot as plt

##5. User Input Section: Here, the user is prompted to provide input. They need to input a protein sequence, a job name, a number of models to use for relaxation, and a template mode. The parameters are adjusted using annotations (@param). Remember that the query_sequence is not important as a parameter, since the user will then be prompted to upload their sequence files.

from google.colab import files
import os
import re
import hashlib
import random

from sys import version_info
python_version = f"{version_info.major}.{version_info.minor}"

def add_hash(x,y):
  return x+"_"+hashlib.sha1(y.encode()).hexdigest()[:5]

query_sequence = 'PIAQIHILEGRSDEQKETLIREVSEAISRSLDAPLTSVRVIITEMAKGHFGIGGELASK' #@param {type:"string"}
#@markdown  - Use `:` to specify inter-protein chainbreaks for **modeling complexes** (supports homo- and hetro-oligomers). For example **PI...SK:PI...SK** for a homodimer
jobname = 'test' #@param {type:"string"}
# number of models to use
num_relax = 0 #@param [0, 1, 5] {type:"raw"}
#@markdown - specify how many of the top ranked structures to relax using amber
template_mode = "none" #@param ["none", "pdb70","custom"]
#@markdown - `none` = no template information is used. `pdb70` = detect templates in pdb70. `custom` - upload and search own templates (PDB or mmCIF format, see [notes below](#custom_templates))

use_amber = num_relax > 0

# remove whitespaces
query_sequence = "".join(query_sequence.split())

basejobname = "".join(jobname.split())
basejobname = re.sub(r'\W+', '', basejobname)
jobname = add_hash(basejobname, query_sequence)

# check if directory with jobname exists
def check(folder):
  if os.path.exists(folder):
    return False
  else:
    return True
if not check(jobname):
  n = 0
  while not check(f"{jobname}_{n}"): n += 1
  jobname = f"{jobname}_{n}"

# make directory to save results
os.makedirs(jobname, exist_ok=True)

# save queries
queries_path = os.path.join(jobname, f"{jobname}.csv")
with open(queries_path, "w") as text_file:
  text_file.write(f"id,sequence\n{jobname},{query_sequence}")

if template_mode == "pdb70":
  use_templates = True
  custom_template_path = None
elif template_mode == "custom":
  custom_template_path = os.path.join(jobname,f"template")
  os.makedirs(custom_template_path, exist_ok=True)
  uploaded = files.upload()
  use_templates = True
  for fn in uploaded.keys():
    os.rename(fn,os.path.join(custom_template_path,fn))
else:
  custom_template_path = None
  use_templates = False

5. Setting Up PDBFixer Import Path (if condition):

This part of the code seems to be specific to handling the Amber molecular dynamics package (use_amber) and adjusting the Python path to include the necessary packages.

if use_amber and f"/usr/local/lib/python{python_version}/site-packages/" not in sys.path:
    sys.path.insert(0, f"/usr/local/lib/python{python_version}/site-packages/")

6. Preprocessing Input:

The input protein sequence is cleaned of whitespace characters. The job name is cleaned up and hashed with the sequence to create a unique identifier for the job.

query_sequence = "".join(query_sequence.split())
basejobname = "".join(jobname.split())
basejobname = re.sub(r'\W+', '', basejobname)
jobname = add_hash(basejobname, query_sequence)

7. Creating and Managing Directories:

A directory is created with the job name to save results. If a directory with the same name already exists, a new name with an appended number is generated to avoid overwriting existing data.

if not check(jobname):
    n = 0
    while not check(f"{jobname}_{n}"): n += 1
    jobname = f"{jobname}_{n}"
os.makedirs(jobname, exist_ok=True)

8. Saving Query Sequences:

The protein sequence is saved to a CSV file within the job directory.

queries_path = os.path.join(jobname, f"{jobname}.csv")
with open(queries_path, "w") as text_file:
    text_file.write(f"id,sequence\n{jobname},{query_sequence}")

9. Template Handling:

Depending on the chosen template mode, template paths are managed accordingly. Templates provide structural information for modeling.

if template_mode == "pdb70":
    use_templates = True
    ...
elif template_mode == "custom":
    custom_template_path = os.path.join(jobname,f"template")
    ...
else:
    custom_template_path = None
    use_templates = False

10. MSA (Multiple Sequence Alignment) Handling:

This section determines the path to the multiple sequence alignment (a3m) file based on the chosen MSA mode.

...
if "mmseqs2" in msa_mode:
    a3m_file = os.path.join(jobname,f"{jobname}.a3m")
elif msa_mode == "custom":
    a3m_file = os.path.join(jobname,f"{jobname}.custom.a3m")
    ...
else:
    a3m_file = os.path.join(jobname,f"{jobname}.single_sequence.a3m")

11. Advanced Settings:

Advanced settings related to the AlphaFold model are provided, including the model type, number of recycles, dropout usage, and more.

model_type = "auto" #@param ["auto", "alphafold2_ptm", ...]
...
save_recycles = False #@param {type:"boolean"}
dpi = 200 #@param {type:"integer"}

12. Loop for Running Predictions:

This loop iterates through the job names and runs the AlphaFold prediction process for each job. The results are then compressed into a ZIP file.

for jobname in jobname_list:
    ...
    results = run(
        queries=queries,
        ...
    )
    results_zip = f"{jobname}.result.zip"
    os.system(f"zip -r {results_zip} {jobname}")

13. Packaging and Downloading Results:

After the predictions are completed, the results are packaged into a ZIP file and then downloaded. If enabled, the results are also uploaded to Google Drive.

...
for jobname in jobname_list:
    ...
    result_zip = f"{jobname}.result.zip"
    os.system(f"zip -r {result_zip} {jobname}")
    files.download(result_zip)

Please note that the code assumes the availability of certain modules. To fully understand how this code fits into the larger context and its intended use, you would need to examine the entire program or script it’s a part of. If you want to test the code with your data, you can do so via the modified ColabFold link provided at the beginning of the document.

AlphaFold2 Code Modified Explanation

Alejandro José Gómez García

2023-08-28