The code is quite complex and is used to manage the process of running AlphaFold predictions on protein sequences, handling various configurations, and saving the results. It involves managing input data, template options, MSA modes, and other advanced settings to customize the predictions. Let’s break down the code step by step:
This line creates a boolean variable display_images that can be toggled using Colab’s interactive widgets. It will be used to determine whether to display images during the execution of the code.
display_images = True #@param {type:"boolean"}
This section imports various Python modules and classes required for the rest of the code. It includes modules for handling warnings, working with file paths, downloading AlphaFold parameters, setting up logging, managing batches of protein sequences, and plotting Multiple Sequence Alignments (MSAs).
import sys
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
from Bio import BiopythonDeprecationWarning
warnings.simplefilter(action='ignore', category=BiopythonDeprecationWarning)
from pathlib import Path
from colabfold.download import download_alphafold_params, default_data_dir
from colabfold.utils import setup_logging
from colabfold.batch import get_queries, run, set_model_type
from colabfold.plot import plot_msa_v2
This section checks if a specific GPU type (Tesla K80) is available. If it’s found, it prints a warning and adjusts certain environment variables related to memory management.
try:
K80_chk = os.popen('nvidia-smi | grep "Tesla K80" | wc -l').read()
except:
K80_chk = "0"
pass
if "1" in K80_chk:
print("WARNING: found GPU Tesla K80: limited to total length < 1000")
if "TF_FORCE_UNIFIED_MEMORY" in os.environ:
del os.environ["TF_FORCE_UNIFIED_MEMORY"]
if "XLA_PYTHON_CLIENT_MEM_FRACTION" in os.environ:
del os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"]
Here, additional functions and classes are imported for protein visualization, working with file paths, and plotting using Matplotlib.
from colabfold.colabfold import plot_protein
from pathlib import Path
import matplotlib.pyplot as plt
##5. User Input Section: Here, the user is prompted to provide input. They need to input a protein sequence, a job name, a number of models to use for relaxation, and a template mode. The parameters are adjusted using annotations (@param). Remember that the query_sequence is not important as a parameter, since the user will then be prompted to upload their sequence files.
from google.colab import files
import os
import re
import hashlib
import random
from sys import version_info
python_version = f"{version_info.major}.{version_info.minor}"
def add_hash(x,y):
return x+"_"+hashlib.sha1(y.encode()).hexdigest()[:5]
query_sequence = 'PIAQIHILEGRSDEQKETLIREVSEAISRSLDAPLTSVRVIITEMAKGHFGIGGELASK' #@param {type:"string"}
#@markdown - Use `:` to specify inter-protein chainbreaks for **modeling complexes** (supports homo- and hetro-oligomers). For example **PI...SK:PI...SK** for a homodimer
jobname = 'test' #@param {type:"string"}
# number of models to use
num_relax = 0 #@param [0, 1, 5] {type:"raw"}
#@markdown - specify how many of the top ranked structures to relax using amber
template_mode = "none" #@param ["none", "pdb70","custom"]
#@markdown - `none` = no template information is used. `pdb70` = detect templates in pdb70. `custom` - upload and search own templates (PDB or mmCIF format, see [notes below](#custom_templates))
use_amber = num_relax > 0
# remove whitespaces
query_sequence = "".join(query_sequence.split())
basejobname = "".join(jobname.split())
basejobname = re.sub(r'\W+', '', basejobname)
jobname = add_hash(basejobname, query_sequence)
# check if directory with jobname exists
def check(folder):
if os.path.exists(folder):
return False
else:
return True
if not check(jobname):
n = 0
while not check(f"{jobname}_{n}"): n += 1
jobname = f"{jobname}_{n}"
# make directory to save results
os.makedirs(jobname, exist_ok=True)
# save queries
queries_path = os.path.join(jobname, f"{jobname}.csv")
with open(queries_path, "w") as text_file:
text_file.write(f"id,sequence\n{jobname},{query_sequence}")
if template_mode == "pdb70":
use_templates = True
custom_template_path = None
elif template_mode == "custom":
custom_template_path = os.path.join(jobname,f"template")
os.makedirs(custom_template_path, exist_ok=True)
uploaded = files.upload()
use_templates = True
for fn in uploaded.keys():
os.rename(fn,os.path.join(custom_template_path,fn))
else:
custom_template_path = None
use_templates = False
This part of the code seems to be specific to handling the Amber molecular dynamics package (use_amber) and adjusting the Python path to include the necessary packages.
if use_amber and f"/usr/local/lib/python{python_version}/site-packages/" not in sys.path:
sys.path.insert(0, f"/usr/local/lib/python{python_version}/site-packages/")
The input protein sequence is cleaned of whitespace characters. The job name is cleaned up and hashed with the sequence to create a unique identifier for the job.
query_sequence = "".join(query_sequence.split())
basejobname = "".join(jobname.split())
basejobname = re.sub(r'\W+', '', basejobname)
jobname = add_hash(basejobname, query_sequence)
A directory is created with the job name to save results. If a directory with the same name already exists, a new name with an appended number is generated to avoid overwriting existing data.
if not check(jobname):
n = 0
while not check(f"{jobname}_{n}"): n += 1
jobname = f"{jobname}_{n}"
os.makedirs(jobname, exist_ok=True)
The protein sequence is saved to a CSV file within the job directory.
queries_path = os.path.join(jobname, f"{jobname}.csv")
with open(queries_path, "w") as text_file:
text_file.write(f"id,sequence\n{jobname},{query_sequence}")
Depending on the chosen template mode, template paths are managed accordingly. Templates provide structural information for modeling.
if template_mode == "pdb70":
use_templates = True
...
elif template_mode == "custom":
custom_template_path = os.path.join(jobname,f"template")
...
else:
custom_template_path = None
use_templates = False
This section determines the path to the multiple sequence alignment (a3m) file based on the chosen MSA mode.
...
if "mmseqs2" in msa_mode:
a3m_file = os.path.join(jobname,f"{jobname}.a3m")
elif msa_mode == "custom":
a3m_file = os.path.join(jobname,f"{jobname}.custom.a3m")
...
else:
a3m_file = os.path.join(jobname,f"{jobname}.single_sequence.a3m")
Advanced settings related to the AlphaFold model are provided, including the model type, number of recycles, dropout usage, and more.
model_type = "auto" #@param ["auto", "alphafold2_ptm", ...]
...
save_recycles = False #@param {type:"boolean"}
dpi = 200 #@param {type:"integer"}
This loop iterates through the job names and runs the AlphaFold prediction process for each job. The results are then compressed into a ZIP file.
for jobname in jobname_list:
...
results = run(
queries=queries,
...
)
results_zip = f"{jobname}.result.zip"
os.system(f"zip -r {results_zip} {jobname}")
After the predictions are completed, the results are packaged into a ZIP file and then downloaded. If enabled, the results are also uploaded to Google Drive.
...
for jobname in jobname_list:
...
result_zip = f"{jobname}.result.zip"
os.system(f"zip -r {result_zip} {jobname}")
files.download(result_zip)
Please note that the code assumes the availability of certain modules. To fully understand how this code fits into the larger context and its intended use, you would need to examine the entire program or script it’s a part of. If you want to test the code with your data, you can do so via the modified ColabFold link provided at the beginning of the document.