Working with Multiple PDB Files? Here’s How to Clean Them All at Once

If you’ve ever worked with multiple protein structures—for example, when analyzing a protein family, running virtual screening, or preparing data sets for machine learning—you know how tedious it can be to clean and prepare each file manually.

Removing solvents, fixing missing atoms, ensuring consistent formatting, or adding hydrogens across dozens or hundreds of PDB files is demanding, repetitive, and error-prone. And yet, just one poorly-prepared file can break your entire pipeline.

Fortunately, SAMSON provides a solution designed specifically for this pain point: the Batch Protein Prepare extension. It applies SAMSON’s protein-cleaning steps automatically to a list of structures, whether they live in a local folder or need to be fetched by their PDB codes online.

When batch preparation makes a difference

Imagine this: you’ve downloaded 50 PDB files to study serine proteases. Most of them are incomplete or include extraneous molecules like co-factors, water, and monatomic ions. Some are in mmCIF format, others in MMTF or regular PDB. You want them cleaned, hydrogen-completed, and saved with a similar structure, ready for downstream simulations or analysis.

Instead of opening each one individually, importing it into an editor, clicking through preparation steps, exporting, and repeating—you can use Batch Protein Prepare to process them all in one go.

How Batch Protein Prepare works

Get started by installing the Batch Protein Prepare extension from SAMSON Connect. Once installed, you can:

Prepare all structures in a selected folder, automatically preserving any subfolder arrangement in the output.
Download structures using PDB codes—either entered manually or via a text file—and apply the cleaning pipeline to them as they download.
Clean files in multiple formats, including PDB, PDBx/mmCIF, MMTF, and MOL2.

Here’s what the preparation pipeline does under the hood:

Removes alternate conformations (keeping highest-occupancy atoms)
Strips solvent, ions, and unwanted ligands
Adds missing hydrogens suitable for standard residues
Leaves you with clean, consistent structures ready for simulation

This is what the interface looks like while processing multiple files:

The batch process is especially helpful if you’re running automated docking queues, training deep learning models that require consistently formatted inputs, or just want to save human time and reduce manual steps in structure preparation.

Tips when using Batch Protein Prepare

Prepare a list of PDB codes in advance—one per line if using a text file.
Double-check your output folder before running to avoid overwriting any important data.
Combine with other extensions like PDBFixer for deeper fixes like missing residues or custom pH protonation if necessary.

By minimizing human input, SAMSON’s Batch Protein Prepare makes protein structure preparation more scalable and reproducible across large datasets.

Learn more in the full SAMSON documentation.

SAMSON and all SAMSON Extensions are free for non-commercial use. You can download and install SAMSON at https://www.samson-connect.net.