Cleaning Hundreds of PDB Files? Here’s How to Do It Automatically

If you’ve ever faced the task of cleaning dozens—if not hundreds—of PDB files before running simulations or docking workflows, you’re not alone. Dealing with missing atoms, stripping waters, resolving alternate locations manually is not the most enjoyable part of molecular modeling. Luckily, if you are using SAMSON, there is a time-saving solution built right into the platform: the Batch Protein Prepare extension.

This blog post walks you through how to perform automated, batch cleaning of protein structures, so you can spend more time running your simulations and less time clicking around files.

Why batch preparation is necessary

When working on large protein datasets—say for virtual screening, molecular dynamics, or building statistical models—you need your structures to be consistent and valid. That means:

Removing alternate atom locations
Stripping out unneeded ligands, ions, or water molecules
Adding missing hydrogens
Ensuring the structure is simulation-ready

Doing this once is fine. Doing this a hundred times manually? That’s when it becomes a real bottleneck.

What Batch Protein Prepare does for you

The Batch Protein Prepare extension in SAMSON gives you the ability to automatically process a large number of protein files. It replicates the cleaning steps of the Home > Prepare tool but scales the process to multiple files or PDB identifiers.

Key features include:

Bulk preparation from folders: Load entire directories containing PDB, mmCIF, MMTF, or MOL2 files. The tool processes each file and outputs a cleaned version, while preserving folder structure.
Download PDBs by ID: Supply a list of PDB codes (manually or via a text file) and the extension will download and prepare them automatically. Both standard and extended PDB formats are supported.

This is especially helpful when integrating protein preparation into automated pipelines, pre-processing data for machine learning, or cleaning public datasets for model evaluation.

How to use it

Install the Batch Protein Prepare extension from SAMSON Connect.
Launch SAMSON, and open the extension via the Extensions menu.
Select a folder containing PDB files, or paste in your list of PDB codes.
Choose your preparation options (e.g. remove waters, keep ligands, add hydrogens, etc.).
Run the process and retrieve cleaned files from the output directory.

When should you use this?

If you’re working with any sizable quantity of proteins—for example, from the Protein Data Bank or from high-throughput modeling tasks—this tool can be very effective. It handles both missing data and redundant elements that would otherwise affect your simulations or analysis.

Even if you only have a dozen files to clean, the saved time and consistent output are often worth it.

Keep things reproducible

Another benefit of automating protein preparation is reproducibility. Once you define and save your settings, it’s easy to apply the same cleaning process every time. No more wondering which structure had alternate locations removed and which didn’t.

To learn more, visit the full documentation page at https://documentation.samson-connect.net/tutorials/prepare-protein/prepare-protein/.

SAMSON and all SAMSON Extensions are free for non-commercial use. You can download SAMSON at https://www.samson-connect.net.