Stable Diffusion Benchmark Testing Methodology

Always look at the date when you read an article. Some of the content in this article is most likely out of date, as it was written on July 31, 2023. For newer information, see our more recent articles.

Table of Contents

Introduction

Stable Diffusion (most commonly used to convert text into images) is a growing application of AI technology in the content creation industry. Unlike many workflows that utilize commercially-developed software (Photoshop, Premiere Pro, DaVinci Resolve, etc.), many commonly used Stable Diffusion applications are open source and constantly evolving.

Stable Diffusion Automatic 1111 web interface

Stable Diffusion Automatic 1111 Web UI

To drive Stable Diffusion on your local system, you need a powerful GPU in your computer that is capable of handling its heavy requirements. Not only will a more powerful card allow you to generate images more quickly, but you also need a card with plenty of VRAM if you want to create larger-resolution images.

In this article, we will present our methodology for benchmarking various GPUs for Stable Diffusion. This includes what implementations of Stable Diffusion we recommend, system configuration, testing prompts, automation, and scoring calculation. With a wide technology like this, there are effectively infinite ways to measure performance. These guidelines are not intended to be definitive or absolute, but rather to act as a common baseline.

Stable Diffusion Implementations

While it is possible to directly use the version of Stable Diffusion that Stability AI and Runway developed, most people use one of the web-based graphical UI created by third parties. The most common are:

Automatic 1111 – Automatic 1111 is most commonly used with NVIDIA GPUs, although forks are available for AMD and Apple Silicon. It allows for the use of xformers which can significantly increase performance on NVIDIA GPUs. There is also OpenVINO support through a custom script to run it on Intel GPUs.

SHARK – An alternative to Automatic 1111, SHARK works natively with NVIDIA and AMD GPUs. However, performance with AMD GPUs tends to be higher than Automatic 1111, while performance with NVIDIA is typically lower.

Custom – Since Stable Diffusion is available for anyone to utilize directly, some opt to create their own application with the exact features they need. In fact, we are in the process of developing an easy-to-run application focused on benchmarking rather than actual image generation.

Each of these implementations (as well as many others that are available) has unique pros and cons regarding features and usability. However, from a performance and benchmarking standpoint, Automatic 1111 and SHARK are the two we currently recommend using. In addition, depending on what GPUs you plan to test, we advise using both in tandem. Automatic 1111 for NVIDIA GPUs, and SHARK for AMD GPUs.

One thing to be aware of is that since Stable Diffusion is constantly being updated and improved, it is important to note the various versions used for the underlying packages. For Automatic 1111, this is reported at the bottom of the web-ui, and looks something like:

version: v1.3.2 • python: 3.10.6 • torch: 2.0.1+cu118 • xformers: 0.0.17 • gradio: 3.32.0 • checkpoint: ad2a33c361

Differences in any of these components can result in performance changes.

Settings & Models

Stable Diffusion can be utilized for several uses (including text-to-image, image-to-image, inpainting, scaling, etc.), and have a host of settings you can adjust. From our testing, the most commonly changed settings (prompt, negative prompt, cfg scale, and seed) don’t impact performance meaningfully. Generating an image of a dog and or a mountain landscape should take the same amount of time. Even the model used tends to only make a minor difference in the generation time.

Stable Diffusion AutoMatic 1111 Sample image generation

All these images take almost the exact same amount of time to generate, even with different prompts and cfg scales.

Others, such as the steps, will change how long it takes to generate the image, but will give the same result in terms of iterations per second. While most people will use between 20 and 50 steps to generate an image, we recommend using a higher step count (such as 200) as that can help with run-to-run consistency.

The resolution will have the biggest impact on performance but also will change how much VRAM is required to generation the image. For benchmarking, we recommend using 512×512 to maximize compatibility across different GPU models.

Beyond the resolution, the sampling method (Euler, DPM, etc.) can make an impact, with specific methods taking roughly twice as long. “Euler” and “Euler a” are the most commonly used methods and tend to give among the best overall performance. Others, like DPM2, take roughly twice as long. From a GPU benchmarking perspective, we recommend sticking to a variation of Euler for consistency.

Recommended Prompts for Benchmarking

With the effectively infinite number of prompts and settings you can use, we wanted to provide a sample set of prompts to facilitate consistent benchmarking.

Text to Image

Prompt: “red sports car, (centered), driving, ((angled shot)), full car, wide angle, mountain road”
Negative prompt: N/A
Steps: 200
Cfg Scale: 7
Seed: 3936349264
Width: 512
Heigh: 512
Batch count: 1
Batch size: 1
Sampler: Euler

Image to Image

Source Image
Prompt: “photo of a minivan, cinematic, realistic, hyper detailed, maximum detail”
Negative prompt: N/A
Steps: 200
DeNoising Strength: 0.8
Cfg Scale: 7
Seed: 1417252111
Width: 512
Heigh: 512
Batch count: 1
Batch size: 1
Sampler: Euler

Suggested inpainting and scaling prompts coming soon!

Measuring Performance

Most implementations of Stable Diffusion report a performance metric of “iterations per second” or “it/s”. We recommend using this metric as it is the most commonly used measurement of performance for Stable Diffusion. If your testing method does not allow you to directly record the it/s, you can calculate it by dividing the number of iterations you are running by the number of seconds it took to complete the test.

For example, if you run our recommended text to image prompt for 200 iterations and it takes 15 seconds to generate the image, that is 200/15 or approximately 13.3 it/s.