Table of Contents
Introduction
Stable Diffusion (most commonly used to convert text into images) is a growing application of AI technology in the content creation industry. Unlike many workflows that utilize commercially-developed software (Photoshop, Premiere Pro, DaVinci Resolve, etc.), many commonly used Stable Diffusion applications are open source and constantly evolving.
To drive Stable Diffusion on your local system, you need a powerful GPU in your computer that is capable of handling its heavy requirements. Not only will a more powerful card allow you to generate images more quickly, but you also need a card with plenty of VRAM if you want to create larger-resolution images.
In this article, we will present our methodology for benchmarking various GPUs for Stable Diffusion. This includes what implementations of Stable Diffusion we recommend, system configuration, testing prompts, automation, and scoring calculation. With a wide technology like this, there are effectively infinite ways to measure performance. These guidelines are not intended to be definitive or absolute, but rather to act as a common baseline.
Stable Diffusion Implementations
While it is possible to directly use the version of Stable Diffusion that Stability AI and Runway developed, most people use one of the web-based graphical UI created by third parties. The most common are:
- Automatic 1111 – Automatic 1111 is most commonly used with NVIDIA GPUs, although forks are available for AMD and Apple Silicon. It allows for the use of xformers which can significantly increase performance on NVIDIA GPUs. There is also OpenVINO support through a custom script to run it on Intel GPUs.
- SHARK – An alternative to Automatic 1111, SHARK works natively with NVIDIA and AMD GPUs. However, performance with AMD GPUs tends to be higher than Automatic 1111, while performance with NVIDIA is typically lower.
- Custom – Since Stable Diffusion is available for anyone to utilize directly, some opt to create their own application with the exact features they need. In fact, we are in the process of developing an easy-to-run application focused on benchmarking rather than actual image generation.
Each of these implementations (as well as many others that are available) has unique pros and cons regarding features and usability. However, from a performance and benchmarking standpoint, Automatic 1111 and SHARK are the two we currently recommend using. In addition, depending on what GPUs you plan to test, we advise using both in tandem. Automatic 1111 for NVIDIA GPUs, and SHARK for AMD GPUs.
One thing to be aware of is that since Stable Diffusion is constantly being updated and improved, it is important to note the various versions used for the underlying packages. For Automatic 1111, this is reported at the bottom of the web-ui, and looks something like:
version: v1.3.2 • python: 3.10.6 • torch: 2.0.1+cu118 • xformers: 0.0.17 • gradio: 3.32.0 • checkpoint: ad2a33c361
Differences in any of these components can result in performance changes.
Settings & Models
Stable Diffusion can be utilized for several uses (including text-to-image, image-to-image, inpainting, scaling, etc.), and have a host of settings you can adjust. From our testing, the most commonly changed settings (prompt, negative prompt, cfg scale, and seed) don’t impact performance meaningfully. Generating an image of a dog and or a mountain landscape should take the same amount of time. Even the model used tends to only make a minor difference in the generation time.
Others, such as the steps, will change how long it takes to generate the image, but will give the same result in terms of iterations per second. While most people will use between 20 and 50 steps to generate an image, we recommend using a higher step count (such as 200) as that can help with run-to-run consistency.
The resolution will have the biggest impact on performance but also will change how much VRAM is required to generation the image. For benchmarking, we recommend using 512×512 to maximize compatibility across different GPU models.
Beyond the resolution, the sampling method (Euler, DPM, etc.) can make an impact, with specific methods taking roughly twice as long. “Euler” and “Euler a” are the most commonly used methods and tend to give among the best overall performance. Others, like DPM2, take roughly twice as long. From a GPU benchmarking perspective, we recommend sticking to a variation of Euler for consistency.
Recommended Prompts for Benchmarking
With the effectively infinite number of prompts and settings you can use, we wanted to provide a sample set of prompts to facilitate consistent benchmarking.
Text to Image
- Prompt: “red sports car, (centered), driving, ((angled shot)), full car, wide angle, mountain road”
- Negative prompt: N/A
- Steps: 200
- Cfg Scale: 7
- Seed: 3936349264
- Width: 512
- Heigh: 512
- Batch count: 1
- Batch size: 1
- Sampler: Euler
Image to Image
- Source Image
- Prompt: “photo of a minivan, cinematic, realistic, hyper detailed, maximum detail”
- Negative prompt: N/A
- Steps: 200
- DeNoising Strength: 0.8
- Cfg Scale: 7
- Seed: 1417252111
- Width: 512
- Heigh: 512
- Batch count: 1
- Batch size: 1
- Sampler: Euler
Suggested inpainting and scaling prompts coming soon!
Measuring Performance
Most implementations of Stable Diffusion report a performance metric of “iterations per second” or “it/s”. We recommend using this metric as it is the most commonly used measurement of performance for Stable Diffusion. If your testing method does not allow you to directly record the it/s, you can calculate it by dividing the number of iterations you are running by the number of seconds it took to complete the test.
For example, if you run our recommended text to image prompt for 200 iterations and it takes 15 seconds to generate the image, that is 200/15 or approximately 13.3 it/s.
Automation
Instructions and scripting for automation are coming soon!