Introduction

About Stable Diffusion

Stable Diffusion is a generative model used for image and audio generation. It is based on the diffusion process and can model complex, high-dimensional distributions. The model works by iteratively adding noise to an input image or audio signal, and then denoising it to produce a new sample. This process is repeated many times to generate a full image or audio clip. Stable Diffusion has shown promising results in image and audio generation tasks and is a popular model in the machine learning research community.

‍

*Images generated by fine-tuned Stable Diffusion v1.5 model*

Neural networks that use diffusion models heavily rely on matrix and vector operations during both training and inference. This is where modern graphical processing units, or GPUs, demonstrate their capabilities.

CPUs vs GPUs

CPUs and GPUs differ in their architectures and purposes. CPUs are general-purpose processors that are designed to handle a wide range of tasks, including running operating systems, running applications, and handling input/output operations. They typically have a few cores, each capable of executing multiple threads in parallel.

GPUs, on the other hand, are specialized processors designed for graphics rendering and parallel processing. They have many more cores than CPUs, each optimized for executing a single instruction on multiple data points in parallel. This makes them very efficient at performing matrix and vector operations, which are common in neural network training and inference.

*Example: the difference between CPU and GPU in matrix factorization task (source*)

However, recent advancements in CPUs have made them more capable of performing operations with vectors and matrices. For example, Intel's Advanced Vector Extensions (AVX) provides a set of instructions that allow CPUs to perform multiple arithmetic operations on vectors and matrices simultaneously. However, GPUs still have an advantage over CPUs in terms of parallel processing power and are often used in deep learning applications.

CPU inference at Realm

At Realm, we confidently utilize CPU machines to satisfy customer requests for art generation. Although it may be slower than using GPU machines to execute the request queue, our product's unique delivery of goodie bags asynchronously allows us to benefit from simpler CPU setups. It is crucial to note that despite the slower speed, it remains acceptable, as we will demonstrate later in this article.

3 ways to run diffusion models on CPU

Here are three ways of running diffusion models on a CPU machine:

Baseline: This method relies on the default PyTorch execution, which uses the CPU to perform matrix and vector operations. While this method can be slow and less efficient than using a GPU, recent advancements in CPU architectures have made it more feasible to run deep learning models on a CPU.
OpenVINO conversion: OpenVINO is a toolkit provided by Intel for optimizing deep learning models on CPUs. By converting a PyTorch model to an OpenVINO format, the model can be run more efficiently on a CPU. This method can significantly improve inference speed and reduce CPU usage.
Intel PyTorch Extension (IPEX): IPEX is a library developed by Intel that utilizes the Advanced Vector Extensions (AVX) instruction set to optimize PyTorch models for CPU execution. This library can significantly improve the speed and efficiency of running deep learning models on a CPU by enabling parallel processing of matrix and vector operations.

Let’s have a look at the execution time comparison of these three approaches:

*Execution time comparison table, image generation with diffusers library (source*)

The first column shows a baseline created using the Intel Xeon Ice Lake chip. This chip is an older generation compared to the Intel Sapphire Rapids hardware generation.
The second column shows the same baseline generation but using Intel Sapphire Rapids hardware.
The third and fourth columns demonstrate how OpenVINO can accelerate the generation speed when using dynamic shape and static shape (difference explained further)
The fifth column represents what can be achieved with a baseline code if the system is set up properly, i.e. libjemalloc-dev, intel-mkl, and numactl installed.
The sixth and seventh shows how it’s going to be when IPEX and IPEX plus minor tricks are used

Greater power comes with greater costs

To use Intel PyTorch Extension (IPEX), one needs to have an Intel CPU with AVX512, which comes with the recent Sapphire Rapids chips. These chips are significantly more expensive than the previous generation (Ice Lake) when purchasing hardware resources on demand.

For example, on Google Cloud, a c3-standard-22 (16 vCPUs, 64 GB memory) instance with Sapphire Rapids costs around $1.14884 per hour, while a c2-standard-16 (16 vCPUs, 64 GB memory) instance with Ice Lake costs around $0.8352 per hour.However, it is worth noting that the performance improvements with Sapphire Rapids can be significant, especially when using IPEX. If cost is a concern, it may be more economical to use the baseline method or OpenVINO conversion, depending on the specific use case and budget constraints

OpenVINO and model shape

OpenVINO is a great tool for optimizing deep learning models on the CPU, especially on older hardware. One of the advantages of OpenVINO is the ability to use both static and dynamic model shapes. The static shape means that the dimensions of the model's input and output tensors are fixed at compile time, while the dynamic shape means that the dimensions can vary at runtime.

In general, static shape models can be faster and more efficient than dynamic shape models. However, static shape models can also be more memory-intensive. For example, when using OpenVINO with a static shape model for image generation, the memory usage can be up to four times higher than with a dynamic shape model that outputs images one at a time. It's important to consider the trade-offs between speed, efficiency, and memory usage when choosing between static and dynamic shape models in OpenVINO.

BF16 vs F16 vs F32

In recent years, float16 (also known as half-precision) has become a popular format for representing numerical values in deep-learning models. Float16 uses half the number of bits as float32 (single precision), which makes it more memory-efficient and faster to compute on GPUs.

Although float16 has a lower precision than float32, recent research has shown that it can be used without significant loss of accuracy in many deep-learning tasks. In fact, some studies have even shown that using float16 can improve the generalization of the model by acting as a regularizer and reducing overfitting.

Although float16 can improve the efficiency of deep learning models on GPUs, it is not always efficient on CPUs. However, there is an alternative called bfloat16, which operates similarly to float16 but can be more efficient on systems that support it. Bfloat16 uses 16 bits to represent numerical values, just like float16, but uses a different encoding scheme that allows for more efficient computation on CPUs. This format is supported by Intel's Advanced Vector Extensions (AVX) instruction set, which is used by Intel PyTorch Extension (IPEX) to optimize PyTorch models for CPU execution.

*Bit layout comparison between the three mentioned float number types (source)*

Reproducibility

One interesting thing that happens when you combine GPU and CPU neural network jobs is maintaining consistency with the same random seed.

If you use different GPUs but the same initial random seed, the results will be different. However, the difference in terms of MSE between the two images might be very small as long as the GPU generates the same pseudo-random numbers.

*There’s a visualized difference between the two generations made on RTX 3070 and Tesla M40, almost every pixel is black which means the same values are output by two GPUs (source*)

The other story happens when we exclude the GPU from a setup and still want to generate consistent images over different runs. One can achieve consistency in two ways:

Generate everything using float32 both on GPU and CPU (consistency achieved by extreme slowdown)
Or to generate random numbers using CPU pseudo-random algorithms and provide them to the GPU (one will lose consistency with seeds used for the GPU algorithm)

Conclusion

In this article, we explored different ways of running the Stable Diffusion model on CPUs, including the baseline method, OpenVINO conversion, and Intel PyTorch Extension (IPEX). We also discussed the differences between CPUs and GPUs, the trade-offs between static and dynamic model shapes in OpenVINO, and the advantages of using float16 and bfloat16 over float32 in deep learning models.

While GPUs are often more efficient for running deep learning models, CPUs can still be a viable option, especially for smaller-scale projects or when the cost is a concern. With recent advancements in CPU architecture and optimization tools like OpenVINO and IPEX, CPUs can offer a reasonable alternative to GPUs for running deep learning models like Stable Diffusion.

At Realm, we utilize CPU machines for deep learning art generation, leveraging our unique delivery system to make up for any potential performance drawbacks. We hope that this article has provided some insights into running the Stable Diffusion model on CPUs and the factors to consider when choosing between different methods and hardware configurations.

Running Stable Diffusion without GPU