Interactive viscoelastic fluid simulation
For my major project, I wanted to design and develop a fluid application from the ground up with performance and user interactivity in mind. While researching various SIGGRAPH papers for different fluid solvers, I came across a 2013 paper by Matthias Mueller and Miles Macklin for Position Based Fluids. This paper describes a method that is robust and suitable for real time applications such as games. Rather than using a traditional time integration scheme for the solver, position based dynamics works directly with the positions of the particles and gives control over explicit integration. This removes instabilities from the other integration methods such as energy gain issues. So I decided this would be a good paper to implement.
The position based fluid solver is a lagrangian method for solving the Navier stokes equations. This solver is based around some of the core principles of smoothed particle hydrodynamics method such as density estimation and smoothing kernel functions. Enforcing incompressibility in the fluid can be computationally expensive in other SPH methods. This paper enforces incompressibility by using an iterative density solver and solving a system of non linear constraints to enforce constant density. This allows greater stability in the system and so a larger time step can be used in the simulation, which is desirable for games applications. For my application I wanted additional control on the type of fluid effects produced and so I researched and implemented viscoelasticity into the solver, which is a unique fluid dynamics property.
By taking advantage of GPU architecture and the huge number of cores available for parallel processing, I have implemented the position based solver with viscoelasticity on the GPU using Compute Unified Device Architecture (CUDA) code and Thrust C++ library. Thrust is a parallel algorithms library which resembles C++ STL, Thrust device and host vectors are used for storing and manipulating the fluid data. As with any particle based simulation where particles are only influenced by neighbouring particles, we must ensure that we do not use a brute force approach. So for each particle we need to calculate the equations for only the neighbouring particles so that the algorithm time complexity is not O(n^2).
The particle neighbours are computed using a uniform grid data structure and it is implemented on the GPU. First the particles are given a unique key by using a hashing function based on the current position. This tells us which cell in the grid the particles belong to. The particles are then sorted by this key. Now the number of particles in each cell is computed and stored in a device vector. By performing a thrust exclusive scan operation on this array, we get the start address of the first particle in each cell. A pointer to these arrays is passed to the CUDA kernels and using pointer arithmetic we can find the neighbouring particles. When I first implemented the neighbour search algorithm, there was a jittering issue when I sorted the particles array by key. This could be due to the fact that the particles array is a shared resource, the OpenGL vertex buffer object has to be mapped and unmapped for the interoperability between CUDA. To overcome this issue, I created another device vector to store the original particle indexes and then the sort by key operation is done on this array instead. Then in the CUDA kernels the sorted index array can be used to lookup the correct particles in the particles array. The main issue with this method is that the OpenGL buffer is still unsorted and so when accessing this array, we are reading into non contiguous global memory which is a bottleneck.
Once the solver update step is finished, the screen space rendering is done using various render passes and GLSL shaders, rendering them in off screen frame buffer objects. The full details of this technique can be found in the fluid curvature paper in the References section. The particles depth is smoothed using a bilateral filter and is separated into two passes for efficiency. The smoothing step switches between two frame buffer objects so that the iterations can be changed and the smoothness of the final surface can be controlled for performance or visual reasons. The main problem I encountered with this technique is that as the final render pass is drawn onto a full screen quad, the fluid would always be in-front of any scene geometry. To fix this issue I setup my own stencil operation. A render buffer object is attached to the frame buffer object. This allows the scene geometry and particles to be rendered to the off screen target with depth testing. In this pass the scene geometry is set to a specific colour and then in the final render pass, the texture can be sampled to discard the fluid pixels that overlay the scene geometry.
This project has been integrated with Gameplay3D, which is an open source c++ game engine. Bullet physics is used to handle any physics in the scene geometry. As this was done on the CPU and my fluid solver was on the GPU. I had to setup my own basic collision detection and response. I implemented Axis Aligned Bounding Box and bounding sphere checks. For the collision response I used a simple projection method, where the particles are projected back onto the object surface. This is done in the solver iteration loop on the GPU, as the position based solver allows us to modify the positions directly while maintaining stability.
Development Platform: Linux
GPU: Geforce GTX 560 Ti
Compute Capability: 2.1
Solver iterations at 2
Uniform Grid size at 128x128x128
Can simulate 35,000 particles at roughly 60fps
Overall I think this project has been successful at fulfilling most of my goals I set out to do. The application is designed to support multiple fluid objects with different properties and the user can interact with the fluid with various forces. The proof of concept game idea that can be developed further is to navigate the fluid to the end of the level with as many particles as possible in a limited amount of time.
I used nVidia's visual profiler to analyse the bottlenecks, it showed that memory operations and latency were the main issues. This is as expected as there are a lot of global memory reads and writes in the CUDA kernels. Global memory is a virtual address space that can be mapped to device DRAM and can be accessed and modified by both host (CPU) and device (GPU). One way to optimise this further would be that for each particle we load the neighbouring particles data into shared memory. Shared memory resides on the streaming multiprocessor chip and it is faster to read and has a much lower memory latency. For GPUs with a compute capability of 2.0 or higher, can store up to 48KB of shared memory per streaming multiprocessor. This is allocated per thread block and so only threads in that block can access this data. So the fluid data is read from global memory to shared memory then using a thread synchronization command, we wait until the data is ready and loaded. As the fluid calculations requires accessing this data multiple times we can perform these operations much faster in shared memory. This local memory is divided into memory banks so using shared memory introduces another issue of possible bank conflicts. This is where two threads in the same warp try to access the same address in the same memory bank. This cannot be done in parallel and so the operation is serialized and causes a bank conflict. The bandwidth of shared memory depends on the graphics card but is usually 32 bits per bank per clock cycle. The size of the warp and number of banks depends on the compute capability. This problem can be avoided by using a well designed array access pattern, so if each thread in a halfwarp accesses successive 32 bit values then there should be no bank conflicts.
This leads onto another issue to consider in my CUDA application which is the memory access pattern as currently all the fluid data is stored as an array of structures. So in the the CUDA kernel when I modify only one specific attribute for all particles, the memory reads is not contiguous because the whole structure is stored in blocks. The other way to store the data would be a structure of arrays. This means that each attribute has a separate array so for example when modifying the density attribute, the data for all particles would be in contiguous memory. This means one memory read can load multiple particles data into available cache for faster read access.
The collision detection currently works on basic primitives, this can be expanded to more complex objects by computing a signed distance field for each object or sampling an object to be tightly packed and made up of small particles. The rendering of the surface can also be improved by implementing fluid curvature flow as proposed in the research paper instead of using the smoothing filter.
References:
matthias-mueller-fischer.ch/publications/pbf_sig_preprint.pdf
cs.rug.nl/~roe/publications/fluidcurvature.pdf
ligum.umontreal.ca/Clavet-2005-PVFS/pvfs.pdf
The position based fluid solver is a lagrangian method for solving the Navier stokes equations. This solver is based around some of the core principles of smoothed particle hydrodynamics method such as density estimation and smoothing kernel functions. Enforcing incompressibility in the fluid can be computationally expensive in other SPH methods. This paper enforces incompressibility by using an iterative density solver and solving a system of non linear constraints to enforce constant density. This allows greater stability in the system and so a larger time step can be used in the simulation, which is desirable for games applications. For my application I wanted additional control on the type of fluid effects produced and so I researched and implemented viscoelasticity into the solver, which is a unique fluid dynamics property.
By taking advantage of GPU architecture and the huge number of cores available for parallel processing, I have implemented the position based solver with viscoelasticity on the GPU using Compute Unified Device Architecture (CUDA) code and Thrust C++ library. Thrust is a parallel algorithms library which resembles C++ STL, Thrust device and host vectors are used for storing and manipulating the fluid data. As with any particle based simulation where particles are only influenced by neighbouring particles, we must ensure that we do not use a brute force approach. So for each particle we need to calculate the equations for only the neighbouring particles so that the algorithm time complexity is not O(n^2).
The particle neighbours are computed using a uniform grid data structure and it is implemented on the GPU. First the particles are given a unique key by using a hashing function based on the current position. This tells us which cell in the grid the particles belong to. The particles are then sorted by this key. Now the number of particles in each cell is computed and stored in a device vector. By performing a thrust exclusive scan operation on this array, we get the start address of the first particle in each cell. A pointer to these arrays is passed to the CUDA kernels and using pointer arithmetic we can find the neighbouring particles. When I first implemented the neighbour search algorithm, there was a jittering issue when I sorted the particles array by key. This could be due to the fact that the particles array is a shared resource, the OpenGL vertex buffer object has to be mapped and unmapped for the interoperability between CUDA. To overcome this issue, I created another device vector to store the original particle indexes and then the sort by key operation is done on this array instead. Then in the CUDA kernels the sorted index array can be used to lookup the correct particles in the particles array. The main issue with this method is that the OpenGL buffer is still unsorted and so when accessing this array, we are reading into non contiguous global memory which is a bottleneck.
Once the solver update step is finished, the screen space rendering is done using various render passes and GLSL shaders, rendering them in off screen frame buffer objects. The full details of this technique can be found in the fluid curvature paper in the References section. The particles depth is smoothed using a bilateral filter and is separated into two passes for efficiency. The smoothing step switches between two frame buffer objects so that the iterations can be changed and the smoothness of the final surface can be controlled for performance or visual reasons. The main problem I encountered with this technique is that as the final render pass is drawn onto a full screen quad, the fluid would always be in-front of any scene geometry. To fix this issue I setup my own stencil operation. A render buffer object is attached to the frame buffer object. This allows the scene geometry and particles to be rendered to the off screen target with depth testing. In this pass the scene geometry is set to a specific colour and then in the final render pass, the texture can be sampled to discard the fluid pixels that overlay the scene geometry.
This project has been integrated with Gameplay3D, which is an open source c++ game engine. Bullet physics is used to handle any physics in the scene geometry. As this was done on the CPU and my fluid solver was on the GPU. I had to setup my own basic collision detection and response. I implemented Axis Aligned Bounding Box and bounding sphere checks. For the collision response I used a simple projection method, where the particles are projected back onto the object surface. This is done in the solver iteration loop on the GPU, as the position based solver allows us to modify the positions directly while maintaining stability.
Development Platform: Linux
GPU: Geforce GTX 560 Ti
Compute Capability: 2.1
Solver iterations at 2
Uniform Grid size at 128x128x128
Can simulate 35,000 particles at roughly 60fps
Overall I think this project has been successful at fulfilling most of my goals I set out to do. The application is designed to support multiple fluid objects with different properties and the user can interact with the fluid with various forces. The proof of concept game idea that can be developed further is to navigate the fluid to the end of the level with as many particles as possible in a limited amount of time.
I used nVidia's visual profiler to analyse the bottlenecks, it showed that memory operations and latency were the main issues. This is as expected as there are a lot of global memory reads and writes in the CUDA kernels. Global memory is a virtual address space that can be mapped to device DRAM and can be accessed and modified by both host (CPU) and device (GPU). One way to optimise this further would be that for each particle we load the neighbouring particles data into shared memory. Shared memory resides on the streaming multiprocessor chip and it is faster to read and has a much lower memory latency. For GPUs with a compute capability of 2.0 or higher, can store up to 48KB of shared memory per streaming multiprocessor. This is allocated per thread block and so only threads in that block can access this data. So the fluid data is read from global memory to shared memory then using a thread synchronization command, we wait until the data is ready and loaded. As the fluid calculations requires accessing this data multiple times we can perform these operations much faster in shared memory. This local memory is divided into memory banks so using shared memory introduces another issue of possible bank conflicts. This is where two threads in the same warp try to access the same address in the same memory bank. This cannot be done in parallel and so the operation is serialized and causes a bank conflict. The bandwidth of shared memory depends on the graphics card but is usually 32 bits per bank per clock cycle. The size of the warp and number of banks depends on the compute capability. This problem can be avoided by using a well designed array access pattern, so if each thread in a halfwarp accesses successive 32 bit values then there should be no bank conflicts.
This leads onto another issue to consider in my CUDA application which is the memory access pattern as currently all the fluid data is stored as an array of structures. So in the the CUDA kernel when I modify only one specific attribute for all particles, the memory reads is not contiguous because the whole structure is stored in blocks. The other way to store the data would be a structure of arrays. This means that each attribute has a separate array so for example when modifying the density attribute, the data for all particles would be in contiguous memory. This means one memory read can load multiple particles data into available cache for faster read access.
The collision detection currently works on basic primitives, this can be expanded to more complex objects by computing a signed distance field for each object or sampling an object to be tightly packed and made up of small particles. The rendering of the surface can also be improved by implementing fluid curvature flow as proposed in the research paper instead of using the smoothing filter.
References:
matthias-mueller-fischer.ch/publications/pbf_sig_preprint.pdf
cs.rug.nl/~roe/publications/fluidcurvature.pdf
ligum.umontreal.ca/Clavet-2005-PVFS/pvfs.pdf