Nowadays, GPU accelerators are commonly used to speed up general-purpose computing tasks on a variety of hardware. However, due to the diversity of GPU architectures and processed data, optimization of codes for a particular type of hardware and specific data characteristics can be extremely challenging. The autotuning of performance-relevant source-code parameters allows for automatic optimization of applications and keeps their performance portable. Although the autotuning process typically results in code speed-up, searching the tuning space can bring unacceptable overhead if (i) the tuning space is vast and full of poorly-performing implementations, or (ii) the autotuning process has to be repeated frequently because of changes in processed data or migration to different hardware.
In this paper, we introduce a novel method for searching generic tuning spaces. The tuning spaces can contain tuning parameters changing any user-defined property of the source code. The method takes advantage of collecting hardware performance counters (also known as profiling counters) during empirical tuning. Those counters are used to navigate the searching process towards faster implementations. The method requires the tuning space to be sampled on any GPU. It builds a problem-specific model, which can be used during autotuning on various, even previously unseen inputs or GPUs. Using a set of five benchmarks, we experimentally demonstrate that our method can speed up autotuning when an application needs to be ported to different hardware or when it needs to process data with different characteristics. We also compared our method to state of the art and show that our method is superior in terms of the number of searching steps and typically outperforms other searches in terms of convergence time.
Cryo Electron Microscopy (Cryo-EM) is currently one of the main tools to reveal the structural information of biological specimens at high resolution. Despite the great development of the techniques involved to solve the biological structures with Cryo-EM in the last years, the reconstructed 3D maps can present lower resolution due to errors committed while processing the information acquired by the microscope. One of the main problems comes from the 3D alignment step, which is an error-prone part of the reconstruction workflow due to the very low signal-to-noise ratio (SNR) common in Cryo-EM imaging. In fact, as we will show in this work, it is not unusual to find a disagreement in the alignment parameters in approximately 20-40% of the processed images, when outputs of different alignment algorithms are compared. In this work, we present a novel method to align sets of single particle images in the 3D space, called DeepAlign. Our proposal is based on deep learning networks that have been successfully used in plenty of problems in image classification. Specifically, we propose to design several deep neural networks on a regionalized basis to classify the particle images in sub-regions and, then, make a refinement of the 3D alignment parameters only inside that sub-region. We show that this method results in accurately aligned images, improving the Fourier shell correlation (FSC) resolution obtained with other state-of-the-art methods while decreasing computational time.
Autotuning, the practice of automatic tuning of applications to provide performance portability, has received increased attention in the research community, especially in high performance computing. Ensuring high performance on a variety of hardware usually means modifications to the code, often via different values of a selected set of parameters, such as tiling size, loop unrolling factor, or data layout. However, the search space of all possible combinations of these parameters can be large, which can result in cases where the benefits of autotuning are outweighed by its cost, especially with dynamic tuning. Therefore, estimating the tuning time in advance or shortening the tuning time is very important in dynamic tuning applications. We have found that certain properties of tuning spaces do not vary much when hardware is changed. In this article, we demonstrate that it is possible to use historical data to reliably predict the number of tuning steps that is necessary to find a well‐performing configuration and to reduce the size of the tuning space. We evaluate our hypotheses on a number of HPC benchmarks written in CUDA and OpenCL, using several different generations of GPUs and CPUs.
Cryogenic Electron Microscopy (Cryo-EM) has been established as one of the key players in Structural Biology. It can reconstruct a 3D model of the sample at the near-atomic resolution, which led to a Method of the year award by Nature, and the Nobel Prize in 2017. With the growing number of facilities, faster microscopes, and new imaging techniques, new algorithms are needed to process the so-called movies data produced by the microscopes in real-time, while preserving a high resolution and maximum of additional information. In this article, we present a new algorithm used for movie alignment, called FlexAlign. FlexAlign is able to correctly compensate for the shift produced during the movie acquisition on-the-fly, using the current generation of hardware. The algorithm performs a global and elastic local registration of the movie frames using Cross-Correlation and B-spline interpolation for high precision. We show that our execution time is compatible with real-time correction and that we preserve the high-resolution information up to high frequency.
In recent years, the heterogeneity of both commodity and supercomputers hardware has increased sharply. Accelerators, such as GPUs or Intel Xeon Phi co-processors, are often key to improving speed and energy efficiency of highly-parallel codes. However, due to the complexity of heterogeneous architectures, optimization of codes for a certain type of architecture as well as porting codes across different architectures, while maintaining a comparable level of performance, can be extremely challenging. Addressing the challenges associated with performance optimization and performance portability, autotuning has gained a lot of interest. Autotuning of performance-relevant source-code parameters allows to automatically tune applications without hard coding optimizations and thus helps with keeping the performance portable. In this paper, we introduce a benchmark set of ten autotunable kernels for important computational problems implemented in OpenCL or CUDA. Using our Kernel Tuning Toolkit, we show that with autotuning most of the kernels reach near-peak performance on various GPUs and outperform baseline implementations on CPUs and Xeon Phis. Our evaluation also demonstrates that autotuning is key to performance portability. In addition to offline tuning, we also introduce dynamic autotuning of code optimization parameters during application runtime. With dynamic tuning, the Kernel Tuning Toolkit enables applications to re-tune performance-critical kernels at runtime whenever needed, for example, when input data changes. Although it is generally believed that autotuning spaces tend to be too large to be searched during application runtime, we show that it is not necessarily the case when tuning spaces are designed rationally. Many of our kernels reach near peak-performance with moderately sized tuning spaces that can be searched at runtime with acceptable overhead. Finally we demonstrate, how dynamic performance tuning can be integrated into a real-world application from cryo-electron microscopy domain.
Protein tunnels and channels are attractive targets for drug design. Drug molecules that block the access of substrates or release of products can be efficient modulators of biological activity. Here, we demonstrate the applicability of a newly developed software tool CaverDock for screening databases of drugs against pharmacologically relevant targets. First, we evaluated the effect of rigid and flexible side chains on sets of substrates and inhibitors of seven different proteins. In order to assess the accuracy of our software, we compared the results obtained from CaverDock calculation with experimental data previously collected with heat shock protein 90α. Finally, we tested the virtual screening capabilities of CaverDock with a set of oncological and anti-inflammatory FDA-approved drugs with two molecular targets—cytochrome P450 17A1 and leukotriene A4 hydrolase/aminopeptidase. Calculation of rigid trajectories using four processors took on average 53 min per molecule with 90% successfully calculated cases. The screening identified functional tunnels based on the profile of potential energies of binding and unbinding trajectories. We concluded that CaverDock is a sufficiently fast, robust, and accurate tool for screening binding/unbinding processes of pharmacologically important targets with buried functional sites. The standalone version of CaverDock is available freely at https://loschmidt.chemi.muni.cz/caverdock/ and the web version at https://loschmidt.chemi.muni.cz/caverweb/.
Caver Web 1.0 is a web server for comprehensive analysis of protein tunnels and channels, and study of the ligands’ transport through these transport pathways. Caver Web is the first interactive tool allowing both the analyses within a single graphical user interface. The server is built on top of the abundantly used tunnel detection tool Caver 3.02 and CaverDock 1.0 enabling the study of the ligand transport. The program is easy-to-use as the only required inputs are a protein structure for a tunnel identification and a list of ligands for the transport analysis. The automated guidance procedures assist the users to set up the calculation in a way to obtain biologically relevant results. The identified tunnels, their properties, energy profiles and trajectories for ligands’ passages can be calculated and visualized. The tool is very fast (2–20 min per job) and is applicable even for virtual screening purposes. Its simple setup and comprehensive graphical user interface make the tool accessible for a broad scientific community. The server is freely available at https://loschmidt.chemi.muni.cz/caverweb.
Protein tunnels and channels are key transport pathways that allow ligands to pass between proteins’ external and internal environments. These functionally important structural features warrant detailed attention. It is difficult to study the ligand binding and unbinding processes experimentally, while molecular dynamics simulations can be time-consuming and computationally demanding.
ResultsCaverDock is a new software tool for analysing the ligand passage through the biomolecules. The method uses the optimized docking algorithm of AutoDock Vina for ligand placement docking and implements a parallel heuristic algorithm to search the space of possible trajectories. The duration of the simulations takes from minutes to a few hours. Here we describe the implementation of the method and demonstrate CaverDock’s usability by: (i) comparison of the results with other available tools, (ii) determination of the robustness with large ensembles of ligands and (iii) the analysis and comparison of the ligand trajectories in engineered tunnels. Thorough testing confirms that CaverDock is applicable for the fast analysis of ligand binding and unbinding in fundamental enzymology and protein engineering.
Availability and implementationSupplementary data are available at Bioinformatics online.
Here we present a novel method for the analysis of transport processes in proteins and its implementation called CaverDock. Our method is based on a modified molecular docking algorithm. It iteratively places the ligand along the access tunnel in such a way that the ligand movement is contiguous and the energy is minimized. The result of CaverDock calculation is a ligand trajectory and an energy profile of transport process. CaverDock uses the modified docking program Autodock Vina for molecular docking and implements a parallel heuristic algorithm for searching the space of possible trajectories. Our method lies in between the geometrical approaches and molecular dynamics simulations. Contrary to the geometrical methods, it provides an evaluation of chemical forces. However, it is far less computationally demanding and easier to set up compared to molecular dynamics simulations. CaverDock will find a broad use in the fields of computational enzymology, drug design and protein engineering. The software is available free of charge to the academic users at https://loschmidt.chemi.muni.cz/caverdock/.
Cryo-electron microscopy is a popular method for macromolecules structure determination. Reconstruction of a 3-D volume from raw data obtained from a microscope is highly computationally demanding. Thus, acceleration of the reconstruction has a great practical value. In this article, we introduce a novel graphics processing unit (GPU)-friendly algorithm for direct Fourier reconstruction, one of the main computational bottlenecks in the 3-D volume reconstruction pipeline for some experimental cases (particularly those with a large number of images and a high internal symmetry). Contrary to the state of the art, our algorithm uses a gather memory pattern, improving cache locality and removing race conditions in parallel writing into the 3-D volume. We also introduce a finely tuned CUDA implementation of our algorithm, using auto-tuning to search for a combination of optimization parameters maximizing performance on a given GPU architecture. Our CUDA implementation is integrated in widely used software Xmipp, version 3.19, reaching 11.4x speedup compared to the original parallel CPU implementation using GPU with comparable power consumption. Moreover, we have reached 31.7x speedup using four GPUs and 2.14x–5.96x speedup compared to optimized GPU implementation based on a scatter memory pattern.
Fast Fourier transform (FFT) has many applications. It is often one of the most computationally demanding kernels, so a lot of attention has been invested into tuning its performance on various hardware devices. However, FFT libraries have usually many possible settings and it is not always easy to deduce which settings should be used for optimal performance. In practice, we can often slightly modify the FFT settings, for example, we can pad or crop input data. Surprisingly, a majority of state-of-the-art papers focus to answer the question how to implement FFT under given settings but do not pay much attention to the question which settings result in the fastest computation. In this paper, we target a popular implementation of FFT for GPU accelerators, the cuFFT library. We analyze the behavior and the performance of the cuFFT library with respect to input sizes and plan settings. We also present a new tool, cuFFTAdvisor, which proposes and by means of autotuning finds the best configuration of the library for given constraints of input size and plan settings. We experimentally show that our tool is able to propose different settings of the transformation, resulting in an average 6x speedup using fast heuristics and 6.9x speedup using autotuning.
Autotuning is an important method for automatically exploring code optimizations. It may target low-level code optimizations, such as memory blocking, loop unrolling or memory prefetching, as well as high-level optimizations, such as placement of computation kernels on proper hardware devices, optimizing memory transfers between nodes or between accelerators and main memory. In this paper, we introduce an autotuning method, which extends state-of-the-art low-level tuning of OpenCL or CUDA kernels towards more complex optimizations. More precisely, we introduce a Kernel Tuning Toolkit (KTT), which implements inter-kernel global optimizations, allowing to tune parameters affecting multiple kernels or also the host code. We demonstrate on practical examples, that with global kernel optimizations we are able to explore tuning options that are not possible if kernels are tuned separately. Moreover, our tuning strategies can take into account numerical accuracy across multiple kernel invocations and search for implementations within specific numerical error bounds.
Kernel fusion is an optimization method, in which the code from several kernels is composed to create a new, fused kernel. It can push the performance of kernels beyond limits given for their isolated, unfused form. In this paper, we introduce a classification of different types of kernel fusion for both data dependent and data independent kernels. We study kernel fusion on three types of OpenCL devices: GPU, Xeon Phi and CPU. Those hardware platforms have quite different properties, thus, kernel fusion often affects performance in quite different ways. We analyze the impact of kernel fusion on those hardware platforms and show how it can be used to improve performance. Based on our study we also introduce a basic transformation method for generating fused kernels, which has good potential to be automatized.
In this paper, we introduce the GPU acceleration of dRMSD algorithm, used to compare different structures of a molecule. Comparing to multithreaded CPU implementation, we have reached 13.4x speedup in clustering and 62.7x speedup in 1:1 dRMSD computation using mid-end GPU. The dRMSD computation exposes strong memory locality and thus is compute-bound. Along with conservative implementation using shared memory, we have decided to implement variants of the algorithm using GPU caches to maintain memory locality. Our implementation using cache reaches 96.5 % and 91.6 % of shared memory performance on Fermi and Maxwell, respectively. We have identified several performance pitfalls related to cache blocking in compute-bound codes and suggested optimization techniques to improve the performance.
Contemporary GPUs have significantly higher arithmetic throughput than a memory throughput. Hence, many GPU kernels are memory bound and cannot exploit arithmetic power of the GPU. Examples of memory-bound kernels are BLAS-1 (vector?vector) and BLAS-2 (matrix?vector) operations. However, when kernels share data, kernel fusion can improve memory locality by placing shared data, originally passed via off-chip global memory, into a faster, but distributed on-chip memory. In this paper, we show how kernels performing map, reduce or their nested combinations can be fused automatically by our source-to-source compiler. To demonstrate the usability of the compiler, we have implemented several BLAS-1 and BLAS-2 routines and show how the performance of their sequences can be improved by fusions. Compared with similar sequences using CUBLAS, our compiler is able to generate code that is up to 2.24x faster for the examples tested.
The element subroutines in finite element method (FEM) provides enough parallelism to be successfully accelerated by contemporary GPUs. However, their efficient implementation is not straightforward and requires time-consuming exploration of numerous implementation variants. In this paper, we present optimization by kernel fusion for element subroutines. Moreover, we show how the optimization is automated using our source-to-source compiler. We demonstrate the optimization of the element subroutines for FEM model using St.Venant-Kirchhoff material. The performance of code generated by our compiler outperforms our previously published hand-tuned implementation by factor of 1.32 -- 1.54 depending on used GPU architecture. Although the optimization technique is demonstrated on element subroutines for using St.Venant-Kirchhoff material, it is generally usable for wider area of computationally-demanding problems.
When implementing a function mapping on the contemporary GPU, several contradictory performance factors affecting distribution of computation into GPU kernels have to be balanced. A decomposition-fusion scheme suggests to decompose the computational problem to be sol\-ved by several simple functions implemented as standalone kernels and to fuse some of these functions later into more complex kernels to improve memory locality. In this paper, a prototype of source-to-source compiler automating the fusion phase is presented and the impact of fusions generated by the compiler as well as compiler efficiency is experimentally evaluated.
Haptic rendering is an important area of research that enables the user to employ haptic perception in human-computer interaction. An important motivation here is to use the human touch to study the behavior of various models. However, the high refresh rate needed for a stable haptic interaction on the one hand and the high-computational-cost characteristic for the simulation of numerous phenomena on the other hand represent a big issue. In this paper, an approach based on the distributed construction of configuration spaces is presented. The main idea behind this approach is to profit from employing a high-performance environment (e. g., computational grid) to overcome or at least moderate the high-frequency issue. The approach is presented using nonlinear deformation models, which are essential for realistic modeling of soft tissues. A distributed algorithm is presented, and its properties are evaluated quantitatively.
The AutoDock suite is widely used molecular docking software consisting of two main programs -- AutoGrid for precomputation of potential grid maps and AutoDock for docking into potential grid maps. In this paper, the acceleration of potential maps generation based on AutoGrid and its implementation called FastGrid is presented. The most computationally expensive algorithms are accelerated using GPU, the rest of algorithms run on CPU with asymptotically lower time complexity that has been obtained using more sophisticated data structures than in the original AutoGrid code. Moreover, the CPU implementation is parallelized to fully exploit computational power of machines that are equipped with more CPU cores than GPUs. Our implementation outperforms original AutoGrid more than 400x for large, but quite common molecules and sufficiently large grids.
Soft tissue modelling is important in the realm of haptic interactions. The main challenge in this research area is to combine two basic conditions which are essential - the stability of the haptic interaction running on high refresh rate on one hand and realistic behavior of the tissue assuming computationally expensive mathematical models on the other. In this paper, a distributed algorithm addressing this challenge is presented. The algorithm is based on the precomputation-interpolation scheme when the force feedback for the actual position of the haptic interaction point (HIP) is computed by a fast interpolation from precomputed data running inside the haptic loop. The standard precomputation-interpolation scheme is modified, however, so that the data needed for the interpolation are generated directly during the interaction. The distributed algorithm and the underlying architecture are presented together with a prototype implementation. Preliminary evaluation of the algorithm using non-linear models of a human liver having more than 1700 elements shows the feasibility of the proposed approach.