cuQuantum

Accelerate quantum computing research.

Quantum computing has the potential to offer giant leaps in computational capabilities. The ability of scientists, developers, and researchers to simulate quantum circuits on classical computers is vital to getting us there.

NVIDIA cuQuantum is an SDK of optimized libraries and tools for accelerating quantum computing workflows. With NVIDIA Tensor Core GPUs, developers can use cuQuantum to accelerate quantum circuit simulations based on state vector and tensor network methods by orders of magnitude.


NVIDIA cuQuantum SDK

Looking to Run in the Cloud?

Run on AWS   |   Run on Azure   |   Run on GCP   |   Run on OCI


Quick Links

cuQuantum Appliance

A full simulation stack based on cuQuantum in a ready-to-deploy container.

Documentation

Documentation for cuQuantum and the cuQuantum Appliance.

GitHub

The cuQuantum public repository, including the cuQuantum Python bindings and examples.

Latest Notes

The cuQuantum release notes, including the latest and greatest features

NVIDIA cuQuantum Appliance

cuQuantum Appliance helps developers get started by making simulation software available in a container optimized to run on the latest NVIDIA DGX™ systems, and HGX™ systems.

The stack includes Google’s Cirq framework and qsim simulator along with NVIDIA cuQuantum.

The appliance software achieved best-in-class performance on key problems in quantum computing, including Shor’s algorithm, random quantum circuits, and quantum Fourier transform. Recent software updates to our container offering have enabled a 4.4X speedup over previously reported numbers. Combined with ~2x speedups offered by Hopper GPUs, users see even greater speedups over CPU implementations despite CPU hardware and software improvements.

cuQuantum Appliance is available now in the NVIDIA® NGC™ catalog and as machine image on each major cloud marketplace.

Multi-GPU Speedups

cuQuantum Appliance speeds up simulations of popular quantum algorithms like quantum fourier transform, Shor’s algorithm, and quantum supremacy circuits by 90-369x on NVIDIA H100 80GB Tensor Core GPUs over CPU implementations on dual Intel Xeon Platinum 8480C CPUs.

Multi-Node Speedups

Line graph showing weak scaling comparison of multi-node simulators

Performance is benchmarked leveraging Quantum Volume with a depth of 10 and depth of 30, along with QAOA and a small Quantum Phase Estimation, run on NVIDIA H100 80GB GPUs. On average cuQuantum with H100 GPUs is ~2x faster than A100s.

Our latest multi-node update introduces support for IBM’s Qiskit Aer, which enables users to scale their Qiskit code with no code changes to the largest NVIDIA machines.

This new capability enables users of the NVIDIA Quantum platform to achieve the most performant quantum circuit simulations at supercomputer scales. On key problems like Quantum Phase Estimation, QAOA, Quantum Volume, and more, the newest cuQuantum Appliance is over two orders of magnitude faster than previous implementations, and seamlessly scales from a single GPU to a supercomputer.

cuQuantum Appliance users are only restricted by the number of GPUs they have access to.

cuQuantum Appliance is available now in the NVIDIA® NGC™ catalog and as machine image on each major cloud marketplace.


Features and Benefits

cuQuantum SDK offers two flexible accelerated quantum circuit simulation methods

Flexible

Choose the best approach for your work from algorithm-agnostic accelerated quantum circuit simulation methods.

State vector method features include optimized memory management and math kernels, efficiency index bit swaps, gate application kernels, and probability array calculations for qubit sets.

Tensor network method features include accelerated tensor and tensor network contraction, order optimization, approximate contractions, and multi-GPU contractions.

cuQuantum SDK offers scalable options with multi-node, multi-GPU clusters

Scalable

Leverage the power of multi-node, multi-GPU clusters using the latest GPUs on premises or in the cloud.

Low-level C++ APIs provide increased control and flexibility for a single GPU and single-node multi-GPU clusters.

The high-level Python API supports drop-in multi-node execution.

cuQuantum SDK can simulate bigger problems faster and get more work done sooner.

Fast

Simulate bigger problems faster and get more work done sooner.

Using an NVIDIA H100 Tensor Core GPU over CPU implementations delivers orders-of-magnitude speedups on key quantum problems, including random quantum circuits, Shor’s algorithm, and the Variational Quantum Eigensolver.

Leveraging the NVIDIA Selene supercomputer, cuQuantum generated a sample from a full-circuit simulation of the Google Sycamore processor in less than 10 minutes.


Framework Integrations

cuQuantum is integrated with leading quantum circuit simulation frameworks.

Download cuQuantum to dramatically accelerate performance using your framework of choice, with zero code changes.

cuQuantum is integrated with Amazon Web Services (AWS)
cuQuantum is integrated with blueqat
cuQuantum is integrated with Cirq
cuQuantum is integrated with ExaTN
cuQuantum is integrated with Orquestra
cuQuantum is integrated with PennyLane
cuQuantum is integrated with Qibo
cuQuantum is integrated with Qiskit
cuQuantum is integrated with QuEST
cuQuantum is integrated with TKET
cuQuantum is integrated with TorchQuantum
cuQuantum is integrated with XACC Quantum Framework

Performance

State Vector Method

Quantum Machine Learning

CPU vs Single GPU (1 thread and 32 thread comparisons)

Line graph showing CPU vs Single GPU (1 thread and 32 thread comparisons)

Evaluation of the Jacobian of a strongly entangling layered circuit leveraging adjoint backpropagation. Run lightning.gpu on an NVIDIA DGX A100, compared to lightning.qubit on an Epyc 7742 CPU. Results are averaged across three runs.

State vector simulation tracks the entire state of the system over time, through each gate operation. It’s an excellent tool for simulating deep or highly entangled quantum circuits, and for simulating noisy qubits.

An NVIDIA DGX™ A100 system with eight NVIDIA A100 80GB Tensor Core GPUs can simulate up to 36 qubits, delivering an orders-of-magnitude speedup on leading state vector simulations over a dual-socket CPU server.

cuStateVec has been adopted by leading publicly available simulators, including integrations into AWS Braket, Google Cirq's qsim simulator, the IBM Qiskit Aer simulator, and Xanadu’s PennyLane Lightning simulator. Users leveraging lightning.gpu on AWS Braket experienced 900X speedups and saved 3.5X on costs. It'll soon support an even wider range of frameworks and simulators. Read the NVIDIA Technical Blog for more details.

Tensor Network Method

Tensor network methods are rapidly gaining popularity to simulate hundreds or thousands of qubits for near-term quantum algorithms. Tensor networks scale with the number of quantum gates rather than the number of qubits. This makes it possible to simulate very large qubit counts with smaller gate counts on large supercomputers.

Tensor contractions dramatically reduce the memory requirement for running a circuit on a tensor network simulator. The research community is investing heavily in improving pathfinding methods for quickly finding near-optimal tensor contractions before running a simulation.

cuTensorNet provides state-of-the-art performance for both the pathfinding and contraction stages of tensor network simulation. See the NVIDIA Technical Blog for more details.

Using cuQuantum, NVIDIA researchers were able to simulate a variational quantum algorithm for solving the MaxCut optimization problem using 1,688 qubits to encode 3,375 vertices on an NVIDIA DGX SuperPOD™ system, a 16X improvement over the previous largest simulation — and multiple orders of magnitude larger than the largest problem run on quantum hardware to date.

Pathfinding and Contraction Performance

State-of-the-Art Performance for Pathfinding

Bar chart showing state-of-the-art performance for Pathfinding

Performance for cuTensorNet pathfinding compared to Cotengra in terms of seconds per sample. Both runs are leveraging a single core EPYC 7742 CPU.

Sycamore refers to 53 qubit random quantum circuits of depth 10, 12, 14, and 20 from Arute et. al. Quantum Supremacy using a Programmable Superconducting Processor.
www.nature.com/articles/s41586-019-1666-5

Cotengra: Gray & Kourtis, Hyper-optimized Tensor Network Contraction, 2021.
quantum-journal.org/papers/q-2021-03-15-410

State-of-the-Art Performance for Contraction Time

Contraction performance for cuTensorNet compared to Torch, cuPy and numPy. All runs leverage the same best contraction path. cuTensorNet, cuPy, Torch, all ran on 1 NVIDIA A100 GPU. Numpy was run on single socket EPYC 7742. cuPy and numPy cannot execute Sycamore depth 12 and 14 as they have restrictions on maximum tensor rank of 32, as both circuits have tensors greater than this limit these jobs are not supported.

BQSKit: circuits with 48 and 64 qubits: Berkeley Quantum Synthesis Toolkit https://github.com/BQSKit/bqskit
QAOA: 36 qubits with 4 parameters
PEPS: tensor network with dimensions of 3x3 and operator depth 30.

Approximate Tensor Network Methods

Line graph showing Matrix Product States (MPS) gate split performance measurement

MPS gate split performance is measured in execution time as a function of bond dimension. We execute this on an NVIDIA A100 80GB GPU and compare it to NumPy running on an EPYC 7742 data center CPU.

As the quantum problems of interest can greatly vary in both size and complexity, researchers have developed highly customized approximate tensor network algorithms to address the gamut of possibilities. To enable easy integration with these frameworks and libraries, cuTensorNet provides a set of APIs to cover the following common use cases: Tensor QR, Tensor SVD, and Gate Split.

These primitives enable users to accelerate and scale different types of quantum circuit simulators. A common approach to simulating quantum computers which takes advantage of these methods is matrix product states (MPS, also known as tensor train). Users can leverage these new cuTensorNet APIs to accelerate MPS-based quantum circuit simulators.

The gate split, and Tensor SVD APIs, enable nearly an order of magnitude speedup over state-of-the-art CPU implementations. Tensor QR is the most efficient with nearly two orders-of-magnitude speedup over the same EPYC 7742 data center CPU.

Get started with NVIDIA cuQuantum.


Download Now