GPU-Accelerated Optimal Interpolation for Global Ocean Surface pCO2 Mapping

Gyundo Pak

doi:10.4217/OPR.2026001

Preview

Article

Ocean and Polar Research. 14 January 2026. 1-10
https://doi.org/10.4217/OPR.2026001

GPU-Accelerated Optimal Interpolation for Global Ocean Surface pCO₂ Mapping

Gyundo Pak¹^*

¹Ocean Circulation & Climate Research Department, Korea Institute of Ocean Science & Technology, Busan 49111, Korea

^{*Corresponding Author}

License (open-access, http://creativecommons.org/licenses/by-nc/4.0):

This is an open access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0), which permits unrestricted educational and non-commercial use, provided the original work is properly cited.

ABSTRACT

GPU acceleration has become essential for meeting the rising computational demands of high-resolution ocean modeling and data assimilation. In this study, a GPU-accelerated, chunk-based, sequential Optimal Interpolation (OI) scheme was developed to reconstruct global ocean surface partial pressure of CO₂ (pCO₂) fields. OI analyses were successfully performed for the period between 1990 and 2019 using 30 years of monthly background fields and observations. Benchmarking highlighted substantial gains, with runtimes up to 35 times faster than that of a single-core CPU baseline and approximately 12 times faster than that of multi-core CPU runs. The GPU performance improved steadily with increasing chunk size, whereas the fastest runtimes were consistent with observation batch sizes in the range of 1,000–2,000. Sequential OI generated analysis fields nearly identical to those from the all-at-once observation update, and its runtime ranged from a bit slower to slightly faster depending on the choice of observation batch size. These results show that GPU-based OI is both practical and efficient, offering a pathway for the direct application of GPUs in operational ocean prediction systems.

Keywords

CPU-GPU

benchmarking

data assimilation

GPU acceleration

optimal interpolation

pCO₂

MAIN

1. Introduction
2. Data and Method
Data sources
Optimal Interpolation
OI implementation for GPU
3. Test Case: Global pCO₂ Analysis
4. Performance Benchmarking
GPU vs CPU
Sequential OI tests with GPU
5. Summary and Concluding Remarks

1. Introduction

The explosive growth in ocean data and model resolutions is rapidly outpacing the capabilities of traditional CPU computing (Vance et al. 2019; Delmas and Soulaïmani 2022; Beech et al. 2024). Many ocean models and data assimilation schemes still depend on CPUs, which are built with a small number of powerful and versatile cores (Ciżnicki et al. 2012; Zhang et al. 2012). Despite its usual effectiveness, this architecture becomes inefficient when faced with the massive and repetitive operations required for high-resolution ocean analyses (Häfner et al. 2021;Yuan et al. 2024; Porter and Heimbach 2025). However, GPUs contain thousands of smaller cores designed for parallel processing, enabling them to execute a large number of calculations simultaneously (Keckler et al. 2011; Navarro et al. 2014). This fundamental difference allows GPUs to deliver orders-of-magnitude accelerations over CPU-based approaches for ocean modeling (Bleichrodt et al. 2012; Xu et al. 2015; Zhao et al. 2017; Silvestri et al. 2025) and data assimilation (De Luca et al. 2021).

Although GPUs have been rapidly adopted in many scientific fields, their use in ocean modeling and assimilation remains limited. Major community ocean models such as MOM6 (Adcroft et al. 2019) and ROMS (Shchepetkin and McWilliams 2005) have not yet transitioned entirely to GPU architectures, and most implementations continue to rely on CPUs. Although experimental efforts and prototype tools have demonstrated the potential of GPU acceleration, the widely used data assimilation frameworks are still CPU-based and the operational data assimilation systems have not yet been coupled with GPUs (Martin et al. 2025). This slow transition is largely because of the extensive legacy of CPU-based Fortran codes, whose adaptation to the GPU architecture calls for an overhaul (Zhao et al. 2017; Vanderbauwhede and Davidson 2018; Porter and Heimbach 2025). Consequently, most operational ocean prediction systems are still implemented using CPUs, even as the scientific community increasingly demands higher-resolution products and near-real-time applications. Considering this, GPU acceleration is no longer optional but essential.

This study demonstrates the advantages of GPU computing through a classical yet computationally demanding example: Optimal Interpolation (OI). OI has long been used in oceanography because of its analytical rigor, cost-effectiveness, and relatively straightforward implementation. Additionally, global-scale applications involve nontrivial computational demands, rendering OI a suitable example for testing GPU acceleration. Furthermore, OI shares mathematical formulations with widely used data assimilation schemes such as the Ensemble Kalman Filter (Evensen 1994) and Ensemble Optimal Interpolation (Evensen 2003) in operational ocean prediction systems (Chang et al. 2024; Jin et al. 2024), which provide strong scalability for broader applications. The OI was implemented on a modern GPU platform to evaluate the efficiency gains while maintaining methodological fidelity. As a practical test case, GPU-accelerated OI was applied to reconstruct the global ocean surface partial pressure of CO₂ (pCO₂), which is a key variable for quantifying air-sea CO₂ fluxes and assessing the role in the global carbon cycle (Iida et al. 2015; Roobaert et al. 2024). Although existing global pCO₂ products provide useful large-scale features, they still contain uncertainties associated with sparse observations and methodological assumptions. Data assimilation of pCO₂ observations therefore plays a critical role in reducing these uncertainties. This demonstration highlights the computational benefits of GPU-based assimilation and its potential to support timely and higher-resolution ocean prediction systems.

The remainder of this paper is organized as follows. Section 2 describes the datasets, principles of OI, and its GPU-based implementation. Section 3 presents the reconstructed global ocean surface pCO₂ fields. Section 4 benchmarks the computational performance of the GPU and CPU implementations, highlighting efficiency gains. Finally, Section 5 summarizes the findings and provides concluding remarks.

2. Data and Method

Data sources

The analysis focuses on the global ocean surface, using the background field of pCO₂ from SeaFlux v2021.04 (https://doi.org/10.5281/zenodo.5482547), which provides 1° global monthly pCO₂ estimates between 1990–2019 (Fay et al. 2021). This product was developed within the framework of the Surface Ocean pCO₂ Mapping Intercomparison (SOCOM) project (Rödenbeck et al. 2015), an international effort that systematically compared and combined multiple mapping approaches. The spco2_filler field from SeaFlux, based on adjusted climatology (Landschützer et al. 2020), was used within this framework to provide a spatially complete and consistent background for the OI assimilation experiments. The spatial distribution of the long-term mean pCO₂ from SeaFlux is shown in Fig. 1a.

https://cdn.apub.kr/journalsite/sites/opr/2026-048-00/N00804801/images/opr_2026_48_001_F1.jpg

Fig. 1.

Spatial distribution of (a) annual mean ocean surface pCO₂ from background (μatm) and (b) time-accumulated number of observations, and (c) monthly time-series of the number of global bin-averaged observations

To constrain data assimilation, three major observational compilations were used: the Surface Ocean CO₂ Atlas (SOCAT), the Lamont-Doherty Earth Observatory (LDEO) database, and underway observations from Korea Institute of Ocean Science & Technology (KIOST). SOCAT provides the most comprehensive global collection of surface ocean CO₂ data (Bakker et al. 2016) and contains more than 41 million quality-controlled observations from 1957 to 2023 (https://socat.info). It primarily includes underway shipboard measurements and data from fixed time-series stations that offer both wide spatial coverage and long-term reference points. As SOCAT provides the fugacity of CO₂ (fCO₂) rather than pCO₂, fCO₂ was converted to pCO₂ using the f2pco2 function in the seacarb R package (https://doi.org/10.32614/CRAN.package.seacarb). The LDEO V2019 database (https://www.ncei.noaa.gov/data/oceans/ncei/ocads/metadata/0160492.html) was one of the earliest global efforts to curate surface water pCO₂ records. Although most of its content overlaps with that of SOCAT, it includes a unique time-series and carefully calibrated cruise data. Underway pCO₂ measurements collected by KIOST research vessels between 2003 and 2019 were also included, which spanned the East/Japan Sea, the East China Sea, and a limited number of cruises in the North Pacific. These data complement global compilations and provide valuable constraints for regions in which international datasets are sparse.

For assimilation, all the observations were aggregated monthly and bin-averaged onto a 0.25° grid, and all the available measurements in each bin were combined into monthly means (Fig. 1b). These binned observations were assimilated into a 1° background field, and thus the resulting analysis fields are also produced at 1° resolution. The spatial distribution shows particularly dense coverage in the North Pacific and North Atlantic, whereas open ocean regions such as the South Pacific, South Atlantic, and South Indian Ocean are poorly observed. To match the background period (1990–2019), the observational dataset was restricted to the same time range. The global time-series of bin-averaged observations demonstrated a sharp increase in data availability after 2000, with an average of 1,193 bins containing observations per month and a maximum of 9,770 in the most densely sampled months (Fig. 1c).

Optimal Interpolation

The OI method combines the background with observations to generate an improved analysis field. The fundamental formulae are as follows:

(1)

X^{a} = X^{b} + W (Y - H X^{b})

where X^a and X^b denote the state vectors of the analysis and background fields, respectively. Y indicates the observation state vector and operator H projects the background field into the observation points. Gain matrix W is defined as follows:

(2)

W = B H^{T} (H B H^{T} + R)^{- 1}

where B and R denote the background and observational error covariance matrices, respectively. The superscript T denotes the transpose, and -1 denotes the matrix inverse.

Background error covariance B was prescribed using a Gaussian correlation function with an e-folding length scale of 250 km. This value is broadly consistent with the previous estimates of the spatial decorrelation scale of the ocean surface fCO₂ variability (~400 ± 250 km; Jones et al. 2012), however, it was set slightly shorter to better capture the local variability in the complex Korean marginal seas, where dense observations are available (Fig. 1b). The background error variance at each grid point (diagonals of B) and the observational error variance (diagonals of R) were both set to 10 (μatm)². Although this is arbitrary, it was chosen to balance the background and observational influences in the absence of additional quality control. It is the relative magnitudes of B and R, rather than their absolute values, that determines the OI analysis. This indicates that the specific choice of 10 (μatm)² is not important in itself. The observations were assumed to be independent, i.e., R is diagonal.

The full background error covariance matrix B of size n_s × n_s, where n_s is the number of background grid points, was not explicitly constructed because this would be computationally prohibitive. Instead, only the covariance terms required for the OI analysis were directly computed, namely BH^T (n_s × n_o) and HBH^T (n_o × n_o), where n_o is the number of observation points. Both represent background error covariances: BH^T between the background and observation points and HBH^T among the observation points.

OI implementation for GPU

A grid-chunking strategy was employed to accelerate the OI computation. The global background grid (n_s) was divided into smaller chunks of size n_c and OI updates were performed independently for each chunk (Fig. 2a). The OI analysis for chunk c can be written as:

(3)

X_{c}^{a} = X_{c}^{b} + (B H^{T})_{c} (H B H^{T} + R)^{- 1} (Y - H X^{b})

where subscript c denotes the chunk. Matrix (HBH^T + R)^-1 and vector (Y – HX^b) have dimensions of n_o × n_o and n_o × 1, respectively; therefore their product (HBH^T + R)^-1(Y – HX^b) is manageable and can be computed once for use in all the chunks (Fig. 2a). Each chunk requires only the calculation of (BH^T)_c with a size of n_c × n_o. In this study, chunks were not defined based on geographic regions but were sequentially divided for computational convenience, since the chunking strategy does not affect the OI results.

https://cdn.apub.kr/journalsite/sites/opr/2026-048-00/N00804801/images/opr_2026_48_001_F2.jpg

Fig. 2.

Schematics of (a) chunk-based and (b) chunk-based sequential Optimal Interpolation (OI). The state vector of length n_s is partitioned into chunks of size n_c, and the observations (n_o) are grouped into batches of size n_b for sequential OI

Although the chunk-based strategy facilitates efficient implementation, its memory demand is still huge, because the term (HBH^T + R)^-1 scales as n_o² (Fig. 2a). In an operational ocean prediction system, the number of practical observation points in the 3-dimensional field at one step of data assimilation can reach the order of 10⁵. This would require approximately 80 GB of memory, which is already beyond the capacity of modern GPUs with 32GB of VRAM. To overcome this limitation, a sequential approach (Houtekamer and Mitchell 2001) was applied in which the full observation set (n_o) was divided into smaller batches of size n_b (Fig. 2b). Here, a chunk refers to a subset of background grid points, whereas a batch denotes a subset of observations. Instead of applying the entire observation set to a single update, the sequential method assimilates each batch sequentially and updates the analysis field iteratively. This reduces memory usage by limiting the matrix operations to n_b × n_b size. However, this approach can cause small discrepancies relative to the all-at-once observation updates. In the present experiments, fewer than 10,000 observations were used per update; therefore, the all-at-once method is feasible. However, sequential OI is also considered, with future operational applications in mind, where more than 10⁵ observations may need to be assimilated.

The chunk-based sequential OI implementation was further accelerated using GPU parallelization with the CuPy library in Python (https://cupy.dev). Many Python users rely on NumPy for array operations (Harris et al. 2020), and CuPy provides a NumPy-compatible interface that enables most matrix operations in the OI update to be executed on CUDA-enabled GPUs with minimal modifications. In the GPU runs, both the grid chunk and observation batch loops were executed sequentially and the matrix operations inside each loop were performed in parallel on the GPU. Although data reading and storage were handled by the CPU, nearly all the subsequent computations were performed on the GPU. This design substantially reduced the runtime compared with the CPU-only experiments (see Section 4.1). The chunk-based OI algorithm was also executed on CPUs using NumPy, both in the single-core mode and in parallel with 8-core and 24-core configurations, to provide baselines for performance comparison. In single-core mode, the grid chunk loop was executed sequentially, and in multi-core mode, it was split across the CPU cores and executed simultaneously.

3. Test Case: Global pCO₂ Analysis

The OI update is illustrated for February 2013 (Fig. 3a–d), as a representative snapshot to show the spatial characteristics of the analysis fields. The background field (Fig. 3a) provides a gap-free global coverage with elevated pCO₂ (> 450 μatm) in the eastern equatorial Pacific and coastal regions in the Northern Hemisphere, contrasted by relatively low pCO₂ (< 350 μatm) across midlatitude Northern Hemisphere and high-latitude Southern Hemisphere. The spatial distribution was not significantly different from that of the long-term average field (Fig. 1a). The observations comprised numerous underway tracks, particularly across the North Pacific, North Atlantic, and Southern Ocean, along with several fixed-point stations worldwide (Fig. 3b). Although the OI analysis (Fig. 3c) closely resembled the background, the increments (analysis minus background; Fig. 3d) revealed local adjustments near the observations. The magnitude of the update remained moderate, indicating that the OI nudges the background toward observations without introducing noticeable artifacts. The resulting analysis is available at https://doi.org/10.22711/idr/1105.

A comparison of the time-series of the area-averaged pCO₂ shows that the analysis closely follows the background at the global scale (Fig. 3e), with larger departures in the Northwest Pacific, where observations were denser (Fig. 3f). These results confirm that OI behaves as expected, with sharper local corrections in observation-rich regions, whereas observation-poor regions retain background characteristics. Overall, the resulting pCO₂ analysis fields remain broadly consistent with the SeaFlux background, while local pCO₂ features reflect observational information more explicitly. In particular, the inclusion of observations around Korea is expected to improve the representation of regional air-sea CO₂ exchange, which may be beneficial for quantifying carbon fluxes in this region. This test case primarily focuses on computational performance and methodological consistency, without aiming for a detailed validation of the pCO₂ analysis fields.

https://cdn.apub.kr/journalsite/sites/opr/2026-048-00/N00804801/images/opr_2026_48_001_F3.jpg

Fig. 3.

(a–d) Spatial distribution of (a) background, (b) observations, (c) OI analysis, and (d) update (analysis minus background) of ocean surface pCO₂ (μatm) in February 2013. (e–f) Time-series of area-averaged pCO₂ from the background (thick black line) and analysis (red line) fields in the (e) global and (f) Northwest Pacific (125–145°E, 25–45°N)

4. Performance Benchmarking

GPU vs CPU

Benchmarking experiments were performed on a workstation with an NVIDIA RTX 5090 GPU (32 GB of GDDR6 VRAM), Intel Xeon W7-3465X CPU (28 cores, 3.2 GHz clock), 256 GB system memory, and WD Black 850x NVMe SSD (8 TB). Comparisons were performed using chunk sizes ranging from 1,000 to 50,000 under an all-at-once observation update, where the maximum number of observations per step was slightly less than 10,000. The OI background field for pCO₂ comprised 44,598 grid points. Therefore, a chunk size of 50,000 effectively and simultaneously used the entire background grid. CPU runs were executed in single-core and multi-core modes (i.e., 1X, 8X, and 24X). Each test was repeated five times, and the average runtime was calculated after discarding the fastest and slowest runs. The input/output operations, which required approximately 6 s per run, were included in the runtime. The experiment names follow the format of Processor-Chunk size, for example, GPU-01k for a GPU run with a chunk size of 1,000 and CPU-24X-50k for a 24-core CPU run with a chunk size of 50,000. The speedup ratio is defined as the runtime of the baseline (CPU-1X-01k; Table 1) divided by that of each test case, with values greater (less) than one indicating acceleration (slowdown).

The benchmarking results clearly highlight the advantages of GPU acceleration. GPU runs such as GPU-50k were up to 35 times faster than that of the single-core CPU baseline (CPU-1X-01k) and approximately 12 times faster than that of the multi-core CPU runs (CPU-8X-01k). The runtime on GPUs decreased with larger chunk sizes, and converged beyond 5,000 grid points, indicating that GPU efficiency was saturated once sufficiently large chunks were used.

However, the CPU performance exhibited a different pattern. Although the efficiency gains diminished between 8 and 24 cores, multi-core settings reduced the runtime compared with the single-core case (1,338 s for CPU-1X-01k, 521 s for CPU-8X-01k, and 457 s for CPU-24X-01k). The CPU runtime generally increased with larger chunk sizes; for example, it increased from 1,338 s at 1,000 points to 1,732 s at 50,000 points in the single-core case. This slowdown presumably reflects the memory bandwidth limitations when larger chunks exceed cache capacity (Patterson and Hennessy 2016), a behavior also observed in multi-core runs. Because the full domain contained 44,598 grid points, the CPU experiments with chunk sizes above 5,000 (for 8 cores) or 2,000 (for 24 cores) no longer balanced the workload effectively across cores. Consequently, for the chunk sizes of 10k, 20k, and 50k, the runtimes of the 8X and 24X parallel modes were nearly identical (Table 1). Notably, although the multi-core 50k cases should have been equivalent to a single-core computation (CPU-1X-50k; 1,732 s), CPU-8X-50k (1,485 s) and CPU-24X-50k (1,486 s) achieved slightly better performances. This modest improvement likely comes from the fact that Python relies on an optimized linear algebra library (OpenBLAS) for matrix operations, which can automatically use multiple CPU cores during the matrix operations outside the chunk loops. In contrast, in the single-core mode, these processes are handled solely by a single core.

Table 1.

Experiment designs, runtimes, and speedup ratios for all-at-once observation update experiments

Experiment	Processor	CPU cores	Chunk size	Runtime (s)	Speedup ratio
GPU-01k	GPU	-	1,000	43.36	30.86
GPU-02k	GPU	-	2,000	39.77	33.65
GPU-05k	GPU	-	5,000	38.09	35.13
GPU-10k	GPU	-	10,000	37.93	35.28
GPU-20k	GPU	-	20,000	38.16	35.07
GPU-50k	GPU	-	50,000	37.90	35.31
CPU-1X-01k	CPU	1	1,000	1,338.14	1.00
CPU-1X-02k	CPU	1	2,000	1,480.31	0.90
CPU-1X-05k	CPU	1	5,000	1,638.88	0.82
CPU-1X-10k	CPU	1	10,000	1,699.35	0.79
CPU-1X-20k	CPU	1	20,000	1,722.61	0.78
CPU-1X-50k	CPU	1	50,000	1,732.97	0.77
CPU-8X-01k	CPU	8	1,000	521.65	2.57
CPU-8X-02k	CPU	8	2,000	534.96	2.50
CPU-8X-05k	CPU	8	5,000	595.63	2.25
CPU-8X-10k	CPU	8	10,000	603.84	2.22
CPU-8X-20k	CPU	8	20,000	826.55	1.62
CPU-8X-50k	CPU	8	50,000	1485.60	0.90
CPU-24X-01k	CPU	24	1,000	457.40	2.93
CPU-24X-02k	CPU	24	2,000	470.44	2.84
CPU-24X-05k	CPU	24	5,000	506.55	2.64
CPU-24X-10k	CPU	24	10,000	604.40	2.21
CPU-24X-20k	CPU	24	20,000	826.85	1.62
CPU-24X-50k	CPU	24	50,000	1,486.42	0.90

Sequential OI tests with GPU

Sequential OI sensitivity tests were only performed on the GPU by varying both the chunk size (1,000–50,000) and batch size (100–10,000) to evaluate the runtime performance (Fig. 4). Typically, the runtime decreased as the batch size increased because smaller batches used less memory per operation and required more repetitions, which slowed the performance. Larger batches reduced repetition and improved efficiency; however, the benefit became negligible beyond the batch size of approximately 2,000. The fastest runtimes were consistent within the range n_b = 1,000–2,000, whereas the performance declined at larger n_b values (e.g., 5,000 or 10,000). This decline likely reflects the increasing computational cost of the larger matrix inversion (HBH^T + R)^-1. Chunk size also has a strong effect on performance. Larger chunks reduced loop iterations and accelerated computations, whereas smaller chunks increased the repetition overhead and slowed the runtimes. The fastest runtime was obtained at the maximum chunk size (50,000) with a batch size of 1,000. These results indicate that although larger chunk sizes are consistently beneficial, batch sizes have an optimal range of approximately 1,000–2,000 rather than improving monotonically with size.

https://cdn.apub.kr/journalsite/sites/opr/2026-048-00/N00804801/images/opr_2026_48_001_F4.jpg

Fig. 4.

Runtimes of GPU-accelerated sequential OI as a function of chunk size (1,000–50,000) and batch size (100–10,000)

Beyond runtime, the consistency of sequential OI with the all-at-once approach (n_b = 10,000) was also evaluated. The spatial root-mean-square difference (RMSD) time-series was calculated between the background and analysis fields from the all-at-once (n_b = 10,000) run and the sequential OI with n_b = 100 (Fig. 5). The all-at-once update typically modified the background by approximately 3–6 µatm at the global scale. When sequential OI with n_b = 100 was applied, the resulting analysis fields exhibited similar temporal variations, with only minor deviations from the all-at-once case. Since the n_b = 100 case represents the smallest batch size considered and thus represents the worst-case scenario, the consistency observed for this case implies even smaller RMSD differences for larger batch sizes (not shown). This finding indicates that sequential OI introduce only minor discrepancies, as reported by Houtekamer and Mitchell (2001). Although it requires slightly longer runtimes (~45 s) than the all-at-once case (~38 s), sequential OI provides a practical strategy when memory limitations prevent all-at-once implementation. In addition, the sequential approach can be computationally advantageous under optimal conditions (e.g., n_b = 1,000), achieving slightly shorter runtimes than the all-at-once case.

https://cdn.apub.kr/journalsite/sites/opr/2026-048-00/N00804801/images/opr_2026_48_001_F5.jpg

Fig. 5.

Time-series of RMSD between the background and the analysis from all-at-once observation update (thick black line) and from sequential OI with a batch size of 100 (red line) overlaid with their difference (blue dotted line)

5. Summary and Concluding Remarks

This study highlights the benefits of GPU acceleration for the OI of global ocean surface pCO₂. A chunk-based sequential OI scheme was implemented using 30 years of monthly background field and bin-averaged observations. A test case for February 2013 confirmed that the OI system successfully assimilated observation data into the background. Performance benchmarking demonstrated substantial acceleration with the GPU implementation, achieving runtimes up to 35 times faster than that of a single-core CPU baseline and approximately 12 times faster than that of multi-core CPU runs. The GPU performance was improved by increasing the chunk size, with only modest gains beyond 5,000. Sensitivity tests further showed that batch sizes of 1,000–2,000 provided the best runtimes. Sequential OI produced analysis fields that were nearly identical to those from the all-at-once update, with a moderate runtime penalty.

This study provides a practical pathway for applying GPUs in data assimilation. As an initial step toward GPU-based ocean data assimilation, a relatively simple test case with OI to 1° global pCO₂ fields to examine the feasibility of GPU acceleration. The proposed OI schemes are compatible with operational ocean prediction systems based on Ensemble Optimal Interpolation (Jin et al. 2024). The operational system must assimilate large observations into high-resolution three-dimensional model grids, which imposes prohibitive demands on memory and computational power. An efficient GPU implementation can satisfy these computational requirements, whereas a chunk-based sequential OI algorithm alleviates the memory burden. The GPU-based system also achieved more than an order-of-magnitude speedup over the fastest CPU configuration in this study, and such a clear advantage of GPUs is expected to remain even when using other recent CPU and GPU hardware. This transition to GPU-based data assimilation is expected to reduce the prediction time. Moreover, if numerical ocean models are migrated to GPU platforms, an even faster and more efficient ocean prediction system can be realized.

Acknowledgements

This research was supported by the Korea Institute of Marine Science & Technology Promotion (KIMST), funded by the Ministry of Oceans and Fisheries (RS-2025-02217872), and by the Korea Institute of Ocean Science & Technology (“Enhancing Capacity for Assessing and Predicting Marine Environmental and Ecosystem Variability around the Korean Peninsula”, PEA0403). The author also thanks the three anonymous reviewers for their constructive and valuable comments.

References

Adcroft A, Anderson W, Balaji V, Blanton C, Bushuk M, Dufour CO, Dunne JP, Griffies SM, Hallberg R, Harrison MJ (2019) The GFDL global ocean and sea ice model OM4.0: model description and simulation features. J Adv Model Earth Syst 11:3167–3211. doi:10.1029/2019MS001726

10.1029/2019MS001726

Bakker DCE, Pfeil B, Landa CS, Metzl N, O'brien KM, Olsen A, Smith K, Cosca C, Harasawa S, Jones SD (2016) A multi-decade record of high-quality fCO₂ data in version 3 of the surface ocean CO₂ atlas (SOCAT). Earth Syst Sci Data 8:383–413. doi:10.5194/essd-8-383-2016

10.5194/essd-8-383-2016

Beech N, Rackow T, Semmler T, Jung T (2024) Exploring the ocean mesoscale at reduced computational cost with FESOM 2.5: efficient modeling strategies applied to the southern ocean. Geosci Model Dev 17:529–543. doi:10.5194/gmd-17-529-2024

10.5194/gmd-17-529-2024

Bleichrodt F, Bisseling RH, Dijkstra HA (2012) Accelerating a barotropic ocean model using a GPU. Ocean Modell 41:16–21. doi:10.1016/j.ocemod.2011.10.001

10.1016/j.ocemod.2011.10.001

Chang I, Ho Kim Y, Park Y-G, Jin H, Pak G, Kwon J-I, Chang Y-S (2024) Assessment of high-resolution regional ocean reanalysis K-ORA22 for the northwest pacific. Prog Oceanogr 229:103359. doi:10.1016/j.pocean.2024.103359

10.1016/j.pocean.2024.103359

Ciżnicki M, Kierzynka M, Kopta P, Kurowski K, Gepner P (2012) Benchmarking data and compute intensive applications on modern CPU and GPU architectures. Procedia Comput Sci 9:1900–1909. doi:10.1016/j.procs.2012.04.208

10.1016/j.procs.2012.04.208

De Luca P, Galletti A, Giunta G, Marcellino L (2021) Recursive filter based GPU algorithms in a Data assimilation scenario. J Comput Sci 53:101339. doi:10.1016/j.jocs.2021.101339

10.1016/j.jocs.2021.101339

Delmas V, Soulaïmani A (2022) Multi-GPU implementation of a time-explicit finite volume solver using CUDA and a CUDA-Aware version of OpenMPI with application to shallow water flows. Comput Phys Commun 271:108190. doi:10.1016/j.cpc.2021.108190

10.1016/j.cpc.2021.108190

Evensen G (1994) Sequential data assimilation with a nonlinear quasi‐geostrophic model using monte carlo methods to forecast error statistics. J Geophys Res-Oceans 99:10143–10162. doi:10.1029/94JC00572

10.1029/94JC00572

Evensen G (2003) The ensemble Kalman Filter: theoretical formulation and practical implementation. Ocean Dynamics 53:343–367. doi:10.1007/s10236-003-0036-9

10.1007/s10236-003-0036-9

Fay AR, Gregor L, Landschützer P, Mckinley GA, Gruber N, Gehlen M, Iida Y, Laruelle GG, Rödenbeck C, Roobaert A (2021) SeaFlux: harmonization of air-sea CO₂ fluxes from surface pCO₂ data products using a standardized approach. Earth Syst Sci Data 13:4693–4710. doi:10.5194/essd-13-4693-2021

10.5194/essd-13-4693-2021

Häfner D, Nuterman R, Jochum M (2021) Fast, cheap, and turbulent-Global ocean modeling with GPU acceleration in python. J Adv Model Earth Syst 13:e2021MS002717. doi:10.1029/2021MS002717

10.1029/2021MS002717

Harris CR, Millman KJ, Van Der Walt SJ, Gommers R, Virtanen P, Cournapeau D, Wieser E, Taylor J, Berg S, Smith NJ (2020) Array programming with NumPy. Nature 585:357–362. doi:10.1038/s41586-020-2649-2

10.1038/s41586-020-2649-232939066PMC7759461

Houtekamer PL, Mitchell HL (2001) A sequential ensemble Kalman Filter for atmospheric data assimilation. Mon Weather Rev 129:123–137. doi:10.1175/1520-0493(2001)129<0123:ASEKFF>2.0.CO;2

10.1175/1520-0493(2001)129<0123:ASEKFF>2.0.CO;2

Iida Y, Kojima A, Takatani Y, Nakano T, Sugimoto H, Midorikawa T, Ishii M (2015) Trends in pCO₂ and sea–air CO₂ flux over the global open oceans for the last two decades. J Oceanogr 71:637–661. doi:10.1007/s10872-015-0306-4

10.1007/s10872-015-0306-4

Jin H, Kim YH, Park Y-G, Chang I, Chang Y-S, Park H, Pak G (2024) Simulation characteristics of ocean predictability experiment for marine environment (OPEM): A western north pacific regional ocean prediction system. Ocean Sci J 59:71. doi:10.1007/s12601-024-00195-6

10.1007/s12601-024-00195-6

Jones SD, Le Quéré C, Rödenbeck C (2012) Autocorrelation characteristics of surface ocean _pCO₂and air-sea CO₂ fluxes. Global Biogeochem Cy 26:GB2042. doi:10.1029/2010GB004017

10.1029/2010GB004017

Keckler SW, Dally WJ, Khailany B, Garland M, Glasco D (2011) GPUs and the future of parallel computing. IEEE micro 31:7–17. doi:10.1109/MM.2011.89

10.1109/MM.2011.89

Landschützer P, Laruelle GG, Roobaert A, Regnier P (2020) A uniform pCO₂ climatology combining open and coastal oceans. Earth Syst Sci Data 12:2537–2553. doi:10.5194/essd-12-2537-2020

10.5194/essd-12-2537-2020

Martin MJ, Hoteit I, Bertino L, Moore AM (2025) Data assimilation schemes for ocean forecasting: state of the art. State of the Planet 5:9. doi:10.5194/sp-2024-20

10.5194/sp-2024-20

Navarro CA, Hitschfeld-Kahler N, Mateu L (2014) A survey on parallel computing and its applications in Data-Parallel problems using GPU architectures. Commun Comput Phys 15:285–329. doi:10.4208/cicp.110113.010813a

10.4208/cicp.110113.010813a

Patterson DA, Hennessy JL (2016). Computer organization and design ARM edition: the hardware software interface. Morgan kaufmann.

Porter AR, Heimbach P (2025) Unlocking the power of parallel computing: GPU technologies for ocean forecasting. State of the Planet 5-opsr:23. doi:10.5194/sp-2024-32

10.5194/sp-2024-32

Rödenbeck C, Bakker DCE, Gruber N, Iida Y, Jacobson AR, Jones S, Landschützer P, Metzl N, Nakaoka S, Olsen A (2015) Data-based estimates of the ocean carbon sink variability-first results of the surface ocean pCO₂ mapping intercomparison (SOCOM). Biogeosciences 12:7251–7278. doi:10.5194/bg-12-7251-2015

10.5194/bg-12-7251-2015

Roobaert A, Regnier P, Landschützer P, Laruelle GG (2024) A novel sea surface pCO₂-product for the global coastal ocean resolving trends over 1982–2020. Earth Syst Sci Data 16:421–441. doi:10.5194/essd-16-421-2024

10.5194/essd-16-421-2024

Shchepetkin AF, Mcwilliams JC (2005) The regional oceanic modeling system (ROMS): a split-explicit, free-surface, topography-following-coordinate oceanic model. Ocean Modell 9:347–404. doi:10.1016/j.ocemod.2004.08.002

10.1016/j.ocemod.2004.08.002

Silvestri S, Wagner GL, Constantinou NC, Hill CN, Campin J-M, Souza AN, Bishnu S, Churavy V, Marshall J, Ferrari R (2025) A GPU-Based ocean dynamical core for routine mesoscale-resolving climate simulations. J Adv Model Earth Syst 17:e2024MS004465. doi:10.1029/2024MS004465

10.1029/2024MS004465

Vance TC, Wengren M, Burger E, Hernandez D, Kearns T, Medina-Lopez E, Merati N, O’brien K, O’neil J, Potemra JT (2019) From the oceans to the cloud: opportunities and challenges for data, models, computation and workflows. Front Mar Sci 6:211. doi:10.3389/fmars.2019.00211

10.3389/fmars.2019.00211

Vanderbauwhede W, Davidson G (2018) Domain-specific acceleration and auto-parallelization of legacy scientific code in FORTRAN 77 using source-to-source compilation. Comput Fluids 173:1–5. doi:10.1016/j.compfluid.2018.06.005

10.1016/j.compfluid.2018.06.005

Xu S, Huang X, Oey LY, Xu F, Fu H, Zhang Y, Yang G (2015) POM.gpu-v1.0: a GPU-based princeton ocean model. Geosci Model Dev 8:2815–2827. doi:10.5194/gmd-8-2815-2015

10.5194/gmd-8-2815-2015

Yuan Y, Yu F, Chen Z, Li X, Hou F, Gao Y, Gao Z, Pang R (2024) Towards a real-time modeling of global ocean waves by the fully GPU-accelerated spectral wave model WAM6-GPU v1.0. Geosci Model Dev 17:6123–6136. doi:10.5194/gmd-17-6123-2024

10.5194/gmd-17-6123-2024

Zhang H, Zhang D-F, Bi X-A (2012) Comparison and analysis of GPGPU and parallel computing on multi-core CPU. Int J Inf Educ Technol 2:185–187. doi:10.7763/IJIET.2012.V2.106

10.7763/IJIET.2012.V2.106

Zhao X-D, Liang S-X, Sun Z-C, Zhao X-Z, Sun J-W, Liu Z-B (2017) A GPU accelerated finite volume coastal ocean model. J Hydrodyn 29:679–690. doi:10.1016/S1001-6058(16)60780-1

10.1016/S1001-6058(16)60780-1

Ocean and Polar Research ISSN:2234-7313(Online) Ocean and Polar Research