Assessing the Applicability of Deep-Learning Method for Predicting Cyanobacteria in a Regulated River

Kim, Jungwook; Kim, Hongtae; Kim, Kyunghyun; Ahn, Jung Min

doi:10.1061/JOEEDU.EEENG-7427

Open access

Technical Papers

Feb 23, 2024

Assessing the Applicability of Deep-Learning Method for Predicting Cyanobacteria in a Regulated River

Authors: Jungwook Kim [email protected], Hongtae Kim [email protected], Kyunghyun Kim [email protected], and Jung Min Ahn [email protected]Author Affiliations

Publication: Journal of Environmental Engineering

Volume 150, Issue 5

https://doi.org/10.1061/JOEEDU.EEENG-7427

PDF

Abstract

Cyanobacterial harmful algal blooms (cyanoHABs) caused by cyanobacteria negatively affect humans via river water and aquatic life. Thus, reliable cyanobacteria predictions are essential for managing cyanoHABs. With recent advancements in computer technology and big data usage, artificial intelligence (AI) technologies have gained attention in various fields, such as water resources, weather and climate, and water quality. This study evaluated the applicability of deep-learning-based AI technology for predicting cyanobacteria. A convolutional neural network (CNN)–long short-term memory (LSTM) model, a deep-learning-based AI technology advantageous for predicting time-series data and cyanobacteria features, was built. Its results were analyzed and compared with those of the existing physical Environmental Fluid Dynamics Code (EFDC)–National Institute of Environment Research (NIER) model for cyanobacteria prediction. The CNN-LSTM model performed better, with an accuracy of 69%, which is an improvement over the previous EFDC-NIER model’s accuracy of 45%. In particular, there was a dramatic improvement in the prediction accuracy for low cyanobacteria cell counts in Level 1, which increased from 39% to 87%. There also was an improvement in the prediction accuracy for Levels 2 and 3. The accuracy for Level 2 increased from increased from 56% to 69%, and the accuracy for Level 3 increased from 38% to 48%. However, there was a significant decrease in prediction accuracy for high cyanobacteria cell counts in Level 4, for which the measured data were very scarce; accuracy decreased from 49% to 16.7%. The CNN-LSTM model yielded better overall prediction performance than the EFDC-NIER model, demonstrating its applicability in cyanobacteria prediction. However, it has limitations of overfitting areas with inadequate data and not accurately predicting patterns that have not occurred in the past. To address this issue, we propose an approach the combines the advantages of physics-based models and AI-based deep learning models, creating a hybrid concept.

Introduction

Every summer, South Korea faces many problems owing to algal blooms (NIER 2020; Ahn et al. 2021a), which cause the death of aquatic life and negatively affect water supply management (Schmidt et al. 2013; Pavagadhi and Balasubramanian 2013; Preece et al. 2017). In particular, because the Nakdong River has been transformed artificially into a lake due to the Four Major Rivers Project, the occurrence of cyanoHABs has become even more severe. Algal blooms are a global environmental issue typically occurring when cyanobacteria proliferate (Huisman et al. 2004). In South Korea, algal blooms caused by cyanobacteria also are a serious social problem (Ahn et al. 2021c). Therefore, predicting cyanobacteria cell counts is crucial for managing cyanobacterial harmful algal blooms (cyanoHABs) (Pyo et al. 2020).

In South Korea, water quality and cyanobacteria cell counts are measured weekly for early response and management of cyanoHABs. These data are used to predict cyanobacteria cell counts. The physical Environmental Fluid Dynamics Code (EFDC)–National Institute of Environment Research (NIER) model predicts cyanobacteria cell counts. The EFDC-NIER model has been improved by the National Institute of Environmental Research to better suit South Korean river environments and has undergone extensive improvements for enhanced prediction accuracy (Ahn et al. 2021b). The EFDC-NIER model is utilized for two main purposes: short-term forecasting for operating the harmful algae alert system; and predicting variations in cyanobacterial occurrences based on various scenarios involving meteorological, hydrological, water quality, and hydraulic factors. It is crucial to predict the occurrence of harmful algal blooms (HABs) in the near future quickly and accurately. By doing so, sufficient time can be secured to implement proactive measures and strategies to effectively mitigate the impacts of HABs on river ecosystems and minimize potential risks to water supply safety in surrounding communities. The ability to make timely and precise short-term forecasts allows for better planning and preparation, thus enhancing the overall management and response to HAB events.

Operating a harmful algae alert system using a physics-based model can present several challenges. One of the major issues is the computational complexity and time required to execute the model. The process of simulating the physical interactions and environmental factors is resource-intensive and time-consuming. Running simulations to predict algal blooms and their dynamics involves complex calculations that require substantial computational power and data processing. Input data, model structure, and parameters considerably influence cyanobacteria cell count predictions using physics-based models (McAvoy et al. 2003; Zhang et al. 2015; Jiang et al. 2018; Su et al. 2018). Physics-based models can yield accurate results through well-validated mechanisms and processes. However, the complexity of these mechanisms and processes can make model construction difficult and result in significantly longer execution times. As a result, obtaining real-time or near-real-time predictions may be difficult, hindering the system’s ability to provide timely warnings and effective responses to potential HAB events. This limitation calls for the exploration of alternative modeling approaches or techniques that can strike a balance between accuracy and computational efficiency in order to improve the operational capabilities of the harmful algae alert system.

Our main research objective was to leverage the strengths of artificial intelligence (AI)-based deep learning prediction models to complement the weaknesses of the physics-based models. For this purpose, we applied artificial intelligence technology to predict harmful algal blooms and compared the results with those obtained from traditional physics-based models. Through discussions and analysis, we present the limitations of AI technology and propose potential solutions to overcome them, ultimately enhancing the accuracy and efficiency of algal bloom predictions.

AI-based deep learning models exhibit strengths in handling large-scale data processing and providing rapid predictions (Lek and Park 2008). By learning patterns and relationships from past data, they enable quicker response times and more-efficient short-term forecasting (Miotto et al. 2018). Therefore, data-driven models can be an alternative to physics-based models because they require only data distribution to predict cyanobacteria cell counts (KEI 2020). Research on cyanoHABs is relatively scarce, and it is still a very challenging field. However, several studies have been conducted recently. Neural-network-based AI technology mainly was used in those studies (Velo-Suarez and Gutierrez-Estrada 2007; Coad et al. 2014; Zhang et al. 2015). ANN techniques have been used widely in algae prediction, including for predicting cyanobacteria cell counts (Maier et al. 1998; Xiao et al. 2017; Yim et al. 2019; Chen et al. 2020). ANNs have evolved in various ways depending on the purpose of prediction. In particular, long short-term memory (LSTM) solves the gradient vanishing problem and improves performance when predicting long sequences (Kim et al. 2021). Cyanobacteria data and various weather and water-quality variables affecting cyanobacteria have a time-series distribution. LSTM is particularly effective for time-series analysis and is used widely in predicting algae (Lee and Lee 2018; Shin et al. 2019; Hill et al. 2020; Liang et al. 2020; Zheng et al. 2021). Recently, convolutional neural networks (CNNs) have been used to predict the temporal and spatial distribution of cyanobacteria (Pyo et al. 2020). Reports indicate that CNNs have advantages in understanding the features of cyanobacteria data. Accordingly, recent studies show that models combining a CNN and LSTM yield the best results compared with other neural-network-based models in cyanobacteria prediction (Naghdi et al. 2020; Cao et al. 2022).

In previous studies using CNN-LSTM to predict cyanobacteria, the focus was on utilizing the advantages of two-dimensional (2D) CNNs for image processing to extract features from cyanoHABs regions along with meteorological data, and then using LSTM for prediction. In other words, the studies aimed to forecast the spatial variation of algal blooms in lakes. Our main objective was to predict the cyanobacteria directly as a numerical value, specifically in terms of cell counts. This approach represents a significant advancement, because it allows for more-precise and quantitative predictions of cyanobacteria occurrences, which can be crucial for effective management and mitigation strategies. Comparing the results of the AI-based prediction model with those of the conventional physics-based model in regions with relatively scarce water quality and algal bloom data, such as Korea, is a critical aspect of our research. The significance of our research lies in evaluating the applicability of artificial intelligence in the fields of water quality and algal blooms, which have not been explored extensively compared with the existing evaluations focused on floods and droughts. Although the utilization of AI has been assessed in flood and drought aspects by comparison with physics-based models, our study provides a fresh perspective by focusing on water quality and algal bloom prediction.

In this study, we selected the CNN-LSTM model, a combination of a CNN and LSTM, as the AI-based deep learning model to compare with the physics-based model. When choosing the optimal model, it is crucial to consider the characteristics of the data that are to be predicted. Cyanobacteria cell counts exhibit a unique feature in which the observed data can vary drastically from 0 to as high as several hundred thousand

cells / mL

, unlike other variables. Recently, several studies have suggested that combining LSTM with a CNN can enhance accuracy in predicting time-series data with rapid and drastic changes (Hwang and Shin 2020). A CNN helps identify meaningful patterns and correlations in the data (Han et al. 2019), and LSTM can effectively capture temporal dependencies, thus enhancing the overall predictive performance (Wan et al. 2020; Xia et al. 2020). This combination of a CNN and LSTM allows the model to benefit from the complementary nature of these two architectures, leading to improved accuracy and more-robust predictions in time-series forecasting tasks, especially when dealing with data that exhibit rapid and drastic changes.

The primary contributions of this study are as follows:

1.

In the existing research on predicting harmful algal blooms, the focus has been on predicting the spatial distribution of harmful algae. In contrast, our study improved the model to predict cyanobacteria cell counts directly.

2.

The developed AI-based deep learning model was compared with the conventional physics-based model that traditionally is used in the field of cyanoHAB prediction. We advanced the evaluation process beyond comparing the predictions with the observed data by directly comparing the results of the AI-based model with those of the established physics-based model. This comprehensive comparison allowed us to thoroughly assess the applicability and limitations of the AI model in the context of predicting cyanobacteria cell counts.

3.

By evaluating the performance of AI-based models in these specific domains, our research contributes to a broader understanding of the potential and limitations of artificial intelligence in cyanobacterial forecasting and management.

Data and Methods

Study Area

The Nakdong River is the longest river in South Korea, and is one of its four major rivers. It is the only river in South Korea with a developed delta, and it has many wetlands, making it ecologically important. Since 2009, following a large-scale river improvement project for flood prevention, water resource security, and water quality improvement, eight weirs have been constructed on the Nakdong River; starting at the upstream, they are the Sangju, Nakdan, Gumi, Chilgok, Gangjeong-Goryeong, Dalseong, Hapcheon-Changnyeong, and Changnyeong-Haman Weirs. The Ministry of Environment has installed monitoring stations at weir sections. It measures water quality and algal items once each week (Fig. 1). The Nakdong River has a gentle slope and a meandering shape, which results in slow flow and increased water temperature every summer, causing extensive damage from algal blooms attributable to cyanobacteria, particularly in the downstream areas.

Data Construction

Water Quality, Algae, and Meteorological Data

Water-quality, algae, and meteorological data of the Nakdong River weir sections available for predicting cyanobacteria cell counts were constructed. The data for training, validation, and prediction were collected from 2012 to 2022. After the completion of the Four Major Rivers Restoration Project, monitoring stations were established at each weir of the Nakdong River. Since 2012, the Ministry of Environment has been conducting weekly observations of water quality and algal parameters at these monitoring stations. The water-quality data included water temperature (WT) (degrees Celsius), pH, dissolved oxygen (DO) (milligrams per liter), biochemical oxygen demand (BOD) (milligrams per liter), chemical oxygen demand (COD) (milligrams per liter), suspended solids (SS) (milligrams per liter), total nitrogen (TN) (milligrams per liter), total phosphorus (TP) (milligrams per liter), and total organic carbon (TOC) (milligrams per liter). The algae data included chlorophyll-a (Chl-a) (milligrams per cubic meter) and cyanobacteria (cells per milliliter). These data can be obtained from the Water Environment Information System operated by the Ministry of Environment (n.d.).

The meteorological observation stations located near the weirs along the Nakdong River include Sangju, Gumi, Daegu, Hapcheon, and Miryang. The Korea Meteorological Administration collects data from each of these meteorological observation stations every hour or every day. For this study, daily data with the closest resolution to the target data of harmful algal blooms were collected. The collected meteorological data included sea level pressure (SLP) (hectopascals), daily maximum temperature (MaxTEM) (degrees Celsius), relative humidity (RH) (percent), precipitation (PCP) (millimeters), solar radiation (SR) (megajoules per square meter), and cloud cover (CC) (1/10). The data from the meteorological observation stations can be obtained by downloading the Automated Surface Observing System (ASOS) data available on the Open MET Data Portal (Korea Meteorological Administration, n.d.) operated by the Korea Meteorological Administration (KMA). The collected data have different durations; meteorological data are available on a daily basis, and water quality and algae data are available on a weekly basis. Therefore, in this study, the daily meteorological data were transformed into a weekly resolution to match the duration of all the data. Detailed information is given in the section “Construction of Cyanobacteria Prediction Model.”

Data Preprocessing

Data preprocessing is required to ensure that negative influences do not affect the prediction results when using time-series data to construct a prediction model for cyanobacteria cell counts. After incomplete data are preprocessed, applying normalization and inputting the data into the prediction model are necessary.

Firstly, data cleaning was conducted to identify and remove missing values or outliers in the observed data. The missing values were interpolated for the daily data, whereas entire rows corresponding to the missing dates were removed for the weekly data. This decision was made because the time intervals for weekly data were relatively large, and interpolation could lead to biased and distorted data. Therefore, it was deemed more appropriate to remove the entire rows corresponding to the missing dates in the weekly data.

Secondly, data partitioning was performed to split the data into appropriate proportions for training and validation sets. According to empirical results, it is recommended to use approximately 70%–80% of the data for training and 20%–30% for validation (Gholamy et al. 2018).

Lastly, the data were transformed. By normalizing each data, the preprocessing step was carried out to align the range of the data, aiming to create an optimized model for analyzing data. For example, cyanobacteria cell counts range from 0 to 100,000 units, whereas water temperature is

> 35 ° C

and TP values are between 0 and 1. Hence, each variable differs in scale, which would negatively affect the prediction results if used for training without normalization. Therefore, before training is performed, normalization must be applied to adjust the scale differences between variables and change the size of individual data to the same unit. Normalization methods include standard scaler, robust scaler, MinMaxScaler, and Normalizer. Cyanobacteria grow rapidly when the temperature and organic matter concentration conditions suitable for cyanobacteria growth in summer are met, with cell counts ranging from tens of thousands to millions even under the same environmental conditions. Therefore, rapidly increasing data may be considered to be an outlier with respect to past cyanobacteria growth patterns. In the past, cases of massive cyanobacterial blooms exceeding hundreds of thousands accounted for only approximately 5% of total observations. However, the occurrence rate is increasing gradually due to climate change and changes in river environments. In 2022, cyanoHABs of several hundred thousand cells or more appeared in the lower Nakdong River, causing algal bloom damage. Instead of such data being considered as outliers, maintaining the effect of these cases by normalizing and adjusting the scale was necessary while recognizing that these cases can occur in the actual Nakdong River environment. Additionally, the distribution characteristics of cyanobacteria that occur only in the summer should not be altered. Therefore, the MinMaxScaler method was adopted from among the several normalization methods. MinMaxScaler is a method that adjusts the scale by converting the values of each data variable with different maximum sizes to values between 0 and 1. The conversion formula is

MinMaxScaler (x) = \frac{x - x_{\min}}{x_{\max} - x_{\min}}

(1)

Even after normalization, the range of cyanobacteria cell counts can be high, and the data were extremely asymmetric compared with those of other variables. Learning such extremely asymmetric data can produce many errors in the results. In the harmful algae alert system, levels are distinguished in units of 1,000, 10,000, and 100,000 cells. Applying a log scale is recommended to improve the asymmetric cyanobacteria cell counts data as per the characteristics of cyanobacteria data (KEI 2020). Therefore, in this study, the cyanobacteria cell counts was normalized by using a log scale and the model was trained with these data.

Model for Cyanobacteria Simulation

Physics-Based Model: EFDC-NIER

The EFDC-NIER model is a widely used three-dimensional (3D) numerical model for analyzing hydrodynamics and water quality in various areas, such as rivers, lakes, estuaries, and oceans. It was developed by the National Institute of Environmental Research with improved environmental fluid dynamics code functions to suit domestic water conditions, and it currently is used as a water-quality prediction model for major river sections in South Korea. The EFDC-NIER model has new functions, such as weir functions for major domestic rivers, multispecies simulation for algae, vertical migration mechanisms of cyanobacteria, dormant spore formation and germination mechanisms, wind stress, and changes in nutrient release owing to changes in oxidation and reduction conditions. The EFDC-NIER model has been improved to accurately reflect flow and water-level control by using artificial hydraulic structures, such as multipurpose weirs, after the Four Major Rivers Project, thus enhancing the simulation accuracy for the changed domestic river environment. The existing EFDC model simulates algae in three separate groups (cyanobacteria, diatoms, and green algae), making the prediction of the rapid dominance and transition of specific algae species difficult. However, the EFDC-NIER model has an improved algae module that enables multispecies simulations, enabling quantitative prediction of algae occurrence, including the dominance and transition of specific algae species (Fig. 2). In this study, the EFDC-NIER model was constructed for all Nakdong River sections. Factors affecting the water balance, such as tributaries flowing into the main river, wastewater discharge from sewage treatment plants, and water intake facilities, were applied as boundary conditions to the model. For the weir sections for predicting cyanobacteria cell counts, the flow of the water body was reflected by allowing the inflow from upstream to be released downstream through the hydraulic structure module.

Fig. 2. Schematic of the reactions among multiple algal species. CHc = cyanobacteria; CHd = diatoms; CHg = green algae; CHx1–CHxn = multiple algal species; DOM = dissolved organic material; and POM = particulate organic material. [Reprinted from Ahn et al. 2021a, under CC BY 4.0 Deed Attribution 4.0 International license (http://creativecommons.org/licenses/by/4.0/).]

AI-Based Deep Learning Model: CNN-LSTM

LSTM is a model proposed by Hochreiter and Schmidhuber (1997), and it improves recurrent neural networks (RNNs) to solve problems such as gradient vanishing and exploding of the error slope when considering long-term data sets. LSTM comprises forget, input, update, and output gates, and it controls data input, storage, and output through the four gates. It is used mainly to learn continuously structured data, such as for speech pattern recognition and stock price prediction. In the water environment field, it is used widely for predicting time-series data, such as runoff (Fan et al. 2020), water level (Zhang et al. 2020), precipitation (Akbari Asanjan et al. 2018), and water quality (Hu et al. 2019; Li et al. 2020).

The CNN is an algorithm that can automatically learn features necessary for recognition, such as character, image, and object recognition, and effectively absorb morphological changes. Lecun et al. (1998) found that it demonstrated better performance than other methods in learning 2D data by successfully recognizing handwritten characters. The structure of a CNN is created by adding convolution layers and pooling layers to the basic artificial neural network (ANN) structure. The feature extraction area consists of multiple convolution and pooling layers. The convolution layer is an essential element that applies a filter to input data and reflects the activation function, extracting features (feature map) from input data through convolution (kernel). This is followed by the pooling layer, which reduces the input data size and emphasizes specific features. This significantly saves learning time and reduces overfitting, improving the CNN’s image-data-processing ability. In the last part of the CNN, a fully connected layer is added for image classification. A flattened layer converts image data into an array format, and a softmax layer, which normalizes values, is included between the image feature extraction and classification parts. CNNs are neural network models that include a preprocessing step called convolution and are primarily used in deep learning for processing image and video data. Although CNNs typically extract features from 2D layers (2D CNN) for images, they also can extract features from nonimage time-series data.

Construction of Cyanobacteria Prediction Model

Various data related to algal blooms, such as cyanobacteria, meteorological, and water-quality data, have time-series characteristics. To accurately predict cyanoHABs, the influence of data from the distant past must be considered. A LSTM model can analyze changes over time in the long term by solving the long-term dependency problem, in which the correlation and prediction power weakens as the distance between the input and output increases. Therefore, the LSTM is one of the optimal models for predicting cyanobacteria. Because cyanobacteria are living organisms, accurate prediction requires a precise understanding of the features between related data and consideration of spatiotemporal characteristics. A one-dimensional (1D) CNN (1D CNN) is used to automatically extract data features that are not visible in the time direction using convolution kernels, making it suitable for time-series applications. Zhao et al. (2021) reported that as input data pass through the layers of the CNN, they are refined into LSTM model input values that are more sensitive to time-series information.

Therefore, in this study a cyanobacteria prediction model was constructed by combining the CNN and LSTM. The first part of the CNN-LSTM model is a CNN model consisting of convolution layers and max-pooling layers. Meteorological, water-quality, and cyanobacteria data are input into the convolution layers. Convolution operations are performed on the convolution kernel weights and local sequence segments of input information to obtain a preliminary feature matrix. The feature matrix calculated from the previous convolution then is input in the pooling layers. A pooling window is slid over the sequence, taking the maximum value in each sliding window to output a more distinctive matrix. Because the input data here were 1D time-series data, a 1D CNN was constructed using the flatten function, which converted the 2D array output format in the CNN to a 1D format, and local features of the time-series data were extracted. To align the resolution of the data, daily data were transformed into weekly data. The daily weather data were passed through the CNN’s pooling layer to convert them into weekly data, taking into account the average weekly features present in the data. In the LSTM part, the LSTM architecture was configured to receive the features extracted from the CNN as input. Here, the training data were configured with the other gates of the LSTM model so that connections could be discovered in the input and output sequences. The architecture was configured to model time-series data using stacking of the LSTM network, and the cyanobacteria cell counts prediction results were output through the dense layer (Fig. 3).

Fig. 3. CNN-LSTM architecture for predicting harmful algal blooms.

Result and Discussion

Determining Major Factors Affecting Cyanobacteria Cell Counts

Variable selection is important in data-driven learning. Variables with low importance in prediction can add uncertainty to the prediction model and consequently degrade the performance of the model (Kuhn and Johnson 2013). In this study, correlation was analyzed using a data heatmap with the Seaborn library. Generally, a Pearson correlation coefficient of 0.4 or higher indicates some degree of correlation (Akoglu 2018). The absence of correlation means that each variable may not be independent but may have a nonlinear relationship; thus, determining the effect on the prediction results and controlling the influence of each variable when constructing a model will be difficult. To address this, a data set with correlated variables must be constructed to increase prediction accuracy.

In this study, the water temperature, DO, COD, TN, TOC, maximum temperature, and RH were found to be highly correlated with the cyanobacteria cell counts (Table 1). Therefore, these were selected as the main limiting factors affecting cyanobacteria occurrence. The water temperature had the most significant impact on cyanobacteria occurrence. A high water temperature aided the growth of cyanobacteria in the Nakdong River and contributed to the formation of cyanoHABs through large cyanobacteria blooms (Ha et al. 1998; Hur et al. 2013). Heavy summer rainfall results in large amounts of nonpoint pollutants, such as nitrogen and phosphorus, in the Nakdong River. This maintains a high nutrient concentration in the Nakdong River, which helps form cyanoHABs (Park et al. 2021).

Table 1. Correlation of water quality and meteorological data for cyanobacteria

Variable	Weir
Variable	SJ	ND	GM	CG	GG	DS	HC	CH
WT	0.59	0.68	0.73	0.68	0.69	0.70	0.65	0.63
pH	0.10	0.05	$- 0.02$	$- 0.10$	$- 0.16$	$- 0.05$	0.01	0.15
DO	$- 0.55$	$- 0.60$	$- 0.63$	$- 0.66$	$- 0.70$	$- 0.65$	$- 0.56$	$- 0.47$
COD	0.34	0.28	0.31	0.22	0.19	0.13	0.24	0.41
SS	0.17	0.03	0.04	$- 0.03$	$- 0.06$	$- 0.01$	$- 0.05$	0.01
T-N	$- 0.39$	$- 0.52$	$- 0.54$	$- 0.64$	$- 0.70$	$- 0.66$	$- 0.65$	$- 0.70$
T-P	0.17	0.16	0.18	0.07	0.13	$- 0.03$	0.04	0.01
TOC	0.27	0.24	0.25	0.15	0.14	0.09	0.17	0.43
Chl-a	0.21	0.06	0.04	$- 0.17$	$- 0.25$	$- 0.27$	$- 0.15$	0.00
SLP	$- 0.26$	$- 0.36$	$- 0.37$	$- 0.36$	$- 0.37$	$- 0.39$	$- 0.36$	$- 0.37$
MaxTEM	0.50	0.58	0.59	0.55	0.53	0.56	0.52	0.52
RH	0.35	0.39	0.44	0.37	0.35	0.41	0.46	0.43
PCP	0.06	0.15	0.09	0.08	0.02	0.06	0.12	0.05
SR	0.04	0.04	0.04	0.06	0.03	0.02	$- 0.05$	$- 0.01$
CC	0.02	0.24	0.23	0.21	0.20	0.24	0.28	0.24

Note: SJ = Sangju; ND = Nakdan; GM = Gumi; CG = Chilgok; GG = Gangjeong-Goryeong; DS = Dalseong; HC = Hapcheon-Changnyeong; and CH = Changnyeong-Haman.

Definition of Prediction Model Parameters

The aim of this study was to construct a prediction model for cyanobacteria cell counts using a CNN-LSTM model. Optimal hyperparameters must be selected to achieve the highest accuracy. Hyperparameters are parameters that influence the training of a model, such as the learning rate, hidden size, number of layers (num_layers), and dropout rate. Adjusting hyperparameter values is crucial to improve the model’s performance. However, manually tuning these hyperparameters requires significant time and effort, and finding the optimal values can be challenging. Therefore, various techniques such as Optuna, Hyperopt, KerasTuner, and so forth, recently have been used to automatically search for the optimal hyperparameter values.

In this study, the hyperparameters for the cyanobacteria cell count prediction model that yielded the highest test score among all possible combinations of the learning rate, hidden size, num_layers, and dropout rate were selected. We applied Optuna as the hyperparameter tuning technique to optimize the model parameters. Optuna is an algorithm that automatically optimizes various parameters. It automatically finds the optimal values of parameters required for training, helping to enhance learning efficiency. Optuna is based on a Bayesian optimization algorithm, which makes it more efficient in searching for optimal hyperparameters compared with other algorithms. It offers support for various optimization algorithms, including tree-structured Parzen estimator (TPE), evolution strategies (ES), and random search, among others. Additionally, Optuna can be used seamlessly with various popular libraries such as PyTorch, TensorFlow, and scikit-learn, providing flexibility and compatibility for researchers and practitioners. This versatility and efficiency make Optuna a powerful tool for hyperparameter tuning in machine learning models. Therefore, Optuna has been used widely in various AI-based deep learning models, including CNN-LSTM, to select the best parameters (Akiba et al. 2019; Ekundayo 2020). We applied the ReduceLROnPlateau scheduler as one of the important elements during training. This scheduler reduces the learning rate by 5% when the training results do not improve. By adjusting the learning rate of the optimizer dynamically using the scheduler, we initially applied a higher learning rate to facilitate faster convergence. As the training progresses, the scheduler gradually reduces the learning rate, allowing for more-precise adjustments and better convergence. This approach helps to achieve better optimization results during the training process. The results of hyperparameter tuning, optimal learning rate, hidden size, num_layers, and dropout rate for the model are presented in Table 2.

Table 2. Results of hyper-parameters for CNN-LSTM models

Hyperparameter	Search range	Optimal hyperparameter of CNN-LSTM
Learning rate	0.00001–0.1	0.00075
Hidden size	1–128	2
Num_layers	1–64	1
Dropout_rate	0.0–0.5 (0.1)	0.2

Results of Prediction Model Training and Validation

The input variables included the water temperature, DO, COD, TN, TOC, maximum temperature, and RH, and the output variable was cyanobacteria cell counts. The model parameters were constructed using the optimal parameters described in the section “Definition of Prediction Model Parameters.” The water quality and cyanobacteria cell count data were based on weekly data, whereas the meteorological data were based on daily data. Therefore, the resolution of each variable had to be transformed for consistency. In this study, a cyanobacteria cell count prediction model was constructed by converting daily meteorological data to weekly averages and applying consistent resolutions between variables.

The exact ratio for splitting the training and validation sets is not defined clearly, but having a training set that is too small may not provide enough data for the algorithm to learn effectively, whereas a small validation set can make it difficult to have confidence in the model. Based on the empirical results related to the ratio (Gholamy et al. 2018) and considering the limited weekly data available, we decided to use an

8 ∶ 2

ratio for the model set, with 80% of the data allocated for training and 20% for validation. For this reason, data from 2012 to 2019 for these sites were used as training data, and data from 2020 to 2021 were used as a validation set for the trained model. The model was trained reasonably well for eight weir points in the Nakdong River, and when validated using the test data from 2020 to 2021, it accurately predicted the occurrence of cyanobacteria (Table 3). The accuracy of the model’s training and validation was evaluated using the coefficient of determination (

R^{2}

), mean absolute error (MAE), and RMSE. Detailed information is presented in the Appendix.

Table 3. Results of training and validation for CNN-LSTM model

Parameter	Weir
Parameter	SJ	ND	GM	CG	GG	DS	HC	CH
Train $R^{2}$	0.88	0.76	0.93	0.85	0.89	0.95	0.85	0.90
MAE_Train	0.4	0.5	0.3	0.4	0.4	0.2	0.4	0.3
RMSE_Train	0.5	0.7	0.4	0.5	0.5	0.3	0.6	0.4
Test $R^{2}$	0.77	0.55	0.83	0.71	0.77	0.87	0.73	0.72
MAE_Test	0.5	0.6	0.4	0.5	0.4	0.3	0.5	0.3
RMSE_Test	0.6	0.9	0.5	0.7	0.7	0.4	0.7	0.4

Note: SJ = Sangju; ND = Nakdan; GM = Gumi; CG = Chilgok; GG = Gangjeong-Goryeong; DS = Dalseong; HC = Hapcheon-Changnyeong; and CH = Changnyeong-Haman.

Comparison of the Results from the Two Models

The performance of a deep-learning-based AI algorithm was compared with that of the currently used EFDC-NIER model for predicting cyanobacteria cell counts. To compare the prediction performance, the cyanobacteria cell counts for the eight weirs of the Nakdong River in 2022 were predicted (Fig. 4; Tables 4 and 5). In Korea, cyanobacteria cell counts are divided into four sections, called Harmful Algal Bloom (HAB) alert levels, and management measures for cyanoHABs are established accordingly. To evaluate the performance of the CNN-LSTM model, comparing the prediction accuracy at each level was considered to be the most suitable method for assessing the model’s applicability. The comparison of the level-by-level prediction accuracy showed that the CNN-LSTM model performed better than the EFDC-NIER model in most cases, except for Level 4 (

> 100,000 cells / mL

) (Tables 6 and 7). By learning the characteristics of surrounding data and time-series patterns, the CNN-LSTM model simulated the observed cyanobacteria cell counts at a level very similar to the actual data and demonstrated better simulation performance than the EFDC-NIER model.

Fig. 4. Results from two models for (a) Sangju weir; (b) Nakdan weir; (c) Gumi weir; (d) Chilgok weir; (e) Gangjeong-Goryeong weir; (f) Dalseong weir; (g) Hapcheon-Changnyeong weir; and (h) Changnyeong-Haman weir.

Table 4. Cyanobacteria cell count prediction results of CNN-LSTM model and EFDC-NIER model

Date	Weir
	SJ			ND			GM			CG
	OBS	EFDC-NIER	CNN-LSTM	OBS	EFDC-NIER	CNN-LSTM	OBS	EFDC-NIER	CNN-LSTM	OBS	EFDC-NIER	CNN-LSTM
5/2	0	0	4	0	0	19	0	0	17	0	0	27
5/9	0	2	3	8	1	31	0	9	36	0	37	13
5/16	0	0	6	0	0	147	0	0	361	0	0	13
5/23	0	0	5	185	0	2,185	5,970	1	3,343	107	0	140
5/30	128	0	61	148	5	105	239	1	139	128	2	145
6/7	55	111	78	1,055	78	798	1,633	683	2,125	2,017	5,356	1,872
6/13	55	461	2,127	3,750	383	3,133	3,766	2,683	6,975	6,095	3,082	8,323
6/20	40	183	3,825	11,200	749	2,509	24,086	9,010	18,620	127,406	8,964	805,749
6/27	345	1,152	592	4,581	7,176	2,861	2,236	5,491	7,851	6,862	29,806	7,303
7/4	445	6,129	175	61,780	4,150	2,505	13,506	8,230	43,223	6,568	29,409	4,658
7/13	1,168	1,045	1,317	58,020	4,155	1,747	21,505	27,410	62,620	13,315	14,169	5,438
7/18	602	2,001	1,166	7,919	3,435	1,041	66,134	11,637	6,990	14,774	37,752	7,339
7/25	150	4,306	399	15,988	8,034	1,811	132,470	20,888	16,352	1,990	69,570	3,145
8/1	6,223	1,950	1,715	607	3,587	3,515	20,348	11,244	2,569	545	39,744	1,795
8/8	3,254	6,517	1,355	1,969	11,100	2,628	2,009	39,383	972	4,701	35,183	4,801
8/22	1,607	3,160	388	1,829	4,240	169	1,944	4,920	1,354	1,135	2,920	1,138
8/29	7,629	3,750	1,139	2,551	2,800	1,375	959	1,340	1,167	385	2,380	788
9/14	13,701	400	1,023	14,835	300	10	17,363	300	2,740	9,752	200	2,227
9/20	8,264	1,100	1,006	21,500	2,175	2	9,785	16,450	1,303	863	11,250	412
9/26	16,959	867	1,447	14,558	1,333	999	12,415	1,567	7,839	181	4,866	335
10/4	2,392	3,633	526	8,449	3,233	1,292	1,503	6,183	397	1,638	6,366	444
10/11	560	2,967	112	801	3,000	879	1,275	3,750	279	441	3,100	359
10/17	180	2,767	6	1,288	4,767	380	1,752	2,867	531	520	2,200	183
10/24	140	175	6	790	425	12	1,306	237	340	625	1,250	395
10/31	25	175	6	57	175	7	1,318	150	384	5,784	625	1,142

Note: SJ = Sangju; ND = Nakdan; GM = Gumi; CG = Chilgok; GG = Gangjeong-Goryeong; and OBS = observed.

Table 5. Cyanobacteria cell count prediction results of CNN-LSTM model and EFDC-NIER model

Date	Weir
	GG			DS			HC			CH
	OBS	EFDC-NIER	CNN-LSTM	OBS	EFDC-NIER	CNN-LSTM	OBS	EFDC-NIER	CNN-LSTM	OBS	EFDC-NIER	CNN-LSTM
5/2	0	0	25	0	0	12	0	0	5	0	0	55
5/9	0	1	30	0	0	13	0	38	5	31	76	59
5/16	0	0	37	0	0	11	0	0	4	204	0	313
5/23	267	0	548	0	0	11	541	0	853	872	0	1,918
5/30	514	2	650	169	0	243	697	0	392	1,472	1,477	977
6/7	2,296	1,548	1,802	8,752	1,188	8,964	16,916	4,016	18,834	4,708	4,443	4,939
6/13	13,672	5,103	11,996	38,572	2,513	28,011	65,232	5,608	17,064	15,437	5,799	22,175
6/20	80,961	36,236	29,444	50,852	27,507	28,030	94,713	23,687	22,115	71,967	21,497	35,521
6/27	79,986	133,028	32,637	33,649	87,252	18,367	43,827	71,504	17,212	43,497	95,665	29,981
7/4	58,084	62,795	22,075	14,126	92,185	2,820	42,050	112,062	16,390	79,767	109,144	35,349
7/13	71,515	33,377	25,532	47,348	53,568	9,042	37,922	31,810	15,464	69,281	46,084	35,286
7/18	114,735	47,381	37,083	53,203	48,147	14,964	123,340	48,462	30,484	40,308	44,735	27,974
7/25	98,595	74,945	31,988	348,349	107,813	25,359	165,744	76,044	32,906	151,711	150,714	41,458
8/1	29,454	68,492	20,926	5,800	133,458	2,353	134,195	121,718	25,450	93,310	102,560	38,217
8/8	32,203	58,350	16,931	11,363	97,967	8,614	148,645	74,667	33,719	134,373	94,733	41,914
8/22	6,321	1,260	5,421	500	9,160	945	635	12,860	1,542	5,382	22,480	5,138
8/29	1,030	1,540	1,561	1,425	8,820	2,624	2,680	7,360	4,747	6,433	9,680	11,992
9/14	1,963	333	912	1,362	1,067	1,394	194	1,167	139	993	2,100	404
9/20	1,166	16,200	847	3,244	5,250	2,749	1,509	2,175	1,662	3,264	2,025	1,099
9/26	240	5,067	437	673	5,133	1,072	1,512	3,433	1,484	3,007	1,367	4,514
10/4	176	4,533	281	2,753	2,867	1,745	1,486	1,733	168	6,634	2,133	3,657
10/11	2,510	1,267	1,065	417	433	576	3,664	233	1,866	5,407	800	2,615
10/17	2,596	2,167	858	2,175	1,900	1,088	4,855	1,433	179	10,476	3,933	5,476
10/24	1,535	825	736	2,255	1,300	1,612	1,413	2,425	68	1,808	2,525	764
10/31	5,031	350	1,580	4,557	375	2,069	1,359	1,300	66	1,559	1,450	695

Note: GG = Gangjeong-Goryeong; DS = Dalseong; HC = Hapcheon-Changnyeong; CH = Changnyeong-Haman; and OBS = observed.

Table 6. Prediction accuracy of EFDC-NIER by HAB alert level (%)

Alert level	Weir								Average
Alert level	SJ	ND	GM	CG	GG	DS	HC	CH	Average
Level 1	44	31	38	26	48	39	42	32	39
Level 2	63	67	67	33	56	61	68	39	56
Level 3	14	6	37	71	59	39	52	28	38
Level 4	—	—	17	0	0	100	50	75	49
Average	49	36	46	35	52	49	53	41	45

Note: SJ = Sangju; ND = Nakdan; GM = Gumi; CG = Chilgok; GG = Gangjeong-Goryeong; DS = Dalseong; HC = Hapcheon-Changnyeong; and CH = Changnyeong-Haman.

Table 7. Prediction accuracy of CNN-LSTM by HAB alert level (%)

Alert level	Weir								Average
Alert level	SJ	ND	GM (%)	CG (%)	GG (%)	DS (%)	HC (%)	CH (%)	Average
Level 1	81	78	80	92	100	86	100	80	87
Level 2	83	75	46	89	50	100	50	56	69
Level 3	0	0	43	0	100	57	100	88	48
Level 4	—	—	0	100	0	0	0	0	16.7
Average	75	54	50	83	79	79	67	67	69

Note: SJ = Sangju; ND = Nakdan; GM = Gumi; CG = Chilgok; GG = Gangjeong-Goryeong; DS = Dalseong; HC = Hapcheon-Changnyeong; and CH = Changnyeong-Haman.

Discussion

Limitation of the CNN-LSTM Model

The comparison of the two models showed that the CNN-LSTM model outperformed the EFDC-NIER model overall. However, the CNN-LSTM prediction had a major limitation. The predictive power was significantly low for Level 4 (

> 100,000 cells / mL

). If the HAB alert is at Level 3 (

10,000 - 100,000 cells / mL

) or Level 4, cyanoHABs have occurred and an alert has been issued. At Level 4, special management is required owing to the severe cyanoHAB levels. From this perspective, the CNN-LSTM model with lower predictive power in Level 4 than the existing EFDC-NIER model can be considered to be a significant limitation in terms of managing cyanoHABs.

The first cause of this limitation is the overfitting problem. To prevent this overfitting problem, a dropout rate of 0.2 was set in this study (Shah et al. 2018). However, the amount of training data for Level 4 was very inadequate compared with that for the other levels. Consequently, although highly accurate training results were obtained, the accuracy in the prediction interval decreased because of overfitting.

The second cause is that the pattern of cyanobacteria occurrence in 2022 was different from that in the past. In 2022, a severe drought and heat wave occurred in the Nakdong River Basin. In particular, Gyeongsangbuk-do and Gyeongsangnam-do, where the Nakdong River is located, experienced the lowest number of rainy days in the last 10 years, according to the Korea Meteorological Administration’s climate information portal.

As a result, since 2018, the worst cyanoHABs occurred in 2022. In particular, for the first time since monitoring began, Gumi Weir in the upstream area of the Nakdong River had cyanobacteria cell counts of more than

100,000 cell / mL

in 2022. This is a substantially different phenomenon from past patterns, in which a large amount of cyanobacteria was not observed in the upper reaches.

The CNN-LSTM is a technique for predicting the future by learning past time-series patterns. Therefore, the predictive power may be lower for future patterns that have not occurred in the past. Deep-learning models, such as the CNN-LSTM, are black box models in which AI autonomously extracts desired information without specifying complex multidimensional data by users (Sengupta et al. 2020). Understanding the learning process of these deep-learning algorithms is difficult, making the identification of the causes that affect uncertainty challenging. By contrast, physics-based models, such as the EFDC-NIER model, can simulate HABs using well-defined equations and mathematical theories.

Applicability of AI-Based Deep Learning Models in the Prediction of Cyanobacteria Cell Counts

Although the CNN-LSTM model has limitations, it demonstrated excellent performance in predicting cyanobacteria cell counts. It achieved results that are comparable to or slightly better than those of the physics-based models, such as the EFDC-NIER model, while significantly reducing the analysis time. In general, cyanobacteria cell counts are predicted by collecting real-time measurement data in the morning and analyzing them; then forecasters make a prediction in the afternoon based on the results. However, the construction of input data for the EFDC-NIER model requires a considerable amount of time, and the analysis takes several hours or more. Moreover, human errors or incorrect model designs can lead to situations in which the results cannot be produced within the deadline. The CNN-LSTM model provides the advantage of having a prediction time of a few minutes. Therefore, even if an error is present in the input data configuration, it can be reanalyzed easily. In addition, CNN-LSTM models have the advantage of being analyzed easily even by nonexperts. The EFDC-NIER and other physics-based models typically rely heavily on the model’s structure and parameters (Jiang et al. 2018; Su et al. 2018). Therefore, building such models and adjusting the parameters require substantial expertise. In contrast, data-driven models, such as CNN-LSTM, can easily construct input data, making them accessible to the general public.

Proposing a Direction for the Research on AI-Based Deep Learning Models in the Prediction of Cyanobacteria Cell Counts

The recent trend of the various developments and advancements in AI algorithms has revealed that the application of deep-learning-based AI technology in the management of cyanoHABs is inevitable. This paper presents both the positive aspects and limitations of deep-learning-based AI technology. Measures that can be taken to overcome the limitations and apply deep-learning-based AI technology to cyanoHABs management are as follows.

Through comparison with the results of the physics-based model, the AI-based deep learning model demonstrated the capability to construct rapid models and predict cyanobacteria cell counts based on data statistics even in situations with limited prior knowledge. In situations in which there are sufficient data, the AI-based deep learning model has shown comparable or better results than the physics-based model, especially in cases with low cyanobacteria cell counts. However, in cases in which data are limited, the AI-based deep learning model demonstrated lower predictive performance for new data, because it may not capture the relationships between input and output variables sufficiently, particularly in cases with high cyanobacteria cell counts. The findings of this study provide evidence that there is a need for the complementary use of the physics-based model and the AI-based deep learning model.

AI-based deep learning models allow for rapid model construction and predictions based on data statistics even in situations with limited prior knowledge. However, they can be challenging to interpret physically, and their reliability may be dependent on the conditions included in the training data, exhibiting a tendency to rely heavily on the provided data (von Stosch et al. 2014). Conversely, physics-based models, grounded in first principles, focus on learning interpretable causal relationships between input and output variables, aiming to understand the physical world. These models significantly advance the understanding of various phenomena in fields such as material balances, fluid dynamics, and reactor kinetics, thereby addressing the limitations of AI-based deep learning models mentioned earlier. Physics-based models offer the advantage of providing a deeper understanding of the physical world, and can complement the drawbacks associated with AI-based deep learning models (Daw et al. 2021). Physics-based models also may face challenges in model construction due to insufficient prior knowledge or complex mechanisms (von Stosch et al. 2014; Jeong et al. 2023). The drawbacks associated with physics-based models can be complemented by AI-based deep learning models. Therefore, AI-based deep learning models and physics-based models possess distinct characteristics and have the potential to mutually complement each other’s limitations, because they offer unique strengths that can address the weaknesses of the other (Aykol et al. 2021).

To overcome the limitations of AI-based deep learning models, as a general solution, collecting and constructing more input data for the model can be effective. This is not an easy task, considering the substantial budget and effort required to collect and observe data. However, by applying the concept of hybridization, combining the strengths of both AI-based deep learning models and physics-based models, it is possible to address these challenges and find solutions. A well-calibrated physics-based model can predict cyanobacteria cell counts accurately at Level 4 because of its internal mechanisms and various parameters, even with limited data. In addition, it can be used to analyze scenarios that did not exist in the past, such as dam and weir operations in rivers, virtual climate scenarios, and water intake in river. If the simulated results are utilized in a AI-based deep learning model, they can complement the shortage of input data and partially solve the problem of overfitting. This approach, which has been proposed widely in recent research, takes advantage of the clear boundaries between AI-based deep learning models and physics-based models (Vo et al. 2022; Jung et al. 2023). As a result, it allows for the application of existing methodologies specific to each model type. This hybrid method harnesses the unique strengths of both approaches, enabling a more versatile and effective solution that can accommodate different modeling requirements and enhance overall predictive capabilities (von Stosch et al. 2014; Aykol et al. 2021). The fusion of these models can lead to a more robust and comprehensive approach, offering improved predictive capabilities and a better understanding of complex systems while overcoming data limitations in certain scenarios.

A direct combination of the two models, not just a simple fusion concept, also can serve as an alternative to overcome limitations. This hybrid model, by integrating both models directly, has the potential to address the drawbacks effectively. Because examples of combining hydrological analysis with deep learning exist in the field of hydrology (Maraun et al. 2017; De Luca et al. 2020), combining the mechanisms of cyanobacterial growth, proliferation, and extinction with deep-learning-based AI algorithms seems possible. In this concept, the hybrid model can incorporate a loss function that reflects the physical laws present in the physics-based model. This enables the AI-based deep learning model’s predictions to be controlled, ensuring that they do not violate the physical laws (Wang et al. 2019). One of the key advantages of this approach is that it allows for the interpretation of predictive results in a physically meaningful way, leveraging the benefits of both models.

However, when applying this hybrid concept, if the accuracy of the predictions from the physics-based model is not sufficiently ensured, using them as inputs for the AI-based deep learning model can lead to increased uncertainty in the results. Therefore, to ensure the accuracy and reliability of the results, further research is needed on the calibration of parameters in the physics-based model. Additionally, efficient deep learning algorithm structures need to be developed to incorporate the mechanisms and processes, such as the physical laws present in the physics-based model, into the algorithms of the AI-based deep learning model. Such efforts are essential to enhance the overall performance and effectiveness of the hybrid approach.

Conclusion

In this study, a deep-learning-based AI technique (CNN-LSTM) that recently has been applied in various fields was used to predict the cyanobacteria cell counts. The results were compared with those obtained from a physics-based model (EFDC-NIER) that commonly is used in cyanobacterial analysis, to evaluate the applicability of deep-learning-based AI techniques in predicting harmful cyanobacteria. Directions for future research on the prediction of harmful cyanobacteria using deep-learning-based AI techniques were suggested.

The conclusions of this study are as follows:

1.

For the period 2012–2021, a CNN-LSTM model was constructed for eight weir sites along the Nakdong River. The cyanobacteria cell counts in 2022 were predicted by using both the CNN-LSTM and EFDC-NIER models. Overall, the prediction results of the CNN-LSTM model had a similar level of prediction accuracy as the EFDC-NIER model.

2.

Although the CNN-LSTM model exhibited excellent prediction performance, it had a limitation in that its predictive power for Level 4 (in which the cyanoHABs are very severe), which had a relatively lesser amount of training data, was low. Therefore, AI-based deep learning models currently are considered to be unable to completely replace physics-based models.

3.

As more data are accumulated and algorithms become more advanced in the future, the accuracy of AI-based deep learning models will be improved further. Currently, water quality and cyanobacteria data are being observed and developed continually, and deep learning algorithms for predicting time-series data, such as water quality and cyanobacteria, are evolving continually. Deep learning algorithms may become a new and effective way for researchers to analyze and predict cyanobacteria. Therefore, the prediction model for cyanobacteria cell counts should evolve into a hybrid concept that combines the advantages of AI-based deep learning models and physics-based models.

Appendix. AI-Based Deep Learning Model Performance Evaluation Metrics

The coefficient of determination,

R^{2}

, indicates how well the predictive value explains the measured value, and the higher the value of

R^{2}

, the higher is the prediction accuracy. MAE is the difference between estimated and measured values, and RMSE is an error metric used to assess the difference between the estimated and measured values. The lower the MAE and RMSE values of a prediction model, the higher is the prediction performance of the model

R^{2} = 1 - (\frac{SSR}{SST})

SST = \sum_{i = 1}^{n} {(y_{i} - {\bar{y}}_{i})}^{2}

SSR = \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}

MAE = \sum_{i = 1}^{n} | y_{i} - {\bar{y}}_{i} |

RMSE = \sqrt{(\frac{1}{n}) \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}

Data Availability Statement

All data, models, or code that support the findings of this study are available from the corresponding author upon reasonable request.

Acknowledgments

This study was supported by a Grant (NIER-2023-01-01-097) from the National Institute of Environmental Research (NIER), which is funded by the Ministry of Environment (MOE) of the Republic of Korea.

References

Ahn, J. M., B. Kim, J. Jong, G. Nam, L. J. Park, S. Park, T. Kang, J.-K. Lee, and J. Kim. 2021a. “Predicting cyanobacterial blooms using hyperspectral images in a regulated river.” Sensors 21 (2): 530. https://doi.org/10.3390/s21020530.

Abstract

Introduction

Data and Methods

Study Area

Data Construction

Water Quality, Algae, and Meteorological Data

Data Preprocessing

Model for Cyanobacteria Simulation

Physics-Based Model: EFDC-NIER

AI-Based Deep Learning Model: CNN-LSTM

Construction of Cyanobacteria Prediction Model

Result and Discussion

Determining Major Factors Affecting Cyanobacteria Cell Counts

Definition of Prediction Model Parameters

Results of Prediction Model Training and Validation

Comparison of the Results from the Two Models

Discussion

Limitation of the CNN-LSTM Model

Applicability of AI-Based Deep Learning Models in the Prediction of Cyanobacteria Cell Counts

Proposing a Direction for the Research on AI-Based Deep Learning Models in the Prediction of Cyanobacteria Cell Counts

Conclusion

Appendix. AI-Based Deep Learning Model Performance Evaluation Metrics

Data Availability Statement

Acknowledgments

References

Information

Published In

Copyright

History

Authors

Affiliations

Metrics

Citations

Download citation

Figures

Other

Share

Copy the content Link

Share with email

Share

Request Username

Create a new account

Change Password

Password Changed Successfully

Verify Phone

Congrats!