0% found this document useful (0 votes)
25 views8 pages

Handling Missing Values in Hydrological Time Series Data

Uploaded by

Khondwani Banda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views8 pages

Handling Missing Values in Hydrological Time Series Data

Uploaded by

Khondwani Banda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Handling Missing Values in Hydrological Time

Series Data
Managing missing data in hydrological records is crucial for reliable analysis. Incomplete streamflow or
water-level series can bias results and add uncertainty to model outputs if not handled properly 1 .
Below, I outline how to assess and address missing values in your station datasets (Bakel and Podor) in
a way that minimizes interference with your analysis, drawing on best practices and relevant research.

Importance of Addressing Missing Data


Hydrological analyses (trend detection, climate correlation, modeling, etc.) typically require continuous
data. Missing values due to instrument failures or other issues are common 1 . If left unaddressed,
gaps can introduce bias and reduce the accuracy of any models or statistical tests 1 . Properly infilling
these gaps (a process known as imputation) helps maintain the integrity of your analysis and improves
model performance 2 3 . It’s not the missing data itself, but how we handle it that matters 4 .

That said, there is no one-size-fits-all solution – the appropriate method depends on factors like the
amount and pattern of missing data, seasonal context, and availability of other data sources 4 . For
example, dealing with a few isolated missing days is very different from dealing with a 3-month gap.
Before choosing a fix, it’s important to assess the missing data pattern in your series:

• In the Podor station data, the early 1960s have large gaps (e.g. 1960–63 have over 30% missing
days), whereas recent years have only sporadic missing days. By contrast, Bakel’s record is
mostly complete except for the very latest year (2024–25) which is partly empty. This tells us that
Podor’s early record might need special treatment or could be excluded if too incomplete, while
most other gaps are short and might be interpolated. Always start by quantifying how many
values are missing in each year/season and whether the gaps are random or occur in blocks (e.g.
a broken gauge for several months).

Strategies for Handling Missing Values


Once you understand the scope of missingness, you can choose an imputation strategy. Short gaps vs.
long gaps often require different approaches. Below are common techniques – you might end up using
a combination of these in your thesis, depending on each gap’s characteristics:

• Leave Data Missing (No Imputation): In some cases, the least biased option is to leave the gaps
as they are (or exclude those periods from analysis). As one expert cautions, blindly inserting
invented values can distort your analysis 5 . If you have only a few missing points that won’t
affect a calculation (or if your analysis method can handle missing values), you may choose to
simply flag them and proceed without filling. This avoids introducing any “wrong” data 5 . For
example, if a year is missing large portions of data (like Podor 1960–61 which is missing over half
the year), you might exclude that year from a climate trend analysis rather than fill it with
unreliable estimates. However, many modeling tools and statistical tests cannot handle gaps, so
in practice we often need to impute values to have a complete series 6 .

1
• Linear or Spline Interpolation (Short Gaps): For relatively short gaps (say a few days up to ~10
days), simple interpolation is a common approach. This draws a line (or smooth curve) between
the last observed value before the gap and the first value after, filling in the intervening points. It
preserves continuity and is easy to do. In fact, a rule of thumb some hydrologists use is linear
interpolation for gaps shorter than 10 days 7 . This would likely be suitable for many of the small
gaps in your data (for example, Bakel 2019–20 had a 12-day gap mid-year, which interpolation
can handle smoothly). Spline (polynomial) interpolation can be used instead of straight lines to
better capture curvature in the hydrograph. One hydrologist suggests using a polynomial fit for
streamflow gaps around 10–15 days 8 , especially if a linear line might underestimate a peak.
Caution: Interpolation assumes the change between known points is relatively steady; it may
smooth out any extreme values that occurred during the gap (e.g. missing a sudden flood
peak). Use it for short gaps where that assumption is reasonable. Always plot the result to
ensure the filled values look realistic and don’t introduce a sudden jump or dip.

• Seasonal or Climatological Averages: For longer gaps (e.g. several weeks or months) where
straight interpolation becomes too uncertain, one simple approach is to fill with expected values
based on climatology. For instance, if an entire month is missing, you could fill it with the long-
term average for that month (or the same month in other years). This maintains the seasonal
pattern but erases that year’s anomaly. Some practitioners suggest using monthly averages
for gaps longer than ~10 days 9 . However, be careful: if only part of a month is missing, don’t
replace the whole month with the average, as that throws away any real data in that month 10 .
Instead, you could use daily averages for each calendar day (computed from other years) to fill a
section. In hydrology, another approach is infilling with data from a similar period in a different
year (for example, “analog year” method, assuming, say, that the pattern in Feb 1972 might
resemble Feb of another year with similar conditions). These climatological fills should be clearly
noted, since they have no year-specific information and will dampen variability.

• Neighboring Station Infilling: If you have another gauge on the same river or in the same
region (in your case, the Bakel and Podor stations on presumably the same river system), use
that relationship to your advantage. This is often considered the best approach when available
8 4 . For example, if Podor’s gauge was down for a period, you can use Bakel’s readings to

estimate Podor’s, since the two stations likely have correlated flows/water levels. This can be
done via regression or ratio methods: regress Podor against Bakel for periods where both have
data to establish a relationship, then apply it to Bakel’s values during Podor’s gap 11 . Ensure
you choose a nearby station with a strong correlation and similar response (and watch out for
travel time or attenuation if one is upstream of the other – you might need to shift by a lag or
adjust magnitudes). Using neighboring station data tends to produce more realistic infills than
unguided interpolation, as it is anchored in real concurrent observations 4 . In your data, for
instance, Podor’s large gaps in the early 1960s might be filled by leveraging Bakel’s data from
those years (assuming Bakel was recording then). On the other hand, Bakel’s 2019–20 12-day
gap and Podor’s same-year data could potentially be cross-filled if one had measurements while
the other didn’t. Always perform a sanity check on these infills (plot the two station series
together) to confirm the imputed segment follows the right trend and range. If no nearby station
exists, one creative alternative is using remote sensing (e.g. satellite river height if available) to
fill major gaps 12 , but that may be beyond your project’s scope.

• Time-Series Modeling (ARIMA/Kalman): Another approach is to treat the data as a time series
and use a statistical model to predict missing values. For example, fit an ARMA/ARIMA model or
use Kalman filtering on the known parts of the series, then generate estimates for the gaps 13 .
The model leverages the temporal autocorrelation in the river levels. Kalman smoothing (state-
space models) is often recommended for hydrological data because it can dynamically adjust to

2
trends and seasonal components. In fact, studies have found Kalman filter/smoother methods to
be among the top performers for univariate hydrological time series imputation 14 . An
advantage of these methods is that they consider the structure of the data (e.g. seasonality,
persistence) rather than just drawing a straight line. For instance, an ARIMA model might
capture the typical rise and fall pattern of the river, and fill a gap in a way that honors those
dynamics (potentially better than linear interpolation if the gap spans part of a flood wave).
However, these models assume the statistical properties (mean, autocorrelation) remain steady –
which might not hold if there’s a long-term trend or shift. They also require a sufficient length of
data around the gap to calibrate the model 13 . You would likely use software (R, Python, etc.) to
fit such a model. If you’re comfortable with coding, you could attempt an ARIMA on a year with
artificial gaps to see how well it predicts them, before trusting it on real missing segments.

• Advanced Methods (Machine Learning & Multiple Imputation): With larger datasets and
multiple variables, you can consider more sophisticated infilling techniques:

• Machine Learning: Techniques like Artificial Neural Networks (ANNs) or Random Forests can
be trained to predict flow from past values or other inputs (rainfall, upstream flow, etc.). For
instance, an ANN could learn the relationship between Bakel and Podor, or between river level
and climate indices, and fill in missing periods. These can capture non-linear relationships better
than simple regression. One Q&A respondent even suggests Gaussian Process regression (a
Bayesian approach) for gap-filling, since it naturally provides uncertainty estimates 15 . ML
methods require substantial data for training and careful validation to avoid overfitting. They
might be overkill for a few gaps, but could be useful if you plan to model the system anyway.

• Multiple Imputation (MI): This is a statistical approach where you fill in missing values multiple
times to reflect uncertainty, and then carry all those filled datasets through your analysis to see
how results vary. MI is powerful because it preserves the natural variability that single
imputation tends to underestimate 16 . However, classic MI methods (like the popular MICE
algorithm) need multivariate data – they work by exploiting correlations between variables 17 .
In your case, if you treat Bakel, Podor, and perhaps climate indices (ENSO, NAO, etc.) as a
multivariate set, you could use MI to infer missing values from the inter-relationships. Recent
research in Water Resources shows that for multivariate water level data, machine-learning-
based imputation (like the missForest algorithm or k-Nearest Neighbors) outperforms simpler
methods 18 . Meanwhile, for univariate series, methods that incorporate seasonality (like STL
decomposition imputation) perform best 14 19 . In fact, an evaluation on Nigerian river data
found seasonal decomposition imputation to have the lowest error among single-variable
methods, with Kalman smoothing a close second, and simple mean or random fills performing
worst 14 . This aligns with the idea that understanding the seasonal cycle of a river helps fill
gaps more realistically. If you have time and resources, you might experiment with the
imputeTS package in R (as referenced by Moritz et al.), which provides these advanced
univariate methods (e.g. na_seadec for seasonal decomposition, na_kalman for Kalman
filtering) 20 19 . For multivariate MI, packages like mice in R or the missForest algorithm
can use other variables (like combining both stations’ data and climate indices) to fill in values
while quantifying uncertainty 18 . These are more complex but could be worth it if you have
large gaps that simpler methods can’t handle well.

• Hybrid Approaches: Often, the best solution is a combination of methods. For example, a
common approach: use linear interpolation for small gaps and regression or climatology for larger
gaps. In a discussion on ResearchGate, one hydrologist described exactly this: interpolate gaps
< 10 days, and use monthly averages for longer gaps 21 (with later experts refining that one
should be cautious applying monthly means blindly 10 ). You can also use different methods in

3
stages – e.g., fill what you can with a reliable method (like using Bakel to fill Podor where
possible), then use interpolation or modeling for the remaining holes. After filling, always review
the statistical properties of the filled series: does the mean or variance of a filled year look
unrealistic? Does the filled segment introduce any strange trend or break? If so, adjust your
method or at least document the potential uncertainty.

Recommendations for Your Data


Considering your specific datasets (Bakel and Podor water levels 1960–2025) and the aim of your
research (which likely involves linking these hydrological series with climate indices like AMO, ENSO,
NAO, PDO):

1. Perform Exploratory Missing Data Analysis: You’ve already started this by noting there is “no
temporal solution at the moment” – meaning you haven’t addressed the gaps yet. Use
visualizations (like a heatmap of missing days by year, or simply plotting the time series with
gaps marked) to see where the missing chunks are. Identify if they tend to occur in certain
seasons (for instance, some gauges fail during extreme floods, or during dry season downtimes).
In your data, Podor’s early 1960s gaps and 1971–72 gap (120 days) stand out, as does the
partially complete final year 2024–25 for both stations. Being at the beginning or end of the
record, those large gaps may be better left out of some analyses (e.g. you might decide to
analyze 1964–2023 fully, excluding the very patchy first years and the incomplete last year for
trend analysis). Document these decisions so your supervisor knows you’ve considered data
quality.

2. Leverage Both Stations: Since you have two stations on presumably the same river, use one to
fill the other whenever possible. Compute the correlation between Bakel and Podor daily or
monthly values (after aligning dates) to confirm their relationship. If strong, do a regression or
even a simple ratio for overlapping periods. For example, if Podor is typically, say, 0.8m higher
than Bakel or lags by a day, account for that. Then for Podor’s missing data in 1960s, see if Bakel
has data for those dates and use the regression model to estimate Podor. This “informed
interpolation” using nearby stations is far superior to guessing values in isolation 22 . It’s also
defensible in front of reviewers/supervisors if you show the regression fit and error range.
Essentially, you are borrowing strength from a neighboring station 4 – a widely recommended
practice in hydrology.

3. Interpolate Minor Gaps: For the many one-day or few-days gaps (for instance, Podor had some
years with just 1–2 missing days), simply interpolate them. The error introduced by interpolating
a day or two is minimal, and it preserves continuity. If the gap is very short, linear is fine; if it’s on
the order of a week or more and you suspect non-linear changes (like a rising or falling limb of a
hydrograph), consider a spline or piecewise polynomial fit to nearby points. Given that daily
water level tends to change gradually (except during flood pulses), interpolation over <10-day
gaps will “not interfere” much with analyses of monthly or annual patterns. Just avoid
interpolating across major inflection points (e.g., don’t draw a straight line through what was
likely a sharp flood peak – in such cases, use a model or partial data from the other station if
available).

4. Handle Long Gaps with Caution: For longer gaps (several weeks to months), decide on a case-
by-case basis:

4
5. If you have other data to inform the gap (neighbor station, rainfall data, etc.), use it (e.g., a
regression or even a simple proportion of flows if the watershed areas are known).
6. If no external data is available, a seasonal mean might be the only simple choice – but be aware
this will produce a very “flat” fill that might not reflect reality. If your analysis is sensitive to
extremes (say you are analyzing droughts and floods), filling with average values could dampen
an extreme that actually occurred or create a false sense of normality. In such cases, you might
indicate that the results for that period are uncertain. For example, Podor 1971–72 missing Jan–
Apr: If that included an extreme low or high, a mean fill would hide it. You could instead use
Bakel 1972 Jan–Apr pattern scaled to Podor, or if that’s not reliable, at least indicate the gap in
any plots.

7. Another option for long gaps is to treat it as a forecasting problem: use data before the gap to
project forward. For instance, if a flood peak is missing due to gauge failure, you could
extrapolate the recession curve from after the flood (if that portion is recorded) backwards. This
is effectively a manual modeling of the hydrograph shape. In some cases, hydrologists fit known
mathematical curves to rising or falling limbs to estimate missing peak flow or volume.

8. Quality-Check After Filling: Whatever fills you perform, do a post-imputation check:

9. Plot the filled time series and look for any discontinuities or obvious anomalies introduced by
your fills.
10. Check statistics like annual means, maxima, minima of the filled series against the original.
Large deviations might indicate over/under-estimation in gaps.

11. If possible, quantify uncertainty for major filled values. For instance, if you used a regression
from Bakel to Podor, you can give an error bar based on the regression RMSE. This was
suggested in the RG discussion: provide confidence bounds for infilled values to acknowledge
uncertainty 23 . In an academic presentation, noting these uncertainties shows rigor.

12. Document and Justify Your Approach: Finally, be transparent in your report about how many
values were missing and how you filled them. Note that different methods were used for
different situations, and why. Supervisors (and future readers) will appreciate that you didn’t just
blindly replace data, but rather applied hydrological reasoning. For example, you might write:
“We filled short gaps (<10 days) via linear interpolation, and longer gaps via a regression-based
donor method using the upstream station. Where data were missing at both stations (e.g., Bakel
and Podor both incomplete in 2024–25), we left those gaps unfilled and excluded those periods
from analysis to avoid speculative data.” This aligns with expert advice that if no “educated
guess” can be made, it’s better to leave data blank than to introduce potentially erroneous
values 5 .

Insights from Literature and Similar Projects


To strengthen your approach, it helps to reference what other researchers have done in similar
hydrological studies:

• Don’t Ignore Missingness: A recent study emphasizes that ignoring missing data or handling it
poorly can bias results and undermine hydrological modeling, especially with shorter data
records 1 . They stress that filling in streamflow gaps appropriately improves model calibration
and water resource planning 24 .

5
• Method Selection Depends on Context: As noted, the “best” method varies. You should
consider how large the gaps are, whether they occur in wet or dry season, and what other
data you have 4 . If a gap is during a critical seasonal event (like monsoon peak), using an
average could be very misleading – you’d lean towards using correlated station data or a model
in that case. Conversely, a gap in a stable low-flow period could be filled with a seasonal mean
with less impact.

• Interpolation vs. Advanced Methods: Simple linear interpolation is common but has
limitations. One hydrologist bluntly called linear interpolation over big gaps the “least preferable”
method, acceptable only as a last resort 5 . It tends to create artificial smoothness and can be
flagged by reviewers if not justified. More advanced single-series methods, like seasonal
decomposition imputation, have been found to yield lower error, since they respect the
repeating annual cycle 14 . Kalman smoothing (essentially using a state-space time-series
model) is another high-performing technique for filling complex hydrologic data 14 . These
methods are implemented in some tools (e.g., R’s imputeTS ), which could be worth exploring if
time permits. The seasonal decomposition approach essentially models the seasonal pattern
from the data and fills missing points based on that model – great for preserving seasonality.

• Multiple Variables Improve Imputation: If you utilize your climate index datasets (AMO, ENSO,
NAO, PDO) or even rainfall records, you add context that might improve infilling through
multivariate imputation. For instance, if a year had an extreme El Niño and you suspect it
would have caused extreme river levels, a multiple-imputation model incorporating ENSO index
could, in theory, predict higher (or lower) values for that gap consistent with El Niño’s typical
influence. Studies have shown that algorithms like missForest (a Random Forest-based
imputer) or k-Nearest Neighbors can successfully leverage multiple stations’ data to fill gaps,
outperforming simpler fills 18 . They effectively “learn” the relationship among stations. In one
comparative study, missForest and kNN were the top methods for multivariate water level data,
while simpler methods (like filling with a regression or mean) were less accurate 18 . This
suggests that if you have time-series from multiple gauges, using them together in a data-driven
imputation algorithm can yield very good results (with the bonus of an error estimate). That said,
these require some statistical coding and understanding to apply correctly.

• Manual and Visual Verification: Multiple experts advise not to rely solely on automated
methods – always validate the fills with hydrological reasoning 25 22 . Plotting the data is
essential: for example, check that an imputed flood peak doesn’t exceed physical possibilities or
that an imputed dry-season flow isn’t negative or out of character. If something looks off, adjust
the approach. As one answer noted, “plot the data!” for a reality check 26 . Sometimes the
process of filling gaps can also uncover data quality issues (like suspicious zeros or outliers that
might indicate sensor errors rather than true values).

• There Are Gap-Tolerant Analyses: In case you are worried about filling methods affecting your
analysis, be aware that some analysis techniques can work with missing data. For example,
certain trend tests or spectral analyses have versions that accommodate gaps. Also, if you
compute correlations between the river and climate indices on a monthly or annual basis, you
don’t necessarily need every single day filled – you could compute monthly means from
whatever days are available (though if an entire month is missing, you have a problem). If a
small fraction of data is missing, using pairwise deletion (each calculation uses whatever
overlapping data exist) might be sufficient. However, if you intend to run any continuous
simulation or frequency analysis, you will need a gap-free record or else handle those periods
specially.

6
Conclusion
In summary, I recommend a strategic gap-filling approach for your hydrological data: - Use
interpolation for short gaps (ensuring it doesn’t mask any sharp changes). - Use informed methods
for longer gaps – preferably leveraging the other station’s data (or any relevant auxiliary data) to guide
the fills. - Avoid brute-force methods like blanket monthly means unless absolutely necessary, and even
then, acknowledge the uncertainty introduced. - If no good estimation basis exists for a particular gap,
it may be better to leave it and adapt your analysis (or at least highlight that results for that period are
uncertain) 5 . - Whichever imputation you do, document it and perhaps test that it doesn’t overly
change your key analysis outcomes (for instance, you could see if a trend line with and without filled
data differs significantly – if it does, investigate why).

By carefully filling gaps in this manner, you’ll maintain the integrity of your dataset without letting
missing values derail your climate impact analysis. Just remember that imputation is an analysis
helper, not actual observed data – treat the filled values with some caution and be transparent about
their use 22 . Your supervisor will be looking for a solid rationale behind any gap-filling, and the
approach above, backed by literature, should provide that. Good luck with your presentation on Sunday!

References
• Odai, S.N. et al. (2021). “Comparison of Missing Data Infilling Mechanisms for Recovering a
Real-World Single Station Streamflow Observation.” Int. J. Environ. Res. Public Health 18(16):
8375. (Highlights the importance of proper missing-data handling to avoid bias, and notes that
method selection depends on gap size, seasonality, and available neighboring data 1 4 .)

• Umar, N. et al. (2023). “Comparing Single and Multiple Imputation Approaches for Missing
Values in Univariate and Multivariate Water Level Data.” Water 15(8):1519. (Found that for
univariate water-level series, a seasonal decomposition imputation was most accurate, followed
by Kalman smoothing 14 . For multivariate data (multiple stations), machine-learning methods
like missForest and kNN performed best 18 . Emphasizes that no single method works for all
cases, and multiple imputation can preserve uncertainty 16 .)

• ResearchGate Q&A (2013). “What is the most effective way to fill the missing hydrological
data?” (Community advice from hydrologists on handling missing data. Key takeaways: use
nearby station data if possible; for short gaps <10 days use interpolation; for longer gaps
consider polynomial interpolation or other models rather than filling entire months with means
8 . Also, several experts warn against uninformed linear interpolation for long gaps and

suggest leaving data blank or quantifying uncertainty if you can’t confidently infill 5 23 .)

1 2 3 4 6 Comparison of Missing Data Infilling Mechanisms for Recovering a Real-World


17 24

Single Station Streamflow Observation - PMC


https://pmc.ncbi.nlm.nih.gov/articles/PMC8394992/

5 7 8 9 10 11 13 15 21 22 23 25 26 What is the most effective way to fill the missing


hydrological data? | ResearchGate
https://www.researchgate.net/post/What_is_the_most_effective_way_to_fill_the_missing_hydrological_data

12 Infilling Missing Data in Hydrology: Solutions Using Satellite Radar ...


https://www.mdpi.com/2073-4441/10/10/1483

7
14 16 18 19 20 Comparing Single and Multiple Imputation Approaches for Missing Values in

Univariate and Multivariate Water Level Data


https://www.mdpi.com/2073-4441/15/8/1519

You might also like