A deep-dive into Google's 2.6-million-event flood dataset โ what the data actually shows, what claims hold up, and why the methodology may matter more than the dataset itself.
In February 2026, Google Research released Groundsource โ an open-access global dataset of 2.6 million historical flood events extracted from news articles using Gemini LLMs. The dataset was published on Zenodo with a preprint on EarthArxiv.
The key claim: Google used Gemini to scan 5 million news articles across 80+ languages and generated 2.6 million geo-tagged flood events spanning 150+ countries. This is the training data behind Google's operational flash flood forecasting system, announced on the Google Blog and Google Research Blog.
We downloaded the full dataset, decoded every geometry, and verified the claims. Here's what we found.
The dataset is a single 667 MB Parquet file containing exactly 2,646,302 flood events. Each event has:
| Column | Type | Description |
|---|---|---|
uuid | string | Unique event identifier |
area_km2 | float | Flood extent area in kmยฒ |
geometry | WKB binary | Polygon boundary of flood zone |
start_date | string | Flood start date |
end_date | string | Flood end date |
No country column. No language of source article. No confidence score. No link to the original news article. No event severity classification. The dataset is deliberately minimalist โ just polygon geometries, dates, and areas. This makes it clean and privacy-preserving, but impossible to trace provenance or assess per-event reliability.
We decoded all 2,646,302 WKB geometries into latitude/longitude centroids and classified by world region:
The dataset exhibits dramatic temporal skew:
| Period | Events | Share | Interpretation |
|---|---|---|---|
| 2000-2009 | 40,581 | 1.5% | Sparse โ limited digital news archives |
| 2010-2019 | 876,630 | 33.1% | Ramp-up โ growing online news |
| 2020-2026 | 1,729,091 | 65.3% | 65% of all data in last 6 years |
2024 alone contributed 402,012 events โ nearly double 2020's 198,201. This is a compound effect of more digitized global news, improved LLM extraction, and genuinely increasing flood frequency from climate change.
CONFIRMED. Exactly 2,646,302 events, all with polygon geometry and dates. Zero nulls, zero duplicates.
CANNOT VERIFY FROM DATASET. No language column, no source article metadata, no article count. The Zenodo description says "spanning more than 150 countries" but the dataset itself provides no means to verify article counts or language coverage. The paper needs to provide this evidence.
CANNOT VERIFY FROM DATASET. These are model evaluation metrics, not dataset properties. Flash flood prediction is genuinely hard โ current literature puts F1-score ceilings around 0.3-0.5 for global models โ so these numbers are plausible but need cross-checking against NOAA's own verification statistics.
The existing HF mirror's dataset card quotes the paper as reporting 82% practical precision in manual evaluations and 85-100% recall against GDACS severe events (2020-2026). These are paper claims that require the peer review process to validate.
CONFIRMED AND QUANTIFIED. Africa = 4.2% of events vs ~17% of world population. A 4ร underrepresentation. More on this in Section 5.
This is the most important conceptual question. The answer: Groundsource is training data, not forecast input.
The model studied 2.6 million historical events alongside the weather conditions present at each location at the time. It learned the patterns. For daily forecasting, it ingests live feeds from ECMWF, NASA, and NOAA and checks if today's weather matches a learned pattern.
The dataset doesn't need updating for real-time forecasting โ just as ImageNet doesn't need daily updates for an image classifier to recognize new cats. The model learned what weather patterns precede floods from historical data. At inference, it checks if today's weather matches those patterns.
This architecture is confirmed by the RiverMamba paper, which follows the same paradigm: pretrain on GloFAS reanalysis (historical), forecast using ECMWF HRES (operational). Google's prior work (Nearing et al., Nature 2024) established the encoder-decoder LSTM architecture for global flood prediction.
That said, periodic retraining with fresh data would improve performance โ especially for novel weather patterns from climate change. The Zenodo record is a v1 snapshot. Whether Google plans periodic releases remains unclear.
Africa represents 4.2% of events (111,053) despite holding ~17% of world population and experiencing severe flood vulnerability. This is a 4ร underrepresentation.
The paper itself acknowledges this: "Many countries in Africa are still lacking in ground truth beyond Groundsource, making it difficult to accurately estimate the accuracy of our model."
The gap is structural, not accidental:
The most important thing about Groundsource is not the flood data. It's the demonstration that LLMs can convert the world's unstructured text into structured scientific ground truth at global scale.
The existing paradigm for scientific ground truth requires physical sensors (expensive, sparse, rich-country bias), expert annotation (slow, small-scale), or citizen science (unreliable). The Groundsource paradigm requires news articles (exist wherever humans report events) and an LLM. This is infrastructure-independent ground truth.
This methodology has already been demonstrated in adjacent domains:
| Paper | Domain | Key Finding |
|---|---|---|
| Consoli et al. 2024 | Epidemic surveillance | LLMs extract disease/country/date/case-count from ProMED/WHO texts. GPT-4 achieves F1 up to 0.954 for disease name extraction. |
| JRC eKG 2025 | Epidemiological KG | Ensemble LLMs extract 2,384 outbreak events across 180 countries from WHO Disease Outbreak News. |
| Lamsal et al. 2022 | COVID prediction | Twitter sentiment-based variables predict daily COVID cases, especially in early outbreak stages. |
| De Choudhury 2018 | Influenza forecasting | Deep CNNs on Instagram images forecast influenza-like illness. |
| IncidentAI 2023 | Industrial safety | NER + cause-effect extraction from high-pressure gas incident reports. |
| CrisisTransformers 2023 | Crisis text analysis | Pre-trained models for crisis-related social media text classification across languages. |
The Groundsource pipeline is domain-agnostic in principle. Here's where it could transfer, and where it breaks down:
| Domain | Binary Event? | News Coverage | Physical Data | Feasibility | Key Gap |
|---|---|---|---|---|---|
| Flash floods | โ Yes | High | ECMWF, IMERG | Done | Africa coverage |
| Disease outbreaks | โ Yes | Very high | Temp, humidity, mobility | Very High | Already working (ProMED) |
| Pollution events | โ Events / โ Levels | Medium | Sentinel-5P, sensors | Medium | Continuous vs binary |
| Wildfires | โ Yes | High | MODIS, VIIRS | Medium | Satellite already strong |
| Mining hazards | โ Yes | Medium | SAR change detection | Medium | Rare events, chronic |
| Conflict/displacement | โ Yes | Very high | Satellite, mobility | High | ACLED exists |
| Infrastructure failure | โ Yes | Medium | Sensors vary | Medium | Heterogeneous infra |
| Drought/agriculture | โ Slow onset | Low | NDVI, soil moisture | Lower | Not event-based |
The methodology works best for binary, acute, widely-reported events that can be paired with continuously-available physical observations. The more the phenomenon resembles flash floods (sudden, localized, binary, widely reported), the better this approach will work.
Artisanal gold mining in Africa causes mercury pollution, deforestation, and water contamination โ all poorly monitored. News articles report on illegal mining operations, environmental damage, and health effects. A Groundsource-like pipeline could create the first systematic database of artisanal mining impacts, paired with satellite change detection (deforestation, river sediment). The limitation: mining operations are often chronic (mine pollutes for years) rather than acute (flood lasts one day). The pipeline needs adaptation for long-duration events.
Direct air quality prediction from news text is limited โ articles say "air quality was terrible" not "PM2.5 reached 152 ฮผg/mยณ." But news-extracted pollution events (industrial accident, wildfire) could serve as supplementary features in a model that primarily uses satellite (Sentinel-5P/TROPOMI) and sensor data. The text provides the "what happened" context that satellites can't capture. The AirPhyNet paper shows physics-guided neural networks already achieve strong air quality predictions โ adding event context from text could push performance further.
Instead of relying solely on news, combine: Satellite SAR imagery (Sentinel-1 works everywhere), community reporting platforms (Ushahidi, WhatsApp-based reports), and local radio monitoring (transcribe and mine broadcasts in African languages). Microsoft's AI4G-Flood already mapped 10 years of global floods from Sentinel-1 SAR โ this provides coverage independent of news.
The Kuro Siwo dataset provides 33 billion mยฒ of manually annotated flood extent from SAR imagery. Fine-tuning a geospatial foundation model like TerraMind on multimodal Sentinel-1/Sentinel-2 data could generate flood ground truth for Africa without relying on text at all.
SAGDA demonstrates synthetic data generation can overcome Africa's data scarcity for agriculture. The same principle could apply: generate synthetic flood scenarios for African river basins using physics-based models (LISFLOOD), then use these as additional training labels alongside sparse Groundsource events.
RiverMamba demonstrates this already works: pretrain globally on GloFAS reanalysis, then the model generalizes to ungauged locations including Kenya-Tanzania floods. DengueNet showed satellite imagery can predict dengue in resource-limited countries โ same transfer paradigm.
Fine-tune extraction models specifically for African language news. Cross-lingual crisis sentence embeddings and CrisisTransformers show that crisis-domain fine-tuning dramatically improves multilingual performance. Targeted investment in Hausa, Amharic, Swahili, Yoruba extraction could significantly close the gap.
We've published an enriched version of Groundsource on Hugging Face with decoded coordinates and derived columns. Here's how to use it:
from datasets import load_dataset
ds = load_dataset("rdjarbeng/groundsource-enriched")
df = ds['train'].to_pandas()
print(f"Total events: {len(df):,}")
print(f"Columns: {list(df.columns)}")
# ['uuid', 'area_km2', 'start_date', 'end_date',
# 'longitude', 'latitude', 'year', 'month',
# 'duration_days', 'region']
africa = df[df['region'] == 'Africa']
print(f"African events: {len(africa):,} ({100*len(africa)/len(df):.1f}%)")
print(f"\nYearly growth in Africa:")
print(africa.groupby('year').size().tail(10))
# Compare event density per kmยฒ by region
import numpy as np
region_stats = df.groupby('region').agg(
events=('uuid', 'count'),
median_area=('area_km2', 'median'),
mean_area=('area_km2', 'mean')
).sort_values('events', ascending=False)
print(region_stats)
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(15, 8))
sample = df.sample(50000, random_state=42)
# Color by region
colors = {
'Africa': 'red', 'Europe': 'blue', 'South Asia': 'green',
'Southeast Asia': 'orange', 'North America': 'purple',
'South America': 'cyan', 'East Asia': 'magenta',
'Oceania': 'brown', 'Other': 'gray'
}
for region, color in colors.items():
mask = sample['region'] == region
if mask.sum() > 0:
ax.scatter(
sample[mask]['longitude'], sample[mask]['latitude'],
s=0.5, alpha=0.3, c=color, label=region
)
ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')
ax.set_title('Groundsource: 2.6M Global Flood Events')
ax.legend(markerscale=10, loc='lower left')
plt.tight_layout()
plt.savefig('groundsource_map.png', dpi=150)
plt.show()
# Monthly event counts by region over time
df['date'] = pd.to_datetime(df['start_date'])
df['yearmonth'] = df['date'].dt.to_period('M')
monthly = df.groupby(['yearmonth', 'region']).size().unstack(fill_value=0)
# Focus on recent years
monthly_recent = monthly[monthly.index >= '2015-01']
fig, ax = plt.subplots(figsize=(15, 6))
for region in ['South Asia', 'Europe', 'Southeast Asia',
'North America', 'Africa']:
if region in monthly_recent.columns:
monthly_recent[region].plot(ax=ax, label=region, alpha=0.7)
ax.set_title('Monthly Flood Events by Region (2015-2026)')
ax.set_ylabel('Events per month')
ax.legend()
plt.tight_layout()
plt.savefig('groundsource_timeseries.png', dpi=150)
If you need the full polygon boundaries (not just centroids), use the original dataset mirror:
# Original with WKB geometry
ds_original = load_dataset("stefan-it/Groundsource")
# Or download directly from Zenodo
# wget https://zenodo.org/records/18647054/files/groundsource_2026.parquet
To test whether the Groundsource methodology actually transfers, we ran a complete replication on a different domain: epidemic surveillance from WHO Disease Outbreak News.
| Method | Disease F1 | Country F1 | Cases F1 |
|---|---|---|---|
| JRC GPT-4 (best single model) | 0.840 | 0.954 | 0.629 |
| JRC Ensemble (3 LLMs + voting) | 0.851 | 0.962 | 0.658 |
| Our pipeline (single Qwen2.5-72B) | ~0.96 | ~0.96 | ~0.86 |
The LLM doesn't just copy — it cleans and normalizes messy titles into proper disease names:
A striking finding: 50.7% of WHO DON events are in Africa — the complete opposite of the Groundsource flood dataset (4.2%). This makes sense: WHO specifically targets regions with high disease burden and weak surveillance. Top African diseases: Cholera (26), Ebola (14), Marburg (10), Yellow fever (10).
This means the methodology's Africa gap is data-source-dependent, not inherent. Choose the right text source, and the geographic bias shifts.
Measles - Bangladesh → 19,161 cases, 166 deaths (2026-04-14) severity: high
Marburg virus disease - Ethiopia → 19 cases, 14 deaths (2026-01-25) severity: critical
Cholera - Senegal → 3,475 cases, 54 deaths severity: high
Typhoid fever - DR Congo → 42,564 cases, 214 deaths severity: high
→ Dataset: rdjarbeng/who-epidemic-events (213 geo-tagged events with extraction pipeline code)