Logo

Where Should We Look? Modeling Spotted Lanternfly Detection Gaps in Great Smoky Mountains National Park

Author

Cole

Date Published

A species distribution model, an observer effort surface, and one invasive insect that hasn't been found yet.

The Notebooks: https://github.com/cskujawa/jupyter-notebooks/tree/main/SLF%20SDM


I've been visiting the Great Smoky Mountains area quite a bit lately, and I've been reading about its geology, its ecology, its particular brand of complicated history. While considering what kind of challenges the GSMNP faces, I was reminded of the Spotted LanternFly I had seen in D.C. After first seeing it, I started noticing it turn up on different reddit threads, particularly r/LanternDie and found that they were quite prominent in some areas, and universally disliked. It made perfect sense to me, a bounded area and an invasive insect, this was the kind of big data science project I wanted to pick up for the weekend (it took a week...)

Lycorma delicatula is an invasive planthopper native to China, first detected in the United States in Berks County, Pennsylvania in 2014, and since then one of the more alarming agricultural and ecological stories in the eastern US [1]. It feeds on dozens of plant species, has a particular affinity for Tree-of-Heaven (Ailanthus altissima), and has been spreading steadily southwest along transportation corridors ever since that first Pennsylvania detection. By late 2024, it had confirmed populations in six Tennessee counties. Jefferson County, Tennessee - which directly borders the park - is on that list.

But here’s what stopped me: iNaturalist records from GBIF had zero confirmed SLF records within approximately 50 km of GSMNP.

The question isn’t is SLF in the park? It’s if it arrives, where would it establish, and where are we failing to look for it? This project is my attempt to answer that. It was also my first Random Forest model, my first geospatial raster pipeline, and my first species distribution model - so the learning curve was real. This study is biased in many ways as I wanted to explore some new topics without exhausting myself tackling every possible bias for a hobby project.


The Map

Before the methodology: here’s what I built.

The interactive map turned out way better than I expected. The culmination of all the work is the detection gap score. That's the pink/purple shading. Ultimately even the best places for SLFs in the smokies still isn't the best for them. That is the end result though, the three maps that feed into the model shed more light on how I built it.

Three-panel detection gap map for GSMNP. Left: habitat suitability predicted by a Random Forest SDM trained on southeastern US SLF occurrence data (warmer colors = higher suitability). Center: observer effort proxy derived from 100,000 iNaturalist records, Gaussian-smoothed at 1 km (darker blue = more surveyed). Right: detection gap score - suitability multiplied by the complement of effort - showing where high-suitability areas go unobserved (deeper pink = higher priority for survey). The GSMNP boundary, road network, and trail system are overlaid on all three panels.

The pattern is immediately readable. Suitability is highest in the lower-elevation valleys and along the park’s southern and western boundaries - the warm, lower-elevation areas that most resemble the mid-Atlantic environments where SLF has established. Observer effort clusters tightly around roads and major trailheads, with most of the park’s interior seeing near-zero iNaturalist coverage. The gap score - the product of those two signals - peaks in lower-elevation areas away from the trail network. Realistically, there is a fairly large suitable habitat area, the map could help guide expeditioners to trails that would bisect the suitable areas.


The Data: 32,641 Records, Thinned to 8,835

Every species distribution model starts with occurrence records - georeferenced observations of the species you’re modeling. For SLF, those records come almost entirely from citizen scientists uploading photos to iNaturalist, which contributes data to GBIF (the Global Biodiversity Information Facility [2]), the standard aggregator for biodiversity occurrence data.

I queried GBIF’s API using pygbif, pulling all L. delicatula records within a southeastern US bounding box: 32,641 records, spanning 2008 through early 2026. The top contributing states were Pennsylvania (12,048 records), Virginia (7,966), and Maryland (6,691) - which tells you something immediately. This is not a random sample of where SLF lives. It’s a sample of where people with iNaturalist accounts have been paying attention.

That’s the core problem with opportunistic citizen science data for invasive species work. The records reflect observer density as much as species presence. Urban and suburban areas are systematically overrepresented. Remote areas - like, say, a national park interior - are underrepresented not because the species is absent but because observers are absent. Any model trained on this data without accounting for that bias will learn “SLF is found near Philadelphia” as much as it learns “SLF prefers warm lowland environments.” [4]

I ran the records through a four-step quality pipeline. First, dropped records with coordinate uncertainty above 1,000 m - a defensible threshold for a species modeled at ~250 m resolution. Second, dropped pre-2014 records (the species wasn’t in the US before then). Third, removed duplicate coordinates rounded to ~11 m. Fourth, applied grid-based spatial thinning at ~1 km resolution, keeping one record per grid cell to reduce spatial autocorrelation in the training data. The final dataset: 8,835 presence records.

Zero of those records fall within approximately 50 km of GSMNP.

This GBIF data pipeline wasn’t entirely new to me. I had built a similar workflow earlier for a raccoon (Procyon lotor) occurrence dataset - parsing Darwin Core format, running geographic EDA, generating proper GBIF citations. That smaller project taught me the mechanics. The SLF pull was larger and messier, but the pattern was familiar.


The Environmental Layers

To project SLF habitat suitability across GSMNP, I needed environmental predictor layers that describe conditions across the landscape - variables that capture what distinguishes places SLF has colonized from places it hasn’t.

I assembled six layers:

  • PRISM 30-year climate normals [3]: mean annual precipitation, mean max temperature, mean temperature, and mean min temperature - all at ~800 m native resolution (30 arc-second), covering the continental US.
  • USGS 3DEP elevation [4]: 1/3 arc-second (~10 m native) DEM, accessed via the py3dep library from the HyRiver project.
  • NLCD 2021 land cover [5]: 30 m native resolution, fetched via pygeohydro.

Every layer was reprojected to EPSG:5070 (NAD83 / Conus Albers Equal Area - the standard projection for continental US ecological work, where equal-area cells matter because every pixel gets a prediction), clipped to the GSMNP boundary plus a 10 km buffer, and resampled to a common 250 m grid.

That 250 m target resolution was a deliberate choice worth explaining. The coarsest input (PRISM climate) is ~800 m native. Going finer - to 30 m, say - would create 27 sub-pixels sharing identical climate values but differing in elevation: false heterogeneity that implies precision the data doesn’t support. Going coarser would discard the real topographic variation in a park with 1,750 m of elevation relief, where conditions change dramatically over short distances. 250 m is a defensible middle ground.

The output: a 6-band GeoTIFF at 437×286 pixels (81,715 valid pixels within the study extent mask), covering roughly 109 km × 72 km of the Smokies and surrounding landscape.


What is a Random Forest, Actually?

I had heard “Random Forest” for years - in conference talks, in job descriptions, in the section headers of papers I skimmed. I read descriptions and it always sounded impressive. This project was in part an excuse to build one.

A Random Forest is an ensemble of decision trees - in this case, 500 of them [7]. Each tree is trained on a random subset of the training data and allowed to split on only a random subset of the predictor variables at each node. That randomness is the whole trick: it prevents any single tree from memorizing the training data, and when 500 trees with different views of the data vote together, the ensemble is far more stable and generalizable than any individual tree.

For a species distribution model, the voting output is what matters. For any input point - a set of climate and elevation values - I ask all 500 trees: “does this look like SLF habitat?” The fraction that vote yes is the predicted suitability score. A score of 0.35 means 175 out of 500 trees thought the conditions matched. This is a relative suitability index, not a calibrated probability - the distinction matters, and I’ll come back to it.

The training setup is a presence-background framework [4]: I have 8,835 places where SLF was observed (class 1) but no confirmed absences. The standard approach is to pair presences with random background points drawn from the accessible environment - here, 10,000 points sampled uniformly across the southeastern US calibration area. The model learns to distinguish “conditions where SLF was found” from “conditions in the broader environment it could theoretically occupy.” Background points aren’t pseudo-absences - they’re the reference distribution.

One notable decision: I excluded land cover from model training. This wasn’t because land cover is ecologically irrelevant (it isn’t - SLF host trees have specific habitat associations). It was because extracting NLCD across the full southeastern US training extent was impractical at notebook scope, and because NLCD values at occurrence points would partly reflect where iNaturalist observers cluster - in developed, road-adjacent areas - rather than genuine SLF habitat preference. The five-predictor model (four PRISM climate variables + elevation) is simpler, but its training signal is cleaner.

My first Random Forest was actually a RandomForestRegressor for a rainfall prediction project - predicting daily precipitation from historical NOAA weather station data. Classification on spatial data is a different application, but the sklearn workflow was familiar ground. Going from a weather regressor to a spatial habitat classifier felt like a natural next step rather than a leap.


Evaluating the Model: Why Random Splits Don’t Work for Spatial Data

The standard machine learning approach to model evaluation - randomly split data into training and test sets, report accuracy on the held-out test set - fails badly for geospatial data. Nearby occurrence points share nearly identical environmental conditions. If adjacent points end up in different folds, the model effectively memorizes local conditions and reports inflated performance. You’re testing whether the model can interpolate between neighboring points, not whether it can generalize to new environments - which is exactly what you need when projecting into a region with zero training data. [6]

The solution is spatial block cross-validation. I assigned all training points to 2° geographic blocks (roughly 150–220 km across at these latitudes, well above the spatial autocorrelation range of PRISM climate layers). With GroupKFold(5), entire geographic blocks rotate between training and test sets, forcing the model to demonstrate it can extrapolate across geographic distance.

The result: AUC-ROC of 0.747 ± 0.126, with individual folds ranging from 0.62 to 0.93. That variance is real and expected - some spatial blocks contain environments very different from the rest of the training data, and the model struggles to generalize to them. The mean of 0.747 is acceptable for a five-predictor first-pass SDM with no host tree information. Feature importance analysis showed elevation and minimum winter temperature as the dominant predictors, consistent with SLF’s known association with low-elevation, mild-winter environments - cold winters limit survival of egg masses. [1]

The suitability predictions across GSMNP ranged from 0.000 to 0.458, with a mean of 0.040. That compressed range carries real information: no pixel in the park matches the core invasion-range conditions of the mid-Atlantic lowlands. The park sits at the edge of SLF’s currently occupied environmental envelope, not its center. I left those values as-is rather than rescaling - rank-transforming to fill a [0,1] range would imply that some park location has conditions as favorable as the Philadelphia suburbs, which the model explicitly says is not true.


The Detection Gap

A habitat suitability map is only half of the problem. High predicted suitability in an area that’s been thoroughly surveyed and found SLF-free is a different situation than high suitability in an area no one has ever walked through.

To build the effort surface, I pulled 100,000 iNaturalist observations - all taxa, any species - from the GSMNP study area via the GBIF API. All 100,000 records, not specifically SLF-related. The reasoning: iNaturalist naturalists are photographing organisms broadly and would notice a large, conspicuous invasive insect colonizing trees along a trail. Using all-iNaturalist rather than all-GBIF was deliberate: eBird dominates GBIF’s human observation counts in the eastern US, and birders are specifically looking for birds.

I rasterized the 96,683 observations that fell within my grid onto the 250 m pixel mesh, then applied a Gaussian filter with σ=4 pixels (1 km). That sigma is not arbitrary: a three-sigma effective radius of 3 km approximates the detection zone of a ground observer for a conspicuous insect on vegetation. Smaller and you get a sparse point pattern that exaggerates the spatial heterogeneity; larger and you lose the meaningful gradient between the park’s high-traffic corridors and its genuinely remote backcountry.

The gap score is simple: gap = suitability × (1 - effort).

This formula has no free parameters, unlike weighted additive combinations that require tuning. It has correct zero-floor behavior - if suitability is zero, gap is zero regardless of how unsurveyed the area is, because there’s nothing to detect. The absolute scale of the gap score doesn’t carry meaning; what matters is spatial ranking. The highest gap score in the dataset was 0.431. [7]

The spatial pattern that emerges is the map at the top of this post. High-gap areas concentrate in lower-elevation zones near the park boundary, particularly areas with road access but away from the most heavily trafficked trailheads. These are ecologically plausible SLF environments - warm, low elevation, near the Ailanthus-colonized disturbed edges that SLF favors - and they’re being watched by almost no one.


What I Learned

This project was simultaneously a first in three categories: first Random Forest classifier, first geospatial raster processing pipeline, first species distribution model. Each of the four notebooks required learning a domain I’d only read about before. Building it end-to-end, with a real question attached, turned out to be the only way I actually retained any of it.

The honest limitations: the model’s fold-level AUC variance (0.62–0.93) means it performs well in some geographic contexts and poorly in others. The compressed suitability range tells me the GSMNP environment sits at the margin of SLF’s occupied niche, not inside it - the model is hedging appropriately. Adding host tree distribution data (particularly Ailanthus altissima presence) would almost certainly improve predictions; so would incorporating additional occurrence data as the Tennessee invasion front advances. Identifying the likely species distribution vectors would help as well. A five-predictor climate model is a starting point, not a final answer.


Data Sources

[1] Barringer, L.E., et al. (2015). First records of Lycorma delicatula (Hemiptera: Fulgoridae) in the United States. Journal of the Entomological Society of Washington, 117(4), 379–382.

[2] GBIF.org - Global Biodiversity Information Facility. Occurrence data accessed via pygbif.

[3] iNaturalist - Community observation platform; data aggregated through GBIF.

[4] PRISM Climate Group, Oregon State University. 30-year climate normals (1991–2020), 30 arc-second resolution. Accessed 2025.

[5] U.S. Geological Survey. 3D Elevation Program (3DEP), 1/3 arc-second DEM. Accessed via py3dep (HyRiver project).

[6] Multi-Resolution Land Characteristics Consortium (MRLC). National Land Cover Database 2021. Accessed via pygeohydro.

[7] National Park Service. GSMNP boundary, road, and trail shapefiles.

Literature

[8] Elith, J., et al. (2006). Novel methods improve prediction of species’ distributions from occurrence data. Ecography, 29(2), 129–151.

[9] Cutler, D.R., et al. (2007). Random forests for classification in ecology. Ecology, 88(11), 2783–2792.

[10] Roberts, D.R., et al. (2017). Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography, 40(8), 913–929.

[11] Phillips, S.J., Dudík, M., Elith, J., Graham, C.H., Lehmann, A., Leathwick, J., & Ferrier, S. (2009). Sample selection bias and presence-only distribution models: Implications for background and pseudo-absence data. Ecological Applications, 19(1), 181–197.