Why webhooks from payment gateways failed when my webhook receiver was behind a VPN and the public relay + tunnelling architecture I used to fix it

Webhooks are a critical component of modern web applications, particularly when dealing with real-time communication from external services like payment gateways. When properly configured, webhooks allow services to push data to your system, enabling instantaneous event tracking, payment confirmations, or updates. However, this system depends heavily on accessibility and reliability. In the case examined here, the webhook listener was deployed behind a VPN, which led to consistent delivery failures from payment gateways.

TLDR

When hosting a webhook receiver behind a VPN, external services like payment gateways cannot reach it due to restricted network visibility. This results in webhook failures and can lead to serious synchronization issues in your systems. The solution is to introduce a public relay server combined with a secure tunneling mechanism to forward incoming HTTPS requests to your private backend. This article explains how this architecture was implemented and outlines key takeaways to ensure reliable handling of webhook events.

Understanding the Issue

To start with, it’s essential to understand how webhooks work. A webhook is essentially an HTTP callback: a mechanism through which one service can notify another about an event via an HTTP POST request. The URL receiving the request must be publicly accessible so that the source system—such as Stripe, PayPal, or Square—can deliver the payload in real-time.

In this case, the listener was placed behind a VPN. This meant that even though the internal system could initiate outbound requests, it was invisible to the public internet. Payment gateways trying to push webhooks received timeouts or 403/404 errors depending on their retry and error-handling logic.

Most payment gateways assume the destination webhook URL is a live HTTP endpoint accessible through the public internet, and will fail—often silently—if it’s not reachable. These failures triggered missed payment events and incorrect system states.

Diagnosing Webhook Failures

The webhook failures were discovered during routine payment audits when it became clear some transactions were missing associated confirmation events. Troubleshooting involved:

Reviewing logs on the webhook receiver
Enabling verbose debug logs on payment gateway dashboards
Using curl from external machines to simulate public reachability
Capturing network traffic using tcpdump and Wireshark

All signs pointed to a network-level issue: the listener was not exposed to the internet.

The Public Relay and Tunneling Architecture

To solve the exposure issue while retaining the internal hosting environment and VPN for security, a new architecture was introduced: a public relay server with tunneling capabilities.

This setup had several goals:

Ensure public accessibility of the webhook endpoint
Securely forward incoming payloads behind the VPN
Allow internal processing of webhooks without exposing the entire backend
Maintain observability and logging

Here’s the architectural solution that was implemented:

A public-facing web server (AWS EC2, DigitalOcean droplet, etc.) was configured to serve as a relay endpoint.
This server accepted HTTPS webhook requests from payment providers and simply forwarded them to a local port over a persistent tunnel (using ngrok or SSH reverse tunnels).
On the internal machine behind the VPN, a tunnel client maintained a live secure connection to the relay server, listening on a specific path or port.
When a webhook request hit the public endpoint, it was piped securely to the internal server, where it was processed as if it had arrived directly.

Tools Used in Our Setup

ngrok: Simple tool for creating HTTP tunnels for development use. Best suited for development or testing.
OpenSSH: Used for reverse proxying via command like ssh -R. Worked well for production-like environments.
Nginx: Used to handle SSL termination and endpoint routing on the public server.
Autossh: Maintained persistent SSH tunnels even during intermittent SSH drops.

Security Considerations

While the architectural fix allowed webhooks to pass through to the private server, additional security measures were implemented:

Mutual TLS (mTLS) between the internal service and the public relay
Validating HMAC signatures sent by payment providers to ensure payload authenticity
Firewall rules limiting access to relay server strictly to payment gateways’ known IPs

This not only protected against spoofed requests but ensured compliance with the security policies expected by the payment providers and regulatory bodies.

Challenges Faced

Several non-obvious challenges emerged while transitioning to this architecture:

Latency: While not significant, introducing another network hop slightly increased latency.
Monitoring: Visibility into failures had to shift to both the external relay and internal systems.
Failover: A backup tunnel mechanism was required to ensure uptime in case the primary tunnel failed.

These were mitigated by proactive monitoring, integrating APM tools like Prometheus and Grafana, and deploying multiple relay endpoints in different regions.

Best Practices for Future Deployments

The fix provided lasting stability and reliability. Here are some advice and best practices for others deploying behind VPNs:

Use public endpoints when dealing with external callbacks
Keep sensitive processing on private systems behind VPN/firewalls
Employ tunneling solutions with robust reconnection and encryption
Always log incoming webhook data for traceability
Verify requests through HMAC or provided authentication mechanisms

Conclusion

Operating webhook listeners behind a VPN or private environment introduces serious delivery and reliability issues. However, the use of a secure tunneling setup with a public relay endpoint provides a clean, scalable solution. By preserving private network integrity while enabling external communication, this architecture ensures that incoming webhook events arrive securely and consistently, unlocking the real power of event-driven integration with payment providers and beyond.

FAQ

Why were payment webhooks failing?: Because the webhook listener was behind a VPN and not accessible from the public internet, payment gateways couldn’t deliver their payloads.
Why not expose the internal server directly?: Security concerns often prevent directly exposing back-end servers. Services behind VPNs typically host sensitive business logic or data.
What is a public relay?: A public relay is a server with a public IP that receives requests on behalf of your internal server and forwards them over a safe tunnel.
Can this setup scale for production use?: Yes, with robust tools like SSH reverse tunnels or commercial tunnel services, this architecture can scale, especially when paired with logging and monitoring.
Is using ngrok safe for production?: Not recommended for production. It’s great for testing but can be rate limited or less secure in long-running deployments.
What are some secure alternatives to ngrok?: Options like self-hosted frp (Fast Reverse Proxy), OpenSSH with autossh, or dedicated reverse proxies on cloud infrastructure are more secure and reliable.