Webhooks are a critical component of modern web applications, particularly when dealing with real-time communication from external services like payment gateways. When properly configured, webhooks allow services to push data to your system, enabling instantaneous event tracking, payment confirmations, or updates. However, this system depends heavily on accessibility and reliability. In the case examined here, the webhook listener was deployed behind a VPN, which led to consistent delivery failures from payment gateways.
TLDR
When hosting a webhook receiver behind a VPN, external services like payment gateways cannot reach it due to restricted network visibility. This results in webhook failures and can lead to serious synchronization issues in your systems. The solution is to introduce a public relay server combined with a secure tunneling mechanism to forward incoming HTTPS requests to your private backend. This article explains how this architecture was implemented and outlines key takeaways to ensure reliable handling of webhook events.
Understanding the Issue
To start with, it’s essential to understand how webhooks work. A webhook is essentially an HTTP callback: a mechanism through which one service can notify another about an event via an HTTP POST request. The URL receiving the request must be publicly accessible so that the source system—such as Stripe, PayPal, or Square—can deliver the payload in real-time.
In this case, the listener was placed behind a VPN. This meant that even though the internal system could initiate outbound requests, it was invisible to the public internet. Payment gateways trying to push webhooks received timeouts or 403/404 errors depending on their retry and error-handling logic.
Most payment gateways assume the destination webhook URL is a live HTTP endpoint accessible through the public internet, and will fail—often silently—if it’s not reachable. These failures triggered missed payment events and incorrect system states.
Diagnosing Webhook Failures
The webhook failures were discovered during routine payment audits when it became clear some transactions were missing associated confirmation events. Troubleshooting involved:
- Reviewing logs on the webhook receiver
- Enabling verbose debug logs on payment gateway dashboards
- Using curl from external machines to simulate public reachability
- Capturing network traffic using tcpdump and Wireshark
All signs pointed to a network-level issue: the listener was not exposed to the internet.
The Public Relay and Tunneling Architecture
To solve the exposure issue while retaining the internal hosting environment and VPN for security, a new architecture was introduced: a public relay server with tunneling capabilities.
This setup had several goals:
- Ensure public accessibility of the webhook endpoint
- Securely forward incoming payloads behind the VPN
- Allow internal processing of webhooks without exposing the entire backend
- Maintain observability and logging
Here’s the architectural solution that was implemented:
- A public-facing web server (AWS EC2, DigitalOcean droplet, etc.) was configured to serve as a relay endpoint.
- This server accepted HTTPS webhook requests from payment providers and simply forwarded them to a local port over a persistent tunnel (using ngrok or SSH reverse tunnels).
- On the internal machine behind the VPN, a tunnel client maintained a live secure connection to the relay server, listening on a specific path or port.
- When a webhook request hit the public endpoint, it was piped securely to the internal server, where it was processed as if it had arrived directly.
Tools Used in Our Setup
- ngrok: Simple tool for creating HTTP tunnels for development use. Best suited for development or testing.
- OpenSSH: Used for reverse proxying via command like
ssh -R. Worked well for production-like environments. - Nginx: Used to handle SSL termination and endpoint routing on the public server.
- Autossh: Maintained persistent SSH tunnels even during intermittent SSH drops.
Security Considerations
While the architectural fix allowed webhooks to pass through to the private server, additional security measures were implemented:
- Mutual TLS (mTLS) between the internal service and the public relay
- Validating HMAC signatures sent by payment providers to ensure payload authenticity
- Firewall rules limiting access to relay server strictly to payment gateways’ known IPs
This not only protected against spoofed requests but ensured compliance with the security policies expected by the payment providers and regulatory bodies.
Challenges Faced
Several non-obvious challenges emerged while transitioning to this architecture:
- Latency: While not significant, introducing another network hop slightly increased latency.
- Monitoring: Visibility into failures had to shift to both the external relay and internal systems.
- Failover: A backup tunnel mechanism was required to ensure uptime in case the primary tunnel failed.
These were mitigated by proactive monitoring, integrating APM tools like Prometheus and Grafana, and deploying multiple relay endpoints in different regions.
Best Practices for Future Deployments
The fix provided lasting stability and reliability. Here are some advice and best practices for others deploying behind VPNs:
- Use public endpoints when dealing with external callbacks
- Keep sensitive processing on private systems behind VPN/firewalls
- Employ tunneling solutions with robust reconnection and encryption
- Always log incoming webhook data for traceability
- Verify requests through HMAC or provided authentication mechanisms
Conclusion
Operating webhook listeners behind a VPN or private environment introduces serious delivery and reliability issues. However, the use of a secure tunneling setup with a public relay endpoint provides a clean, scalable solution. By preserving private network integrity while enabling external communication, this architecture ensures that incoming webhook events arrive securely and consistently, unlocking the real power of event-driven integration with payment providers and beyond.
FAQ
- Why were payment webhooks failing?
- Because the webhook listener was behind a VPN and not accessible from the public internet, payment gateways couldn’t deliver their payloads.
- Why not expose the internal server directly?
- Security concerns often prevent directly exposing back-end servers. Services behind VPNs typically host sensitive business logic or data.
- What is a public relay?
- A public relay is a server with a public IP that receives requests on behalf of your internal server and forwards them over a safe tunnel.
- Can this setup scale for production use?
- Yes, with robust tools like SSH reverse tunnels or commercial tunnel services, this architecture can scale, especially when paired with logging and monitoring.
- Is using ngrok safe for production?
- Not recommended for production. It’s great for testing but can be rate limited or less secure in long-running deployments.
- What are some secure alternatives to ngrok?
- Options like self-hosted frp (Fast Reverse Proxy), OpenSSH with autossh, or dedicated reverse proxies on cloud infrastructure are more secure and reliable.