Our headquarters is located in Bowling Green, Ohio. (I’ll label this “HQ” from here on out – IP address is 18.104.22.168)
We have a branch office in Lebanon, OH using Spectrum Business Class (I’ll label this “BO” – IP address is 22.214.171.124)
We have a Site to Site VPN tunnel between HQ and BO. This allows the BO to use the servers at the HQ as well as the phone system. This tunnel has been up and running well for a couple of years.
Starting around July 16th, we started to see the tunnel drop briefly anywhere from 0-5 times per day. The drop is brief, but it is enough to close connections and drop phone calls and is irritating the BO staff members greatly.
I started by rebooting routers and firewalls at both ends of the tunnel. The problem continued. I spent literally hours poring over logs in the firewalls looking for something that could explain what I was seeing and could find no indication that anything was wrong. I talked to WatchGuard (the firewall manufacturer) and they looked at it and was able to find nothing other than the firewalls were losing communication with each other (but not the internet).
I started placing ping traps on a machine at the BO. What I found was that when the tunnel went down, that the BO was still reaching the Internet, but was just dropping pings to the HQ public IP (126.96.36.199)
I checked for bandwidth spikes on both ends. The BO traffic barely is barely using any bandwidth. The HQ normally stays in the 20% range, with very occasional spikes up to 60%, but nowhere near hitting the bandwidth cap.
I then ran a trace route and mapped out the hops. I started running continuous pings on those hops to see if I can see where the traffic breaks down. This is where I found a problem. Below is a screenshot of me pinging the hops a capturing a breakdown in the tunnel. What you see is a series of continuous pings from the HQ to the BO. The first square is a ping to a device at the BO through the tunnel. The second is a ping of the WAN of the BO. Each square after that is a ping to the sequential hops between the HQ and the BO. As you can see, the first drop is the 188.8.131.52 hop, and everything cascade fails after that. This capture was taken on 7/25 at 3:04 PM:
This pattern repeats each time the tunnel goes down. I also tried it the other way (Using a PC at the BO, I pinged the hops to the HQ) and I similar results EXCEPT I was able to ping all the hops up until 184.108.40.206 at which point I got a mix of “Request timed out” and “TTL expired in transit”. I checked and the 220.127.116.11 hop is a TW router. Do you guys have a router flaking out?
I called and speant 2 hours on the phone with Spectrum support. The woman said that the engineers refused to speak with me with out a trace route of the incidences. I have since gathered some and sent them and have heard nothing back. Is there anyone there willing to look into this issue for us please?
TBH the outage only lasts ~5 seconds. If we were running something buffered like video, it probably would not be that noticable. Unfortuntely we have stuff going through the tunnel like VOIP and telnet that are extremely sucseptible to packet loss and this is enough to be extremely irritating. It was all working fine until about 3-4 weeks ago when we started having daily trouble. It took me forever to track down where the breakdown was. Once I did and went to Spectrum, I've been kind of blown off.....
I do not think that a keep alive setting on the firewalls is going to help if a hop on the internet stops responding.
Think of the old 2 tin cans and a string. If someone cuts the string in the middle, no amount of fiddling with the cans is going to help.
Well, as soon as the hop starts responding again, the VPN tunnel reconnects immediately.
Bear in mind that the trace that I am posting is not going through the tunnel in any way, it is a straight trace between the two endpoints.