Tuesday, April 1, 2014

This is what happens to the Internet when a fibre transit cable gets cut.

I came home from work today and sat down to check my Facebook messages and immediately noticed something strange. The shapes and formatting of the page would load but none of the images or text were loading. 





At first I thought my WiFi card might be having issues with the access point since it's an old Dlink so I opened a ping to Google to see what was going going on. I then thought perhaps someone was running a torrent because the pings were in the hundreds peaking over 300-400, but then it would drop down to 59 again for a few seconds then loose a few packets and jump to 200 again. This is not what happens when someone's saturating the connection with a torrent or two, in that case the ping would have stayed more consistently high.

My next thought was to check the ping times between all the tier one ISPs on 
http://www.internethealthreport.com/ This site shows a breakdown of the ping times between the largest fibre optic cable providers, these being tens or hundreds of gigabits. At this point the majority of evening traffic in North America was just starting as I live in EST and work a normal 9-5 schedule. What I saw showed that definitely something was wrong and it wasn’t my computer or my ISP.

Then Netflix went down and there wasn’t much else to do than to figure out what was going wrong. I also decided to pick my own hashtag so I could group all my tweets in the future more easily.



Right away you can see that the latency is starting to yellow-line and the packet loss has significantly red-lined. This usually means something has crashed or broken, or a cable has been cut, or it even could be a symptom of a DDOS attack, but how do you figure this out for sure when you don’t work for any of those companies?

Before long, to my surprise, Netfix customer service tweeted back with a troubleshooting guide. (Yea I’m a bit new to twitter.. leave me alone.) Now I felt obligated to tell Netflix that it wasn’t their fault since they were so nice to try to help and that it was just an IP transit issue or something.

I went back to check the ping times and saw that another network had started dropping packets from Level 3.

I started searching for tools that had error reports for Canada to see if it was localized and found one, http://canadianoutages.com/ that showed a spike across every big Internet name I knew of and more such as Facebook, YouTube, PSN, Google, XBox Live, and more. At the time of this writing one can still see the spike in outages and issues across the board on that site.


I wanted to make a call to tell my boss we might be having issues tomorrow, but when I tried to load my app on my phone it was unable to sync with the network. I tried calling myself with a landline and it rang through but might not have been working, I decided to use the landline instead.

I thought perhaps Bell Canada's Internet Tech department might be able to tell me if they had been notified of any fibre cuts or lines going down or something but I got the usual "if you can't give me an account or anything I'm not allowed to tell you anything" which in this case I guess is good business practice but annoying nonetheless. I tried calling Teksavvy after not finding much on their website (had not been posted yet) but then hung up on hold when I realized they must be swamped with calls from actual customers and probably were too busy and I shouldn't bother them. I decided decided I'd try more twittering..


I washed a few dishes then went back with my twitter i noticed Ringcentral had replied and I let them know also that it was something wrong with a main peering hub.





Techsavvy must have noticed the hashtag usage somehow and replied.

I sent them a message and they referred to me to a forum thread they had just started as a staging ground for information as it developed. At this point it looked more like a fibre cut than anything else

http://www.dslreports.com/forum/r29144871-Slow-Service
At the time of the message there was one reply, at the time of writing there were 13 pages.

It wasn't much longer after this that the CEO of Teksavvy posted on the thread and confirmed that it indeed was a fiber cable cut, from Hurricane Electric effecting 100 gigabits of fibre, effectively half their upstream network and that of every other ISP in Ontario and Quebec and beyond to congest the whole north american grid.

Marc, the CEO, then posted a link to the details of exactly what happened:

Incident
Beauharnois
In progress
0%

We have a fiber cut between Newark < > Beauharnois.

We contacted LEVEL3. 
http://status.ovh.net/?do=details&id=6629
http://status.ovh.net/?do=details&id=6632
And that's one of my favourite ways to spend the evening. I talked to a dozen different people in half a dozen different countries and have some new tools to use in the future. Was great fun :D


**update

Tuesday, 01 April 2014, 09:43AM
The cut in the fibre is at the New York exit towards Albanie
and it's affecting multiple providers and services
- Level3 NWK / BHS (10x10G)
- Level3 NWK / MTL (2x10G)
- Hibernia NWK / MTL (2x10G)
- Telia NWK / CHI (6x10G)
- Telia NWK / PAL (2x10G)
- Level3 Paris / New York / BHS (4x10G)
Tuesday, 01 April 2014, 09:51AM
The provider has reported that the fault has been located, and was determined to have been caused by road construction. It appears that the 120-count fiber was damaged by a new post being installed during the repairs to a highway guardrail. Splice crews remain en route to the area; an estimated arrival time has not yet been provided.


*note
I am a self taught enthusiast in this subject, I know only as much about oc-192 and 40gige as I have read on my own. I do hold Comptia A+ and work in the IT industry but I'm not a professional expert on fibre transit lines by any means. Please feel free to submit positive criticism, I'm always searching to learn more.