- Cloud: Service provider delivering an application over the internet.
- Client: Business using the Cloud
- Telco: Service provider operating part of the network infrastructure connecting them.
- elkement: Somebody who always ends up playing intermediary.
Client: Cloud logs us off ever so often! We can’t work like this!
elkement: Cloud, what timeouts do you use? Client was only idle for a short break and is logged off.
Cloud: Must be something about your infrastructure – we set the timeout to 1 hour.
Client: It’s becoming worse – Cloud logs us off every few minutes even we are in the middle of working.
[elkement does a quick test. Yes, it is true.]
elkement: Cloud, what’s going on? Any known issue?
Cloud: No issue in our side. We have thousands of happy clients online. If we’d have issues, our inboxes would be on fire.
[elkement does more tests. Different computers at Client. Different logon users. Different Client offices. Different speeds of internet connections. Computers at elkement office.]
elkement: It is difficult to reproduce. It seems like it works well for some computers or some locations for some time. But Cloud – we did not have any issues of that kind in the last year. This year the troubles started.
Cloud: The timing of our app is sensitive: If network cards in your computers turn on power saving that might appear as a disconnect to us.
[elkement learns what she never wanted to know about various power saving settings. To no avail.]
Cloud: What about your bandwidth?… Well, that’s really slow. If all people in the office are using that connection we can totally understand why our app sees your users disappearing.
[elkement on a warpath: Tracking down each application eating bandwidth. Learning what she never wanted to know about tuning the background apps, tracking down processes.]
elkement: Cloud, I’ve throttled everything. I am the only person using Clients’ computers late at night, and I still encounter these issues.
Cloud: Upgrade the internet connection! Our protocol might choke on a hardly noticeable outage.
[elkement has to agree. The late-night tests were done over a remote connections; so measurement may impact results, as in quantum physics.]
Client: Telco, we buy more internet!
[Telco installs more internet, elkement measures speed. Yeah, fast!]
Client: Nothing has changed, Clouds still kicks us out every few minutes.
elkement: Cloud, I need to badger you again….
Cloud: Check the power saving settings of your firewalls, switches, routers. Again, you are the only one reporting such problems.
[The router is a blackbox operated by Telco]
elkement: Telco, does the router use any power saving features? Could you turn that off?
Telco: No we don’t use any power saving at all.
[elkement dreams up conspiracy theories: Sometimes performance seems to degrade after business hours. Cloud running backup jobs? Telco’s lines clogged by private users streaming movies? But sometimes it’s working well even in the location with the crappiest internet connection.]
elkement: Telco, we see this weird issue. It’s either Cloud, Client’s infrastructure, or anything in between, e.g. you. Any known issues?
Telco: No, but [proposal of test that would be difficult to do]. Or send us a Wireshark trace.
elkement: … which is what I planned to do anyway…
[elkement on a warpath 2: Sniffing, tracing every process. Turning off all background stuff. Looking at every packet in the trace. Getting to the level where there are no other packets in between the stream of messages between Client’s computers and Cloud’s servers.]
elkement: Cloud, I tracked it down. This is not a timeout. Look at the trace: Server and client communicating nicely, textbook three-way handshake, server says FIN! And no other packet in the way!
Cloud: Try to connect to a specific server of us.
[elkement: Conspiracy theory about load balancers]
elkement: No – erratic as ever. Sometimes we are logged off, sometimes it works with crappy internet. Note that Client could work during vacation last summer with supper shaky wireless connections.
[Lots of small changes and tests by elkement and Cloud. No solution yet, but the collaboration is seamless. No politics and finger-pointing who to blame – just work. The thing that keeps you happy as a netadmin / sysadmin in stressful times.]
elkement: Client, there is another interface which has less features. I am going to test it…
[elkement: Conspiracy theory about protocols. More night-time testing].
elkement: Client, Other Interface has the same problems.
[elkement on a warpath 3: Testing again with all possible combinations of computers, clients, locations, internet connections. Suddenly a pattern emerges…]
elkement: I see something!! Cloud, I believe it’s user-dependent. Users X and Y are logged off all the time while A and B aren’t.
[elkement scratches head: Why was this so difficult to see? Tests were not that unambiguous until now!]
Cloud: We’ve created a replacement user – please test.
elkement: Yes – New User works reliably all the time! 🙂
Client: It works – we are not thrown off in the middle of work anymore!
Cloud: Seems that something about the user on our servers is broken – never happened before…
elkement: But wait 😦 it’s not totally OK: Now logged off after 15 minutes of inactivity? But never mind – at least not as bad as logged off every 2 minutes in the middle of some work.
Cloud: Yeah, that could happen – an issue with Add-On Product. But only if your app looks idle to our servers!
elkement: But didn’t you tell us that every timeout ever is no less than 1 hour?
Cloud: No – that 1 hour was another timeout …
elkement: Wow – classic misunderstanding! That’s why it is was so difficult to spot the pattern. So we had two completely different problems, but both looked like unwanted logoffs after a brief period, and at the beginning both weren’t totally reproducible.
[elkement’s theory validated again: If anything qualifies elkement for such stuff at all it was experience in the applied physics lab – tracking down the impact of temperature, pressure and 1000 other parameters on the electrical properties of superconductors… and trying to tell artifacts from reproducible behavior.]