Certificates and PKI. The Prequel.

Some public key infrastructures run quietly in the background since years. They are half forgotten until the life of a signed file has come to an end – but then everything is on fire. In contrast to other seemingly important deadlines (Management needs this until XY or the world will come to an end!) this deadline really can’t be extended. The time of death is included in the signed data since a long time.

The entire security ‘ecosystem’ changes while these systems sleep in the background. Now we have Let’s Encrypt (I was late to that party), HTTPS is everywhere, and the green padlock as an indicator of a secure site is about to die.

Recently I stumbled upon a whirlwind tour of the history of PKI and SSL/TLS – covering important events in the evolution of standards and technologies, from shipping SSLv2 in Netscape Navigator 1.1 in 1995 to Chrome marking HTTP pages as ‘not secure’ in 2018. Scrolling down the list of years I could not avoid waxing nostalgic. I had written about PKI before at length before, but this time I do what the Hollywood directors of blockbusters do – I write a prequel.

I remember the first times I created a Certificate Signing Request (CSR) and submitted it to a Certificate Authority (CA). It was well before 2000, and it was an adventure!

I was a scientist turned freelance IT consultant – I went from looking at Transmission Electron Microscope images to troubleshooting why Outlook did not start on small business owners’ computers. And I was daring enough to give trainings, based on the little I knew (with hindsight) about IT and networking. I also developed some classes from scratch – creating wiki-style training material, using Microsoft FrontPage 1998.

One class was  called networking and security of the like, and it was part of a vocational retraining curriculum – to turn former factory workers and admin assistants into computer technicians. For reasons I cannot remember I included a brief explanation of the RSA algorithm in my clunky FrontPage site. It was maybe a pretext to justify an exciting lab exercise: As the PKI history timeline shows, SSL was still rather new. Press releases by Austrian IT companies highlighted the military-grade protection from eavesdropping. It felt like Star Trek. One of the early Austrian National CAs offered ‘light’ test certificates. The description of the enrollment process was targeted to business users, but it was pure geek speak: A mysterious multi-step procedure explained in hacker terms like Secure Vault.

I don’t remember if my students found it that exciting or if the test enrolling a lots of certificates simultaneously did work so well at all. But I was hooked.

As a freelancer I started working with my former colleagues again – supporting the sciencists to subvert re-interpert the central IT department’s policies by setting up their own server, or by circumventing the firewall by dialing in to their own modem. This were the days of IT hype in the late 1990s before the dotcom bust. The research center had a new CEO with an IT background, and to get your projects approved you had to tack the label virtual onto anything. So I helped with creating a Virtual Materials Science Lab – which meant we used Microsoft Netmeeting.

Despite or because of such activities I also started working for the IT department. It was the time when The Cluetrain Manifesto told us that hyperlinks were subversive. As a ‘manager’ I should have disciplined shadow IT admins purchasing their own domains and running their shadow servers, but I could not stop tinkering with the web servers myself. It was also the time when I learned that to make things work in larger organizations – or a combination of several of those – you often need to social engineer someone.

We needed a SSL certificate – and I was the super qualified person for that task, based on my experience playing with the Secure Vault. But creating and submitting the CSR, and installing the certificate was the easy part. There were unexpected challenges:

The research center had a long legal name – 65 characters including the final dot in the indication of the legal entity. Common Names in X.509 certificates are limited to 64 characters, so I could not enter the final dot in IIS’s (Internet Information Server’s) wizard for CSRs. The legal name was cross-checked against the Dun&Bradstreet database. One would think that the first 64 characters of a peculiar German name would have been sufficient, but no. It took several phone calls – and faxes! – to prove to the US-based CA company that we were who we claimed to be.

The fact I called a CA company in the US might highlight a mistake: If I recall correctly Big CA had partners in Europe already at that time, but I missed that, or I wanted to talk to the mothership for some reason.

To purchase the certificate from the US-based company you needed a credit card, to be entered exactly when you submit the CSR. This process was disrupting the usual purchasing procedures and I had to social engineer somebody from the procurement department to join me in my adventure, bringing the corporate credit card.

The research center was a company owned 51% by government – so you had SAP and insane management deadlines as well as conferences and academic publication records. The Internet in general was still dominated by its academic roots. Not long ago, there had been a single web page listing All WWW servers in Austria, and that page was run by the academic internet backbone. Domain registration data were tied to a person, to the wrong person, or to the wrong entity – which came back to bite you later.

Fortunately the domain assigned to the SSL certificate belonged to us – so I did not have to  social engineer a DNS admin this time. But it was assigned to a person and not to the organization. The person was an employee in charge of the network, but how should you prove that? More faxes and phone calls were required to sort out the fine legal points.

I did not keep records of that period, so I don’t know if this web server is alive or if at least the domain still exists. Maybe unlikely, given the rapid decay of rotting links. But while researching history for this post – randomly googling for early versions of Microsoft’s web servers – I discovered interesting things. There is a small change it may be alive!

The first version of the Windows Certificate Authority had been released as an add-on to Windows NT 4, as part of the so-called Windows NT 4 Option Pack – the same add-on that also contained the webserver (IIS) itself. It was the time when I learned ASP programming by going online via dial-up and browsing through MSDN as quick as possible not to overspend my precious monthly online time.

I wanted to relive the setup of Internet Information Server 4.0 as and the Option Pack – and found lots of support articles and how-to’s, like this one.

However, I also found live websites like this:

This is only the setup CD, so no danger yet, but you can as well find sites with the welcome page of the operating web server online – including sample ASP applications – which I don’t show deliberately. (Image credits: Microsoft.)

I wonder why I had been frantically re-developing my websites in ASP.NET from scratch – ‘just because’ ASP was outdated technology, even though there were no known vulnerabilities and the sites were running on a modern operating system.

Time to quote from Peter Gutmann’s book Engineering Security:

A great many of today’s security technologies are “secure” only because no-one has ever bothered attacking them.

… which is also true for yesterday’s technology still online!

Looking for Patterns

Scott Adams, of Dilbert Fame, has a lot of useful advice in his autobiographical book How to Fail at Almost Everything and Still Win Big. He recommends looking for patterns in your life, without attempting to theorize about cause and effects. Learning from those patterns you could increase the chance that luck with hit you. I believe in increasing your options, so I can relate a lot to applying this approach to Life, the Universe and Everything.

It should be true in relation to the iconic example of patterns, that is: Web traffic. In this post I’ll try to briefly summarize what I have learned so far from most recent unfortunate events (This is PR speak for disaster). I was intrigued by web statistics, web servers’ log files, and the summaries show by the free Google or Bing Webmaster Tools ever since, but I started to follow the trends more closely after my other, non-Wordpress web server had been hacked by the end of November.

How do you recognize that your site has been hacked?

This is very different from what you might expect from popular lore and movies. I downloaded the log files for my web server from time to time, and I just noticed that suddenly the size of the daily files was about twice as usual. Inspecting the IP addresses which the traffic to my site came from I spotted a lot of hits by Google bot. Sites are indexed all the time, but I was baffled by the URLs – all pointing to pages that should not exist on my server. These URLs contained a long query string with all kinds of brand names, as you know them from spam comments or e-mails.

This is an example line in the log file:

Spammy page on hacked web server, accessed by Google botThis IP address belongs to a *.googlebot.com machine, as can be confirmed by resolving the name, e.g. using nslookup. The worrying fact was the status code 200 which means the page had indeed been there.

A few days later this has changed to a 404, so the page did not exist anymore:

Spammy page removed from hacked web server, Google bot tries to access it.The attack had happened in the weekend, and the pages have been removed immediately by my hosting provider.

I cross-checked if those pages had indeed been indexed by Google I searched for site:[domain name]. This is a snippet from the search results – the spammers even borrowed the tag line of our legitimate site as a description (which I cropped from the screenshot here).

spammy-page-in-google-indexOverall these were just a bunch of different pages (ASP files) but Google recognizes every different query string, appended after the question mark, as a different URL. So suddenly Google had a lot more URLs to index and you could see a spike in web master tools:

Crawl stats after hackThere was also a warning message on the welcome page:

Google warning message about 404 errorsWhat to do?

Obviously the first thing is to delete the spammy pages and deal with whatever vulnerability had been exploited. This was done before I noticed the hack myself. But I am still in clean-up mode to get the spammy pages removed from Google’s index:

robots.txt. Using the site:[domain name] search I identified all the spammy pages and added them to the robots.txt file on my server. This file tells search engines which pages not to index. Fortunately you do not have to add each individual URL – adding the page (ending in .asp in this case) is sufficient.

But pages were still in the index after that, just the description was changed to:
A description for this result is not available because of this site’s robots.txt.

As far as I can tell, entries are still added to the index if somebody else links to your pages (actually, spammy pages on other hacked servers, see root cause analysis below). But as Google is not allowed to investigate the target as per robots.txt, it only adds the link without a description.

URL parameters. Since the spammy pages all use query strings and all strings have the same parameter – [page].asp?dca= in my case – I tried managing the URL parameters via web master tools. This is actually an option to let Google know if a query string should really denote another version of a page or if all query strings for one page should be indexed as a single page. E.g. I am using a query string called imgClicked to magnify an image when clicking in the top image, and I could tell Google that the clicked / unclicked image should not be counted as different URLs.

In the special case of the spammy pages I tried to tell Google that different dca values don’t make for a separate page (which would result in about 6 spammy URLs in the index instead of 1500) but this did not impact the gradual accumulation of indexed spammy pages.

Mind-numbing work. To get rid of all pages as fast as possible I also removed each. of. them. manually. via Google master tools. This means:

  • Click on the URL from the search results, opening a new tab. This results in a 404.
  • Copy the URL from the address bar to web master tools in the form for removing the URL.
  • Click submit.
  • Repeat 1500 times.

I am now at about 500. Not all spammy pages that ever existed are displayed at once in the index, but about 10 are added every day. Where do they come from after the original pages had been deleted?

How was this hack actually supposed to work?

The legitimate pages had not been changed or vandalized but the hacker-spammers just placed additional pages on the server. I had never noticed them, had I not encountered Google’s indexing activities.

I was curious how those pages had looked like and I inspected Google’s cache, by searching for cache:[spammy URL]. The cached page consisted of:

  • Your typical junk of spammy text, otherwise I would be delighted about raw material for poetry.
  • A list of links to other spammy pages, most of them on my hacked server
  • An exact copy of the default page of this (legitimate) web site.

I haven’t investigated all those more than 1000 pages and spammy links displayed on them but I conjectured there have to be some outbound links to other – hacked – servers Links will be only boosted if there are backlinks from seemingly independent web sites. Somehow this should make people buy something in a shady webshop at the end of a cascade of links.

After some weeks I was able to confirm this as Google web master tools now show external backlinks to my domain from other spammy pages on legitimate sites, mostly small businesses in the US. Many of them used the same provider that obviously had been hacked as well.

This explains where the gradual supply of spammy links to the index comes from: Google has followed the spammy links from the other hacked servers inbound to my server. It seems to take a while to clean this out as all the other webmasters have removed there pages as well – I checked each. of. them. from the long list supplied by Google as a CSV file.

Hadn’t I been hacked I might have never been aware of the completely unrelated onslaught by Google itself, targeted to this blog. I reported on this in detail previously; here is just an update and a summary.

Edit as from the comments I conclude this was not clear: The following analysis is unrelated to the hack of non-Wordpress site – the hacked site had not been penalized so far by Google. But the blog you are reading right now was.

Symptoms of your site having been penalized by a search engine

Rapid decline of impressions. Webmaster tools show a period of 3 months maximum. I have checked the trend for all my sites now and then, but there was actually never anything that constituted a real trend. But for this blog page impressions went from a few hundred, often more than 1000 per day this summer to less than 10 per day now.

Page impressions Sept to DecPage impressions stayed at their all-time-low since last time, so just extend that graph to the right.

Comparison with sites that should rank much lower. Currently this blog has as much or as few impressions as my personal website e-stangl.at. Its Google pagerank is 1 – as compared to 3 for the WordPress blog; I only update it every quarter at maximum, and its word count is perhaps a thousands of this blog.

My other two sites subversiv.at and radices.net score better although I update them only about once every 6 weeks, and I am pretty sure I violate best practices due to my creative mixing languages, commenting on my own stuff, and/or curating enormous lists of outbound links.

It is ironic that Google has penalized this blog now, as per autumn 2014 my quality control has become more ruthless. I had quite a number of posts in Drafts, with more than 1000 words each, edited, and spell-checked – and finally deleted all of them. The remaining posts were the ones requiring considerable research plus my poetry. This spam poem is one of my most popular posts as by Google’s page impressions. So all theorizing is really futile and I should better watch the pattern emerge.

Identifying offending pages. I added an update to the previous post as I spotted the offending pages using the following method:

  • Identify your top performing pages by ranking pages in the list of search results by impressions or clicks.
  • Then order pages in the list of search results by page name. This is effectively ranking by date for blogs, and the list can be compared to the archive of all pages.
  • Make the time span covered by the Google tools smaller and smaller and check if one your former top pages is suddenly vanishing from the list.

In my case these pages were:

  • A review of a new, a bit unconventional, textbook on quantum field theory and
  • a list of physics books, blogs and websites.

As a reader pointed out correctly this does not mean that the page has been deleted from the index – as you can confirm by searching for site:[Offending URL] explicitly or by adding a more specific search criterion, like adding elkement. I found that the results displayed for my offending pages are erratic: Sometimes, surprisingly, the page will still show up if I just use the title of the post; perhaps a consequence of me, owner of the site, being logged on to Google. Sometimes I need to add an additional keyword to move it to the top in search results again.

But anyway, even if the pages had not been deleted, they had been pushed back to search results page >10.

Something had been deleted from the index though. Here is the number of indexed pages over time, showing a decline starting at the time impressions were plummeting, too:

Pages indexed by Google for this blog as per writing of this postI cannot see a similar effect for any of the other sites, and as far as I know it does not correlate with some Google update (Google has indicated a major update in March 2014 in the figure).

Find the root cause. Except from links on my own sites, and links on other other blogs my blog has no backlinks. As I learned in this research backlinks from forums are often tagged nofollow so that search engines would not consider them spammy. This means links from your avatar commenting on other pages might not boost your blog, but might not hurt either.

The only ‘worthy’ backlink was from the page dedicated to that book I had reviewed – and that page linked exactly to the offending pages. My blog and the author’s page may look to Google as the tangle of cross-linked spammy pages hackers had misused my other web server for.

Do something about it? Conclusion? I replaced some of my links to the author’s site with a link to the book’s page on amazon.com. I moved one of the offending pages, the physics link list, over to radices.net – as I had planned to do so for quite a while in my eternal quest for tidy, consistent web sites. The page is still available on this blog, but not visible in the menu anymore.

But I will not ask the author to remove a valid backlink or remove my innocuous post, it seems like succumbing to the rules of a silly game.

What I learned from this episode is that one single page – perhaps one you don’t even consider important on the grand scale of things and your blog in particular – can boost a blog or drag it down. Which pages are the chosen ones is beyond unpredictable.

Ending on a more positive note I currently encounter the boost effect for your German blog as we indulge in writing about the configuration of this gadget, the programmable control unit we use with our heat pump system. The device is very popular among ambitious DIY enthusiasts, and readers are obviously searching for it.

Programmable control unit

We are often linking to the vendor’s business page and manuals. I hope they will never link back to us.

I will just keep watching the patterns and reporting on my encounters. One of the next enigmas to be resolved: Why is the number of Google searches in my WordPress Stats much higher than the number of page impressions in Google Tools for that day, let alone clicks in Google Tools?

Update 2015-01-23: The answer was embarrassingly simple, and all my paranoia had been misguided. WordPress has migrated their hosted blogs to https only. All my traffic was hiding in the statistics for the https version which has to be added in Google Webmaster Tools as a separate website.