Looking for Patterns

Scott Adams, of Dilbert Fame, has a lot of useful advice in his autobiographical book How to Fail at Almost Everything and Still Win Big. He recommends looking for patterns in your life, without attempting to theorize about cause and effects. Learning from those patterns you could increase the chance that luck with hit you. I believe in increasing your options, so I can relate a lot to applying this approach to Life, the Universe and Everything.

It should be true in relation to the iconic example of patterns, that is: Web traffic. In this post I’ll try to briefly summarize what I have learned so far from most recent unfortunate events (This is PR speak for disaster). I was intrigued by web statistics, web servers’ log files, and the summaries show by the free Google or Bing Webmaster Tools ever since, but I started to follow the trends more closely after my other, non-Wordpress web server had been hacked by the end of November.

How do you recognize that your site has been hacked?

This is very different from what you might expect from popular lore and movies. I downloaded the log files for my web server from time to time, and I just noticed that suddenly the size of the daily files was about twice as usual. Inspecting the IP addresses which the traffic to my site came from I spotted a lot of hits by Google bot. Sites are indexed all the time, but I was baffled by the URLs – all pointing to pages that should not exist on my server. These URLs contained a long query string with all kinds of brand names, as you know them from spam comments or e-mails.

This is an example line in the log file:

Spammy page on hacked web server, accessed by Google botThis IP address belongs to a *.googlebot.com machine, as can be confirmed by resolving the name, e.g. using nslookup. The worrying fact was the status code 200 which means the page had indeed been there.

A few days later this has changed to a 404, so the page did not exist anymore:

Spammy page removed from hacked web server, Google bot tries to access it.The attack had happened in the weekend, and the pages have been removed immediately by my hosting provider.

I cross-checked if those pages had indeed been indexed by Google I searched for site:[domain name]. This is a snippet from the search results – the spammers even borrowed the tag line of our legitimate site as a description (which I cropped from the screenshot here).

spammy-page-in-google-indexOverall these were just a bunch of different pages (ASP files) but Google recognizes every different query string, appended after the question mark, as a different URL. So suddenly Google had a lot more URLs to index and you could see a spike in web master tools:

Crawl stats after hackThere was also a warning message on the welcome page:

Google warning message about 404 errorsWhat to do?

Obviously the first thing is to delete the spammy pages and deal with whatever vulnerability had been exploited. This was done before I noticed the hack myself. But I am still in clean-up mode to get the spammy pages removed from Google’s index:

robots.txt. Using the site:[domain name] search I identified all the spammy pages and added them to the robots.txt file on my server. This file tells search engines which pages not to index. Fortunately you do not have to add each individual URL – adding the page (ending in .asp in this case) is sufficient.

But pages were still in the index after that, just the description was changed to:
A description for this result is not available because of this site’sΒ robots.txt.

As far as I can tell, entries are still added to the index if somebody else links to your pages (actually, spammy pages on other hacked servers, see root cause analysis below). But as Google is not allowed to investigate the target as per robots.txt, it only adds the link without a description.

URL parameters. Since the spammy pages all use query strings and all strings have the same parameter – [page].asp?dca= in my case – I tried managing the URL parameters via web master tools. This is actually an option to let Google know if a query string should really denote another version of a page or if all query strings for one page should be indexed as a single page. E.g. I am using a query string called imgClicked to magnify an image here – when clicking in the top image, and I could tell Google that the clicked / unclicked image should not be counted as different URLs.

In the special case of the spammy pages I tried to tell Google that different dca values don’t make for a separate page (which would result in about 6 spammy URLs in the index instead of 1500) but this did not impact the gradual accumulation of indexed spammy pages.

Mind-numbing work. To get rid of all pages as fast as possible I also removed each. of. them. manually. via Google master tools. This means:

  • Click on the URL from the search results, opening a new tab. This results in a 404.
  • Copy the URL from the address bar to web master tools in the form for removing the URL.
  • Click submit.
  • Repeat 1500 times.

I am now at about 500. Not all spammy pages that ever existed are displayed at once in the index, but about 10 are added every day. Where do they come from after the original pages had been deleted?

How was this hack actually supposed to work?

The legitimate pages had not been changed or vandalized but the hacker-spammers just placed additional pages on the server. I had never noticed them, had I not encountered Google’s indexing activities.

I was curious how those pages had looked like and I inspected Google’s cache, by searching for cache:[spammy URL]. The cached page consisted of:

  • Your typical junk of spammy text, otherwise I would be delighted about raw material for poetry.
  • A list of links to other spammy pages, most of them on my hacked server
  • An exact copy of the default page of this (legitimate) web site.

I haven’t investigated all those more than 1000 pages and spammy links displayed on them but I conjectured there have to be some outbound links to other – hacked – servers Links will be only boosted if there are backlinks from seemingly independent web sites. Somehow this should make people buy something in a shady webshop at the end of a cascade of links.

After some weeks I was able to confirm this as Google web master tools now show external backlinks to my domain from other spammy pages on legitimate sites, mostly small businesses in the US. Many of them used the same provider that obviously had been hacked as well.

This explains where the gradual supply of spammy links to the index comes from: Google has followed the spammy links from the other hacked servers inbound to my server. It seems to take a while to clean this out as all the other webmasters have removed there pages as well – I checked each. of. them. from the long list supplied by Google as a CSV file.

Hadn’t I been hacked I might have never been aware of the completely unrelated onslaught by Google itself, targeted to this blog. I reported on this in detail previously; here is just an update and a summary.

Edit as from the comments I conclude this was not clear: The following analysis is unrelated to the hack of non-Wordpress site – the hacked site had not been penalized so far by Google. But the blog you are reading right now was.

Symptoms of your site having been penalized by a search engine

Rapid decline of impressions. Webmaster tools show a period of 3 months maximum. I have checked the trend for all my sites now and then, but there was actually never anything that constituted a real trend. But for this blog page impressions went from a few hundred, often more than 1000 per day this summer to less than 10 per day now.

Page impressions Sept to DecPage impressions stayed at their all-time-low since last time, so just extend that graph to the right.

Comparison with sites that should rank much lower. Currently this blog has as much or as few impressions as my personal website e-stangl.at. Its Google pagerank is 1 – as compared to 3 for the WordPress blog; I only update it every quarter at maximum, and its word count is perhaps a thousands of this blog.

My other two sites subversiv.at and radices.net score better although I update them only about once every 6 weeks,and I am pretty sure I violate best practices due to my creative mixing languages, commenting on my own stuff, and/or curating enormous lists of outbound links.

It is ironic that Google has penalized this blog now, as per autumn 2014 my quality control has become more ruthless. I had quite a number of posts in Drafts, with more than 1000 words each, edited, and spell-checked – and finally deleted all of them. The remaining posts were the ones requiring considerable research plus my poetry. This spam poem is one of my most popular posts as by Google’s page impressions. So all theorizing is really futile and I should better watch the pattern emerge.

Identifying offending pages. I added an update to the previous post as I spotted the offending pages using the following method:

  • Identify your top performing pages by ranking pages in the list of search results by impressions or clicks.
  • Then order pages in the list of search results by page name. This is effectively ranking by date for blogs, and the list can be compared to the archive of all pages.
  • Make the time span covered by the Google tools smaller and smaller and check if one your former top pages is suddenly vanishing from the list.

In my case these pages were:

  • A review of a new, a bit unconventional, textbook on quantum field theory and
  • a list of physics books, blogs and websites.

As Michelle pointed out correctly this does not mean that the page has been deleted from the index – as you can confirm by searching for site:[Offending URL] explicitly or by adding a more specific search criterion, like adding elkement. I found that the results displayed for my offending pages are erratic: Sometimes, surprisingly, the page will still show up if I just use the title of the post; perhaps a consequence of me, owner of the site, being logged on to Google. Sometimes I need to add an additional keyword to move it to the top in search results again.

But anyway, even if the pages had not been deleted, they had been pushed back to search results page >10.

Something had been deleted from the index though. Here is the number of indexed pages over time, showing a decline starting at the time impressions were plummeting, too:

Pages indexed by Google for this blog as per writing of this postI cannot see a similar effect for any of the other sites, and as far as I know it does not correlate with some Google update (Google has indicated a major update in March 2014 in the figure).

Find the root cause. Except from links on my own sites, and links on other other blogs my blog has no backlinks. As I learned in this research backlinks from forums are often tagged nofollow so that search engines would not consider them spammy. This means links from your avatar commenting on other pages might not boost your blog, but might not hurt either.

The only ‘worthy’ backlink was from the page dedicated to that book I had reviewed – and that page linked exactly to the offending pages. My blog and the author’s page may look to Google as the tangle of cross-linked spammy pages hackers had misused my other web server for.

Do something about it? Conclusion? I replaced some of my links to the author’s site with a link to the book’s page on amazon.com. I moved one of the offending pages, the physics link list, over to radices.net – as I had planned to do so for quite a while in my eternal quest for tidy, consistent web sites. The page is still available on this blog, but not visible in the menu anymore.

But I will not ask the author to remove a valid backlink or remove my innocuous post, it seems like succumbing to the rules of a silly game.

What I learned from this episode is that one single page – perhaps one you don’t even consider important on the grand scale of things and your blog in particular – can boost a blog or drag it down. Which pages are the chosen ones is beyond unpredictable.

Ending on a more positive note I currently encounter the boost effect for your German blog as we indulge in writing about the configuration of this gadget, the programmable control unit we use with our heat pump system. The device is very popular among ambitious DIY enthusiasts, and readers are obviously searching for it.

Programmable control unit

We are often linking to the vendor’s business page and manuals. I hope they will never link back to us.

I will just keep watching the patterns and reporting on my encounters. One of the next enigmas to be resolved: Why is the number of Google searches in my WordPress Stats much higher than the number of page impressions in Google Tools for that day, let alone clicks in Google Tools?

Update 2015-01-23: The answer was embarrassingly simple, and all my paranoia had been misguided. WordPress has migrated their hosted blogs to https only. All my traffic was hiding in the statistics for the https version which has to be added in Google Webmaster Tools as a separate website.

19 thoughts on “Looking for Patterns

  1. Pingback: All My Theories Have Been Wrong. Fortunately! | Theory and Practice of Trying to Combine Just Anything

    • Thanks a lot – as my blog has now fallen from Google’s grace and I feel like ‘nobody will read that’ πŸ˜‰ it is particularly nice to meet a new follower!

  2. Why is it that after reading posts like this I get very, very nervous? Can you distill all of this into its significance for me and a WP enduser? I am so naive … why should I worry about spammers/hackers functioning as parasites in this way?

    • As an owner of a blog hosted by WP you can relax and read the first part – about my hacked website – just for entertainment. If you ever consider to self-host your blog it would get relevant.

      The second thing – my predicament with Google now basically ignoring my WP blog – is perhaps unlikely but could happen to every WP blog. The equivalent in your case would be something like: You link to a page of a photographer, perhaps writing a positive review (as I did for that physics book). That photographer links back to you. All is legitimate on principle. But now Google thinks that these cross-links are part of a link scam, so they believe you and the other photographer try to game the ranking of search results…. and your blog will be less likely to appear on the first pages of search results.
      But it still may be much more unlikely in your case than it was for me – I have done more research and developed my conspiracy theory further after I have published this post. Google rolled out a minor update in September 2014, my blog might have been hit by that. Currently I believe that in addition to the links back an forth to the author’s site my blogging about ‘anything’ and the diverse topics might have made it look even more like a fake site. The new Google algorithm is said to promote small sites with niche topics – which would also explain why our German blog has even benefitted from the change. But my blog my look more like a ‘content farm’ – with many outbound links.

  3. I had to do a bit of research on self-hosted word press websites a couple of years ago. After not having to think about spam with WP, I was surprised by how much work a typical user of some services has to do themselves to maintain a spam-free site. I was left with the impression that most of these hacks and spam comments are done to leave links to particular sites and thus boost SEO on a typical person’s google search for the retailers who purchase these SEO services. I didn’t think of it as a black market scam before, as most web developers advertise their SEO schemes… but I suppose there would be differences to how an ethical or questionable business would proceed with that. Your conversation with Maurice is interesting, and it has been nice to read more about your experience in this post. Many of these questions rest in the back of my mind, too, and I appreciate you sharing all your research and observation.

    • Thanks for your comment, Michelle! I am also still researching this, and my intention in this ‘series’ is to write the posts I would have loved to read when starting to google randomly some weeks ago.

      Yes, this seems to be quite an efficient black market. Having viewed all things security more from the perspective of protecting IT infrastructure from corporate spies, disgruntled employees, evil nation states, mischieveous vandalizing script-kiddies etc. before I was not aware of hacking used primarily for spamming before. For example, it was interesting to learn (from Krebs’ book) that those web shops advertized by most experienced and ‘famous’ spammers will not misuse your credit card data, as they don’t want to risk to have their sites shut down. So there is cybercrime in terms of hijacking PCs for sending spam e-mails or hacking web servers for hosting spam pages but there is no credit card scam in the traditional sense, and the organization(s) are setup in a way that separates the spammers from the shop operators… It seems difficult for law enforcement to chase the criminals.

      But the ironic thing is that dealing with the hack is just a bit of a nuisance as cleaning out the URLs takes time. I am more baffled by the down-voting of the blog which is not related to the hack of the site that is hosted on another web server (as might not have been clear from my posts).
      Currently I believe my blog might have been hit my Google’s most recent minor update to Panda 4.1 whereas our German blog might have benefitted a bit from it. SEO experts say the goal of this update is to make small, specialized sites rank higher. But ‘content farms’ offering ‘thin’ posts on whatever topic might be penalized; this also applies to community sites like howstuffworks or to sites with ‘medical advice’. The timing of the ‘Google attack’ would match the 4.1 rollout last autumn. SEO articles have some examples of sites that took a heavy blow, like 80% less traffic from search engines.
      The advice by Google is to provide longform original content on niche topics, written by experts, with outbound links to authoritative sites. Experts state that with this Panda version one ‘bad’ post can impact all URLs though just as I experienced it (Quote: ‘Then you are in real trouble!’). So I think it is / was the combination of that unfortunate article that might have looked like a link scam plus the diverse nature of the topics, as it is obvious also from the tag cloud. Actually I have been thinking about ‘the future of this blog’ for several months, and I was gradually making adjustments, maybe somewhat unconsciously and mainly by deleting many posts I had already written. I will try to see the Panda’s attack as an opportunity to get more serious about this change.

      • I admit that I’ve been thinking about how to adapt my blog in a similar way. I periodically remove posts and I have also second-guessed drafted posts and deleted them before publishing. It’s a bit sad to think about the ways we limit ourselves when trying to grow ourselves professionally. For me, an interest in many topics is a benefit, and being able to demonstrate research abilities in a broad way is important. Yet, I can also understand why a company managing a search engine would drive its returns like this, so as to keep quality information a priority on line. The entire marketing industry makes me sick with its gimmicks and tricks for attracting more attention and business… whereas just being helpful in a field can do a better job of growing a business. In that way, I’m glad that SEO keeps changing.

        When I wrote those posts on life transitions, I felt a little concerned for straying off topic, and tried to build in some writers’ resources in the text. It crossed my mind after that maybe what would have been better was to build the posts in a private site and share selectively (by asking others to join as co-administrators). This would keep the public WordPress site focused and sparse. I often feel the uncertain pull between using the blog for professional uses and responding to social and personal interests. I’m not sure how, as a writer, one can separate the two!

        But, as I said before, I suspended publicizing my blog, have removed links on LinkedIn and turned off Google indexing for the time being. I needed to simplify, but I also need to rethink what I do with this blog and other social media outlets. Then again, I’ve almost gone off-line and feel a little too busy right now to be able to change that.

        • I have decided now I will ignore Google and cute furry Asian animals. I rather play with my own Big Data and post about it! Ha!
          I realized that the effect of high or low Google page impressions is not that impressive (pun intended) as there are other search engines in the world and many people don’t come to the blog via search results.

    • If this is spam comment, I will not delete it. It complements the post nicely, in a weird way. But if it is spam, your comment and profile miss a link. If it is not spam, I failed sort of a reverse Turing test.
      It would not be the first time that I tried to talk to spam bot on this blog – so: Welcome and thanks!

  4. Hi there Elke–quite enlightening in several different ways. First, the way that the hack affected your page rank information was completely unexpected to me, however, I suppose that had I thought about it in advance it would all have made sense. Second, though, is the way that the site was hacked. As you noted it’s not what you might expect. In my case I would have expected a denial of service attack or an outright hijacking, not something as subtle as what was done. I wonder what’s the real goal in doing that–does it make any kind of economic sense or is it just more or less silly efforts from those who really just want to mess around with systems without regard to how it affects others.
    By the way, I follower the links and visited some of your other sites. You have quite a wide internet footprint overall, don’t you! I noticed, with interest, the “story” that led to the new profile pictures πŸ™‚ You live in a very beautiful land!

    • Hi Maurice! I think I did not explain the (un-)relation between the hack (of my non-Wordpress site) and the ranking issue of my WordPress blog well – really totally unrelated. I just paid more attention to the Google results for the blog as I used the Google tools also to understand the hack.
      The ranking of the hacked site actually had not been impacted so far – although I would sort of deserve it in this case. Maybe the spammy pages had been removed quickly enough.

      The final goal of the hack was quite mundane – just placing spammy links on a page so finally somebody would buy something. I had once encountered a hacked Hotmail account of a client – same thing: no damage like sending e-mails in your name or deleting your stuff… just using your account to send spam.

      Thanks for following some links πŸ™‚ Yes, I am mega-googleable! I hope will Google will not down-vote the other sites now as I linked them!

    • Forgot to add a link to a book – oh, wait, I better just quite the title and no link πŸ™‚ I recently read ‘Spam Nation’ by Brian Krebs, investigative journalist in cyber security. He explains how the whole spam business really works. Spammer rent bot nets of PCs infected by malware to send out spam, and the sort of hack I have encountered is called ‘Black SEO’. So a lot of hacking is actually done for spamming.

  5. I like the way yu dive deep into such a disaster – and learn from. You had me rushing to my webstats that I rarely check other than visits. Touch wood – all is well.

    A few years ago we launched a micro-donation site where people could donate a few bucks for a good cause. (It didn’t really work but that is anoher story). Then we got this call from a Sheriff in Texas who claimed we had stolen credit card details and were deducting small amounts from it. Some digging later we discovered that a stolen credit card reseller (big businesses with hundrfeds of staff in offices in Turkey and elsewhere) was using our site to test the stolen numbers!!! They would pay us 1 USD with the card. If it passed, they sold card on. The owner would probably not bother to investigate and we were supposed to be happy with the 1USD.

    Not so. The FBI let us off the hook but the clearing banks put us on their blacklists and we had to do a lot of expaining to save our business from offiial and lasting disrepute.

    Lessons learned: think twice before setting up your own online paymen facility and if you do, make sure your business insurance is up to scratch!

    • Thanks for sharing this great story – in a sense I feel better now! There is a saying that goes like: An expert is somebody who made every possible error in his field, so I really try to see the opportunity… and turn from ‘We have been hacked, terrible!” to “We have been hacked, awesome opportunity for first-hand learning experience!”

      • Definitely. I once read about a CEO of VW in the US who had lost 3 BUSD due to a strategic error. He offered his resignation but the president of the board refused, saying: “I have just paid 3 BUSD for your learning experience, you can’t leave now!”

        • Thanks again – great example in contrast to the culture of some organizations that spend more time in tracking down the ‘culprit’ than trying to fix the actual problem.

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s