Bots, Like This! I am an Ardent Fan of HTTPS and Certificates!

This is an experiment in Machine Learning, Big Data, Artificial Intelligence, whatever.

But I need proper digression first.

Last autumn, I turned my back on social media and went offline for a few days.

There, in that magical place, the real world was offline as well. A history of physics museum had to be opened, just for us.

The sign says: Please call XY and we open immediately.

Scientific instruments of the past have a strange appeal, steampunk-y, artisanal, timeless. But I could not have enjoyed it, hadn’t I locked down the gates of my social media fortresses before.

Last year’ improved’ bots and spammers seem to have invaded WordPress. Did their vigilant spam filters feel a disturbance of the force? My blog had been open for anonymous comments since more than 5 years, but I finally had to restrict access. Since last year every commentator needs to have one manually approved comment.

But how to get attention if I block the comments? Spam your links by Liking other blogs. Anticipate that clickers will be very dedicated: Clicking on your icon only takes the viewer to your gravatar profile. The gravatar shows a link to the actual spammy website.

And how to pick suitable – likeable – target blog posts? Use your sophisticated artificial intelligence: If you want to sell SSL certificates (!) pick articles that contain key words like SSL or domain – like this one. BTW, I take the ads for acne treatment personally. Please stick to marketing SSL certificates. Especially in the era of free certificates provided by Let’s Encrypt.

Please use a different image for your different gravatars. You have done rather well when spam-liking the post on my domains and HTTPS, but what was on your mind when you found my post on hijacking orphaned domains for malvertizing?

Did statements like this attract the army of bots?

… some of the pages contain links to other websites that advertize products in a spammy way.

So what do I need to do to make you all like this post? Should I tell you that have a bunch of internet domains? That I migrated my non-blogs to HTTPS last year? That WordPress migrated blogs to HTTPS some time ago? That they use Let’s Encrypt certificates now, just as the hosting provider of my other websites does?

[Perhaps I should quote ‘SSL’ and ‘TLS’, too.]

Or should I tell you that I once made a fool of myself for publishing my conspiracy theories – about how Google ditched my blog from their index? While I actually had missed that you need to add the HTTPS version as a separate item in Google Webmaster Tools?

So I despearately need help with Search Engine Optimization and Online Marketing. Google shows me ads for their free online marketing courses on Facebook all the time now.

Or I need help with HTTPS (TLS/SSL) – embarrassing, as for many years I did nothing else than implementing Public Key Infrastructures and troubleshooting certificates? I am still debugging of all kinds weird certificate chaining and browser issues. The internet is always a little bit broken, says Sir Tim Berners-Lee.

[Is X.509 certificate a good search term? No, too nerdy, I guess.]

Or maybe you are more interested in my pioneering Search Term Poetry and Spam Poetry.  I need new raw material.

Like this! Like this! Like this!

Maybe I am going to even approve a comment and talk to you. It would not be the first time I fail the Turing test on this blog.

Don’t let me down, bots! I count on you!

Update 2018-02-13: So far, this post was a success. The elkemental blog has not seen this many likes in years.… and right now I noticed that the omnipresent suit bot also started to market solar energy and to like my related posts!

Update 2018-02-18: They have not given up yet – we welcome another batch of bots!

bots-welcome-experiment-success-2

Update 2018-04-01: They become more subtle – now they spam-like comments – albeit (sadly) not the comments on this article. Too bad I don’t display the comment likes – only I see them in the admin console 😉

bots-welcome-experiment-success-3

Bing Says We Are Weird. I Prove It. Using Search Term Poetry.

Bing has done so repeatedly:

Bing Places asks for this every few months.

In order to learn more about this fundamental confusion I investigated my Bing search terms. [This blog has now entered the phase of traditionally light summer entertainment.]

Rules:

  • Raw material: Search terms shown in Bing Web Master Tools for any of my / our websites.
  • Each line is a search term of a snippet of a search term – snippets must not be edited but truncation of phrases at the beginning or end is allowed.
  • Images: Random pick from the media library of elkement.blog or our German blog (‘Professional Tinkerers – Restless Settlers‘)

_______________________

you never know
my life and my work

is life just about working
please try to give a substantial answer

A substantial answer.

element not found
map random elkement

Random elkement, confused by maps.

so as many of you may know (though few of you may care)
my dedication to science

this can happen if
the internet is always a little bit broken

The internet is always a little broken

google translate repetetive glitch poetry

_______________________

So let’s hear what Google has to say!
Same rules, only for Google Search Console:

_______________________

so called art
solar energy poem

ploughing through
proof of carnot theorem
thermodynamics in a nutshell

Thermodynamics in a nutshell.

mr confused
why him?

Mr Confused

plastic cellar
sublime attic
in three sentences or fewer, explain the difference

Plastic tank - 'ice storage'

Sublime attic.

name the three common sources of heat for heat pumps
magic gyroscope
frozen herbs
mice in oven

Had contained frozen herbs. Isomorphic to folding of ice/water tank pond liner.

shapeshift vs kraken
tinkers construct slimey
you found that planet z should not have seasons

Slimey

the best we can hope for is that
tv is dangerous

what is the main function of the mulling phase?
you connect a packet sniffer to a switch

Connect a sniffer - to your heat pump

how to keep plastic water tank cool in summer
throwing boiling water into freezing air

Freezing air.

just elke
self employed physicist
the force is strong in me

The force is strong in me.

  _______________________

Blinded by the Light: Links Not Considered Click-Worthy

I promised more experimental internet poetry — here it is!

The problem: Spam comments have petered out, their number seems to scale with posting frequency. The quality of search terms for this blog has improved, and most terms are technical and dull.

I turned to untapped raw material – finally I know what last year’s web development project was good for. As predicted, views plummeted after merging three sites into one and picking a country domain. The upside is that search terms got much better.

The following ‘poem’ is built from phrases displayed in Google Search Console (formerly called Webmaster Tools) for my personal website. I am picking from search terms in the list of Impressions: which means my pages appeared in the list of results but most of them were not clicked; hence the title of this post.

~

blinded by the light
despicable me

subversive activities
inconspicuous in a sentence

blinded by the light

blended movie stream
spooky black without you

offensive security
issued by an authority that is not trusted

offensive security

how does a steel plow work
educated guess

definition of tossed
krypton
deflector

energy armor
was last opened
across the universe

prevent the details of
dubious battle

energy armor

galaxy life hacked
mathematical modelling of zombies
lest means
gloomy sunday lyrics

meaning of strangeness
toilet success hacked

letterbox company

educated sentence generator
template missing

define subversive humor
farewell letter to customers
reply all bcc
without papers

subversive (humor)

 ~

Shortest Post Ever

… self-indulgent though, but just to add an update on the previous post.

My new personal website is  live:

elkement.subversiv.at

I have already redirected the root URLs of the precursor sites radices.net, subversiv.at and e-stangl.at. Now I am waiting for Google’s final verdict; then I am going to add the rewrite map for the 1:n map of old ASP files and new ‘posts’. This is also the pre-requisite for informing Google about the move officially.

The blog-like structure and standardized attributes like Open Graph meta tags and a XML sitemap should make my site more Google-likeable. With the new site – and one dedicated host name only – I finally added permanent redirects (HTTP 301). Before I used temporary (HTTP 302) redirects, to send requests from the root directory to subfolders, which (so the experts say) is not search-engine-friendly.

On the other hand the .at domain will not help: You can pick a certain country as preferred audience for a non-country domain, but I have to stick with Austria here, even if the language is set to English in all the proper places (I hope).

I have discovered that every WordPress.com Tag or Category has its own feed – just add /feed/ to the respective URLs – and I will make use this in order to automate some of my link curation, like this. This list of physics postings has been created from this feed of selected postings:
https://elkement.wordpress.com/category/science-and-technology/physics/feed/
Of course this means re-tagging and re-categorizing here! Thanks WordPress for the Tags to Categories (and vice versa) Conversion Tools!

It is fun to watch my server’s log files more closely. Otherwise I would have missed that SQL injection attack attempt, trying to put spammy links on my website (into my database):

SQL injection by spammer-hackers

Finally Mobile-Friendly! (How I Made Googlebot Happy)

Not this blog of course – it had been responsive already.

But I gave in to Google’s nagging and did not ignore messages in Google Webmaster Tools any longer. All my home-grown websites had a fixed width of the content pane and a fixed left sidebar. On a mobile device you only saw the upper left corner – showing the side bar and only part of the content pane.

Learning about a major Google update implemented last week I spent one night coding until the test went fine for our business website

punktwissen website, Google's test for mobile friendliness

… and for my/our other sites subversiv.at, radices.net, e-stangl.at, and z-village.net. I keep one non-responsive page: epsi.name.

This is not a guide to the perfect responsive design, I am not a professional web developer, and I don’t claim my CSS or HTML code is flawless, elegant, or processed correctly by all browsers in the world. I read this tutorial and this guide, and they provided me with clues to answer my main question:

What is the bare minimum to make a classical website
mobile-friendly according to Google’s requirements?

It also does not necessarily mean other websites are extremely difficult to read on a mobile device. There is a famous website that doesn’t meet Google’s standards although the content pane fits nicely into the width of a smartphone – if you turn it by 90° and scroll to the right … which Googlebot will not do.

In summary I did the following:

Pre-requisites: Use only CSS for formatting, especially define the layout by containers referred to in the stylesheet. Fortunately I made that move long ago.

1) Set a viewport metatag which tells the device to adapt the visible content to the width of the screen. Even if the width of the content is not fixed in a desktop browser, it is not automatically interpreted correctly on mobile devices without viewport. Actually, I was wrong in assuming that a plain old-school hardly formatted HTML text of variable width is mobile-friendly by default. In this case the content adapts to the width of the device, but Google rightly complains about too small text, and links too close together – in addition to missing viewport.

I had been intimidated by the small text / links close errors some time ago and figured I had to re-do all navigation elements. But after adding viewport, the ‘only’ thing left was to make the content break or flow so that it won’t be larger than the screen width. Text size and links were fine without any change to font size or width / height of containers for navigation links.

2) Add at least one media query to my CSS stylesheets in order to make the left side bar vanish or move if the width of the screen is pixels is smaller than a certain size. I tested with an Android device, and with Google’s tool – but mainly I was squeezing the window on a desktop PC to very small widths. For the business website I decided the sidebar is nice-to-have as it just shows recent blog posts – the same approach as used with by my current WordPress template. For some other sites it was an essential navigation pane; so I let it move to the top.

3) Make sure that all containers and images on a page resize or flow accordingly by making their styles change at the threshold width or continuously – this meant cross-checking the styles of all containers that define the layout and changing / adding style definitions depending on the screen width. I made images resizable, and text displayed left to images should flow under it at a certain width.

All My Theories Have Been Wrong. Fortunately!

I apologize to Google. They still like my blog.

This blog’s numbers plummeted as per Webmaster Tools, here and here you find everything you never wanted to know about it. I finally figured that my blog was a victim of Google’s latest update Panda 4.1. Sites about ‘anything’ had suffered, and the Panda rollout matched the date of the onset of the decline.

Other things happened in autumn, too: I had displayed links to latest WordPress blog posts on my other websites, but my feed parser suddenly refused to work. The root cause was the gradual migration of all WP.com blogs and feeds to https:// only. Only elkement’s blog had been migrated at that time; our German blog’s feed was affected two months later.

Recently also the German blog started its descent in impressions and clicks, again two months after elkement’s blog. I pondered about https URLs again – the correlation was too compelling. Then suddenly the answer came to me:

!

!!

!!!

You need to add the https URL as an additional site in Webmaster Tools.

!!!

!!

!

It was that simple. All the traffic I missed was here all the time – tucked away in the statistics for https://elkement.wordpress.com. This also answers the question I posed in my last Google rant post: Why do I see more Search Engine referrers in WordPress stats than clicks in Webmaster Tools? I had just looked in the wrong place.

I had briefly considered the https thing last year but ruled it out as I misinterpreted Webmaster Tools – falsely believing that one entry for a site would cover both the http and the https version. These are the results for both URLs – treated like separate entities by Webmaster Tools:

Results for http : // elkement.wordpress.com  – abysmal:

(Edit: I cannot use a link here and have to add those weird blanks – otherwise WP will always convert both URL and text to https automatically even if the prefix is displayed as http in the editor.)

Google traffic for http version of this blogResults for https://elkement.wordpress.com – better by a factor of 100: Way more Google traffic for the https version of this blog URLPopular pages were the first to ‘move’ over to the https entry. This explains why my top page was missing first from http pages impressions – the book review which I assumed to have been penalized by Panda as an alleged cross-link scam. In full paranoia mode I was also concerned of my adding random Wikimedia images to my poetry.

But now I will do it again as I feel relieved. And relaxed – as this Panda. Giant panda01 960______________________________

You have read a post in my new category Make a Fool of Myself. (I tried to top the self-sabotaging effect of writing about my business website being hacked – as a so-called security expert.)

Yet the theory was all too compelling. I found numerous examples of small sites penalized by Panda in a weird way. See this discussion: A shop’s webmaster makes a product database with succinct descriptions available online and is penalized for ‘key word spamming’ – as his key words are part of each product name. Advice by SEO experts: Circumscribe your product names.

Legend has it that Panda was named after a Google engineer. I figured it was because the Panda is so choosy, insisting on bamboo eucalyptus (*), just as Google scrutinizes our sites more and more. (*) One more theory I got wrong, now edited! Thanks to commentator Cleo for pointing out the mistake.

Looking for Patterns

Scott Adams, of Dilbert Fame, has a lot of useful advice in his autobiographical book How to Fail at Almost Everything and Still Win Big. He recommends looking for patterns in your life, without attempting to theorize about cause and effects. Learning from those patterns you could increase the chance that luck with hit you. I believe in increasing your options, so I can relate a lot to applying this approach to Life, the Universe and Everything.

It should be true in relation to the iconic example of patterns, that is: Web traffic. In this post I’ll try to briefly summarize what I have learned so far from most recent unfortunate events (This is PR speak for disaster). I was intrigued by web statistics, web servers’ log files, and the summaries show by the free Google or Bing Webmaster Tools ever since, but I started to follow the trends more closely after my other, non-Wordpress web server had been hacked by the end of November.

How do you recognize that your site has been hacked?

This is very different from what you might expect from popular lore and movies. I downloaded the log files for my web server from time to time, and I just noticed that suddenly the size of the daily files was about twice as usual. Inspecting the IP addresses which the traffic to my site came from I spotted a lot of hits by Google bot. Sites are indexed all the time, but I was baffled by the URLs – all pointing to pages that should not exist on my server. These URLs contained a long query string with all kinds of brand names, as you know them from spam comments or e-mails.

This is an example line in the log file:

Spammy page on hacked web server, accessed by Google botThis IP address belongs to a *.googlebot.com machine, as can be confirmed by resolving the name, e.g. using nslookup. The worrying fact was the status code 200 which means the page had indeed been there.

A few days later this has changed to a 404, so the page did not exist anymore:

Spammy page removed from hacked web server, Google bot tries to access it.The attack had happened in the weekend, and the pages have been removed immediately by my hosting provider.

I cross-checked if those pages had indeed been indexed by Google I searched for site:[domain name]. This is a snippet from the search results – the spammers even borrowed the tag line of our legitimate site as a description (which I cropped from the screenshot here).

spammy-page-in-google-indexOverall these were just a bunch of different pages (ASP files) but Google recognizes every different query string, appended after the question mark, as a different URL. So suddenly Google had a lot more URLs to index and you could see a spike in web master tools:

Crawl stats after hackThere was also a warning message on the welcome page:

Google warning message about 404 errorsWhat to do?

Obviously the first thing is to delete the spammy pages and deal with whatever vulnerability had been exploited. This was done before I noticed the hack myself. But I am still in clean-up mode to get the spammy pages removed from Google’s index:

robots.txt. Using the site:[domain name] search I identified all the spammy pages and added them to the robots.txt file on my server. This file tells search engines which pages not to index. Fortunately you do not have to add each individual URL – adding the page (ending in .asp in this case) is sufficient.

But pages were still in the index after that, just the description was changed to:
A description for this result is not available because of this site’s robots.txt.

As far as I can tell, entries are still added to the index if somebody else links to your pages (actually, spammy pages on other hacked servers, see root cause analysis below). But as Google is not allowed to investigate the target as per robots.txt, it only adds the link without a description.

URL parameters. Since the spammy pages all use query strings and all strings have the same parameter – [page].asp?dca= in my case – I tried managing the URL parameters via web master tools. This is actually an option to let Google know if a query string should really denote another version of a page or if all query strings for one page should be indexed as a single page. E.g. I am using a query string called imgClicked to magnify an image here – when clicking in the top image, and I could tell Google that the clicked / unclicked image should not be counted as different URLs.

In the special case of the spammy pages I tried to tell Google that different dca values don’t make for a separate page (which would result in about 6 spammy URLs in the index instead of 1500) but this did not impact the gradual accumulation of indexed spammy pages.

Mind-numbing work. To get rid of all pages as fast as possible I also removed each. of. them. manually. via Google master tools. This means:

  • Click on the URL from the search results, opening a new tab. This results in a 404.
  • Copy the URL from the address bar to web master tools in the form for removing the URL.
  • Click submit.
  • Repeat 1500 times.

I am now at about 500. Not all spammy pages that ever existed are displayed at once in the index, but about 10 are added every day. Where do they come from after the original pages had been deleted?

How was this hack actually supposed to work?

The legitimate pages had not been changed or vandalized but the hacker-spammers just placed additional pages on the server. I had never noticed them, had I not encountered Google’s indexing activities.

I was curious how those pages had looked like and I inspected Google’s cache, by searching for cache:[spammy URL]. The cached page consisted of:

  • Your typical junk of spammy text, otherwise I would be delighted about raw material for poetry.
  • A list of links to other spammy pages, most of them on my hacked server
  • An exact copy of the default page of this (legitimate) web site.

I haven’t investigated all those more than 1000 pages and spammy links displayed on them but I conjectured there have to be some outbound links to other – hacked – servers Links will be only boosted if there are backlinks from seemingly independent web sites. Somehow this should make people buy something in a shady webshop at the end of a cascade of links.

After some weeks I was able to confirm this as Google web master tools now show external backlinks to my domain from other spammy pages on legitimate sites, mostly small businesses in the US. Many of them used the same provider that obviously had been hacked as well.

This explains where the gradual supply of spammy links to the index comes from: Google has followed the spammy links from the other hacked servers inbound to my server. It seems to take a while to clean this out as all the other webmasters have removed there pages as well – I checked each. of. them. from the long list supplied by Google as a CSV file.

Hadn’t I been hacked I might have never been aware of the completely unrelated onslaught by Google itself, targeted to this blog. I reported on this in detail previously; here is just an update and a summary.

Edit as from the comments I conclude this was not clear: The following analysis is unrelated to the hack of non-Wordpress site – the hacked site had not been penalized so far by Google. But the blog you are reading right now was.

Symptoms of your site having been penalized by a search engine

Rapid decline of impressions. Webmaster tools show a period of 3 months maximum. I have checked the trend for all my sites now and then, but there was actually never anything that constituted a real trend. But for this blog page impressions went from a few hundred, often more than 1000 per day this summer to less than 10 per day now.

Page impressions Sept to DecPage impressions stayed at their all-time-low since last time, so just extend that graph to the right.

Comparison with sites that should rank much lower. Currently this blog has as much or as few impressions as my personal website e-stangl.at. Its Google pagerank is 1 – as compared to 3 for the WordPress blog; I only update it every quarter at maximum, and its word count is perhaps a thousands of this blog.

My other two sites subversiv.at and radices.net score better although I update them only about once every 6 weeks,and I am pretty sure I violate best practices due to my creative mixing languages, commenting on my own stuff, and/or curating enormous lists of outbound links.

It is ironic that Google has penalized this blog now, as per autumn 2014 my quality control has become more ruthless. I had quite a number of posts in Drafts, with more than 1000 words each, edited, and spell-checked – and finally deleted all of them. The remaining posts were the ones requiring considerable research plus my poetry. This spam poem is one of my most popular posts as by Google’s page impressions. So all theorizing is really futile and I should better watch the pattern emerge.

Identifying offending pages. I added an update to the previous post as I spotted the offending pages using the following method:

  • Identify your top performing pages by ranking pages in the list of search results by impressions or clicks.
  • Then order pages in the list of search results by page name. This is effectively ranking by date for blogs, and the list can be compared to the archive of all pages.
  • Make the time span covered by the Google tools smaller and smaller and check if one your former top pages is suddenly vanishing from the list.

In my case these pages were:

  • A review of a new, a bit unconventional, textbook on quantum field theory and
  • a list of physics books, blogs and websites.

As Michelle pointed out correctly this does not mean that the page has been deleted from the index – as you can confirm by searching for site:[Offending URL] explicitly or by adding a more specific search criterion, like adding elkement. I found that the results displayed for my offending pages are erratic: Sometimes, surprisingly, the page will still show up if I just use the title of the post; perhaps a consequence of me, owner of the site, being logged on to Google. Sometimes I need to add an additional keyword to move it to the top in search results again.

But anyway, even if the pages had not been deleted, they had been pushed back to search results page >10.

Something had been deleted from the index though. Here is the number of indexed pages over time, showing a decline starting at the time impressions were plummeting, too:

Pages indexed by Google for this blog as per writing of this postI cannot see a similar effect for any of the other sites, and as far as I know it does not correlate with some Google update (Google has indicated a major update in March 2014 in the figure).

Find the root cause. Except from links on my own sites, and links on other other blogs my blog has no backlinks. As I learned in this research backlinks from forums are often tagged nofollow so that search engines would not consider them spammy. This means links from your avatar commenting on other pages might not boost your blog, but might not hurt either.

The only ‘worthy’ backlink was from the page dedicated to that book I had reviewed – and that page linked exactly to the offending pages. My blog and the author’s page may look to Google as the tangle of cross-linked spammy pages hackers had misused my other web server for.

Do something about it? Conclusion? I replaced some of my links to the author’s site with a link to the book’s page on amazon.com. I moved one of the offending pages, the physics link list, over to radices.net – as I had planned to do so for quite a while in my eternal quest for tidy, consistent web sites. The page is still available on this blog, but not visible in the menu anymore.

But I will not ask the author to remove a valid backlink or remove my innocuous post, it seems like succumbing to the rules of a silly game.

What I learned from this episode is that one single page – perhaps one you don’t even consider important on the grand scale of things and your blog in particular – can boost a blog or drag it down. Which pages are the chosen ones is beyond unpredictable.

Ending on a more positive note I currently encounter the boost effect for your German blog as we indulge in writing about the configuration of this gadget, the programmable control unit we use with our heat pump system. The device is very popular among ambitious DIY enthusiasts, and readers are obviously searching for it.

Programmable control unit

We are often linking to the vendor’s business page and manuals. I hope they will never link back to us.

I will just keep watching the patterns and reporting on my encounters. One of the next enigmas to be resolved: Why is the number of Google searches in my WordPress Stats much higher than the number of page impressions in Google Tools for that day, let alone clicks in Google Tools?

Update 2015-01-23: The answer was embarrassingly simple, and all my paranoia had been misguided. WordPress has migrated their hosted blogs to https only. All my traffic was hiding in the statistics for the https version which has to be added in Google Webmaster Tools as a separate website.