A Color Box. Lost in Translation

It was that time again.

The Chief Engineer had rebuilt the technical room from scratch. Each piece of heavy equipment had a new place, each pipe and wire was reborn in a new incarnation (German stories here.)

The control system was turned upset down as well, and thus the Data Kraken was looking at its entangled tentacles, utterly confused. The fabric of spacetime was broken again – the Kraken was painfully reminded of its last mutation in 2016.

Back then a firmware update had changed the structure of log files exported by Winsol. Now these changes were home-made. Sensors have been added. Sensor values have been added to the logging setup. Known values are written to different places in the log file. The log has more columns. The electrical input power of the heat pump has now a positive value, finally. Energy meters have been reset during the rebuild. More than once. And on and on.

But Data Kraken had provided for such disruptions! In a classical end-of-calendar-year death march project its software architecture had been overturned 2016. Here is a highly illustrative ‘executive level’ diagram:

The powerful SQL Server Kraken got a little companion – the Proto Kraken. Proto Kraken proudly runs on Microsoft Access. It comprises the blueprint of Big Kraken – its DNA: a documentation of all measured values, and their eternal history: When was the sensor installed, at which position in the log file do you find its values from day X to day Y, when was it retired, do the values need corrections …

A Powershell-powered tentacle crafts the Big Kraken from this proto version. It’s like scientists growing a mammoth from fossils: The database can be rebuilt from the original log files, no matter how the file structure mutates over time.

A different tentacle embraces the actual functions needed for data analysis – which is Data Kraken’s true calling. Kraken consolidates data from different loggers, and it has to do more than just calculating max / min / totals / averages. For example, the calculation of the heat pump’s performance factor had to mutate over time. Originally energy values had been read off manually from displays, but then the related meters were automated. Different tentacles need to reach out into different tables at different points of time.

Most ‘averages’ only make sense under certain conditions: The temperature at different points in the brine circuits should only contribute to an average temperature when the brine circulation pump is active. If you calculate the performance factor from heat source and target temperature (using a fit function), only time intervals may contribute when the heat pump did actually run.

I live in fear of Kraken’s artificial intelligence – will it ever gain consciousness? Will I wake up once in a science fiction dystopia? Fortunately, there is one remaining stumbling block: We have not yet fully automated genetic engineering. How could that ever work? A robot or a drone trying to follow the Chief Engineer’s tinkering with sensor wiring … and converting this video stream into standardized change alerts sent to Data Kraken?

After several paragraphs laden with silly metaphors, I finally come to the actual metaphor in the title of this post. The

Color Box

Once you came up with a code name for your project, you cannot get it out of your head. That also happened to the Color Box (Farbenkastl).

Here, tacky tasteless multi-colored things are called a color box. Clothes and interior design for example. Or the mixture of political parties in parliament. That’s probably rather boring, but the Austrian-German term Farbenkastl has a colorful history: It had been used in times of the monarchy to mock the overly complex system of color codes applied to the uniforms of the military.

What a metaphor for a truly imperial tool: As a precursor to the precursor to the Kraken Database … I use the Color Box! Brought to me by Microsoft Excel! I can combine my artistic streak, coloring categories of sensors and their mutations. Excel formulas spawn SQL code.

The antediluvian 2016 color box was boring:

But trying to display the 2018 color box I am hitting the limit of Excel’s zooming abilities:

I am now waiting now for the usual surprise nomination for an Science & Arts award. In the meantime, my Kraken enjoys its new toys. Again, the metaphoric power of this video is lost in translation as in German ‘Krake’ means octopus.

(We are still working at automating PVC piping via the Data Kraken, using 3D printing.)

Waging a Battle against Sinister Algorithms

I have felt a disturbance of the force.

As you might expect from a blog about anything, this one has a weird collection of unrelated top pages and posts. My WordPress Blog Stats tell me I am obviously an internet authority on: how rodents get into kitchen appliances, about the physics of a spinning toy, about the history of the first heat pump, and most recently about how to sniff router traffic. But all those posts and topics are eclipsed by the meteoric rise of the single most popular ever article, which was a review of a book on a subfield in theoretical physics. I am not linking this post or quoting its title for reasons you might understand in a minute.

Checking out Google Webmaster Tools the effect is even more pronounced. Some months ago this textbook review attracted by far the most Google search impressions and clicks. Looking at the data from the perspective of a bot it might appear as if my blog had been created just to promote that book. Which is, what I believe might actually had happened.

Concluding from historical versions of the book author’s website (on archive.org), the page impressions of my review started to surge when he put a backlink to my post on his page, some when in spring this year.

But then in autumn this happened.

Page impressions for this blog on Google Webmaster Tools, Sept to Dec.These are the impressions for searches from desktop computers (‘Web’), without image or mobile search. A page impression means that  the link had been displayed on Google Search Results pages to some user. The curve does not change much if I remove the filter for Web.

For this period of three months, that article I Shall Not Quote is the top page in terms of impressions, right after the blog’s default page. I wondered about the reason for this steep decline as I usually don’t see any trend within three months on any of my sites.

If I decrease the time slot to the past month that infamous post suddenly vanishes from the top posts:

Page impressions and top pages in the last monthIt was eradicated quickly – which can only be recognized when decreasing the time slot step-by-step. With a few days at the end of October / beginning of November the entry seems to have been erased from the list of impressions.

I sorted the list of results shown above by the name of the page, not by impressions. Since WordPress posts’ names are prefixed with dates you would expect to see any of your posts in that list somewhere, some of them of course with very slow scores. Actually, that list does include also obscure early posts from 2012 nobody ever clicks at.

The former top post, however, did not get a single impression anymore in the past month. I have highlighted the posts before and after in the list, and I have removed all filters for this one, thus also image and mobile search are taken into account. The post’s name started with /2013/12/22/:

Last month, top pages, recent top post missingChecking the status of indexed pages in total confirms that links have been recently removed:

Index status of this blogFor my other sites and blog this number is basically constant – as long as a website does not get hacked. As our business site actually has been a month ago. Yes, I only mention this in passing as I am less worried about that hack than about that mysterious penalizing of this blog.

I learned that your typical hack of a website is less spectacular that what hacker movies let you believe: If you are not a high-profile target, hacker-spammers leave your site intact, but place additional spammy pages with cross-links on your site to promote their links. You recognize this immediately by a surge of the number of URLs, of indexing activities, and – in case your hoster is as vigilant as mine – a peak in 404 not found errors after that spammy pages have been removed. This is the intermittent spike in spammy pages on our business page crawled by Google:

Crawl stats after hackI used all tools at my disposal to clean up the mess the hackers caused – those pages actually have been indexed already. It will take a while until things like ‘fake Gucci belts’ will be removed from our top content keywords, after I removed the links from the index by editing robots.txt, and using the Google URL removal tool and the URL parameters tool (the latter comes in handy as the spammy pages have been indexed with various query strings, that is: parameters).

I have expected the worst but Google have not penalized me for that intermittent link spam attack (yet?). Numbers are now back to normal after a peak in queries for those fake brand stuff:

Queries back to normal after clean-up.It was an awful lot of work to clean those URLs popping up again and again every day. I am willing to fight the sinister forces without too much whining. But Google’s harsh treatment of the post on this blog freaks me out. It is not only the blog post that was affected but also the pages for the tags, categories and archive entries. Nearly all of these pages – thus all the pages linking to the post – did not get a single impression anymore.

Google Webmaster Tools also tells me that the number of so-called Structured Data for this blog had been reduced to nearly zero:

Structured data on this blogStructured Data are useful for pages that show e.g. product reviews or recipes – anything that should have a pre-defined structure that might be presented according to that structure in Google search results, via nice formatted snippets. My home-grown websites do not use those, but the spammer-hackers had used such data in their link spam pages – so on our business site we saw a peak in structured data at the time of the hack.

Obviously WP blogs use those per design. Our German blog is based on the same WP theme – but the number of structured data there has been constant. So if anybody out there is using theme Twenty Eleven I would be happy to learn about your encounters with structured data.

I have read a lot: what I never wanted to know about search engine optimization. This also included hackers’ Black SEO. I recommend the book Spam Nation by renowned investigative reporter and IT security insider Brian Krebs, published recently. Whose page and book I will again not link.

What has happened? I can only speculate.

Spammers build networks of shady backlinks to promote their stuff. So common knowledge is of course that you should not buy links or create such network scams. Ironically, I have cross-linked all my own sites like hell for many years. Not for SEO purposes but in my eternal quest for organizing my stuff, keeping things separate, but adding the right pointers though, Raking the virtual Zen Garden etc. Never ever did this backfire. I was always concerned about the effect of my links and resources pages (links to other pages, mainly tech and science). Today my site radices.net which was once an early German predecessor of this blog is my big link dump – but still these massive link collections are not voted down by Google.

Maybe Google considers my posting and the physics book author’s website part of such a link scam. I have linked to the author’s page several times – to sample chapters, generously made available via download as PDFs, and the author linked back to me. I had refused to tie my blog to my Google+ account and claim ‘Google authorship’ so far as I don’t wanted to trade elkement for my real name on G+. Via Webmaster tools Google knows about all my domains but they might suspect I – a pseudo-anonymous elkement, using an @subversiv.at address on G+ – might also own the book author’s domain that I – diabolically smart – did not declare in Webmaster Tools.

As I said before, from a most objective perspective Google’s rationale might not be that unreasonable. I don’t write book reviews that often, my most recent were about The Year Without Pants and The Glass Cage. I rather write posts triggered by one idea in a book, maybe not even the main one. When I write about books I don’t use Amazon Affiliate marketing – as professional reviewers such as Brain Pickings or Farnam Street do. I write about unrelated topics. I might not match the expected pattern. This is amusing as long as only a blog is concerned but on principle it is similar as being interviewed by the FBI at an airport because your travel pattern just can’t be normal (as detailed in the book Bursts, on modelling human behaviour – a book I also sort of reviewed last year).

In short, I sometimes review and ‘promote’ books without any return on that. I simply don’t review books I don’t like as I think blogging should be fun. Maybe in an age of gamified reviews and fake forum posts with spammy signatures Google simply doesn’t buy into that. I sympathize. I learned that forums websites shod add a nofollow tag to any hyperlinks users post so that Google will now downvote the link targets. So links in discussion groups are considered spammy per se and you need to do something about it so that they don’t hurt what you – as a forum user – are probably trying to discuss or recommend in good faith. I already live in fear that those links some tinkerers set in DIYer’s forums (linking to our business site or my posts on our heating system) will be considered paid link spam.

However, I cannot explain why I can find my book review post on Google (thus generating an impression) when searching for site:[URL of the post]. Perhaps consolidation takes time. Perhaps there is hope. I even see the post when I use Tor Browser and a foreign IP address so this is not related to my preferences as a logged on Google user. But if there isn’t a glitch in Webmaster Tools, no other typical searcher encounters this impression. I am aware of the tool for disavowing URLs but I don’t want to report a perfectly valid backlink. In addition, that backlink from the author’s site does not even show up in the list of external backlinks which is another enigma.

I know that this seems to be an obsession with a first world problem: This was an post on a topic I don’t claim expertise or that I don’t consider strategically important. But whatever happens to this blog could happen to other sites I am more concerned about, business-wise. So I hope if is just a bug and/or Google Bots will read this post and will release my link. Just in case I mentioned your book or blog here, even if indirectly, please don’t backlink.

Perhaps Google did not like my ranting about encrypted search terms, not available to the search term poet. I dared to display the Bing logo back then. Which I will do again now as:

  • Bing tells me that the infamous post generates impressions and clicks
  • Bing recognizes the backlink
  • The number of indexed pages is increasing gradually with time.
  • And Bing did not index the spammy pages in the brief period they were on our hacked website.

Bing logo (2013)Update 2014-12-23 – it actually happened twice:

Analyzing the impressions from the last day I realize that Google has also treated my physics resources page Physics Books on the Bedside Table. Page impressions dropped and now that page which was the top one )after the review had plummeted) is gone, too. I had already considered to move this page to my site that hosts all those list of links (without issues, so far): radices.net, and I will complete this migration in a minute. Now of course Google might think I, the link spammer, am frantically moving on to another site.

Update 2014-12-24 – now at least results are consistent:

I cannot see my own review post anymore when I search for the title of the book. So finally the results from Webmaster Tools are in line with my tests.

Update 2015-01-23 – totally embarrassing final statement on this:

WordPress has migrated their hosted blogs to https only. All my traffic was hiding in the statistics for the https version which has to be added in Google Webmaster Tools as a separate website.