Other People Have Lives – I Have Domains

These are just some boring update notifications from the elkemental Webiverse.

The elkement blog has recently celebrated its fifth anniversary, and the punktwissen blog will turn five in December. Time to celebrate this – with new domain names that says exactly what these sites are – the ‘elkement.blog‘ and the ‘punktwissen.blog‘.

Actually, I wanted to get rid of the ads on both blogs, and with the upgrade came a free domain. WordPress has a detailed cookie policy – and I am showing it dutifully using the respective widget, but they have to defer to their partners when it comes to third-party cookies. I only want to worry about research cookies set by Twitter and Facebook, but not by ad providers, and I am also considering to remove social media sharing buttons and the embedded tweets. (Yes, I am thinking about this!)

On the websites under my control I went full dinosaur, and the server sends only non-interactive HTML pages sent to the client, not requiring any client-side activity. I now got rid of the last half-hearted usage of a session object and the respective cookie, and I have never used any social media buttons or other tracking.

So there are no login data or cookies to protect, but yet I finally migrated all sites to HTTPS.

It is a matter of principle: I of all website owners should use https. Since 15 years I have been planning and building Public Key Infrastructures and troubleshooting X.509 certificates.

But of course I fear Google’s verdict: They have announced long ago to HTTPS is considered a positive ranking by its search engine. Pages not using HTTPS will be tagged as insecure using more and more terrifying icons – e.g. http-only pages with login buttons already display a striked-through padlock in Firefox. In the past years I migrated a lot of PKIs from SHA1 to SHA256 to fight the first wave of Insecure icons.

Finally Let’s Encrypt has started a revolution: Free SSL certificates, based on domain validation only. My hosting provider uses a solution based on Let’s Encrypt – using a reverse proxy that does the actual HTTPS. I only had to re-target all my DNS records to the reverse proxy – it would have been very easy would it not have been for all my already existing URL rewriting and tweaking and redirecting. I also wanted to keep the option of still using HTTP in the future for tests and special scenario (like hosting a revocation list), so I decided on redirecting myself in the application(s) instead of using the offered automated redirect. But a code review and clean-up now and then can never hurt 🙂 For large complex sites the migration to HTTPS is anything but easy.

In case I ever forget which domains and host names I use, I just need to check out this list of Subject Alternative Names again:

(And I have another certificate for the ‘test’ host names that I need for testing the sites themselves and also for testing various redirects ;-))

WordPress.com also uses Let’s Encrypt (Automattic is a sponsor), and the SAN elkement.blog is lumped together with several other blog names, allegedly the ones which needed new certificates at about the same time.

It will be interesting what the consequences for phishing websites will be. Malicious websites will look trusted as being issued certificates automatically, but revoking a certificate might provide another method for invalidating a malicious website.

Anyway, special thanks to the WordPress.com Happiness Engineers and support staff at my hosting provider Puaschitz IT. Despite all the nerdiness displayed on this blog I prefer hosted / ‘shared’ solutions when it comes to my own websites because I totally like it when somebody else has to patch the server and deal with attacks. I am an annoying client – with all kinds of special needs and questions – thanks for the great support! 🙂

Anniversary 4 (4 Me): “Life Ends Despite Increasing Energy”

I published my first post on this blog on March 24, 2012. Back then its title and tagline were:

Theory and Practice of Trying to Combine Just Anything
Physics versus engineering
off-the-wall geek humor versus existential questions
IT versus the real thing
corporate world’s strangeness versus small business entrepreneur’s microcosmos knowledge worker’s connectedness versus striving for independence

… which became

Theory and Practice of Trying to Combine Just Anything
I mean it

… which became

elkemental Force
Research Notes on Energy, Software, Life, the Universe, and Everything

last November. It seems I have run out of philosophical ideas and said anything I had to say about Life and Work and Culture. Now it’s not Big Ideas that make me publish a new post but my small Big Data. Recent posts on measurement data analysis or on the differential equation of heat transport  are typical for my new editorial policy.

Cartoonist Scott Adams (of Dilbert fame) encourages to look for patterns in one’s life, rather than to interpret and theorize – and to be fooled by biases and fallacies. Following this advice and my new policy, I celebrate my 4th blogging anniversary by crunching this blog’s numbers.

No, this does not mean I will show off the humbling statistics of views provided by WordPress 🙂 I am rather interested in my own evolution as a blogger. Having raked my virtual Zen garden two years ago I have manually maintained lists of posts in each main category – these are my menu pages. Now I have processed each page’s HTML code automatically to count posts published per month, quarter, or year in each category. All figures in this post are based on all posts excluding reblogs and the current post.

Since I assigned two categories to some posts, I had to pick one primary category to make the height of one column reflect the total posts per month:Statistics on blog postings: Posts per month in each main category

It seems I had too much time in May 2013. Perhaps I needed creative compensation – indulging in Poetry and pop culture (Web), and – as back then I was writing a master thesis.

I had never missed a single month, but there were two summer breaks in 2012 and 2013 with only 1 post per month. It seems Life and Web gradually have been replaced by Energy, and there was a flash of IT in 2014 which I correlate with both nostalgia but also a professional flashback owing to lots of cryptography-induced deadlines.

But I find it hard to see a trend, and I am not sure about the distortion I made by picking one category.

So I rather group by quarter:

Statistics on blog postings: Posts per quarter in each main category

… which shows that posts per quarter have reached a low right now in Q1 2016, even when I would add the current posting. Most posts now are based on original calculations or data analysis which take more time to create than search term poetry or my autobiographical vignettes. But maybe my anecdotes and opinionated posts had just been easy to write as I was drawing on ‘content’ I had in mind for years before 2012.

In order to spot my ‘paradigm shifts’ I include duplicates in the next diagram: Each post assigned to two categories is counted twice. Since then the total number does not make sense I just depict relative category counts per quarter:

Statistics on blog postings: Posts per quarter in each category, including the assignment of more than one category.

Ultimate wisdom: Life ends, although Energy is increasing. IT is increasing, too, and was just hidden in the other diagram: Recently it is  often the secondary category in posts about energy systems’ data logging. Physics follows an erratic pattern. Quantum Field Theory was accountable for the maximum at the end of 2013, but then replaced by thermodynamics.

Web is also somewhat constant, but the list of posts shows that the most recent Web posts are on average more technical and less about Web and Culture and Everything. There are exceptions.

Those trends are also visible in yearly overviews. The Decline Of Web seems to be more pronounced – so I tag this post with Web.

Statistics on blog postings: Posts per year in each main category

Statistics on blog postings: Posts per year in each category, including the assignment of more than one category.

But perhaps I was cheating. Each category was not as stable as the labels in the diagrams’ legends do imply.

Shortcut categories refer to
1) these category pages: EnergyITLifePhysicsPoetryWeb,
2) and these categories EnergyITLifePhysicsPoetryWeb, respectively, manually kept in sync.

So somehow…

public-key-infrastructure became control-and-it

and

on-writing-blogging-and-indulging-in-web-culture is now simply web

… and should maybe be called nerdy-web-stuff-and-software-development.

In summary, I like my statistics as it confirms my hunches but there is one exception: There was no Poetry in Q1 2016 and I have to do something about this!

________________________________

The Making Of

  • Copy the HTML content of each page with a list to a text editor (I use Notepad2).
  • Find double line breaks (\r\n\r\n) and replace them by a single one (\r\n).
  • Copy the lines to an application that lets you manipulate strings (I use Excel).
  • Tweak strings with formulas / command to cut out date, url, title and comment. Use the HTML tags as markers.
  • Batch-add the page’s category in a new column.
  • Indicate if this is the primary or secondary category in a new column (Find duplicates automatically before so 1 can be assigned automatically to most posts.).
  • Group the list by month, quarter, and year respectively and add the counts to new data tables that will be used for diagrams (e.g. Excel function COUNTIFs, using only the category or category name  + indicator for the primary category as criteria).

It could be automated even better – without having to maintain category pages by simply using the category feeds (like this: https://elkement.wordpress.com/category/physics/feed) or by filtering the full blog feed for categories. I have re-categorized all my posts so that categories matches menu page lists, but I chose to use my lists as

  1. I get not only date and headline, but also my own additional summary / comment that’s not part of the feed. For our German blog, I actually do this in reverse: I create the HTML code of a a sitemap-style overview page on wordpress.com from an Excel list of all posts plus custom comments and then copy the auto-generated code to the HTML view of the respective menu page on the blog.
  2. the feed provided by WordPress.com can have 150 items maximum no matter which higher number you try to configure. So you need to start analyzing before you have published 150 posts.
  3. I can never resist to create a tool that manipulates text files and automates something, however weird.

Random Things I Have Learned from My Web Development Project

It’s nearly done (previous episode here).

I have copied all the content from my personal websites, painstakingly disentangling snippets of different ‘posts’ that were physically contained in the same ‘web page’, re-assigning existing images to them, adding tags, consolidating information that was stored in different places. Raking the Virtual Zen Garden – again.

New website: A 'post.'

Draft of the layout, showing a ‘post’. Left and right pane vanish in responsive fashion if the screen gets too small.

… Nothing you have not seen in more elaborate fashion elsewhere. For me the pleasure is in creating the whole thing bottom up not using existing frameworks, content management systems or templates – requiring an FTP client and a text editor only.

I spent a lot of time on designing my redirect strategy. For historical reasons, all my sites use the same virtual web server. Different sites have been separated just by different virtual directories. So in order to display the e-stangl.at content as one stand-alone website, a viewer accessing e-stangl.at is redirected to e-stangl.at/e/. This means that entering [personal.at]/[business] would result in showing the business content at the personal URL. In order to prevent this, the main page generation script used checks for the virtual directory and redirects ‘bottom-up’ to [business.at]/[business].

In the future, I am going to use a new hostname for my website. In addition, I want to have the option to migrate only some applications while keeping the others tied to the old ASP scripts temporarily. This means more redirect logic, especially as I want to test all the redirects. I have a non-public test site on the same server, but I have never tested redirects as it means creating loads of test host names; but due to the complexity of redirects to come I added names like wwwdummy for every domain, redirecting to my new main test host name, in the same way as the www URLs would redirect to my new public host name.

And lest we forget I am obsessed with keeping old URLs working. I don’t like it if websites are migrated to a new content management system, changing all the URLs. As I mentioned before, I already use ASP.NET Routing for having nice URLs with the new site: A request for /en/2014/10/29/some-post-title does not access a physical folder but the ‘flat-file database engine’ I wrote from scratch will search for the proper content text file based on a SQL string handed to it, retrieve attributes from both file name and file content, and display HTML content and attributes like title and thumbnail image properly.

New website: Flat-file database.

Flat-file database: Two folders, ‘pages’ and ‘posts’. Post file names include creation date, short relative URL and category. Using the ascx extension (actually for .NET ‘user controls’ as the web server will not return these files directly but respond with 404. No need to tweak permissions.)

The top menu, the tag cloud, the yearly/monthly/daily archives, the list of posts on the Home page, XML RSS Feed and XML sitemap  are also created by querying these sets of files.

New web site: File / database entry

File representing a post: Upper half – meta tags and attributes, lower half – after attribute ‘content’: Actual content in plain HTML.

Now I want to redirect from the old .asp files (to be deleted from the server at some point in the future) to these nice URLs. My preferred solution for this class of redirects is using a rewrite map hard-coded in the web server’s config file. From my spreadsheet documentation of the 1:n relation of old ASP pages to new ‘posts’ I have automatically created the XML tags to be inserted in the ‘rewrite map’.

Now the boring part is over and I scared everybody off (But just in case you can find more technical information on the last update on the English version of all website, e.g. here) …

… I come up with my grand insights, click-bait X-Things-You-Need-To-Know-About-Seomthing-You-Should-Not-Do-and-Could-Not-Care-Less-Style:

It is sometimes painful to read really old content, like articles, manifestos and speeches from the last century. Yet I don’t hide or change anything.

After all, this is perhaps the point of such a website. I did not go online for the interaction (of social networks, clicks, likes, comments). Putting your thoughts out there, on the internet that does never forget, is like publishing a book you cannot un-publish. It is about holding yourself accountable and aiming at self-consistency.

I am not a visual person. If I would have been more courageous I’d use plain Courier New without formatting and images. Just for the fun of it, I tested adding dedicated images to each post and creating thumbnails from them – and I admit it adds to the content. Disturbing, that is!

I truly love software development. After a day of ‘professional’ software development (simulations re physics and engineering) I am still happy to plunge into this personal web development project. I realized programming is one of the few occupations that was part of any job I ever had. Years ago, soul-searching and preparing for the next career change, I rather figured the main common feature was teaching and know-how transfer – workshops and acedemic lectures etc. But I am relieved I gave that up; perhaps I just tried to live up to the expected ideal of the techie who will finally turn to a more managerial or at least ‘social’ role.

You can always find perfect rationales for irrational projects: Our web server had been hacked last year (ASP pages with spammy links put into some folders) and from backlinks in the network of spammy links I conclude that classical ASP pages had been targeted. My web server was then hosted on Windows 2003, as this time still fully supported. I made use of Parent Paths (../ relative URLs) which might have eased the hack. Now I am migrating to ASP.NET with the goal to turn off Classical ASP completely, and I already got rid of the Parent Paths requirement by editing the existing pages.

This website and my obsession with keeping the old stuff intact reflects my appreciation of The ExistingBeing Creative With What You Have. Re-using my old images and articles feels like re-using our cellar as a water tank. Both of which are passions I might not share with too many people.

My websites had been an experiment in compartmentalizing my thinking and writing – ‘Personal’, ‘Science’, ‘Weird’, at the very beginning the latter two were authored pseudonymously – briefly. My wordpress.com blog has been one quick shot at Grand Unified Theory of my Blogging, and I could not prevent my personal websites to become more an more intertwined, too, in the past years. So finally both do reflect my reluctance of separating my personal and professional self.

My website is self-indulgent – in content and in meta-content. I realize that the technical features I have added are exactly what I need to browse my own stuff for myself, not necessarily what readers might expect or what is considered standard practice. One example is my preference for a three-pane design, and for that infinite (no dropdown-menu) archive.

New website: Category page.

Nothing slows a website down like social media integration. My text file management is for sure not the epitome of efficient programming, but I was flabbergasted by how fast it was to display nearly 150 posts at once – compared to the endless sending back and forth questionable stuff between social networks, tracking, and ad sites (watch the status bar!).

However, this gives me some ideas about the purpose of this blog versus the purpose of my website. Here, on the WordPress.com blog, I feel more challenged to write self-contained, complete, edited, shareable (?) articles – often based on extensive research and consolidation of our original(*) data (OK, there are exceptions, such as this post), whereas the personal website is more of a container of drafts and personal announcements. This also explains why the technical sections of my personal websites contain rather collections of links than full articles.

(*)Which is why I totally use my subversive sense of humour and turn into a nitpicking furious submitter of copyright complaints if somebody steals my articles published here, on the blog. However, I wonder how I’d react if somebody infringed my rights as the ‘web artist’ featured on subversiv.at.

Since 15 years I spent a lot of time on (re-)organizing and categorizing my content. This blog has also been part of this initiative. That re-organization is what I like websites and blogs for – a place to play with structure and content, and their relationship. Again, doing this in public makes me holding myself accountable. Categories are weird – I believe they can only be done right with hindsight. Now all my websites, blogs, and social media profiles eventually use the same categories which have evolved naturally and are very unlike what I might have planned ‘theoretically’.

Structure should be light-weight. I started my websites with the idea of first and second level ‘menu’s and hardly any emphasis on time stamps. But your own persona and your ideas seem to be moving targets. I started commenting on my old articles, correcting or amending what I said (as I don’t delete, see above). subversiv.at has been my Art-from-the-Scrapyard-Weird-Experiments playground, before and in addition to the Art category here and over there I enjoyed commenting in English on German articles and vice versa. But the Temporal Structure, the Arrow of Time was stronger; so I finally made the structure more blog-like.

Curated lists … were most often just ‘posts’. I started collecting links, like resources for specific topics or my own posts written elsewhere, but after some time I did not considered them so useful any more. Perhaps somebody noticed that I have mothballed and hidden my Reading list and Physics Resources here (the latter moved to my ‘science site’ radices.net – URLs do still work of course). Again: The arrow of time wins!

I loved and I cursed the bilingual nature of all my sites. Cursed, because the old structure made it too obvious when the counter-part in the other language was ‘missing’; so it felt like a translation assignment. However, I don’t like translations. I am actually not even capable to really translate the spirit of my own posts. Sometimes I feel like writing in English, sometimes I feel like writing in German. Some days or weeks or months later I feel like reflecting in the same ideas, using the other language. Now I came up with that loose connection of an English and German article, referencing each other via a meta attribute, which results in an unobtrusive URL pointing to the other version.

Quantitative analysis helps to correct distorted views. I thought I wrote ‘so much’. But the tangle of posts and pages in the old sites obscured that actually the content translates to only 138 posts in German and 78 in English. Actually, I wrote in bursts, typically immediately before and after an important change, and the first main burst 2004/2005 was German-only. I think the numbers would have been higher had I given up on the menu-based approach earlier, and rather written a new, updated ‘post’ instead of adding infinitesimal amendments to the existing pseudo-static pages.

Analysing my own process of analysing puts me into this detached mode of thinking. I have shielded myself from social media timelines in the past weeks and tinkered with articles, content written long before somebody could have ‘shared’ it. I feel that it motivates me again to not care about things like word count (too long), target groups (weird mixture of armchair web psychology and technical content), and shareability.

Finally Mobile-Friendly! (How I Made Googlebot Happy)

Not this blog of course – it had been responsive already.

But I gave in to Google’s nagging and did not ignore messages in Google Webmaster Tools any longer. All my home-grown websites had a fixed width of the content pane and a fixed left sidebar. On a mobile device you only saw the upper left corner – showing the side bar and only part of the content pane.

Learning about a major Google update implemented last week I spent one night coding until the test went fine for our business website

punktwissen website, Google's test for mobile friendliness

… and for my/our other sites subversiv.at, radices.net, e-stangl.at, and z-village.net. I keep one non-responsive page: epsi.name.

This is not a guide to the perfect responsive design, I am not a professional web developer, and I don’t claim my CSS or HTML code is flawless, elegant, or processed correctly by all browsers in the world. I read this tutorial and this guide, and they provided me with clues to answer my main question:

What is the bare minimum to make a classical website
mobile-friendly according to Google’s requirements?

It also does not necessarily mean other websites are extremely difficult to read on a mobile device. There is a famous website that doesn’t meet Google’s standards although the content pane fits nicely into the width of a smartphone – if you turn it by 90° and scroll to the right … which Googlebot will not do.

In summary I did the following:

Pre-requisites: Use only CSS for formatting, especially define the layout by containers referred to in the stylesheet. Fortunately I made that move long ago.

1) Set a viewport metatag which tells the device to adapt the visible content to the width of the screen. Even if the width of the content is not fixed in a desktop browser, it is not automatically interpreted correctly on mobile devices without viewport. Actually, I was wrong in assuming that a plain old-school hardly formatted HTML text of variable width is mobile-friendly by default. In this case the content adapts to the width of the device, but Google rightly complains about too small text, and links too close together – in addition to missing viewport.

I had been intimidated by the small text / links close errors some time ago and figured I had to re-do all navigation elements. But after adding viewport, the ‘only’ thing left was to make the content break or flow so that it won’t be larger than the screen width. Text size and links were fine without any change to font size or width / height of containers for navigation links.

2) Add at least one media query to my CSS stylesheets in order to make the left side bar vanish or move if the width of the screen is pixels is smaller than a certain size. I tested with an Android device, and with Google’s tool – but mainly I was squeezing the window on a desktop PC to very small widths. For the business website I decided the sidebar is nice-to-have as it just shows recent blog posts – the same approach as used with by my current WordPress template. For some other sites it was an essential navigation pane; so I let it move to the top.

3) Make sure that all containers and images on a page resize or flow accordingly by making their styles change at the threshold width or continuously – this meant cross-checking the styles of all containers that define the layout and changing / adding style definitions depending on the screen width. I made images resizable, and text displayed left to images should flow under it at a certain width.

About 14,5 Random Thoughts on Blogging and Social Media

I have been blogging on WordPress.com since nearly three years, and I noted the following:

Blogs have a half life. Many decay after 2 years. Blogs I had followed had been deleted, or bloggers had suddenly stopped publishing without notice.

There are tons of single-post-blogs. A user-friendly editor motivates people to get started. But blogging does not take more time than HTML editing. We need time for composition, not for typing.

An important change in personal or professional life often triggers the launch of a new blog. If the change had been mastered successfully, the well might run dry.

You can write the articles you want to write, or you can write what you want to read. Perhaps many hobbyist authors go from the former, introspective-therapeutic stage to the latter.

Bloggers running blogs of the same age flock together in groups. Groups consist of less than 10 people; everybody reads and comments on the others’ blogs regularly.

WordPress.com is both publishing platform and social network, and it works well because nearly every user is both contributor and commentator.

Nearly all social media have done away with nested discussion threads, and only the first few lines of comments are visible unless you click More. Will WordPress follow suit?

It is hard to resist popular topics, and the hype might not be obvious. Who knew that all things quantum would enthrall the masses?

At the beginning there was the classical website; then there was the blog – configurable to serve any purpose. Now there is a specific platform for images, for long-form texts, and whatnot.

Optimization for mobile devices can makes sites harder to read on PCs. There is no such thing as the integrity of individual web pages anymore.

Web-logging the diary way messes up structure and categories. But  on static WordPress pages organized via nested menus I always look for that signature date information.

Social media fundamentally recalibrated communications; we go asynchronous. A synchronous phone call  feels like an intrusion unless life-altering.

Blogging and social media have revived the art of rhetorics, and I learned a new word: humblebragging. 

Our online repositories are like the human brain: Content needs to be alive: to be revisited, rearranged, and curated all the time to be useful.

You ought to add an image.

The View, 2015-02

Looking for Patterns

Scott Adams, of Dilbert Fame, has a lot of useful advice in his autobiographical book How to Fail at Almost Everything and Still Win Big. He recommends looking for patterns in your life, without attempting to theorize about cause and effects. Learning from those patterns you could increase the chance that luck with hit you. I believe in increasing your options, so I can relate a lot to applying this approach to Life, the Universe and Everything.

It should be true in relation to the iconic example of patterns, that is: Web traffic. In this post I’ll try to briefly summarize what I have learned so far from most recent unfortunate events (This is PR speak for disaster). I was intrigued by web statistics, web servers’ log files, and the summaries show by the free Google or Bing Webmaster Tools ever since, but I started to follow the trends more closely after my other, non-Wordpress web server had been hacked by the end of November.

How do you recognize that your site has been hacked?

This is very different from what you might expect from popular lore and movies. I downloaded the log files for my web server from time to time, and I just noticed that suddenly the size of the daily files was about twice as usual. Inspecting the IP addresses which the traffic to my site came from I spotted a lot of hits by Google bot. Sites are indexed all the time, but I was baffled by the URLs – all pointing to pages that should not exist on my server. These URLs contained a long query string with all kinds of brand names, as you know them from spam comments or e-mails.

This is an example line in the log file:

Spammy page on hacked web server, accessed by Google botThis IP address belongs to a *.googlebot.com machine, as can be confirmed by resolving the name, e.g. using nslookup. The worrying fact was the status code 200 which means the page had indeed been there.

A few days later this has changed to a 404, so the page did not exist anymore:

Spammy page removed from hacked web server, Google bot tries to access it.The attack had happened in the weekend, and the pages have been removed immediately by my hosting provider.

I cross-checked if those pages had indeed been indexed by Google I searched for site:[domain name]. This is a snippet from the search results – the spammers even borrowed the tag line of our legitimate site as a description (which I cropped from the screenshot here).

spammy-page-in-google-indexOverall these were just a bunch of different pages (ASP files) but Google recognizes every different query string, appended after the question mark, as a different URL. So suddenly Google had a lot more URLs to index and you could see a spike in web master tools:

Crawl stats after hackThere was also a warning message on the welcome page:

Google warning message about 404 errorsWhat to do?

Obviously the first thing is to delete the spammy pages and deal with whatever vulnerability had been exploited. This was done before I noticed the hack myself. But I am still in clean-up mode to get the spammy pages removed from Google’s index:

robots.txt. Using the site:[domain name] search I identified all the spammy pages and added them to the robots.txt file on my server. This file tells search engines which pages not to index. Fortunately you do not have to add each individual URL – adding the page (ending in .asp in this case) is sufficient.

But pages were still in the index after that, just the description was changed to:
A description for this result is not available because of this site’s robots.txt.

As far as I can tell, entries are still added to the index if somebody else links to your pages (actually, spammy pages on other hacked servers, see root cause analysis below). But as Google is not allowed to investigate the target as per robots.txt, it only adds the link without a description.

URL parameters. Since the spammy pages all use query strings and all strings have the same parameter – [page].asp?dca= in my case – I tried managing the URL parameters via web master tools. This is actually an option to let Google know if a query string should really denote another version of a page or if all query strings for one page should be indexed as a single page. E.g. I am using a query string called imgClicked to magnify an image here – when clicking in the top image, and I could tell Google that the clicked / unclicked image should not be counted as different URLs.

In the special case of the spammy pages I tried to tell Google that different dca values don’t make for a separate page (which would result in about 6 spammy URLs in the index instead of 1500) but this did not impact the gradual accumulation of indexed spammy pages.

Mind-numbing work. To get rid of all pages as fast as possible I also removed each. of. them. manually. via Google master tools. This means:

  • Click on the URL from the search results, opening a new tab. This results in a 404.
  • Copy the URL from the address bar to web master tools in the form for removing the URL.
  • Click submit.
  • Repeat 1500 times.

I am now at about 500. Not all spammy pages that ever existed are displayed at once in the index, but about 10 are added every day. Where do they come from after the original pages had been deleted?

How was this hack actually supposed to work?

The legitimate pages had not been changed or vandalized but the hacker-spammers just placed additional pages on the server. I had never noticed them, had I not encountered Google’s indexing activities.

I was curious how those pages had looked like and I inspected Google’s cache, by searching for cache:[spammy URL]. The cached page consisted of:

  • Your typical junk of spammy text, otherwise I would be delighted about raw material for poetry.
  • A list of links to other spammy pages, most of them on my hacked server
  • An exact copy of the default page of this (legitimate) web site.

I haven’t investigated all those more than 1000 pages and spammy links displayed on them but I conjectured there have to be some outbound links to other – hacked – servers Links will be only boosted if there are backlinks from seemingly independent web sites. Somehow this should make people buy something in a shady webshop at the end of a cascade of links.

After some weeks I was able to confirm this as Google web master tools now show external backlinks to my domain from other spammy pages on legitimate sites, mostly small businesses in the US. Many of them used the same provider that obviously had been hacked as well.

This explains where the gradual supply of spammy links to the index comes from: Google has followed the spammy links from the other hacked servers inbound to my server. It seems to take a while to clean this out as all the other webmasters have removed there pages as well – I checked each. of. them. from the long list supplied by Google as a CSV file.

Hadn’t I been hacked I might have never been aware of the completely unrelated onslaught by Google itself, targeted to this blog. I reported on this in detail previously; here is just an update and a summary.

Edit as from the comments I conclude this was not clear: The following analysis is unrelated to the hack of non-Wordpress site – the hacked site had not been penalized so far by Google. But the blog you are reading right now was.

Symptoms of your site having been penalized by a search engine

Rapid decline of impressions. Webmaster tools show a period of 3 months maximum. I have checked the trend for all my sites now and then, but there was actually never anything that constituted a real trend. But for this blog page impressions went from a few hundred, often more than 1000 per day this summer to less than 10 per day now.

Page impressions Sept to DecPage impressions stayed at their all-time-low since last time, so just extend that graph to the right.

Comparison with sites that should rank much lower. Currently this blog has as much or as few impressions as my personal website e-stangl.at. Its Google pagerank is 1 – as compared to 3 for the WordPress blog; I only update it every quarter at maximum, and its word count is perhaps a thousands of this blog.

My other two sites subversiv.at and radices.net score better although I update them only about once every 6 weeks,and I am pretty sure I violate best practices due to my creative mixing languages, commenting on my own stuff, and/or curating enormous lists of outbound links.

It is ironic that Google has penalized this blog now, as per autumn 2014 my quality control has become more ruthless. I had quite a number of posts in Drafts, with more than 1000 words each, edited, and spell-checked – and finally deleted all of them. The remaining posts were the ones requiring considerable research plus my poetry. This spam poem is one of my most popular posts as by Google’s page impressions. So all theorizing is really futile and I should better watch the pattern emerge.

Identifying offending pages. I added an update to the previous post as I spotted the offending pages using the following method:

  • Identify your top performing pages by ranking pages in the list of search results by impressions or clicks.
  • Then order pages in the list of search results by page name. This is effectively ranking by date for blogs, and the list can be compared to the archive of all pages.
  • Make the time span covered by the Google tools smaller and smaller and check if one your former top pages is suddenly vanishing from the list.

In my case these pages were:

  • A review of a new, a bit unconventional, textbook on quantum field theory and
  • a list of physics books, blogs and websites.

As Michelle pointed out correctly this does not mean that the page has been deleted from the index – as you can confirm by searching for site:[Offending URL] explicitly or by adding a more specific search criterion, like adding elkement. I found that the results displayed for my offending pages are erratic: Sometimes, surprisingly, the page will still show up if I just use the title of the post; perhaps a consequence of me, owner of the site, being logged on to Google. Sometimes I need to add an additional keyword to move it to the top in search results again.

But anyway, even if the pages had not been deleted, they had been pushed back to search results page >10.

Something had been deleted from the index though. Here is the number of indexed pages over time, showing a decline starting at the time impressions were plummeting, too:

Pages indexed by Google for this blog as per writing of this postI cannot see a similar effect for any of the other sites, and as far as I know it does not correlate with some Google update (Google has indicated a major update in March 2014 in the figure).

Find the root cause. Except from links on my own sites, and links on other other blogs my blog has no backlinks. As I learned in this research backlinks from forums are often tagged nofollow so that search engines would not consider them spammy. This means links from your avatar commenting on other pages might not boost your blog, but might not hurt either.

The only ‘worthy’ backlink was from the page dedicated to that book I had reviewed – and that page linked exactly to the offending pages. My blog and the author’s page may look to Google as the tangle of cross-linked spammy pages hackers had misused my other web server for.

Do something about it? Conclusion? I replaced some of my links to the author’s site with a link to the book’s page on amazon.com. I moved one of the offending pages, the physics link list, over to radices.net – as I had planned to do so for quite a while in my eternal quest for tidy, consistent web sites. The page is still available on this blog, but not visible in the menu anymore.

But I will not ask the author to remove a valid backlink or remove my innocuous post, it seems like succumbing to the rules of a silly game.

What I learned from this episode is that one single page – perhaps one you don’t even consider important on the grand scale of things and your blog in particular – can boost a blog or drag it down. Which pages are the chosen ones is beyond unpredictable.

Ending on a more positive note I currently encounter the boost effect for your German blog as we indulge in writing about the configuration of this gadget, the programmable control unit we use with our heat pump system. The device is very popular among ambitious DIY enthusiasts, and readers are obviously searching for it.

Programmable control unit

We are often linking to the vendor’s business page and manuals. I hope they will never link back to us.

I will just keep watching the patterns and reporting on my encounters. One of the next enigmas to be resolved: Why is the number of Google searches in my WordPress Stats much higher than the number of page impressions in Google Tools for that day, let alone clicks in Google Tools?

Update 2015-01-23: The answer was embarrassingly simple, and all my paranoia had been misguided. WordPress has migrated their hosted blogs to https only. All my traffic was hiding in the statistics for the https version which has to be added in Google Webmaster Tools as a separate website.

The More Content You Have Created

… the more time you need for curating.

My first ever attempt at tweeting an aphorism. But it is true for me, and it defines the way I use online spaces.

As a contributor of online content, I am operating in different modes:

  • Creator, with emphasis on creating something original – including unintended re-invention of the already existing I had not googled.
  • Researcher, when cross-checking sources or doing calculations.
  • Commentator on the blogs of others.
  • Curator. It is this role I want to dwell upon now.

I started writing online by editing static web pages, and this still determines my netizen’s philosophy. Creating content was always playing with structure versus content, and playing with how to present content in a way it was useful to me – and maybe to others, too.

You cannot pre-define categories, tags, and other structure upfront in my opinion, but you have to revisit them regularly. Social media like Facebook, Google+, or Twitter are primarily determined by the timeline, without giving the user a chance to provide a more timeless structure. Actually the user cannot influence the layout at all. That’s why I consider them secondary channels. Originally I had planned to organize resources useful to me on such sites – but I don’t want to full-text search my own posts or edit hashtags.

What I prefer is what search engines might penalize me for: Long, hand-crafted lists of links. Since a few months I have resumed posting to technical IT security forums. These forums provide automatically compiled lists of my threads, my activity etc. But yet I compile lists of my threads on one of my sites whose domain name accidentally makes for an insider pun: radices, which means roots. I violate database best practices by organizing the same content in redundant fashion – by date or by topic. It is not the final list that is so terribly useful to me, but the act of revisiting all the content, struggling with categorizing, and adding summaries.

I made it a rule to only add links to my collections that I had already used, as I believe the ease of arranging links and downloading documents makes it too simple to just collect and hoard – without actually reading or using. The internet is not too blame here: In the old times, at the university, I was collecting and curating scientific papers – the collecting being a result of my monthly browsing interesting journals, and it was often more extensive than needed.

This year I have created new major categories for these blog by laying out the site map –  the pages making up the main menu. It dawned on me that this more than a navigation menu. It reflects what is important for me, and that I don’t care anymore about explaining how all this goes together. I created these pages at about the same time I stopped trying to explain why my professional playgrounds would complement each other so nicely. I rather say, I work on A and B, as odd as this combination seems. Combining Anything – I Mean It.

I do love playing with different platforms, and I do maintain all of my non-blog websites. My most innovative experiments in Google-Translate-assisted poetry go unnoticed as I published them only on my subversive website.

The reason I have this Anything blog is that I tried to keep things separated originally, but finally all of my sites cover all of my topics anyway. Curation is what makes me aware of it.

But on principle I do know that it makes sense to keep to a topic. On our German blog we focus more on heat pumps; there is a lot of technical information, and we keep to a specific narrative style. I am at a loss how to explain this. We call ourselves the Two Fearless Settlers and Professional Tinkerers who tell stories about renewable energies. We use synonyms for places that are reminiscent of fairy-tales or slightly old-fashioned newspapers. Ironically, there is a game called Settlers Online whose style might quite match our blog, and we only learned about this via search terms on our blog (which have also been turned into search term poetry). Clients find the site both informative and entertaining – some of the unexpected positive feedback left me speechless.

On this blog here the navel-gazing ramblings may still outweigh light entertainment or practically useful information. I am not sure if I want to change this – I am just parsing my content and I am recognizing this. The more I create site maps and categories, the less I use them to plan for future content. It is more about coming to a halt and contemplating. I have also decided that I will go on a literary diet until end of 2014 – so I will not amend my Reading List in the next weeks.

I feel like I have reached a goal I hadn’t defined before.

Vineyards