My Flat-File Database

A brief update on my web programming project.

I have preferred to create online text by editing simple text files; so I only need a text editor and an FTP client as management tool. My ‘old’ personal and business web pages are currently created dynamically in the following way:
[Code for including a script (including other scripts)]
[Content of the article in plain HTML = inner HTML of content div]
[Code for writing footer]

The main script(s) create layout containers, meta tags, navigation menus etc.

Meta information about pages or about the whole site are kept in CSV text files. There are e.g. files with tables…

  • … listing all of pages in each site and their attributes – like title, key words, hover texts for navigation links or
  • … tabulating all main properties of all web sites – such as ‘tag lines’ or the name of the CSS file.

A bunch of CSV files / tables can be accessed like a database by defining the columns in a schema.ini file, and using a text driver (on my Windows web server). I am running SQL queries against these text files, and it would be simple to migrate my CSV files to a grown-up database. But I tacked on RSS feeds later; these XML files are hand-crafted and basically a parallel ‘database’.

This CSV file database is not yet what I mean by flat-file database: In my new site the content of a typical ‘article file’ should be plain text, free from code. All meta information will be included in each file, instead of putting it into the separate CSV files. A typical file would look like this:

title: Some really catchy title
headline: Some equally catchy, but a bit longer headline
date_created: 2015-09-15 11:42
date_changed: 2015-09-15 11:45
author: elkement
[more properties and meta tags]
content:
Text in plain HTML.

The logic for creating formatted pages with header, footer, menus etc. has to be contained in code separate from these files; and text files needs to be parsed for meta data and content. The set of files has effectively become ‘the database’, the plain text content being just one of many attributes of a page. Folder structure and file naming conventions are part of the ‘database logic’.

I figured this was all an unprofessional hack until I found many so-called flat-file / database-less content management systems on the internet, intended to be used with smaller sites. They comprise some folders with text files, to be named according to a pre-defined schema plus parsing code that will extract meta data from files’ contents.

Motivated by that find, I created the following structure in VB.NET from scratch:

  • Retrieving a set of text files based on a search criteria from the file system – e.g. for creating the menu from all pages, or for searching for one specific file that should represent the current page – current as per the URL the user entered.
  • Code for parsing a text file for lines having a [name]: [value] structure
  • Processing nice URL entered by the user to make the web server pick the correct text file.

Speaking about URLs, so-called ASP.NET Routing came in handy: Before, I had used a few folders whose default page redirects to an existing page (such as /heatpump/ redirecting to /somefolder/heatpump.asp). Otherwise my URLs all corresponded to existing single files.

I use a typical blogging platform’s schema with the new site: If users enters

/en/2015/09/15/some-cool-article/

the server accesses a text text file whose name contains language, year, such as:

2015-09-15_en_some-cool-article.txt

… and displays the content at the nice URL.

‘Language’ is part of the URL: If a user with a German browsers explicitly accesses an URL starting with /en/ , the language is effectively set to English. However, If the main page is hit, I detect the language from the header sent by the client.

I am not overly original: I use two categories of content – posts and pages – corresponding to text files organized in two different folders in the file system, and following different conventions for file names. Learning from my experience with hand-crafted menu pages in this this blog here, I added:

  • A summary text included in the file, to be displayed in a list of posts per category.
  • A list of posts in a single category, displayed on the category / menu page.

The category is assigned to the post simply as part of the file name; moving a post to another category is done by renaming it.

Since I found that having to add my Google+ posts to just a single Collection was a nice exercise I limit myself to one category per post deliberately.

Having built all the required search patterns and functions for creating lists of posts or menus or recent posts, or for extracting information from specific pages as the current or the corresponding page in the other language …  I realized that I needed a better and clear-cut separation of a high-level query for a bunch of attributes for any set of files meeting some criteria from the lower level doing the search, file retrieval, and parsing.

So why not using genuine SQL commands at the top level – to be translated to file searches and file content parsing on the lower level?

I envisaged building the menu of all pages e.g. by executing something like

SELECT title, url, headline from pages WHERE isMenu=TRUE

and creating the list of recent posts on the home page by running

SELECT * FROM posts WHERE date_created < [some date]

This would also allow for a smooth migration to an actual relational database system if the performance of file-based database would not be that great after all.

I underestimated the efforts of ‘building your own database engine’, but finally the main logic is done. My file system recordset class has this functionality (and I think I finally got the hang of classes and objects):

  • Parse a SQL string to check if it is well-formed.
  • Split it into pieces and translate pieces to names of tables (from FROM) and list of fields (from SELECT and WHERE).
  • For each field, check (against my schema) if the field should be encoded in the file’s name of if it was part of the name / value attributes in the file contents.
  • Build a file search pattern string with * at the right places from the file name attributes.
  • Get the list of files meeting this part of the WHERE criteria.
  • Parse the contents of each file and exclude those not meeting the ‘content fields’ criteria specified in the WHERE clause.
  • Stuff all attributes specified in the SELECT statement into a table-like structure (a dataTable in .NET) and return a recordset object –  that can be queried and handled like recordsets returned by standard database queries – that is: Check for End Of File, or MoveNext, return the value of a specific cell in a column with specific name.

Now I am (re-)creating all collections of pages and posts using my personal SQL engine, In parallel I am manually sifting through old content and turning my web pages into articles. To do: The tag cloud and handling tags in general, and the generation of the RSS XML file from the database.

The new site is not publicly available yet. At the time of writing of this post, all my sites still use the old schema.

Disclaimers:

  • I don’t claim this is the best way to build a web site / blog. It’s also a fun project for the sake of having fun with developing it, exploring the limits of flat-file databases, forcing myself to deal with potential performance issues.
  • It is a deliberate choice: My hosting space allows for picking from different well-known relational databases and I have done a lot of SQL Server programming in the past months in other projects.
  • I have a licence of Visual Studio. Using only a text editor instead is a deliberate choice, too.

12 Comments Add yours

  1. Wonderful :) I completely understand where you are coming from: there is nothing like DYI!

    A couple of years ago I developed a website builder using my Genetic Fractal language (what else!). It dealt with all the layout issues automatically and understood the UI depencies without coding . But I missed things like updating from a phone or adding new features in a few clicks and went back to wordpress.

    However, I can’t help myself. A couple of weeks ago I went back to the same concept to implement a more general language that represents program logic as a tree architecture where the root is the lowest common denominator between all the components. These components communicate via their branches and as such the communication channels construct themselves as you build the tree. The underlying idea is to have a language that is able to construct with high level components that automatically understand the structure they are grafted on without necessarily understanding the overal purpose system. A bit like an API-free subsystem.

    The point is that this is completely unnecessary because there are beautiful languages and systems already available but … there is nothing like creating and trying out new paradigms yourself. Exploring parts of the forest that no one has visited before.

    Best of luck with your database: it is awesome!

    1. elkement says:

      Your project sounds very impressive. My text file database just has the most basic features. ;-)
      I will also stay away from social media widgets. Yesterday I tested to embed my latest tweets using a standard Twitter widget – and my website suddenly got remarkably slower by some factor. So my conclusion is that, no matter how inefficient my code or my concept of accessing text files is, it cannot beat those slow cross-references to social media sites.

      It would be more reasonable to use an existing system, such as installing WordPress. But I have this unreasonable desire to know every bit of my code. I enjoy debugging and improving my own stuff, but I find the ongoing maintenance – such installation of somebody else’s (security) updates, being vigilant re latest issues – rather tedious.

      From the most pragmatic perspective I can say that my own personal programming projects have always helped me develop skills that finally turned out very useful in professional projects. For example, having got the hang of ASP.NET now, I set out to re-do also the simulations for the heat pump system.

      1. ” I have this unreasonable desire to know every bit of my code” Absolutely – I think that is the mark of the true scientist and engineer: having to understand it right down to its bitsiest bits. New generation programmers probably haven’t got that urge. In the new environment one doesn’t hesitate to download a modul with more than a million lines of code and since the computers we run can handle it, everyone is happy. Except that the code that you right is a hundred times faster and you know that there is zero pollution in your system.

        Here’s to barebones programming!

  2. Michelle H says:

    Interesting post. I find database construction particularly intriguing, and I appreciate the way you have chosen to structure your categories for searching your sites. Recently I encountered a database which had some problematic limitations in how it restricts (or fails to restrict) information in the most important category for those of us using it to find particular entries. The problem was that some key words were similar to one another and provided no filter but to manually read through all of the hundreds of entries to tag the relevant ones. There were other problems with the database as well, and when combined with other issues I’ve encountered this past year in using data from various organizations, I have come to think that database architecture needs some serious consideration and redevelopment. All in all, I like the way you’ve outlined your thought process for creating your changes, with insight on how a set of decisions will impact your process. Like Maurice, I want to reread what you’ve posted… so more later!

    1. elkement says:

      Our village has just changed its postal code. On trying to update all my records I learn what I never wanted to know about databases keeping customer records :-) There is hardly any company that did accept the new zip code immediately and that used in in their – obviously – distinct systems for holding duplicates of the same customer records. Mail (in paper) still has the old zip code after changing all the online records and communicating with support. I even got a letter from our local government that had the old zip code.

      It is in line with what a startup founder told me years ago – they were provided client records by one of their big partners or stakeholders – and these inconsistent data came from six (!) different databases.
      But looking at my own databases I am not the one to point fingers. I think if more than one person is working with the system, even the best input validation cannot prevent duplicate or inconsistent records.

      On the hand, professionally speaking, I have to admit that I have always enjoyed reverse engineering and fixing such software. I think what I enjoy is sorting out the different levels – the database as such, the business logic, the frontend.

      1. Michelle H says:

        I encountered a similar situation with postal codes a few years ago when our former residence was in a new development, and being regarded by many agencies (government included) as a fake address because the new code wasn’t already listed in their system. What made authenticity more complex was we hadn’t been photographed by Google earth’s car either, so people tended to believe the neighbourhood did not exist. The database is becoming people’s everyday bible, if you will, and they seem to be increasingly strict living by it. If only they knew what these things were like behind the curtain.

    1. elkement says:

      Of course it is more efficient to plaint a blog post – a picture is worth a thousand words :-)

  3. I’ll come back for another comment later on after I’ve had a chance to re-read, especially the parts about the process you are following. For now I’m happy to tell you that for the first time I’m including coding in my “Education 3484” course, which teaches pre-service Primary / Elementary teachers about computer applications and electronic learning resources. After some deliberation I have decided to build the material (sadly I could only spare a week, ideally I’d do 4) around the “Introduction to programming – 101” course which is found in Tynker.com (essentially an on line implementation of skitch). I’m quite excited about this :-)

    1. elkement says:

      I don’t know anything about Tynker or programming for kids in general (I just had a quick look at the stuff included with my RaspBerry Pi) – but I had a quick look at the website now. I liked the option for transitioning to a standard language later. I wonder if psychological research in 20 years will find a difference between programmers who learned the traditional way and those who started with a platform specifically for kids.

      My flat-file processing approach is for sure not a best practice to learn from :-) I am baffled I have not noticed any performance penalty so far.

      The other project I am working on would be more educational – I am turning my simulation of the heat pump system from Excel / VBA and a classical scripting approach to a .NET application with classes and objects – and the correspondence of real-live objects like tanks and heaters, objects in the code, and objects in our programming interface of the control system is a nice demo what classes and objects are for.

  4. Irgendeine says:

    Aha ;-)?!

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.