Release of a HA blogging platform & finally a new design for this blog

High availability (HA), distributed, lightweight, static site yet with comments, with a responsive UI. These are a few characteristics of the ideal blogging platform I have always desired. So I built it. Warning: this is proof-of-concept code only.

I exported data from my old PivotX instance. I wrote server-side code to handle blog comments and distribute them across multiple servers. I designed a responsive UI. And after many hours of work I finally switched blog.zorinaq.com to—wait I need a name for my project… hablog!

Visual design

First, let’s talk about design & vertical space (click to enlarge):

I took screenshots of 8 mobile sites on an Android phone with a 720×1280 display running Chrome.

The leftmost screenshot shows that content on my blog typically starts at 350–400 pixels from the top of the screen, whereas most sites start at 600–900 pixels, or in extreme cases they use more than the entire screen to display ads and zero content *cough*BBC*cough*. I dislike waste of vertical space and I think my design gives readers a chance to engage with more of the post before scrolling down and before waiting for the whole page to load.

Notice also that, on a small screen, images on my blog can be as wide as the full width of the screen. Why waste margin space?

Responsive layout

On larger screens the responsive UI transitions to 2 columns:

(Compare this to the previous design of my blog.)

The new design allows the post’s first sentence to start right at the top of the page, maximizing (again) the amount of content shown to readers while keeping a reasonable line length and without scrolling down.

Another thing. I am lucky to have relatively high-quality comments on my blog. So instead of relegating them to the bottom of the page, the 2-column layout lets me showcase them by tucking them alongside the post—like comments in a Google Doc or MS Word document. The comment submission form is also right there at the top, to entice readers to leave comments without hunting for a form at the bottom of the page.

Finally, I needed a mechanism to emphasize my own comments. So I came up with the idea of this vertical orange line between the 2 columns that swerves around my replies. Doing so groups them with the post, which is perfectly logical because they share authorship—me.

My blog requires no sign-in in order to reduce friction when submitting a comment.

Fonts

Many modern web designs adopt a sans-serif font for titles and headers, and a serif one for the main body. I like that.

For titles & headers I picked Raleway. Notice its elegant “fi” ligature in “finally” in this post’s title. [Edit: I eventually decided to stop using a custom font for titles & headers, instead I use the system’s default sans-serif. One less font to download.]

For the main body I picked Noto Serif which happens to be the default serif font for Chrome on Android (therefore this browser would not even have to download it.) It has great Unicode coverage. I was annoyed at how many other popular fonts do not provide a glyph for U+2126 OHM SIGN (Ω)—used there—which causes text rendering to fall back to the browser’s default serif font, so if it has a taller line-height than the custom font a visually unappealing glitch happens where a bigger-than-normal gap will appear between that line and the one above it.

Color

Low-contrast sites suck. And black text on white background hurts the eyes. So I chose black text on very light grey background (#f0f0f0).

As to the color theme, it is grey & orange. Maybe not the best? I am open to suggestions.

hablog

The visual design is the only thing visible to my readers. But what about the technical guts of blog.zorinaq.com?

Architecture

Six years ago I described the architecture I wanted:

“I will soon have 2 servers colocated in 2 datacenters on 2 different continents, with blog.zorinaq.com having 2 A records for these 2 servers. Browsers try to connect to the 2nd if the 1st fails; and with DNS pinning they tend to stick with the one that works for the remaining of the browsing session. Doing it this way is a cheap way of providing HA for a website.”

Today the cost of VPS and dedicated ARM servers is so low that I decided to run my site on 3 servers, from 3 different providers, on 3 continents. This is why blog.zorinaq.com resolves to 3 IP addresses:

Digital Ocean in the US ($5/month VPS)
Scaleway in Europe (3€/month dedicated ARM server)
Vultr in Asia ($5/month VPS)

On the software side, I put all my posts in a local Mercurial repository, and use static site generator Jekyll to generate the site locally. The pages look complete except they are, well, they are not dynamic but static. They miss blog comments. I place this tag in the page at the location where I would like comments to be inserted:

<!--hablog-insert-comments-->

Remember this tag for now. I will come back to it later.

After generating the site locally I run a bash script to rsync the files to my 3 servers, except with a twist…

The static content (image assets, home page index.html) are rsync'd to the web server’s document root /foobar/html:

/foobar/html/index.html
/foobar/html/assets/image1.png
/foobar/html/assets/image2.png
/foobar/html/...

However the dynamic content (post pages that will contain the reader comments but for now only have the  tag) are rsync'd to a different directory /foobar/db:

/foobar/db/disk-vibrations-and-ssds/index.html
/foobar/db/what-the-heck-pandora/index.html
/foobar/db/...

Keep in mind this is all done in parallel on 3 different servers.

Now hablog (high availability blog) comes into play. It is made of 3 components: watch-db, hablog.fcgi, and sync-daemon (total ~400 lines of Python code and ~50 lines of bash).

watch-db

Each server runs a daemon watch-db that uses inotify to watch the content of /foobar/db and whenever files are rsync'd there, they are processed and copied to the web server’s document root /foobar/html. The processing step replaces the  tag mentioned earlier with the actual comments.

hablog.fcgi

Where are the actual comments fetched from? When a comment is submitted to blog.zorinaq.com via a POST /hablog request, a simple FastCGI server hablog.fcgi handles the request, verifies the Google reCAPTCHA, and writes the comment as a JSON file under /foobar/db/<post-id>/:

{
  "user": "john",
  "comment": "Yes I was aware...",
  "ip": "10.0.0.0",
  "user-agent": "Mozilla/5.0...",
  "removed": 0
}

(I will explain “removed” in a moment.)

And because watch-db watches /foobar/db, it notices both new posts (/foobar/db/<post-id>/index.html) and new comments (/foobar/db/<post-id>/...), and will be able to replace the  tag with the comments in order to regenerate the final HTML file under /foobar/html.

sync-daemon

How are the comment files synchronized between my 3 servers? This is the role of sync-daemon: a simple cronjob which runs every few minutes on each server and rsync only the comment files to/from the other 2 servers. If any 1 of the 3 servers goes down, the remaining 2 online servers still synchronize comment files between each other. When the offline server comes back online, whichever one of the 2 other servers runs the cronjob first will resync all the comments to the resuscitated server.

This is the crux of how hablog implements high availability: the 3 servers form a distributed redundant architecture and are independent from each other.

Note that sync-daemon does not use the rsync --delete option. Comment files are never modified, never deleted, only created once. As a result synchronization conflicts are impossible by design (KISS).

Comment deletion

Occasionally a comment does need to be deleted, such as a spam that circumvented reCAPTCHA. But wait, I said comment files are never modified and never deleted…

Here is another crucial design aspect of hablog. This one makes comment modification/deletion possible.

hablog names the comment files according to the convention <timestamp-since-epoch>.<comment-ID>, eg. 1470000000.633ce99f46a21520b67a3022469241fa. When watch-db processes a post, it sorts all comments according to timestamps (that is how they end up in chronological order in the final HTML). However watch-db also lets a more recent comment file overwrite the JSON attributes if the more recent file has the same comment ID.

For example if a comment is saved as 1470000000.633ce99f46a21520b67a3022469241fa, but a file named 1470000001.633ce99f46a21520b67a3022469241fa exists (notice the newer timestamp) and contains:

{ "removed": 1 }

Then it overwrites the removed JSON attribute from 0 to 1, and the code considers it deleted. Any of the other JSON attributes (user, comment, etc) can be overwritten by a newer comment file. For example comment could be overwritten to edit the text content.

Deterministic comment IDs

Comment IDs are generated server-side by hashing together the post ID + the post’s last comment ID (if you inspect the HTML, this is the seed form input value) + the username + the content of the comment. Therefore if a browser submits a comment to 2 or more servers (eg. due to network glitches causing the browser to retry the POST request against 2 or more of the IP addresses of blog.zorinaq.com), the servers will each generate the same comment ID and each save the file in /foobar/db, which will not cause a data discrepancy. At worst this would result in 2 files with a possibly different <timestamp-since-epoch> in the filename, but containing the same content which is harmless (per the logic of newer JSON data overwriting older JSON data).

Download

Download hablog here.

Warning: proof-of-concept code only. hablog is probably not for you. Storing one comment per file and using rsync to synchronize comments only scales up to a point, maybe up to 100k files. My blog has only 1000 files as of today (80 posts + ~1000 comments).

Conclusion

hablog gives me many advantages for a high-traffic few-comments site.

Lightweight high performance static site. All 3 servers combined can handle up to ~2500 page hits/sec of my largest text-only posts (50kB), or 350 Mbit/s of traffic according to my benchmarks. The bottleneck is not CPU or IOPS, but network bandwith available to my servers. This level of performance is definitely much more than I need considering that my heaviest slashdotting—when I published this—was 40 page hits/sec sustained for a few hours. At the time the PivotX instance could not keep up with the traffic because PHP handling was too CPU-intensive, so I am relieved to move to a static site that can handle 100× more page hits/sec :)

Highly available redundant architecture with no single point of failure. It would take 3 different outages at 3 hosters on 3 continents at the same time to take down the site. In fact even if the servers are available only 98% of the time—7 days of downtime per year!—hablog is expected to still provide five nines availability (as long as downtime amongst the 3 servers is random and uncorrelated): 1 - (1 - 0.98)³ = 99.9992%

CLI tools and revision control. My posts are text files, edited locally with my favorite editor and placed under revision control. I can run custom CLI scripts on my servers to bulk delete the occasional spam comments. I prefer doing it this way rather than using a constraining point-and-click web UI.