Blog

Writing about software engineering, health, or both combined.

Connecting Intermittent Explosive Disorder with diet and histamine intolerance: no more uncontrollable anger and rage

July 24, 2025

Intermittent Explosive Disorder (IED) is a disease defined by uncontrollable anger and rage, racing thoughts and other exaggerated, abnormal behavior.

Science and medical information only provides symptom descriptions for this disease, and sufferers have basically no chance to solve it. Only psychiatric medication is well-known to improve it to some extent but can have strong side effects and does not solve the cause.

IED destroys families, relationships and probably even lifes. It’s considered a rare condition, but with an estimated lifetime prevalence of 3-5%, it seems not rare at all (for comparison: ADHD is estimated at 5-10% and considered a widespread syndrome nowadays). Guess the impact that a permanent solution could have, even if it only works for a subset of affected people. I found an easily-explainable solution that worked for me ‒ mostly the dietary change of leaving out histamine-rich food items.

My story
Development of the disease
Detecting IED in yourself
Medication and optionally therapy can be a good first step
Measuring symptoms to find the cause
Histamine intolerance (HIT) is the cause, omitting histamine-rich food items the solution
What comes after the anger is solved?
Related thoughts and reading

My story

…won’t be detailed very much here. It doesn’t matter for resolving the illness, and I want my privacy after all these years of fighting and not knowing what’s going on. The goal of this article is to help others. There are many stories online to read and watch, even scientific case reports, in case you’re unfamiliar with how the disease can present behavior-wise:

Videos about someone’s experience with IED incl. self-harm, pet harm, hitting walls and other people, strong anger outbursts. Got better with medication. Disappointments, loud noises or getting touched mentioned as main triggers. Relation to autistic traits.
Description from someone about her boyfriend and what she did to help. On the other side, most IED stories online can rather be summarized with recommendations like "leave that person". This disease easily destroys family ties if untreated.
As already mentioned in the above links, there are hints about anger symptoms relating to ADHS and autism. For instance, try searching for "anger" on r/ADHD and lots of posts will pop up.
Case reports of Mayo Clinic patients (dated 1987), mostly clinical severity
One study correlates road rage incidents, aggressive criminal acts and other actions with IED (full text not available)
Not about IED, but this r/Anger post reveals various scary stories of what people do while their brain is in anger mode

Another health topic on a software engineering blog? Yes, normally such a switch of topics is unexpected because those computer freaks only write about their stuff. Unless health problems take over your life, you develop an analysis tool specifically to figure it all out, you become a hobby scientist and expert for your own health, ultimately find real solutions, and want to share it for the benefit of everyone out there. My analysis software, yet unpublished, again had an important stake in studying this problem, and below I’ll prove once more how important data analysis and correlations are.

Development of the disease

Intermittent Explosive Disorder didn’t really occur when I was younger. A few signs here and there when I got unnerved for small things, but overall it either didn’t "break out" yet, or wasn’t noticeable. My father had anger and rage outbursts ever since I can remember, and IED is said to possibly be genetic. I didn’t want to be like him.

I got clearer symptoms only at adult age. It’s impossible to say when this really started. Some problems at work annoyed me, particularly if others did stuff without taking care, resulting in me fixing the issues, for example, and I’d sometimes rant about it. This could however be explained as the perfectionism of a young engineer, or by the environment the company gave me. I left the somewhat-toxic environment at my own choice, switched companies and was good for a long time, but still, the same triggers could annoy me. Stupid drivers on the road would also upset me. I can’t say at all when road rage started for me, but I had it to a non-extreme extent, meaning I wouldn’t honk and shout at everyone, like some other people do. It mostly stayed inside my car. And then came family life and failing partnerships, creating strong social triggers for this illness. Yet, all my anger usually had a good explanation: stupid or ignorant people around me, bad relationships, a crying baby, inconsiderate drivers, clumsily forgetting where I had put my belongings in the apartment, things falling down or just "everything goes wrong today" kind of days, having to repeat myself when talking to others, and so on. I kept telling myself that it’s those triggers which caused anger, and didn’t think much of it.

There’s an important distinction to make here: IED presents as uncontrollable anger and rage. Probably every healthy person gets angry about something many times in their life, maybe even many times per week depending on the circumstances. The difference is that someone suffering from IED goes from trigger to rage in milliseconds and doesn’t have the capability to stop the bad behavior at the onset. Specifically, in your mind, you can’t choose an alternative behavior. There’s no time for it, and no other thoughts in your head at that time. You explode, shout, rage, misbehave, and the angry feeling in your gut ‒ sometimes felt literally ‒ might only go away after a few minutes or sometimes hours. I have no clue how and when the disease developed.

Detecting IED in yourself

You came to this article, so you already know or suspect that you or someone close to you has a problem. The first important part now is to know your triggers. I’ve listed a few of mine above. Also, you will likely have non-triggers, that is, bad situations which you can tolerate without getting angry. For me, I was mostly fine at work except for a few occasions, I never had problems among my friends, and my anger was rather in family situations and on the road. Triggers can vary a lot, but I guess the disease is the same.

Second, you need to find out if you have a normal level of anger, for example triggered solely by your surroundings, or rather anger that is uncontrollable, pathological anger, stemming from a mental illness that needs to be fixed in your body and appears even for the smallest issue. It’s super easy to always blame others, especially if you’re really behaving well with other people 80% of the time. But those 80% of time don’t matter if the remaining 20% are killed by your rage outbursts. Mind that it’s very much possible that other people in your environment treat you badly, making you angry, and you still can’t just state it’s only the other person. Try to deal with, and objectively look at, yourself, as you can’t change others. If your current relationship already got intoxicated by anger and stress, there’s always the chance in another one and sometimes things need a fresh start.

There seems to be no objective way to quickly determine if you have the disease. Nevertheless, read on to what my body’s problem and the solution was… you’ll see that anyone can try my dietary solution safely and thereby find out if bad behavior and outbursts turn back to healthy levels ‒ or go to complete remission (no more symptoms), as is the case for me.

Medication and optionally therapy can be a good first step

I had no idea that a mental illness, caused by biochemical reasons in my body, could even be the case. I only got that last-resort idea after many years of struggling, facing life consequences and trying a myriad of interventions to fix this syndrome, all without any improvements. With that idea, the logical next steps to rescue myself were to call the crisis hotline, confirm the idea of getting medical treatment, getting an emergency appointment and a first prescription for an antidepressive from my general practitioner doctor, and then following up with a psychiatrist about the medication. (You can’t imagine the effort and pain it took me to get to that simple-sounding conclusion which finally got the ball rolling.)

The psychiatrist appointment, which in countries like mine usually only happens after several weeks or months on the waiting list even in case of emergency, was honestly useless because it only confirmed that the medication was a good first treatment choice and that the psychiatrist had no idea about IED whatsoever. In fact, IED only appeared in the ICD (International Statistical Classification of Diseases and Related Health Problems) around 2016, and was mainly documented in the USA before that, described for much longer in their DSM (Diagnostic and Statistical Manual of Mental Disorders) and case/medication reports dating back to 1987 and maybe even earlier. Doctor can’t do much about IED apart from prescribing antidepressives and a few other, off-label medication types. Psychiatry is a field that has almost no technical or medical options for root cause analyses, but I was still surprised that not even blood was drawn or any tests made when I was there. The medication I got from my general practitioner, chosen by guidelines, was quetiapine (Americans know it from the brand name Seroquel), a drug affecting neurotransmitters in the brain. My dosage was 2x25 mg a day, the minimal therapeutic dose. Another well-known type of medications that "work" for IED, according to online stories and case report, are SSRIs (selective serotonin reuptake inhibitors). The beta blocker propranolol is also documented to help in many cases.

My point here is: medication has been available for decades, why not try it? For me, trying the pills for a while proved that 1) they do work pretty quickly and 2) it’s in my body and not a psychological problem. That’s why all the talk therapy and other attempts failed in the long run. Therapy has a short-term positive effect that feels like starting with a new mindset, and therefore I’d recommend trying it in combination. It’s not for me though. Since the medication worked to a very good extent, and symptoms reappeared when trying to lower the dose or not taking a pill for a day, I knew that it might be possible to find the actual cause in my body. I think medication is a good start to quickly get rid of IED-triggered behavior. It may even work long-term if you don’t experience intolerable side effects. However, mind that those medications can have serious side effects and don’t treat a cause. Particularly medications affecting neurotransmitters like serotonin, dopamine and norepinephrine are well-described if you search online.

I had researched SSRIs before and was actually hoping to get one of them to try, as they are said to be highly effective within weeks, albeit sometimes with harsh side effects. It turned out that quetiapine was a similar drug I hadn’t read about before, and it worked within days, not weeks. Not only my irritability improved a lot, but also my motivation, insomnia and how good my days felt. Racing thoughts were completely gone. I was finally able to freely decide what my mind thinks. It felt like seeing snow for the first time in life. My thoughts were totally calm. If I did get a small intrusive thought, I could now just decide to stop thinking about that topic! This was suddenly the case in all life situations. In addition, and much more important, things wouldn’t annoy me so easily anymore, and if they did, I could decide within a few seconds whether I react to it and get annoyed. Previously, I had milliseconds and the reaction was like pre-programmed, not decideable. That was the "uncontrollable anger" part of IED ‒ also gone, thanks to medication.

Eventually though, I was able to get rid of the medication after about 4 months. At that time, I had not only, luckily, found the real cause, but also experienced severe and dangerous side effects. There had been a few rough and somewhat-acceptable, known effects such as sedation (which subsided quickly after experimenting with intake timing), insomnia, and exacerbated restless legs syndrome (RLS; I normally have this in summer) due to the influence on dopamine. But those last weeks before quitting, I couldn’t sleep at all anymore, was groggy and very dizzy all day, was ready to pause my gym membership because I couldn’t sustain any sports or much movement anymore, and eventually considered visiting a hospital. Particularly my lower body was physically tired in a way that I’ve never sensed before. I’d like to sensitize you about such risks. Almost every medication with strong main effect can also have strong side effects. Every doctor knows that, normally. I’m especially considerate of anything that crosses the blood-brain barrier. These issues went away after tapering off the medication ‒ which by the way should never be done at once ("cold turkey") because of the corresponding risks.

Measuring symptoms to find the cause

In autumn 2024, I finally picked up the term "Intermittent Explosive Disorder" again for my research even though I remembered that other than some symptom descriptions, there was nothing interesting to find online when I heard of the term first around 2022. This led me to finally write down symptoms in a structured way, long-term, every day. It’s not easy to find a metric that can be objectively measured and consequently written down each day without guessing (was my anger a 7 or a 9 out of 10??). The same applies to other symptoms, though, and fortunately I had years of experience writing down health symptoms from my migraine research (I fixed migraine to a good extent, see my article).

The first metric I started to write down daily is "inner anger" on a scale of 0-3 where 1 would be annoyed feeling, 2 stronger anger, and 3 a freakout, even if it remained only in my head.

And the second one is "nice to person XYZ" on a scale of 1-5, with both the day’s minimum (worst behavior) and maximum (best behavior) written down, so actually two metrics. I did it for multiple close people, so I had even more data. On that scale, 1 stands for screaming, wild, unfair, outrageous behavior, 2 for very unfriendly, 3 for neutral/affirmative, 4 for friendly/nice and 5 for loving/caring/understanding behavior.

Together, this means I have one internal metric that was "measurable" even if I didn’t shout or otherwise react to other people, and two metrics that define my externalizing behavior, or in other words, how bad IED actually affected my behavior towards others, not just my thoughts. It was so incredibly stupid to start the data collection so late, as I knew that the data is the most important input for finding correlations with other conditions.

What could I now do by just having some numbers of how angry and outrageous I was? Standalone, those numbers are useless. They must be correlated with, at best, possible causes. I had lots of conditions already written down for a while due to my migraine research. For example whether I was outside in the sun, did sports, took supplements and ‒ sadly only since that autumn ‒ also what I ate on each day. After a few months, I still had no clue and also didn’t find any success stories online. I tested and analyzed supplements which are supposed to help (e.g. magnesium is said to be calming), among other attempts. I spare you the details.

Histamine intolerance (HIT) is the cause, omitting histamine-rich food items the solution

I had used magnesium for years to try and combat migraine. It’s recommended for a myriad of health conditions and said to have a calming effect, improve sleep and headache, yada yada. It’s the holy grail supplement nowadays. Different magnesium salts have different effects and side effects, but for me, magnesium had exactly one outcome:

Magnesium vs. anger (since writing down data, until roughly the time I figured out this negative effect, ~100 days of data)

Magnesium vs. anger

Remember my scale of being nice and friendly to "person XYZ"? It’s the main behavioral problem I wanted to improve and magnesium supplementation lowered my niceness score by 0.8 points! That’s vastly significant. (For simplicity, I combined all magnesium types and dosages into one bar in the graph since there was no difference for me.) For the "inner anger" metric, worsening can be seen as well, but it’s not very obvious, so I’m happy to have chosen an externalizing behavior metric as well. The more often I took magnesium, the longer the very bad effect lasted.

This was the first hint of effects from diet and supplements. I also tried the ketogenic diet which is effective for the more well-known mental health disorders (depression, bipolar disorder, anxiety, schizophrenia, epilepsy, etc.). In short: it didn’t help for my IED (and I might have done it wrong by miscalculating the fat ratio). After starting to write down total daily fat, protein, carbohydrates and calories to help get used to keto, I extended that at some point to write down the foods I ate. For example, my spreadsheet has one row for each meal (breakfast 1/2/3, lunch 1/2/3, dinner 1/2/3) and I’d put the cooking ingredients in there, such as "pizza with flour, tomato sauce, champignons, salami, grated gouda cheese".

After several months of recording food items, I found the first food which strongly affected my anger: my beloved peanuts. Then other nuts. Actually, almost all nuts (not all are displayed below). Shit.

Without/with certain types of nuts vs. anger (74 days of data)

Nuts vs. anger

The bars for "I ate cashews/almonds" are shown transparent because they have less than 10 days of data each, but the effect is clear: these and even more types of nuts turned out as very bad triggers for my anger. Statistically, one of course has to exclude cofactors like combinations with other foods. I did that. The data didn’t lie. Ripping those favorite snacks out of my diet already helped tremendously.

I still wasn’t sure what the common denominator was, though. That turned out to be high histamine content. My "bad list" of foods grew to over 10 items after a few more weeks of research. It was all easy after getting the initial idea and the written-down data available for every day. It’s just statistical analysis and in the years before, I had already written the tool which produces the above graphs from my spreadsheet of health data (the tool will be released in a few weeks or months, I hope).

As of June 2025, my known incompatible foods were:

avocado
banana
blueberries
chocolate
coconut milk
coffee (incl. decaf)
cream
eggs
green salad
joghurt
kiwi
nuts (most probably all of them, but some I eat so rarely that I don’t have enough data; for example brazil nuts which are anyway to be eaten in very small amounts)
onions
spinache
tea (from the actual tea plant i.e. black/green/white tea, not herbal/fruity water)
vinegar and fermented foods (sauerkraut diarrhea, ewwww never again)

There are very few stories on the internet saying that histamine can have an effect on your mental state and behavior, but there are. And here’s mine. The data proves it. I created a metric in my statistical analysis which tracks how many of those known-incompatible foods I ate on a day, and in retrospective, the number of portions perfectly, linearly (!) correlate with anger symptoms:

Incompatible foods vs. anger

It can’t be more obvious than those graphs: all anger metrics are worse the more high-histamine foods I eat. It’s not gluten, not lactose, not some minor, usually destroyed-by-cooking toxin like oxalates ‒ it’s histamine! Histamine is in many foods, can’t be destroyed or dissolved by cooking and there are great lists available online that show which foods can have a high content.

Mind that not everyone has the same trigger foods. Tomatoes are listed on all web pages that discuss histamine intolerance, but for me, they’re actually very positive. Maybe that’s just a gift so I have at least one favorite food left that I can eat? Meat is good for me as well, luckily, because otherwise I wouldn’t know anymore what to eat to get satiated.

The above data had already convinced me, but there were more signs that hinted at histamine being the actual problem:

Since magnesium was extremely bad for me, which I still can’t understand today, I tried its antagonist calcium. And indeed, calcium is linearly positive regarding anger symptoms! Calcium even seems to work independently from leaving out my trigger foods. I guess that it’s because calcium is a potent antihistamine. It was used as such in the early 20th century. The effect was very easy to observe: if I have a stuffed nose and take calcium, the nose is mostly free within 20 minutes, and the effect could last up to 3 hours. Surely there might also be a risk to high calcium supplementation, so I rather consider diet the main solution for IED, and will be very considerate about using calcium at all.
The more incompatible foods I ate, the more my nose is stuffed. This relationship is also perfectly linear in my data. So the surrounding topic to research here is allergies. Not just food allergies, but also pollen and other triggers. I currently believe that heat is also a trigger. Start reading about "histamine intolerance" (HIT), "mast cell activation syndrome" (MCAS) and related topics, and you’ll get into a huge world of people with extreme health trouble, with various suggested solutions.

So in summary, leaving out food items and taking calcium did the trick. I’m still in full remission, no anger symptoms, after several months on this restrictive diet.

If you ask now: what if the medication has fixed the problem in the long term, and not the dietary change? No, the data and my observations were absolutely clear here. Going lower than the minimal dosage of quetiapine brought back anger symptoms latest after 36 hours. Its effect only lasts while the level remains roughly constant, which is why you need to take two pills each day. While the medication reduced symptoms to a good extent, leaving out foods had a much stronger impact, namely complete remission of symptoms. I don’t consider the medication a solution, or find it in any way comparable to the exclusion diet.

Also, I use the antihistamine cetirizine every now and then against my pollen allergy on days with a runny nose. That didn’t have an effect on the anger symptoms. It’s not surprising: antihistamine medications block certain histamine receptors, and usually not all of them; they don’t influence accumulation of histamine.

This is the end of the informative part of the article. Below you’ll find another optional section with my next steps to try and find the root cause for my body’s overreaction to, or overaccumulation of, histamine. I don’t want to miss out on great foods and drinks forever. No facts or findings in there, yet, but maybe there will be an update later. Good luck to you!

What comes after the anger is solved?

How to ensure the symptoms don’t come back? I don’t really believe in histamine intolerance as standalone problem or it being the root cause. I reckon it’s caused by something else. And I don’t want to skip eating all those great food items forever. I don’t have a clear answer yet, but the research about HIT led to SIBO (small intestinal bacterial overgrowth) as a possible cause. Since SIBO is quite widespread, the possibility for me to have it seemed high, so I did a test. SIBO is a common underlying problem of IBS (irritable bowel syndrome), but I don’t have any digestive or abdominal problems, or frequent flatulence from certain foods. Why do a test then if I don’t have the typical symptoms? Well, let’s compare that to my histamine intolerance: it’s absolutely rare to get mental symptoms from too much histamine. Most people simply get allergy symptoms like skin rashes, runny nose, itchy eyes and so on. I had regular allergy symptoms as well since I’m allergic to trees and grasses. This still doesn’t mean you can’t have other symptoms. Just because something is rare and almost completely undocumented doesn’t mean it cannot happen. I’m ready to do anything to fix my health issues long term. It’s like in software engineering: if you do it the right way, the problem doesn’t come back to you later. This explains why I wanted to test for SIBO despite not having the well-known symptoms like bloating, diarrhea and digestive issues.

The test cost around 100€ incl. return shipment of the glass vials, and debrief phone call about the results. Since doctors and insurances here don’t know or cover SIBO, I paid and did all this myself. There are several labs offering tests, even in other countries. The results turned out strongly positive for hydrogen gas production. In the article The Surprising Connection Between Histamine Intolerance and SIBO, among other sources, I found a reasonable explanation: overgrowth of bacteria in the small intestine, where naturally only a small amount of bacteria should arrive and settle, can damage the gut lining. And that in turn can lead to bad degradation of histamine by the body. There are enzymes involved in the degradation process, as you’ll read in the first 5 minutes about histamine intolerance, yet supplementing the DAO enzyme didn’t help in my case. Anyway, due to this finding, my current attempt is to fix SIBO and thereby maybe the root cause of histamine intolerance. The problem is that I don’t have anger symptoms anymore, so it’s impossible to relate the gut healing with improving symptoms. Yet I still have insomnia as very major health issue, and if my craziest theories prove to be true, then histamine keeps me from falling sleep at night, treating SIBO will fix HIT, and I’ll be happy in life 🤔. Wish me luck, since this isn’t much based on data or other people’s experiences anymore.

I’m following the recommended antimicrobial treatment: oregano oil, berberine, then probiotics, and finally gut restoration with certain supplements. I’ll update this article if it improves anything. Again: I can’t tell if anger symptoms improve since I’m in full remission already. Plus I want to reintroduce avocado, chocolate, eggs, among other items, and not fear food any longer. The exclusion diet is said to be required only temporarily for histamine intolerance, and not required long-term for everyone.

How I mostly fixed my migraine, weather and sports-induced headaches

May 28, 2025

After 37 years of migraine, I found a scientifically well-explained solution. It’s hydration with just table salt as electrolyte. And a few additional habits to turn off the remaining headache that the salt couldn’t always prevent.

In this article, I want to explain the background, and since it’s a supplementary intervention without much research behind it, my data about safety and long-term efficacy after trying several variants of salt supplementation.

My last blog post was 3 years ago, as I was spending lots of private time solving health issues. This is still a software engineering blog, and software helped me figure this all out.

Migraine started in childhood
Science principles ‒ why even write about this single-person case?
Failed attempts
The solution for my migraine: salt at the right amount
Safety of the salt intervention
Scientific theory of electrolyte imbalance behind migraine
The end of the fight after 37 years of pain
Related reading

Migraine started in childhood

It all started with the early signs of developing a "migraine brain": many nights of my childhood, I had to wake up several times to vomit. Combined with a terrible headache, of course. In adolescent years, I would often come home from school, just wanting to lie down in a dark room until the headache pain, light sensitivity and sometimes sound sensitivity ended the next morning. Surely, I missed out on lots of fun in my childhood. I don’t remember when the puking part ended, but the rest never did… until I found the solution which drastically helped within a matter of weeks.

Doctors couldn’t do or understand much about migraine back then. At the age of 18 or so, I still had no remedy at all. I remember a general doctor telling me: "Strange, aspirine should work after 30 minutes". Of course, aspirine can be effective in some common headache types like tension headache, but barely for migraine headaches. Why would nobody ever tell me? I still tried that medication all the time to no avail. The term "medication overuse headache" only became well-known later. At that age, I already had some theories what’s triggering the illness for me, such as weather and air pressure changes, but my pure observations wouldn’t help because I couldn’t change the weather.

My main symptoms are a headache developing over several hours, then in case of very strong migraine also a neck strain spreading usually on one side of my head. Once developed, I wouldn’t get rid of it until waking up the next morning. Luckily, I have never experienced cluster headaches or migraine lasting several days. A visual aura happened only once, and scared me. Migraine can have many prodromes (early sign of migraine onset), symptoms with and without pain, and is therefore studied as multifaceted, allegedly hard-to-fix sickness.

For a while in my youth, I subjectively believed that drinking coke works against headaches. Even in later years, I tried a few caffeinated and other drinks in this faith. My parents and me never collected any data. It could have proven that the sugary drink didn’t help at all. I was too young and naive to get this idea, and no doctor recommended a headache journal either. All I got was two dead teeth and several root canal surgeries from all those drinks. They were available because my parents bought them all the time. Moms and dads, stop this crap!

Science principles ‒ why even write about this single-person case?

Doctors had no idea of migraines back then. My parents also had the average health knowledge for that era: none. And who’s to blame them ‒ the 1990’s were still a time in which you had no other chance but to trust experts: doctors and unfortunately also newspaper and magazine articles. The internet, open science and wide-spread, free health information by governments or even influencers all happened much later. And despite the existence of these great sources of information, many people, not just our parents' generation, still don’t make use of them. So as I solved my migraine by a convincing extent, I also set out to tell others about the story, and clearly document my case on this public, free website. I believe that free information is a cornerstone of modern science. There are too many commercial interests and corruption conflicting with science and its subsidization. The solution is to simply override scientific malpractice by single people’s experience-based, anecdotal, and in my case data-backed, solutions. That’s my personal rationale behind this article and also behind the things I’ll write about next: free information gives people the power to fix their health.

Obviously, my solution may not work for everyone else. But you’ll read below that salt supplementation wasn’t my own idea. A whole Facebook group is dedicated to a related protocol and has helped an unknown number of people. There are forums and groups for many illnesses, which is a great way to discuss your specific symptoms. These groups are usually structured by questions and comment threads and, only if you’re lucky, clear instructions on what to try against the respective illness. They are however not an easily, publicly accessible and searchable way to find information about migraines. I’m therefore trying to put together the major parts in a concise way, even if I only have data and proofs for my single case.

Failed attempts

French fries: not a healthy, permanent solution

French fries

By the age of 25, I hadn’t tried much apart from aspirine and triptans, in the hopes that some unproven "blood pressure in the brain vessels" theory plays out somehow. No medication really worked, and regularly buying a pack of 2 (stupid package size!) triptan pills at a horrendous price therefore wasn’t justified. Instead, I toughened up, so-to-say, not letting migraine days ruin my activities. They of course still did, because you can’t fully focus on work or hobbies, nor do sports with a full-blown migraine. The acceptance however worked somehow on the psychological level to weaken its position as a miserable part of my life.

In 2021, four years ago, me now aged 34, I somehow got the idea of writing down daily symptoms in a spreadsheet, as an attempt to get the headache and other symptoms solved. Not long after, my mouth ulcers were gone after trying out several interventions (hint: use SLS-free toothpaste). So I considered data as a great way to correlate symptoms, interventions and with that, potentially solve health issues. I continued, switched from paper spreadsheets to a huge Google spreadsheet, tried out more and more supplements and stuff in order to solve my main concern at the time ‒ the migraine headache and accompanying neck pain.

My strongest migraine attacks would usually come with strong neck pain, slowly spreading into my head. Many migraineurs have prodromes as first symptoms, and the headache appearing as last, most severe symptom. Prodromes let you predict that you get a headache, so the typical advice "know your triggers" should be extended to "know your triggers and prodromes". The nausea and light sensitivity from my youth are also well-known prodromes. "Feeling" weather changes is another one. I’d know in most cases if a strong headache develops over the next hours, but simply didn’t know how to prevent it.

Example spreadsheet with health data (simplified, contrived)

Spreadsheet

Before having tried any interventions, when I started to write down how strong my headache was each day, I had a headache (of any severity) every third day, 120 out of 365 days! Life sucks, eh? Many of those were also full-blown migraine attacks. The way I wrote them down in my log was easy: 0=no headache, 1=weak headache, 2=medium and already quite painful, 3=ruining my day. With this data, analyzing the monthly/yearly count of headaches, and their average severity, was the next logical step. As a software developer, I wrote my own small program to analyze my spreadsheet data. That gave answers about whether I have more or less headache if I drink apple juice, take supplements, do sports, had alcohol, etc.

I also got enthusiastic about reading scientific studies around the topic. Which remedies exist? What supplements and dietary interventions could I try? I became focused on trying vitamin D, magnesium, eating enough food because skipping meals is a trigger for me, etc. And indeed, the year 2022 turned out to have 22 headache days less than the year before. Or in words: from every third day to roughly every fourth day! Quite a success, I thought, not yet knowing that long-term health issues can be solved much better.

In 2023, I tried many more supplements from scientific literature and ideas from the internet. For migraine, you quickly find a plethora of things to try. Apart from r/migraine’s top recommendation of salty McDonald’s fries, you’ll find quite some substantiated evidence in favor of several supplements, for instance coenzyme Q10, vitamin B2 (riboflavin), vitamin D or magnesium. And of course modern medication such as triptans or monoclonal antibody medications like Erenumab. Food triggers are also listed for a small percentage of migraineurs, but I hadn’t considered that at the time. Of those data-backed suggestions from science, and countless comments from other people on Reddit, I picked the ones which made sense to me. Since either or both of Vitamin D and magnesium seemed to help, I was eager to find other supplementary or lifestyle interventions to help my pain. Regular sports was one of the things I kept evaluating. I did neck stretches, neck strengthening with weight training, avoiding the valsalva maneuver (breath-holding) during training, different magnesium supplements and higher dosage, caffeine and electrolyte drinks, other supplements. My improvements from the previous year remained, likely because I was still using vitamin D and magnesium, but boy did this take a vengeance… higher doses of vitamin D (2000-5000 IU/day), taken over a long time, left me with permanent, severe dizziness over several months, useless neurologist appointments, their new distracting theories of "vestibular migraine", and so on. By looking into my spreadsheet, I could luckily see that I had raised the dosage before this occurred, and fully tapered off vitamin D to get rid of the dizziness. The body buffers vitamin D, so even after taking a small daily dose, I kept seeing symptoms. If you check online information, vitamin D is hyped as the world savior and high doses often mentioned as safe. That’s simply not true for everyone! This definitely made me trust N>1 science (studies on many subjects) less in the future, now equally taking single case studies and experience reports ("N=1 science") more into account. Reddit in particular turned out as great source for real people reporting about their health problems and solutions. Unfortunately, the average internet user doesn’t really detail their story at all, and health still remains individual. Anyway, internet answers can be great hints of possible illness causes that you didn’t consider before.

My research on migraine-related neck pain also misled me: my neck training sessions turned out as potential trigger of migraine, mostly since I was lifting heavy weights (up to 15 kg / 33 lbs). DOMS in the neck is a one-of-a-kind experience that not everybody gets in their life, and it surely made me stronger, but it didn’t make the headache much better in the long run, and definitely worse on training days. Another wasted effort to follow this idea, as so many before. The below section on the scientific theory will explain why migraine doesn’t originate in the neck. I consider neck pain as secondary symptom or prodrome of my migraine.

But I insisted to follow the obvious observations: intense sports were still triggering my headache. A common term for this is exercise-induced headache and it can have various causes. After a lower back injury in the gym end of 2023, I reconsidered how I do my training. I wasn’t really following common advice, even though it was known to me. I was weak in my back, but luckily smart enough to use the safety bars. Therefore, I didn’t destroy my life but was able to recover the muscle which just shut off during squats. Upon resuming strength training, I now always warm up for 5 minutes on the treadmill. I switched to exercises on safer machines and did Andy Galpin’s "3 by 5" protocol (as advertised by Andrew Huberman), meaning 3-5 repetitions, sets, minutes of break, etc. On top, I started with real progressive overload training, increasing the weight whenever I was ready to do more than 5 reps of the previous weight. Writing down the weights in a spreadsheet was again a good solution. A digital spreadsheet makes it easy to change exercises, hide the ones currently not in my plan, and so on. Google Spreadsheet comes with a good mobile app, so that’s my preferred tool. With the warmup, breaks and safety changes, the average headache severity reduced even more. Hard to quantify how much, since just a few months later, beginning of 2024, I had finally found the best-yet fix ‒ the salt intervention. Overall, in 2023, I was down to 83 headache days, i.e. one every 4.4 days.

An electrolyte product I tried

An electrolyte product

Electrolytes were an often mentioned topic on the internet and in some studies about exercise-induced headache, so I researched it further. I tried apple juice soda and other rather natural sources which didn’t help, and later sports mixtures containing sodium, potassium and sometimes magnesium. Unfortunately, here comes a crux of analyzing preventative drugs or supplements: if you only take them when there’s a trigger (for me: sports) or you feel a prodrome (for me: noticing a weather change), you can’t really find out if it helps unless there’s a super-significant difference. Interventions must be tried in a preventative fashion over the course of at least some weeks. Headache frequency without electrolytes after sports was 20.1% in 2023, but with electrolytes it was worse at 36.8%. This proves nothing. In order to look at long-term effects, I also looked at "with/without supplement on at least X out of the last Y days". And indeed, if I took electrolytes after sports on 3 out of the last 7 days (this also means sports 3x a week), the risk of a headache (20.8% in 2023) would even be slightly less compared to not taking them (23.1%). Fine, those results wouldn’t make anyone jump and say that electrolytes are any effective. Such a small difference just isn’t significant.
However, I was lucky to also have analyzed only the highest headache severity separately ‒ and of those debilitating days, I had zero when taking electrolytes after sports in 3 out of the last 7 days. Getting rid of the worst days of my daily life was a big finding, end of 2022. This was the first time when I had a clear hint about electrolytes being involved. The finding was in a small time frame and only on sports days, though. It yet didn’t prevent migraine overall.

Regarding weather-induced headaches: the triggering weather pattern never became clear to me until I analyzed air pressure information of my location. What if it was air pressure change? That would turn out as true in my data: days with pressure increase had somewhat higher risk of headache. But it doesn’t matter, as attacks could happen on days with sunshine turning to rain, rain turning to sunshine, air pressure changing in any direction. Only the probability is different, but every day is a potential migraine day! I stopped writing down the useless weather information at some point. We can’t change the weather.

Those years of reading scientific papers, social media reports of real people trying interventions took a lot of my private time. It became a full-blown hobby. The average severity (my daily headache score between 0 and 3) went down from 0.57 in the year 2021, to only 0.38 in 2023. Significant, but not life-saving. However, I wouldn’t know how to do it differently. Doing nothing would have changed nothing at all. Was it all a failure? I don’t think so. Learning about a hundred failed attempts that don’t work taught me a lot about scientific methods, real-world statistics, making logical conclusions, following the obvious instead of miniscule and pseudoscientific hints, and that apple juice isn’t really a working electrolyte even if I had the feeling that it works. Subjective feelings all turned out as false or even negative, so writing down and objectively evaluating and graphing my statistics was a good course of action. As of 2023, my homegrown analysis app had to analyze many symptoms against many conditions and produced a huge table of correlation percentages that I starred at for hours to see significant effects of supplements and other attempts. This made no sense anymore. Each analysis took 10 minutes already as the program was poorly developed and slow, not meant for long-term usage. I set out to rewrite it in another programming language, add a web page to control what I want to analyze, and to visually show graphs for correlations of symptoms vs. interventions. I’m going to publish that application in the next months, and believe it’s a quite good biohacking and health optimization tool for people who are happy entering daily data into a spreadsheet.

Despite the failure to fix headache in those previous years, I consider these years a success. They turned me into a hobby scientist dedicated to my own health, and I became interested to use this for the good of other people, too, resulting in this blog article, for instance. The fix for migraine came in 2024, and later, in 2025, I would even be able to put a decades-long mental disease into remission with the help of health data collected in my ever-growing spreadsheet. But that’s for a later blog post, if I ever dare to share that story.

The solution for my migraine: salt at the right amount

The results in the years 2021-2024:

Headache statistics over the years

The graph doesn’t show the severity of the headache days, only the count. You can see the dip from 83 headache days in 2023 to only 32 headache days in 2024, the year in which I started supplementing salt. That’s only one headache every 11.4 days! The average severity, not shown here, reduced from 0.38 to 0.15 (as mentioned above, I wrote down severity as score between 0 and 3 where 0=no headache, 1=weak headache, 2=medium and already quite painful, 3=ruining my day). In 2024, I only had 7 days of full-blown migraine. Numbers, numbers… they are critical to health science, but don’t nicely translate to what they mean for my life. The successful change in 2024 allowed me to finally consider migraine fixed to an extent that doesn’t make me fear it anymore. I am so happy and thankful for that.

I can only imagine what this could mean for people who have a really strong migraine problem. I have almost never experienced multi-day migraine. Mine mostly resolved overnight, but some folks even have episodes lasting weeks! Whether this solution works for you or not, I think it’s well worth the little effort to try it for a few weeks.

As mentioned, the main change is to drink a tiny amount of salt each day. If the scientific explanation holds true, there’s a good chance that certain types of migraine may have a root cause related to sodium-potassium imbalance in nerve cells and therefore supplementing with a safe dose of table salt could work for many people. There are unfortunately many types of headache in the International Classification of Headache Disorders (ICHD) for which I cannot provide any insights. There’s however enough information on the internet to distinguish the vastly different "regular" tension headache from migraines, so I don’t explain that here. This article is about migraines, and headache as their most typical and most painful symptom. I’m not a doctor and not giving medical advice here, so treat and try all information at your own risk. Below, you’ll find a section about safety, for good reasons.

Science provides many theories and even more recommendations for migraine, with some of them making obvious sense (such as weather changes and exercise inducing headache), but no clear "cause X? use remedy Y!" findings, or explanations why there are so many different triggers. Most people with migraines are left with barely working medicine such as painkillers, aspirine, triptans or supplements (e.g. magnesium, CoQ10). With those, they get a few percent of improvement in headache days, as I experienced in my own story above, but no life-changing outcomes. And most importantly, those treatments aren’t preventative, but curative and often applied only on pain days, usually too late after the root cause has already developed in the body. Funnily, if you do statistics like me, you may get surprising results such as headache being worse on days when you took ‒ for example ‒ aspirine. That’s easily explained: you don’t take aspirine as preventative measure (I hope you don’t; ever heard of medication overuse headaches??!), but only when you feel pain already. So statistically, headache severity will very likely be higher on those days, making the statistic totally useless because the data can’t answer if aspirine worked for you. You need to try a preventative, at best cause-treating intervention over a longer period of time, and can probably ditch medication at some point.

This is the intervention that works for me now, long-term:

Drink around least 2.5 liters (85 oz) of plain water every day. I live in a country with clean, unchlorinated tap water, so I’m using that. See below for my data about headache vs. water intake ‒ I highly recommend finding your own optimum amount! On sports days, my body automatically demands more. I mostly don’t drink tea or coffee. I have an insufficient amount of data hinting that tea (black/green/white i.e. from tea plant, not herb/fruit tea) might be very positive for my headache, but such diuretic drinks make me pee very often, and both tea and coffee contain histamine which isn’t for me. Hence I’m skipping them for non-migraine reasons.
I add less than a quarter tea spoon of table salt (¼ tea spoon ≈ 1.25 g NaCl alias sodium chloride ≈ 0.5 g sodium) to every 1-liter bottle that I drink. After some days, it doesn’t taste salty at all anymore. I found that the body buffers sodium, so the absolute amount over several days is more important than weighing the exact mass.
Watch out for concerning symptoms, see the section "Safety of the salt intervention" below!

That’s why I reduced from ¼ tea spoon to basically just a pinch of salt per liter:

Less than a quarter table spoon suffices for me

At first, it took some weeks to notice much fewer days with headache or headache onset. I started the intervention in January 2024 and still got 7 headache days in that month. In March and April, I was down to only 2 headache days each month. Incredible! Since the effect doesn’t come from one day to another, it doesn’t feel like a miracle cure but I only realized after some weeks that something was strongly improved in my life. This wasn’t the end of the story though, so please read on about the safety research I did. And if you’re still interested by then, there’s a section at the end explaining what the theory behind this solution is, and how a large group of people may already have known, and benefited from this for over 10 years. It felt bad for me that the knowledge was out there on the internet all the time, me having no clue about it. My migraine could have been solved many years ago.

A common recommendation is to drink lots of water together with electrolytes, but my own data says that drinking too much is bad for my headaches:

Headache severity (0-3) vs. water intake in milliliters (data from 2024-06-01 to 2025-05-27; insufficient data for the [3500,4000) ml bucket)

Graph

For the probability of feeling any migraine prodrome on a day, the results are similar:

Prodrome probability (0 or 1) vs. water intake in milliliters (data from 2024-06-01 to 2025-05-27; insufficient data for the [3500,4000) ml bucket)

Graph

For me, 2000-2500 ml of water might be optimal, with more water making it worse. I can only guess that dilution could be the problem. Important: I only analyzed the same-day water intake. For example, if I analyze headache vs. yesterday’s water intake, the results are similar, but less pronounced. Also, I only consider salted water here and not which other electrolytes I did or didn’t take with it, such as magnesium. Significant linear-looking relationships in my data however often turned out as meaningful, so I go with roughly 2500 ml as my optimal amount, with more on sports days.

Intense sports or a full-day in a sauna and thermal spa can still evoke rather strong headaches for me. I haven’t figured this out fully and can’t claim a great remedy just yet. I’m wondering if electrolyte imbalance could explain those triggers as well.

Here are some additional recommendations:

Hydration means water + electrolytes. Electrolytes include sodium, potassium, magnesium and possibly other micronutrients if you’re deficient. It’s important to hydrate the day before sports or another intense event, and then also during that day. Pee color and times-on-toilet are said to be good hints about over-hydrated vs. under-hydrated status, but I didn’t research if that’s true. Sugar-free electrolyte supplements might work as well, but don’t overdo potassium supplementation since getting that from food is safer than drinking it all at once (article about potassium toxicity).
Warm up before sports. My data doesn’t show much benefit of warmup regarding migraine risk, but it reduces injury risk for sure. It’s a no-brainer. Overall, my gym sessions are now long, structured, and with breaks, rather than intense 30-minute workouts that make me totally tired for the rest of the day.
I take breaks during sports, so that my heart rate doesn’t go over a certain threshold (for me: 170 🤷). This idea isn’t data-backed; I only have a subjective clue that this could raise migraine risk on that day. I have a pretty high heart rate overall, typically at least 160 for any speed of running and also with other cardio sports. The more volume (~ repetitions multiplied by weight), the higher the heart rate spikes will be. I’m avoiding that by taking breaks, especially for compound/leg exercises, since those seem to require the most energy and therefore strongest heart pumping.
Another trigger: skipping a meal. If intermittent fasting isn’t good for you, leave it.
In a sauna/thermal spa, there are often very warm temperatures. Some indoor pools offer "warm water days". Both warm air and water can quickly raise your body temperature, while the body tries to stay in a small range of feel-good temperatures. So cool down outside or in the ice bath more often. Don’t hop from sauna to sauna, and don’t remain in the warm water or warm air without a cooldown break.
Make typical modern health improvements: ditch alcohol and carb-loaded drinks (also alcohol-free beer), eat fresh, little-processed and rather low-carb food. Alcohol screws up sleep and muscle growth. In my experience, the headache from alcohol hangover seems to be a different type from migraine, and should be avoided in addition. I noticed that certain, highly-processed foods gave me a headache within minutes. That only happened rarely to me, but the trigger was obvious.

I haven’t written down food macros for long enough, but my current data shows that I should probably avoid a high-carb diet:

Headache severity (0-3) vs. eaten carbohydrate amount on a day (data from 61 days of food data, year 2025, scores 0-3)

In fact, the Stanton Migraine Protocol which I’ll mention in the next sections, recommends low-carb nutrition. I tried going low-carb/keto and it’s definitely helpful to feel satiated for longer and not having to cook all the time. I’d recommend it independently of my migraine research and data.
If you feel a headache coming up despite salt supplementation, there’s still the chance to stop its progression. Put the aforementioned amount of salt on a table spoon, place it under your tongue and let it dissolve. The strong salty taste is a special experience you shouldn’t miss. I’m joking. If you’re not too late, this can still make a difference on the same day. For me, it can sometimes still fight the onset, but it adds to the absolute amount of ingested salt, so safety is a concern.
Sunstroke can lead to a similar, unstoppable headache as with migraine, but this one is actually dangerous. It’s often also associated with nausea, similar to alcohol hangover. The solution is simple: avoid long exposure of the head to strong sunlight. A baseball cap can be enough on certain days. Shadow is best. And mind that air temperature and sunlight intensity aren’t much related. You can easily get a sunburn or sunstroke in winter, just like solar panels can create lots of electrical energy despite freezing temperatures. Estimate sunlight energy, not temperature.

That’s it ‒ some changes are simple, especially the salt part, and some take more effort and knowledge, like the dietary part. Mind that the intervention does not cure your migraine illness; it only prevents it, sorry!

Safety of the salt intervention

The so-called Stanton Migraine Protocol, one of the sources recommending salt ingestion in water as migraine remedy, prescribes a larger amount of salt. I initially salted each glass of water as suggested. At some point though, dizziness symptoms started. With my previous dizziness phase caused by vitamin D (see my story above), I was already aware that the attempted interventions could be the likely culprit. And indeed, it turned out as too much salt which triggered the dizziness! (I’m not taking vitamin D anymore, so that was ruled out.)

Eating more potassium, or supplementing with a 2:1 sodium-potassium salt mix instead of table salt, didn’t change that.

Regarding the safety of sodium salt in general: too much and too little sodium both correlate with higher mortality. The WHO recommends reducing salt intake from the statistical average to less than 5 g/day salt (= 2 g elementary sodium). With processed foods, it’s easy to get over that threshold, with fresh foods not at all. You’ll need to judge your individual intake and read food labels.

Since experiencing dizziness, I tried a few variations of salt dosing. For a while, I turned back to not salting my water while still drinking 2.5 to 3 liters of water a day and keeping other things roughly the same as well. I tried without salt supplementation several times, as a crossover experiment, and it seems that my body has roughly a 10-day buffer. After that, my headaches come back, but dizziness subsides. These repeated, identical results also prove that the salt really works as preventative intervention. I didn’t want dangerous side effects like dizziness, so I reckoned that taking much less salt, or every second day, might do it as well. So what works for me right now, as written above, is to dissolve only a pinch of salt per liter, several times a day. Less than a quarter tea spoon of table salt.

I’ll keep testing this. In contrast, I’m not writing down and analyzing dietary salt intake, but that’s an option, too. Overall, my recommendation is to not treat the salt intervention as a perfect, final solution, and to observe whether it works for you without side effects.

By the way, very intense gym sessions still give me temporary dizziness, independent from salt intake. It’s a complex symptom.

Scientific theory of electrolyte imbalance behind migraine

When I found electrolytes as possibly related to my headache, I didn’t know much about brain neurology yet. That changed quickly. Angela Stanton, with her neuroscience background, pioneered both in creating a reasonable scientific theory and in bringing awareness to people about what works. Being a decades-long migraine sufferer herself, she’s much more believable than strange healing stories on the internet. Creating a Facebook group for healing migraine, in which, as of 2025-05, around 19000 people have joined, takes dedication.

That said, I particularly like the neurons' electrolyte imbalance theory as described in her book and articles. Here, I’m summarizing a few resources:

Definition of Migraine
- Migraine is genetic, and a "migraine brain" is different than a healthy one. Problems with sodium/electrolyte homeostasis (balance) leads to different electrical properties, and that is critical for nerve cells (neurons).
- Restricting carbohydrate consumption can improve electrolyte balance
- Hydration isn’t drinking lots of water
- Other possible triggers, e.g. weather changes, menstrual cycles, etc.
- Recommendations for salt intervention during air pressure increase/decrease
Specifically formulated ketogenic, low carbohydrate, and carnivore diets can prevent migraine: a perspective

Front. Nutr., 30 April 2024, Sec. Nutrition, Psychology and Brain Health, Volume 11 - 2024, https://doi.org/10.3389/fnut.2024.1367570

An overview paper for her theory and what’s behind it. The abstract greatly summarizes it and I have nothing to add to this great article.

This article presents a hypothesis explaining the cause of migraines, suggesting that electrolyte imbalance, specifically a lack of sufficient sodium in the extracellular space of sensory neurons, leads to failed action potentials. The author argues that migraines are triggered when sodium channels fail to initiate action potentials, preventing communication between neurons. The article discusses the evolutionary perspective of the migraine brain, stating that migraineurs have a hypersensitive brain with more sensory neuronal connections, making them more reactive to environmental stimuli and in need of more minerals for the increased sensory neuronal communication. Since glucose is often used to reduce serum hypernatremia, it follows that a high carbohydrate diet reduces sodium availability for use in the brain, causing an electrolyte imbalance. Low carbohydrate diets, such as ketogenic, low carb-high fat (LCHF), and carnivore (all animal products), can be beneficial for migraineurs by reducing/eliminating carbohydrate intake, thereby increasing sodium availability. In support, many research papers and some anecdotal evidences are referred to. The article concludes by proposing lifestyle modifications, such as dietary changes and sodium intake management. These will provide migraineurs with a long-term healthy metabolic foundation helping them to maintain strong nutritional adherence and with that aiding continued proper neuronal functioning and migraine free life.
The book "Fighting The Migraine Epidemic: A Complete Guide: How To Treat & Prevent Migraines Without Medicine" was the first thing I’ve read. It’s much longer than the above resources, but worth a read for migraineurs or relatives, such as parents trying to help their kids, as the book is written for everyone (not just scientists). If you want to check out the complete protocol, the go-to places would be the book and the Facebook group.

Hint: I’m not advertising. Science should be as open, free and without financial interests as possible. It’s just fair and important to mention my sources of information. If there’s a sold book along a plethora of freely available resources from one scientist, that’s totally fine. I could offer my endless writeup of other resources from the last years. I must have read thousands of web pages and full-text studies. This article is already too long to list and explain everything. Therefore I refer to sources that give explanations in a single place, and which influenced me the most.

The sodium-potassium homeostasis explanation makes sense to me. But I can’t tell if it’s the correct explanation. And it doesn’t matter. The result ‒ getting rid of migraine pain ‒ is more important, as long as the intervention is safe to use. That said, I’m not ruling out other explanations. Just like I’m not excluding that I might have multiple causes for headache and other symptoms.

The end of the fight after 37 years of pain

The whole endeavor was worth the effort. I became a hobby scientist, fixed more health issues afterwards, and invest even more effort into free and open science now: I’ll publish my free data analysis program soon. I learned that some of my observation-derived ideas turned out to be true: I had my own theory that electrolytes were involved, and that’s the case. Those observations should be followed first rather than trying hundreds of likely unrelated ideas from the internet or from studies that average out effects on a large population. Most importantly, my life improved a lot by solving migraine. And even though it’s not cured, and also not 100% fixed, the pain I have remaining is definitely something I can live with. And I’m happy being able to help others now by sharing my story, solution and specific instructions, fully backed by my own efficacy and safety data.

With the problem now solved closer to the cause, rather than only treating symptoms, certain triggers are fully gone: Flights usually gave me nausea and headache; even before a flight, I would already have a bad feeling in my body. Also high-speed train rides used to give me a high headache risk, likely due to the fast and countless changes of air pressure and weather along the route. With the salt intervention, these are no longer a trigger.

The result is incredible. According to the recommended treatment goals, I’m under "optimal control" with 1-2 monthly moderate-to-severe headaches (mid of 2025, much over a year into the intervention).

From: Sacco S, Ashina M, Diener H-C, et al. Setting higher standards for migraine prevention: A position statement of the International Headache Society. Cephalalgia. 2025;45(2). doi:10.1177/03331024251320608, published by Sage here. Figure 2. Aspirational goals of migraine prevention according to the position statement of the International Headache Society.

Aspirational goals of migraine prevention according to the position statement of the International Headache Society

Thank you for reading, and best of luck to all sufferers!

My research on migraine will continue. I’ll probably try again to sustain a strict low-carb diet for a longer time. This article will definitely be updated once I have new insights.

Grafana dashboards — best practices and dashboards-as-code

April 21, 2022

Grafana is a web-based visualization tool for observability, and also part of a whole stack of related technologies, all based on open source. You can configure various data sources — time series sources like Prometheus, databases, cloud providers, Loki, Tempo, Jaeger — and use or even combine them for your observability needs. As of April 2022, that means metrics, logs, traces, and visualizations based on those concepts. That is where dashboards come into play.

Dashboards are the typical first solution that even small companies or hobbyists can use to quickly get insights of their running software, infrastructure, network, and other data-generating things such as edge devices. In case of incidents, or for just glancing over them at will, dashboards are supposed to give a good overview to find and solve problems. Great dashboards make the difference of understanding an incident in 10 seconds vs. digging into red herrings for over 5 minutes. Meaningful dashboards also ease the path to setting correct alerts that do not wake you up unnecessarily.

On top of dashboards, you can leverage alerts, logs and tracing which together form a simple and helpful look at your stuff, if done right — or a complicated mess like your legacy software code, if done wrong 😏. Those other concepts are out of scope in this article. I only focus on how to achieve helpful, up-to-date dashboards, using private research and my production experience, where I extensively used Grafana dashboards — and also logs — to resolve incidents fast, and often with ease.

Typical examples on the internet present dashboards as a huge screen full of graphs, sometimes a collection of numbers or percentages. A lot to look at for human eyes. I will show you how to visualize better, with realistic examples. This article showcases solutions for Grafana and Prometheus, but can also be applied generically for other platforms.

Goals for all your dashboards
Spinning up a Grafana playground in a minute
Getting started with a playground dashboard
High-level dashboard creation guidelines
Dashboards-as-code and GitOps (committed dashboards)
jsonnet for dashboards — some tips
- Distinguish environments
- Consider jsonnet as full programming language
Grafana-specific dashboard tips
Observability tips not specific to Grafana
Summary
Out of scope
Related reading

Goals for all your dashboards

You cannot look only at best practices for a single dashboard. If you do that, you will end up with 100 dashboards, each for a single, distinct purpose, such as one per microservice. But altogether, that becomes an unmanageable mess.

Think of your company: who is the main audience of the dashboards? If you are using the "you build it, you run it" concept, it may be mostly developers (e.g. on-call engineers during an incident). For other organizational concepts, it might be SREs or infrastructure engineers. Or it may be multiple technical departments (operations + engineering). This article will focus on examples how to monitor the health of software systems (easy to apply to network and infrastructure likewise). If however your organization looks totally different and dashboards are for sales, management, compliance, information security, then the goals can be different — the article may still be helpful nevertheless. Most importantly, you as author should be part of the audience yourself, or else you are not a good fit to develop a reasonable dashboard. On the same note: great dashboards help in many ways, such as against following red herrings, but nevertheless you need application experts to resolve incidents, so those should definitely be part of the audience.

This is what we want to achieve for users:

Fast resolution of incidents, by finding the root cause and impacted services/customers fast
Users should have one browser bookmark, leading to a main, high-level dashboard. It will be the first page you open when you get called for an incident. Within seconds, it tells you which parts could be problematic, and which ones are okay (as operators in Star Trek say: "operating within normal parameters").
Show health at a glance, with a simple indicator that the human eyes can quickly consume (e.g. green or red background color), for each component
Allow drilling down into more detail (low-level) in order to come closer to the root cause if the high-level dashboard is not enough

How this can roughly work:

Create a high-level overview dashboard. It depends on your company how many of those make sense. The scope could be one system, service, product domain, or for small companies even the whole landscape at once.
Represent each component or microservice of a system in the high-level overview
For fine-grained analysis, it also often makes sense to create a separate, detailed (low-level) dashboard for each component. The high-level dashboard links to those.
From each visualization (those are the rectangles on your dashboard, such as graphs), link to detailed dashboards, prepared log queries, debugging tools on your intranet, the system/website itself, etc.
Create dashboards solely through code, to avoid having a mess of manually created, unreviewed, inconsistent dashboards after a few weeks, and the need for a company-wide "tooling switch" after a few months or years, only to clean up all of that. Users nevertheless get write access to Grafana, since that allows temporarily adapting a dashboard for own usage, such as special investigation during incidents. They should however be trained that changes should not be saved, and any saved changes will regularly be overwritten by the dashboards committed as code. Those get automatically deployed, for example by a CI pipeline.
You give no training to users. Yes, you heard right! A well-designed dashboard is 100% obvious and requires no explanation to use it, given the user knows the relevant terminology of your monitored system. Training for incidents anyway mostly happens through practice. Therefore, I recommend you present the dashboards on screen during incidents, so that other users see the capabilities they offer, and less obvious features such as the hyperlinks that can be added to the clickable top-left corner of each visualization. Code review for dashboards, and reviewers who are application experts (who understand the meaning of the displayed metrics), are essential to keep up good quality and really make the dashboards plain simple to use without explanations.
For medium to large companies in terms of head count, introducing such a consistent concept will be impossible unless technical leadership supports the full switch from the old or non-existing monitoring solution to Grafana with dashboards-as-code. It is key to document and communicate that codifying dashboards is the only way to go, listing the rationale for your company and also how a developer can start creating or modifying dashboards. This requires no more than one-page documentation/guide and an introduction by leadership or engineering managers.

Spinning up a Grafana playground in a minute

You only need this if you want to follow along with the blog article recommendations, play around with random sample data, and do not have a live Grafana instance with real application metrics at hand.

Some solutions exist — most of them using docker-compose — which allow you to easily and quickly spin up Grafana and data sources in a minute. Here, I describe the official devenv that Grafana developers and contributors use (see Grafana’s devenv README):

git clone --depth 1 https://github.com/grafana/grafana.git
cd grafana/devenv

# As of 2022-04, these are the instructions.
# Check `README.md` files for more information.
./setup.sh

cd ..

# See directory `devenv/docker/blocks` for more supported sources.
# "grafana" is not a source - this value ensures you don't have to
# build Grafana, and an official image is used instead.
make devenv sources=grafana,loki,prometheus

# To tear down later: `make devenv-down`

Now open http://localhost:3001/ and log in with user admin and password admin. Navigate to Explore mode on the lefthand navigation bar, choose gdev-prometheus as data source and query an example metric such as counters_logins. If it shows data, you are ready to play around.

Getting started with a playground dashboard

Mind that a good dashboard takes hours or days to create! As a start, the graphical way of clicking one together is the fastest. Once you have found out a good concept and layout, codifying it for the first time is some work, but worth the effort — more on that later.

If you have to experiment on a live instance, start by adding a name and hint so that your testing dashboard will not be touched by others. You can use a text visualization for that. Save the dashboard with a meaningful name. Do not use filler words like "monitoring" or "dashboard" — of course it is a dashboard… the screenshot only does this for the temporary "it’s a test" hint in the title. Now you can follow along with the recommendations and examples in this blog post.

Create a dashboard

High-level dashboard creation guidelines

Choose main input for the high-level dashboard

By first concentrating on monitoring and alerting for the main function of your business and system, you can cover almost all critical problems in subcomponents and infrastructure resources, without having to monitor those explicitly.

That needs explanation… The business in this blog article’s example is to process payments. So if payments fail, people cannot pay, and the business is at risk. Anything else is not as important, and therefore not worth to observe as the first thing. If a critical issue arises in our systems or network, the payment_errors_total metric will most likely cover that! To use IT terminology: the metric of payment failures is a significant service level indicator (SLI).

Admittedly, that will not cover if the internet, a customer, or our API are down, because payment requests would not even reach the system and therefore cannot produce logs or error metrics. That can be covered by a metric describing the payment request rate, probably by customer and payment method, since each of those have different typical traffic rates (minimum/maximum, different per timezone or day/night, etc.). We keep this shortcoming out of scope to keep the blog article simple. The point is: choose very few, essential business metrics as a start, not technical metrics.

Often, you would select the most business-relevant Prometheus metric offered by the system you want to monitor. Metrics are stored as time series and therefore very fast to query, as opposed to logs. If you use other observability tools, such as an ELK stack, you can check if Grafana supports the relevant data source. This metric would typically pertain to the methods "RED" (requests/rate, errors, duration) or "USE" (utilization, saturation, errors). The Four Golden Signals of Google’s SRE book additionally distinguishes traffic from saturation. An error metric is a good choice to start, since it is easy to determine which error rate is acceptable, and at which threshold you would consider it a serious problem.

Throughout this blog post, we will use the following example metric and simple terminology from the payments world:

Imagine we are in a company that processes payments, offering different payment methods, with each of those methods (e.g. credit card, voucher, bank transfer) having a separate microservice implementation
Counter payment_errors_total
Cardinality — the counter has these labels:
- payment_method (example values credit_card, voucher, bank_transfer)
- error_type (example values connectivity_through_internet, remote_service_down, local_configuration_error)

Metric naming and cardinality

We do not want to have separate metrics credit_card_payment_errors_total and bank_transfer_payment_errors_total! If you have microservices of the same type, as in this example one service per payment method, the metrics really mean the same thing. So rather improve your consistency and use just one metric name. Easy to do if your code is structured in a monorepo, by the way. If you have inconsistent names, it will take extra effort to repeat each dashboard visualization for each of the conventions, instead of just using labels to distinguish and plot the services (or other traits that you define as label dimensions, such as customers).

Metric names should be easy to find in code, so that developers can make sense of them fast in a stressful situation, or find out if/where a metric (still) exists. Here is a bad example where the metric name mysystem_payment_errors_total cannot be found in code: for request_type in ['payment', 'refund', 'status']: prometheus.Counter(name=f’mysystem_{request_type}_errors_total') (Python pseudocode).

Avoid high-cardinality metrics (many label combinations), since those take up lots of space, and queries take longer. Like for the logging rate of your systems, you might want to check for large metrics sporadically, or you may run into unnecessary cost and performance issues.

Stat instead of Graph for human understanding within milliseconds

A graph (nowadays called Time series visualization) for our example metric, showing payment method and error type combinations, looks like this:

Dumb graph

Cool graph, right? And we’re already done and have a monitored system! No, this is very, very bad! Great observability requires much more than just clicking together some visuals. The example is not sufficient to monitor a service. A graph visualization is a bad way to get an impression within milliseconds. The eyes have to scan the whole rendered graph, potentially containing multiple lines on varying bounds of the Y axis. You also need to know which thresholds are bad, or configure horizontal lines on the graph which represent warning and errors thresholds. That means lots of lines, colors, and points to look at before getting your question answered: "is this normal or do we have a problem, and where?"

Graphs can be helpful if you set them up nicely, but definitely not in the high-level part of your overview dashboard.

Instead, a Stat visualization, combined with traffic light colors, gives you the answer in milliseconds: green is good, amber (yellow) is noteworthy, red is bad. In addition, I tend to use blue as "noteworthy but may not be problematic" — kind of an early warning sign or unusually high amount of traffic, such as during sales/promotion seasons. So for me personally, I like the order green-blue-amber-red. Grafana allows choosing to color the background instead of the value line, which I recommend since then your whole screen should look green most of the time (click Panel > Display > Color mode > Background), and your eyes do not need to focus on the color of tiny graph lines. Exactly one value is shown — typically a number or human description.

Stat (many items)

Settings for the above screenshot:

Prometheus query: sum by (payment_method, error_type) (increment(payment_errors_total[2m]))
Legend: {{payment_method}} / {{error_type}}
Choose Calculation > Last. That will give the latest metric value, since now is the most interesting time point to show. Aggregations such as Mean may be a useless "all problems averaged away" view if you pick a big time range such as Last 24 hours, and would therefore show different values to different people. Since the Last setting does not average at all, your query should do that instead of sampling a single raw value: Prometheus queries such as increment(the_metric[2m]), or rate(the_metric[2m]) if you prefer a consistent unit to work with, will average for you. The [2m] in there should be selected depending on how stable the metric is and how fast you need to react once the metric reaches a threshold (mind the averaging!). Magic variables like $__rate_interval may sound promising, but also have the issue that a different time range selection shows different results, and that could lead to confusion if you exchange links to dashboard views with other people during an incident.

To show the colors, you need to set thresholds on the Field tab. Setting them as static numbers (with the queried unit, e.g. "errors per 2 minutes") may work for the start. That is called Absolute in Grafana.

In our example though, the different payment methods and error types have very different thresholds: for instance, let’s say the credit_card payment method has remote_service_down errors very frequently because the 3rd party provider is unreliable and we cannot help it, so we want to set a higher threshold because it otherwise unnecessarily shows a problem. Or instead of a higher threshold, you could consider querying the error rate increment over 10 minutes, to even out any short spikes. To use relative thresholds, click Percentage and fill some values. They will use the Min/Max settings. For example: if you set Min = 0 and Max = 200, red background color above the 66% threshold will be shown above 66% * 200 = 132 (unit in this example: errors within 2 minutes). Everything above 66% will be red. Everything between 16% and 33% will be blue. And so on.

Relative (percentage) thresholds

To set specific thresholds per value combination (here: "payment method / error type"), adjust Max:

Override Max setting

Since Grafana live-previews your changes, it should be simple to choose good values for Max. Select a healthy time range for your system, and it should be green (note that the right-most time point is displayed, as we chose Calculation > Last). Select an incident time window, and choose a Max value to make it red. The other values (amber/blue) might then just work, since they are based on percentages. Start with values that work, and adjust them if you later see false positives (red when system is fine) or false negatives (green when system has problems). If you want human descriptions instead of numbers, you can also use the override feature (Field > Value mappings, or for specific fields: Overrides > [create an override] > Add override property > Value Mappings / No value), for instance to replace 0 with the text "no errors".

Example: our query shows the number of errors in 2 minutes. By looking at the graphed data of the last few days (paste query into Explore mode), we might find that 20 errors in 2 minutes are acceptable and should be shown as green. We therefore choose a slightly higher threshold of 25 to still be within the green color. Since we switch to blue color from 16%, we get Max = 25 * 100% / 16% = 156. As a result, red background color — which shouts "something is seriously wrong" — would be shown above Max * 66% = 103 errors in 2 minutes. You should experiment a little so that in healthy times, your dashboard remains green.

Side note: for "higher is better" metrics such as customer conversion rate (100% = all customer purchases succeed), you can just turn around the colors (green on top, red on bottom). The Max setting also defines the upper bound of the graph which is shown as part of the Stat visualization, so if values are higher, the line will cross the top of the Y axis and therefore becomes invisible. Not a big deal if Min/Max cover the expected range. You may also have the rare case of "too high and too low are both bad" metrics, e.g. a counter for payment requests where you always expect some traffic to be made, but also want to be warned if the request rate is soaring. The colors could be adapted to show both low range and high range as red, with green for the expected, normal range.

Keep panels small on screen

Pack only few visualizations horizontally, so the font stays large enough. Other people may work on a smaller screen than yours, or do not use full screen sizing. The repeat feature (Panel > Repeat options) makes Grafana create and align them automatically. In our example, the repeat feature is unused, but since we pulled different combinations of payment method and error type out of our (single, consistently named and labeled) metric, that will also show multiple rectangles in one Stat visualization, and try to align them on screen. In the screenshot further above, the titles are barely readable, and the visualization is large and could require scrolling on small screens. To solve that, you could:

Keep as-is and show a separate rectangle for each combination. With each added or newly monitored product/feature (here: payment methods), the whole dashboard size grows, so the page does not always look the same or fit on one screen. I’m not telling you it should fit on one screen, but a high-level dashboard must not be an endless scrolling experience.
Show only problematic items — for instance, only yellow and worse. The downside is that in healthy cases, nothing gets shown, making users unaware of how it should normally look like. See below for a better option.
Show the top 10 highest error rates (Prometheus: topk). This can be combined with traffic light coloring. If only one shows red, you will think that one payment method is down, while if multiple show red, you may think of a larger issue. With this solution, the visualization will never show as empty, so you’ll see ~10 green rectangles in healthy scenarios (the section Show only offenders or top N problematic items later explains why it may not be exactly 10, and how to fix that). Compared to the above option, this ensures that the visualization remains at the same size and does not jump around on the web page, and you know that the dashboard is still working. Just like software, dashboards can become buggy and not show you the right things, for example if someone renames a metric!

Top items only to keep panels small on screen

Avoid dropdowns for variable values

Typical pre-built, open source dashboards may show variables like these:

Variables

Such low-level investigation dashboards are helpful, but for high-level purposes, your dashboard should show everything at one glance. If you have clusters A and B, of which one serves traffic at a time and the other one is the passive backup, you should not be required to know by heart which cluster is active. Instead, the dashboard should show health across clusters. You can still repeat your visualizations for each cluster, or query for by (cluster) if it proves helpful — but probably rather on low-level dashboards. For our example of an error metric, you want to know if it goes above a threshold, and sum(rate(…)) does not strictly require distinction by cluster in the high-level visualizations.

Dashboards-as-code and GitOps (committed dashboards)

The Grafana provisioning documentation describes how to use a local directory which Grafana will watch and load dashboards from. However, you cannot just write dashboards as plain JSON as a human. Also, more ergonomic ways of importing dashboards, e.g. from a Git repo, are not supported yet but Grafana Labs is considering improvements.

Here is the rough plan:

Generate Grafana-compatible JSON containing dashboard objects. The Grafonnet library is the official way to develop dashboards using the Jsonnet language. There is also Weaveworks' grafanalib for Python which is not presented in this article.
Add that build to CI. Deployment means that you have to make the JSON files available to Grafana in some directory.
Configure Grafana to pick up files from that directory
Also, support a developer workflow

Advantages of not creating dashboards visually through the Grafana UI:

You get a developer workflow. The later section Fast develop-deploy-view cycle explains how that works. It is surely worse than WYSIWYG. But a programming language such as jsonnet makes the repetitive parts so much more expressive and consistent. With file watching tools and a Grafana API key, you can deploy each saved change within a second and only need to reload in your browser. Very soon, it will be a great experience, once you have assembled some basic functionality and learned the language. While jsonnet is not the best or most well-known language, better alternatives such as CDK integration or other libraries may arise in the future. And once you have developed a dashboard, it is easy to improve in small increments, similar to software code. In fact, just like the main parts of your software code, dashboard code will typically be written once and then not touched for a long time. Codifying dashboards therefore leads to long-term consistency, yet making large changes easy. If you tell people to visually create dashboards instead of dashboard-as-code, after a few months you will see a bloat of outdated, awful, non-informative and unreviewed dashboards, with lots of them probably unused. Coded dashboards improve quality and allow you to throw out old stuff easily and with the needed 4-eye principle. You can still allow people to visually author dashboards, but tell them they will be automatically destroyed every Deleteday. Changes can be tested visually (at best with production data!) but should then be ported back into code. Once coded, a dashboard should go through review, and it is very likely that most changes reuse homegrown jsonnet functions instead of reinventing each dashboard from scratch. Simple improvements should become one-line changes. Reviewed dashboards are much more robust, stable and avoid outdated parts.
Consistent annotations such as deployment events, for instance by reusing a custom function which adds them everywhere
With such a custom base library of functionality, nobody needs to be an expert to get started making changes to monitoring
Vendor lock-in can be avoided to some extent. The Grafonnet library and custom object properties are highly Grafana-specific. No way around that. Even if you use WYSIWYG editing and storage, the dashboard is stored as Grafana-specific JSON, not transferable at all to other providers. I recommend to choose one observability platform and stick with it for years — just like you would for an infrastructure platform such as Kubernetes and its ecosystem. This article shows you how to cleanly manage your dashboards as code. That way, improving or fixing all dashboards at once is done within minutes, and you get the benefit of code review. If you write some high-level jsonnet functions such as addDashboard(prometheusQuery, yellowThreshold, redThreshold) (pseudocode), you can even abstract Grafana-specific stuff to some extent and later port more easily once the company switches to another observability provider or cloud product. I cannot provide experience or examples (yet) whether such an abstraction layer is worth the effort.
Old and unused stuff is easy to detect and delete. For example, you can grep for metric names or other things that do not exist anymore in your software, and delete those dashboards or visualizations from the code. Likewise, it is easy for a developer to find out where and if a metric is actually used for monitoring.
A monorepo keeps all observability-related things in one place. You do not want to copy the solution into every engineering team’s projects, since it may take some effort upfront to include it with your existing CI/CD/GitOps tooling, and decentralizing the solution defeats many advantages (consistency, shared base library functions, availability of good examples for new joiners to learn from). If you can reuse your software (mono)repository, even better, since that makes it easier to put relevant changes together — such as adding/removing/extending metrics or log fields.

Dashboard generation from jsonnet code using the grafonnet library

Let’s set up the generation of our dashboard from code. First, the necessary tools. jb (jsonnet-bundler) will be used as jsonnet package manager, and go-jsonnet (not the much slower C++ implementation!) for the language itself.

# macOS + Homebrew
brew install go-jsonnet jsonnet-bundler
# Any other OS / package manager combination
go install -a github.com/jsonnet-bundler/jsonnet-bundler/cmd/jb@latest
go install github.com/google/go-jsonnet/cmd/jsonnet@latest

We need jsonnet libraries for the outputs we want to generate. In this case, Grafonnet for Grafana dashboards is enough. If you want to generate non-Grafana resources, consider the kube-prometheus collection which covers much of the Kubernetes landscape (but mind its Kubernetes version compatibility matrix).

cd my/version/control/repository
jb init
jb install https://github.com/grafana/grafonnet-lib/grafonnet
echo "/vendor" >>.gitignore

Commit jsonnetfile.json and jsonnetfile.lock.json. The lock file ensures that the next user gets the same version of libraries, so you do not need to commit the downloaded modules in the vendor directory. But that probably starts a flame war among Go developers, so please decide yourself…

Instead of jb, you could also use Git submodules, but probably will regret it after adding more dependencies — I did not test that alternative.

First write a minimal dashboard as code and save as dashboards/payment-gateway.jsonnet:

local grafana = import 'grafonnet/grafana.libsonnet';

grafana.dashboard.new(
  timezone='utc',
  title='Payment gateway (high-level)',
  uid='payment-gateway',
)
.addPanel(
  grafana.text.new(
    content='Yippie',
    mode='markdown',
  ),
  gridPos={
    x: 0,
    y: 0,
    w: 24,
    h: 2,
  },
)

Play around manually:

# This may fail if you forget the newline at the end of the file, or the code is
# otherwise not compatible with jsonnetfmt's expectations
jsonnetfmt --test dashboards/payment-gateway.jsonnet || echo "ERROR: File must be reformatted" >&2 # use in CI and IDE ;)

export JSONNET_PATH="$(realpath vendor)" # same as using `-J vendor` argument for the below commands
jsonnet-lint dashboards/payment-gateway.jsonnet # use in CI and IDE ;)
jsonnet dashboards/payment-gateway.jsonnet

The last command outputs a valid Grafana dashboard as JSON. To manually apply it, open the test dashboard in your Grafana instance, then ⚙️ > JSON Model > copy-paste generated JSON > Save Changes. You should now see the dashboard as described by code — containing only a text panel that says "Yippie". This is the simplest development workflow. But it is very tiring to always go and copy-paste some generated blob and save it with many clicks. And the workflow is not visual (WYSIWYG). The next section explains a better way.

Fast develop-deploy-view cycle

While there is no WYSIWYG editor for the whole conversion from jsonnet to a visual dashboard in Grafana, here is an alternative which works right now (in 2022):

Create a personal API key (side bar > Configuration > API Keys) with Editor permission
Use entr or other file watching tool to execute a script whenever you save your jsonnet files. IDEs may offer this action-on-save feature as well.
That script uses your API key to overwrite dashboards in your Grafana instance

Preparation in your shell:

export GRAFANA_API_KEY="THE_API_KEY_YOU_CREATED_WITH_EDITOR_PERMISSION"
export GRAFANA_URL="THE_GRAFANA_URL" # if you use devenv: `export GRAFANA_URL="http://localhost:3001/"`

Next, store the following script as watch.sh:

#!/usr/bin/env bash
set -eu -o pipefail

error() {
	>&2 echo "ERROR:" "${@}"
	exit 1
}

[ -n "${GRAFANA_API_KEY:-}" ] || error "Invalid GRAFANA_API_KEY"
[[ "${GRAFANA_URL:-}" =~ ^https?://[^/]+/$ ]] || error "Invalid GRAFANA_URL (example: 'http://localhost:3001/' incl. slash at end)"

[ $# = 1 ] || error "Usage: $(basename "${0}") JSONNET_FILE_OF_DASHBOARD"
dashboard_jsonnet_file="${1}"

rendered_json_file="/tmp/$(basename "${dashboard_jsonnet_file%.jsonnet}").rendered.json"

cat >/tmp/render-and-upload-dashboard.sh <<-EOF
	#!/usr/bin/env bash
	set -euo pipefail
	clear

	# Render
	echo "Will render to \${2}"
	JSONNET_PATH="\$(realpath vendor)"
	export JSONNET_PATH
	jsonnet-lint "\${1}"
	jsonnet -o "\${2}" "\${1}"

	# Enable editable flag and upload via Grafana API
	cat "\${2}" \
		| jq '{"dashboard":.,"folderId":0,"overwrite":true} | .dashboard.editable = true' \
		| curl \
			--fail-with-body \
			-sS \
			-X POST \
			-H "Authorization: Bearer \${GRAFANA_API_KEY}" \
			-H "Content-Type: application/json" \
			--data-binary @- "${GRAFANA_URL}api/dashboards/db" \
		&& printf '\nDashboard uploaded at: %s\n' "$(date)" \
		|| { >&2 printf '\nERROR: Failed to upload dashboard\n'; exit 1; }
EOF
chmod +x /tmp/render-and-upload-dashboard.sh

echo "${dashboard_jsonnet_file}" | entr /tmp/render-and-upload-dashboard.sh /_ "${rendered_json_file}"

Run the script, passing the source files as argument.

# Make script executable
chmod +x watch.sh

./watch.sh dashboards/payment-gateway.jsonnet

The script listens for changes to the source file and then overwrites the dashboard in your Grafana instance with an API request. Open your Grafana instance and find the dashboard by its title. Mind that the uid field in source code must be set to a fixed value per dashboard in order to overwrite the dashboard instead of creating new ones.

This workflow gives you results within seconds and you only need to refresh in your browser to see saved changes. You may want to keep an eye on your terminal, since jsonnet is a compiled language and therefore spits out errors if you make coding mistakes.

Automatic provisioning of generated dashboards into Grafana instance

Grafana provisioning allows automatic reloading of dashboards from a certain place. We want to load the committed, generated dashboards. On the other hand, we will not fully recreate Grafana and its database on every commit to some control repository — I’d call that murder by GitOps, and the sheer idea does not sound useful, as users and their settings are stored in the database, so we do not want to manage everything as code.

We can set up Grafana in various ways: via Ansible on a single server, with containers on Docker or Kubernetes, manually run on the company’s historic Raspberry Pi in the CEO’s closet, etc. They luckily all work the same way for configuration: local files.

In the dashboard provisioning section, it says you can put "one or more YAML configuration files in the provisioning/dashboards directory". We try that with a realistic setup of Grafana, and for the sake of simplicity, we assume a GitOps model, meaning that Grafana loads dashboards from committed files. You have to adapt this article yourself to your respective setup.

The following instructions show you how to do this on a self-hosted Kubernetes setup of Grafana (see their setup instructions). As I do not have a Grafana Cloud account right now, I cannot tell if the cloud offering provides this much flexibility, or any good way of using the GitOps model — if they do, the documentation misses this important piece (as of 2022-04). We use kind to simulate a production Kubernetes cluster.

# Create Kubernetes cluster and a target namespace
kind create cluster --kubeconfig ~/.kube/config-kind
export KUBECONFIG=~/.kube/config-kind
kubectl create ns monitoring

# Install Prometheus so we have a data source to play with
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm upgrade --install -n monitoring prometheus prometheus-community/prometheus

To install from Grafana’s Helm chart, you need to configure it. Store the following content in grafana-values.yaml:

adminUser: admin
adminPassword: admin

# Disable persistence so all data is lost on restart. That's a little unfair to your users,
# though, so you may want to instead combine GitOps with a delete-every-Sunday concept.
# A real production setup would provide persistence, but that is out of scope for this article.
persistence:
  enabled: false

datasources:
  datasources.yaml:
    apiVersion: 1
    datasources:
      - name: prometheus
        type: prometheus
        url: http://prometheus-server
        access: server
        isDefault: true

dashboardProviders:
  dashboardproviders.yaml:
    apiVersion: 1
    providers:
      - name: default
        orgId: 1
        folder: ""
        type: file
        disableDeletion: false
        updateIntervalSeconds: 10 # how often Grafana will scan for changed dashboards
        allowUiUpdates: true
        options:
          path: /var/lib/grafana/dashboards/default
          foldersFromFilesStructure: false

rbac:
  extraRoleRules:
    # Allow k8s-sidecar image to read ConfigMap objects in same namespace
    - apiGroups: [""]
      resources: ["configmaps"]
      verbs: ["get", "watch", "list"]

extraContainers: |
  - name: collect-dashboard-configmaps-in-directory
    image: kiwigrid/k8s-sidecar:latest
    volumeMounts:
      - name: collection
        mountPath: /tmp/collection
    env:
      - name: LABEL
        value: "collect-me"
      - name: LABEL_VALUE
        value: "grafana-dashboard"
      - name: FOLDER
        value: /tmp/collection
      - name: RESOURCE
        value: configmap

extraVolumeMounts:
  # This will implicitly create an `emptyDir` volume as well (quite surprising),
  # so we do not require the value `extraContainerVolumes`.
  - name: collection
    mountPath: /var/lib/grafana/dashboards/default
    readOnly: true

And continue installation:

# Install Grafana
helm repo add grafana https://grafana.github.io/helm-charts
helm upgrade --install -f grafana-values.yaml -n monitoring grafana grafana/grafana

# Wait until the installation is ready, then keep this running in order to access
# Grafana in the browser
kubectl port-forward -n monitoring svc/grafana 7878:80

Now open http://127.0.0.1:7878/ and log in with admin:admin.

Grafana reads dashboards from a directory structure. In the Kubernetes world, we can put generated dashboards into a ConfigMap in order to mount it into the Grafana directory structure. They however have a 1 MiB limit each. An example high-level dashboard of mine takes 115 kiB when Base64-encoded, so you also will not be able to say "we can fix this problem later", since you will reach the limit very soon. We will render all dashboards into one manifest file. That YAML file will contain one ConfigMap object per dashboard. Committing that file in a GitOps fashion is easy to add in your CI (git clone && git add && git commit && git push), but that is out of scope for this article. You could also do CIOps (kubectl apply from CI; not recommended), or just rsync -r --delete if you have Grafana on physical, mutable hardware and not on Kubernetes, or whatever other way of deployment to the directory structure. Make sure you overwrite during deployment instead of only adding new files/dashboards, since deletion and cleanup of technical debt is just as important as it is for writing software. Treat your Grafana instance and database as something that gets reset regularly, from a committed state. This avoids people making manual, unreviewed edits.

As you can see in the configuration, we use k8s-sidecar to automatically collect all dashboard JSON files into one directory for use by Grafana. Each ConfigMap must have the label collect-me: grafana-dashboard to get picked up. The following script creates such ConfigMap manifests. I do not explain here how to integrate it with your specific CI tool, but that should be easy if it works locally. Save the script as render-and-configmap.sh.

#!/usr/bin/env bash
set -eu -o pipefail

error() {
	>&2 echo "ERROR:" "${@}"
	exit 1
}

[ $# = 1 ] || error "Usage: $(basename "${0}") JSONNET_FILE_OF_DASHBOARD"
dashboard_jsonnet_file="${1}"

rendered_json_file="/tmp/$(basename "${dashboard_jsonnet_file%.jsonnet}").rendered.json"

# Render
JSONNET_PATH="$(realpath vendor)"
export JSONNET_PATH
jsonnet-lint "${dashboard_jsonnet_file}"
jsonnet -o "${rendered_json_file}" "${dashboard_jsonnet_file}"

# Grafana wants `.json` file extension to pick up dashboards
kubectl create configmap "$(echo "${dashboard_jsonnet_file}" | openssl sha1)" \
	--from-file="$(basename "${dashboard_jsonnet_file%.jsonnet}").json"="${rendered_json_file}" \
	--dry-run=client -o json \
	| jq '.metadata.labels["collect-me"]="grafana-dashboard"'

Ensure you are still pointing KUBECONFIG to the desired Kubernetes cluster, and test dashboard deployment like so:

# Make script executable
chmod +x render-and-configmap.sh

./render-and-configmap.sh dashboards/payment-gateway.jsonnet | kubectl apply -n monitoring -f -

Head over to the Grafana instance running in Kubernetes, and you see that the dashboard was already loaded. Integrate this with your CI pipeline, et voilà, you have a GitOps workflow!

Once done with the cluster, you can delete it:

kind delete cluster

jsonnet for dashboards — some tips

Recommendations specifically for the jsonnet language.

Distinguish environments

You will surely have different environments, such as dev/staging/prod. They have varying URLs, potentially different set of running systems, and thresholds for production not always make sense in pre-production environments. Pass --ext-str myCompanyEnv=prod to the jsonnet tool to pass in a variable which you can use inside the source code. This allows previewing dashboards with development data before a software feature even goes live, and you will have very consistent views across environments. Example usage:

assert std.extVar('myCompanyEnv') == 'dev' || std.extVar('myCompanyEnv') == 'prod';

{
  my_company_environment_config:: {
    dev: {
      environment_title: 'Development',
      prometheus_cluster_selector: 'k8s_cluster="dev.mycompany.example.com"',
    },
    prod: {
      environment_title: 'Production',
      prometheus_cluster_selector: 'k8s_cluster="prod.mycompany.example.com"',
    },
  }[std.extVar('myCompanyEnvironment')],
}

Alternatively, --ext-code-file seems also a viable option, but I have no experience with it (see external blog post Grafana dashboards and Jsonnet which showcases this parameter).

This can also be interesting if you have a dev/prod split for your Grafana instances.

Consider jsonnet as full programming language

A simple dashboard should be a matter of only few lines of code. Follow the jsonnet tutorial to learn more how to achieve that. You will see some similarities with Python, for example string formatting with %, slicing, array comprehension, modules/imports and other syntax that can make your code easier and shorter.

If you think your users are not technical enough, jsonnet may not be a good fit unless you either provide high-level functions, or replace the whole jsonnet+Grafonnet rendering idea with your own custom solution that does the same thing: output dashboard definitions as Grafana-compatible JSON.

Grafana-specific dashboard tips

Some small tips and their solution, some with jsonnet examples.

Categorize, do not make a dashboard mess

Use the folder structure to organize your dashboards. You and your colleagues will surely play around with test dashboards, and mixing them with production-ready, usable ones is not helpful. Also, particularly if you have many systems to watch, you want everything categorized for easy access. Product-unrelated dashboards, such as monitoring for Kubernetes clusters or infrastructure, can go into a separate category. Unfortunately, you cannot set the parent folder through jsonnet as of 2022-04, but it has to be achieved as part of deploying the generated dashboards.

Grouping within a dashboard

For very detailed dashboards, you may have a lot of graphs. While this is typically discouraged, your software may really have so many important metrics. In such case, group them into rows. They are collapsible and ease navigation. Maybe Grafana could consider adding a "Table of contents" feature to jump around quickly on a dashboard, using a navigation sidebar.

Rows

Make dashboards editable

In the developer workflow above, we explicitly set dashboards to editable. You may want this for the GitOps/CI workflow as well. This is helpful because incidents sometimes require a bit of playing around with shown data. Users should however not save changes, since they are supposed to be overwritten regularly by deploying dashboards from committed code.

By default, hovering over a graph with many series shows them in a box in alphabetical order of the display label, e.g. sorted by {{customer_name}}. Typically however, such as for error rate metrics, you want the top values shown first, since the bottom of the tooltip may be cut off in case of many entries. Go to setting Display > Hover tooltip > Sort order and adjust to your liking (e.g. Decreasing). With jsonnet, use grafana.graphPanel.new(sort="decreasing") (not documented as of 2022-04).

Tooltip sort order

Do not confuse the tooltip with the legend (which also has a configurable sort order!).

Y axis display range

Many metrics only produce non-negative numbers. Graphs with such a metric on the Y axis should therefore have Visualization > Axes > Y-Min set to 0 instead of auto in order to save screen space by not showing the negative area. Another problem is that you often want 0 to be the lower bound, or else Grafana chooses the display range based on the available data. In jsonnet: grafana.graphPanel.new(min=0). Setting the maximum may be helpful if you know the number range (e.g. disk full 0-100%) and want to have a consistent display.

See how the bad example on the left makes you think of a fluctuating metric. The corrected example on the right shows that in reality, the value is quite stable. In general, make trends easier to recognize for the eyes.

y min zero

Be cautious with your use case, though. If you want to display and warn when a disk gets full, for example, you better extrapolate and display the trend. Or use a Stat visualization with a warning color once the trend reaches "disk becomes full within 30 days". The respective alerts need to be designed in a similar way. Otherwise, if a short time range is selected, the user may not see that the disk usage is going up a lot, as the difference between 200 GiB and 210 GiB may not look dramatic with Y-Min set to zero.

Link to detailed dashboards, logs, other observability tools

Use grafana.graphPanel.new(…).addLinks(…) to create panel links:

grafana.graphPanel.new(
  // ...
).addTarget(
  prometheus.target(
    // ...
  )
).addLinks([
  {
    title: 'Logs - Payment gateway',
    url: ...,
  },
  // ...
])

For links to detailed dashboards, consistently pre-select a reasonable time frame such as now-30m.

If you consistently tag dashboards, you can use dashboard links to put clickable links to related dashboards on top. You can also add external links such as other company tools. I have not used this feature yet and typically rather repeat the links on each panel since that does not require scrolling all the way to the top. With jsonnet, it is easy to provide a consistent set of links (as dashboard or panel links).

Use variables for repetitive values

In rare cases, you want a repetitive variable such as datacenter = cluster="dc"\,host=~"server.*" so that queries become less verbose: sum by (payment_method, error_type) (rate(payment_errors_total{$datacenter}[2m])). If the value is used in a label filter of a Prometheus query, as in this example, remember that commas need to be escaped with a backslash, or else Grafana treats the comma as separator between different choices for the variable value.

Custom variable

Even if you use jsonnet, you should use variables instead of filling a hardcoded value into each query. This allows users to change all visualizations on a dashboard at once (at ⚙️ > Variables).

Consider hiding those variables on the dashboard if their sole purpose is to avoid repeated, hardcoded values. See also below for some rationale.

Clearly differentiate environments

You do not want to be looking at a development dashboard while debugging a production issue, so make that mistake impossible to happen. Possible solutions:

Separate Grafana instance per environment. See this Reddit thread for some options.
Set the category and title of each dashboard so that non-production ones show a clear hint
Different colors and backgrounds
Different Grafana UI theme per environment. I am not aware of an official way to customize styles using CSS or external themes. You could patch built-in themes and build Grafana yourself, or use the Boomtheme plugin. I did not test those options. Users can change their own preference (light vs. dark), so this idea anyway does not really help unless you hardcode one fixed, customized theme. The feature request Custom UI themes discusses solutions and describes drawbacks of the available plugin.

Grid positioning

The grid position must be specified explicitly:

.addPanel(
  [...],
  gridPos={
    x: 0,
    y: 0,
    w: 24,
    h: 12,
  },
)

See Panel size and position documentation — width is split in 24 columns, height is 30 pixels each. I recommend you use 12 or 24 columns width for readability on small screens, and set a reasonable, consistent height for all visualizations on a dashboard.

As of 2022-04, you cannot easily align visualizations using jsonnet. You can hardcode x/y absolute values to your liking, but that is a hassle since you do not want to develop a user interface in an absolute grid, right? I recommend setting both to 0 in order to automatically align the visualizations on screen.

Choose the right data unit

Choose the right unit, e.g. Seconds, Requests per second, etc.

Data unit

Mind subtle differences between the built-in choices, e.g. Duration / seconds will show long text such as "412 milliseconds" which makes it hard to put much information on one screen — consider using Time / seconds instead.

Also, do not confuse the order of magnitude: if your data is provided in seconds, do not choose Time / milliseconds since that would show falsified values.

UTC timezone everywhere

Grafana’s default is to use the browser timezone. Particularly for international companies or those who have international customers, consistent values and avoidance of confusion are important. Employees usually do not open the user settings page, for example to choose light/dark mode or their timezone preference, resulting in inconsistent customer and incident communication regarding dates and times. I am working from Germany and keep seeing confusion between CET/CEST once daylight saving time toggles, and sometimes even do such mistakes myself. Let’s avoid that and communicate only in UTC, and default to UTC in tools such as Grafana.

By writing a jsonnet wrapper function instead of using raw calls to grafana.dashboard.new, you can set that as your default for generated dashboards.

Somewhat related xkcd comic: ISO 8601. Did you know that the Z suffix in 2022-04-21T17:13Z stands for UTC ("Zulu time") and is therefore a pretty good abbreviation? Most non-technical people rather know the suffix "UTC", so that one is preferable.

Do not rely on `default` data source

Even if you rely on one Prometheus-compatible source in the beginning, you will very likely add more data sources, or migrate to another one, in the future. Therefore, explicitly define the source in each visualization. In jsonnet, use for example grafana.statPanel.new(datasource='thanos'). In general, do not ever name something default, anywhere. The same applies to the words "old" and "new", since "new" is always the next "old".

Heatmaps

Those are hard to set up since the UI does not give guidance. You have to set several options correctly to see reasonable results:

Prometheus query example: sum by (le) (increase(prometheus_http_request_duration_seconds_bucket[1m]))
Query > Format: Set to Heatmap instead of Time series
Visualization: Choose type Heatmap
Visualization > Y Axis > Data format: Time series buckets
Visualization > Y Axis > Unit: Choose according to the metric, typically seconds (s)
Visualization > Y Axis > Decimals: For seconds (s) or other time unit, use 0 decimals, as Grafana automatically shows the appropriate text "ms"/"s"/"min", so the .0 decimal after each axis label is useless. This would best be fixed within Grafana source code.
The Y axis will only be sorted numerically once you change the query legend to {{le}}
Visualization > Display > Colors: I recommend opacity-based coloring with full blue (rgb(0,0,255)) as strongest color, in order to see things without getting eye strain or having to come close to the monitor. Use a color that is visible with light and dark theme. I would love to see the thresholds feature for the heatmap visualization as well, so that good values can be colored green, and bad ones yellow or red. Right now, colors are assigned by how often a value range ("bucket") appeared, not by the value itself — that means your eyes have to rest on the visualization for some seconds to understand it.
Visualization > Tooltip: Enable, and optionally show the histogram for fast interpretation of the value distribution on mouseover
This option seems not available through the UI anymore for Prometheus data sources, but let me put this here for the record: if the visualization is showing too detailed information (too many faded bars), limit Query > Query options > Max data points to e.g. 24.

Heatmap visualization

Instead of a histogram, laying out the information as percentiles on a Stat visualization may give a faster overview and is preferable on high-level dashboards. For example, p50 (median), p95 and p99 percentiles are often useful. Make sure you use the rate function inside histogram_quantile.

Example Prometheus query: histogram_quantile(0.95, sum(rate(prometheus_http_request_duration_seconds_bucket[2m])) by (le))

3 percentiles in a Stat visualization

Display interesting events as annotations

Grafana annotations can mark interesting time points on graphs. Among many imaginable events to enrich on your dashboards, software and infrastructure deployments are the most interesting ones since change in a technology-driven company usually means risk and the potential for failure. In an incident, the starting point is often known quite soon by looking at dashboards. If graphs additionally show whether and when changes were made, you have better chances to find the cause.

Do not bother adding annotations manually (e.g. time window of every deployment), since people will forget the procedure, get the timezone wrong, and it only adds an unnecessary burden which should be automated.

Here is an example how to consistently show Argo CD sync events on your dashboards. Those mostly relate to real deployments. When I developed that query, no better, human-level event type was available. You may want to tweak this to your own use cases.

local grafana = import 'grafonnet/grafana.libsonnet';

{
  deployments:: grafana.annotation.datasource(
    name='Deployments related to payment methods and their infrastructure',
    datasource='loki',
    expr=|||
      {app_kubernetes_io_name="argocd-application-controller"}
      |~ "reason=(OperationStarted|ResourceUpdated)"
      | logfmt
      | dest_namespace =~ "payment-methods|ingress-nginx"
        and
        msg =~ "(?i).*?(?:initiated.*(?:sync|rollback)|sync status: OutOfSync -> Synced|health status: Progressing ->).*"
      | line_format `App {{.application}}, namespace {{.dest_namespace}}, cluster {{.cluster}}: {{.msg}}`
    |||,
    iconColor='blue',
  ),
}

You can now use grafana.dashboard.new(…).addAnnotation(deployments) to add the annotations to your dashboard.

Self-describing visualization titles

Each visualization’s title should be self-describing. Bad: "Error rate". Good: "Payment methods — error rate of requests to provider".

One reason is because you can link to a single visualization which is helpful during incidents to tell others what exactly you are looking at (or to present one detail on a really large TV):

View single visualization

And again, it helps the eyes to quickly get a full picture instead of having to look at multiple locations on screen.

In Grafana, the dashboard title is always displayed, even for such single-visualization URLs. So if your dashboard is nicely titled "Payment gateway (high-level)", that will already be a good starting point and you may not even need or want verbose visualization titles.

For averaging queries like sum(rate(metric[5m])), which may constitute most of your dashboards, you should consider adding the interval hint (e.g. abbreviated [5m]) to the visualization title — and/or the Y axis — so that users are aware how fast a recovered metric or an error peak will become visible.

Observability tips not specific to Grafana

These tips relate for example to Prometheus query practices and other things that do not require Grafana in the monitoring stack per se.

Stay consistent in naming metrics and labels

The Prometheus naming practices page gives very good guidance, such as to use lower_snake_case, name counters xxx_total or specify the unit such as xxx_seconds.

No need to create a metric for everything / how to easily get started monitoring an uninstrumented application

The 3 current pillars of observability — metrics, logs and traces — may not remain considered the best solution forever. We can expect tooling to try and combine them in the future, such as "metrics from logs" features. You want to observe your applications with minimum instrumentation effort? Then sometimes, a LogQL query such as sum(count_over_time(… [15m])) to look for specific log lines may be what you want (temporarily), instead of developing and maintaining a new metric. Beware however that log text tends to change much more frequently than metric names, and how much slower and more expensive it is to query logs. A totally uninstrumented application can easily be monitored if you have access to its logs. Later on, you can make the dashboards more efficient once you learned which indicators are important to show the application health, and which ones are not. Very helpful if you are just getting started with observability.

Show only offenders or top N problematic items

You can use > 0 or topk(…) > 5 to display only offenders in your high-level dashboard (please also read Keep panels small on screen above).

For example, the customers with the highest concurrency of API requests. Use Value mappings feature to map "null" to e.g. "currently low concurrency" for better understanding in humans (since Stat visualizations always show something). Together with green/yellow/red thresholds, this explains in 2 seconds what the current value is and whether it is problematic. As explained before, use Calculation > Last if only the latest value is relevant — you do not care about the Average API concurrency over 3 hours while debugging an incident, right?

In our payment example, we could alternatively show the payment methods with the highest rate of errors. Or depending on the business, define each payment method’s business importance in code and then only show the most critical products with a label filter (e.g. importance="boss_says_this_is_super_critical"; or name them "Tier1", "Tier2", etc.).

Note that topk(5, …) may not show you the top 5 items if evaluated for a graph, since the "top 5" are checked for many time points and all resulting items are shown. The same applies to Stat visualizations — unless you choose Instant to only choose the end time point, but that can falsify the desired data to show. If you really want only up to 5 items to be shown, use the @ end() modifier to evaluate only the latest top-5 values in the range (thanks to a blog reader for pointing this out! – available since Prometheus v2.25.0).

Mind test and synthetic traffic

In a modern infrastructure, you might run synthetic test traffic to verify the end-to-end health of your applications. Since those are not from real customers, you should check if that should be shown differently or excluded from certain dashboards or visualizations.

Daytime vs. nighttime

If your business is mostly in a certain region or timezone of the world, such as European payments, traffic goes down at night. Consider different error and request thresholds at day and night, respectively. Visualizations should be clearly distinguished with e.g. 🔆 or 🌒 in the title.

Prometheus allows the time distinction with and/unless hour() >=6 <21. This can be tricky, though: in special cases such as calculations sum(…)/sum(…) and hour() >=6 <21, label set matching will surprise you with an empty result. Example to fix that: (sum by (something) (rate(some_metric[15m]))) / sum by (something) (rate(some_metric2[15m])) and ignoring(something) hour() >=6 <21.

This is cumbersome and should be avoided for the start, unless you really need such a strong distinction by time.

Do not use `rate(…)` alone

The same applies to calculations like rate(…) / rate(…). Why? Any change to the labels will make them explode into many series. Combine rate with sum or sum by.

Prefer counters over gauges

A counter in Prometheus represents a value that can only increase. In contrast, a gauge can take an arbitrary value.

In regular scrape intervals, a metric’s value gets collected by Prometheus. A longer scrape interval means less storage and cost, but can mean that a short spike of a gauge’s value is not stored at all, and therefore also will not produce an alert. So prefer a counter if possible for your for use case, since its value does not lose increments (but on the other hand, it only supports increments).

Good example for using a gauge: queue size. Items can be processed, i.e. removed from the queue, or added. The more interesting metrics for queues however are error rate and per-item processing time.

Make observed components distinguishable

To find a root cause quickly in case of problems, dashboards must allow drilling down into details. In our example of payment methods as products, each of them could fail separately, or several/all at once. This must be visible in visualizations and alert messages.

Examples why this distinction is important:

1 payment method failing — only that application’s code might be affected, for example from a bad change recently deployed
Multiple payment methods failing — perhaps those have something in common, such as serving traffic from a certain cloud region or Kubernetes cluster, or which are otherwise special (in the middle of a migration, feature toggled, traffic pattern changed, rate limit of database reached, etc.)
All payment methods failing — bad code change affecting all those applications was introduced, networking issues, infrastructure down, other catastrophic scenario

Other ideas for details to drill down into: per customer, per Kubernetes cluster, per cloud region, per API endpoint. For some of these, you may be able to leverage variables (mind Avoid dropdowns for variable values), while some value ranges may simply be too large — for instance if you have a million customers — and you should rather show the top N problematic ones (Show only offenders or top N problematic items).

Summary

I showed how high-level dashboards and main business metrics cover most of your monitoring and incident resolution needs. On top of that, the article explains the advantages of dashboards as code and how to apply that concept, using jsonnet, the Grafonnet library and working scripts to integrate in your developer and CI/CD/GitOps workflow. Lastly, I listed the best practices for dashboard creation and visualization so that your monitoring becomes easier and faster to use.

Out of scope

Detailed relation to logs, traces, alerting, and other tools. Great dashboards can help you shape alerting — particularly, I mean that if you have built an understandable and quickly navigable dashboard without any clutter, then alerts should cover those observed areas. For example, if your revenue is driven by successful outcomes of payments, that should be on your main dashboard of the payment gateway, and represented in alerts. Such a high-level alert can replace a hundred fine-grained alerts. How? Here’s an example alert: "for payment method SuperFastPay, alert if there are more than 50 failed payments per minute" (set this value based on an expected failure rate). Once such an alert is received, and the on-call engineer opens the main dashboard, or the SuperFastPay-specific dashboard (if that even makes sense), it should show red for that component. The detailed dashboard may show things like failure type statistics based on metrics, or the most common recent errors in logs. If it shows you mainly internet/connectivity issues, follow your way to the payment logs and infrastructure dashboards, for example (which at best would be linked). In the end, you may find that one of your cloud availability zones A/B/C, in which the software runs, does not have internet access. And that only by getting alerted about the most important business symptom, not because you had put large effort into monitoring internet connectivity from those availability zones. If only your dashboards make sense, allowing you to navigate quickly from symptom to root cause, you can probably live with fewer alert definitions overall. This example is not from production, but a wild dream of mine if all the suggestions are optimally applied. Surely you still want alerts for symptoms in infrastructure/platform/network, particularly if the company reaches a scale where those are handled by separate teams, but those alerts then may not need highest priority ("P1") — while business-critical symptoms like failing payments of your customers should be P1 alerts. The fewer high priority alerts you have, the better people’s work life, sleep and therefore productivity will be.
How hard it is to convince people of doing dashboards in code. There are very valid points against it, such as the missing WYSIWYG support as of 2022-04. Those can mostly be resolved with good tooling or a reasonable "how to develop a dashboard" README file. Other concerns are often just opinionated, and you will simply need to take the decision "do we allow it to become a mess or not". I recommend vendors to make codifying resources easier, so that even less technical people will be able to work with this concept. Exporting a visually-crafted dashboard to JSON is unfortunately not a solution, since that diminishes many of the advantages explained in this article (such as consistency).
The article is all about live monitoring of a service/system which could have incidents at any time. For example, an API serving requests for customers. There are lots of other use cases where monitoring, alerting and tracing may help, such as performance issues, SLOs, business statistics and intelligence.
Accessibility / color blindness support. Red and green may not be the best options, but I do not have the experience to give help here.
Installation and maintenance of the observability stack does not belong in this article. Dashboards are most helpful if you can also look at historical data and not only use them for short-term review of incidents. Therefore, prefer using long-term storage such as Thanos or Cortex (and since recently in March 2022: Mimir). Those solutions provide a very good backup solution as well.
The current jsonnet+Grafonnet solution for generated dashboards is not the final stage of evolution. Tooling like CDK could be adapted so dashboards can be written in a Grafana-agnostic way, using great languages like TypeScript. For now, if you go with jsonnet, I recommend you implement common functions that abstract the Grafana details away and set reasonable defaults everywhere.
Observing only the main business metric(s) is not sufficient. Particularly when you have split into several engineering teams or even have a platform infrastructure / DevOps / SRE team, specific monitoring depending on the teams' respective responsibility makes a lot of sense. In our example business, watching the health of partner or provider companies can make sense, since they may not have the most modern health monitoring in place. For examples, Grafana Labs has acquired k6 which can be used for load tests, but in the future hopefully also to monitor TLS certificate expiry (until that feature exists, Blackbox exporter is a reasonable tool). Try a "pre-mortem" brainstorming session to think of what could go wrong, and you will find many things to monitor which are not covered by the main metrics. Consider also "value under threshold" checks, since an error rate of zero could simply come from zero requests per second, and that can mean a whole service or feature is not working, or customers cannot reach your API.
Recommendations in this article were collected mainly in 2020-2021, before Tempo/tracing, exemplars and k6 were in wide-spread use. All these can prove helpful in combination with metrics-based monitoring.
Training. As mentioned, I think a good solution survives without training, but instead has proper and concise documentation, and the code speaks for itself. There are very few professional training and recommendation videos on the internet around dashboarding, and the available beginner content often showcases "The more metrics/graphs on a dashboard, the better" 😬. I cannot disagree more, so please try my "high-level dashboard + most important business metric" approach first and see if you prefer that, or rather a jungle of messy, unreviewed stuff which fosters a useless and long-winded tooling replacement every 2-3 years. See also the Grafana webinar Getting started with Grafana dashboard design for a gentle introduction which requires less upfront knowledge about Grafana compared to my article. The video however has some examples where dashboards are too crowded for my taste.
Advanced observability features. Grafana and its competitors offer quite interesting features such as anomaly detection (Datadog) (also possible with Prometheus — interesting blog post), error/crash tracking (Sentry) and others. Those deserve a place on dashboards if reasonably applicable to your products.
Auto-deletion of manually authored changes: Remember my term Deleteday from above? Make sure to implement that. At best, your deployment from CI simply takes care to replace all existing dashboards, including those not created by code. Think of the deployment like rsync -a --delete committed-dashboards production-grafana-instance.

Setting up buildbot in FreeBSD jails

April 22, 2018

In this article, I would like to present a tutorial to set up buildbot, a continuous integration (CI) software (like Jenkins, drone, etc.), making use of FreeBSD’s containerization mechanism "jails". We will cover terminology, rationale for using both buildbot and jails together, and installation steps. At the end, you will have a working buildbot instance using its sample build configuration, ready to play around with your own CI plans (or even CD, it’s very flexible!). Some hints for production-grade installations are given, but the tutorial steps are meant for a test environment (namely a virtual machine). Buildbot’s configuration and detailed concepts are not in scope here.

Choosing host operating system and version for buildbot
Create a FreeBSD playground
Introduction to jails
Overview of buildbot
Set up jails
Install buildbot master
Run buildbot master
Install buildbot worker
Run buildbot worker
Set up web server nginx to access buildbot UI
Run your first build
Production hints
Finished!

Choosing host operating system and version for buildbot

We choose the released version of FreeBSD (11.1-RELEASE at the moment). There is no particular reason for it, and as a matter of fact buildbot as a Python-based server is very cross-platform; therefore the underlying OS platform and version should not make a large difference.

It will make a difference for what you do with buildbot, however. For instance, poudriere is the de-facto standard for building packages from source on FreeBSD. Builds run in jails which may be any FreeBSD base system version older or equal to the host’s version (reason will be explained below). In other words, if the host is FreeBSD 11.1, build jails created by poudriere could e.g. use 9.1, 10.3, 11.0, 11.1, but potentially not version 12 or newer because of incompatibilities with the host’s kernel (jails do not run their own kernel as full virtual machines do). To not prolong this article over the intended scope, the details of which nice things could be done or automated with buildbot are not covered.

Package names on the FreeBSD platform are independent of the OS version, since external software (as in: not part of base system) is maintained in FreeBSD ports. So, if your chosen FreeBSD version (here: 11) is still officially supported, the packages mentioned in this post should work. In the unlikely event of package name changes before you read this article, you should be able to find the actual package names like pkg search buildbot.

Other operating systems like the various Linux distributions will use different package names but might also offer buildbot pre-packaged. If not, the buildbot installation manual offers steps to install it manually. In such case, the downside is that you will have to maintain and update the buildbot modules outside the stability and (semi-)automatic updates of your OS packages.

Create a FreeBSD playground

Vagrant is a popular tool to quickly set up virtual machines from pre-built images. We are using it here for simplicity. Any form of test environment or virtual machine would suffice. If you choose to follow along using Vagrant, please install it and ensure you have a compatible hypervisor installed as well in order to run a virtual machine (for instance VirtualBox).

Official and nightly FreeBSD images for Vagrant are available. With the following commands, we create a new directory for the playground virtual machine (called "VM" from here on) and then use Vagrant to download the FreeBSD 11.1-RELEASE image. Ensure you have enough disk space: the image presented here has around 1.4 GB, and you additionally need to allocate space for the VM.

mkdir -p ~/vagrant/freebsd-11.1-buildbot
cd ~/vagrant/freebsd-11.1-buildbot
vagrant init freebsd/FreeBSD-11.1-RELEASE

After vagrant init, the image is available to create new VMs and a Vagrantfile was created in the current directory. We must edit the file, because the metadata (contained in what Vagrant calls a "box" = disk image + metadata) is missing two pieces of information: base MAC address and shell (see bug report). Vagrant’s default shell is bash -l, but FreeBSD does not ship bash in its base system; hence we use sh. Also, we will disable synced folders as we will not need them here and they do not work out of the box (literally!). Without the commented sample configurations, the file should look as follows:

Vagrant.configure("2") do |config|
  config.vm.box = "freebsd/FreeBSD-11.1-RELEASE"
  config.ssh.shell = "/bin/sh"
  config.vm.base_mac = "080027D14C66"
  config.vm.synced_folder ".", "/vagrant", disabled: true
  config.vm.network "forwarded_port", guest: 80, host: 8999
end

Now let’s provision the virtual machine:

vagrant up

If you see messages like Warning: Connection reset. Retrying… for a while, keep hanging on — the official FreeBSD image defaults to connect to the Internet on first startup in order to fetch and install the latest updates. This can take a few minutes and several VM reboots.

Once the VM has fully booted, we can drop into a terminal via SSH. Vagrant handles the connection details for us:

vagrant ssh

Remember we set /bin/sh as shell in the Vagrantfile? Confusingly, Vagrant 2.0.3 needs this setting to work (else fails while bringing up the virtual machine), but now totally ignores the setting and we find ourselves in csh, the default configured for the connecting user account 🙄. You can recognize it from its default vagrant@freebsd:~ % shell prompt (sh uses $ without extra information), or type ps -p $$ to show details about the shell itself (where $$ resolves to the shell process ID in all popular shells). If you are more familiar with a different shell, you could for example install and use bash like so: sudo pkg install bash && chsh && sudo chsh. If you decide to stick to the default terminal csh, ensure you do not copy-and-paste example shell command lines starting with #, as those are not interpreted as comments in interactive csh shells.

Introduction to jails

FreeBSD has been supporting the concept of jails since the start of its 4.x release series in the year 2000. This is way before its modern competitors LXC/Docker/rkt and — like most other mechanisms — OS-specific. Some people say that jails are more mature. Since I have not worked with any Linux container mechanisms after OpenVZ many years back, I cannot give any experience or comparison here, and in any case it would probably be apples vs. pears; I like pears when they lay around a little and got soft.

Jails work like a full FreeBSD environment, but access to the outer system’s resources is restricted. For example, a jail may only listen on a network interface and IP address that was assigned to it. Filesystem access and other permissions like mounting of filesystems is (configurably) limited, as well (similar to a chroot environment). The performance difference of running software in a jail vs. directly on the jailhost is usually not noticeable (somewhat related study: packet routing performance analysis by Olivier Cochard-Labbé at EuroBSDcon 2017).

No other operating systems like Linux or Windows can be run in a jail, because the kernel is shared among jailhost (this is what I will call the outer operating system in this article) and all jails. For the same reason, running e.g. FreeBSD 12 in a jail — while the host is still on FreeBSD 11 — might not work because software built for the newer OS version may expect a different kernel interface and crash if run with the older kernel.

Overview of buildbot

Buildbot is a very versatile software. While I mentioned its main use as CI (Continuous Integration) and probably even CD (Continuous Delivery/Deployment) platform, it could theoretically do any automated task that runs on a computer. It’s just so that the "batteries included" are mostly related to building software. If you need something else, you can easily write build steps and other things in your Python-based master configuration file.

The main components to understand are the buildbot master and buildbot worker:

buildbot master: component which parses all build configuration and other settings (notification e-mails, change sources such as Git repositories, when builds are triggered/scheduled, etc.) and distributes the actual builds to its workers.
buildbot worker: a dumb component which only has connection details as configuration and gets all other commands from the master, namely to run builds. There could be multiple, and in large production setups, it makes a lot of sense to put them onto powerful, separate servers. Ephemeral workers (buildbot calls them "latent workers"), i.e. dynamically created and destroyed instances, are another option and support for several cloud providers and hypervisors is included. In this article, we will start small and set up a single, jailed worker which may be enough for your first steps with buildbot. You can later easily add/move workers somewhere else if you see the need.

Set up jails

Jails are a cheap way to semantically (and security-wise) separate applications or groups of them. If we later want to move the buildbot worker component or clone it, it is easiest to have the worker — and nothing else — in a jail.

We begin by installing ezjail, a very popular and stable wrapper around FreeBSD’s jail functionality. It makes creation and administration of jails much easier.

sudo pkg install ezjail
# Create directory structure and "base jail" i.e. extract base
# FreeBSD system to /usr/jails/basejail
sudo ezjail-admin install

Now it’s time to actually create the jails. Since the master offers a web UI and the worker talks to the master, both need IP addresses assigned. For simplicity, we choose local-only addresses here (network 10.0.0.0/24).

Jail networking has several gotchas, one of them being how loopback addresses are handled: namely, when accessing the IP addresses 127.0.0.1 and ::1 inside the jail, the connection does not end up on the jailhost’s loopback interface (else jails could access its parent’s services — a security hole), but the kernel rewrites those connections to the first IPv4/IPv6 address assigned to the jail. If the first assigned IP address is public and a service in the jail listens on 127.0.0.1:1234, port 1234 will suddenly be publically accessible! Therefore, the recommended practice is to have a separate network interface for jails (you could even have one per jail, but in this tutorial we want the jails to communicate with each other directly). This works by "cloning" lo0 into the new interface lo1.

# Configure a separate network interface for jails
sudo sysrc cloned_interfaces+=lo1

# We can assign an IP to the server ("jailhost") as well. Needed in
# this tutorial so jailhost and jails can communicate (we will
# serve buildbot's web user interface with nginx later).
sudo sysrc ifconfig_lo1="inet 10.0.0.240 netmask 255.255.255.0"

# Create the cloned interface (automatically happens at next boot as
# well, no need to repeat this step)
sudo service netif cloneup


# Set default network interface for jails (if not explicitly configured)
sudo sysrc jail_interface=lo1
# Start ezjail's configured jails on boot
sudo sysrc ezjail_enable=YES
# Actually create our jails
sudo ezjail-admin create -f example master "10.0.0.2/24"
sudo ezjail-admin create -f example worker0 "10.0.0.3/24"
# Start all ezjail-managed jails (will also happen on reboot because
# of ezjail_enable=YES). Please ignore the warning
# "Per-jail configuration via jail_* variables is obsolete" - ezjail
# simply has not been changed yet to use another mechanism.
sudo ezjail-admin start

The jails have successfully started, but to do something useful — like installing packages inside — we want Internet access from within the jails (at least if you decide to use the official source pkg.freebsd.org). For that purpose, we set up a NAT networking rule using one of FreeBSD’s built-in firewalls (or rather: package filters), pf.

sudo tee /etc/pf.conf <<EOF
ext_if = "em0" # external network interface, adapt to your hardware/network if needed
jail_if = "lo1" # the interface we chose for communication between jails

# Allow jails to access Internet via NAT, but avoid NAT within same network so jails can
# communicate with each other
no nat on \$ext_if from (\$jail_if:network) to (\$jail_if:network)
nat on \$ext_if from (\$jail_if:network) to any -> \$ext_if
# Note: above two rules split for clarity -> equivalent to this one-liner:
# nat on \$ext_if from (\$jail_if:network) to ! (\$jail_if:network) -> \$ext_if

# No restrictions on jail network
set skip on \$jail_if

# Common recommended pf rules, not exactly related to this article
set skip on lo0
block drop in
pass out on \$ext_if

# Don't lock ourselves out from SSH
pass in on \$ext_if proto tcp to \$ext_if port 22
# Allow web access
pass in on \$ext_if proto tcp to \$ext_if port 80
EOF

# Check firewall rules syntax
sudo service pf onecheck

sudo sysrc pf_enable=YES
sudo service pf start

(mind that $ must be escaped in shells and will land in /etc/pf.conf unescaped)

At this point, your SSH connection will stall (and drop after some time) because the firewall does not have a state of your existing connection. To drop out from the hanging terminal, press Enter, ~, . one after another. To understand how this keyboard shortcut closes the SSH session, please read up about escape characters in the ssh manpage. Now, please reconnect to the VM with vagrant ssh.

# Check if Internet connection works at all
fetch -o - http://example.com

# Copy resolv.conf to every jail to allow resolving hostnames
# (note: typically added to your default ezjail flavor)
sudo tee /usr/jails/master/etc/resolv.conf < /etc/resolv.conf
sudo tee /usr/jails/worker0/etc/resolv.conf < /etc/resolv.conf

# Check if Internet connection works from a jail
sudo jexec master fetch -o - http://example.com

Install buildbot master

Apart from the master, we want to install the web user interface (called "UI" hereinafter) and Git since that is used in buildbot’s sample configuration for fetching a source project (the smaller package git-lite should be enough for fetching of most typical schemes like ssh and https).

sudo pkg -j master install git-lite py36-buildbot py36-buildbot-www
# Alternative which requires installing the tool package manager `pkg`
# itself inside jail:
# sudo jexec master pkg install git-lite py36-buildbot py36-buildbot-www

We create a regular, unprivileged user to run the buildbot master:

# Open a shell inside jail
sudo jexec master sh

# Instead of pw, you can use the interactive command `adduser`. We use a
# random password to protect the account. Since we are always root when
# doing `jexec` into a jail, we can become the user without entering the
# password and therefore can forget which password was automatically generated.
pw useradd -n buildbot-master -m -w random

# Create directory for master
mkdir /var/buildbot-master
chown buildbot-master:buildbot-master /var/buildbot-master

# Become unprivileged user
su -l buildbot-master
buildbot create-master /var/buildbot-master
cp /var/buildbot-master/master.cfg.sample /var/buildbot-master/master.cfg
# Switch to root user again (we did `su -l buildbot-master` earlier)
exit

The sample configuration polls a "Hello world" project every few minutes and builds it on changes. Nothing very interesting here, but it explains the principles quite well.

Time to do configure something useful, right? Not so fast! Without a worker, no build could run. For now, we copied the sample configuration to get started. In the next steps, we permanently run the master and set up a worker to actually run the builds.

Run buildbot master

The built-in mechanism for running buildbot is simply buildbot start. Since this starts the master only once, we opt for a permanent solution to start on boot. The package maintainers have thought of this and provide an rc script (such scripts manage service start, stop and other subcommands like restart/reload). It can be executed at boot (or more exactly in this tutorial: when the jail is started) to bring up the service. For that to happen, we only have to enable the service permanently and specify its working directory and user:

# Still inside jail shell
sysrc buildbot_enable=YES
sysrc buildbot_basedir=/var/buildbot-master
sysrc buildbot_user=buildbot-master
service buildbot start

# Check log file if you wish
tail /var/buildbot-master/twistd.log

If you are interested how the rc script starts and stops the service, check its code at /usr/local/etc/rc.d/buildbot.

Install buildbot worker

If you are still in the buildbot master jail’s shell, drop out with exit, or alternatively create a new session to the jailhost with vagrant ssh.

Like for the master, we first install required packages and then create an unprivileged user. Watch out to not mistype buildbot-master for buildbot-worker — below, we will only execute commands related to the worker. Git is used in the example builder to fetch the source code for the build. Not to be confused with the GitPoller on the master which is a "change source" i.e. regularly checks if changes exist in a repository; therefore we need Git on both master and worker for our example usage.

sudo pkg -j worker0 install git-lite py36-buildbot-worker
# Alternative which requires installing the tool package manager `pkg`
# itself inside jail:
# sudo jexec worker0 pkg install git-lite py36-buildbot-worker

# Open a shell inside jail
sudo jexec worker0 sh

# Instead of pw, you can use the interactive command `adduser`. We use a
# random password to protect the account. Since we are always root when
# doing `jexec` into a jail, we can become the user without entering the
# password and therefore can forget which password was automatically generated.
pw useradd -n buildbot-worker -m -w random

# Create directory for worker
mkdir /var/buildbot-worker
chown buildbot-worker:buildbot-worker /var/buildbot-worker

# Become unprivileged user
su -l buildbot-worker

buildbot-worker create-worker /var/buildbot-worker 10.0.0.2 example-worker pass

# The output told us do perform some actions manually. Let's obey:
cd /var/buildbot-worker
# Please fill in yourself or the admin
echo "Your Name <your.name@example.test>" > info/admin
# Worker description for display in UI
echo "worker0" > info/host

# Switch to root user again (we did `su -l buildbot-master` earlier)
exit

Note

Buildbot workers were previously called "slaves" and due to the politically unsound meaning, Mozilla assigned a $15000 contribution to take care of the rename, which went from documentation all the way down to source code and package names. So luckily, I do not have to write about a "slave in a jail" here 👍.

Run buildbot worker

We are lucky: buildbot workers do not need any configuration other than the connection details because the master handles all logic. Workers are "dumb" and only perform builds locally, reporting progress and results back to the master over the connection we specified (worker connects to master at IP 10.0.0.2 using default port 9989). Most extensibility of buildbot is in the master (and its master.cfg file). However, flexibility for your actual build purposes is in the workers as well, since you have the freedom to choose a different operating system, configuration and installed software for each worker. Since we work with FreeBSD jails in this tutorial, we are "restricted" to the jailhost’s FreeBSD kernel, but can freely choose any base system and extra packages for the worker as long as the OS release version is not newer than the host (as mentioned in the introduction).

Similar to the buildbot master rc script, you will probably want to run the worker permanently:

# Still inside jail shell
sysrc buildbot_worker_enable=YES
sysrc buildbot_worker_basedir=/var/buildbot-worker
sysrc buildbot_worker_uid=buildbot-worker
sysrc buildbot_worker_gid=buildbot-worker
service buildbot-worker start
# if it fails with "cannot run /usr/local/bin/twistd", apply this patch from
# https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=227675 to the file
# `/usr/local/etc/rc.d/buildbot-worker` and try again:
# sed -i '' 's|command="/usr/local/bin/twistd"|command="/usr/local/bin/twistd-3.6"|' /usr/local/etc/rc.d/buildbot-worker

# Check log file, should show a message "Connected to 10.0.0.2:9989; worker is ready"
tail /var/buildbot-worker/twistd.log

# Back to jailhost shell
exit

Set up web server nginx to access buildbot UI

Master and worker have been set up, and if you watch log files, activity will be visible:

# On jailhost
$ tail -F /usr/jails/*/var/buildbot*/twistd.log
[...]
2018-04-21 17:23:28+0000 [-] gitpoller: processing changes from "git://github.com/buildbot/hello-world.git"

Here, "processing changes" means that if a change was detected from the previous build, a new build will be triggered. The change source is explicitly connected to trigger a build in the sample configuration — no builds are triggered implicitly only because there is a Git change source; the configuration does only and exactly what you code into it 💪.

There is of course no reason to look into log files to see which build is running. Buildbot features a web-based UI to give an overview, see results, force-trigger builds and more. In the sample master configuration, the www component is already set up to serve HTTP on port 8010. In a real environment, you would not serve unencrypted HTTP or open up the non-standard port 8010 to the outside (mind how listening on port 80 needs superuser privileges). Also, our server contains more than just the buildbot UI: depending on your actual use case for CI/CD, you may also want to serve the build logs and artifacts (such as built software). Hence, we serve the UI with nginx (any other server with HTTP and Web Sockets support would work just as well), and you can later configure yourself which data you are serving to outside users, allowing everyone to see everything and even to trigger builds. By the way, the buildbot UI by default does not perform user authorization. HTTPS is not covered in this tutorial — we will use plain HTTP for test purposes. Nevertheless, the nginx configuration presented below works if you enable SSL/TLS.

# On jailhost
sudo pkg install nginx

sudo tee /usr/local/etc/nginx/nginx.conf <<EOF
events {
    worker_connections 1024;
}
http {
    include           mime.types;
    default_type      application/octet-stream;
    sendfile          on;
    keepalive_timeout 65;
    server {
        listen 80;
        server_name localhost;

        location / {
            root /usr/local/www/nginx;
            index index.html index.htm;
        }

        location /buildbot/ {
            proxy_pass http://10.0.0.2:8010/;
        }
        location /buildbot/sse/ {
            # proxy buffering will prevent sse to work
            proxy_buffering off;
            proxy_pass http://10.0.0.2:8010/sse/;
        }
        # required for websocket
        location /buildbot/ws {
            proxy_http_version 1.1;
            proxy_set_header Upgrade \$http_upgrade;
            proxy_set_header Connection "upgrade";
            proxy_pass http://10.0.0.2:8010/ws;
            # raise the proxy timeout for the websocket
            proxy_read_timeout 6000s;
        }

        error_page 500 502 503 504 /50x.html;
        location = /50x.html {
            root /usr/local/www/nginx-dist;
        }
    }
}
EOF

sudo sysrc nginx_enable=YES
sudo service nginx start

(mind again that $ is escaped in the shell but not in the output file)

Remember the line config.vm.network "forwarded_port", guest: 80, host: 8999 in our Vagrantfile? Vagrant’s networking is a little different in that access to a VM’s TCP ports is not directly possible, but typically achieved by a port forward which Vagrant establishes for you. You should therefore see a welcoming nginx example page at http://localhost:8999/ (open in your computer’s browser).

Let us replace the page with an index of what’s on the server — the buildbot master is already active, while as mentioned, other items like serving build artifacts or logs might become important to you later (not in scope of this tutorial).

sudo tee /usr/local/www/nginx/index.html <<EOF
<html>
<body>
<a href="/buildbot/">buildbot</a>
<!--
    Since there's only one thing here right now, let's redirect automatically
    until you figure out which artifacts you want to put here.
-->
<script>
    window.location.href = "/buildbot/";
</script>
</body>
</html>
EOF

Run your first build

Reload the browser page. The buildbot UI should come up. There will be a warning about the configured buildbotURL because we use Vagrant’s port forwarding; in production, you should have direct access to https://your-ci.your-company.example.com and configure the value accordingly.

Feel free to browse around the UI. You will find the example builder runtests, our single worker on host worker0 and some other information already available. Since the example builder has a "force" scheduler configured, you can even trigger a first build now! Click "Builds > Builders > runtests > force > Start Build" and see how the build runs. It will fail when trying to run trial, the example project’s test runner because we have not installed this software on the worker (at time of writing, it was not available as separate FreeBSD package).

buildbot UI screenshot

We are now ready to do something useful with our buildbot instance. Buildbot configuration and essentials are not covered in here — please read the official documentation to get started. The configuration at /usr/jails/master/var/buildbot-master/master.cfg is right at your fingertips and ready for editing. Here is an edit-and-reload workflow that you may need as "trial and error" strategy until you have successfully learned all the basics:

# Open a shell inside jail
sudo jexec master sh
# Make some changes and reload
vi /var/buildbot-master/master.cfg
service buildbot reload

The rc script’s reload command actually calls something like buildbot reconfigure /var/buildbot-master under the hood, telling our master process to reload the configuration.

Production hints

We worked in a test virtual machine for this setup, but for production grade, you may still want to adapt a few things:

Think about using ZFS as filesystem so ezjail can take advantage of it (see manpage’s Using ZFS section). Official Vagrant images of FreeBSD are set up using UFS, not ZFS.
In my company, I have set up buildbot to run package builds using poudriere. Poudriere performs clean builds by means of creating empty jails ("empty" = only FreeBSD base system installed but no packages) and starting the build within. For that to work within our buildbot worker jail, you need to allow it to create subjails, among other settings. At some point, especially if you are a friend of human-readable names and paths, you may run into the current FreeBSD mount point name length limit of 88 characters which will be fixed in FreeBSD 12. To work around that limitation now, you could set ezjail_jaildir=/j in ezjail.conf (before running ezjail-admin install) instead of using the longer path /usr/jails. Or you could choose shorter jail names like w0 instead of my-cool-project-worker0-freebsd-10.3.
Store the worker password in a separate file instead of hardcoding it in master.cfg (as done in the sample configuration). This allows you to share the configuration with software developers (e.g. commit to a version-controlled repo) or even allow them to edit it — without any security concerns.
You should replace the sample worker name and password with own values, obviously.

Finished!

The tutorial narrated about basics of FreeBSD jails and buildbot, followed by the setup of a test virtual machine featuring a buildbot master and single attached worker. With this in place, you can go on to implement your CI/CD intentions with buildbot’s explicit and programmable configuration. Good luck!

Read more… (post is longer)

Ansible best practices

April 24, 2017

Ansible can be summarized as tool for running automated tasks on servers that require nothing but Python installed on the remote side. Typically used as configuration management framework, Ansible comes with a set of key benefits:

Has simple configuration with YAML, avoiding copy-paste by applying customizable "roles"
Uses inventories to scope and define the set of servers
Fosters repeatable "playbook" runs, i.e. applying same configuration to a server twice should be idempotent
Doesn’t suffer from feature matrix issues because by design it is a framework, not a full-fledged solution for configuration management. You cannot say "it supports only web servers X and Y, but not Z", as principally Ansible allows you to do anything that is possible through manual server configuration.

For a full introduction to Ansible, better read the documentation first. This article assumes you have already made yourself familiar with the concepts and have some existing attempts of getting Ansible working for a certain use case, but want some guidance on improving the way you are working with Ansible.

The company behind Ansible gives some official guidelines which mostly relate to file structure, naming and other common rules. While these are helpful, as they are not immediately common sense for beginners, only a fraction of Ansible’s features and complexity of larger setups are touched by that small set of guidelines.

I would like to present my experience from roughly over 2 years of Ansible experience, during which I have used it for a test environment at work (allowing developers to test systems like in production), for configuring my laptop and eventually for setting up this server and web application, and also my home server (a Raspberry Pi).

Why Ansible over other frameworks?

Honestly, I did not compare many alternatives because the Ansible environment at work already existed when I joined and soon I believed Ansible to be the best option. The usual suspects Chef and Puppet did not really please me because the recipes do not really look like "infrastructure as code", but are too declarative and hard to understand in detail without looking at many files — while in a typical Ansible playbook, the actions taken can be read top-down like code.
Many years ago, I built my own solution to deploy my personal web applications ("Site Deploy"; UI-based). As hobby project, it never became popular or sophisticated enough, and eventually I learned that it suffers from the aforementioned feature matrix problem. Essentially it only supported the features relevant to me 🙄, without providing a framework to support anything on any server. Nevertheless, Site Deploy already had support for configuring hosts with their connection data and services, with the help of variable substitution in most places. Or in other words: the very basic concepts of Ansible.
Size of the user-base says a lot (cf. their 2016 recap)
Ansible aims at simple design, and becomes powerful by all the open-source modules to support services, applications, hardware, network, connections, etc.
No server-side, persistent component required. Only Python needed to execute modules. Usual connection type is SSH, but custom modules are available for other types.
Flat learning curve: once you understand the basic concepts (define hosts in inventory, set variables on different levels, write tasks in playbooks) and you know the commands/steps to configure a host manually, it’s easy to get started writing the same steps down in Ansible’s YAML format.
Put simply, Ansible combines a set of hosts (inventory) with a list of applicable tasks (playbooks & roles), customizable with variables (at different places), allowing you to use pre-defined or own task modules and plugins (connection, value lookup, etc.). If you rolled your own, generic configuration management, you probably could not implement its principles much simpler. Since the concepts are so clearly separated, the source code (Python) is easy enough to read, if ever needed. Usually you will only have 2 situations to look into Ansible source code: learning how modules should be implemented and finding out about changed behavior when upgrading Ansible. The latter is not common and only occurred to me when switching from Ansible 1.8/1.9.x to 2.2.x which was quite a big step both in features, deprecations and also Ansible source code architecture itself.
Change detection and idempotency. Whenever a task is run, there may be distinct outcomes: successfully changed, failed, skipped, unchanged. After running a playbook, you will have an overview of which tasks actually made changes on the target hosts. Usually, one would design playbooks in a way that running it a second time only gives "unchanged" outcomes, and Ansible’s modules support this idea of idempotency — for example, a command task can be marked as "already done that before, no changes required" by specifying creates: /file/created/by/command → once the file was successfully created, a repeated execution of the task module will not run the command again.

Choose your type of environment

Before we jump into practice, in the first thought we must consider what kind of Ansible-based setup we want to achieve, which greatly depends on the environment: work/personal, production/staging/testing, mixture of those…

Testing

A test environment could have many faces: for instance, at my company we manage a separate Git repo for the test environment, unrelated to any production configuration and therefore very quick to modify for developers without lengthy code reviews or approval by devops, as no production system can be affected. Ansible is used to fully configure the system and our software within a virtual machine.

To spin up a VM, many solutions exist already — for instance Vagrant with a small provisioning script that installs everything required for Ansible (only Python 😉) in the VM. We use a small Fabric script to bootstrap a FreeBSD VM and networking before continuing with Ansible.

Staging/production

You should keep separate inventories for staging and production. If you don’t have staging, you should probably aim at automating staging setup with Ansible, since you already develop the production configuration in playbooks. But if you have both, the below recommendations apply.

Both non-production and production with one Ansible setup

When deploying both non-production and production environments from the same roles/playbooks, you must take care they don’t interfere with each other. For instance, you don’t want to send real e-mails to customers from staging, use different domain names, etc. The main way to decide on applying non-production vs. production properties should be your use of inventories and variables. An example will be discussed below (dynamic inventory).
Careful — developers should not have live credentials such as SSH access to a production server, but probably be able to manage testing/staging systems?!
GPG encryption of sensitive files or other protection to disallow unprivileged people from accessing production machines at all (mentioned in section Storing sensitive files)
A safe default choice for inventories is required, and the default should most probably not be production. This is described below in the section Ansible configuration.

Careful when mixing manual and automated configuration

If you already have a production system manually set up — which is almost always the case, at least for initial OS installation steps which cannot be done via Ansible on physical servers — making the switch to fully automated configuration via Ansible is not easy. You may want to introduce automation step-by-step.

There are many imaginable ways to achieve that migration. I want to propose what I would do, admittedly without any real-world experience because I do not manage any production systems as developer.

Develop playbooks and maintain check mode and the --diff option. This is not always easy and sometimes unnerving because you have to think both in normal mode (read-write) and check mode (read-only) when writing tasks, and apply appropriate options for modules that can’t handle it themselves (like command):
- check_mode: no (previously called always_run: yes)
- changed_when
- If you use tags: apply tags: [ always ] to tasks that e.g. provide results for subsequent tasks
Take care when making manual changes to servers. While often okay and necessary to react quickly, ensure the responsible people (e.g. devops team) can later reproduce the setup rather sooner than later with playbooks.
Use {{ ansible_managed }} to mark auto-generated files as such, so nobody unknowingly edits them manually
Automate as much setup as you can, but only the parts that you are able to implement via Ansible without risk. For example, if you fear that an automatic database setup could go horribly wrong (like overwrite the existing production database), then rely on your distrust and do those steps manually.

Directory structure

Some common directory layouts are already part of the official documentation. In addition, you may want to separate your playbooks in subdirectories of playbooks/ once your content grows too large. This cannot really be handled well in best practices because size and purpose of each project varies, so I just leave this on you to decide when time comes to "clean up". Note that if you use several playbook (sub-)directories and files relative to them (such as a custom library folder), you may have to symlink into the each directory containing playbooks.

Basic setup

It should be clear that Ansible uses text files and therefore should be versioned in a VCS like Git. Make sure you ignore files that should not be committed (for example in .gitignore: *.retry).
Add something like alias apl=ansible-playbook in your shell. Or do you want to type ansible-playbook all the time?
Require users to use at least a certain Ansible version, e.g. the latest version available in OS package managers at the time of starting your endeavors. You could have a little role check-preconditions doing this:

# Check and require certain Ansible version. You should document why that
# version is required, for instance:
#
# We require Ansible 2.2.1 or newer, see changelog
# (https://github.com/ansible/ansible/blob/devel/CHANGELOG.md#221-the-battle-of-evermore---2017-01-16):
# > Fixes a bug where undefined variables in with_* loops would cause a task
# > failure even if the when condition would cause the task to be skipped.
- name: Check Ansible version
  assert:
    that: '(ansible_version.major, ansible_version.minor, ansible_version.revision) >= (2, 2, 1)'
    msg: 'Please install the recommended version 2.2.1+. You have Ansible {{ ansible_version.string }}.'
  run_once: true

Ansible configuration

ansible.cfg allows you to tweak many settings to be a little saner than the defaults.

I recommend the following:

[defaults]
# Default to no fact gathering because it's slow and "explicit is better
# than implicit". Depending how you use variables, you may rather explicitly
# define variables instead of relying on facts. You can enable this on
# a per-playbook basis with `gather_facts: yes`.
gathering = explicit
# You should default either 1) to a non-risky inventory (not production)
# or 2) point to a nonexistent one so that the person explicitly needs to
# specify which one to use. I find the alternative 1) the least risky,
# because 2) may lead to people creating shortcuts to deploy to live machines
# which defeats the purpose of having a safer default here.
inventory = inventories/test
# Cows are scared of playbook developers
nocows = 1

# Point to your local collection of extras, e.g. roles
roles_path = ./roles

[ssh_connection]
# Enable SSH multiplexing to increase performance
pipelining = True
control_path = /tmp/ansible-ssh-%%h-%%p-%%r

Choosing a safe default for the inventory is obviously important, thinking about recent catastrophic events like the Amazon S3 outage that originated from a typo. Inventory names should not be confusable with each other, e.g. avoid using a prefix (inv_live, inv_test) because people hastily using tab completion may quickly introduce a typo.

If you are annoyed by *.retry files being created next to playbooks which hinders filename tab completion, an environment variable ANSIBLE_RETRY_FILES_SAVE_PATH lets you put them in a different place. For myself, I never use them as I’m not working with hundreds of hosts matching per playbook, so I just disable them with ANSIBLE_RETRY_FILES_ENABLED=no. Since that is a per-person decision, it should be an environment variable and not go into ansible.cfg.

Name tasks

While already outlined in the mentioned best practices article, I’d like to stress this point: names, comments and readability enable you and others to understand playbooks and roles later on. Ansible output on its own is too concise to really tell you the exact spot which is currently executing, and sometimes in large setups you will be searching that spot where you canceled (Ctrl+C) or a task failed fatally. Naming even the single tasks comes in handy here. Or tooling like ARA which I personally did not try yet (overkill for me). After all we’re doing programming, and no reasonable language would allow you to make public functions unnamed/anonymous.

- name: 'Create directories for service {{ daemontools_service_name }}'
  file:
    state: directory
    dest: '{{ item }}'
    owner: '{{ daemontools_service_user }}'
  with_items: '{{ daemontools_service_directories }}'

In recent versions of Ansible, variables in the task name will be correctly substituted by their value in the console output, giving you visual feedback which part of the play is executing. That will be especially important once your configuration management project is growing and you run large collections of playbooks that execute a certain role (this example: daemontools_service) multiple times, for example to create a couple of permanent services.

Another advantage of this technique is that you can start where a play canceled/failed previously using the --start-at-task="Task name" option. That might not always work, e.g. if a task depends on a previously register:-ed variable, but is often helpful to save time by skipping all previously succeeded tasks. If you use static task names like "Install packages", then --start-at-task="Install packages" will start at the first occurrence of that task name in the play instead of a specific one ("Install dependencies for service XYZ").

Avoid skipping items

…because it might hurt idempotency. What if your Ansible playbook adds a cronjob based on a boolean variable, and later you change the value to false? Using when: my_bool (value now changed to no) will skip the task, leaving the cronjob intact even though you expected it to be removed or disabled.

Here’s a slightly more complicated example: I had to set up a service that should be disabled by default until the developer enables it (because it would log error messages all the time unless the developer had established a required, manual SSH tunnel). Considerations:

When configuring that service (let’s call the role daemontools_service; daemontools are great to set up and manage services on *nix), we cannot simply enable/disable the service conditionally: the service should only be disabled initially (first playbook run = service created for the first time on remote machine) and on boot, but its state should be untouched if the developer had already enabled the service manually. Or in other words (since that fact is not easy to find out), leave state untouched if the service was already configured by a previous playbook run (= idempotency).
You might also want an option to toggle enabling/disabling the service by default, so I’ll show that as well

- hosts: xyz

  vars:
    xyz_service_name: xyz-daemon

    # Knob to enable/disable service by default (on reboot, and after
    # initial configuration)
    xyz_always_enabled: true

  roles:
    - role: daemontools_service
      daemontools_service_name: '{{ xyz_service_name }}'
      # Contrived variable, leaving state untouched should be the default
      # behavior unless you want to risk in production that services are
      # unintentionally enabled or disabled by a playbook run.
      daemontools_service_enabled: 'do_not_change_state'
      daemontools_service_other_variables: ...

  tasks:
    - name: Disable XYZ service on boot
      cron:
        # We know that the role will symlink into /var/service,
        # as usual for daemontools
        job: "svc -d /var/service/{{ xyz_service_name }}"
        name: "xyz_default_disabled"
        special_time: "reboot"
        disabled: "{{ xyz_always_enabled }}"
        # ...or...
        # state: "{{ 'absent' if xyz_always_enabled else 'present' }}"
      tags: [ cron ]

    - name: Disable XYZ service initially
      # After *all* initial configuration steps succeeded, take the service
      # down (`svc -d`) and mark the service as created so we...
      shell: "svc -d /var/service/{{ xyz_service_name }} && touch /var/service/{{ xyz_service_name }}/.created"
      args:
        # ...don't disable the service again if playbook is run again
        # (as someone may have enabled the service manually in the meantime).
        creates: "/var/service/{{ xyz_service_name }}/.created"
      when: not xyz_always_enabled
      tags: [ cron ]

Use and abuse of variables

The most important principle for variables is that you should know which variables are used when looking at a portion of "Ansible code" (YAML). As an Ansible beginner, you might have 1) wondered a few times, or looked up, in which order of precedence variables are taken into account. Or 2) you might have just given up and asked the author what is happening there. Like in software development, both 1) and 2) are fatal mistakes that hamper productivity — code must be readable (hopefully top-down or by looking within the surrounding 100 lines) and understandable by colleagues and other contributors. The case that you even had to check the precedence shows the problem in the first place! Variables should be specified at exactly one place (or two places if a variable has a reasonable, overridable default value), as close as possible to their usage while still being at the relevant location and most variables should be ultimately mandatory so that Ansible loudly complains if a variable is missing. Let us look at a few examples to see what these basic rules mean.

[exampleservers]
192.168.178.34

[all:vars]
# Global helper variables.
#
# I tend to use these specific ones because when inside a role, Ansible 1.9.x
# did not correctly find files/templates in some cases (if called from playbook
# or dependency of other role). Not sure if that is still required for 2.x,
# so don't copy-paste without understanding the need! These are really
# just examples.
my_playbooks_dir={{ inventory_dir + "/../playbooks" }}
my_roles_dir={{ inventory_dir + "/../roles" }}

# With dynamic inventories, you can structure your per-host and per-group
# variables in a nicer way than this INI file top-down format. If you use
# INI files, at least try to create some structure, like alphabetical sorting
# for hosts and groups.
[exampleservers:vars]
# Here, put only variables that belong to matching servers in general,
# not to a functional component
ansible_ssh_user=dog

Let’s look at an example role "mysql" which installs a MySQL server, optionally creates a database and then optionally gives privileges to the database (also allows value * for all databases) to a user:

# ...contrived excerpt...
- name: Ensure database {{ database_name }} exists
  mysql_db:
    name: 'ourprefix_{{ database_name }}'
  when: database_name is defined and database_name != "*"

- name: Ensure database user {{ database_user }} exists and has access to {{ database_name }}
  mysql_user:
    name: '{{ database_user }}'
    password: '{{ database_password }}'
    priv: '{{ database_name }}.*:ALL'
    host: '%'
  when: database_user is defined and database_user
# ...

The good parts first:

Once database_user is given, the required variable database_password is mandatory, i.e. not checked with another database_password is defined.
Variables used in task names, so that Ansible output clearly tells you what exactly is currently happening

But many things should be fixed here:

Role (I called this example role "mysql") is doing way too many things at once without having a proper name. It should be split up into several roles: MySQL server installation, database creation, user & privilege setup. If you really find yourself doing these three things together repeatedly, you can still create an uber-role "mysql" that depends on the others.
Role variables should be prefixed with the role name (e.g. mysql_database_name) because Ansible has no concept of namespaces or scoping these variables only to the role. This helps finding out quickly where a variable comes from. In contrast, host groups in Ansible are a way to scope variables so they are only available to a certain set of hosts.
The database name prefix ourprefix_ seems to be a hardcoded string. First of all, this led to a bug — privileges are not correctly applied to the user in the second task because the prefix was forgotten. The hardcoded string could be an internal variable (mark those with an underscore!) defined in the defaults file roles/mysql/defaults/main.yml: _database_name_prefix: 'ourprefix_' # comment describing why it’s hardcoded, and must be used wherever applicable. Whenever the value needs changing, you only need to touch one location.
The special value database_name: '*' must be considered. Because the role has more than one responsibility (remember software engineering best practices?!), the variables have too many meanings. As said, there had better be a role "mysql_user" that only handles user creation and privileges — inside such a scoped role, using one special value turns out to be less bug-prone.
database_user is defined and database_user is again only necessary because the role is doing too much. In general, you should almost never use such a conditional. For no real reason, an empty value is principally allowed, and the task skipped in that case, and also if the variable is not specified. Once you decide to rename the variable and forget to replace one occurrence, you suddenly always skip the task. Whenever you can, let Ansible complain loudly when a variable is undefined, instead of e.g. skipping a task conditionally. In this example, splitting up the role is the solution to immediately make the variables mandatory. In other cases, you could introduce a default value for a role variable and allow users to override that value.

Other practices regarding variables and their values and inline templates:

Consistently name your variables. Just like code, Ansible plays should be grep-able. A simple text search through your Ansible setup repo should immediately find the source of a variable and other places where it is used.
Avoid indirections like includes or vars_files if possible to keep relevant variables close to their use. In some cases, these helpers can shorten repeated code, but usually they just add one more level of having to jump around between files to grasp where a value comes from.
Don’t use the special one-line dictionary syntax mysql_db: name="{{ database_name }}" state="present" encoding="utf8mb4". YAML is very readable per se, so why use Ansible’s crippled syntax instead? It’s okay to use for single-variable tasks, though.
On the same note, remove defaults which are obvious, such as the usual state: present. The "official" blog post on best practices recommends otherwise, but I like to keep code short and boilerplate-less.
Decide for one quoting style and use it consistently: double quotes (dest: "/etc/some.conf"), single quotes (dest: '/etc/some.conf') plus decision if you quote things that don’t need it (dest: /etc/some.conf). Keep in mind that dest: {{ var }} is not possible (must be quoted), and that mode: 0755 (chmod) will give an unexpected result (no octal number support), so recommended practice is of course mode: '0755'.
Also decide for one style for spacing and writing Jinja templates. I prefer dest: '{{ var|int + 5 }}' over dest: '{{var | int + 5}}' but only staying consistent is key, not the style you choose.
You don’t need --- at the top of YAML files. Just leave them away unless you know what it means.

More rules can be shown best in a playbook example:

- hosts: web-analytics-database

  vars:
    # Under `vars`, only put variables that really must be available in several
    # roles and tasks below. They have high precedence and therefore are prone
    # to clash with other variables of the same name (if you didn't follow
    # the principle of only one definition), or may set a value in one of the
    # below roles that you didn't want to be set! Therefore the role name
    # prefix is so important (`mysql_user_name` instead of `username` because
    # the latter might also be used in many other places and is hard to grep
    # for if used all over the place).

    # When writing many playbooks, you probably don't want to hardcode your
    # DBA's username everywhere, but define a variable `database_admin_username`.
    # The rule of putting it as close as possible to its use tells you to
    # create a group "database-servers" containing all database hosts and put
    # the variable into `group_vars/database-servers.yml` so it's only available
    # in the limited scope.
    # Using variable name prefix `wa_` for "web analytics" as example.
    wa_mysql_user_name_prefix: '{{ database_admin_username }}'

  roles:
    - role: mysql_server

      # [Comment describing why we chose MySQL 5.5...]
      # Alternatively (but more risky than requiring it to be defined explicitly),
      # this might have a default value in the role, stating the version you
      # normally use in production.
      mysql_server_version: '5.5'

    # Admin with full privileges
    - role: mysql_user
      mysql_user_name: '{{ wa_mysql_user_name_prefix }}_admin'

      # This should not have a default. Defaulting to `ALL` means that on a
      # playbook mistake, a new user may get all privileges!
      mysql_user_privileges: 'ALL'

      # Production passwords should not be committed to version control
      # in plaintext. See article section "Storing sensitive files".
      mysql_user_password: '{{ lookup("gpgfile", "secure/web-analytics-database.password") }}'

    # Read-only access
    - role: mysql_user
      mysql_user_name: '{{ wa_mysql_user_name_prefix }}_readonly'
      mysql_user_privileges: 'SELECT'
      mysql_user_password: '{{ lookup("gpgfile", "secure/web-analytics-database.readonly.password") }}'

  tasks:
    # With well-developed roles, you don't need extra {pre_}tasks!

sudo only where necessary

The command failed, so I used sudo command and it worked fine. I’m now doing that everywhere because it’s easier.

It should be obvious to devops people, and hopefully also software developers, how very wrong this is. Just like you would not do that for manual commands, you also should not use become: yes globally for a whole playbook. Better only use it for tasks that actually need root rights. The become flag can be assigned to task blocks, avoiding repetition.

Another downside of "sudo everywhere" is that you have to take care of owner/group membership of directories and files you create, instead of defaulting to creating files owned by the connecting user.

Assertions

If you ever had a to debug a case where a YAML dictionary was missing a key, you will know how bad Ansible is at telling you where an error came from (does not even tell you the dictionary variable name). I have found my own way to deal with that: assert a condition before actually running into the default error message. Only a very simple plugin is required. I opened a pull request already but the maintainers did not like the approach. Still I will recommend it here because of practical experience.

In ansible.cfg, ensure you have:

filter_plugins = ./plugins/filter

Then add the plugin plugins/filter/assert.py:

from ansible import errors


def _assert(value, msg=''):
    # You can leave this condition away if you think it's too strict.
    # It's supposed to help find typos and type mistakes in assertion conditions.
    if not isinstance(value, bool):
        raise errors.AnsibleFilterError('assert filter requires boolean as input, got %s' % type(value))

    if not value:
        raise errors.AnsibleFilterError('assertion failed: %s' % (msg or '<no message given>',))
    return ''


class FilterModule(object):
    filter_map = {
        'assert': _assert,
    }

    def filters(self):
        return self.filter_map

And use it like so:

- name: My task
  command: 'somecommand {{ (somevar|int > 5)|assert("somevar must be number > 5") }}{{ somevar }}'

This will only be able to test Jinja expressions, which are mostly but not 100% Python, but that should be enough.

Less code by using repetition primitives

Ever wrote something like this?

- name: Do something with A
  command: dosomething A
  args:
    creates: /etc/somethingA
  when: '{{ is_admin_user["A"] }}'

- name: Do something with B
  command: dosomething --a-little-different B
  args:
    creates: /etc/somethingB
  when: '{{ is_admin_user["B"] }}'

A little exaggerated, but chances are that you suffered from copy-pasting too much Ansible code a few times in your configuration management career, and had the usual share of copy-paste mistakes and typos. Use with_items and friends to your advantage:

- name: Do something with {{ item.name }}
  # At a task-level scope, it's totally okay to use non-mandatory variables
  # because you have to read only these few lines to understand what it's
  # doing. Use quoting if you want to support e.g. whitespace in values - just
  # saying, of course it's unusual on *nix...
  command: 'dosomething {{ item.args|default("") }} "{{ item.name }}"'
  args:
    creates: '/etc/something{{ item.name }}'
  # This is again following the rule of mandatory variables: making dictionary
  # keys mandatory protects you from typos and, in this case, from forgetting
  # to add people to a list. Get a good error message instead of just
  # `KeyError: B` by using the aforementioned assert module.
  when: '{{ item.name in is_admin_user|assert("User " + item.name + " missing in is_admin_user") }}{{ is_admin_user[item.name] }}'
  with_items:
    - name: A
    - name: B
      args: '--a-little-different'

More readable (once it gets bigger than my contrived example), and still does the same thing without being prone to copy-paste mistakes and complexity.

Idempotency done right

This term was already mentioned a few times above. I want to give more hints on how to achieve repeatable playbook runs. "Idempotent" effectively means that on the second run, everything is green and no actual changes happened, which Ansible calls "ok" but in a well-developed setup means "unchanged" or "read-only action was performed".

The advantages should be pretty clear: not only can you see the exact --diff of what would happen on remote servers but also it gives visual feedback of what has really changed (even if you don’t use diff mode).

Only a few considerations are necessary when writing tasks and playbooks, and you can get perfect idempotency in most cases:

Avoid skipping items in certain cases (explained above)
Often you need a command or shell task to perform very specific work. These tasks are always considered "changed" unless you define e.g. the creates argument or use changed_when.
Example: changed_when: _previously_registered_process_result.stdout == ''
On the same note, you may want to use failed_when in special cases, like if a program exits with code 0 even on errors.
Always use same inputs. For example, don’t write a new timestamp into a file at every task run, but detect that the file is already up-to-date and does not need to be changed.
Use built-in modules like lineinfile, file, synchronize, copy and template which support the relevant arguments to get idempotency if used right. They also typically fully support checked mode and other features that are hard to achieve yourself. Avoid command/shell if built-ins can be used instead.
The argument force: no can be used for some modules to ensure that a task is only run once. For instance, you want a configuration template copied once if not existent, but afterwards manage it manually or with other tools, use copy and force: no to only upload the file if not yet existent, but on repeated run don’t make any changes to the existing remote file. This is not exactly related to idempotency but sometimes a valid use case.

Leverage dynamic inventory

Who needs to fiddle around carefully in check mode every time you change a production system, if there’s a staging environment which can bear a downtime if something goes wrong? Dynamic inventories can help separate staging and production in the most readable and — you guessed it — dynamic way.

Separate environments like test, staging or production of course have different properties like

IP addresses and networks
Host and domain names (FQDN)
Set of hosts. Production software may be distributed to multiple servers, while your staging may simply be installed on one server or virtual machine.
Other values

Ideally, all of these should be specified in variables, so that you can use different values for each environment in the respective inventory, but with consistent variable names. In your roles and playbooks, you can then mostly ignore the fact that you have different environments — except for tasks that e.g. should not or only run in production, but that should also be decided by a variable (→ when: not is_production).

Check the official introduction to Dynamic Inventories and Developing Dynamic Inventory Sources to understand my example inventory script. It forces the domain suffix .test for the "test" environment, and no suffix for the "live" environment.

#!/usr/bin/env python
from __future__ import print_function
import argparse
import json
import os
import sys

SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))

# One way to go "dynamic": decide inventory type (test, staging, production)
# based on inventory directory. Remember that Ansible calls the first file
# found if you specify a directory as inventory. Symlinking the same script
# into different directories allows you to use one inventory script
# for several environments.
IS_LIVE = {'live': True, 'test': False}[os.path.basename(SCRIPT_DIR)]
DOMAIN_SUFFIX = '' if IS_LIVE else '.test'


host_to_vars = {
    'first': {
        'public_ip': '1.2.3.4',
        'public_hostname': 'first.mystuff.example.com',
    },
    'second': {
        'public_ip': '1.2.3.5',
        'public_hostname': 'second.mystuff.example.com',
    },
}
groups = {
    'webservers': ['first', 'second'],
}


# Avoid human mistakes by applying test settings everywhere at once (instead
# of inline per-variable)
for host, variables in host_to_vars.items():
    if 'public_hostname' in variables:
        # Just an example. Realistically you may want to change `public_ip`
        # as well, plus other variables that differ between test and production.
        variables['public_hostname'] += DOMAIN_SUFFIX


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--debug', action='store_true', default=False)
    parser.add_argument('--host')
    parser.add_argument('--list', action='store_true', default=False)
    args = parser.parse_args()

    def printJson(v):
        print(json.dumps(v, sort_keys=True, indent=4 if args.debug else None, separators=(',', ': ' if args.debug else ':')))

    if args.host is not None:
        printJson(host_to_vars.get(args.host, {}))
    elif args.list:
        # Allow Ansible to only make one call to this script instead
        # of one per host.
        # See https://docs.ansible.com/ansible/dev_guide/developing_inventory.html#tuning-the-external-inventory-script
        groups['_meta'] = {
            'hostvars': host_to_vars,
        }
        printJson(groups)
    else:
        parser.print_usage(sys.stderr)
        print('Use either --host or --list', file=sys.stderr)
        exit(1)

Much more customization is possible with dynamic inventories. Another example: in my company, we use FreeBSD servers with our software installed and managed in jails. For developer testing, we have an Ansible setup to roughly resemble the production configuration. Unfortunately, at the time of writing, Ansible does not directly support configuration of jails or a concept of "child hosts". Therefore, we simply created an SSH connection plugin to connect to jails. Each jail looks like a regular host to Ansible, with the special naming pattern jailname@servername. Our dynamic inventory allows us to easily configure the hierarchy of groups > servers > jails and all their variables.

For personal and simple setups, in which only a few servers are involved, you might as well just use the INI-style inventory file format that Ansible uses by default. For the above example inventory, that would mean to split into two files test.ini and live.ini and managing them separately.

Dynamic inventories have one major downside compared to INI files: they don’t allow text diffs. Or in other words, you see the script change when looking at your VCS history, not the inventory diff. If you want a more explicit history, you may want a different setup: auto-generate INI inventory files with some script or template, then commit the INI files whenever you change something. Of course you will have to make sure to actually re-generate the files (potential for human mistakes!). I will leave this as exercise to you to decide.

Modern Ansible features

While you may have introduced Ansible years back when it was still in v1.x or earlier stages, the framework is in very active development both by Red Hat and the community. Ansible 2.0 introduced many powerful features and preparations for future improvements:

Task blocks (try-except-finally): useful to perform cleanups if a block of tasks should be applied "either all or none of the tasks". Also can reduce repeated code because you can apply when, become and other flags to a block.
Dynamic includes: you can now use variables in includes, e.g. - include: 'server-setup-{{ environment_name }}.yml'
Conditional roles are nothing new. I had some trouble with related bugs in 1.8.x, but those are obviously resolved and role: […] when: somecondition can help in some use cases to make code cleaner (similar to task blocks).
Plugins were refactored to cater for clean, more maintainable APIs, and more changes will come in 2.x updates (like the persistent connections framework). Migrating your own library to 2.x should be simple in most cases.

Off-topic: storing sensitive files

For this special use case, I don’t have a recommendation since I never compared different approaches.

Vault support seems to be a good start but seems to only support protection by a single password — a password which you then have to share among the team.

Several built-in lookups exist for password retrieval and storage, such as "password" (only supports plaintext) and Ansible 2.3’s "passwordstore".

In my company, we store somewhat sensitive files (such as passwords for financial test systems) in our developers' Ansible test environment repository, but in GPG-encrypted form. A script contains a list of files and people and encrypts the files. The encrypted .gpg files are committed, while original files should be in .gitignore. Within playbooks, we use a lookup plugin to decrypt the respective files. That way, access can be limited to a "need to know" group of people. While this is not tested for production use, it may be an idea to try and incorporate this extra level of security if you are dealing with sensitive information.

Conclusion

Ansible can be complex and overwhelming after developing playbooks in a wrong way for a long time. Just like for source code, readability, simplicity and common practices do not come naturally and yet are important to keep your Ansible code base lean and understandable. I’ve shown basic and advanced principles and some examples to structure your setup. Many things are left out of this general article, because either I have no experience with it yet (like Ansible Galaxy) or it would just be too much for an introductory article.

Happy automation!

Read more… (post is longer)

Today I learned — episode 4 (numbers in JavaScript considered useless)

December 21, 2016

This blog series is supposed to cover short topics in software development, learnings from working in software companies, tooling, etc.

Numbers in JavaScript considered useless

For my hobby web application project, I wanted to implement a simple use case: my music player application needs to know the playback status including some other fields, and retrieves that status using AJAX calls to the local server. While that should be pretty fast in theory, every network request will slow down your (JavaScript) application, especially if we assume that the web server might not always be on localhost. An easy way to circumvent this are bidirectional WebSocket messages (here: server pushes status). However I’m playing with Rust and the nickel.rs web framework so I just wanted a quick solution without having to add WebSocket support.

My idea was to just have the server sleep during the request until the playback status has actually changed. This way, the client makes a request which simply takes longer if the status remains unchanged, resulting in fewer connections being made. I added a GET parameter previous_hash to the URL so the server could check if the status had changed from what the client stored earlier. Using Rust’s Hash trait, it was very simple to create a u64 hash of my struct and send the new hash back to the client.

In Rust pseudo-code:

router.get("/app/status", {
    middleware! { |request, mut response|
        let previous_hash : Option<u64> = request.query().get("previous_hash")
                                                         .map(|s| s.parse::<u64>().expect("previous_hash not an integer"));
        response.set(MediaType::Json);

        // Delay response for some time if nothing changes, to help client make fewer calls
        let mut ret = None;
        for i in 0..100
        {
            let status_response = get_status_response(); // unnecessary detail
            if previous_hash != Some(status_response.status_hash) {
                ret = Some(json::encode(&status_response).unwrap());
                break;
            }
            sleep_ms(50);
        }

        // If nothing changed while we slept ~100*50 milliseconds, just send latest status
        if ret.is_none() {
            let status_response = get_status_response();
            ret = Some(json::encode(&status_response).unwrap())
        }
        ret.unwrap()
}
});

A change so simple should have just worked, but even though the playback status of my music player remained the same, my requests kept taking 1 millisecond without any sleep calls. The web developer tools in Firefox quickly showed me the potential problem:

The JSON response view and the raw response from the server showed different values. OMFG this must be a browser bug showing big numbers — let’s file a bug on Firefox! Just joking, this was not the real problem, but my first suspect was Firefox simply because I’m using the nightly version.

Long story short: I wasted some nerves and time just to stumble over the same old JavaScript problem again. Numbers in JS are all IEEE 754 floating point. Firefox was showing me the correct thing. My Rust-based web server could easily output the exact u64 integer value while JavaScript converts to floating point, losing precision and making comparisons and any other use of (big) numbers for my hashing use case totally useless. That means I have to switch to using a string representation of the number instead.

While this is just another WAT moment, I am hoping that WebAssembly (supposed to include 64-bit types at some point) and languages that compile to that target can alleviate such problems for the sake of a better future of web development.

Read more… (post is longer)

Giving technical talks — tips to make your listeners happy

December 17, 2016

I’m not a speaker. Since finishing my Master studies, I never held a technical presentation in front of many people, except for doing lots of company-internal presentations related to tooling, security training and induction. In the last years, I’ve visited conferences, meetups and smaller presentations and am seeing the same mistakes over and over again. You might ask — who am I to give you advice? Obviously I’m not a well-known speaker, so what do I know? Well, the important point is that I am a good listener, and the quality of a talk is only defined by the perception/reception of its listeners — you can believe you’re the best speaker in the world, but if people don’t like it, they will 1) typically not give you helpful feedback and thereby not allow you to improve and 2) not come back to your next year’s talk (or even vote it out of the program). I observed many speakers to learn how to present own topics at a future conference or local meetup, and would like to share my experiences with you.

Here’s a list of the most common observations of what is going wrong, how to improve, and other helpful tips to just be a better presenter and get a better conversion and perception from your audience.

Common problems and hints in one list

Readability

The rhetorical question "Can you all read this, yes?" is almost always answered with a silent mumbling of the audience, which actually expresses "Oh not another dude who cannot create readable slides". And even if we were taught since university, and some even since school (PowerPoint started slowly being allowed in my school era, while it is already "the thing" nowadays), that you should not put too many bullets on a slide and keep the text large and readable, speakers still fail to see their presentation from the eyes of people watching.

Font size and amount of content: This is the most crucial setting for your slides. It doesn’t depend as much on the room size as you think, as larger venues are often equipped with large canvases or even mirrored ones for the people in the back. Large font sizes are equally important for any room and any audience. If you don’t set a reasonable size when starting to work on your slides, you will 1) later have to reorganize your slides on font size increase because the content will not fit anymore, or 2) get the resentment of the audience when having to change it during the talk. The latter case is much worse, and I have seen many speakers use reveal.js and other web-based presentation frameworks without understanding how to use them. I even saw a presenter who understood at second zero of his talk that font size was way too small, asked the rhetorical question, tried to use the browser zoom feature, but failed at the attempt because the framework generated HTML that only zoomed controls, not font size. In such a stressful situation, you probably wouldn’t think of hacking it using Web Inspector to enforce the size change.
I can understand that PowerPoint and foes are not very helpful when it comes to syntax highlighting or embedding source code from a file, but yet you have to know what you use and come prepared. For static content, LaTeX presentations are a good starting point.
In summary: know your room and presentation target as you would know your deployment target when developing software. Just because it looks good on your screen does not mean people can read it with a projector (which by the way are usually 4:3 or seldom 16:9). Think of font size, foreground and background colors, contrast, limit font family and size variations and keep examples and text readable. That applies to slides, examples and also applications you switch to (terminal, IDE → zoom feature).
Colors: mind the color blind. I admit to have little knowledge around this, but if you tend to distinguish meaning by color, consider using something else instead (bold/italic/underlined text, side-by-side comparison table, multiple slides… depends heavily on content).

Content

Quoting Hadi Hariri’s great talk The Silver Bullet Syndrome: a talk should be informative, thought-provoking, entertaining and inspirational. Please have a look (at least) at the first few minutes of the video to understand the terms. I want to give some related advice with real examples, again in no particular order of importance (you decide!):

Hobby projects: At developer conferences, I noticed that speakers are often a mixture of: 1) experienced speakers who prepare well, probably even held their talk before in a smaller group and chose their topic based on either strong interest for a programming language, technology or standardization, or out of a real (business) use case/issue they encountered. 2) People whose name you didn’t hear before — often those basing their topic and slides around a personal problem statement or hobby project.
While a personal topic can be very interesting (I’m a big fan of lightning talks which have a lot of such topics), some topics are also very boring or useless for an audience that paid to learn about new standards, technologies and practices instead of a hobby project with questionable future, public interest (e.g. GitHub stars) or substantiated problem statement. Before even starting to work on slides or publishing something, check if it may be interesting for others. Key deliverables in that kind of presentation could be: real use case, description of other public projects which face the same issue, (your) library/framework to solve the problem statement, proposals for improvement and — often forgotten — public source code.
For one bad example: the latest hype in C++ conferences was functional programming (immutable data structures, monad-like chaining, etc.), and I saw talks centered around the guys' home projects which were advertised to the fullest on their blog with code excerpts, but none of it was ever published. Then on the other hand, functional programming libraries like brigand became popular also because they were made public immediately with a request for trying it out, including some good examples.
Real examples: This directly continues on the problems of hobby projects, but applies to all presentations. Without actual examples that people can apply to their own work or personal projects, a talk may not be informative (depends on topic, of course). For myself, I dislike variables named foo/a/b/whatever in examples. Many speakers present a problem that came from a real work problem. In my previous blog posts, I used examples from the financial sector in which I work, for instance. Try to put your real problem statement into a minimal (source code) example, removing all the confidential, over-detailed and useless stuff. You will even find out that if you do so, you may be able to reuse that problem statement as job interview question for software developers!
And please, for the love of all we honor as modern software developers, stop using Monkey, Giraffe and Animal as class names. Not even a zoo’s source code would have such a thing! The only exceptions may be study classes on object orientation and I even admit to have used those names myself on a covariance question when I was younger, but please, keep those and other nonsense examples out of technical talks. Show real use cases.
Number and size of slides: It feels sad that people still do this wrong even though it’s common sense and by just practicing your talk once (even mumbling it to yourself in silence), you can find out that you have too many. I never saw the case of too few slides — never! But I saw the opposite — a guy presenting way more than a hundred slides in a 60 minutes slot, constantly skipping content that he said was not relevant for the audience. Remember that the slides exist to guide the audience, not you, and your voice and highlighting is there to amend and explain the slides. There’s no silver bullet for the ratio of slides per minute, but quite clearly if you find you have to present 2 slides per minute with lots of content or code examples, that simply is not comprehensible at such high speed and your listeners will hate you. In school and university, I learned to put an agenda at the beginning. Even if that is not always very helpful for listeners, it is one way to provide a common thread to guide through your topic in a reasonable order.
There are many ways to reduce complexity by removing and shrinking content, and thus improve understandability:
- Remove uninteresting clutter and images: Every so often I see people trying to be entertaining with funny images and memes.
  
  That’s okay if your talk is supposed to be funny as (part of) its selling argument (like WAT or The Silver Bullet Syndrome), but I would recommend not to overdo it. This applies to all kinds of images: photos of famous persons of the 1x-th century who no one can recognize and you kept unlabeled for people to guess (hint: lame!), complicated flow graphs like your manager’s manager would put in a PowerPoint slide (keep it simple so people can understand!), unrelated side stories if not exactly interesting or amusing (distracts from the common thread).
- Avoid copy-pasting external resources: if you have to paste a whole StackOverflow question or answer into your slides, something is wrong with the way you are explaining the problem or solution. Most often, the title or small summary is enough.
- Short examples: Code snippets must be to the point, i.e. concisely show the use case, problem or solution. For longer examples, shortly indicate which lines you are going to explain next, e.g. by selecting them or using highlighting features of your presentation software.
- Inline explanation in your code snippets: Instead of talking several slides about "how you’re going to do it" and then show the code which does what you just explained, sometimes you can simply put the relevant explanation into the code or on the same slide. Example: algorithm that transforms matrices for which viewers can relate the single steps with the respective line of the code snippet (great example slide of Kris Jusiak, video wasn’t available yet at time of writing).
Take care of details: Similar to typos in your examples which make them not compile, other small mistakes such as text typos, half-truths and incomplete explanations may lead to annoyances among the attentive people of your audience. You should be confident that your presentation is good and exact instead of creating it in a hurry.

Listening comprehension

About making the audience understand what you’re saying (literally).

Accent in English speaking: German speakers are a lucky few, because even if they have the typical, horrible accent when speaking English, it’s one of those which everyone can still understand. However the Germans also have something called "Denglisch", a bad mixture of German and English words, which can lead to misunderstanding. For instance, the German word "Handy" means "mobile phone", and if you mix that into an English sentence, people will be justifiably confused since in English, "handy" means "practical" or "useful". You should be aware of your own accent and such language traps and avoid them. Clearly, speaking perfect English is harder for people of certain cultures, but everyone should try their best when talking to an international audience.
The typical stereotypes and prejudices about how certain peoples speak English very often hold true, especially if speakers are unaware of how they speak English, so you may be able to just find out something about your culture/language/people by researching (insulting) comments about it. Seriously! To give you a personal example: I googled "german speakers accent horrible" and found e.g. Why are Germans among the worst speakers of English?, which felt somehow insulting, but since as an adult I don’t care much, I read on and found quite good reasoning by the author and others.
The difference of languages and cultures is a big source of trouble in listening comprehension and a topic of its own. I might give more hints on that in a future article, but here it is too much.
Summary: be aware of how you speak! If you know of your accent, and hear that you cannot avoid it completely, at least try to speak slowly.
Do not turn around to your slides on the wall: you should have a mirrored display (or presentation mode) on your laptop or second screen. You are turning around because you are feeling unsafe — similar to putting a hand in your pocket. Knowing your content or at least the order of chapters helps not needing to look up the slide content all the time. Presentation or mirror mode can show you the current and/or next slide if you need to see it.
Practice: Many people don’t like rehearsing their presentation in front of a mirror, with family, colleagues or friends or even by themselves. That’s natural and often even unnecessary. Regarding stress level: you are not in a job interview (if you are by any chance, ignore this hint), no need to haste or feel unsafe. Think less and keep the presentation style simple and your voice calm but controlled. The slower and more calm you are, the lower the risk for increased stress levels are. Watch a few minutes of Louis Dionne’s Meeting C++ 2016 keynote to see the meaning of that hint. He is the best example — exaggeratedly calm in some people’s opinion, but yet doing a perfect presentation with close to zero glitches or mistakes 👍.
To get a real preparation for a bigger event, you can try a local meetup group, present at your company or in another small setting. You could extract a small part of your slides into a lightning talk to check if people like the topic at all. Request and collect feedback from the rehearsal/practice session audience.
Speak loud and clear: Some rooms simply have bad acoustics and you can make up for it with your voice. It also makes listeners think that you are feeling confident about your topic. But do not speak too fast — as mentioned above, try to keep your pace and style calm.
Know your own quirks: You should be aware of your own behavior and speaking. Practice and feedback can help find out what you didn’t know yourself yet or didn’t want to realize. Example: many people say the same words or phrases all the time. It’s a pity that humanity is so fearful that we do not tell each other, but no way to change that at large. I know there are people who have that as invalidity or uncurable illness, but for the ones who can control themselves — just listen to yourself to find such tics. It even happens to keynote speakers ("you know"). Other popular words to accidentally repeat are "I", "so", "amazing" (or similar), "like" and of course "ummmm".
Do not read full slides aloud, separate "inputs" for listeners: Your audience can mostly only concentrate on one input at a time — voice, code, slides, seatmates whispering, other distractions. This is the very reason why your slides' content should be complementary to your voice, and therefore avoid too much text on a slide. Summarize in very short sentences if the slide has bullet points, or use emphasized text. If it’s about code, amend the code on the slide with your spoken explanations. From an "input" point of view, listeners should switch between 1) having the time to look at and understand your code and 2) you explaining it. Obviously not all brains are of the same effective speed and not all listeners an expert in your topic, so there must be appropriate "thinking breaks" in between (remember to talk calmly!). Provide good, slow example descriptions, and mind beginners and people who are not deep inside the topic.
At the conclusion slide, it’s fine to list things that were already mentioned in short, to repeat them verbally and/or in writing.

Miscellaneous technical hints

Use latest tools: In one particular presentation of Meeting C++ 2016, I saw a C++11 implementation of what was already available in C++17 as built-in feature. While all other speakers were already showing off trunk compiler features which would soon be shipped as implementation of the new 2017 standard, I had to read slides which proved not only that it can already be done now, but also that I have to use ridiculous templating tricks and go through 50 copy-pasted lines of code on one slide just to get a point across that did not help the actual example, but which would also become obsolete in only a few months.
If you don’t want to make the effort of building/installing the latest compiler, just use online tools like compiler explorer (but mind that you may be offline during the presentation).
Consistent examples: Try to keep your examples centered around one topic. Be it one use case, one other programming language to compare to, and so on. Try to keep the variation low to guide watchers, not confuse them. One real-life example I experienced was a talk loosely related to functional programming which pointed out some wildly mixed examples in both Python and Haskell, a combination which was obviously quite unfamiliar to most of the (C++) audience, especially given Haskell’s syntax which is not immediately comprehensible.
Working examples: Make sure your examples actually compile. A great way to do so is to add a Makefile to your slides repo, and embed the source files into your slides, instead of copy-pasting examples into slides directly. Provide a precompiled header to remove all the clutter (e.g. for C++: includes, using namespace statements, repeated example types), so that only the relevant part is imported from the source file into the slide. Having your examples compile allows you to demo and adapt your use cases if someone has a question.
Display your keystrokes (topic-dependent): If you are showing an IDE or anything else where shortcuts, pressed keys and clicks help the understanding, use a tool to display what you are typing.
Do not disturb: Nothing is more annoying than notifications and applications popping up while you present. You have to say "sorry" and probably it is even something embarrassing like a chat or e-mail notification including content. Take measures to avoid such a situation:
- Close or mute browser tabs which could show notifications (such as WhatsApp Web or Facebook), or temporarily disable them (Firefox: open about:config and search for dom.webnotifications.enabled).
- Enable "do not disturb" mode in your presentation tool or operating system. For macOS, open the notification center (right-most menu button), scroll up and enable "Do not disturb". You can even auto-enable it by opening "System Preferences > Notifications > Turn on Do Not Disturb > When mirroring to TVs and projectors". Windows 10 users have a setting "System > Notifications & actions > Hide notifications while presenting" (which the software must support). And so on.
- Close unnecessary applications, especially the ones which show custom style notifications (and thus don’t react to no-distraction settings as mentioned above) or change the screen color. Popular examples are f.lux (adapts color temperature based on daytime) or Time Out (forces you to take a break).
- Do the same on your phone, or set it to silent or vibration mode. Tell your partner not to call you while presenting.

Hardware

Internet access: Hopefully, you’re fully prepared without requiring any online resources. Wi-Fi access is often a pain or sometimes even unavailable, so try to open websites beforehand or store resources on disk. The same applies to software and packages that you need to have installed. If a Wi-Fi is available, connect early enough and consider a temporary WLAN hotspot of your phone’s mobile connection as backup.
Display/projector problems: Conference organizers can bring tons of cables and adapters, but still we can see problems in 2 out of 10 presentations. Before one particular talk, I observed 6 software engineers and 2 venue assistants trying to get to one out of two laptops to work with the projector — for over 10 minutes. Organizers should offer the speakers to try that out before, and if so, take that chance.
- Bring a USB backup of your slides so in case of unsolvable problems (such as output with wrong colors, green stripes, graphics chip or projector related), the organizers can give you an alternative laptop to present with. This means you should be prepared to hold the presentation on another machine, if possible. If you want to run code examples, for instance, try to have a portable environment ready, such as prebuilt binaries or a pre-installed Python virtualenv. Use common formats like PDF or PowerPoint because those are most likely to be openable on spare laptops if yours is not working.
- Bring your own adapter: Especially for Apple machines, often a funny mixture of adapters is necessary. Some lonely projectors out there are still VGA-only. Chances are that soon young people do not even know what VGA or analog display means…
Try a presenter mouse if you are the type of person to walk around, or the room setup forces you to stand far from your laptop ("Can I have the next slide, please?"). The model should be easy to handle with a fixed grip and have all the features you need, e.g. next/previous slide, start presentation, laser pointer and probably extended features like media controls, right click, touchpad for cursor moves and whatnot. A regular mouse can serve as stupid backup if things do not work on the spot. Try out the presenter mouse before the presentation to get to know how to use it (and potentially install required drivers while you are still online). I’m using an older version of this Logitech model which serves me well with the most basic features.
Laser pointers may not show up on mirrored canvases — for instance, talks with large audiences are sometimes in a venue that has one main projector on stage, and more projectors for people in the back). The small dot could also be invisible in the video recording. Alternatively, select text with your cursor, which is easy for PDF and HTML-based slides. Other tools like PowerPoint include markup tools to highlight text. Or use the old-school "one bullet point at a time" feature to reveal only the passage or code snippet you are currently talking about, to allow people to focus on the right spot of your slide.

Conclusion

Naturally, the above list cannot be comprehensive. The items are the ones I found important for me as the listener of a talk. I know many of them might sound very meta or hard to achieve (like changing your voice, speed or stress level), but only by knowing about what could go wrong, you get a chance of improving (as in: "if nobody tells you about your bad breath, you can never change it"). Please comment if you have more important things to add or want to give feedback, no matter if from the perspective of speaker or listener!

Read more… (post is longer)

Names are important – improving use of terms in software engineering

November 14, 2016

In our field, few things are more important than reading code, which — except for one-man army companies — involves numerous developers reading and trying to understand the same code. One code base is read way more often than written or refactored (if not, you’re doing it wrong), hence the importance of a common understanding of the terminology. Herein I want to present challenges with examples, the different types of scope applying to a term and tips to improve on your use of terminology to foster better communication within companies and elsewhere.

Examples

Terms can occur at various semantic locations in code. Here are some examples (some real, some contrived) from the financial sector in which I’m working. They language assumption is C++ in this article, but the recommendations can be applied equally to other languages.

Type names (class/struct/enum/interface depending on programming language): Account
Function names: validateIban
Method names: getBankList
Module and namespace names: Billing::Aggregation

Terminology scopes

Even if you’re working in the same area, you might not immediately understand each of the example terms above in the way they are used in code and conversations in my company. The reason is that different levels and types of scope apply.

Global scope

Globally familiar words are often clear by themselves, and typically can be looked up in an English dictionary without having more information on the context. The dictionary should usually give no ambiguities. One example is Billing, which expresses that the topic is centered around bills in some sense. If all code had globally intuitive words only, our understanding would be perfect! Reality is that there’s almost always a context which describes (required) details.

Exceptions are truly global names. Think of public trademarks and product names known worldwide. These don’t need explanations anymore, but still the companies behind them have to ensure that the understanding does not get altered through the years. Some people wish for the term "Java" to only relate to an island… A good example is "Microsoft Office". It is familiar all around the globe even for non-tech people (keep this hint in mind).

Local scope (field of expertise, company, team, project, module, etc.)

Even if Billing is easily comprehensible, the example namespace Billing::Aggregation is probably gibberish to someone who is not from the financial sector, or even new hires who don’t know yet how the company handles bills — namely, in this example, by aggregating some key figures. The word may therefore as well be very specific to the company, with a different understanding in other businesses.

A local context also applies to the method name getBankList. Without extra information like a documenting comment, the signature is not enough to find out 1) what kind of list this is, 2) if "get" means downloading, retrieving via remote call or parsing from file, and so on. At best, the surrounding class/module/project is clear enough to the reader of the code to understand the concept later, or provides a clarifying unit test or sample input.

Terms may even be team-specific (or evolve so over time) by various reasons: access to sensitive code may be restricted, the software company grows and splits up teams, topics of teams are not interrelated and there are no intersection points for shared code/guidelines/rules, …

The worst case is when the scope gets so narrow that, earnestly, "it’s all in the code" (only). Below, I will list some recommendations for choosing names, and related tips, to not get to this point (I call it "people lock-in", as in "vendor lock-in").

Ambiguous contexts

The aforementioned "local scope" is one context for a term. If one word can be understood in different ways, it is contextually ambiguous. You don’t want to collect many such terms in your code base, or else switching to another project will leave you confused why accountName suddenly means "human description of bank account" for at another point it meant "username of login credentials".

Likewise, even if terms have a unique meaning across the code base, they may be represented as different types and therefore again cause an inconsistent understanding. For example, let’s assume that "account" is always a bank account, but once there’s a variable AccountInfo account and in another spot int64_t account. "The latter is quite obviously the account ID in the database!" — oh no sorry, Mr. Original Code Author, it’s not obvious!

Another popular piece of information stored in data structures is string address (compatible with all international addresses! 🌍) being represented differently elsewhere: struct Address { string address1; string houseNo; […] }. These are even incompatible in conversion.

The key to overcome ambiguities is to stay consistent in naming. In the majority of cases, the main term (here: "account"), which by itself is not explanatory, can be suffixed to give a meaningful variable name: int64_t accountId and AccountInfo accountInfo (still nicely readable if type is omitted: const auto accountInfo = […];).

Implementation detail scope

While validateIban very obviously validates IBANs, knowing only the function name doesn’t say anything about how the function works. It requires at least the function signature and possibly a documentation comment to grasp the semantics. A company or development team may have their concept of how all validateXYZ functions should work, e.g. throw specific exception or return false on invalid value, and even if that concept is "well-known", it’s a notion that must be transferred to new hires. Such an induction to the company’s development practices is of course necessary for new developers, but too much will overload those people, resulting in small details being forgotten. Let’s say you forgot what the function validateIban returns for an empty input string? It’s a very important detail, and the most sane way would be to consider an empty value as invalid, because then the caller can decide whether an empty/optional value is allowed depending on use case. Yet this detail is not found in the name (granted, it’s hard in this case without getting wildly over-verbose names).

Here are a few alternative function signatures (C++):

auto validateIban(const std::string& s) -> bool; — this suggests to the reader that the function does not throw and returns whether the input is valid or not. It does not say what happens in the case of empty string, but as stated above, this could just be left off because there’s a sane default behavior. Nevertheless, following the "verbSubject" naming principle, a better signature would be [[nodiscard]] auto isValidIban(const std::string& s) -> bool which makes it even clearer that the function doesn’t throw but returns a boolean. Developers don’t even have to read the full signature to use it correctly, and are warned (starting with C++17) if the return value is unused by mistake.
auto validateIbanOrThrow(const std::string& s) -> void; — the void result type and "OrThrow" suffix in the name makes it totally clear that the function will throw on invalid input. Whether you include the type of exception in the signature or name is a question for your style guide (e.g. template<typename TExc> … to make it explicit). Personally, I’d just throw a standard exception here (std::invalid_argument), and stay consistent in similar functions.
No function at all. Use strong typing to ensure that at the relevant spots, only valid IBAN arguments can be passed in (i.e. auto extractAccountNumberFromIban(const ValidIban& iban) -> std::string;). Along the same line, introduce the practice to validate inputs at the input boundary (e.g. remote call), not just later where you could forget calling validateIban by accident. This will also improve your error handling because you will fail earlier, and can write functions that make assumptions about their inputs and thus may even become exception-free. As mentioned in the linked post, using strong types is probably overkill if done throughout your code, so this is a also something for the style guide, or to decide per case.

Factors other than context

Surely the context defines to which domain a term belongs. Nevertheless other influences can help determine whether a name or term makes sense to use.

Complexity and language

The first influences I want to summarize here are complexity and (spoken) language. International mixes of development team members can be found in almost all companies. The language and culture barrier and gap are the most influential topics to be aware of when it comes to creating a common and mutual understanding of technical and personal themes. Hence it’s no wonder that the English language reigns software development both in coded and spoken words. To account for culture differences, complex English vocabulary should be banned where reading or listening comprehension is important.

To give an example, I want to name the Unicode standard. After reading 20+ articles (incl. the famous Joel Spolsky) about Unicode in the last 15 years, and continuously learning about its updates, its terminology still is only partially burned into my brain. The sheer count of terms is high, but in my opinion not the issue, since the memory of a software developer is quite durable once a term is clear. Can you distinguish UTF-8, UCS-2, UCS-4, UTF-16{BE,LE}, UTF-32, (UTF-7), character, glyph, character set, code point, surrogate pair, BMP, BOM, U+1F4A9? I have no problem recalling their meaning when I see them, but what really makes my brain smoke are the non-technical things mentioned in that list: is a glyph a fully rendered code point, or a partial symbol? How did they define character again — was it the same as a code point? Just look for a minute at their glossary and you’re going to be overwhelmed as well. In summary, the standard, the related myriad of blog posts, true/half-assed/false answers on StackOverflow and other resources are simply an overload for the software industry and simplifying now takes a huge amount of effort. If we were to use UTF-8 everywhere (great simplified glossary there!) already 20 years ago, there probably wouldn’t be crazy inventions like MySQL’s UTF-8 variants (yes, your UTF-8 enabled database probably cannot store all of Unicode!):

For a supplementary character, utf8 cannot store the character at all, whereas utf8mb4 requires four bytes to store it.

See, complexity and amount of terminology is like a growing company — the smart ones can handle growth easily by keeping things simple and stupid, while the typical response to growth is levels of management, performance reviews, more business, less "family" feeling, or in other words: complexity.

Ambiguous wording

Imagine you’re in one well-defined context, have chosen simple English words that need no explanation in your opinion, developers you ask tell you they understand the meaning immediately — what could possible go wrong? You’ve landed a set of terms to be carved in stone. They will call a dictionary after you! Well, probably not… In reality this is long before the finish line.

One area for which I really have a strong opinion are filesystem terms. Those are around since ages but still confused and forcefully en-ambiguated (my opposite of disambiguated) or highly confused in code all the time, to the point where it’s not funny anymore. The problem is that even if the words are clear, and you were given Tanenbaum’s book on operating systems in studies class, the terms are still way too interchangeable. Find below some examples of ambiguous wording, including my proposals and what people also use as alternatives. I’m using lowerCamelCase examples here to also nitpick about spelling differences. This was the motivation to start writing this blog post, so sorry about the lengthy commentary! I’d like to hear comments on this admittedly very opinionated section:

file, f, path, p, filepath, filePath, filename: In operating system terms, a "file" can be a regular file, symlink, hard link, socket, FIFO, other special types or a directory. Often it is perfectly fine to use the terms "file" and "directory" to tell (regular) files apart from directories. Just think of the famous error message "No such file or directory". Usage depends a little on the use case, but mostly readers of code will simply understand because it is clear that a file with content is being read, or a directory is listed, for instance.
But: file != path != filePath != filename ☝️. First of all, "filepath" is a spelling that nobody uses, so you also shouldn’t, while "filename" is funnily the typical spelling (not "fileName"), just like "filesystem" exists in some dictionaries alongside "file system" (I don’t have a preference there). A "file path" is a path that points to a (optionally existing) file, and is mostly used in code to mean a regular file (or transparently a regular file behind a symlink). The difference to a "path" is that the latter means it can point to any file type on the system, including a directory. Using the variable name path therefore is probably underspecified and not a good idea if the intention is specific. Using p alone as variable name is much worse than the familiar abbreviations f (to represent a file handle) or i (for loop indices).
Moreover, people don’t seem to get the difference between filename and file path. A "file name" is the name of a file entry (mostly within a directory, but without exposing that context), e.g. "Hello.cpp", while its path may be any path pointing to that file, e.g. "/tmp/Hello.cpp" or "C:\SuperSource\Hello.cpp" (absolute paths), or "../../private/tmp/Hello.cpp" or — equaling the filename — "Hello.cpp" (relative paths).
Last, if I were to see a variable called file, in C++ I’m most likely to guess that it’s a file input stream, while many people use that name in place of a file name or path, which is greatly misleading and semantically wrong. This is a case for a naming guideline, since different opinions exist, and it’s also slightly dependent on the programming language — in Python, I would use f for an input file stream and out or out_file for a writing stream, while in other languages such short variable names are unusual.
directory, dir, dirPath, folder: In my memory, it was mostly Microsoft coining the term "folder". Wikipedia explains that a "folder" is just the graphical metaphor that represents a directory on the filesystem, and that e.g. Windows has special folders (like "Photo library") that don’t map directly to a directory on disk. Therefore in code, the correct term is almost always "directory" or an abbreviation (dir). Unlike file, the variable name dir by itself says even less about its meaning: unless you’re working with directory handles, you couldn’t infer what dir should stand for, and if it might represent an absolute directory path, or something else. So often times, this had better be dirPath, or if the variable name includes the meaning (it should!), I’m tempted to omit the *Path suffix: bankStatementsDownloadDir.

Recommendations

In no particular order:

Simple English: Use vocables that are taught internationally and resolve to one clear meaning when looked up in a dictionary. You should not even have to look it up. It starts at easy terms like "replace" instead of "substitute", and continues to native level complexity (missing reasonable bad examples here, sorry), or even to words that are only understood in certain English-speaking countries.
Code that reads like English sentences is often the best choice for later comprehension.
No code names. Made up words and names, or acronyms, can be a nice memory or story behind a project, but should not leak into the writing of its source code. Stay with clear English wording that other people can grasp.
Also: prefer short names before abbreviations — please stay away from stupid acronyms and be smarter than governments, research institutions and armies who use letter abbreviations everywhere. Example: STYLE = "Strategic Transitions For Youth Labour in Europe" — you gotta be kidding me!
Comprehensible by non-techies: If terms are important and publically visible for other departments or consumers, name them accordingly. "Billing aggregated information per merchant" is much better than "Merchant tx sums" (totally contrived 😉). "Microsoft Office" is much better than "Humble Write Bundle".
I could write a whole book about this item done wrong in public-facing user interfaces and applications. Assume Google sent you an e-mail "Login from unknown IP abcd:beef:1234:::1 with device supermario". Now estimate how many of the people in your neighborhood would react to such a mail, or even know what an "IP" is (or IPv6)? In reality, Google is much smarter, and the title for an unknown login alert currently reads "Someone has your password". While this could also be a spam subject, the average tech user is much more likely to react to clickbait titles warning about a virus or stolen password than to titles they don’t understand. No technical details like IP, device name or location are shared by Google’s alert (only after the click), but instead there’s a single, fat button "REVIEW YOUR DEVICES NOW". This far it’s wonderful naming and perfectly smart design to attract people to security measures — an outstanding example.
Maintain a technical glossary page or a good practices project: Create a Wiki or intranet page for developers to look up commonly used terms. You could even add the recommended variable name(s) in there for important concepts. Don’t pack too many words in there and don’t grant other (non-technical) departments write access because else they might quickly pile up half-true or unrelated descriptions of things that developers don’t even need to know, or must have a much deeper technical understanding of. If you’re one of those "our Wiki is always outdated" or "our Wiki is write-only" companies, you could instead "appoint" a best practices code project, so to say a flagship project that does most things (including naming) right and consistently. Newcomers should learn good practices from that project. In my team at work, for example, we develop implementations for many payment methods (e.g. Credit Card or PayPal are payment methods) based on the same module interface, so implementations only (need to) vary slightly in their overall logic and naming concepts. We implicitly know which projects are the ones we wrote this year, and as such are the ones where we applied the most modern practices and conventions to stay consistent or introduce better terminology. These latest projects can be seen as starting point for any new project. In addition, we have a Wiki page outlining important points to consider for these similar implementations — much like a checklist (not related to terminology per se, just as general hint).
Provide examples: If there is a core spot where a term stems from or which defines the main usage, for instance a module that parses the important company report called "monthly aggregated Blobby Volley results and player of the month", that code repository probably should contain a relevant unit test and sample file/input where reviewers can later look up what makes up this report (can be anonymized data), how its output would look like, and probably a short explanation of its meaning for the company. Alternatively, I imagine explanatory articles on the company Wiki, structured in reasonable order/hierarchy of topics, and linked in the glossary.
Ambiguous meanings: In many cases, code and terminology grew historically and you can’t easily change names anymore — accept the fact and try to disambiguate as far as possible. If a term "account" is ambiguous between two projects, let’s say project A ("LoginService", where it stands for login credentials) and B ("BankAccountService", here it represents bank account information), then ensure the ambiguous term doesn’t slip from project B into A, and vice versa.
If both meanings need really be mixed within one code repository, use namespaces, type and variable name prefixes or suffixes to overcome the ambiguity: loginAccountInfo and bankAccountInfo. Before introducing terms, look which ones already exist, or else you won’t be able to disambiguate easily — for instance, Rust’s package manager cargo uses the word "target" publicly for both "build target" (as in make <targetname>) and for "target architecture" (alias platform triple, e.g. "x86_64-unknown-linux-gnu"), which is mostly clear in the code because internally it’s most often called "platform", but the slight annoyance remains existent because the public configuration key is still called target and will remain so for a long time to stay backward-compatible.
Use one consistent name and stop the typos already to make code grep-able. This allows to search through the whole code base and see where a term or type/variable name is actually in use. If you consistently used accountInfo for all places where you store a local variable about bank account information, you can more easily rename all places to the new desired name bankAccountInfo. Side note: in reality, renames tend to be a bit less trivial, though. The same applies to sentences such as public error messages: if they are all identical, or even in one shared linked library, it’s easy to fix/amend/replace/gettext-translate them.
Ensure a given name is clear within the desired scope: If you have a method getBankList, you should make sure that the parent class describes what it is about — e.g. DeutscheBundesbankXmlBankListParser is a little exaggerated but clearly says it parses the XML bank list of the German federal bank. The bigger the scope is, the more important good naming is for types and items that lie within. Imagine this class was part of a shared library that you’re selling to customers!
Function names should be verb-followed-by-subject where a reader should be able to infer the output from the verb ("validate" in our example was not helpful).

I hope this list proves helpful to see terminology from a different perspective and allows you to take action enhancing your practices and sweeping out old, nonsense names from your code.

Read more… (post is longer)

Today I learned — episode 3 (strong typing in C++ vs. Rust)

November 4, 2016

This blog series is supposed to cover topics in software development, learnings from working in software companies, tooling, but also private matters (family, baby, hobbies).

Strong typing in Rust and comparison to C++

C++ enthusiast Arne Mertz recently wrote a post Use Stronger Types!, a title which immediately sounded like an appealing idea to me. Take a look at his article, or for a tl;dr, I recommend looking at the suggestion of strong typedefs and links to libraries implementing such constructs/macros.

My inclination towards the Rust programming language and own expertise in related C++ constructs (and attempts to use stronger typing in work projects) commanded me to research how the languages compare and what other simple options exist. Matter of fact, I’m going to present below some findings that I already had prepared for a draft presentation which compares C++ with Rust (with the goal of finding out where C++ could improve). This article explains possible alternatives in C++, a suggested solution that is very explicit, and how one can achieve something similar in Rust.

Terminology for code samples

As I’m working for payment service provider PPRO, my examples come from the financial sector. Let me quickly introduce a few relevant terms.

The term PAN essentially means a credit card number, where the full PAN may never be stored on disk ("at rest") without encryption (such as in a log file) or leave the protected environment, and has many more security restrictions demanded by the PCI-DSS data-security standard (PCI = Payment Card Industry). Masked PANs are the ones that can be displayed outside of a PCI environment. For example, if you purchase a product with your credit card (number 1234569988771234), that number may be stored in an encrypted form within a PCI-compliant environment. However it may only leave that environment in masked PAN form, that is, at most the first six digits (BIN = bank identification number) and the last four digits (non-identifying fraction of the customer’s card number). At your next purchase at the merchant, they can offer you to pay with the same card again (displaying the masked number 123456XXXX1234).

C++ `typedef` is not strong typing

We want to ensure that full PANs cannot be converted directly to masked PANs (among other reasonable restrictions). Look at a beginner attempt to define separate types:

#include <cstdint>
#include <iostream>
#include <string>

typedef std::string MaskedPan;
typedef std::string FullPan;
// Assuming PANs never start with a zero, we can stuff them into an integer type
typedef uint64_t MaskedPanU;
typedef uint64_t FullPanU;

int main()
{
    {
        FullPan full = "1234569988771234";
        MaskedPan masked = full;
        std::cout << "Masked (string): " << masked << std::endl;
    }

    {
        FullPanU full = 1234569988771234;
        MaskedPanU masked = full;
        masked += full; // even this works
        std::cout << "Masked (integer): " << masked << std::endl;
    }
}

Well, that compiled just fine and led to a fatal bug — we just logged a full PAN to stdout. That action shouldn’t have been possible. You don’t want to sit through and pay for those two extra weeks in the next credit card audit, not to mention the cleanup to get the sensitive data out of the way!

Using an enum wrapper is also ugly, and not really readable like English prose — so probably not a good idea in general.

The C++ standard gives a simple explanation:

A typedef-name is thus a synonym for another type. A typedef-name does not introduce a new type the way a class declaration (9.1) or enum declaration does.

Or in other words, typedef A B; seems to be no different in this use case from using B = A; — if there’s a difference at all?! Fortunately right in that quotation we have a proposed solution: declare a new type with struct/class/enum.

C++ strong typing with `enum`

While I wouldn’t recommend using an enum for scenario, it apparently has its strong typing benefits:

#include <cstdint>
#include <iostream>
#include <string>

enum class MaskedPanE : uint64_t {};
enum class FullPanE : uint64_t {};

auto maskPan(FullPanE full) -> MaskedPanE
{
    // Take first six and last four digits of full (optionally test that full
    // is at least 10 digits if not guaranteed by other code)
    const auto fullPan = std::to_string(static_cast<uint64_t>(full));
    const auto maskedPan = std::stoull(fullPan.substr(0, 6) + fullPan.substr(fullPan.size() - 4));
    return static_cast<MaskedPanE>(maskedPan);
}

int main()
{
    FullPanE full = static_cast<FullPanE>(1234569988771234);

    // Now we have strong typing :) This gives
    //   error: cannot convert 'FullPanE' to 'MaskedPanE' in initialization
    // MaskedPanE masked = full;

    MaskedPanE masked = maskPan(full);

    // Additional benefit: outputting only possible with explicit cast
    std::cout << "Masked (enum): " << static_cast<uint64_t>(masked) << std::endl;
}

C++ strong typing of strings

The std::string case actually is a no-brainer: strings are ubiquitous in business logic of most companies. They are used for

money amounts (different format, decimal and thousand separator, rounding, precision)
file paths, filenames
numbers
binary data, but also text of varying encodings (C++ and Unicode is a different story altogether 😜)
data types which are incompatible or should be semantically distinct (such as full and masked PANs in our scenario)
maaaaaany more use and abuse cases all around the globe

We have to create a new struct or class to have disjoint string-based data types.

#include <cstdint>
#include <stdexcept>
#include <iostream>
#include <string>

class StringBasedType
{
    const std::string _s;
public:
    explicit StringBasedType(const std::string& s): _s(s) {}
    auto str() const -> const std::string& { return _s; }
};

// Randomly using `struct` keyword here, could as well be `class X: public StringBasedType`
struct MaskedPan: StringBasedType
{
    explicit MaskedPan(const std::string& s): StringBasedType(s)
    {
        // Check input value: require 123456XXXX1234 format
        if (s.size() != 14 ||
            s.substr(0, 6).find_first_not_of("0123456789") != std::string::npos ||
            s.substr(6, 4) != "XXXX" ||
            s.substr(10).find_first_not_of("0123456789") != std::string::npos)
        {
            throw std::invalid_argument{"Invalid masked PAN"};
        }
    }
};

struct FullPan: StringBasedType
{
    explicit FullPan(const std::string& s): StringBasedType(s)
    {
        // Check input value based on assumptions
        if (s.size() < 13 || s.find_first_not_of("0123456789") != std::string::npos)
            throw std::invalid_argument{"Invalid full PAN"};
    }

    auto getMasked() const -> MaskedPan
    {
        const auto& s = str();
        // Use assumptions about string size and content from `MaskedPan` constructor
        return MaskedPan{s.substr(0, 6) + "XXXX" + s.substr(s.size() - 4)};
    }
};

int main()
{
    try
    {
        FullPan full = FullPan("1234569988771234");

        // This fails to compile because no such converting constructor exists
        //   error: conversion from 'FullPan' to non-scalar type 'MaskedPan' requested
        // MaskedPan masked = full;

        MaskedPan masked = full.getMasked();

        // Outputting only possible with explicit `str()` - more visible in a code review!
        // If you're calling `str()` all the time, you're probably misusing strong typing.
        std::cout << "Masked (string-based): " << masked.str() << std::endl;

        return 0;
    }
    catch (const std::exception& e)
    {
        std::cerr << "Exception: " << e.what() << std::endl;
    }
}

Summarizing this solution (one of many):

Everything is explicit:
- conversion to raw string value (str()) and thus the ability to output or compare the value
- conversion to disjoint type (must add constructor or method like getMasked)
- construction of specific type (use explicit constructors)
Exactly one place for input validation/assertion
Can be adapted to other base types as well, not only strings
Operators must be defined manually. This can be an advantage, for instance, if the base type (here: string) can be compared for equality/order, but ordering does not make sense for the specific type (here: PAN). BOOST_STRONG_TYPEDEF(BaseType, SpecificType) is an example implementation which defines operators for you.

Strong typing wrappers are not — and will presumably never be — an inherent part of C++. Instead, the above solutions proved simple enough for the mentioned use cases. It’s on developers to decide whether they write a few lines of wrapper code to be very explicit, or choose a library which does the same thing.

Comparison with Rust

Rust has the same notion as C++'s aliasing typedef:

type Num = i32;

which has the same problems, so no need to repeat that topic.

Syntactically, Rust provides a very lightweight way of creating new types in order to achieve strong typing — tuple structs:

struct FullPan(String);
struct MaskedPan(String);

fn main() {
    let full = FullPan("1234569988771234".to_string());

    // Fails to build with
    //   error[E0308]: mismatched types
    //   expected struct `MaskedPan`, found struct `FullPan`
    // let masked: MaskedPan = full;

    let masked = MaskedPan("123456XXXX1234".to_string());
    println!("Masked (tuple struct): {}", masked.0);

    // Oops, no input validation: we can pass a full PAN value without getting an error
    let masked2 = MaskedPan("1234569988771234".to_string());
    println!("Masked2 (tuple struct): {}", masked2.0);
}

Well, that didn’t help much… Admittedly, tuple structs are more helpful for other use cases, such as struct Point(f32, f32) where it’s clear that X and Y coordinates are meant. A rule of thumb is: if you have to give the tuple fields a name to understand them, or you require input validation at construction time, don’t use a tuple struct. Remember that Rust uses an error model that is different from throwing exceptions, and in the above example there’s not even a constructor involved that could return an error (or panic) on invalid input.

Let’s replicate what we did in C++:

#[derive(Debug)]
struct FormatError { /* ... */ }

// Rust doesn't have object orientation i.e. we cannot "derive" from a base type
struct MaskedPan {
    value: String,
}

impl MaskedPan {
    pub fn new(value: &str) -> Result<Self, FormatError> {
        if value.len() != 14 || value[..6].find(|c: char| !c.is_digit(10)).is_some() ||
           value[6..10] != *"XXXX" ||
           value[10..].find(|c: char| !c.is_digit(10)).is_some() {
            Err(FormatError {})
        } else {
            Ok(MaskedPan { value: value.to_string() })
        }
    }

    pub fn as_str(&self) -> &str {
        &self.value
    }
}

struct FullPan {
    value: String,
}

impl FullPan {
    pub fn new(value: &str) -> Result<Self, FormatError> {
        if value.len() < 13 || value.find(|c: char| !c.is_digit(10)).is_some() {
            Err(FormatError {})
        } else {
            Ok(FullPan { value: value.to_string() })
        }
    }

    pub fn get_masked(&self) -> MaskedPan {
        // Since we already checked the `FullPan` value assumptions, we can call
        // `unwrap` here because, knowing the `MaskedPan` implementation, we can
        // be sure `new` will not fail.
        MaskedPan::new(&format!("{}XXXX{}",
                                &self.value[..6],
                                &self.value[self.value.len() - 4..]))
            .unwrap()
    }
}

fn main() {
    match FullPan::new("1234569988771234") {
        Ok(full) => {
            let masked = full.get_masked();
            println!("Masked (string-based): {}", masked.as_str())
        }
        Err(_) => println!("Invalid full PAN"),
    }
}

Should I use strong typing everywhere?

This questions seems to be mostly language-independent, and a matter of taste to some extent. In my experience, there are ups and downs:

Yay:

Safety from mistakes, especially if they can lead to horrific problems like in the credit card scenario, where full PANs could be leaked to the outside or written to disk if types are confused.
Code using the strong types may become more readable (as in: reading English prose) as things get spelled out explicitly
User-defined literals can make code even more concise, but that only applies to code which uses a lot of constants. To be honest, I’ve never had a project where those literals would be worthwhile.

Nay:

Much extra typing and explicit definition of operators/actions
Avoid using strings all over the place and you will have fewer problems from the start. For example, there’s boost::filesystem::path.
No real benefit for structures which probably never change and have well-named fields. To prevent mistakes in the order of constructor arguments, use POD structs and C++ designated initialization (syntax extension). Rust also has such a syntax, and additionally gives builds errors if you forgot to initialize a field. The builder pattern is a similar alternative (however not really beautiful). Stupid example:

// C++
struct CarAttribs
{
    float maxSpeedKmh; // kilometers per hour
    float powerHp; // horsepower
};

class Car
{
public:
    explicit Car(const CarAttribs& a) { /* ... */ }
};

int main()
{
    auto car = Car{{.maxSpeedKmh = 220, .powerHp = 180}};

    // Unfortunately that syntax doesn't prevent unspecified fields (no compiler warning)
    auto car2 = Car{{.maxSpeedKmh = 220}};
}

// Rust
struct CarAttribs {
    max_speed_kmh: f32, // kilometers per hour
    power_hp: f32, // horsepower
}

struct Car { /* ... */ }
impl Car {
    fn new(attribs: &CarAttribs) -> Self {
        Car{ /* ... */ }
    }
}

fn main() {
    let car = Car::new(&CarAttribs {
        max_speed_kmh: 220.0,
        power_hp: 180.0,
    });

    // This fails to build with
    //   error[E0063]: missing field `power_hp` in initializer of `CarAttribs`
    // let car2 = Car::new(&CarAttribs { max_speed_kmh: 220.0 });
}

In the end, you must decide per case. Often times, the declaration of functions or types allows for human errors, so before changing to strong typing, you should first consider if the order of parameters, name of fields, choice of constructor(s), et cetera are sane, consistent in their meaning (money amount shouldn’t be 123 cents in one place, but decimal number string 1.23 elsewhere) and follow the principle of least surprise and smallest risk of mistakes.

There’s also no clear winner between the languages — since strong typing is not a built-in feature in either language, you must roll your own or use a library, and that isn’t exactly elegant, but still readable.

Read more… (post is longer)

Today I learned — episode 2 (hacking on Rust language)

October 28, 2016

This blog series (in short: TIL) is supposed to cover topics in software development, learnings from working in software companies, tooling, but also private matters (family, baby, hobbys).

Hacking on the Rust language — error messages

Right now I’m totally digging Rust, a modern systems programming language which covers for instance thread-safety and memory access checks at compile time to guarantee code safety — while still being close to hardware like C/C++ — and has many more benefits such as a well-designed standard library, fast paced community and release cycle, etc.

Since I’m professionally working in C++, I am currently drafting a presentation that compares C++ with Rust with the goal of finding out where C++ could improve. The plan is to first present the slides in the Munich C++ meetup when completed.

One topic where C++ lags behind are macros — in Rust (macro documentation), one can match language elements, instead of doing direct text preprocessing (pre = before the compiler even parses the code).

// Simple macro to autogenerate an enum-to-int function
macro_rules! enum_mapper {
    // `ident` means identifier, `expr` is an expression. `,*` is comma-separated repetition (optional trailing comma)
    ( $enum_name:ident, $( ($enum_variant:ident, $int_value:expr) ),* ) => {
        impl $enum_name {
            fn to_int(&self) -> i32 {
                match *self {
                    // I can put comments in a macro, and don't need to have backslashes everywhere!
                    $(
                        $enum_name::$enum_variant => $int_value
                    ),* // repetition consumes matches in lockstep
                }
            }
        }
    };
}

// Totally stupid enum
#[derive(Debug)]
enum State {
    Succeeded,
    Failed(String), // error message
    Timeout,
}

enum_mapper!(
    State,
    (Succeeded, 1),
    (Failed, 2),
    (Timeout, 3)
);

fn main() {
    let st = State::Failed("myerror".to_string());
    println!("{:?} maps to int {}", st, st.to_int());
}

(play with this code)

That code snippet produces the error error[E0532]: expected unit struct/variant or constant, found tuple variant State::Failed with the nightly compiler. To me, reading such a verbose error was like learning C++ in my childhood — I just had no idea of the terminology used with the language, so "unit struct/variant" and "tuple variant" were totally unclear to me and not immediately intuitive. The displayed error location also wasn’t helpful, and neither provided me the expanded macro, nor the failing code line. In this sense, the error messages are on par with the C++ preprocessor (just as bad 😂). Normally, Rust provides error explanations with examples, displayed by rustc --explain E0532. But in this case: error: no extended information for E0532.

So I found out myself — removing the variant parameter (String) from State::Failed(String) (so the enum only has simple variants), my macro was working fine, and after some thinking it was clear that I had previously commented out the consideration of variant parameters (that’s how I call them at the moment). Here’s how I could match State::Failed(String):

$enum_name::$enum_variant(..) => $int_value

Note that this is not a solution because now it won’t match State::Succeeded and State::Timeout anymore (maybe it used to work earlier), but this article is more about getting to understand the problem by the error message.

Having found the problem, I still didn’t feel happy because that debug session cost me time and might happen again for me and surely for others as well. Hence, let’s hack Rust!

Getting started with hacking Rust is elegantly simple: clone, ./configure, make. That will build the whole compiler and rustdoc toolchain, but not the cargo build tool, but that’s already enough hard disk space and download/build time spent. On my slow connection, the configure script was still cloning LLVM after 15 minutes 🐢💨…

make tips hints at different targets to limit what is built, like if you’re only working on the compiler (→ make rustc-stage1, make <target-triple>/stage1/bin/rustc).

In the meantime I searched the code for existing error messages (just grepped for the one of E0003), and immediately found the source file src/librustc_const_eval/diagnostics.rs. I found it strange that the list of diagnostic error messages was so short, so I did ag -l '^\s*E[0-9]{4}:' and discovered that the error messages belong to the respective crate. In case of the example error I grepped (E0003), it’s the crate for "constant evaluation on the HIR and code to validate patterns/matches" (HIR stands for High-level Intermediate Representation). My desired error explanation should therefore go into the right crate, which turned out to be librustc_resolve.

Finally the compiler build completed, but to my surprise, x86_64-apple-darwin/stage1/bin/rustc --explain E0003 could not find an explanation. That was peculiar as stage 1 already should give me a working compiler (as of the writeup in make tips and the great summary Contributing to the Rust compiler). The solution of the riddle was easy: E0003 has vanished with the following commit:

commit 76fb7d90ecde3659021341779fea598a6daab013
Author: Ariel Ben-Yehuda <ariel.byd@gmail.com>
Date:   Mon Oct 3 21:39:21 2016 +0300

    remove StaticInliner and NaN checking

    NaN checking was a lint for a deprecated feature. It can go away.

Using another error code, it displays the explanation just fine, e.g. x86_64-apple-darwin/stage1/bin/rustc --explain E0004. Only missing point was to get in my desired explanation and example for E0532, and test it in the same way. This part is too detailed for a blog post, but I ultimately ended up with a pull request (still pending at the time of writing).

Now I’m happy to have started my first contribution. There will surely be more blog posts following about my experiences with Rust!

P.S. Only later I found that the stable rustc 1.12.1 would’ve given a slightly better error for the initial problem (State::Failed does not name a unit variant, unit struct or a constant). Remember you can play around with Rust versions online or with rustup.

Read more… (post is longer)

Today I learned — episode 1 (introduction to blog series, Ansible)

October 22, 2016

This new blog series (in short: TIL) is supposed to cover topics in software development, learnings from working in software companies, tooling, but also private matters (family, baby, hobbys). No idea where I’m heading with it, though 😉

I would like to start with the reason for creating TIL.

nginx replaces Apache, introducing Ansible

The setup on my server got too complicated, especially with Apache configs that I’ve been maintaining since Apache 2.0.x was installed. From experience at my company, I know that nginx is much easier and concise in its configuration.

Around 2011, I created Site Deploy, a UI for deploying web sites to web servers via SSH. The set of supported web site types was limited (the ones I used, e.g. Django, Play!, static files), as were the supported servers (Apache, nginx, lighttpd).

Only in 2015, while working on an automated test environment setup for my company, I learned about Ansible, and that my software was basically the same invention to a small extent, just earlier and created for private purposes. Site Deploy has similar concepts like hosts, variable resolution, SSH connection, but never became as elegant as Ansible.

My previous setup was manual configuration of virtual hosts of Apache, plus configuration files automatically created by Site Deploy per site. This was all replaced by an Ansible-based setup, which makes it easier to

Run tasks and updates from one place
Create a test environment. I’m using a VM with the same OS as my real server, plus a suffix .testdomain so that I only have to add e.g. <IP of test VM> andidog.de.testdomain to /etc/hosts.
Test repeatability (and idempotency) of the tasks. I tried checked mode, which only performs a dry run, but it’s much harder to maintain playbooks that are compatible with it. Having a test machine (or VM) is a much better idea. Seeing things fail (even if on a test system) makes you a better developer / devop / sysadmin.
Get the same setup every time, therefore having a full configuration backup in one place, without the need to backup anything from the server (except for live data like databases).
Reuse, reuse, reuse. Ansible roles are one way to apply the same changes multiple times, e.g. creating a web site configuration. Another way is to use nginx/Apache’s include operation to add common configuration directives to other configs.
Readability and ability to share playbooks/roles with others.

Ansible is definitely the way to go for me. It’s available as package on *nix systems (not on Windows) and principally only needs Python installed on the to-be-configured server. It works by syncing modules to the server and running them there, with the inputs you provide. Read their introduction article to get a grasp of the concepts - the learning curve is gentle and you will quickly get good results.

Other than my server, I am also maintaining my home server - a Raspberry Pi with a hard drive, exposed through dynamic DNS on the Internet so that I can, for example, access my music from work. At least that’s the theory. In reality, my neighborhood only gets DSL 6000 speed with horrible upload rate which makes my remote MP3 listening experience very hickupy… Back to topic: I’ve even used Ansible to setup/restore my work laptop before.

Conclusion: Ansible is the simplest, most readable and, in my opinion, architecturally best (e.g. no server-side component, only Python required) way to set up your private server or other machine. Learning it should be really quick, while mastering takes the usual time. Even if the tool is one of the industry standards among its rivals Chef, Puppet, etc., it’s hard to find (consistent) best practices. The sharing portal Ansible Galaxy has playbooks/roles of mixed quality, which I haven’t checked out very much, and software maintainers typically don’t provide direct Ansible support in their upstream project (for instance, wouldn’t that be nice for nginx?!). However there are books on it already (I’ve read a bit of an Ansible: Up and Running excerpt a while ago, looks promising), and meetup groups around the world (for me: Munich group). The company behind the open source project also seems to be quite good at communicating and documenting everything, while keeping a good balance between making money and maintaining the open source part. Their developers and evangelists whom I’ve met on the meetups ranged from very competent to brilliant, so the future looks bright on this project.

Read more… (post is longer)

Giving away my master thesis: Comparison and evaluation of cross-platform frameworks for the development of mobile business applications

December 28, 2012

Since I’ve received enquiries about the full text of my master’s thesis from several people, I am now offering it for download. In case you want to use the thesis commercially, e.g. to decide about your future mobile development strategy, I would like you to consider making a donation — that will help me work more on personal efforts regarding mobile application development and research.

Enjoy! (even if it’s 164 pages)

Read more… (post is longer)

New tab bar PhoneGap plugin for Android based on ActionBarSherlock

October 12, 2012

I just implemented a simple new plugin to add a native tab bar to the top of an existing PhoneGap application. More exactly, I extracted the plugin from work I have previously done in conjunction with my side job and also my master’s thesis about cross-platform frameworks for mobile applications (almost finished, Monday is the deadline ^^). Since I already maintain the plugins for a tab bar and navigation bar on iOS, it only made sense to also work on this one. For a personal app idea, I also want to extract a jQuery Mobile based project template that uses PhoneGap and works on both Android and iOS – but that will still take me a bit of the time I don’t have.

Here’s how the plugin looks like in action. Both text and icon labels are possible on a tab (but not both):

Please try it out and report any problems over at my GitHub repo. It is very simple to use, check the README. Note that at the moment, the plugin is not yet in the upstream repository (pull requests seem to be accumulating there).

Read more… (post is longer)

Packaging a Sencha Touch 2 application with PhoneGap for Android

June 28, 2012

Rationale

As part of my master’s thesis, I’m comparing and evaluating several cross-platform mobile frameworks. I also wanted to have PhoneGap (now called Cordova) in the comparison, but since it does not include a UI library, I decided to combine it with Sencha Touch 2 for that purpose. Why not jQuery Mobile you may ask? Rhodes includes jQuery Mobile, and I covered Rhodes already in my comparison – and I don’t want to compare too similar frameworks. Also, Sencha Touch only includes dependencies that are actually used (concatenated into a single file), thus it seemed worthwhile to check its performance. And yes, Sencha Touch comes with native packaging for Android, but 1) that didn’t work for me, 2) does not include the device functionality offered by PhoneGap and 3) should be as performant as the PhoneGap app wrapper since it uses the platform’s web view component.

Update: Sencha Touch 2.1 and Sencha Command Utility

Somebody commented that people are having trouble when using the newer Sencha Touch 2.1 SDK. This article walks you through setting up version 2.0. Since I don’t use Sencha Touch for myself and the Sencha team might change their mind again to reinvent their own build toolchain, I will not update the whole article. However, I found that there are only few differences introduced by the new “Sencha Command utility” (replacing “SDK tools”) and build system, and my example application below works fine with Sencha Touch 2.1 if you consider the following differences:

sencha app create is now sencha generate app
sencha app build testing will not consider the argument -d android/assets/www anymore but looks into the build.xml file to find the build output directory. It defaults to <project dir>/build/<project name>/testing.
Unchanged: The sencha app build process still happily returns 0 in case of errors, so the wrapper script is still necessary.
You will have to adapt the wrapper script or your build.xml file to output the application into the android/assets/www folder.

Let’s get going

First of all, I’m using Eclipse to package the Android app, so make sure you have everything installed to create a simple Android project. In this article, I’m using Windows but it should work the same way on Linux and others. You will also need the Sencha Touch SDK and tools – ensure that the sencha command works and that it is always on the PATH (on Windows, just restart your computer). At the time of writing, the Sencha Touch SDK version was 2.0.1.1 and the Sencha Touch SDK tools version was 2.0.0-beta3. For some black magic build automation, you will also need Python (UPDATE: I use 2.7, but it should also work with 3.x).

The directory structure

Since the Android project will later go in a subdirectory, let me first explain the directory structure that our application will have:

AndroidSencha
- android
  - assets
    
    www
  - libs
  - res
  - src
  - .project (and other Eclipse Android project files)
- app
- resources
- sdk
- .senchasdk
- app.js
- app.json
- cordova-x.y.z.js
- index.html
- sencha_wrapper.py

The AndroidSencha directory contains the app scaffolding created by Sencha Touch, i.e. app (models, stores, views, controllers), resources (CSS, images), sdk (necessary Sencha Touch SDK files), .senchasdk (points to the SDK), app.js, app.json, cordova-x.y.z.js and index.html. The android folder will be created manually and contains our Eclipse project. And sencha_wrapper.py is my wrapper script for the sencha command that will be explained later.

If this does not make sense to you, check out the finished application at Github or just bear with me in the rest of the article.

Create the Sencha app

cd "/path/to/downloaded/sencha/sdk"
sencha app create AndroidSencha "/path/where/you/want/the/app/AndroidSencha"

The command should now copy/create some files and directories.

Create the Android project in a subdirectory

Now it’s time to do some bootstrapping for the Android part. In the newly created application directory (the one that contains app.js), create a folder named android and create a new Android project there using Eclipse:

Note: I use Android 2.3.3 (SDK version 10) as build target and the package name org.dyndns.andidogs.androidsencha. You might run into the error "Can’t find variable: Ext" or similar if you use the 2.2 or older emulator (see here).

Set up PhoneGap

PhoneGap is set up almost as usual with Android projects, following the official guide, but some steps differ a bit. In the android folder:

Create the folders /libs and /assets/www
Copy cordova-x.y.z.jar to /libs and add it to the build path using Eclipse
Copy the xml folder from PhoneGap to /res
Make the changes to the main activity (in my case AndroidSenchaActivity)
Different and optional: Add setIntegerProperty("loadUrlTimeoutValue", 60000); before the super.loadUrl call in case you run into timeout problems!
Make the changes to AndroidManifest.xml
Different: It is not necessary to put cordova-x.y.z.js and the sample index.html file into /assets/www, but you might want to do that and run the app on the emulator to see if the PhoneGap "Hello World" works!! Note that the index.html will later be overwritten automatically, so don’t change it in the /assets/www directory, in fact don’t change anything there! (you will see later why)

Test if Sencha Touch is working

We have a main app directory for the Sencha Touch application and a subdirectory for Android. You should now check if the Sencha Touch application actually works. By default, it should contain a tab bar with two different views. Fire up a web server in that main directory and open it up in a browser (should be a Webkit browser, not Firefox). With Python, it’s as simple as:

python -m SimpleHTTPServer 8000
# Multithreaded alternative if you have Twisted installed:
twistd.py web --path . --port 8000

Note that it might take some time to load (especially with the single threaded SimpleHTTPServer of Python). It should look something like this:

Important: The fact that it works in the browser does not mean it works on a mobile device/emulator. It took me a while to find out that you have to change "logger": "no" in app.json to "logger": "false". Else you will get an error like "Error: [Ext.Loader] Failed loading 'file:///android_asset/www/sdk/src/log/Logger.js', please verify that the file exists at file:///android_asset/www/sdk/sencha-touch.js:7908".

PhoneGap working? Sencha Touch working? Time to combine them!

Including PhoneGap

First of all, add the PhoneGap script as dependency, so copy cordova-x.y.z.js to the main folder and change app.json to include it – therefore you only have to add it to the key "js".

"js": [
    {
        "path": "cordova-1.8.0rc1.js"
    },

Building

Sencha Touch comes with the sencha command line utility that can build an application, i.e. scan its dependencies, concatenate necessary files into a final app.js, copy resources etc. What we want to accomplish is to put that build output into the Android app’s /assets/www folder. And that’s why I said, don’t edit any files there because they will get overwritten.

I am using a simple builder configuration in Eclipse to run this command (will be explained below). Unfortunately, it seems that Eclipse does not stop the build if the sencha command returns an error – the sencha command actually always returns 0, but even if I wrap it with a script, Eclipse does not stop the build on non-zero return codes. Also, when set up as builder, Eclipse hides the sencha command output after completion (don’t know why?). That is a problem, because if you have a syntax error or other mistake in your Sencha Touch app, then you will see only the loading indicator and some unhelpful "[Ext.Loader] Failed loading…" error in LogCat once you try and start the app. I work around this problem with a dirty hack wrapper for the sencha command. In the main folder (the one with app.js), add the file sencha_wrapper.py with the following content:

import os
import subprocess
import sys

def contains_errors(s):
    return '[ERROR]' in s

def get_errors(s):
    ret = ''
    for line in s.splitlines():
        if contains_errors(line):
            ret += line + '\n'
    return ret

print('Running Sencha command...')
try:
    proc = subprocess.Popen(['sencha.bat' if os.name == 'nt' else 'sencha'] + list(sys.argv[1:]),
                            stdout=subprocess.PIPE,
                            stderr=subprocess.PIPE)
    stdout, stderr = proc.communicate()

    # Try to decode output to Unicode
    stdout = stdout.decode('utf-8', 'replace')
    stderr = stderr.decode('utf-8', 'replace')

    if proc.returncode != 0 or contains_errors(stdout) or contains_errors(stderr):
        return_code = proc.returncode or 1
        sys.stderr.write('Command failed\n')
    else:
        return_code = 0

    sys.stdout.write(stdout)
    sys.stderr.write(stderr)
except Exception as e:
    stdout = ''
    stderr = ('[ERROR] Failed to execute sencha command, did you reboot and ensure the sencha command is always on the '
              'PATH? (%s)' % str(e))
    return_code = 2

# Eclipse does not seem to stop the build even for return codes != 1, so let's be a bit more cruel
with open(os.path.join('android', 'AndroidManifest.xml'), 'r+t') as f:
    MAGIC = ('SENCHA BUILD FAILED, PLEASE CHECK FOR ERRORS AND RE-RUN BUILD (THIS LINE IS REMOVED AUTOMATICALLY IF '
             'SENCHA BUILD SUCCEEDS)')
    content = f.read()

    magicPosition = content.find(MAGIC)
    if magicPosition != -1:
        content = content[:magicPosition].strip()

    if return_code != 0:
        content += '\n' + MAGIC + '\n' + get_errors(stdout + '\n' + stderr)

    f.seek(0)
    f.write(content)
    f.truncate()

exit(return_code)

This script runs the sencha command with the passed arguments and checks if the string "[ERROR]" occurs in the output (by the way, they also don’t use stderr as they should) and if so, writes these errors to the end of AndroidManifest.xml and thus stops the Android build process because that XML file is no longer valid. As I said, a dirty hack. These lines are automatically removed once you correct the Sencha Touch app mistakes and run the script again.

So let’s try that out. From the main folder, run python sencha_wrapper.py app build testing -d android/assets/www. If successful, it will show no errors and end with "Embedded microloader into index.html". Now go to Eclipse, refresh the project (select project name and hit F5) and then run it. The app should work without problems (only the emulator is slow as hell):

Build automation

Great progress! But of course you don’t want to use the command line and refresh manually every time you want to run your app, so let’s automate this. In Eclipse, right click the project, select "Properties" and then "Builders", "New…" and "Program". Configure it as follows (your Python path will vary):

Click "OK" and use the "Up" button to move that builder to the beginning of the list:

Ensure that AndroidManifest.xml is refreshed after the script is run:

As a bonus, you can set up another builder called "Force recompile" that ensures that every time you click the run button in Eclipse, the Sencha Touch app is recompiled and Eclipse rebuilds the Android app instead of bringing the current intent to the foreground as it would normally do if it doesn’t recognize any changes (which it can’t because Sencha Touch app changes are in the parent directory!). The builder is configured as follows, note that you will need touch.exe (equivalent for the touch command of Linux, for Windows you can use the one from msysgit or the gnuwin32 coreutils package):

Eclipse builder for forcing recompilation (1)

Make sure it’s located before "Android Package Builder":

Eclipse builder for forcing recompilation (2)

That’s it, now when you click the run button in Eclipse, the app should always be recompiled from the current Sencha Touch source code in the main folder.

Short test

Open app/view/Main.js and replace the Getting Started content as follows:

html: "<a href=\"javascript:navigator.notification.alert('Congratulations, you are ready to work with Sencha Touch 2 and PhoneGap!')\">Click me</a>",

Run the application and when you click the link, you should get a native alert box:

Working app using Sencha Touch 2 and PhoneGap

Finished!

So there you have it, two possibly great frameworks combined! Whether they really are that great – I will find out in my thesis while implementing a sample application using this combination. Then I shall see how it performs on my Huawei Ideos X3 (probably the slowest Android phone available). You can get the finished application at Github.

Some hints: Again, do not change anything in /android/assets/www, but rather edit the code in the main folder (app.js and anything in app and resources). Mind that we used the command app build testing in our builder – this is cool for debugging because it leaves the JavaScript unminified. If you want to release your app, you should replace testing by production.

Note also that PhoneGap might not be available as soon as Sencha Touch loads. I tried it in a painted event and it was not loaded yet. Since the sencha app build command loads the application to find dependencies, you have to take care that it can be loaded – using PhoneGap in startup code will produce errors, for example.

And watch out for stupid syntax errors:

Feedback and further considerations

I’m happy to hear any feedback to this article. Just leave a comment or send me a mail. If someone finds out how to create a splash screen that hides when Sencha Touch is loaded, or how to use PhoneGap in startup code, let me know! Cheers!

Archived comments from old blog:

I saved those from the old system since this article is still popular, so the comments may be important. Commenting is no longer possible.

107 archived comments

Edu commented on Thursday, June 14, 2012 at 22:54 UTC

Thank you so much for this tutorial. I was going crazy trying to find a way to deploy a sencha + phonegap app on the Android. Sencha's website has a tutorial for iOS applications using xcode, but they are lacking an Android walkthrough. Once again, thanks a lot!

Cliff commented on Wednesday, June 20, 2012 at 13:27 UTC

Great Tutorial! Finally a coherent walkthrough. Thank you so much! Helped me a lot.
Two things though.
1. Maybe you should explain, where you get the touch.exe
2. Could you provide us a screenshot how your Eclipse project explorer looks like? I was a bit confused with all the folders :)
Also do you develop your SenchaTouch2 app in Eclipse?
Once again, Thanks a lot!
Cliff

Andreas Sommer commented on Wednesday, June 20, 2012 at 15:57 UTC

Thanks for your feedback! I made the changes to the article to explain the directory structure and added links to where you can get touch.exe for Windows.

bouhlel commented on Thursday, June 21, 2012 at 11:21 UTC

running osx 10.7.4 and eclipse 3.7.2 and for some reason best known to eclipse it refuses to make the android project in AndroidSencha/android but in AndroidSencha ( as per your example)
one point-it may be useful to point out how you launch the apk in the emu as some phonegap examples use the RunAs > Android Appl(right click on project) and not the usual run configuration setup.
Other than that a great headsup for my compilation.Many thanks

Andreas Sommer commented on Thursday, June 21, 2012 at 17:44 UTC

Regarding the run configuration, right-click > Run As > Android Application is the usual way and I thought that goes without saying for most people that read an article with this title.
I'm not really sure what you are saying here with Eclipse on OSX – can you explain that? The Android project (.project file) goes into the android subdirectory that you create manually.

Mahmut commented on Friday, June 22, 2012 at 02:18 UTC

Hi, thanks this post. is very excellent.
I'am getting the following error.
CordovaLog(828): Uncaught TypeError: Cannot call method 'alert' of undefined
CordovaLog(828): file:///android_asset/www/index.html: Line 1 : Uncaught TypeError: Cannot call method 'alert' of undefined
Web Console(828): Uncaught TypeError: Cannot call method 'alert' of undefined at file:///android_asset/www/index.html:1
Do you help me? Thanks.

Björn R commented on Friday, June 22, 2012 at 08:33 UTC

thx for your great tutorial but i got Problems to launch my app.
If i use one cordova,exec command, there fires a error that i use it before device ready.
I know that sencha touch fires after device ready, so i don't know wheres the Problem, did you know that Problem?
The funny thing is, that all works fine on iOS

Andreas Sommer commented on Friday, June 22, 2012 at 11:33 UTC

@Mahmut: You should try the application in your browser as described in my article (must be a WebKit browser like Chrome or Safari). Use the developer tools to see if all JavaScript files load correctly. Maybe you entered the wrong filename for the Cordova script in app.json?!

Andreas Sommer commented on Friday, June 22, 2012 at 11:39 UTC

@Björn: Looking at the Sencha Touch source code, it seems that it should work fine and the log of my Android emulator also says that deviceready is fired right before the launch method is called. But I remember that I had some issues with that, so I took the safe route and do not use any Cordova functions directly in launch. Not sure what's wrong here and I unfortunately cannot reproduce the problem anymore.
If you need Cordova when the app starts, you may want to try a workaround. For example I'm hiding Cordova's splash screen when deviceready has fired:

// In global context
document.addEventListener("deviceready", function() {
    console.log("phoneGapReady()")
    phoneGapReady = true
    console.log("phoneGapReady()~")
}, false)

// In app.js
launch: function() {
    console.log("launch()")

    function hideSplashScreen()
    {
        if(phoneGapReady)
            navigator.splashscreen.hide()
        else
            setTimeout(hideSplashScreen, 100)
    }

    setTimeout(hideSplashScreen, 100)

    // other stuff

    console.log("launch()~")
},

Mahmut commented on Friday, June 22, 2012 at 13:26 UTC

I'am sorry, app.json file incorrectly written.

"js": [
        {
            "path": "cordova-1.8.1.js",
            "path": "sdk/sencha-touch.js"
        },

written as follows

"js": [
        {
            "path": "sdk/sencha-touch.js"
        },
        {
            "path": "cordova-1.8.1.js"
        },

now it works.
very thanks.

Andreas Sommer commented on Saturday, June 23, 2012 at 23:08 UTC

It should be noted that Björn's problem was solved by taking the right files from the PhoneGap package – the ones from the android folder, not the ios directory!

Sandip commented on Wednesday, June 27, 2012 at 16:37 UTC

Thank you so much for such a nice tutorial. I had spent one and half day following tutorial available at Sencha Touch, PhoneGap and Android site but none of them explains clearly how to setup a perfect development environment using Eclipse like you did in this tutorial.

Srini commented on Wednesday, June 27, 2012 at 21:48 UTC

Excellent tutorial. I am new to mobile app development but was able to follow through and able to get far enough. When I run the application in emulator after the build I get the following message in the eclipse console "AndroidSencha] ActivityManager: Warning: Activity not started, its current task" and on the emulator 3 blinking dots. I had the same same 3 blinking dots issue on the google chrome browser also after long reasearch I found that I had to allow jason mime type on iis.
Please help.

srini commented on Wednesday, June 27, 2012 at 21:56 UTC

On a side note, I was not able to run a simple javascript on the index.html

Srini commented on Wednesday, June 27, 2012 at 22:06 UTC

E/Web Console(1467): Uncaught Error: [Ext.Loader] Failed loading 'file:///android_asset/www/sdk/src/log/Logger.js', please verify that the file exists at file:///android_asset/www/sdk/sencha-touch.js:7908
The above error is occurring where as sencha-touch.js file exists, but the log\logger.js doesn't exist. FYI.. if that helps in answering my issue.

Andreas Sommer commented on Wednesday, June 27, 2012 at 22:23 UTC

@Srini: Please try to change the "logger" value in the file app.json to "false" as denoted in my article. If that doesn't work, please try false without quotes. I think I read about a recent change in Sencha Touch and you have to use a normal boolean value instead of a string. Please let me know if that worked!

Srini commented on Thursday, June 28, 2012 at 01:02 UTC

Hi, I only have the following text in the app.json file

{"id":"2f1abb90-c085-11e1-8064-f3c7ed002926","js":[{"path":"sdk/sencha-touch.js","type":"js"},{"path":"app.js","bundle":true,"update":"delta","type":"js"}],"css":[{"path":"resources/css/app.css","update":"delta","type":"css"}]}

I don't see logger entry, am I missing anything?

Andreas Sommer commented on Thursday, June 28, 2012 at 15:09 UTC

That is all you got when creating a new project?
I have another key "buildOptions" in that file, maybe you can try adding this:

"buildOptions": {
        "product": "touch",
        "minVersion": 3,
        "debug": false,
        "logger": "false"
    },

and then try which value works for "logger".
Which version of the Sencha Touch SDK are you using?

Srini commented on Friday, June 29, 2012 at 01:19 UTC

1. Yes that is all in my app.json file
2. I tried adding the "buildOptions" tag and tried with all options "false and no" with and with out quotes. Here is the updated app.json file

{"id":"2f1abb90-c085-11e1-8064-f3c7ed002926","js":[{"path":"sdk/sencha-touch.js","type":"js"},{"path":"app.js","bundle":true,"update":"delta","type":"js"}],"css":[{"path":"resources/css/app.css","update":"delta","type":"css"}],"buildOptions": [{"product": "touch","minVersion": 3,"debug": false,"logger": "no"}]}

3. Sencha Touch version 2.0.1.1
Please let me know if there is anything else I should try.
Regards
Srini

Andreas Sommer commented on Friday, June 29, 2012 at 08:24 UTC

So you have the very same version of the SDK. When a new project is created with sencha app create <name> <path>, the JSON file should be like this. Maybe you want to try and copy the contents completely. The only thing I changed in app.json in my own application is the ID (autogenerated), name and that "logger" value.

Srini commented on Friday, June 29, 2012 at 18:42 UTC

Thanks for your help, but unfortunately issue remains same with all variations of "logger" values, here is the error I get.
06-29 18:36:18.754: E/Web Console(560): Uncaught Error: [Ext.Loader] Failed loading 'file:///android_asset/www/sdk/src/log/Logger.js', please verify that the file exists at file:///android_asset/www/sdk/sencha-touch.js:7908
Let me know if you think of anything.
Regards
Srini

Srini commented on Friday, June 29, 2012 at 19:32 UTC

Forgot to mention that I used the app.json file you have listed above

Andreas Sommer commented on Friday, June 29, 2012 at 20:33 UTC

Can you zip up your application as it is (include the large sdk folder also) and upload it somewhere? Maybe I can have a look.

Srini commented on Saturday, June 30, 2012 at 21:22 UTC

Sure, please get the zip file at https://docs.google.com/open?id=0B1h-jczMK2IBLU9xTW5WMVB2TWs

Andreas Sommer commented on Sunday, July 01, 2012 at 10:25 UTC

In that ZIP file, the app.json contains the line "logger": "no" just like I said, which you have to replace by "logger": "false". I don't know which JSON file you quoted above.
And don't forget to add the builder configurations to the Eclipse project.

Srini commented on Sunday, July 01, 2012 at 17:28 UTC

app.json , logger value "no" was the last one I tried, as I mentioned I tried all variations of logger value no, "no", false, "false".
I am not following this line "And don't forget to add the builder configurations to the Eclipse project"

Andreas Sommer commented on Monday, July 02, 2012 at 12:08 UTC

It worked immediately for me once I changed that single line to "logger": "false" in the files of your ZIP file.
With the "builder configurations", I mean the ones I mention in my article. Please read the article carefully. I didn't see in your Eclipse project that you set up these builder configurations. The first one is very important because it compiles the application and puts it into the Android project's assets/www folder. It is possible that you forgot that, and therefore your changes in <root folder>/app.json did not get copied to <root folder>/android/assets/www/app.json.

Srini commented on Wednesday, July 04, 2012 at 23:12 UTC

Finally, it worked when I recreated the everything from scratch. Only deference is this time, I changed logger value to "false" in the root app.json file, all along I was changing it in assets/www/app.json file.
Thanks for your help and support. I will continue with the next steps in the article and will let you know how it goes.

Victor commented on Thursday, July 05, 2012 at 06:44 UTC

Hi,
I'm encountering a problem when following your guide.
When i have to combine phonegap and sencha i get an error from that it can not find the file index.html. Another issue is when launching the web server with python, i solved that by first starting in the AndroidSencha catalog and from there launch the python simpleserver. But when i have to combine phonegap and sencha i'm also starting in androidsencha catalog but it cannot find all the files. How is your python installation on windows? Mine is under C:\python27\ and AndroidSencha is C:\AndroidSencha\. The issue maybe already starts that i can't just type python in cmd, i have to be located inside the AndroidSencha catalog and run the python web server.
Any help would be appreciated. Thank you for a great tutorial by the way!

Andreas Sommer commented on Thursday, July 05, 2012 at 07:56 UTC

@Victor: If the application directory is C:\AndroidSencha\ (the one with app.json in it), then you should start the web server in that directory. Open up the command prompt (press Win+R, "cmd"), switch to the project directory (cd C:\AndroidSencha) and start the web server from there, i.e. type C:\Python27\python -m SimpleHTTPServer 8000. The server will then serve files from that directory, and when you open http://localhost:8000/ in the browser, you should see a lot of requests to "/sdk/src/*" (takes a while because the server is single-threaded).
Let me know if that solved your issues.

Srini commented on Thursday, July 05, 2012 at 19:48 UTC

Now that I am able to do this article I I am trying to use the following link to further get some exposure to do development and using the following (http://miamicoder.com/2012/how-to-create-a-sencha-touch-2-app-part-1/) tutorial to use in the application we created here, but I am unable to continue further. I just want to keep the application framework we created in this article and extend the new tutorial under this. Could you guide me on how can I go about this?

Andreas Sommer commented on Thursday, July 05, 2012 at 21:00 UTC

Well that tutorial does not use the official way to create an application (sencha app create ...). You can just take what you have from my article, and then start in the other tutorial from the section "Extending Classes In Sencha Touch". You won't have to change any of the directory structure, but you will only create new files for views and stuff, and change the launch function in app.js – it's all in their tutorial. Just make sure you don't change index.html as they do, follow from the heading I mentioned above.

Victor commented on Friday, July 06, 2012 at 07:03 UTC

Thank you for the quick reply!
The problem was not testing the web server it is when i merge phonegap and sencha: python sencha_wrapper.py app build testing -d android/assets/www. I get an error: Failed loading your applicaition from: 'file:///C:AndroidSencha/index.html'. Try setting the absolute URL to your application for the 'url' item inside 'app.json'..I've tried setting the url but then it just show the index.html content. I'm running win7 x64 if it has any meaning.

Andreas Sommer commented on Friday, July 06, 2012 at 09:44 UTC

@Victor: This usually means there's an error in your application, because the sencha build tool actually interprets some parts of the JavaScript code while it build the application! Can you try to run sencha app build testing -d android/assets/www yourself in a console and look through all the output? There should be some error shown in one of the JavaScript files. If you don't find it, you can upload what you have somewhere, then I can have a look.

Victor commented on Friday, July 06, 2012 at 12:53 UTC

I only get ERROR : CreateProcessW: The system cannot find the file specified. And then it says that it cannot load the AndroidSencha/index.html. I've tried many things, manually moving etc...but i cannot get it to work. I've UL the folder to http://code.google.com/p/group4-gps/downloads/detail?name=AndroidSencha.zip&can=2&q=#makechanges

Srini commented on Friday, July 06, 2012 at 21:08 UTC

Hi Andreas, thanks for the advice. Today I started to follow the tutorial from other site I mentioned and able to get to a point where it shows the initial screen on part1 tutorial. But when I click the "New" button, the event fires and nothing happens. As per the tutorial it should list the log messages when taped on new button.
I have ziped my project and placed at the following link https://docs.google.com/open?id=0B1h-jczMK2IBZXl1Tlo1Mkl4bXM
Only thing is I just commented out "Ext.Viewport.add(Ext.create('AndroidSencha.view.Main'));" so that other page will be loaded initially.
I have included logCat messages in LogCat.txt file, kindly let me know if I am doing anything wrong here.
Regards
Srinivas

Andreas Sommer commented on Friday, July 06, 2012 at 22:10 UTC

@Victor: Obviously the sencha command is not available. Please ensure you installed the SDK tools, the installer will add the relevant directory (e.g. C:\Program Files\SenchaSDKTools-2.0.0-beta3) to the PATH so that you can call that command from anywhere (which is necessary for my wrapper script as well). After installation, reboot or log out and back in and then try the sencha command from a command prompt. If that works, my wrapper script should work, too.

Andreas Sommer commented on Friday, July 06, 2012 at 22:17 UTC

@Srini: I'm not sure I understand your problem here. In the tutorial, clicking on the new button only logs the message "onNewNote", and that is exactly what your application does (in the Notes.js file, method onNewNote). If you have a look at the LogCat output, it states the three messages from the tutorial ("init", "launch", "onNewNote"), marked with the tag "CordovaLog".

Srini commented on Saturday, July 07, 2012 at 23:19 UTC

I realize now, I was hopping it would list the messages on the screen. They have shown a screen in the tutorial, I didn't realize that that is a console message screen.
Thanks for the help.
Regards
Srinivas

Srini commented on Monday, July 09, 2012 at 00:35 UTC

Andreas, thanks for your continued support. I went ahead with the tutorial 2 of from the following link (http://miamicoder.com/2012/how-to-create-a-sencha-touch-2-app-part-2/) everything went fine, but our build process was not completing successfully. I have included the build error in the zip file (https://docs.google.com/open?id=0B1h-jczMK2IBeGU4WWJvSjdYR2M) . I had to comment the line "Ext.getStore("Notes").load();" in the controller(controller\Notes.js) to build it successfully. Not able to figure it out on why it is not recognized. It has been declared in the "views\NotesListContainer.js".
Please help in proceeding further.
Thanks And Regards
Srini

Srini commented on Monday, July 09, 2012 at 00:45 UTC

Never mind I had to add stores: ['Notes'], to app.js and that helped.
Regards
Srini

Srini commented on Monday, July 09, 2012 at 02:32 UTC

Hi Andreas, This tutorial helped me to understand to package for Android which it is meant for, great job!
Is there such tutorial to package Sencha Touch 2 to Windows Phone?
If you have any references please help me.
Thanks and Regards
Srini

Andreas Sommer commented on Monday, July 09, 2012 at 10:44 UTC

Glad you found out what the problem is.
For Windows Phone, the process is quite similar. Don't know any tutorial, but it should work the very same way as I explained in this article, except that you would have to create a custom builder in Visual Studio that runs python sencha_wrapper.py app build testing -d /www. Should be easy to figure out.

Srini commented on Wednesday, July 11, 2012 at 04:38 UTC

I have tried running the command "python sencha_wrapper.py app build testing -d /www" from command prompt it ran successfully but the windows phone application doesn't work , it just hangs on the start up screen with 3 dots.
I just want to give this FYI.
Regards
Srinivas

Andreas Sommer commented on Wednesday, July 11, 2012 at 08:12 UTC

The cordova-x.y.z.js JavaScript file is different between the platforms because the native bridging mechanism differs (and probably for other reasons). So for Windows Phone, you'd have to use the one from the lib/windows-phone folder of the PhoneGap download. If there are still problems, you should look at the JavaScript console output to see what's the problem – I'm not sure if that console can actually be seen when developing WP, if not you may want to integrate weinre.

Aaron commented on Saturday, July 14, 2012 at 03:18 UTC

This is awesome, thank you. I initially installed Python 3.2 and had a couple of problems. After uninstalling and then installing Python 2.7 the problems were solved.

Andreas Sommer commented on Saturday, July 14, 2012 at 10:02 UTC

@Aaron: Thanks for the hint. I checked that shortly and made a little modification to make it work with Python 3.2. That change can also be found in the GitHub repository.

Victor commented on Tuesday, July 17, 2012 at 07:16 UTC

Hi again!
I've still not been able to compile the project, (python sencha_wrapper.py app build testing -d android/assets/www). Just keeps saying that it can't find the index.html file. I've also looked around at other forums with similar problem, one suggested to not have the Sencha tools folder under Program Files due to the space in the folder name, but this didn't either solve my problem. I've also tried to set the absolut url path in the app.json file and still unsuccessful. Don't know what I've missed because it works on a mac but now on my Win7 x64. Any other suggestions would be apperciated. /Victor

Andreas Sommer commented on Tuesday, July 17, 2012 at 11:12 UTC

Are you executing the command from the directory with app.json and index.html? Please try sencha app build testing -d android/assets/www and post the build errors here (all of it!), or a screenshot of the output.

Victor commented on Tuesday, July 17, 2012 at 11:18 UTC

I've UL it to http://code.google.com/p/group4-gps/downloads/detail?name=ST2.jpg&can=2&q=#makechanges. If i add the absolute URL in app.json, it gives the same error but with the new URL (http://localhost/AndroidAA/index.html)

Andreas Sommer commented on Tuesday, July 17, 2012 at 12:43 UTC

The actual error is "CreateProcessW: The system cannot find the file specified.", meaning that some executable could not be started by the sencha tool. I can't really help you here. Make sure you can get a simple ST application to work and that you have the latest SDK tools installed. If that doesn't work, you should debug all spawn calls in the sencha tool (sencha.js and probably other files in the SDK tools folder) and check whether the executed file actually exists. For example, bin\jsdb.exe is called from sencha.js.

victor commented on Tuesday, July 17, 2012 at 21:06 UTC

I solved it! The problem was that i didn't create the sencha app from the sdk. I was creating it from the sencha tools folder. Don't know why it worked from there but there it is. Thank you for all the help and for a great tutorial.

spencer commented on Monday, August 20, 2012 at 05:53 UTC

I have a question about setting up PhoneGap. What do you mean by "Make the changes to the main activity" and "AndroidManifest.xml"? What changes do you make in those files?

Shree commented on Thursday, August 23, 2012 at 06:03 UTC

Thanks for the Tutorial.
I was also getting the same "Uncaught TypeError: Cannot call method 'alert' of undefined". I added "Cordova-x.y.z" in the head of index.html and that solved the problem.
Thank you :)

Andreas Sommer commented on Thursday, August 23, 2012 at 11:26 UTC

@spencer: Only what is described in the PhoneGap setup guide for Android.

Andreas Sommer commented on Thursday, August 23, 2012 at 11:28 UTC

@Shree: If you followed my article, the Cordova script should be loaded automatically by Sencha Touch.

Stephen Brown commented on Friday, September 14, 2012 at 10:36 UTC

I have downloaded from GitHub your code but when I run it the AndroidManifest.xml file has an error added to the bottom of the file.
[1m[31m[ERROR][39m[22m ENOENT, no such file or directory 'C:\Users\rross\workspace\android-sencha-touch-2-phonegap-packaging\resources\images'

Andreas Sommer commented on Friday, September 14, 2012 at 15:21 UTC

@Stephen Brown: The Sencha build script expects this directory to exist even if it's empty. Just create it and it should work. I updated the repo.

Anestis Kivranoglou commented on Saturday, September 15, 2012 at 21:41 UTC

Thank you for your great tutorial.
Everything seems to work fine, although i have a problem probably with the python script.
When i manually copy paste my sencha app on the android/assets/www folder the app runs fine.
But when i try to compile the project, only some of the sencha app files are copied into the "www" folder thus getting different kind of errors.
(like Uncaught TypeError: Object #<Object> has no method 'define)

Andreas Sommer commented on Saturday, September 15, 2012 at 23:11 UTC

@Anestis: This does not sound like a problem with my script because it only wraps the sencha command. Please run sencha app build testing -d android/assets/www manually and see if you get the same problem.

Osman commented on Tuesday, September 18, 2012 at 14:58 UTC

I keep getting this error when running the sentcha_wrapper file. What could be the problem ? Thanks.

←[1m←[31m[ERROR]←[39m←[22m ReferenceError: Can't find variable: deviceapis
Stack trace:
   file:///C:/inetpub/wwwroot/myapp/cordova-2.0.0.js : 4132 : Anonymous
   file:///C:/inetpub/wwwroot/myapp/cordova-2.0.0.js : 4114 : Anonymous
   file:///C:/inetpub/wwwroot/myapp/cordova-2.0.0.js : 579 : Anonymous
   file:///C:/inetpub/wwwroot/myapp/cordova-2.0.0.js : 619 : Anonymous
   file:///C:/inetpub/wwwroot/myapp/cordova-2.0.0.js : 5648 : Anonymous
   file:///C:/inetpub/wwwroot/myapp/cordova-2.0.0.js : 480 : Anonymous
   file:///C:/inetpub/wwwroot/myapp/cordova-2.0.0.js : 579 : Anonymous
   file:///C:/inetpub/wwwroot/myapp/cordova-2.0.0.js : 619 : Anonymous
   file:///C:/inetpub/wwwroot/myapp/cordova-2.0.0.js : 78 : Anonymous
←[1m←[31m[ERROR]←[39m←[22m Failed loading your application from: 'file:///C:/ine
tpub/wwwroot/myapp/index.html'. Try setting the absolute URL to your application
 for the 'url' item inside 'app.json'

Andreas Sommer commented on Tuesday, September 18, 2012 at 17:12 UTC

@Osman: Doesn't make sense to me – the deviceapis attribute isn't referenced anywhere in the JavaScript file of Cordova 2.0.0. Please check where you use this attribute. Just do a file search through all the files in your app folder. Maybe you used it in your own application or you have a new Sencha Touch SDK that uses it?! This has nothing to do with my build script or tutorial, I assume.

Osman commented on Wednesday, September 19, 2012 at 09:18 UTC

The earlier problem was solved. Dont know exactly why, but I had something wrong with my app.json file.
I have finished the rest of the tutorial. My problem is that I cant see the changes I made in the Main.js . What could be the problem ? I save the file and refresh the project every time..

Andreas Sommer commented on Wednesday, September 19, 2012 at 10:23 UTC

@Osman: Please check if the output in android/assets/www changes whenever you start the app from Eclipse. You should see the sencha wrapper script running every time. The default builder options is only after "clean" and on manual compilation, i.e. when you execute the app.

James commented on Tuesday, October 09, 2012 at 07:26 UTC

Hi, you really made a great job here and it works fine for me. My question now is: how can we package the app for google play?

Andreas Sommer commented on Tuesday, October 09, 2012 at 08:45 UTC

@James: Thanks! For publishing on Google Play you just follow the normal process for any Android app. Note that you may have to disable the two builders first (untick them) because the APK export could fail if files are changed during the build ("file out of sync" errors).

James commented on Tuesday, October 09, 2012 at 09:58 UTC

Okay, thank you again for the quick response...

Santo commented on Sunday, November 11, 2012 at 22:12 UTC

Hey there :)
Good job on this tutorial, but I'm encountering a problem where I'm not able to find a solution for by myself.
I'm using Sensa Touch 2.1 and Sencha CMD v3.0.0.250.
I'm using your phyton sencha_wrapper to build the stuff, but my /assets/www is empty everytime.
Confuesing is, the script itself seems to run without encountering problems. (you can see the log here: http://pastebin.com/R37737LY).
Are there any common newb fails for this situation?
Thanks for the help :)
Regards :)

Andreas Sommer commented on Sunday, November 11, 2012 at 23:14 UTC

@Santo: You should first check if the Sencha build tool successfully creates your application in assets/www. Just run `sencha app build testing -d android/assets/www` and see what happens. Note also that this article was about Sencha Touch 2.0.1.1, and it seems the build tool has changed (at least its output...). Let's see what we can find out.

Santo commented on Sunday, November 11, 2012 at 23:27 UTC

Hey :)
Thanks for the fast reply. Tested and it's not creating content within /assets/www as well. I also figured out, that the sdk folder is missing within my AppFolder. Maybe this is the problem? Or is this based in regards to using Sencha CMD and not the SDK Tool? :)
Thank's for the heads up!

Andreas Sommer commented on Sunday, November 11, 2012 at 23:32 UTC

The .senchasdk must point to a valid SDK folder which is normally copied automatically when creating a project. If you don't have it, copy it from a fresh project.

Santo commented on Sunday, November 11, 2012 at 23:43 UTC

Is this still needed, even by using Sencha CMD? Cause, the whole sdk folder itself is missing as well.
If so, I guess I'll try to generate a new app and copy the content?

kermit136 commented on Sunday, November 18, 2012 at 10:30 UTC

Hi, the entire community have problems with ST 2.1.
http://www.sencha.com/forum/showthread.php?248720-PhoneGap-Build-produces-broken-app-after-upgrading-from-ST-2.0.1-to-ST-2.1./page5
maybe you could help us

kermit136 commented on Sunday, November 18, 2012 at 16:20 UTC

Thank you Andreas,
it worked on Android.
If I substitute the phonegap js to run on ios it doesn't work.
Any suggestion?

Andreas Sommer commented on Sunday, November 18, 2012 at 18:08 UTC

Mind that the PhoneGap JS file on iOS is a different one. You should use the typical debugging tools (weinre / iWebInspector) if you don't see any problems in the logs. I cannot help you here because I don't have time to dig into Sencha Touch since I don't use it myself.

Rob commented on Friday, December 14, 2012 at 00:26 UTC

Hi ich hab da n kleines problem mit sencha, wenn ich sencha in phonegap integrieren möchte bekomme ich ne fehlermeldung dass die index.html nicht geöffnet werden kann, hab das auch schon in der app.json datei hartcodiert, aber es nützt nichts hab auch nur die standard sencha app dafür genutzt. Hoffe du kannst mir da vllt weiterhelfen.

TB commented on Tuesday, December 18, 2012 at 17:55 UTC

I know that you have addressed this before but it still wasnt very clear to me (being a noob) especially since I just installed ST 2.1 and trying to follow your sample.
-"PhoneGap is set up almost as usual with Android projects, following the official guide, but some steps differ a bit. In the android folder:" When setting up phonegap. Do you create folders in phonegap directory or ST project directory?
-When you say "Make the changes to the main activity (in my case AndroidSenchaActivity)". What changes are you talking about and where is this AndroidSenchaActivity and how to I get to it?
-How do you actually know if PG is working? I get the sample project to come up in the website but I think that is just ST, not PG.

Andreas Sommer commented on Tuesday, December 18, 2012 at 21:07 UTC

@Rob: That is not enough information to help you in detail. You should have a look at LogCat output (e.g. Web Console / JavaScript errors) to see what's going wrong when loading the page. Also, you can serve the assets/www folder with a web server and try it in a normal browser.

Andreas Sommer commented on Tuesday, December 18, 2012 at 21:21 UTC

@TB:

I explicitly said it's the android folder. Have a look at the directory structure above, or on GitHub.
Changes to the main activity: As documented in the PhoneGap guide I referenced.
Check the section "Set up PhoneGap" again. In the last step you should test with the index.html that comes with the PhoneGap project template. If it runs correctly on the emulator, you can happily continue. You won't be able to test it in a normal browser until non-mobile OSes are supported – run it in the emulator to see whether PhoneGap's ondeviceready event fires, it will then show a blinking text or so.

Subhajit Sarkar commented on Wednesday, December 19, 2012 at 06:45 UTC

@Andreas Sommer
"SENCHA BUILD FAILED, PLEASE CHECK FOR ERRORS AND RE-RUN BUILD (THIS LINE IS REMOVED AUTOMATICALLY IF SENCHA BUILD SUCCEEDS)
[ERROR] Failed to execute sencha command, did you reboot and ensure the sencha command is always on the PATH? ([Error 2] The system cannot find the file specified) "
I am getting this error can u suggest me where I am going wrong. every thing is configured correctly all are completely ok.

koo commented on Wednesday, December 19, 2012 at 09:59 UTC

hi, i keep on get tis error
SENCHA BUILD FAILED, PLEASE CHECK FOR ERRORS AND RE-RUN BUILD (THIS LINE IS REMOVED AUTOMATICALLY IF SENCHA BUILD SUCCEEDS)
[ERROR] Failed to execute sencha command, did you reboot and ensure the sencha command is always on the PATH? ([WinError 2] The system cannot find the file specified)
Please help

Jerome commented on Friday, December 21, 2012 at 16:33 UTC

Hi There,
I found personnly easier to have to 2 projects in Eclipse, one for Sencha code and another for Phonegap and have a simple builder (like a .bat file) in the Sencha one that call Sencha command (the output in complete in the Eclipse console view) and then copy the generated files to assets/www of the Phonegap project. I added in the builder settings the refresh of ressources, no need for touch.exe neither. Works great! Hope this help! Merry Christmas to all!

Andreas Sommer commented on Monday, December 24, 2012 at 02:02 UTC

Regarding the "SENCHA BUILD FAILED" error: Check if you can run the sencha command from the command line (e.g. sencha app build testing -d android/assets/www). If the executable cannot be found, fix your PATH variable and reboot. Then my wrapper script should also be able to find it.

algone commented on Sunday, January 06, 2013 at 19:28 UTC

I followed the turorial some modifications, but ended up with an error trying to run the following command:
C:\Users\moss\Documents\NetBeansProjects\AndroidSencha>python sencha_wrapper.py app build testing -d android/assets/www
Running Sencha command...
[31m[1m[ERROR] The current working directory (C:\Users\moss\Documents\NetBeansProjects\AndroidSencha) is not a recognized Sencha SDK or application folder[22m[39m
Command failed
[31m[1m[ERROR] The current working directory (C:\Users\moss\Documents\NetBeansProjects\AndroidSencha) is not a recognized Sencha SDK or application folder[22m[39m

Andreas Sommer commented on Sunday, January 06, 2013 at 21:05 UTC

@algone: Looks like you're missing the .senchasdk file or sdk folder which should be created in a new project. If you don't have them, copy them from a newly generated project. However this may have changed in the latest version of Sencha Touch, not sure.

kanyerezi@ commented on Monday, January 07, 2013 at 15:20 UTC

thank you for this great tutorial but can you please give me a screen shot of the whole folder structure please am some how lost with the www folder and my application has a server side in php are there any adjustments i should do for that? thank you in advance man...you rock

Andreas Sommer commented on Monday, January 07, 2013 at 19:25 UTC

@kanyerezi: The folder structure is mentioned in the article, and you can find it on GitHub (precompiled into the assets/www folder).

rupak das commented on Wednesday, January 09, 2013 at 12:50 UTC

Hi Andreas,
I followed your tutorial. i am using windows pc-32bit and python is not installed on my pc.so 'python' command not working on command prompt.I also tried "sencha app build testing -d android/assets/www ".it is also showing error(building fail)
I cant move ahead from here..

Andreas Sommer commented on Wednesday, January 09, 2013 at 22:12 UTC

@rupak das: I cannot help you with so little information. If the sencha compiler shows an error, post it here or try to understand what it says.

neeraj commented on Friday, January 11, 2013 at 04:56 UTC

when i run the command
"python sencha_wrapper.py app build testing -d android/assets/www"
i get an Error
i checked my /assets/www folder
it had all the files except the index.html file
please help !!
where i'm going wrong.
i have followed the tutorial well

Andreas Sommer commented on Wednesday, January 16, 2013 at 21:22 UTC

@neeraj: What error?

@redwuan commented on Tuesday, January 22, 2013 at 08:36 UTC

@neeraj: Make sure that you have Compass or SASS installed. I was getting errors too running the command and it was cryptic but it turns out that the new Sencha needs compass to compile the SASS files. I fixed it by installing the ruby gem compass and it fixed the problem.
@Andreas: Thanks for the detailed tutorial. You should point out that people need a SASS compiler installed to help with some of the issues encountered while running the sencha_wrapper.py script.

Andreas Sommer commented on Tuesday, January 22, 2013 at 18:11 UTC

@redwuan: Thanks for the information. If I remember correctly, Compass was included with the SDK tools that I used at the time of writing. Good to know about this as possible solution.

geetha commented on Tuesday, January 29, 2013 at 04:36 UTC

hi i am getting htis error!
SENCHA BUILD FAILED, PLEASE CHECK FOR ERRORS AND RE-RUN BUILD (THIS LINE IS REMOVED AUTOMATICALLY IF SENCHA BUILD SUCCEEDS)
[ERROR] Failed to execute sencha command, did you reboot and ensure the sencha command is always on the PATH? ([WinError 2] The system cannot find the file specified)

jagadish commented on Tuesday, January 29, 2013 at 05:55 UTC

SENCHA BUILD FAILED, PLEASE CHECK FOR ERRORS AND RE-RUN BUILD (THIS LINE IS REMOVED AUTOMATICALLY IF SENCHA BUILD SUCCEEDS)
[ERROR] Failed to execute sencha command, did you reboot and ensure the sencha command is always on the PATH? ([WinError 2] The system cannot find the file specified)
i am geting this error in AndriodManifest.xml ,please give me a solution

Andreas Sommer commented on Tuesday, January 29, 2013 at 20:30 UTC

Run the sencha command yourself and see what the error is.

mohammad commented on Saturday, February 23, 2013 at 19:00 UTC

hi all , i new in sencha touch and phoengap ,i'm using sencha touch 2.1
1- i'm a bit confuse about project structure
2- when i download the application at Github , then extract it , i get error
please advice , thanks a lot

webdev commented on Sunday, February 24, 2013 at 11:56 UTC

i followed you tutorial , first thank you a lot , i'm using sencha touch 2.1 sencha cmd v3.0.0.250 and both working fine
but when i run" python sencha_wrapper.py app build testing -d android/assets/www" command , at the command line i see "running sencha command ..."
and at eclipse :
SENCHA BUILD FAILED, PLEASE CHECK FOR ERRORS AND RE-RUN BUILD (THIS LINE IS REMOVED AUTOMATICALLY IF SENCHA BUILD SUCCEEDS)
[ERROR] Failed to execute sencha command, did you reboot and ensure the sencha command is always on the PATH? ([Error 2] The system cannot find the file specified)
how to fix this issue please
thanks again

Andreas Sommer commented on Tuesday, February 26, 2013 at 21:48 UTC

@webdev: I don't understand your problem. Are you saying the sencha command is found from the command line, but not when run automatically by Eclipse? In any case, make sure your PATH variable includes the directory to the sencha executable.

Andreas Sommer commented on Tuesday, February 26, 2013 at 21:50 UTC

@mohammad: I'm not going to answer this kind of incomplete or poor questions anymore. Everybody seems to think that other programmers have an oracle or can read minds.

nostoppinnow.com commented on Thursday, February 28, 2013 at 07:35 UTC

Pretty! This has been an incredibly wonderful post.
Thank you for providing this info.

amith commented on Tuesday, March 05, 2013 at 09:02 UTC

Thanx for the nice post.
I checked out the code from git hub and try to import it on eclips , it open perfectly and i got the same application once I run it, but i could not able to do the edit part since those files are not available on eclips project. Could you please help me on this.

Andreas Sommer commented on Saturday, March 09, 2013 at 00:38 UTC

@amith: Which files are you trying to edit – are any missing from the GitHub repo?

horcle_buzzz commented on Thursday, April 25, 2013 at 02:01 UTC

Missing directions regarding activity, manifest files, etc. are here http://www.adobe.com/devnet/html5/articles/getting-started-with-phonegap-in-eclipse-for-android.html
The ones on the Cordova site do not mention the specifics.

bsegvic commented on Tuesday, May 28, 2013 at 12:03 UTC

I have problem with python -m SimpleHTTPServer 8000
I have call it and it takes on and on (I saw that you wrote it colud take a while, but 20+mins)..nothing happens, when I try to open it via http://192.168.1.2:8000 I only get the files in the directory like there in no index.html file. Can u help me please, what am I missing? Thanks!

Andreas Sommer commented on Saturday, June 01, 2013 at 09:54 UTC

You have to run the web server from the directory with app.js, index.html, etc. If it doesn't work for you, just use any other web server.

Prasanna commented on Wednesday, August 07, 2013 at 14:02 UTC

Great article. Thanks to you, i could wrap my sencha app as an apk!

Read more… (post is longer)

Blog

Connecting Intermittent Explosive Disorder with diet and histamine intolerance: no more uncontrollable anger and rage

Table of contents

My story

Development of the disease

Detecting IED in yourself

Medication and optionally therapy can be a good first step

Measuring symptoms to find the cause

Histamine intolerance (HIT) is the cause, omitting histamine-rich food items the solution

What comes after the anger is solved?

Related thoughts and reading

How I mostly fixed my migraine, weather and sports-induced headaches

Table of contents

Migraine started in childhood

Science principles ‒ why even write about this single-person case?

Failed attempts

The solution for my migraine: salt at the right amount

Safety of the salt intervention

Scientific theory of electrolyte imbalance behind migraine

The end of the fight after 37 years of pain

Related reading

Grafana dashboards — best practices and dashboards-as-code

Table of contents

Goals for all your dashboards

Spinning up a Grafana playground in a minute

Getting started with a playground dashboard

High-level dashboard creation guidelines

Choose main input for the high-level dashboard

Metric naming and cardinality

Stat instead of Graph for human understanding within milliseconds

Keep panels small on screen

Avoid dropdowns for variable values

Dashboards-as-code and GitOps (committed dashboards)

Dashboard generation from jsonnet code using the grafonnet library

Fast develop-deploy-view cycle

Automatic provisioning of generated dashboards into Grafana instance

jsonnet for dashboards — some tips

Distinguish environments

Consider jsonnet as full programming language

Grafana-specific dashboard tips

Categorize, do not make a dashboard mess

Grouping within a dashboard

Make dashboards editable

Tooltip sort order

Y axis display range

Link to detailed dashboards, logs, other observability tools

Use variables for repetitive values

Clearly differentiate environments

Grid positioning

Choose the right data unit

UTC timezone everywhere

Do not rely on default data source

Heatmaps

Display interesting events as annotations

Self-describing visualization titles

Observability tips not specific to Grafana

Stay consistent in naming metrics and labels

No need to create a metric for everything / how to easily get started monitoring an uninstrumented application

Show only offenders or top N problematic items

Mind test and synthetic traffic

Daytime vs. nighttime

Do not use rate(…​) alone

Prefer counters over gauges

Make observed components distinguishable

Summary

Out of scope

Related reading

Setting up buildbot in FreeBSD jails

Table of contents

Choosing host operating system and version for buildbot

Create a FreeBSD playground

Introduction to jails

Overview of buildbot

Set up jails

Install buildbot master

Run buildbot master

Install buildbot worker

Run buildbot worker

Set up web server nginx to access buildbot UI

Run your first build

Do not rely on `default` data source

Do not use `rate(…)` alone

C++ `typedef` is not strong typing

C++ strong typing with `enum`