The Scene
The Project
In the last two years, I did my first project that made use of the Zephyr OS and I thought I use the more relaxed time at the end of the year, to share my experiences and thoughts on Zephyr with you. Hopefully, someone finds my experiences interesting or find some value in this article.
Zephyr was used in a project that broadly could be described as medical data collecting device connected to the cloud, using Bluetooth Low Energy.
Due to the scope of the project, the heart of the hardware became a microcontroller from Nordic Semiconductor. They provide a range of BLE enabled microcontrollers and we could pick one of their flagship models and still would be able to size down the microcontroller later without having to redesign hardware or software.
Me (in that Context)
To that point, I had already a lot of project experience with microcontrollers from Nordic. All projects were bare metal and used either Nordics SoftDevice (some kind of shared library, that contains the BLE protocol implementation), or my own, open source BLE protocol library (Bluetoe, a lightweight C++ GATT server framework).
Zephyr
Zephyr entered the scene, when we were exploring different architectural choices. Zephyr looked very promising, with support for Bluetooth Low Energy, MQTT-SN, and a bootloader out of the box. Zephyr was even supported by the chip vendor and it was open source.
As someone who already had to cope with bugs in vendors hardware drivers, having access to the source code was a very big plus to me.
Zephyr was already some years old and got support by many hardware vendors, there was really no good reason to not believe in Zephyr’s maturity and to not at least give Zephyr a try at the early stage of the project.
Zephyr promises to be secure, suitable for resource-constrained systems, and portable.
The Curtain Rises
Two Zephyrs?
First, it took me quite some time to figure out, that there were two different flavours of Zephyr: the original vanilla Zephyr and one branch created, maintained and renamed by Nordic (they named it nRF Connect SDK).
I could only speculate on why they decided to do so but I can say, from a user perspective, it doesn’t make things easier. Whenever I had to approach the community (being it to seek for help or guidance or to report or discuss a bug), I first had to investigate whether this is a Zephyr Zephyr issue, or a Nordic Zephyr issue and spend some time to investigate whether there are differences in the area of concern.
The experience described in this article is based on the Nordic version of Zephyr. I guess, most will apply for the native Zephyr version as well but to make sure: I haven’t checked.
West
When following the online documentation, you get in touch with West pretty quickly. Basically, West is something like git submodules. It describes dependencies by recursive lists of git repositories. Understanding the workings of West helped later in the project to inject patches without the need to clone the entire Zephyr project.
Initially, I had some problems to make sense of the terms (project, workspace, module) used in the West documentation. I took this initial confusion (knowing, how useful some fresh pair of eyes can be to review documentation) to open my first issue in the Zephyr bug tracker. The issue was taken seriously and I felt that my report was welcome and the resulting change in the documentation was clearly an improvement to me. Wonderful start!
Weird Use of CMake
I use CMake to implement the build process in nearly all of my projects. So I was pleased to learn that Zephyr also uses CMake. But there were a lot of surprises, knowing CMake was not really of any help, more the opposite was true.
To add Zephyr specific functions to your CMake build, you have to use the find_package() function to add the Zephyr package. So far, nothing new until I had to figure out that this find_package() call has to be placed before the (usually initial) call to project(). Not calling both functions in that specific order yields hard to understand error messages.
Which looks like a small detail is a real surprise to everyone with basic familiarity with CMake. To be fair, this requirement is documented.
The reason for this seems to be that Zephyr does not rely on CMake’s built-in compiler and compiler feature detection but to implement that on their own (which also means, that Zephyr does not take CMAKE_TOOLCHAIN_FILE into account). I have no clue why they do this. I suspect that they moved their build from Make to CMake without considering the additional features CMake has over Make.
In my projects, there are always some portion of code, that do not have any direct dependencies to hardware. I try to move everything that consists some degree of complexity into that part of the projects code. That allows me to test that code very convenient on the host machine (aka my developer machine or the CI pipeline).
In addition, for more complex peripheral drivers, I have some special test programs, that link to the peripheral drivers and that I can use to stress the driver on the target platform.
Now, the surprise: Zephyr does not really take into account, that people might like to build the same code on target / Zephyr platforms and on their development machine. CMake does not impose any of such limitation. Zephyr has one, very limiting model: there is a single executable, named app. You can add sources and libraries to that executable target, but you can not have a second Zephyr executable target.
Ironically, Nordic stumbled over the same limitation and instead of lifting that limitation (and for example, turn Zephyr into a configurable, linkable library), they added the feature to have multiple executables again, on top of CMake (which has this feature already) and called it sysbuild.
Similarly applies to having firmware versions: CMake implements that and Zephyr ignores the feature and implements it on their own.
First Features
Now to the good parts: I was able to add Bluetooth Low Energy very quickly. It really worked out of the box as promised. Same for our first tests with MQTT-SN.
The BLE API felt well designed and was a pleasure to work with. You really can expect to have a GATT server up and running within minimum time. The API documentation is excellent.
There where other APIs that were less well documented, but fortunately, we have the source code and that helps to understand what is exactly required to get done what needs to get done (L2CAP or Settings as an example).
Threads Everywhere
To my surprise, all asynchronous APIs were implemented by having library threads calling your callbacks to indicate, that a certain event happened. I haven’t found much about that in the API documentation, but maybe this is documented somewhere in the more general documentation of Zephyr.
The consequences should be, that every data that is touched (be it just reading or writing) is synchronised with other threads by using a mutex. But I have not found a single example that was doing this. It might be, that technically, none of the examples is sharing the data with any other thread, but in a real world application, it would be extremely rare that data handled by the BLE part of the application would not be used by (for example) the main thread.
Zephyr is spawning a lot of threads. The stack sizes are configurable and have decent sizes configured. At least unless, you want to debug your application and disable optimization. As soon as you do that, you will see crashes all over the place and if you are lucky, you will quickly figure out, that these crashes happen due to stack overflows. Now, you only have to figure out, where threads are spawned and what the name of the configuration parameter is, that you have to increase to increase the stack size. Good luck!
Looking deeper into this, threads being enabled in the Zephyr kernel is a configuration option. Unfortunately features like BLE requires threads to be enabled.
The design decision to use multiple threads, makes sure that when the target hardware would have multiple CPUs, that these CPUs could be utilized by callback calls. But especially if the target would have multiple threads and the kernel would spread the Zephyr threads over multiple CPUs, proper synchronisation would be required. Having a lot of threads also means a larger portion of the available RAM has to be spent for the threads stacks.
Configuration
The good part is, that Zephyr has configurations for everything. The bad part: Zephyr has configurations for everything. After hitting configuration issues multiple times, I asked the rather broad question as how to cope with such problems in general in the Nordic support forums.
The answer was more or less: “Try until it works or look into my colleague’s unofficial samples”.
I ended very often, with some incompatible configurations, where I had to figure out, what combination of configuration is supported and how certain configurations would affect the behaviour of the software. Solving these issues took a lot of time. (I saw these issues mostly around the sysbuild configuration, with MCUBoot and Bluetooth)
Beside having a configuration system that configures the software, there is a second major configuration system, that describes the hardware configuration of the application named DeviceTree (DT in short). Using DeviceTree to describe the hardware allows to build the very same software for different hardware.
Having so much configuration options and systems probably is one of the reasons, why Zephyr has the reputation to have a very steep learning curve.
Sometimes, I’ve stumbled over configurations issues, where unsupported features are only revealed at runtime. For example, a certain driver is configured in a certain way, that is not supported by that driver. Instead of letting the build fail, the driver builds and returns a “Feature not supported” error code at runtime. In my opinion that’s the wrong choice and I really can’t think of a situation where you want to link against a driver that you cannot use at runtime. Needless to say, that this design choice took me more time than it would have if the driver would have failed to build.
Bugs
It probably comes to no surprise that a project as large as Zephyr contains bugs (on GitHub there seems to be a constant number of roughly 2,000 open issues). And again, having the source code helps in finding the cause, once you are experiencing any “weird” behaviour.
Seeing “weird” behaviour does not necessarily mean, that one stumbled over a bug. I could mean that an API was not used correctly and it does not mean that even if it is a bug, that the bug is in Zephyr. But having the source code makes it so much easier to find the root cause of whatever shows up in Zephyr and needs to be investigated.
All in all, I think the quality of the source code of Zephyr and the maturity of Zephyr is above industry average. Albeit I was a little bit disappointed by the rather low test coverage.
Conclusion
Given the same task, would we have picked Zephyr again? I would say yes! Not because Zephyr is brilliant and a pleasure to work with, but because the alternatives are not better. Zephyr is ok but not near what I would dream of, if I would dream of a hardware abstraction.
Using C instead of C++ or Rust might have been a good / valid choice when the project started but feels awkward from today’s perspective. The “creative” use of CMake makes it really hard to integrate a Zephyr project into a larger context.
Zephyr is definitely a good choice when there is a requirement to port the code to different hardware. If the BOM (bill of material) is of very great concern, Zephyr (and probably most other RTOS) is for sure not the most efficient way to program a microcontroller.
If speed of development is important, Zephyr is probably a good choice too, given that the initial learning has been done already.
Have I mentioned already, that Zephyr is open source? If you ever spent days finding a bug in a SDK provided by a hardware vendor without source code and spending hours begging for support from that vendor, you feel invincible having access to the sources. For example, we had to find and fix a bug in the L2CAP (part of Bluetooth) layer of Zephyr and I’m pretty sure that bug would have taken us much longer if we hadn’t had access to the sources.
