For the Suricata 1.3beta1 release, one of our goals was to improve the scalability of the engine when running on many cores. As the graph below shows, we made a good deal of progress.
The blue line is an older 1.1 version, the yellow line is 1.3dev. It clearly shows that 1.1 peaked at 4 cores, then started to get serious contention issues. 1.3dev scales nicely beyond that, up to 24 cores in this test (four 6core AMD cpu’s). Tilera recently demonstrated Suricata on their many core systems, running a single Suricata process per cpu. Their cpu’s have 36 real cores.
We had already manually identified some potential hotspots, but that wasn’t enough. We needed to be able to measure. So I added lock profiling code. This gave us the tools needed to really pin point contention points in the code. Hotspots were: flow engine, thresholding engine, tag engine. Not very surprising, as each of those represent an global data structure, used by all packet processing threads.
Several improvements were made to the flow engine. First of all, the main contention point was a queue that was really a series of ordered lists. These lists were ordered by flow time out. The idea behind it was that this way the “flow manager”, which takes care of timing out and cleaning up flows, would just look at those queues for the oldest flows to process.
The problem was that these queues had to be updated for every packet (sometimes even twice per packet). This queue is now gone. Instead, the flow manager now walks the entire flow hash table. This removes the contention point. The flow hash has fine grained locking leading to much less contention.
When dealing with a hash table, distribution is very important and a good hash algorithm takes care of that. One of the changes in 1.3dev is the replacement of our naive algorithm by the Jenkins hash. At the cost of a small computational overhead, this leads to much better hash distribution and thus less contention.
Finally, for the flow engine I’d like to mention once more the flow based auto load balancing work I’ve written about before here. It gives more balanced distribution between threads.
Thresholding and Tag engines
Both the thresholding and tag engines store information per host. Until 1.3, both used a separate hash table governed by a single lock. Lookups are frequent: once for each event in the case of thresholding, once per packet for tags.
To address this a host table was introduced, modelled after the flow table. So with fine grained locking. Both thresholding and tagging now use that table.
For thresholding one contention point is unresolved. Thresholding per signature id is still a global table with a single lock.
Lots of improvements in this version. Still scaling is not as good as we’d like, it takes too many cores to double performance. Our goal is to get as close to linear as possible. The work continues! :)
The graph was provided by Josh White and is part of his performance research for Clarkson University. Thanks Josh, looking forward to your final paper!