Speeding up Suricata with tcmalloc

‘tcmalloc’ is a library Google created as part of the google-perftools suite for speeding up memory handling in a threaded program. It’s very simple to use and does work fine with Suricata. Don’t expect magic from it, but it should give you a few percent more speed.

On Ubuntu, install the libtcmalloc-minimal0 package:

apt-get install libtcmalloc-minimal0

Then run Suricata as follows (on a single line):

LD_PRELOAD=”/usr/lib/libtcmalloc_minimal.so.0″ ./src/suricata -c suricata.yaml -i eth0

That is all there is to it. 🙂

Improving Suricata performance with bitmask based signature prefiltering

The last weeks I’ve been spending quite a bit of time improving Suricata’s performance, making good progress. I did a lot of optimizations all over the code, but the most significant is a new way of prefiltering signatures for inspection. I’ll briefly explain the concept here.

But first a quick explanation of how Suricata selects signatures for inspection. When Suricata starts, it organizes signatures into groups, called SigGroupHead in the code. To reduce the number of signatures that need inspection for each packet, the grouping is done on quite a few properties: flow direction, protocol, src ip, dst ip, src port, dst port. Even though this grouping is quite aggressive, a single SigGroupHead can still contain many thousands of signatures. For example Emerging Threats web-client sigs will almost all end up in the same SigGroupHead.

To reduce the overhead of checking the signatures a more efficient prefiltering mechanism was added.

The bitmask prefilter

The basic concept is simple. Each signature creates a bitmask at engine initialization time, setting a bit for each “feature” it requires to match. Examples of such features are: needs payload, needs flowbit set, needs flow, needs http state.

Then at runtime, we create a mask for each packet. There we set flags for when the packet has a payload, has a flow associated with it, the flow has flowbits, etc. This operation is quite cheap as it needs to be done for each packet only once and requires only relatively simple checks.

The final step of this process is we compare the mask of each signature in a SigGroupHead against the mask of the packet.

if ((packetmask & sigmask) != sigmask)
skip_this_signature();

Using this filter, using flowbits becomes much more attractive. Most flows don’t have flowbits set, so this effectively excludes all signatures requiring flowbit from being checked almost all the time.

In the current git master (soon to become 1.0.3) this mask is only 8 bits wide of which only 5 are used. I’m experimenting with using more fine grained bitmasks.

SigGroupHead based masks

One idea I’m exploring currently is seeing if there is any use in additionally creating a single mask for a SigGroupHead. The idea here being that if many signatures in a group are alike, the SigGroupHead will have a strong mask and we can bypass all signature checking for a packet quite often. This would bypass pattern matching as well.

Preliminary results show that the idea works, but only for small & homogeneous rulesets. For a 38M pkt pcap, with just emerging-web.rules I see about 40% of the packets bypassing all signature checks. For emerging-all.rules it’s less that 1%, and for a larger ruleset (14k sigs) it’s 0%. So it may not be a viable optimization.

More conditions

I’m also experimenting with increasing the number of conditions. So far, I’ve defined about 20. This way all TCP signatures at least have some form of condition set. A single signature with mask 0 (no conditions set) kills the SigGroupHead based filtering, as it’s mask is determined by the lowest common denominator. So far I’m not seeing much if any gains from using more conditions.

Maybe the increased size of the mask to 32 bits undoes performance gains, or the added complexity of the mask creation at packet runtime is too expensive.

SIMD checks

On other thing I’m planning to explore is to see if SIMD can help speed up these bit checks. The SSE extensions should be able to do multiple checks at the same time. Here the mask size will become important as well. As SIMD currently works with 16 bytes at a time, for a 8 bit mask I could check 16 sigs at once, but for a 32 bit mask only 4 at once. I’m not sure it’ll be worth it though. CPU’s are quite good at doing bitwise operations, to SIMD instructions might not be faster at all.

The initial version of the bitmask based prefilter code is available now in the current git master. If you’re interested, please give a try and let me know how it works for you!