Profiling Suricata with JEMALLOC

JEMALLOC is a memory allocation library: http://www.canonware.com/jemalloc/

It offers many interesting things for a tool like Suricata. Ken Steele of EZchip (formerly Tilera) made me aware of it. In Ken’s testing it helps performance.

Install

wget http://www.canonware.com/download/jemalloc/jemalloc-3.6.0.tar.bz2
tar xvfj jemalloc-3.6.0.tar.bz2
cd jemalloc-3.6.0
./configure --prefix=/opt/jemalloc/
make
sudo make install

Then use it by preloading it:

LD_PRELOAD=/opt/jemalloc/lib/libjemalloc.so ./src/suricata -c suricata.yaml -l tmp/ -r ~/sync/pcap/sandnet.pcap -S emerging-all.rules -v

I haven’t benchmarked this, but if you’re running a high performance setup it may certainly be worth a shot.

Profiling

The library comes with many interesting profiling and debugging features.

make clean
./configure --prefix=/opt/jemalloc-prof/ --enable-prof
make
sudo make install

Start Suricata like this:

LD_PRELOAD=/opt/jemalloc-prof/lib/libjemalloc.so ./src/suricata -c suricata.yaml -l tmp/ -r ~/sync/pcap/sandnet.pcap -S emerging-all.rules -v

Now we don’t see any change as we need to tell jemalloc what we want.

Exit stats

Dumps a lot of stats to the screen.

MALLOC_CONF=stats_print:true LD_PRELOAD=/opt/jemalloc-prof/lib/libjemalloc.so ./src/suricata -c suricata.yaml -l tmp/ -r ~/sync/pcap/sandnet.pcap -S emerging-all.rules -v

Memory leak checks

MALLOC_CONF=prof_leak:true,lg_prof_sample:0 LD_PRELOAD=/opt/jemalloc-prof/lib/libjemalloc.so ./src/suricata -c suricata.yaml -l tmp/ -r ~/sync/pcap/sandnet.pcap -S emerging-all.rules -v
[... suricata output ...]
<jemalloc>: Leak summary: 2011400 bytes, 4523 objects, 645 contexts
<jemalloc>: Run pprof on "jeprof.22760.0.f.heap" for leak detail

Then use the pprof tool that comes with jemalloc to inspect the dumped stats.

$ /opt/jemalloc-prof/bin/pprof --show_bytes ./src/suricata jeprof.22760.0.f.heap
Using local file ./src/suricata.
Using local file jeprof.22760.0.f.heap.
Welcome to pprof!  For help, type 'help'.
(pprof) top
Total: 2011400 B
1050112  52.2%  52.2%  1050112  52.2% PacketGetFromAlloc
600064  29.8%  82.0%   600064  29.8% SCProfilePacketStart
103936   5.2%  87.2%   103936   5.2% SCACCreateDeltaTable
65536   3.3%  90.5%    66192   3.3% pcap_fopen_offline
35520   1.8%  92.2%    35520   1.8% ConfNodeNew
26688   1.3%  93.6%    26688   1.3% __GI___strdup
20480   1.0%  94.6%    20480   1.0% MemBufferCreateNew
20480   1.0%  95.6%    20480   1.0% _TmSlotSetFuncAppend
14304   0.7%  96.3%    14304   0.7% pcre_compile2
14064   0.7%  97.0%    25736   1.3% SCPerfRegisterQualifiedCounter

So it seems we don’t properly clean up our packet pools yet.

Create a PDF of this info:

$ /opt/jemalloc-prof/bin/pprof --show_bytes --pdf ./src/suricata jeprof.22760.0.f.heap > jemalloc.pdf

Dumping stats during runtime

Dump stats after every 16MiB of allocations (lg_prof_interval:24, means every 2^24 bytes, so 16MiB)

I’ve done this in a separate directory since it dumps many files.

$ mkdir jemalloc-profile
$ cd jemalloc-profile/
$ MALLOC_CONF="prof:true,prof_prefix:victor.out,lg_prof_interval:24" LD_PRELOAD=/opt/jemalloc-prof/lib/libjemalloc.so ../src/suricata -c ../suricata.yaml -l ../tmp/ -r ~/sync/pcap/sandnet.pcap -S ../emerging-all.rules -v

Then you should see new *.heap files appear, many during startup. But after some time it should slow down.

You can then visualize the diff between two dumps:

$ /opt/jemalloc-prof/bin/pprof --show_bytes --pdf ../src/suricata --base victor.out.24159.150.i150.heap victor.out.24159.200.i200.heap > jemalloc.pdf

This creates a PDF of the 200th dump taking the 150th dump as a baseline. As we dump every ~16MiB, this covers about 50 * 16 = 800MiB worth of allocations.

Further reading

http://www.canonware.com/jemalloc/
https://github.com/jemalloc/jemalloc/wiki
https://github.com/jemalloc/jemalloc/wiki/Use-Case%3A-Heap-Profiling

Many thanks to Ken Steele for pointing me to the lib and providing me with some good examples.

Crossing the Streams in Suricata

At it’s core, Suricata is a packet processor. It reads packets and pushes them through a configurable pipeline. The 2nd most important processing unit in Suricata is the flow. In Suricata we use the term flow for the bidirectional flows of packets with the same 5 tuple (proto, src ip, dst ip, sp, dp. Vlans can be added as well). In fact, much of Suricata’s threading effort revolves around the flow. In the 2 main runmodes, autofp and workers, flow based load balancing makes sure that a all packets of a single flow always go through the same threading pipeline. In workers this means one single thread, in autofp 2: the capture thread and a stream/detect/output thread.

Flows are the central unit for out ‘app layer’ parsing. Protocol parsers like HTTP don’t even have access to the original packet. It all runs on top of the stream engine, which tracks TCP flows in … our flow structure.

Another place where the flow is crucial is in many of the rules. Rules extensively use the concept of ‘flowbits’. This allows one rule to ‘flag’ a flow, and then another to check this flag. In Emerging Threats many hundreds of rules use this logic.

Ever since we started Suricata, we’ve been talking about what some called ‘global flowbits’. A bit of a strange and contradictory name, but pretty much rule writers wanted the logic of flowbits, but then applied to other units as well. So a few weeks ago I (finally) decided to check if I could quickly implement ‘hostbits’. As Suricata already has a scalable ‘host table’, it was easy add the storage of ‘bits’ there. In a few hours I had the basics working and made it public: see this pull request.

Although I got some nice feedback, I was mostly interested in what the ET folks would think, since they would be the main consumers. While presenting the work I also mentioned the xbits ideas by Michael Rash and the response was “wow, do we have ip_pair tracking now?”. Ehh, no, just ip/host based… “Ah well, I guess that is nice too”. Not exactly the response I hoped for 🙂

IP pair tracking is not something Suricata already did. But as the need was clear I decided to have a look at it. Turned out it was quite simple to do. The IPPair tracker is much like the Host tracking. It’s only done on demand, which sets it apart from the Flow tracking which is done unconditionally. In this case only the new keyword is making use of the IP Pair storage.

So, what I have implemented is pretty much ‘xbits’. It supports tracking by ‘ip_src’, ‘ip_dst’ and ‘ip_pair’. It uses the syntax as suggested by Michael Rash:

xbits:<set|unset|isset|isnotset|toggle>,<bitname>,\
      track <ip_src|ip_dst|ip_pair>,expire <seconds>

It’s only lightly tested, so I would appreciate testing feedback!

You’ll find the code here in PR 1275 at github. This should normally end up in Suricata 2.1, which will come out early next year.