Profiling Suricata with JEMALLOC

JEMALLOC is a memory allocation library: http://www.canonware.com/jemalloc/

It offers many interesting things for a tool like Suricata. Ken Steele of EZchip (formerly Tilera) made me aware of it. In Ken’s testing it helps performance.

Install

wget http://www.canonware.com/download/jemalloc/jemalloc-3.6.0.tar.bz2
tar xvfj jemalloc-3.6.0.tar.bz2
cd jemalloc-3.6.0
./configure --prefix=/opt/jemalloc/
make
sudo make install

Then use it by preloading it:

LD_PRELOAD=/opt/jemalloc/lib/libjemalloc.so ./src/suricata -c suricata.yaml -l tmp/ -r ~/sync/pcap/sandnet.pcap -S emerging-all.rules -v

I haven’t benchmarked this, but if you’re running a high performance setup it may certainly be worth a shot.

Profiling

The library comes with many interesting profiling and debugging features.

make clean
./configure --prefix=/opt/jemalloc-prof/ --enable-prof
make
sudo make install

Start Suricata like this:

LD_PRELOAD=/opt/jemalloc-prof/lib/libjemalloc.so ./src/suricata -c suricata.yaml -l tmp/ -r ~/sync/pcap/sandnet.pcap -S emerging-all.rules -v

Now we don’t see any change as we need to tell jemalloc what we want.

Exit stats

Dumps a lot of stats to the screen.

MALLOC_CONF=stats_print:true LD_PRELOAD=/opt/jemalloc-prof/lib/libjemalloc.so ./src/suricata -c suricata.yaml -l tmp/ -r ~/sync/pcap/sandnet.pcap -S emerging-all.rules -v

Memory leak checks

MALLOC_CONF=prof_leak:true,lg_prof_sample:0 LD_PRELOAD=/opt/jemalloc-prof/lib/libjemalloc.so ./src/suricata -c suricata.yaml -l tmp/ -r ~/sync/pcap/sandnet.pcap -S emerging-all.rules -v
[... suricata output ...]
<jemalloc>: Leak summary: 2011400 bytes, 4523 objects, 645 contexts
<jemalloc>: Run pprof on "jeprof.22760.0.f.heap" for leak detail

Then use the pprof tool that comes with jemalloc to inspect the dumped stats.

$ /opt/jemalloc-prof/bin/pprof --show_bytes ./src/suricata jeprof.22760.0.f.heap
Using local file ./src/suricata.
Using local file jeprof.22760.0.f.heap.
Welcome to pprof!  For help, type 'help'.
(pprof) top
Total: 2011400 B
1050112  52.2%  52.2%  1050112  52.2% PacketGetFromAlloc
600064  29.8%  82.0%   600064  29.8% SCProfilePacketStart
103936   5.2%  87.2%   103936   5.2% SCACCreateDeltaTable
65536   3.3%  90.5%    66192   3.3% pcap_fopen_offline
35520   1.8%  92.2%    35520   1.8% ConfNodeNew
26688   1.3%  93.6%    26688   1.3% __GI___strdup
20480   1.0%  94.6%    20480   1.0% MemBufferCreateNew
20480   1.0%  95.6%    20480   1.0% _TmSlotSetFuncAppend
14304   0.7%  96.3%    14304   0.7% pcre_compile2
14064   0.7%  97.0%    25736   1.3% SCPerfRegisterQualifiedCounter

So it seems we don’t properly clean up our packet pools yet.

Create a PDF of this info:

$ /opt/jemalloc-prof/bin/pprof --show_bytes --pdf ./src/suricata jeprof.22760.0.f.heap > jemalloc.pdf

Dumping stats during runtime

Dump stats after every 16MiB of allocations (lg_prof_interval:24, means every 2^24 bytes, so 16MiB)

I’ve done this in a separate directory since it dumps many files.

$ mkdir jemalloc-profile
$ cd jemalloc-profile/
$ MALLOC_CONF="prof:true,prof_prefix:victor.out,lg_prof_interval:24" LD_PRELOAD=/opt/jemalloc-prof/lib/libjemalloc.so ../src/suricata -c ../suricata.yaml -l ../tmp/ -r ~/sync/pcap/sandnet.pcap -S ../emerging-all.rules -v

Then you should see new *.heap files appear, many during startup. But after some time it should slow down.

You can then visualize the diff between two dumps:

$ /opt/jemalloc-prof/bin/pprof --show_bytes --pdf ../src/suricata --base victor.out.24159.150.i150.heap victor.out.24159.200.i200.heap > jemalloc.pdf

This creates a PDF of the 200th dump taking the 150th dump as a baseline. As we dump every ~16MiB, this covers about 50 * 16 = 800MiB worth of allocations.

Further reading

http://www.canonware.com/jemalloc/
https://github.com/jemalloc/jemalloc/wiki
https://github.com/jemalloc/jemalloc/wiki/Use-Case%3A-Heap-Profiling

Many thanks to Ken Steele for pointing me to the lib and providing me with some good examples.

Suricata 1.4 development update

Today, a day after 1.3.2, we’ve released 1.4beta2. While 1.3.2 is an important update for those running 1.3.1 or lower, today’s release is where things get exciting. A lot of things were improved and added. Let me show some numbers first.

The 1.4beta2 release is a pretty big update over 1.4beta1 as it touches over 5k lines of code:

234 files changed, 5033 insertions(+), 3759 deletions(-)

Compared to 1.4beta2 vs yesterday’s 1.3.2 it’s clear over 11k lines of code are touched:

262 files changed, 11406 insertions(+), 5794 deletions(-)

Personally, I’ve been working on two main area’s: defrag engine and the luajit integration, and a couple of other things.

Defrag

The defrag engine was the last major subsystem that still used a Big Lock. Defrag uses so called “trackers” to track fragments belonging to a single IP packet. These trackers are stored in a hash table. 1.3 and prior used a hash that had no locking, so it relied on a Big Lock to protect it’s operations. Suricata has had fine grained hashes for flow and host tables for some time already, so it made sense to port defrag over as well.

Luajit

I’ve written about the luajit a couple of times already. While the basic functionality debuted in beta1, the code has been completely overhauled. The most important change that is user visible is the integration with the various HTTP inspection engines. This did result in a limitation though, for now you can just inspect one HTTP buffer per script.

A weird challenge with luajit is that it’s “state” needs to be in the 32 bit part of memory. The reason isn’t clear to me, but this gave us some trouble. Some users use many rules and agressive pattern matcher settings. When after this memory usage the luajit states had to be alloc’d, it failed. I’ve worked around this by allocating a bunch of states in advance, hoping they’ll end up in the proper memory. We’ll see how that will work.

Misc

I’ve also largely rewritten the optional rule profiling to perform better. Here too, a Big Lock was removed. The accounting is now first done on a per thread basis, and only merged at detection engine shut down. Another nice feature is that it will now print the profiling stats during a live rule reload as well.

Next, I’ve improved performance of the decode, stream and app layer event keywords. They were quite expensive as they were checked quite often. I’ve now added a prefilter check to the detection engine’s prefilter stage. Helps quite a bit!

Finally, I’ve been working on getting global and rule threshold play well together. This work isn’t done yet, but some real progress has been made. Work is tracker here and documentation lives here.

So all in all quite a bit of changes. Please help us test this so we can move to a stable and high performing 1.4! 🙂

Suricata luajit update

After an exciting week of meeting and working with the team around the RAID conference, time for another lua update.

The keyword supports an interesting set of buffers now:

packet
payload

http.uri
http.uri.raw
http.request_line
http.request_headers
http.request_headers.raw
http.request_cookie
http.request_user_agent
http.request_body

http.response_headers
http.response_headers.raw
http.response_body
http.response_cookie

The http keywords are now integrated into their respective inspection engines. This led to one important limitation for now: you can only inspect one such buffer per script.

We pass the inspection offset to the script as well for these. In the lua script you can access it as follows:

function match(args)
    a = tostring(args["http.request_headers.raw"])
    o = args["offset"]

    s = a:sub(o)
    print (s)

    return 0
end

In a buffer “Mozilla/5.0” and a signature “content:Mozilla;”, “s” in the script will contain “/5.0”. At this moment there is no way yet to pass back an offset from the script to the inspection engine.

On the performance side things are looking good as well. At RAID Will Metcalf converted a set of 6 ETpro sigs to a single lua script. It resulted in better detection accuracy and better performance. That work is still private, but we’ll get some real world scripts public soon! 🙂

Update 10/4: this code is now available for testing in the new Suricata 1.4beta2 release!

First impressions of lua(jit) performance in Suricata

Today I decided to look into the potential performance of the luajit keyword a bit. It’s important to know if this can perform at reasonable speeds so that we can actually use it in real deployments. Even if we can’t the feature may still be appealing though, for offline pcap analysis.

So far, the results are rather encouraging.

First, I added 2 buffers today: http.uri, which contains the normalized uri (same buffer as the http_uri content modifier inspects) and http.request_line, which is the request line given to us by libhtp. This contains method, separators, uri, HTTP version.

Next I created 5 rules. A pure Lua rule (1), a Lua rule with content prefilter (2), a Lua rule with content and pcre prefilter (3), a pcre rule with content prefilter (4) and a pure pcre rule (5).

alert http any any -> any any (msg:"LUAJIT HTTP POST test, pure lua"; luajit:test2.lua; sid:1;)
alert http any any -> any any (msg:"LUAJIT HTTP POST test, content prefilter"; content:"POST"; http_method; content:".php"; http_uri; luajit:test2.lua; sid:2;)
alert http any any -> any any (msg:"LUAJIT HTTP POST test, pcre prefilter"; content:"POST"; http_method; content:".php"; http_uri; pcre:"/^POST\s+\/.*\.php\s+HTTP\/1\.0\r\n/m"; luajit:test2.lua; sid:3;)
alert http any any -> any any (msg:"LUAJIT HTTP POST test, pcre no lua"; content:"POST"; http_method; content:".php"; http_uri; pcre:"/^POST\s+\/.*\.php\s+HTTP\/1\.0\r\n/m"; sid:4;)
alert http any any -> any any (msg:"LUAJIT HTTP POST test, pure pcre"; pcre:"/^POST\s+\/.*\.php\s+HTTP\/1\.0\r\n/m"; sid:5;)

and the following Lua script:

function init (args)
    local needs = {}
    needs["http.request_line"] = tostring(true)
    return needs
end

-- match if packet and payload both contain HTTP
function match(args)
    a = tostring(args["http.request_line"])
    if #a > 0 then
        if a:find("^POST%s+/.*%.php%s+HTTP/1.0$") then
            return 1
        end
    end
  
    return 0
end

return 0

The script does a pattern match (regex even) against the request line, something I’d consider quite expensive.

So, how does this perform? Here are the rule perf stats:

   Num      Rule        Avg Ticks   Avg Match   Avg No Match
  -------- ------------ ----------- ----------- -------------- 
  1        5            12113.53    7198.08     12114.28   
  2        3            11638.15    39842.23    9424.83    
  3        2            10682.71    35497.08    10194.56   
  4        1            8812.31     15841.85    8807.01    
  5        4            8536.46     20074.97    7630.97 

Pure pcre rules are bad, we all know that, but they end up being most expensive in this test which surprises me. The pure Lua rule is quite a bit cheaper and even ends up below the prefilted Lua rules. Only the content+pcre (no Lua) rule is slightly faster.

So far things look rather good for the lua keyword. Who knows, maybe it can even be used on live traffic.

The work continues! 🙂

Suricata 1.3.1 is out

Since this morning Suricata 1.3.1 is available. The main focus of this release was fixing a number of bugs. See the list of closed bugs, the release notes and the upgrade instructions.

As a bonus, I applied a set of patches by Eric Leblond. Eric has been trying to push AF_PACKET to the limit and has achieved some spectacular results with it. Read all about his quest to get to 10Gbps here on Eric’s blog.

As a final note, the Suricata git repository is now mirrored at Github. Forking, submitting pull requests and downloading the latest source is a lot simpler now. Check the official repo at github.

Suricata http_user_agent vs http_header

One of the new features in Suricata 1.3 is a new content modifier called http_user_agent. This allows rule writers to match on the User-Agent header in HTTP requests more efficiently. The new keyword is documented in the OISF wiki. In this post, I’ll show it’s efficiency with two examples.

Example 1: rarely matching UA

Consider a signature where the match if on a part of the UA that is very rare, so not part of regular User Agents. In my example “abc”.

The signature looks like this:
alert http any any -> any any (msg:"User-Agent abc http_header"; content:"User-Agent: "; http_header; nocase; content:"abc"; http_header; distance:0; pcre:"/User-Agent:[^\n]*abc/iH"; sid:1; rev:1;)

The http_user_agent variant looks much simpler:
alert http any any -> any any (msg:"User-Agent abc http_user_agent"; content:"abc"; http_user_agent; sid:2; rev:1;)

Now when running this against a pcap with over 12.500 HTTP requests, neither signature matched. However, signature 1 was inspected 209752 times! This high number is because the request headers are inspected one-by-one. Signature 2 wasn’t inspected at all, as it never made it past the multi pattern matching stage (mpm).

When looking at pcap runtime, running with only the http_user_agent version is about 10% faster.

Example 2: commonly matching UA

So, what if we want to match on something that is quite common? In other words, the signature will have frequent matches?

First, the http_header signature:
alert http any any -> any any (msg:"User-Agent MSIE 6 http_header"; content:"User-Agent: "; http_header; nocase; content:"MSIE 6"; http_header; distance:0; pcre:"/User-Agent:[^\n]*MSIE 6/iH"; sid:3; rev:1;)
The http_user_agent variant:
alert http any any -> any any (msg:"User-Agent MSIE 6 http_user_agent"; content:"MSIE 6"; http_user_agent; sid:4; rev:1;)

In this case both signatures do match, just over 10.000 times even. The stats look like this:

Each of the inspections of signature 4, the http_user_agent variant, is actually a match. This makes sense as we look for a simple string and the mpm will only invoke the signature if that string is found. It’s clear that the http_header variant takes way more resources. Here too, when looking at pcap runtime, running with only the http_user_agent version is approximately 10% faster.

Final remarks

It’s quite clear that the http_user_agent keyword is much more efficient that inspecting all the HTTP headers. But other than efficiency, the http_user_agent also allows for much easier to read rules.

The Emerging Threats project will likely fork their Suricata ruleset for 1.3 (see this blog post). Even though this will be a significant effort on their side, it’s pretty clear to me the performance effect will be noticeable!

Suricata scaling improvements

For the Suricata 1.3beta1 release, one of our goals was to improve the scalability of the engine when running on many cores. As the graph below shows, we made a good deal of progress.

The blue line is an older 1.1 version, the yellow line is 1.3dev. It clearly shows that 1.1 peaked at 4 cores, then started to get serious contention issues. 1.3dev scales nicely beyond that, up to 24 cores in this test (four 6core AMD cpu’s). Tilera recently demonstrated Suricata on their many core systems, running a single Suricata process per cpu. Their cpu’s have 36 real cores.

We had already manually identified some potential hotspots, but that wasn’t enough. We needed to be able to measure. So I added lock profiling code. This gave us the tools needed to really pin point contention points in the code. Hotspots were: flow engine, thresholding engine, tag engine. Not very surprising, as each of those represent an global data structure, used by all packet processing threads.

Flow engine

Several improvements were made to the flow engine. First of all, the main contention point was a queue that was really a series of ordered lists. These lists were ordered by flow time out. The idea behind it was that this way the “flow manager”, which takes care of timing out and cleaning up flows, would just look at those queues for the oldest flows to process.

The problem was that these queues had to be updated for every packet (sometimes even twice per packet). This queue is now gone. Instead, the flow manager now walks the entire flow hash table. This removes the contention point. The flow hash has fine grained locking leading to much less contention.

When dealing with a hash table, distribution is very important and a good hash algorithm takes care of that. One of the changes in 1.3dev is the replacement of our naive algorithm by the Jenkins hash. At the cost of a small computational overhead, this leads to much better hash distribution and thus less contention.

Finally, for the flow engine I’d like to mention once more the flow based auto load balancing work I’ve written about before here. It gives more balanced distribution between threads.

Thresholding and Tag engines

Both the thresholding and tag engines store information per host. Until 1.3, both used a separate hash table governed by a single lock. Lookups are frequent: once for each event in the case of thresholding, once per packet for tags.

To address this a host table was introduced, modelled after the flow table. So with fine grained locking. Both thresholding and tagging now use that table.

For thresholding one contention point is unresolved. Thresholding per signature id is still a global table with a single lock.

 

Lots of improvements in this version. Still scaling is not as good as we’d like, it takes too many cores to double performance. Our goal is to get as close to linear as possible. The work continues! 🙂

The graph was provided by Josh White and is part of his performance research for Clarkson University. Thanks Josh, looking forward to your final paper!