Suricata on Myricom capture cards

Myricom and OISF just announced that Myricom joined to OISF consortium to support the development of Suricata. The good folks at Myricom already sent me one of their cards earlier. In this post I’ll describe how you can use these cards already, even though Suricata doesn’t have native Myricom support yet. So in this guide I’ll describe using the Myricom libpcap support.

Getting started

I’m going to assume you installed the card properly, installed the Sniffer driver and made sure that all works. Make sure that in your dmesg you see that the card is in sniffer mode:

[ 2102.860241] myri_snf INFO: eth4: Link0 is UP
[ 2101.341965] myri_snf INFO: eth5: Link0 is UP

I have installed the Myricom runtime and libraries in /opt/snf

Compile Suricata against Myricom’s libpcap:

./configure --with-libpcap-includes=/opt/snf/include/ --with-libpcap-libraries=/opt/snf/lib/ --prefix=/usr --sysconfdir=/etc --localstatedir=/var
make
sudo make install

Next, configure the amount of ringbuffers. I’m going to work with 8 here, as my quad core + hyper threading has 8 logical CPU’s.

pcap:
  - interface: eth5
    threads: 8
    buffer-size: 512kb
    checksum-checks: no

The 8 threads setting makes Suricata create 8 reader threads for eth5. The Myricom driver makes sure each of those is attached to it’s own ringbuffer.

Then start Suricata as follows:
SNF_NUM_RINGS=8 SNF_FLAGS=0x1 suricata -c suricata.yaml -i eth5 --runmode=workers

If you want 16 ringbuffers, update the “threads” variable in your yaml to 16 and start Suricata:
SNF_NUM_RINGS=16 SNF_FLAGS=0x1 suricata -c suricata.yaml -i eth5 --runmode=workers

It looks like you can use any number of ringbuffers, so not limited to a power of 2 for example.

Example with CPU affinity

You can also use Suricata’s built in CPU affinity settings to assign a worker to a cpu/core. In this example I’ll create 7 worker threads that will each run on their own logical CPU. The remaining CPU can then be used by the management threads, most importantly the flow manager.

max-pending-packets: 8192

detect-engine:
  - sgh-mpm-context: full

mpm-algo: ac-bs

threading:
  set-cpu-affinity: yes
  cpu-affinity:
    - management-cpu-set:
      cpu: [ "0" ]
    - detect-cpu-set:
      cpu: [ "1-7" ]
      mode: "exclusive"
      prio:
        default: "high"

pcap:
  - interface: eth5
    buffer-size: 512kb
    threads: 7
    checksum-checks: no

Then start Suricata with:
SNF_NUM_RINGS=7 SNF_FLAGS=0x1 suricata -c suricata.yaml -i eth5 --runmode=workers

This configuration will reserve cpu0 for the management threads and will assign a worker thread (and thus a ringbuffer) to cpu1 to cpu7. Note that I added a few more performance tricks to it. This config with 7 cpu pinned threads appears to be a little faster than the case where CPU affinity is not used.

Myricom has a nice traffic replay tool as well. This replays a pcap at 1Gbps:
snf_replay -r 1.0 -i 10 -p1 /path/to/pcap

Final remarks

The Myricom card already works nicely with Suricata. Because of the way their libpcap code works, we can already use the ringbuffers feature of the card. Myricom does also offer a native API. Later this year, together with Myricom, we’ll be looking into adding support for it.

Suricata scaling improvements

For the Suricata 1.3beta1 release, one of our goals was to improve the scalability of the engine when running on many cores. As the graph below shows, we made a good deal of progress.

The blue line is an older 1.1 version, the yellow line is 1.3dev. It clearly shows that 1.1 peaked at 4 cores, then started to get serious contention issues. 1.3dev scales nicely beyond that, up to 24 cores in this test (four 6core AMD cpu’s). Tilera recently demonstrated Suricata on their many core systems, running a single Suricata process per cpu. Their cpu’s have 36 real cores.

We had already manually identified some potential hotspots, but that wasn’t enough. We needed to be able to measure. So I added lock profiling code. This gave us the tools needed to really pin point contention points in the code. Hotspots were: flow engine, thresholding engine, tag engine. Not very surprising, as each of those represent an global data structure, used by all packet processing threads.

Flow engine

Several improvements were made to the flow engine. First of all, the main contention point was a queue that was really a series of ordered lists. These lists were ordered by flow time out. The idea behind it was that this way the “flow manager”, which takes care of timing out and cleaning up flows, would just look at those queues for the oldest flows to process.

The problem was that these queues had to be updated for every packet (sometimes even twice per packet). This queue is now gone. Instead, the flow manager now walks the entire flow hash table. This removes the contention point. The flow hash has fine grained locking leading to much less contention.

When dealing with a hash table, distribution is very important and a good hash algorithm takes care of that. One of the changes in 1.3dev is the replacement of our naive algorithm by the Jenkins hash. At the cost of a small computational overhead, this leads to much better hash distribution and thus less contention.

Finally, for the flow engine I’d like to mention once more the flow based auto load balancing work I’ve written about before here. It gives more balanced distribution between threads.

Thresholding and Tag engines

Both the thresholding and tag engines store information per host. Until 1.3, both used a separate hash table governed by a single lock. Lookups are frequent: once for each event in the case of thresholding, once per packet for tags.

To address this a host table was introduced, modelled after the flow table. So with fine grained locking. Both thresholding and tagging now use that table.

For thresholding one contention point is unresolved. Thresholding per signature id is still a global table with a single lock.

 

Lots of improvements in this version. Still scaling is not as good as we’d like, it takes too many cores to double performance. Our goal is to get as close to linear as possible. The work continues! 🙂

The graph was provided by Josh White and is part of his performance research for Clarkson University. Thanks Josh, looking forward to your final paper!

Suricata runmode changes

Yesterday I pushed a patch that changes the default runmode from “auto” to “autofp”. The autofp name stands for “auto flow pinning” and it automatically makes sure all packets belonging to a flow are processed by the same stream, detection and output thread. Until now, the assignment was done with a simple hash calculation. The problem with that is that it doesn’t take into account how busy a thread may be.

OISF’s Anoop Saldanha recently wrote a new load balancer, called “active-packets”, which is now the default. Before assigning a new flow to a thread, it checks how busy it is. This leads to a much more fair distribution of flows and packets.

AutoFP - Total flow handler queues - 6
AutoFP - Queue 0 - pkts: 82879145 flows: 30589
AutoFP - Queue 1 - pkts: 36997716 flows: 4042
AutoFP - Queue 2 - pkts: 22168624 flows: 356
AutoFP - Queue 3 - pkts: 36886948 flows: 40
AutoFP - Queue 4 - pkts: 22135664 flows: 118
AutoFP - Queue 5 - pkts: 22121314 flows: 101

In the example above it’s clearly visible that the number of flows assigned to queues (and thus threads) varies greatly. However the number of packets varies much less. It may appear that Queue 0 is somewhat oversubscribed, but remember that the queue is selected based on how busy it is. In this case the IDS box is not very busy, so queue 0 was available most of the time.

The output above is displayed at shutdown if you use the (now default) “autofp” mode. Take a look at it to see if the load balancing makes sense in your setup!

Listening on multiple interfaces with Suricata

A question I see quite often is, can I listen on multiple interfaces with a single Suricata instance? Until now the answer always was “no”. I’d suggest trying the “any”-pseudo interface (suricata -i any), with an bpf to limit the traffic or using multiple instances of Suricata. That last suggestion was especially painful, as one of the goals of Suricata is to allow a single process to process all packets using all available resources.

Last week I found some time to look at how hard adding support for acquiring packets from multiple interfaces would be. Turned out, not so hard! Due to Suricata’s highly modular threading design, it was actually quite easy. I decided to keep it simple, so if you want to add multiple interfaces to listen on, just add each separately on the command line, like so: suricata -i eth0 -i eth1 -i ppp0. This will create a so called “receive thread” for each of those interfaces.

I’ve added no internal limits, so in theory it should possible to add dozens. I just tested with 2 though, so be careful. Normally the thread name in logs and “top” for the pcap receive thread is “ReceivePcap”. This is still true if a single interface is passed to Suricata. In case more are passed to Suricata, thread names change to “RecvPcap-<int>”, e.g. RecvPcap-eth0 and RecvPcap-eth1. Untested, but it should work fine to monitor multiple interfaces from different types. Suricata sets the data link type in the interface-specific receive thread.

If you’re interested in trying out this new feature, there are a few limitations to consider. First, no Windows support yet. I hope this can be addressed later. Second, the case where two or more interfaces (partly) see the same traffic is untested. The problem here is that we’ll see identical packets going into the engine. This may (or may not, like I said, it’s untested) screw up the defrag, stream engines. Might cause duplicate alerts, etc. Addressing this is something that would probably require keeping a hash of packets so we can detect duplicates. This is probably quite computationally intensive, so it may not be worth it. I’m very much open to other solutions. Patches are even more welcome 🙂

So, for now use it only if interfaces see completely separate traffic. Unless you’re interested to see what happens if you ignore my warnings, in that case I’d like to know! The code is available right now in our current git master, and will be part of 1.1beta2.

Merry xmas everyone!