Suricata: Handling of multiple different SYN/ACKs

synackWhen processing the TCP 3 way handshake (3whs), Suricata’s TCP stream engine will closely follow the setup of a TCP connection to make sure the rest of the session can be tracked and reassembled properly. Retransmissions of SYN/ACKs are silently accepted, unless they are different somehow. If the SEQ or ACK values are different they are considered wrong and events are set. The stream events rules will match on this.

I ran into some cases where not the initial SYN/ACK was used by the client, but instead a later one. Suricata however, had accepted the initial SYN/ACK. The result was that every packet from that point was rejected by the stream engine. A 67 packet pcap resulting in 64 stream events.

If people have the stream events enabled _and_ pay attention to them, a noisy session like this should certainly get their attention. However, many people disable the stream events, or choose to ignore them, so a better solution is necessary.

Analysis

In this case the curious thing is that the extra SYN/ACK(s) have different properties: the sequence number is different. As the SYN/ACKs sequence number is used as “initial sequence number” (ISN) in the “to client” direction, it’s crucial to track it correctly. Failing to do so, Suricata will loose track of the stream, causing reassembly to fail. This could lead to missed alerts.

Whats happening on the wire:

TCP SSN 1:

-> SYN: SEQ 10
<- SYN/ACK 1: ACK 11, SEQ 100
<- SYN/ACK 2: ACK 11, SEQ 1000
-> ACK: SEQ 11, ACK 101

TCP SSN 2:

-> SYN: SEQ 10
<- SYN/ACK 1: ACK 11, SEQ 100
<- SYN/ACK 2: ACK 11, SEQ 1000
-> ACK: SEQ 11, ACK 1001

It’s clear that in SSN 1 the client ACKs the first SYN/ACK while in SSN 2 the 2nd SYN/ACK is ACK’d. It’s likely that the first SYN/ACK was lost before it reached the client. Suricata accepts the first though, and rejects any others that are not the same.

Solution

The solution I’ve been working on is to delay judgement on the extra SYN/ACKs until Suricata sees the ACK that completes the 3whs. At that point Suricata knows what the client accepted, and which SYN/ACKs were either ignored, or never received.

Logic in pseudo code:

Normal SYN/ACK coming in:

    UpdateState(p);
    ssn->state = TCP_SYN_RECV;

Extra SYN/ACK packets:

    if (p != ssn) {
        QueueState(p);

On receiving the ACK that completes the 3whs:

    if (ssn->queue_len) {
        q = QueueFindState(p);
        if (q)
            UpdateState(q);
    }
    UpdateState(p);
    ssn->state = TCP_ESTABLISHED;

So when receiving the ACK, Suricata first searches for the proper SYN/ACK on the list. If it’s not found, the ACK will be processed normally, which means it’s checked against the original SYN/ACK. If Suricata did have a queued state, it will first apply it to the SSN. Then the ACK will be processed normally, so that is can complete the 3whs and move the state to ESTABLISHED.

Limitations

Queuing these states takes some memory, and for this reason there is a limit to the number each SSN will accept. This is configurable through a new stream option:

stream:
  max-synack-queued: 5

It defaults to 5. I’ve seen a few (valid) hits against a few terrabytes of traffic, so I think the default is reasonably safe. An event is being set if the limit is exceeded. It can be matched using a stream-event rule:

  alert tcp any any -> any any (msg:"SURICATA STREAM 3way handshake \
      excessive different SYN/ACKs"; stream-event:3whs_synack_flood; \
      sid:2210055; rev:1;)

Performance

This functionality doesn’t affect the regular “fast path” except for a small check to see if we have queued states. However, if the queue list is being used Suricata enters a slow path. Currently this involves an memory allocation per stored queue. It may be interesting to consider using pools here, although a single global pool might be ineffecient. In such a case a lock would have to be used and this might lead to contention, especially in a case where Suricata would be flooded. Per thread pools (519, 520, 521) may be best here.

IPS mode

SYN/ACKs that exceed the limit are dropped if stream.inline is enabled as is the case with all packets that are considered to be bad in some way.

Code

The code is now part of the git master through commit 4c6463f3784f533a07679589dab713096137a439. Feedback welcome through our oisf-devel list.

Suricata IPS improvements

January has been a productive month for Suricata, especially for the IPS part of it. I’ve quite some time on adding support to the stream engine to operate differently when running inline. This was needed as dropping attacks found in the reassembled stream or the application layer was not reliable. Up until now the stream engine would offer the reassembled stream to the detection engine as soon as it was ACK’d. This meant that by definition the packets containing the data had already passed the IPS device. Simply switching to sending un-ACK’d data to the detection engine would have it’s own set of issues.

To be able to work with un-ACK’d data, we need to make sure we deal with possible evasions properly. The problem, as extensively documented by Judy Novak and Steven Sturges, is that in TCP streams there can be overlapping packets. Those are being dealt with differently based on the receiving OS. If we would need to account for overlaps in the application layer, we would have to be able to tell the HTTP parser for example: “sorry, that last data is wrong, please revert and use the new packet instead”. A nightmare.

The solution I opted for was to not care about destination OS’ for overlaps and such. The approach is fairly simple: once we have accepted a segment, thats what it’s going to be. This means that if we receive a segment later that (partially) overlaps and has different data, it’s data portion will simply be overwritten to be the same as the original segment. This way, the IPS and not an obscure mix of the sender (attacker?) and destination OS, determines the data the destination will see.

Of course the approach comes with some drawbacks. First, we need to keep segments in memory for a longer period of time. This causes significantly higher memory usage. Secondly, if we rewrite a packet, it needs to be reinjected on the wire. As we modified the packet payload a checksum recalculation is required.

In Suricata’s design the application layer parsers, such as our HTTP parser, run on top of the reassembly engine. After the reassembly engine and the app layer parsers are updated, the packet with the associated stream and app layer state is passed on to the detection engine. In the case where we work with ACK’d data, an ACK packet in the opposite direction triggers the reassembly process. If we detect based on that, and decide we need to drop, all we can do is drop the ACK packet as the actual data segment(s) have already passed. This is not good enough in many cases.

In the new code the data segment itself triggers the reassembly process. In this case, if the detection engine decides a drop is required, the packet containing the data itself can be dropped, not just the ACK. The reason we’re not taking the same approach in IDS mode is that we wouldn’t be able to properly deal with the said evasion/overlap issues. The IPS can exactly control what packets pass Suricata. The IDS, being passive, can not.

You can try this code by checking out the current git master. In the suricata.yaml that lives in our git tree you’ll find a new option in the stream config, “stream.inline”. If you enable this, the code as explained above is activated.

Feedback is very welcome!

Suricata 1.0.2 released

After some well deserved vacation I’m getting back up to speed in Suricata development. Luckily most of our dev team continued to work in my absence, making today’s 1.0.2 release possible.

The main focus of this release was fixing the TCP stream engine. Judy Novak found a number of ways to evade detection. See her blog post describing the issues.

The biggest other change is the addition of a new application layer module. The SSH parser parses SSH sessions and stops detection/inspection of the stream after the encrypted part of the session has started. So this is mainly a module focused on reducing the number of packets that need inspection, just like the SSL and TLS modules.

As a bonus though, we introduced two rule keywords that match on the parsed SSH parameters:

ssh.protoversion will match against the ssh protocol version. I’ll give some examples.

ssh.protoversion:2.0

This will match on 2.0 exactly.

ssh.protoversion:2_compat

This will match on 2, but also 1.99 and other versions compatible to “2”.

ssh.protoversion:1.

The last example will match on all versions starting with “1.”, so 1.6, 1.7, etc.

ssh.softwareversion will match on the software version identifier. An example:

ssh.softwareversion:PuTTY

This will match only on session using the PuTTY SSH client.

Other changes include better HTTP accuracy, better IPS functionality.

For the next release we will focus on further improving overall detection accuracy, improving inline mode further, improving performance and specifically improving CUDA performance. As always, we welcome any feedback. Or if you are interested in helping out, please contact us!

Update: added a link to Judy Novak’s blog post on the TCP evasions.

OISF engine prototype: streams handling

I’ve been thinking about how to deal with streams in the OISF engine. We need to do stream reassembly to be able to handle spliced sessions, otherwise it would be very easy to evade detection. Snort traditionally used an approach of inspecting the packets individually and reassembling (part of) the stream in a pseudo packet, that was inspected mostly like a normal packet. Recent Snort versions, especially when Stream5 was introduced, have a so called stream api. This enables detection modules to control the reassembly better.

In Snort_inline’s Stream4 I’ve been experimenting with ways to improve stream reassembly in an inline setup. The problem with Snort’s pseudo packet scanning way of operation is that it’s after the fact scanning. Which means that any threat detected in the reassembled stream can’t be dropped anymore. The way I tried to work around this was by constantly scanning a sliding window of reassembled unacked data. It worked quite well, except for the performance of it. That was quite bad.

I’m thinking about a stream reassembler for the OISF engine that can both do the after-the-fact pseudo packet scanning and do a sliding window approach as I did in stream4inline. This would be used for the normal tcp signatures. I think it should be possible to determine the minimal size of the reassembled packet based on the signatures per port, possibly more fine grained. Of course things like target based reassembly and robust reassembly will be part of it.

In addition to this I’m thinking about a way to make modules act on the stream similary to how programs deal with sockets. Code that only wakes up if a new piece of data in that connection is received, with semantics similar to recv()/read(). I haven’t really made up my mind about how such an api should work exactly, but I think it would be very useful to detection module writers if they only have to care about handling the next chunk of data.

I haven’t implemented any of this yet, but I plan to start working on this soon. I’ll start with simple TCP state tracking that I’m planning to build on top of the flow handling already implemented. I’ll blog about this as I go…

Improving Snort_inline’s NFQ performance

When using Snort_inline with NFQ support, it’s likely that at some point you’ve seen messages like these on the console: packet recv contents failure: No buffer space available. When the messages are appearing Snort_inline slows down significantly. I’ve been trying to find out why.

There are a number of setting that influence NFQ performance. One of them is the NFQ queue maximum length. This is a value in packets. Snort_inline takes an argument to modify the buffer length: –queue-maxlen 5000 (note: there are two dashes before queue-maxlen).

That’s not enough though. The following settings increase the buffer that NFQ seems to use for it’s queue. Since I’ve set it this high, I haven’t been able to get a single read error anymore:

sysctl -w net.core.rmem_default=’8388608′
sysctl -w net.core.wmem_default=’8388608′

The values are in bytes. The following values increase buffers for tcp traffic.

sysctl -w net.ipv4.tcp_wmem=’1048576 4194304 16777216′
sysctl -w net.ipv4.tcp_rmem=’1048576 4194304 16777216′

For more details see this page: http://www-didc.lbl.gov/TCP-tuning/linux.html

Setting these values fixed all my NFQ related slowdowns. The values probably work for ip_queue as well. If you use other values, please put them in a comment below.

Thanks to Dave Remien for helping me track this down!

New Snort_inline TCP window normalization code in SVN

A while ago I wrote about why the TCP window scaling normalization in Snort_inline was broken by design. I also wrote about a new solution I was working on and testing that would be uploaded to SVN soon. I just committed the patch to SVN. What it does is add two new options to stream4:

norm_window: normalize the TCP window (disabled by default). This is to protect Snort_inline from being forced to queue too many packets.
max_win_size: maximum size of the scaled TCP window. Packets increasing the window beyond the limit are modified.

This option is disabled by default, and the old wscale normalization code is removed, as are the options that configured it. It runs fine on my gateway, without noticeable slowdowns, but I haven’t done any benchmarking so far. Please try this and let me know how it works for you!

Window scaling normalization in Snort_inline broken by design

After debugging some connection problems I found that the wscale normalization concept is flawed. I’ll describe here what is wrong with it and then move on to suggest a different solution I’m currently testing. The problem I was seeing is that some connections to some webservers stalled without an apparent reason.

First a quick reminder of why I originally came up with the wscale normalization. Stream4 originally doesn’t look at the window scaling value when determining the TCP window. This causes it to be wrong about the TCP window in about every connection, which is one of the reasons out of window packets are not dropped (this is actually a gaping evasion hole since these packets are not used in stream reassembly). This is why I decided to add window scaling support to the stream4inline extension. This works great and allows the admin to drop out of window packets. There is a problem associated with it though. The maximal window that is possible with wscaling is 1GB. This would mean that Snort_inline would in the worst case have to queue almost 1GB of data in it’s buffers for a single stream. To prevent this being used by an attacker to attack Snort_inline, I wanted give the admin the option to set a maximal wscale size.

So, why doesn’t replacing the wscale value in packets work? I’ll explain that now. First an example without normalization. Say we have client connecting to a server. The client sends in it’s SYN packet a window of 5840 and a wscale of 5. The server replies with a SYN/ACK with window 5792, wscale 9. Both have a unscaled window in their packet since the wscale won’t be used before both sides have received a packet with the wscale option enabled. The client sends an ACK completing the three way handshake, with a window of 183. That means a scaled window of 5856 (183 x 2^5). The client will now send an actual data packet, using the same window. The server ACK’s the data with a packet with a window of 16, meaning a scaled window of 8192 (16 x 2^9).

Now, what happens when we normalize? Consider the same connection, but now Snort_inline normalizes all wscale values above 2, to 2. The client sends in it’s SYN packet: window of 5840, wscale of 5, but due to the normalization, the server receives it as window of 5840, wscale of 2. The server replies with a SYN/ACK with window 5792, wscale 9, but the client receives it as window 5792, wscale 2. The problem here is that neither the client or the server know that it’s wscale value was been modified. Nor is there a way to make it known. So what then happens is this. When the server wants to say it has a window of 8192, it will send a packet with the window field set to 16 (16 * 2^9 = 8192). But, due to the normalization, it actually says it has a window of 64 (16 * 2^2). Likewise, when the client wants to tell the server it has a window of 5856 (window field set to 183), it actually says it has a window of 732 (183 * 2^2). This completely stalls connections. So why did I only see this on some rare connections? That is because most servers on the internet use low wscale values. The server I ran into issues with however, used a value of 9.

The solution I am now testing is normalizing the scaled window. With this idea Snort_inline takes the full scaled window into account and compares it with a maximum value. If it exceeds it, the window value in the packet is modified taking the wscale value into account. I’ve been running like this for about 2 weeks now, and so far I have seen no stalling connections anymore. There is however quite a drawback to this approach. The window size is a constantly changing value that is adapted in almost every TCP packet. Unlike the wscale normalization, that could be done by modifying the SYN and SYN/ACK packets, the new approach in the worst case has to modify and replace almost every single packet in a stream. This will take more resources from Snort_inline.

I’m interested in hearing other possible solutions to this problem or other drawbacks of my new solution. I will be checking my new solution into SVN soon. I will make sure it is disabled by default. To work around the broken wscale normalization just set it to it’s maximum value, so add ‘norm_wscale_max 14’ to your stream4 configuration line.