Suricata: Handling of multiple different SYN/ACKs

synackWhen processing the TCP 3 way handshake (3whs), Suricata’s TCP stream engine will closely follow the setup of a TCP connection to make sure the rest of the session can be tracked and reassembled properly. Retransmissions of SYN/ACKs are silently accepted, unless they are different somehow. If the SEQ or ACK values are different they are considered wrong and events are set. The stream events rules will match on this.

I ran into some cases where not the initial SYN/ACK was used by the client, but instead a later one. Suricata however, had accepted the initial SYN/ACK. The result was that every packet from that point was rejected by the stream engine. A 67 packet pcap resulting in 64 stream events.

If people have the stream events enabled _and_ pay attention to them, a noisy session like this should certainly get their attention. However, many people disable the stream events, or choose to ignore them, so a better solution is necessary.

Analysis

In this case the curious thing is that the extra SYN/ACK(s) have different properties: the sequence number is different. As the SYN/ACKs sequence number is used as “initial sequence number” (ISN) in the “to client” direction, it’s crucial to track it correctly. Failing to do so, Suricata will loose track of the stream, causing reassembly to fail. This could lead to missed alerts.

Whats happening on the wire:

TCP SSN 1:

-> SYN: SEQ 10
<- SYN/ACK 1: ACK 11, SEQ 100
<- SYN/ACK 2: ACK 11, SEQ 1000
-> ACK: SEQ 11, ACK 101

TCP SSN 2:

-> SYN: SEQ 10
<- SYN/ACK 1: ACK 11, SEQ 100
<- SYN/ACK 2: ACK 11, SEQ 1000
-> ACK: SEQ 11, ACK 1001

It’s clear that in SSN 1 the client ACKs the first SYN/ACK while in SSN 2 the 2nd SYN/ACK is ACK’d. It’s likely that the first SYN/ACK was lost before it reached the client. Suricata accepts the first though, and rejects any others that are not the same.

Solution

The solution I’ve been working on is to delay judgement on the extra SYN/ACKs until Suricata sees the ACK that completes the 3whs. At that point Suricata knows what the client accepted, and which SYN/ACKs were either ignored, or never received.

Logic in pseudo code:

Normal SYN/ACK coming in:

    UpdateState(p);
    ssn->state = TCP_SYN_RECV;

Extra SYN/ACK packets:

    if (p != ssn) {
        QueueState(p);

On receiving the ACK that completes the 3whs:

    if (ssn->queue_len) {
        q = QueueFindState(p);
        if (q)
            UpdateState(q);
    }
    UpdateState(p);
    ssn->state = TCP_ESTABLISHED;

So when receiving the ACK, Suricata first searches for the proper SYN/ACK on the list. If it’s not found, the ACK will be processed normally, which means it’s checked against the original SYN/ACK. If Suricata did have a queued state, it will first apply it to the SSN. Then the ACK will be processed normally, so that is can complete the 3whs and move the state to ESTABLISHED.

Limitations

Queuing these states takes some memory, and for this reason there is a limit to the number each SSN will accept. This is configurable through a new stream option:

stream:
  max-synack-queued: 5

It defaults to 5. I’ve seen a few (valid) hits against a few terrabytes of traffic, so I think the default is reasonably safe. An event is being set if the limit is exceeded. It can be matched using a stream-event rule:

  alert tcp any any -> any any (msg:"SURICATA STREAM 3way handshake \
      excessive different SYN/ACKs"; stream-event:3whs_synack_flood; \
      sid:2210055; rev:1;)

Performance

This functionality doesn’t affect the regular “fast path” except for a small check to see if we have queued states. However, if the queue list is being used Suricata enters a slow path. Currently this involves an memory allocation per stored queue. It may be interesting to consider using pools here, although a single global pool might be ineffecient. In such a case a lock would have to be used and this might lead to contention, especially in a case where Suricata would be flooded. Per thread pools (519, 520, 521) may be best here.

IPS mode

SYN/ACKs that exceed the limit are dropped if stream.inline is enabled as is the case with all packets that are considered to be bad in some way.

Code

The code is now part of the git master through commit 4c6463f3784f533a07679589dab713096137a439. Feedback welcome through our oisf-devel list.

Suricata IPS improvements

January has been a productive month for Suricata, especially for the IPS part of it. I’ve quite some time on adding support to the stream engine to operate differently when running inline. This was needed as dropping attacks found in the reassembled stream or the application layer was not reliable. Up until now the stream engine would offer the reassembled stream to the detection engine as soon as it was ACK’d. This meant that by definition the packets containing the data had already passed the IPS device. Simply switching to sending un-ACK’d data to the detection engine would have it’s own set of issues.

To be able to work with un-ACK’d data, we need to make sure we deal with possible evasions properly. The problem, as extensively documented by Judy Novak and Steven Sturges, is that in TCP streams there can be overlapping packets. Those are being dealt with differently based on the receiving OS. If we would need to account for overlaps in the application layer, we would have to be able to tell the HTTP parser for example: “sorry, that last data is wrong, please revert and use the new packet instead”. A nightmare.

The solution I opted for was to not care about destination OS’ for overlaps and such. The approach is fairly simple: once we have accepted a segment, thats what it’s going to be. This means that if we receive a segment later that (partially) overlaps and has different data, it’s data portion will simply be overwritten to be the same as the original segment. This way, the IPS and not an obscure mix of the sender (attacker?) and destination OS, determines the data the destination will see.

Of course the approach comes with some drawbacks. First, we need to keep segments in memory for a longer period of time. This causes significantly higher memory usage. Secondly, if we rewrite a packet, it needs to be reinjected on the wire. As we modified the packet payload a checksum recalculation is required.

In Suricata’s design the application layer parsers, such as our HTTP parser, run on top of the reassembly engine. After the reassembly engine and the app layer parsers are updated, the packet with the associated stream and app layer state is passed on to the detection engine. In the case where we work with ACK’d data, an ACK packet in the opposite direction triggers the reassembly process. If we detect based on that, and decide we need to drop, all we can do is drop the ACK packet as the actual data segment(s) have already passed. This is not good enough in many cases.

In the new code the data segment itself triggers the reassembly process. In this case, if the detection engine decides a drop is required, the packet containing the data itself can be dropped, not just the ACK. The reason we’re not taking the same approach in IDS mode is that we wouldn’t be able to properly deal with the said evasion/overlap issues. The IPS can exactly control what packets pass Suricata. The IDS, being passive, can not.

You can try this code by checking out the current git master. In the suricata.yaml that lives in our git tree you’ll find a new option in the stream config, “stream.inline”. If you enable this, the code as explained above is activated.

Feedback is very welcome!

Suricata 1.0.2 released

After some well deserved vacation I’m getting back up to speed in Suricata development. Luckily most of our dev team continued to work in my absence, making today’s 1.0.2 release possible.

The main focus of this release was fixing the TCP stream engine. Judy Novak found a number of ways to evade detection. See her blog post describing the issues.

The biggest other change is the addition of a new application layer module. The SSH parser parses SSH sessions and stops detection/inspection of the stream after the encrypted part of the session has started. So this is mainly a module focused on reducing the number of packets that need inspection, just like the SSL and TLS modules.

As a bonus though, we introduced two rule keywords that match on the parsed SSH parameters:

ssh.protoversion will match against the ssh protocol version. I’ll give some examples.

ssh.protoversion:2.0

This will match on 2.0 exactly.

ssh.protoversion:2_compat

This will match on 2, but also 1.99 and other versions compatible to “2”.

ssh.protoversion:1.

The last example will match on all versions starting with “1.”, so 1.6, 1.7, etc.

ssh.softwareversion will match on the software version identifier. An example:

ssh.softwareversion:PuTTY

This will match only on session using the PuTTY SSH client.

Other changes include better HTTP accuracy, better IPS functionality.

For the next release we will focus on further improving overall detection accuracy, improving inline mode further, improving performance and specifically improving CUDA performance. As always, we welcome any feedback. Or if you are interested in helping out, please contact us!

Update: added a link to Judy Novak’s blog post on the TCP evasions.

OISF engine prototype: streams handling

I’ve been thinking about how to deal with streams in the OISF engine. We need to do stream reassembly to be able to handle spliced sessions, otherwise it would be very easy to evade detection. Snort traditionally used an approach of inspecting the packets individually and reassembling (part of) the stream in a pseudo packet, that was inspected mostly like a normal packet. Recent Snort versions, especially when Stream5 was introduced, have a so called stream api. This enables detection modules to control the reassembly better.

In Snort_inline’s Stream4 I’ve been experimenting with ways to improve stream reassembly in an inline setup. The problem with Snort’s pseudo packet scanning way of operation is that it’s after the fact scanning. Which means that any threat detected in the reassembled stream can’t be dropped anymore. The way I tried to work around this was by constantly scanning a sliding window of reassembled unacked data. It worked quite well, except for the performance of it. That was quite bad.

I’m thinking about a stream reassembler for the OISF engine that can both do the after-the-fact pseudo packet scanning and do a sliding window approach as I did in stream4inline. This would be used for the normal tcp signatures. I think it should be possible to determine the minimal size of the reassembled packet based on the signatures per port, possibly more fine grained. Of course things like target based reassembly and robust reassembly will be part of it.

In addition to this I’m thinking about a way to make modules act on the stream similary to how programs deal with sockets. Code that only wakes up if a new piece of data in that connection is received, with semantics similar to recv()/read(). I haven’t really made up my mind about how such an api should work exactly, but I think it would be very useful to detection module writers if they only have to care about handling the next chunk of data.

I haven’t implemented any of this yet, but I plan to start working on this soon. I’ll start with simple TCP state tracking that I’m planning to build on top of the flow handling already implemented. I’ll blog about this as I go…

Improving Snort_inline’s NFQ performance

When using Snort_inline with NFQ support, it’s likely that at some point you’ve seen messages like these on the console: packet recv contents failure: No buffer space available. When the messages are appearing Snort_inline slows down significantly. I’ve been trying to find out why.

There are a number of setting that influence NFQ performance. One of them is the NFQ queue maximum length. This is a value in packets. Snort_inline takes an argument to modify the buffer length: –queue-maxlen 5000 (note: there are two dashes before queue-maxlen).

That’s not enough though. The following settings increase the buffer that NFQ seems to use for it’s queue. Since I’ve set it this high, I haven’t been able to get a single read error anymore:

sysctl -w net.core.rmem_default=’8388608′
sysctl -w net.core.wmem_default=’8388608′

The values are in bytes. The following values increase buffers for tcp traffic.

sysctl -w net.ipv4.tcp_wmem=’1048576 4194304 16777216′
sysctl -w net.ipv4.tcp_rmem=’1048576 4194304 16777216′

For more details see this page: http://www-didc.lbl.gov/TCP-tuning/linux.html

Setting these values fixed all my NFQ related slowdowns. The values probably work for ip_queue as well. If you use other values, please put them in a comment below.

Thanks to Dave Remien for helping me track this down!

New Snort_inline TCP window normalization code in SVN

A while ago I wrote about why the TCP window scaling normalization in Snort_inline was broken by design. I also wrote about a new solution I was working on and testing that would be uploaded to SVN soon. I just committed the patch to SVN. What it does is add two new options to stream4:

norm_window: normalize the TCP window (disabled by default). This is to protect Snort_inline from being forced to queue too many packets.
max_win_size: maximum size of the scaled TCP window. Packets increasing the window beyond the limit are modified.

This option is disabled by default, and the old wscale normalization code is removed, as are the options that configured it. It runs fine on my gateway, without noticeable slowdowns, but I haven’t done any benchmarking so far. Please try this and let me know how it works for you!

Window scaling normalization in Snort_inline broken by design

After debugging some connection problems I found that the wscale normalization concept is flawed. I’ll describe here what is wrong with it and then move on to suggest a different solution I’m currently testing. The problem I was seeing is that some connections to some webservers stalled without an apparent reason.

First a quick reminder of why I originally came up with the wscale normalization. Stream4 originally doesn’t look at the window scaling value when determining the TCP window. This causes it to be wrong about the TCP window in about every connection, which is one of the reasons out of window packets are not dropped (this is actually a gaping evasion hole since these packets are not used in stream reassembly). This is why I decided to add window scaling support to the stream4inline extension. This works great and allows the admin to drop out of window packets. There is a problem associated with it though. The maximal window that is possible with wscaling is 1GB. This would mean that Snort_inline would in the worst case have to queue almost 1GB of data in it’s buffers for a single stream. To prevent this being used by an attacker to attack Snort_inline, I wanted give the admin the option to set a maximal wscale size.

So, why doesn’t replacing the wscale value in packets work? I’ll explain that now. First an example without normalization. Say we have client connecting to a server. The client sends in it’s SYN packet a window of 5840 and a wscale of 5. The server replies with a SYN/ACK with window 5792, wscale 9. Both have a unscaled window in their packet since the wscale won’t be used before both sides have received a packet with the wscale option enabled. The client sends an ACK completing the three way handshake, with a window of 183. That means a scaled window of 5856 (183 x 2^5). The client will now send an actual data packet, using the same window. The server ACK’s the data with a packet with a window of 16, meaning a scaled window of 8192 (16 x 2^9).

Now, what happens when we normalize? Consider the same connection, but now Snort_inline normalizes all wscale values above 2, to 2. The client sends in it’s SYN packet: window of 5840, wscale of 5, but due to the normalization, the server receives it as window of 5840, wscale of 2. The server replies with a SYN/ACK with window 5792, wscale 9, but the client receives it as window 5792, wscale 2. The problem here is that neither the client or the server know that it’s wscale value was been modified. Nor is there a way to make it known. So what then happens is this. When the server wants to say it has a window of 8192, it will send a packet with the window field set to 16 (16 * 2^9 = 8192). But, due to the normalization, it actually says it has a window of 64 (16 * 2^2). Likewise, when the client wants to tell the server it has a window of 5856 (window field set to 183), it actually says it has a window of 732 (183 * 2^2). This completely stalls connections. So why did I only see this on some rare connections? That is because most servers on the internet use low wscale values. The server I ran into issues with however, used a value of 9.

The solution I am now testing is normalizing the scaled window. With this idea Snort_inline takes the full scaled window into account and compares it with a maximum value. If it exceeds it, the window value in the packet is modified taking the wscale value into account. I’ve been running like this for about 2 weeks now, and so far I have seen no stalling connections anymore. There is however quite a drawback to this approach. The window size is a constantly changing value that is adapted in almost every TCP packet. Unlike the wscale normalization, that could be done by modifying the SYN and SYN/ACK packets, the new approach in the worst case has to modify and replace almost every single packet in a stream. This will take more resources from Snort_inline.

I’m interested in hearing other possible solutions to this problem or other drawbacks of my new solution. I will be checking my new solution into SVN soon. I will make sure it is disabled by default. To work around the broken wscale normalization just set it to it’s maximum value, so add ‘norm_wscale_max 14’ to your stream4 configuration line.

Snort_inline and out of order packets

In Snort_inline’s stream4 modifications, one of the changes is that out of order TCP packets are treated differently from unmodified stream4. This can cause some new alerts to appear and some unexpected behaviour. So I’ll try to explain what happens here.

First of all let me explain quickly what out of order packets are. To put it simple, TCP packets are send out by the source host in a specific order but can arrive in a different order at the destination. Packetloss, link saturation, routing issues are among many things that can cause this. A Snort_inline specific issue is that when Snort_inline can’t keep up with the packets it needs to process, it will drop packets which causes packetloss. These packets will then have to be resent by the sending host.

Out of order packets become a problem when dealing with stream reassembly. Stream reassembly basically is putting all data from the packets in the right order to get the original data as it was sent. We can’t do stream reassembly if we don’t have all packets. Unmodified stream4 basically ignores gaps in the stream. Designed for passive listening for traffic, it has to deal with packetloss differently than Snort_inline.

Next, some definitions of this functionality in Snort_inline. Out-of-order packets: The number of packets that we have in queue that are out of order for a stream. This means they have a higher sequence number than the next in-sequence packet we are expecting. Out-of-order bytes: The number of bytes of the combined data of the out-of-order packets in the stream. Sequence number hole: A gap between two packets, that can be closed by one or more missing packets.

To prevent Snort_inline from using to much memory on bad connections or when an attacker sends lots of out of order packets, Snort_inline can enforce limits to protect itself. Snort_inline can even force a stream to be completely in-order by dropping all packets that are out of order. Sadly, this has a bad effect on the performance of the connections, so you can set certain limits that balance between performance and protection.

When Snort_inline hits these limits, it will (optionally) fire alerts that look like this:

(spp_stream4) TCP out-of-order packets limit reached for stream
(spp_stream4) TCP out-of-order bytes limit reached for stream
(spp_stream4) TCP sequence number holes limit reached for stream

You can disable the alerts by adding the following option to the preprocessor stream4 line: disable_ooo_alerts. The limits themselves can be adjusted by using the following options: max_seq_holes 2, max_ooo_pkts 25, max_ooo_bytes 7000. These are the values I currently use on my home gateway. I got the idea of implementing these limits from this paper by Vern Paxson. However, it seems to me that his suggestion that at max one sequence hole per stream (even per host) was a bit optimistic. Maybe DSL has more packetloss than the university links he studied.

By default Snort_inline uses the settings that were chosen a bit randomly, so they may not fit your usage. Like with the wscaling, please let me know in a comment what values you use!

TCP Window scaling in Snort_inline

The TCP window field in the TCP header is only 16 bits, so the maximum window size it can handle is only 64kb. A long time ago this was enough, but nowadays it isn’t, by far. Luckily, this is something the window scaling option fixes. Window scaling is very common these days. Your pc or laptop probably uses it by default. Snort’s stream4 however, does not support it. This means that when tracking and reassembling streams, Snort for most connections has no idea about what data is in window and which is out of window. To make matters worse, the packets that are in window when using wscaling, but appear out of window when the wscaling is not accounted for, are never used in the reassembly process. This makes Snort evadable.

One of the goals when creating the stream4inline modifications, was to be able to drop on all TCP anomalies stream4 detects. For this support for window scaling was added to Stream4, so Snort_inline would be able to drop out of window packets. There is however a big problem with window scaling. With window scaling the TCP window possibly increases to a maximum of 1GB (with the maximum wscale value of 14). Stream4 would thus theoretically have to queue up to 1GB of packet data, per stream. While this is something that is unlikely to happen during normal connections, it is possible. This could then be used by an attacker to attack Snort_inline itself.

To prevent this, I added an option to stream4inline that allows the administrator to set a maximum allowable wscale setting. Any higher setting will be normalized away. In these cases the packet is modified and the wscale lowered to the maximum that is allowed. The hosts talking to each other then think the other accepts only the lower wscale and accepts that setting. This can however have some unexpected consequences. If the link that Snort_inline deals with is high speed, high latency or both, setting the wscale value to low can result in serious performance degradation. Connections that are (way) slower than usual is how this issue shows. In these cases the wscale value needs to be increased.

The default value of Snort_inline 2.6.1.5 is a wscale of 2, which is quite low but works fine on my home DSL connection. To change the setting add ‘norm_wscale_max 5’ to your stream4 configuration line. This will allow for a wscale of up to 5. The maximum value is 14. I’d be interested in what values people use on what types and speeds of lines, so please let me know! We can use it to suggest values in the docs or to set a less insane default value 🙂

Memory leak fixed in stream4inline

A few days ago William told me that if he enabled stream4inline on a busy gateway, Snort_inline would consume all memory within hours. The problem went away when disabling stream4inline, so it made sense that the problem would be in there somewhere.

The first suspect was the reassembly cache. The reassembly cache is used to keep a per stream copy of the reassembled packet in memory. While being memory expensive, it greatly speeds up the sliding window stream reassembly process, especially with small packets. The reason for this being the first and primary suspect is that this is the only place where stream4inline code allocates memory. Reviewing the code however, showed no leaks and adding a debug counter to monitor the memory usage also showed that the leak was not in that code.

Next my investigation focused on parts where stream4 behaves differently in stream4inline mode. I initially focused on what happened when stream4 hit it’s memory limit: the memcap. When the configurable memcap is reached, stream4 nukes 5 random sessions. In stream4inline the option to truncate 15 of the sessions was added, where an attempt is made to clear the memory by removing stored packets no longer needed from a stream. If that fails, 5 random sessions are nuked anyway.

Reviewing the truncating of the sessions didn’t show anything obvious to me so I went on to the killing of the sessions. Descending down the code I finally reached the DropSession function, where the memory cleanup for a session is handled. Here it turned out that the DeleteSpd function, used to clear the stored packets in a stream, was not called in stream4inline mode. The reason for this mistake is that with Snort 2.6.1 support for UDP was added to stream4. The merge with the Snort_inline code went wrong because of extra checks added to the DropSession function.

The stupid thing is that when I did the merge, I was already in doubt about it as a comment showed:

/* XXX did I merge this right??? VJ */

Guess I know the answer now: No 😉