File extraction in Suricata

Today I pushed out a new feature in Suricata I’m very excited about. It has been long in the making and with over 6000 new lines of code it’s a significant effort. It’s available in the current git master. I’d consider it alpha quality, so handle with care.

So what is this all about? Simply put, we can now extract files from HTTP streams in Suricata. Both uploads and downloads. Fully controlled by the rule language. But thats not all. I’ve added a touch of magic. By utilizing libmagic (this powers the “file” command), we know the file type of files as well. Lots of interesting stuff that can be done there.

Rule keywords

Four new rule keywords were added: filename, fileext, filemagic and filestore.

Filename and fileext are pretty trivial: match on the full name or file extension of a file.

alert http any any -> any any (filename:”secret.xls”;)
alert http any any -> any any (fileext:”pdf”;)

More interesting is the filemagic keyword. It runs on the magic output of inspecting the (start of) a file. This value is for example:

GIF image data, version 89a, 1 x 1
PE32 executable for MS Windows (GUI) Intel 80386 32-bit
HTML document text
Macromedia Flash data (compressed), version 9
MS Windows icon resource – 2 icons, 16×16, 256-colors
PNG image data, 70 x 53, 8-bit/color RGBA, non-interlaced
JPEG image data, JFIF standard 1.01
PDF document, version 1.6

So how the filemagic keyword allows you to match on this is pretty simple:

alert http any any -> any any (filemagic:”PDF document”;)
alert http any any -> any any (filemagic:”PDF document, version 1.6″;)

Pretty cool, eh? You can match both very specifically and loosely. For example:

alert http any any -> any any (filemagic:”executable for MS Windows”;)

Will match on (among others) these types:

PE32 executable for MS Windows (DLL) (GUI) Intel 80386 32-bit
PE32 executable for MS Windows (GUI) Intel 80386 32-bit
PE32+ executable for MS Windows (GUI) Mono/.Net assembly

Finally there is the filestore keyword. It is the simplest of all: if the rule matches, the files will be written to disk.

Naturally you can combine the file keywords with the regular HTTP keywords, limiting to POST’s for example:

alert http $EXTERNAL_NET any -> $HOME_NET any (msg:”pdf upload claimed, but not pdf”; flow:established,to_server; content:”POST”; http_method; fileext:”pdf”; filemagic:!”PDF document”; filestore; sid:1; rev:1;)

This will alert on and store all files that are uploaded using a POST request that have a filename extension of pdf, but the actual file is not pdf.

Storage

The storage to disk is handled by a new output module called “file”. It’s config looks like this:

enabled: yes # set to yes to enable
log-dir: files # directory to store the files
force-magic: no # force logging magic on all stored files

It needs to be enabled for file storing to work.

The files are stored to disk as “file.1”, “file.2”, etc. For each of the files a meta file is created containing the flow information, file name, size, etc. Example:

TIME: 01/27/2010-17:41:11.579196
PCAP PKT NUM: 2847035
SRC IP: 68.142.93.214
DST IP: 10.7.185.57
PROTO: 6
SRC PORT: 80
DST PORT: 56207
FILENAME: /msdownload/update/software/defu/2010/01/mpas-fe_7af9217bac55e4a6f71c989231e424a9e3d9055b.exe
MAGIC: PE32+ executable for MS Windows (GUI) Mono/.Net assembly
STATE: CLOSED
SIZE: 5204

Configuration

The file extraction is for HTTP only currently, and works on top of our HTTP parser. As the HTTP parser runs on top of the stream reassembly engine, configuration parameters of both these parts of Suricata affect handling of files.

The stream engine option “stream.reassembly.depth” (default 1 Mb) controls the depth into a stream in which we look. Set to 0 for no limit.
The libhtp options request-body-limit and response-body-limit control how far into a HTTP request or response body we look. Again set to 0 for no limit. This can be controlled per HTTP server.

Performance

The file handling is fully streaming, so it’s very efficient. Nonetheless there will be an overhead for the extra parsing, book keeping, writing to disk, etc. Memory requirements appear to be limited as well. Suricata shouldn’t keep more than a few kb per flow in memory.

Limitations

Lack of limits is a limitation. For file storage no limits have been implemented yet. So it’s easy to clutter your disk up with files. Example: 118Gb enterprise pcap storing just JPG’s extracted 400.000 files. Better use a separate partition if you’re on a life link.

Future work

Apart from stabilizing this code and performance optimizing it, the next step will be SMTP file extraction. Possibly other protocols, although nothing is set in stone there yet.

29 thoughts on “File extraction in Suricata

  1. Hello,

    enabled: yes # set to yes to enable
    log-dir: files # directory to store the files
    force-magic: no # force logging magic on all stored files

    in suricata.yaml config ????

  2. Hello,

    ¿?

    outputs:
    – console:
    enabled: yes
    – file:
    enabled: yes
    filename: /var/log/suricata/suricata.log
    log-dir: /var/log/suricata
    force-magic: no

    Best Regards,

  3. The “filename” in the “file” section does nothing.

    log-dir is fine, although the files will go into /var/log/suricata directly. If you enter just “files” there, it goes into /var/log/suricata/files/

  4. Thank you. I left the settings in “File” section in this way:

    – file:
    enabled: yes
    log-dir: files
    force-magic: no

    I added the rules:

    – files.rules

    when I run suricata, I do not give any error. Everything is fine, but do not extract any files in /var/log/suricata /files

    Best Regards,

  5. 2 things to check:

    First and foremost, the files.rules file contains example rules. They are all disabled by default. Remove the # before them to enable the rules.

    Second, the file extraction is heavily influenced by 3 other settings:
    stream.reassembly.depth (defaults to 1mb, set larger if you want to extract larger files)

    In the libhtp section, the request-body-limit and response-body-limit settings. Both default to just a few kb, set to 0 or a high value.

  6. Hi!
    What about to go further and use Apache’s Tika?
    I think that would be great if it possible to do?

  7. Interesting, I wasn’t aware of this project. How would you use it with Suricata?

    If you look in the contrib/ directory in the source you will find a perl framework for processing the files. Maybe you can hook Tika into that.

  8. Pingback: Defend your network from Microsoft Word upload with Suricata and Netfilter » To Linux and beyond !

  9. I am interested in pulling out a soap payload. I am not seeing that feature. Is that in Suricata and I am missing it? Or is it something that has not been done yet?
    thanks
    Chris

      • A rule like “alert http any any -> any any (flow:to_server; filestore; sid:1;) would store all request bodies. Use “to_client” to store reply bodies.

    • can you store only windows PE files in an HTTP stream? what would be the command for that?

      • You could use the filemagic keyword in combination with the filestore keyword. Something like:
        alert http any any -> any any (msg:”FILE magic — windows”; flow:established,to_client; filemagic:”executable for MS Windows”; filestore; sid:17; rev:1;)
        alert http any any -> any any (msg:”FILE magic — windows”; flow:established,to_client; filemagic:”PE32″; filestore; sid:18; rev:1;)

  10. Hello,
    Thank you for your great work on suricata.
    I would like to ask a little question.
    I extracted some image files by using suricata with suricata’s file extraction sample(files.rules).
    But it didn’t extracted whole image, it only extracts the small part of image which consists few lines of the whole image from the beginning.

    My scenario like that,
    – All rules disabled in suricata.yaml except files.rules because I want to see only image file extraction process.
    – Enabled suricata.yaml file extracion config just like you said above in the passage.

  11. I am wondering what kind of entries (or where I can find them) that can be made for the libmagic parameter. I want to create rules to generate alerts when downloading files with masked content.
    Example: alert ip $EXTERNAL_NET any -> $HOME_NET any (fileext:pdf; filemagic:!”PDF document”;)
    But of course there are a lot of extensions, I would like to make a proof of concept of the most commonly used file types and compare them to their actual content type.
    The following url gives these extensions but is the 2nd collumn a possible entry for filemagic parameter? (http://fileinfo.com/filetypes/common)

    Thanks in advance!!

    • As there are some differences for various libmagic versions, it would probably be a good idea to enable the ‘magic’ logging in the file log, so you can see the output.

      Looks like ‘file -l’ may also contain the same output for your setup.

      • Thanks for the tip! I will enable this feature and look into the ‘file -l’ command further!

        Thanks again.

  12. Rules for blocking file extension:
    drop ip any any 192.168.3.141 any (msg:”Block user141 “; fileext:”dat”; sid:64615384;rev:1;)

    Above is not working in 2.1.beta4 :

    When i change to drop http it works, however we need files must block for specific IP address not just http protocol [means if file download link is ssl ].

  13. Hi,

    Does this feature work well on Windows? (I’ve read that it makes a use in Unix filemagic)…

  14. Hi,
    I would like to check AWS VPC flow logs and S3 access logs using suricata. My plan is download both logs in linux server and install suricata on that linux machine for monitoring. Is there any way to check/monitor those logs using suricata?

    Thanks in advance

  15. enabled: yes # set to yes to enable
    log-dir: files # directory to store the files
    force-magic: no # force logging magic on all stored files

Comments are closed.