Reference

Pattern Matching

There are occasions in which we use a parser against a lots of string, parsing even invalid data, with the goal to filter out and keep only a few strings that are valid and of interest for us.

But instead of unpacking everything like a brute-force attack, we can use a pattern matching technique to improve the search and filtering.

Imagine the following IP packet (simplified):

>>> from bisturi.packet import Packet
>>> from bisturi.field  import Int, Data, Bits, Bkpt

>>> class IP(Packet):
...    version = Bits(4)
...    header_length = Bits(4)
...
...    type_of_service = Int(1)
...    total_length    = Int(2)
...    identification  = Int(2)
...
...    fragment_offset = Bits(13)
...    flags = Bits(3)
...
...    others = Data(4)   # simplification of some fields
...
...    source_address      = Data(4)
...    destination_address = Data(4)
...
...    options = Data(header_length * 4 - 20)   # simplified
...    data    = Data(total_length - header_length * 4)

Lets create a pattern that can match any string that it is similar to an IP packet.

This doesn’t mean that the string that matches is an IP packet, it is just a quick check.

We can create this generic pattern with anything_like function:

>>> from bisturi.pattern_matching import anything_like, Any

>>> ip = anything_like(IP)

>>> isinstance(ip, Packet)
True

This will return a normal IP packet, but with a difference: all of its fields have Any as their value.

The Any value is a placeholder for the field and cannot be pack to a string.

For debugging we can ask what is the pattern of this incomplete packet

>>> bytes(ip.as_regular_expression().pattern)
b'(?s).{1}.{1}.{2}.{2}.{1}.{1}.{4}.{4}.{4}.*.*'

This pattern will match almost anything. Only strings that don’t match the expected length will not match the pattern.

So far nothing impressive but here is the twist.

We can replace the Any values with a specific and concrete value and that will be part of the pattern to search for.

For example if we want to search all the IP packets with a fixed destination_address we do the following:

>>> ip.destination_address = b"\xff\xff\xff\xff" # broadcast address
>>> ip.as_regular_expression().pattern
b'(?s).{1}.{1}.{2}.{2}.{1}.{1}.{4}.{4}\xff\xff\xff\xff.*.*'

If we want to look for any destination address we set the Any object again:

>>> ip.destination_address = Any()
>>> ip.as_regular_expression().pattern
b'(?s).{1}.{1}.{2}.{2}.{1}.{1}.{4}.{4}.{4}.*.*'

Bits pattern

The Bits field can also be set but its regexp is more complex.

If the lower bits of a byte can be any and the higher are fixed, then the regexp will be a range:

  fragment_offset flags
     0000000000111 xxx
     |||||||| \\\\\ \\\
     ||||||||  \\\\\ \\\
     ||||||||   \\\\\ \\\
     00000000    00111 xxx

        0         56-63
      \x00         8-?
>>> ip.fragment_offset = 7
>>> ip.as_regular_expression().pattern
b'(?s).{1}.{1}.{2}.{2}\x00[8-\\?].{4}.{4}.{4}.*.*'

But if the lower bits are fixed, then we need to try every single bit pattern:

  fragment_offset flags
     xxxxxxxxxxxxx 011
     |||||||| \\\\\ \\\
     ||||||||  \\\\\ \\\
     ||||||||   \\\\\ \\\
     xxxxxxxx    xxxxx 011

       0-256     00000 011 | 00001 011 | 00010 011 | ... | 11111 011
        .                    32 possibilities
>>> ip.fragment_offset = Any()
>>> ip.flags = 3
>>> ip.as_regular_expression().pattern          # byexample: +geometry=10x300
b'(?s).{1}.{1}.{2}.{2}.{1}[\x03\\\x0b\x13\x1b\\#\\+3;CKS\\[cks\\{\x83\x8b\x93\x9b\xa3\xab\xb3\xbb\xc3\xcb\xd3\xdb\xe3\xeb\xf3\xfb].{4}.{4}.{4}.*.*'

Complex, eh?

This is a more ease case:

  fragment_offset flags
     0000000000111 011
     |||||||| \\\\\ \\\
     ||||||||  \\\\\ \\\
     ||||||||   \\\\\ \\\
     00000000    00111 011

         0          59
       \x00          ;
>>> ip.fragment_offset = 7
>>> ip.flags = 3
>>> ip.as_regular_expression().pattern
b'(?s).{1}.{1}.{2}.{2}\x00;.{4}.{4}.{4}.*.*'

Dependency

Setting a value not only define a fixed string to match for that field but also can change the pattern for others.

For example, the options field depends on header_length.

If we set the header length to 5 (5 * 4 = 20 bytes) then the options field should have 0 bytes.

This dependency is tracked automatically.

>>> ip.fragment_offset = ip.flags = Any()
>>> ip.version = 4
>>> ip.header_length = 5
>>> ip.as_regular_expression().pattern
b'(?s)E.{1}.{2}.{2}.{1}.{1}.{4}.{4}.{4}.{0}.*'

Notice how the data field depends of both, the header length and the total length and because the last field can be anything, the data field still has a generic pattern.

Setting the total length to 30 bytes then the data must have room for only 10 bytes.

>>> ip.total_length = 30
>>> ip.as_regular_expression().pattern
b'(?s)E.{1}\x00\x1e.{2}.{1}.{1}.{4}.{4}.{4}.{0}.{10}'

Now it’s time to do something more useful.

Filtering

In the following data file there are 1210 IP packets, all of them of 84 bytes.

>>> packet_size  = 84
>>> packet_count = 1210
>>> data = open("pingpattern.data", 'rb').read()
>>> raw_packets = [data[i*packet_size: (i+1)*packet_size] for i in range(packet_count)]
>>> del data

Now we can build a very simple pattern that will match every packet

>>> ip = anything_like(IP)
>>> ip.version = 4
>>> ip.header_length = 5
>>> ip.total_length = 84

With our packet we can filter the raw packets loaded before.

The filter uses the pattern to discard string that are not compatible with the IP structure without parsing the string.

>>> from bisturi.pattern_matching import filter_like
>>> len(list(filter_like(ip, raw_packets))) == packet_count
True

This is by two orders faster than parsing each string:

If we want to find a particular packet, then the filter is much faster:

>>> ip.identification = 0x2fbb
>>> len(list(filter_like(ip, raw_packets))) == 1
True

filter_like is a generator of strings that match the pattern and are possible IPs packets.

If we want to be sure we need to unpack the string and do a manual check comparing the unpacked packet with our target.

The function filter will do that for us and it will return the packets found (IP objects not strings):

>>> import bisturi.pattern_matching as pattern_matching
>>> found = list(pattern_matching.filter(ip, raw_packets))

>>> len(found) == 1
True

>>> isinstance(found[0], IP) and found[0].identification == 0x2fbb
True

Substring matching

The Any object can accept certain conditions, strings that should be at the begin (startswith) or the end of the field (endswith) or even in the middle (contains).

This allow us to find packets that contains a particular string in the field.

This feature is only supported by Data and Ref fields.

For example, to search all the packets that contain the famous two byte sequence 0xde 0xad in their data we do:

>>> ip.total_length = Any()
>>> ip.identification = Any()
>>> ip.data = Any(contains=b"\xde\xad")
>>> ip.as_regular_expression().pattern
b'(?s)E.{1}.{2}.{2}.{1}.{1}.{4}.{4}.{4}.{0}.*\xde\xad.*'

>>> found = list(pattern_matching.filter(ip, raw_packets))
>>> len(found)
100