Opportunistic Search in Unstructured Data

Ad-Hoc Pattern Search Without Indexing

By Jordan Hrycaj on Wed, Sep 14, 2016
Category: Architecture,
Tags: algorithm bin iin pci dss programming searching

Introduction

With applications in Computer Forensics it is often beneficial to triage the computer systems before full images are acquired for further analysis. For instance, in a PCI DSS environment, there might be a quest for leaked credit card numbers. It is expected that only a few of a hundred workstations have unprotected data. Ideally, there are tools available to scan workstations live without shutting down or otherwise disrupting processes.

Forensic images of hard disk and memory are taken not before the triage tools flag high probability that some of the workstations are of interest.

So, from the triage tool set it is expected:

  1. Report all files that contain certain data objects.
  2. Do not use/modfy any mass storage (aka hard disk) resources on the triaged machines.
  3. Run the tool as-is, a programme without installation needed.

I wrote about developing such a tool set already in another post when I was more interested in describing general aspects of developing. Here I will discuss some more algorithmic aspects of how to work out efficient tools that

  1. Find data objects in documents, email storage, compressed data containers.
  2. Either, data objects can be debit/credit card numbers,
  3. or the data objects can be boolean combinations of - probably restricted - regular expressions.

Talking about Item 4 would be an article of its own and is not covered, here.

A Mock Solution

Starting with a naive approach one could try on the Linux/Unix command line

 cat /usr/lib/*.so |                # search in shared objects
    strings |                       # extract strings
    grep require | grep password    # lines with: require AND password

which searches shared objects (DLLs in Windows-speak) for sentences containing the words require and password. Apart from being a bit contrived, the result of such a construct will not be too useful as it only shows whether there is at least one object file that contains such a sentence but not the name of the corresponding file. This is not quite what was stated in the requirement item 1. A cheap fix for this command line programme could be

 for file in /usr/lib/*.so
 do
     echo $file                      # print the current file name
     cat $file |                     # search in a single shared object
        strings |                    # extract strings
        grep require | grep password # lines with: require AND password
 done

This shell scripting is neither meant to be efficient nor clever in any way. It illustrates what the building blocks of an ad-hoc scanner are and how they are combined in a pipeline.

Logical Diagram For A Scanner Solution

The building blocks of the scanning tool would roughly look like

         [A]               [B]                [C]             [D]
        +-----------+     +------------+     +---------+     +--------+
        | input     | --> | serialise  | --> | object  | --> | report |
        | container |     | into files |     | scanner |     +--------+
        +-----------+     +------------+     +---------+       ^
            ^                  |    |                          |
            | decode directory |    +--------------------------+
            +------------------+             metadata

where the arrows indicate data flow. Blocks [A] and [B] are invoked recursively and kept on a decoder stack (think of zipped ZIP-files). As mentioned earlier in a comment on item 4, details of how such a recursive serialisation works in detail will not be discussed here. For the purpose here let it suffice to assume that the input for the object scanner – block [C] – is just a sequence of binary data streams from files, one file after the other.

There is no assumption on the data encoding for a file of the data stream. It is considered opaque, unstructured, binary data. This works quite well when the data object to be searched for can be expected embedded unmodified as it often happens with (Win-)Word/97 type files (OLE).

The object scanner – block [C] – works similar to the group of the pipelined grep commands in the mock examples. Looking at the requirements 5 and 6 there are two kinds of scanners,

  • an ASCII digits string scanner for payment cards (see item 5) and
  • a general Boolean text expression scanner (see item 6)

which will both be discussed, below.

A Payment Card Object Scanner

A payment card number is a sequence of 12 to 19 digits. Some of these numbers can be validated by the weak Luhn verification algorithm. Nowadays, most of these numbers consist of 16 digits. Without any additional information, a sequence of 20 zeros has all characteristics of overlapping card numbers:

     00000000000000000000              -- 20 digits
     0000000000000000                  -- 16 digits satisfying Luhn
      0000000000000000                 -- overlapping
       0000000000000000                -- overlapping
     ...

This illustrates the weak nature of checking that a number is really a payment card number. There is no general inner structure of these numbers that could be exploited by a scanner in order to eliminate unwanted positive findings.

Card scanners typically search for ASCII/UTF16 digits.

The IIN List

The only possibility to eliminate unwanted digits sequences that would otherwise hold characteristics of a valid card number is to maintain a data base of valid numbers. While the leading 6 digits – called IIN or BIN – are unique to the issuing card number institute some of them issue details of assignments for the numbers following the IIN. This can help to build a data base with prefixes of up to 9 digits (out of mostly 16). Unfortunately, the density of prefixes is very high but randomly distributed at some regions so expect the database becoming huge (~500k optimised ASCII in my case).

I call this database IIN list although – strictly speaking – the IIN is made up of only the first six digits of a card number. In my case, the IIN list is also supposed to hold some extra information as there are: the valid lengths of the digit strings (mostly 16) and if/what validation algorithm applies (e.g Luhn).

With this IIN list in place, every sequence of digits can be checked for whether it is a valid payment card number or not.

The Digits Parser State Machine

While the IIN list solves the general problem of deciding whether a sequence of digits is a valid card number, it would be very inefficient for the object scanner (block [C] in the diagram) to query this IIN list for every character/byte sequence from the input stream.

It is legit to assume that card numbers in the data object are sparsely distributed. So the aim is to discard as early as possible the possibility that there is a card number, at all. A simple finite state automaton solving this problem could look like

              +------------+       +------------+       +------------+       +----------+
 +-------+    | 1st input  | digit | 2nd input  | digit | 3nd input  | digit | check    |
 | start | -> | character? | ----> | character? | ----> | character? | ----> | IIN list |
 +-------+    +------------+       +------------+       +------------+       +----------+
    ^              |                    |                      |
    |              |                    |                      |
    +-- no digit --+--------------------+----------------------+

When the state check IIN list is reached, the rest of the at most 19 characters that make up a card number are cached in and the IIN list is looked up for a match.

There are implementation variants for the state machine:

  • UTF16 little endian has an ASCII NUL character following each digit.
  • Magnetic stripes have card numbers separated by characters % ^ ; or =.
  • A context sensitive approach may require that each card number is preceded and followed by non-digit characters.

These variants are easily implemented with a state machine that is compiled at run-time rather than a static structure.

Scanner State Machine

I made heuristic tests for the depth of the state machine whether it is worth to consider an extra state for testing a forth digit. It turned out that after reading the third digit, the chance of not having a digit following was negligible. Equally, testing with less then three digit verification states slowed down the system somehow.

Both observations seem plausible because the state machine quickly discards stray digits in binary code. For evenly distributed characters, a single digit (out of 256 choices) appears with probability ~4% , a quadruple appears with probability ~0.0002%. So, sequences with more than three digit are not expected random – and I am looking for non-random digits sequences.

Character Fetch Implementation

What really boosted the throughput of reading characters and executing the state machine was grouping about 20 read/process statements inside a loop instead of looping after each single read/process statement. This is a well known speed-up technique following profiling.

Assembling Components

Combining digits parser state machine and IIN list gave a powerful scanner engine. I collected and sorted the scanner results in an internal database for later perusal.

Originally I used hash tables. Although extremely fast when there were no more than a few thousands of scanner results, the scanner programme became unusable with detailed reports for data sets with millions of findings. In that case the hash tables degenerated to unsorted linked lists. Switching to binary (red-black) trees, I could manage “quick” reports for up to 4 millions of findings (the limit mainly due to Windows memory restrictions).

A General Boolean Objects Scanner

Having learned from the payment card scanner, a more general digital objects scanner was envisaged. Searching for Boolean AND/OR combinations – possibly bracketed – of texts or character strings, any such expression can be normalised to something like

      (text11 AND text12 AND ...)
   OR (text21 AND text22 AND ...)
   OR (...)
      ...

Note that I did not implement NOT as it was not required at this time. A search is considered successful if one of the lines above is satisfied completely, i.e. all the text pieces text#1, text#2, … are found for a particular line number #.

Some regular expressions instead of a simple text were allowed if they could be reasonably expanded into all implied text combinations, e.g.

   /[Tt]ext/     => "Text" OR "text"
   /(text){1,3}/ => "text" OR "texttext" OR "texttexttext"

I was mainly working with UTF8 character sets. So I restricted regular expressions by banning unspecified wildcard characters . and * as well as the negated set [^…]. A similar expansion approach was employed to support different international character sets as UTF16le, or ISO8859-1 by re-writing an input text into all possible text encodings to be searched for.

The more general scanner challenge was so reduced to the task of efficiently scanning a huge set of binary texts or character strings.

Opportunistic Binary Text Scanner

As before I assumed that the binary texts to scan for were sparsely distributed. Unfortunately I had no general structure as in the case of ASCII digits, 10 out of 256. So the best I could do was to compile the set of the first characters of all possible text pieces and test whether the current scanner input stream can be the start of such a text piece. Still it worked nicely in many cases.

The main algorithmic efficiency gain was in the table of the of text pieces to be looked up partially, and queried sequentially, character wise. This table accepts a sequence of binary characters and returns the

  • number of prefixed entries, i.e. entries that start with the sequence so far
  • True/False according to whether an entry fully matches the sequence so far

Starting a new instance on each input character the scanner could search in parallel (dropping instances when there were no prefixed entries left).

Reducing Long Text Entries

In a real search scenario, there are often hundreds of binary texts to search for due to expanding regular expressions or target encodings. As explained earlier, target texts are considered sparse. It is often enough to search for a matching prefix and decide then all at once whether the rest (or tail) is matching, as well.

Large tables with binary/byte search texts can be reduced by using prefixes only – similar to the IIN list case. All possible tails are found in a bucket associated with a prefix from the table. Using another data structure than byte strings allow for more convenient character representations when encodings differ in lengths (as in HTML encoding, eg. ' &x27; &39; ').

Case Insensitive Texts

Many search applications I came across required ASCII case insensitive search. This would have resulted in an expansion of the search text pieces into all possible upper and lower case variants.

In order to avoid unnecessary expansion which would be so common, I added case insensitive characters as an extra character type in the search texts table. The only difference between binary/byte characters and case insensitive characters is that the latter are converted to lower case before compared for equality.

Comparison Of Solutions

In the case of the payment card scanner the throughput is mainly limited by the data reading system. This statement holds easily under the premise that credit card numbers are sparsely distributed. There is hardly any CPU load involved.

In a particular real scenario ~2014 when I measured the throughput on a Windows 2003 server scanning several Terabytes Windows Exchange (ESE formatted) databases for payment card numbers, I found that the maximum data input was exhausted by approximately 2 12 scanner instances running parallel. There might still be improvements to block [A] and [B] in the logical diagram but this was not topic, here.

In the second case with Boolean text expressions, the throughput was felt considerably slower, maybe 23 or 12 of the payment card scanner (sorry, guessing only).

The main application for the Boolean text expressions scanner tool was scanning Microsoft Outlook databases where an email or attachment was found and reported when the text pieces in the email object covered at least one row in the normalised version of the Boolean search expression. In many cases, the search expression was a list of single word alternatives and searching the current entity was stopped immediately when one of the words was found.

So only a part was really scanned whereas in the case of the payment card scanner the focus was on reporting all payment card numbers which always needed to scan the whole file. I never made hard throughput comparisons for the different search algorithms.