Deadpan Tooling -- mjh-IT

Updated: 18 Apr 2017

TL;DR

I have written a new forensic proof-of-concept PAN scanner from scratch based on some previous experience. My deadpan PAN scanner holds a blank face in the sense that it would not reveal any information by itself but a – probably remote – companion report extractor will – probably much later – show the results.

I designed some old novelties into the scanner: standard technology, probably not found elsewhere in a forensic scanner.

Introduction and Overview

See here for a quick overview of what a PAN scanner does. Having worked in forensic software development for some years I decided to provide a prototype, a proof-of-concept of what I understand would be useful for a forensic scanner. This prototype is for PAN scanning only although many parts apply for general forensic scanners where the PAN finding part is replaced by another scanning automaton.

I have not yet decided where to go from here and how/whether to maintain this tools archive publicly – I am open for business suggestions.

Contents:

Split Tool Set
- Restrict Access to Confidential Scanning Data
- Produce a Refined Report at a Later Stage
Build System
- Universal Availability on Target Systems
- Chicken or NIM?
Flexible PAN Context
- Non-digits Preamble and Epilogue
IIN/BIN Lists
- A Simple Metric for Matching Against IIN/BIN Lists
File Objects and Directory Traversal
- Tree View
- Discarding GIF, JPEG, and DLL objects
- File Walker Policy
Summary

1. Split Tool Set

The building blocks of the scanner tool set look like:

            deadpan                              dprep
      +------------------------------+     +----------------------------------+
      |  +--------+     +---------+  |     |  +-----------+     +-----------+ |
      |  |  file  | --> |  PAN    |  | --> |  |  report   | --> |  report   | |
      |  | walker |     | finder  |  |     |  | extractor |     | generator | |
      |  +--------+     +---------+  |     |  +-----------+     +-----------+ |
      +------------------------------+     +----------------------------------+

See the unstructured search article for a more general technical analysis. I have grouped the components of deadpan and dprep above in the way they are physically implemented as programmes.

Restricted Access to Confidential Scanning Data

Apart from business considerations (licence expiring) I was made aware of concerns about misuse of such a scanner. It could reveal confidential data such as credit/debit card numbers (in case of the PAN scanner) and their locations.

      Ok, then don't produce any scanning data.

Consequently, deadpan is a scanner that outputs data to an encrypted scanning journal. This journal can only be decrypted by another tool, dprep. Here is a technical description of the ECC logic used for public/private key encryption and stream cipher usage on GitHub. The scanning and reporting process logic looks like:

        scans for                      produces
          PANs                          report
      +-------------+  log pipe     +-----------+
      |  [deadpan]  | ----------->  |  [dprep]  |
      +-------------+               +-----------+
                 |       (or)          ^
                 v                     |
              +--------------------------+
              |       log file           |
              +--------------------------+

where the deadpan programme writes to a file or a pipe which can be decrypted and decoded by dprep. That is simple and efficient. The command for running a deadpan log pipe example on a Windows/Linux console reads

      deadpan -o- . | dprep       # "." = scan starts at current directory

and a log file example reads

      deadpan -o journal.txt .    # "." = scan starts at current directory
      dprep journal.txt

(on Linux, local commands should probably be preceeded by ./ as in ./deadpan.) In the absence of dprep (the report extractor) the scanning journals remain unreadable. Any possible confidential data are buried in the encrypted journal.

Produce a Refined Report at a Later Stage

The split tool set allows for a clear post processing logic. The command chain for accessing the raw deadpan scanning journal reads

      deadpan -o- . | dprep --raw

There are more possibilities. In a chapter below I will present a metric for matching against IIN/BIN lists. This metric could be left weak in the scanner so it would produce many findings (minimise loss of valid PANs). Later, when the report is extracted, bespoke versions of dprep could pull out PANs matching a lists from a particular payment card issuer.

What makes it different is that there is a clear process chain which can be verified individually at each step.

back to overview

2. Build System

Information about the build system used is relevant because it keeps its promise for future availability of the tools about

Supported target systems
- How complicated is it to support legacy systems?
- Which embedded systems can be supported?
Extensibilty
- Is further feature development feasible in near time?
Dependencies
- How much does it rely on external libraries? In particular systems developed with C++ tend to build up heavy dependencies. A forensic toolkit with heavy operating system dependencies is hardly in control of its findings.
- Is barebone coding (dependency reduction) possible? For example, under Windows Visual Studio/C++ it is not obvious how to do that.
Size of Binaries
- Binaries need not be extra small but portable command line tools (as opposed to installed ones) should be well under 2MiB, say.

Here is a summary of what the current deadpan built system produces:

C/C99 language based
- developed in a high level language (different from C, see below)
only base libraries needed:
- Windows: ntdll.dll, kernel32.dll, KERNELBASE.dll, msvcrt.dll
- Linux: linux-vdso.so, libm.so, librt.so, libdl.so, libc.so,
  ld-linux-*.so, libpthread.so
binary sizes:
- currentlly ~ ¹⁄₂ MiB
binaries currently provided for:
- Windows, Linux
- i386 or amd64 architecture (not ARM)

Universal Availability on Target Systems

It is desirable to leave it open to compile to smaller embedded systems (not referring to the PI which is basically a full blown PC) as well as legacy SVR4 systems like UHC, UnixWare, or SCO. After playing with and trying Rust, Go, C++11, and other popular languages I went back to C because it is the only one universally supported.

Chicken or NIM?

Increasing the productivity of programming in C is accomplished by a development system that produces portable C code. From the two contenders, the Chicken Scheme or the NIM compiler, I chose NIM which has been influenced by many languages like Lisp, Go, C++, C# and others. Several backend C compilers are officially supported by NIM which promises quick portability to uncommon systems as long as there is an Intel or Generic Unix C compiler available (most systems are supported by GCC or CLANG anyway).

back to overview

3. Flexible PAN Context

A PAN, a payment card object is a rather weakly defined construct, mainly a character string between 12 and 19 digits (more details here.) The more restrictions apply the lower the chance of unwanted findings.

Non-digits Preamble and Epilogue

While it makes sense to expect isolated PANs, i.e. not preceded or followed by digits, some payment card processors store PANs with the expiry date attached. So there is no general rule and the PAN scanner must be flexible with configuring the expected PAN context.

The prototype provided here allows for adjusting the search context either way, non-digits preamble or epilogue/trailer.

back to overview

4. IIN/BIN Lists

In order to reduce or control false positives (see discussion here) I definitely expect a PAN scanner to use some sort of embedded IIN/BIN lists.

But using a single list is not always sufficient when the lists come from banks which, for some reason, only support a subset of all worldwide available payment cards. So the simple solution is to use different lists with varying coverage. In tests with fraudulent card usage/preset numbers I found that Wikipedia has generally a sufficiently broad coverage. So has the Discover Network which has published their full IIN/BIN ranges for years.

A Simple Metric for Matching Against IIN/BIN Lists:

The strategy is to add more specific IIN/BIN lists to the collection of lists with broad coverage and match PANs against them all using a simple metric. A metric allows to control the kind of error one makes regarding false positives:

Either prefer not to lose PANs for the price of having a huge number of false positives,
or reduce the number of false positives risking to loose perfectly valid PANs due to probably incomplete lists.

The situation resembles type I and II error control in statistical test hypotheses. For a metric I use the (obvious choice of) prefix lengths needed to match a PAN against some IIN/BIN list entry. For a digit string starting with 30000.., the metric for the current deadpan implementation is 3 as can be tested with

      dprep --lookup 30000

while the metric for the string starting with 31000.. is 0.

As a standard I require a minimum metric of 2 in the deadpan scanner. This seems to be a safe bet. I have not seen the first two or three IIN/BIN digits change much over the last ten years. So I expect the second kind error above to be negligible with this metric. And I keep the minimum required metric configurable. Try

      dprep --lookup 4
      dprep --lookup 45
      dprep --lookup 4567890

to get a feeling of how the metric works.

The value of embedding IIN/BIN lists with different coverage is to have the card details needed from specific lists (all detailed information at hand) and to fall back to the broader coverage lists when there are gaps.

If there is need to reduce PAN matches to a particular payment card issuer, this should be filtered at report extraction time when working with the scanner journal to produce reports (see above.)

back to overview

5. File Objects and Directory Traversal

A naive PAN scanner on a Linux (or Windows Cygwin/MSys) system could be written as a shell command that reads

      find . -type f -exec grep -E '[0-9]{12,19}' {} /dev/null \;

(‘grep -R’ would also work instead of find, see here for more examples.) This example demonstrates a general mechanism find which produces file names which in turn are passed on to grep for finding PANs in the very file with the name passed.

The hard bit is to implement the role of find properly for the PAN scanner and expose its mechanism so it remains configurable. As with the find command there are many ways to tweak the way a directory traversal tool operates.

I have not seen anything configurable or documented in commercial scanners regarding this logic (they use it but seem expect to be trusted to do it right.)

Tree View

While find only works for directory objects, the PAN scanner needs to look into ZIP files for example. Inside the ZIP file there is a directory structure again with files. And this recurs. Note that DOCX, XLSX, and JAR files are all ZIP files in disguise. So a typical ZIP/DOCX expansion logic looks like

           [file system]      [inside ZIP]     [inside DOCX]

      directory --+---
                  |
                  +--- arch.zip --+---
                  |               |
                  +---            +--- info.docx --+--- [Content_Types].xml
                  |               |                |
                  +---            +---             +--- _rels/.rel
                  :               :                :

Associated with this tree there is a stack of decoder modules (as I call them). Each applicable module interprets an file (e.g. arch.zip) as an archive with its own directory structure.

Discarding GIF, JPEG, and DLL objects.

Using this module stack with decoder modules, rather than listing its directory content one can discard it (i.e. they are not scanned for PANs, at all). This amounts to ignoring the particular file and is used for JPEG, GIF, DLL, etc.

Although this might be fine in most cases there will be exceptions and so it is configurable with an open architecture.

File Walker Policy

For the proof-of-concept tool the way decoder modules interact is called file walker policy and can be listed with

      deadpan --policy=?

Here is an example how the policy works.

In the default configuration, ZIP files are tested first and decoded if applicable. If this fails, binaries (e.g. EXE or DLL files) are tried at a later stage and discarded when applicable. Swapping priorites is possible and now binaries are tested first and probably discarded. This discarding includes self-extracting ZIP files which are binaries.

The example shows that it makes sense to be able to modify what type of file object is scanned and how this is prioritised.

Here is another example.

In order to leave no trace forensically on a Windows system, one can read the whole disk partition similar to how a ZIP file is read (given an NTFS decoder module is available.) This leaves no trace on the time stamps of a file, neither is there a problem with locked files. The problem here is that administrator rights are needed to scan a live Windows system that way. So there must be a mechanism to scn the partition or the file system tree (this goes even deeper – on some systems the whole disk with all partitions is open but the C: drive is locked.)

back to overview

Summary

The main value of this proof-of-concept study is to explore how current technology can be used to build a modern version of the good old PAN scanner. There are traditional aspects of scanning tools development which involve file/directory traversal and deep inspection of all kind of file objects. Some new aspects come up.

The first new aspect is the use of crypto-engineering to build a resilient journal/report system mitigating concerns of unwanted exposures while perfectly fitting into a report post processing chain.

The second aspect is the use of the C programming language for compiling but not for manual coding. The development system used here is a high level, strictly typed, functional and OO language which supports metaprogramming. This maintains a lean code base and portability to yet unknown systems.

And finally, making up a metric to match against as set of IIN/BIN lists is not a big deal but comes in handy for describing how the scanner works internally.

Apart from tools where the source code is available, with commercial tools I did not find enough documentation of how these tools work inside – rather a trust me approach seems prevalent. So I suspect that most of the features I provide are not available there.

back to overview