Challenging PAN Scanners -- mjh-IT

Updated: 29 Mar 2017

TL;DR

This article will not go into detail evaluating PAN scanners currently available. It is rather about providing some tools – mostly thought tools – for evaluating quality aspects of your favourite PAN scanner tool, see summary.

Introduction

The acronym PAN is used as a general term comprising credit and debit card numbers. Having worked on PAN scanner software and similar tools for the best part of the last 10 years I finally came back to programming this sort of tool set again (among others I wrote 7seec for 7Safe/PA). I wrote about earlier work here and here.

PAN scanning is part of the certification for pci dss compliance.

A simplistic technical summary of PAN scanning reads like:

Find ASCII digit strings of 12 to 19 ASCII characters in all sorts of files and digital data objects.
Print a pretty report.

There are variations of this basic functionality:

Digits string verification. Most PANs have a control digit (Luhn algorithm) but it is very weak: the 16 digits string of zeros is fully validated by this algorithm.
Other character sets than ASCII might apply, e.g. UTF16, EBCDIC

Overview

Contrary to the situation with virus scanners there is a narrow market for PAN scanners dominated by a few commercial suppliers. The open source scene is not available (but I might be proven wrong). I will highlight some technical aspects of PCI DSS scanning. Nevertheless, I view technical superiority (or inferiority) only as one of several aspects for problem solving competence, i.e. producing a useful scanning report one can work with.

PAN scanning faces several hard challenges apart from simply finding something (ie. PANs) which can be hard enough already:

There are all sorts of digit strings in all kinds of data objects.
There are lots of false positives.
There are lots of unwanted findings.
There are expectations of software doing the right thing which in turn imposes pressure on developers to do the impossible.
Commercial aspects and how suppliers meet their selling targets.
Summary

1. All sorts of digit strings

This first list item is rather a statement in support of what follows. Digit strings data are used in all kind of applications not only intended to hold payment data information. This fact will be repeated frequently.

back to overview

2. False Positives

Some scanners report mostly anything that has 16 digits and meets the Luhn check algoritm (16 digits has become the prevalent length for PANs). This leads to reporting digit strings which are not PANs.

Reporting false positives is avoidable in most cases. There are publicly and commercially available IIN/BIN data bases for the first six digits of a PAN. Suppliers may argue that these data bases or lists are never 100% accurate for commercial grade applications, so not useful. Will they miss valid PANs? This argument is futile. In particular the first two or three IIN/BIN digits rarely change and a scanning tool has to be regularly updated anyway. So has the data base of the tool. Some card vendors have published their IIN/BIN prefixes for years already detailed enough to be used in PAN scanners.

Think of a virus scanner supplier issuing this kind of statement. They would have a hard time to be accepted afterwards. It is the equivalent of reporting a vast number of legitimate binaries as virus infected although they are clean. This would be because the virus data base was left weak for it can never be 100% accurate.

For my research I have used a little PAN generator tool which allows me to generate fake PANs in order to test scanner accuracy. And there is a range fake PAN generators available on the internet.

It costs time and money to verify PAN findings after scanning. This is particularly annoying if it could have been avoided in the first place. So it makes sense to qualify the PAN scanner accuracy before field use.

back to overview

3. Unwanted Findings

Suppliers and sales people – no bashing here – tend to conflate unwanted findings with the previously used term false positive. I reject this for the fact that PAN numbers cannot be false. Otherwise they would not be PAN numbers (a simple truism). But PAN numbers appear virtually everywhere in digital data. I suppose that the reason for not explaining what unwanted findings are is because it is believed an easier sell with a PAN scanning product. Also, talking only about false positives sounds much more authoritative.

In order to mitigate this challenge of unwanted findings scanners look at the context. For example, a PAN number appearing in a web server’s HTML or XML/XHTML pages might be absolutely legitimate (because the number is not used as as PAN) whereas a PAN number in DOCX or ODT file is probably not (and used for remembering a PAN). The PAN scanner is expected to suppress the first finding and to report the second. This sounds easy but consider the following:

Compress the web server’s XML files as ZIP file, call it server.zip.
Rename the DOCX or ODT file as document.zip.

Where did we end up? Both file server.zip and document.zip are now compressed ZIP files (indeed, DOCX, CLSX, etc. are all zip files with the extensions changed to docx, xlsx, etc.) and the scanner has to either report all or make a decision. Whatever the decision is, it should be predictable which is not the case with some scanners I tested in the past.

Unfortunately, PAN scanner decisions are often not configurable even though there is no useful one-fits-all approach. Take the example of virus scanners where you add signatures to a white list (network administrators typically need to add netcat which is often considered harmful by virus scanners). In case of PAN scanners, image (GIF, JPEG, etc) and font files are usually skipped over but there are exceptions, similar reasoning applies to JAR java files.

back to overview

4. Software Expectations

I compare PAN scanners with virus scanners because the latter are well understood but there are important differences often neglected. In particular with virus scanners, the problem of unwanted findings does not exist. A virus scanner must report everything in the first place which is handled by a white list at a later stage. There is no hassle here because the white list will stay relatively small.

Naively, this is sometimes expected from developer or supplier of the PAN scanner as well. But doing so, a PAN scanner will have been more or less rejected quickly, just because of the overwhelming after-scan verification work related to unwanted findings (provided that all findings are PANs in the first place, see false positives).

A PAN scanner might be used for triage or at a particular stage for PCI DSS compliance certification cited earlier. The most desirable property will then be a fast scanner running in the foreground not leaving a trace (ie. no write access) on the host machine. This again differs from a virus scanner where it does not matter how long it takes. Also there is a private storage area usable as cache.

As always with software, doing it right is only possible if challenges are fully understood and admitted.

back to overview

5. Commercial Aspects

Depending on the business model, PAN scanning is sold as a software tool or as a service (eg. PCI DSS certification, see above). But because of the complexity of the after-scan verification work needed, there can be no easy selling of a PAN scanner like a virus scanner.

Packaging and Licencing

Not many years ago, PAN scanners were mostly installed on the target (Windows) system rather than run as-is. This might have changed but today’s scanner programmes come still with sizes ranging ~20MiB (compare with 7Seec which had ~500KiB including a full IIN/BIN list).

Large binary sizes make it desirable to use password/encryption technology for managing licences. With a lean sized scanner binary on the contrary, the effort of producing a release is not much higher than producing a license key. Then the scanner binary itself is just thrown away when expired and a new one used. There is no need for bug-fixing with short lived throw-away binaries.

But on-line licensing gives a ready excuse for calling home (see below).

Sales Spin

Sometimes the details of PAN scanners are verbosely shrouded in secret justified by the sensitivity of PANs which involves monetary transactions. I see this more of a PR stunt needed to raise awareness that such a tools exist.

For an example, looking at this blog entry message from 2014 of a leading software supplier it reads like an advertisement stressing the importance of their well designed tool. This blog message format resembles a security breach report but there is no CVE reference. Registrating there would be the thing to do for a security breach. What also seems a bit odd is that at this time (2014), the tool was already more than 20MiB in size (~40 times the size of 7seec, see above) – hardly something one would seriously use for hacking. One could get a demo version for free anyway, so what? Here is more about this issue.

Calling Home

There are tools around that frequently pass data from the customer site to the supplier’s home server. As a contractor working on a customer site I would not want to be caught unbeknown to me that my favourite PAN scanner contacts some server on the internet abroad.

It is not hard to check what is running under the hood of a PAN scanner on Linux using the strace monitor:

strace <pan-scanner> <options ...> 2>&1 | grep AF_INET

Turning the internet connection off, the above statement might produce something like

socket(AF_INET, SOCK_DGRAM, IPPROTO_IP) = 3
ioctl(3, SIOCGIFADDR, {ifr_name="eth1", [..] sin_addr=inet_addr("192.0.2.1")}}) = 0
ioctl(3, SIOCGIFADDR, {ifr_name="docker0", [..] sin_addr=inet_addr("172.17.0.1")}}) = 0
socket(AF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 3
connect(3, [..] sin_port=htons(53), sin_addr=inet_addr("172.16.200.1")}, 16) = 0
[..]

It is particularly annoying when alien software visits local interfaces like eth1 and docker0 for what reason ever (DNS connection to 172.16.200.1 would be ok). This can all be legitimate and for sound reasons, but lacks explanation. Just to compare, connecting to my local gateway (called pudding) with NetCat as in

strace nc pudding 12345 2>&1 | grep AF_INET

produces

socket(AF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 3
connect(3, [..] sin_port=htons(53), sin_addr=inet_addr("172.16.200.1")}, 16) = 0
recvfrom(3, [..]pudding\6office\0[..]53), [..]"172.16.200.1")}, [28->16]) = 48
socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
connect(3, [..]12345), [..]"172.16.207.1"}, 16) = -1 ENETUNREACH (Network is unreachable)

indicating that there is really no need to snoop on the network interfaces eth1 and docker0. Another test with Wireshark or tcpdump can reveal similar, more detailed results on IP protocols used.

As mentioned earlier, calling home once is used for checking a PAN scanner licence against the suppliers data base. But I lack imagination to find a reason for software calling home while running – other than collecting customer data. For commercial applications to be run on confidential third party sites this should be explained, at least.

Comparing again with the virus scanner situation, the calling home model is clear, open and configurable (well, there might be exceptions, too).

back to overview

6. Summary

While pretty useful nevertheless, PAN scanners must be validated for accuracy and well-behaving.

I use my own PAN generator tools for quick-checking accuracy. Most (if not all) of the generated fake PANs should never be reported.
The Linux operating system provides tools for quick-checking well-behaving and apparent remote communication (and probably stealth data collection) in absence of a clear supplier statement. On Linux, the tool tcpdump reveals more about probably unsolicited remote communication. On other systems (inc. Linux) Wireshark (command line version tshark) might help.
I beware of supplier spin. PR is needed and duly highlights features but often exaggerates negligible properties while talking down technical deficiencies.

back to overview