Visualizing the spread of binary content

Edit: this has been worked into a set of scripts for downloading here.

One of the other weird ideas I had (which was more possible for me to implement at this stage) was to visualize the content of binary data in a way that showed its characteristics at a glance. Analyzing binary data appears to be one of the things you do a lot in DFIR work when you are faced with cryptanalysis tasks.

A couple of questions that came to mind when looking at a set of binary data were:

  1. Which offsets had relatively constant binary values?
  2. What are the range of binary values that can be observed from these offsets?

Getting some ideas from my dabbling with converting binaries into images, here’s something that I cobbled together. Please do feel free to ask questions or discuss in the comments section below.

A small sidetrack: automating the generating of test data. It’s pretty repeatable, and I’m too lazy to keep searching the shell history to modify the commands… (defaults to generating 128 bytes of urandom data if no parameter is passed in)

dd if=/dev/urandom count=1 bs=$size 2>/dev/null

And so we generate one set of sample data…the insert part is to demonstrate a part of the binary data that will be “constant”.

$ 16 > insert
$ 64 > part1
$ 128 > part2
$ cat part1 insert part2 > test1.bin

$ xxd insert
0000000: b7c0 a172 3a6f 146c 4c27 b5cd 72e1 29d0  ...r:o.lL'..r.).

$ xxd test1.bin
0000000: 59a7 ed2d 7e7a 4cd9 f3df 157a 27aa f98b  Y..-~zL....z'...
0000010: 5988 14c0 56e8 965d c92f 9907 9ec0 370e  Y...V..]./....7.
0000020: 0d9d 74b7 f5a6 ee92 310d fe3e 5dd7 5e63  ..t.....1..>].^c
0000030: 6282 dc9a 1a7f 6c92 ab6e 39b0 5b92 323b  b.....l..n9.[.2;
0000040: b7c0 a172 3a6f 146c 4c27 b5cd 72e1 29d0  ...r:o.lL'..r.).
0000050: 3503 9689 01f6 ddef f8ea e4db 09f4 bc13  5...............
0000060: fcff 1ccd 43fa 062b e745 7dd5 d0ae 9f25  ....C..+.E}....%
0000070: dc85 5b12 5f7f 5da9 bf2f 8628 cd73 2702  ..[._.]../.(.s'.
0000080: c36c cc1e 4c6a 8ce0 590a 3756 3c6f b460  .l..Lj..Y.7V<o.` 
0000090: d553 49d7 2b7d e361 87c4 662f 3ed3 a262  .SI.+}.a..f/>..b
00000a0: b721 6c74 8037 5ceb 927a dc2a 75f5 932c  .!lt.7\..z.*u..,
00000b0: 36ac af30 2a72 a9bb eca6 188e fb8f 4648  6..0*r........FH
00000c0: fa62 434f 9962 42e9 e305 9109 07d1 9305  .bCO.bB.........

$ xxd -c1 -ps test1.bin

A small script to help with converting a binary file into something like CSV: first column is the offset, and the second column is the decimal equivalent of the binary value at that point. This output is space separated so that gnuplot can work with it nicely later on.



if [ -z $1 ]; then echo "I need something to work on!" 1>&2; exit 1; fi

xxd -c1 -ps "$1" > "$tmpfile"

for i in $(cat "$tmpfile"); do
	echo "$cntr $((0x$i))"
	let "cntr++"

rm "$tmpfile"

Running it on our test data gives us:

$ test1.bin
0 89
1 167
2 237
3 45
4 126
5 122
6 76
7 217
8 243
9 223
10 21

The gnuplot script we will be using:


set terminal png
set output "output.png"
set yrange [-1:256]
set ytics 16
set key off
set grid xtics ytics
set title "Scatterplot of binary values across offsets"
set xlabel "Offset (decimal)"
set ylabel "Value (decimal)"
plot "input.dat" using 1:2

First run on the test binary data we just generated:

$ test1.bin > input.dat
$ gnuplot gnuplotscript

And we have our first scatterplot!

Scatterplot of single binary (values vs offset)

Okay…let’s see how this thing works when we start adding in more samples into the mix…

$ for i in {2..50}; do 64 > part1; 128 > part2; cat part1 insert part2 > test$i.bin; rm part1 part2; done
$ for i in *.bin; do $i > $i.chomped; done

Let’s start off with 10 samples…

$ rm input.dat
$ for i in {1..10}; do cat test"$i".bin.chomped >> input.dat; done
$ sort -u input.dat | sort -n > tmp && mv tmp input.dat
$ gnuplot gnuplotscript
Scatterplot of 10 binaries (values vs offset)

Heh, the band of “constants” is already pretty obvious here. And when we move on to using the entire set of 50 “samples” (moar data!), the differences get even more obvious.

$ rm input.dat
$ cat *.chomped >> input.dat
$ sort -u input.dat | sort -n > tmp && mv tmp input.dat
$ gnuplot gnuplotscript

Scatterplot of 50 binaries (values vs offset)

Now the remaining question is: why not just plot these points in Excel? “It could do the same thing, and we prefer using the GUI instead!”

Well, one thing is scalability: Excel will choke on opening such a humongous file (there are limits to how many rows can be displayed at a time in a worksheet). And not to mention the charting of anything beyond a few thousand data points, it gets really sloooow…

And the other is automation: gnuplot allows the user to have some pretty fine-grained control of the output, if one would only learn it. Which would you prefer? Having to click a few buttons here and there every time you need to do something repetitive, or to just distill these actions into a set of scripts/commands that can be invoked with much less pain?


In terms of scalability, running gnuplot against an input file of well over a million points is possible too, as compared with trying to wrangle something out in Excel.

Yeps, gnuplot was one answer to the whole issue of scalability:

Much remains to be done to improve and tweak this method (and both visualization and gnuplot will have their own limitations anyway), but here’s something to start off with.

Scatterplot with 100k data points, some tweaking of gnuplot settings needed



One thought on “Visualizing the spread of binary content”

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s