
Edit: this has been worked into a set of scripts for downloading here.
One of the other weird ideas I had (which was more possible for me to implement at this stage) was to visualize the content of binary data in a way that showed its characteristics at a glance. Analyzing binary data appears to be one of the things you do a lot in DFIR work when you are faced with cryptanalysis tasks.
A couple of questions that came to mind when looking at a set of binary data were:
- Which offsets had relatively constant binary values?
- What are the range of binary values that can be observed from these offsets?
Getting some ideas from my dabbling with converting binaries into images, here’s something that I cobbled together. Please do feel free to ask questions or discuss in the comments section below.
A small sidetrack: automating the generating of test data. It’s pretty repeatable, and I’m too lazy to keep searching the shell history to modify the commands…
generatedata.sh (defaults to generating 128 bytes of urandom data if no parameter is passed in)
#!/bin/sh
size=${1:-"128"}
dd if=/dev/urandom count=1 bs=$size 2>/dev/null
And so we generate one set of sample data…the insert part is to demonstrate a part of the binary data that will be “constant”.
$ generatedata.sh 16 > insert $ generatedata.sh 64 > part1 $ generatedata.sh 128 > part2 $ cat part1 insert part2 > test1.bin $ xxd insert 0000000: b7c0 a172 3a6f 146c 4c27 b5cd 72e1 29d0 ...r:o.lL'..r.). $ xxd test1.bin 0000000: 59a7 ed2d 7e7a 4cd9 f3df 157a 27aa f98b Y..-~zL....z'... 0000010: 5988 14c0 56e8 965d c92f 9907 9ec0 370e Y...V..]./....7. 0000020: 0d9d 74b7 f5a6 ee92 310d fe3e 5dd7 5e63 ..t.....1..>].^c 0000030: 6282 dc9a 1a7f 6c92 ab6e 39b0 5b92 323b b.....l..n9.[.2; 0000040: b7c0 a172 3a6f 146c 4c27 b5cd 72e1 29d0 ...r:o.lL'..r.). 0000050: 3503 9689 01f6 ddef f8ea e4db 09f4 bc13 5............... 0000060: fcff 1ccd 43fa 062b e745 7dd5 d0ae 9f25 ....C..+.E}....% 0000070: dc85 5b12 5f7f 5da9 bf2f 8628 cd73 2702 ..[._.]../.(.s'. 0000080: c36c cc1e 4c6a 8ce0 590a 3756 3c6f b460 .l..Lj..Y.7V<o.` 0000090: d553 49d7 2b7d e361 87c4 662f 3ed3 a262 .SI.+}.a..f/>..b 00000a0: b721 6c74 8037 5ceb 927a dc2a 75f5 932c .!lt.7\..z.*u.., 00000b0: 36ac af30 2a72 a9bb eca6 188e fb8f 4648 6..0*r........FH 00000c0: fa62 434f 9962 42e9 e305 9109 07d1 9305 .bCO.bB......... $ xxd -c1 -ps test1.bin 59 a7 ed 2d 7e 7a 4c d9 f3 df ...
A small script to help with converting a binary file into something like CSV: first column is the offset, and the second column is the decimal equivalent of the binary value at that point. This output is space separated so that gnuplot can work with it nicely later on.
chompdata.sh
#!/bin/bash tmpfile="tmpfile" if [ -z $1 ]; then echo "I need something to work on!" 1>&2; exit 1; fi xxd -c1 -ps "$1" > "$tmpfile" cntr=0 for i in $(cat "$tmpfile"); do echo "$cntr $((0x$i))" let "cntr++" done rm "$tmpfile"
Running it on our test data gives us:
$ chompdata.sh test1.bin 0 89 1 167 2 237 3 45 4 126 5 122 6 76 7 217 8 243 9 223 10 21 ...
The gnuplot script we will be using:
gnuplotscript
set terminal png set output "output.png" set yrange [-1:256] set ytics 16 set key off set grid xtics ytics set title "Scatterplot of binary values across offsets" set xlabel "Offset (decimal)" set ylabel "Value (decimal)" plot "input.dat" using 1:2
First run on the test binary data we just generated:
$ chompdata.sh test1.bin > input.dat $ gnuplot gnuplotscript
And we have our first scatterplot!
Okay…let’s see how this thing works when we start adding in more samples into the mix…
$ for i in {2..50}; do generatedata.sh 64 > part1; generatedata.sh 128 > part2; cat part1 insert part2 > test$i.bin; rm part1 part2; done
$ for i in *.bin; do chompdata.sh $i > $i.chomped; done
Let’s start off with 10 samples…
$ rm input.dat
$ for i in {1..10}; do cat test"$i".bin.chomped >> input.dat; done
$ sort -u input.dat | sort -n > tmp && mv tmp input.dat
$ gnuplot gnuplotscript
Heh, the band of “constants” is already pretty obvious here. And when we move on to using the entire set of 50 “samples” (moar data!), the differences get even more obvious.
$ rm input.dat $ cat *.chomped >> input.dat $ sort -u input.dat | sort -n > tmp && mv tmp input.dat $ gnuplot gnuplotscript<philosophy_sidetrack>
Now the remaining question is: why not just plot these points in Excel? “It could do the same thing, and we prefer using the GUI instead!”
Well, one thing is scalability: Excel will choke on opening such a humongous file (there are limits to how many rows can be displayed at a time in a worksheet). And not to mention the charting of anything beyond a few thousand data points, it gets really sloooow…
And the other is automation: gnuplot allows the user to have some pretty fine-grained control of the output, if one would only learn it. Which would you prefer? Having to click a few buttons here and there every time you need to do something repetitive, or to just distill these actions into a set of scripts/commands that can be invoked with much less pain?
</philosophy_sidetrack>
In terms of scalability, running gnuplot against an input file of well over a million points is possible too, as compared with trying to wrangle something out in Excel.
Yeps, gnuplot was one answer to the whole issue of scalability:
Anyone knows of F/OSS tools tt can chart scatterplots frm huge #data sets (scaling up to millions of points)? #visualization (pls help RT :)
—
Ray Foo (@rrrayfoo) November 17, 2012
Much remains to be done to improve and tweak this method (and both visualization and gnuplot will have their own limitations anyway), but here’s something to start off with.
HTH.




Pingback: Visualizing binary samples, now with scripts! | geekery