Tag Archives: Data Mining

Visualizing Data (using Processing)

Visualizing Data, by Ben Fry (O’Reilly)
[Amazon] [Google Books] [O’Reilly] [Google Search]

One of my interests has always been in data visualization (makes data more understandable, and is one step towards easier interaction with it). Chanced upon this book at the library today, certainly one thing I’d like to look into in more detail at a later point in time.

Why this book caught my interest was the fact that there was another book on such a topic in itself. Other than Applied Security Visualization by Raffael Marty, I’ve yet to chance upon anything else.

A quick browse of the book showed that it’s very possible to use Processing (yet another good reason to take up this book: simple programming!) to implement many of the data visualization concepts. Though many people would say that this is “raw” and “slow” as compared to having a tool to do this simply and quickly, I’d say that doing it this way would certainly give the user a great understanding of the data visualization process itself. Furthermore, who’s to say that Processing’s not the tool itself! 😛 Also, the author has helpfully made the source code examples available online at his blog too.

Will keep this book in mind to look at later. Have other books to go through first… :}

Profiling client internet connections

Some more fun with p0f and Splunk…Now with profiling of client internet connections!

Setup of the p0f and logging is the same as in the OS Profiling post.

The Splunk search string has been extended to extract the source’s internet link as a field too (go for the portion in bold for the field extracting commands):

| file /home/path/to/p0f.log | rex field=_raw “> (?<srcip>[^:]+):(?<srcport>[^ ]+) – (?<srcos>.+?) (” | rex field=_raw “-> (?<dstip>[^:]+):(?<dstport>[^ ]+) ” | rex field=_raw “link: (?<srclink>.*))$” |  regex srclink!=”(unspecified|unknown)” | top limit=0 srclink

The fields that I extract with this:

  • srcip -> source IP
  • srcport -> source TCP port
  • srcos -> source’s OS (woot!)
  • dstip -> destination IP (which is my machine’s)
  • dstport -> the destination port which the TCP connection was initiated to
  • srclink -> the internet link of the source machine

After filtering out the “unspecified” and “unknown” links, the list of the detected links are as follows:

“ethernet/modem” points to mostly cable connections.  There’re some interesting entries in the list though, like vtun, pppoe, Google/AOL, IPv6/IPIP (early adopters? haha).  Don’t have any idea on what’s IPSec/GRE, or vLAN here in this context though.

Just for the heck of it, here’s the chart for this table, generated from the reports link in Splunk.

I like the charts, because they allow some interaction with the charts for simple datasets, but I digress 😛

OS Profiling

Trying out p0f along with Splunk..

p0f allows you to determine the OS of the remote machine based on the TCP fields characteristics.  It can also tell whether the machine is behind a firewall, what kind of internet connection it is running from…pretty useful for information junkies like me 😀

Here’s what I did:

./p0f -t -u MyUseridHere -i eth0 ‘src not MyIPAddressHere’ | tee -a p0f.log

Runs p0f, logging with actual timestamps (-t), chroot and setuid to MyUserIdHere (-u), listening on eth0 (-i), and filtering out packets for connections initiated from my machine itself (since I’m not interested in profiling my own machine).

tee is a (really nifty!) linux command.  What it does is to “split” the input (stdin) to two parts: stdout and the file specified.  The -a option tells it to append to the file instead of overwriting it.

Using this, p0f outputs logs like this one:

<Sat Jul  3 07:03:56 2010> 175.40.12.47:1095 – Windows 2000 SP2+, XP SP1+ (seldom 98)
-> 74.207.229.183:80 (distance 12, link: sometimes DSL (2))

One of the Splunk queries that I poked around with:

| file /path/to/p0f.log | rex field=_raw “> (?<srcip>[^:]+):(?<srcport>[^ ]+) – (?<srcos>.+?) (” | rex field=_raw “-> (?<dstip>[^:]+):(?<dstport>[^ ]+) ” | regex srcos!=”UNKNOWN” | top limit=0 srcos

This query extracts out the source and destination IP and port, and the source OS.  Then after filtering out the OS tagged with UNKNOWN, the remaining entries are ranked…

The resulting chart, of not much real interest by itself, just shows that other than that the connections are predominantly from linux machines (hurhur), and there’s a connection from a really old Netware machine (5 was released in Oct 1998!).