Note: Although this was created some time back (sorry for sharing this so late), there’re improvements to be made still. Discussions are always welcomed.
When responding to an enterprise network compromise, one big question (and source of pressure) is that network IOCs need to be determined quickly. While this information would usually come from the malwares/tools used in the compromise, the fact that the surfacing of network IOCs and triaging being done in parallel presents a Catch-22 situation: How do we find machines and malware without network IOCs available? How do we get network IOCs without analyzing any machines/malware suspects?
One of the richest (and commonly available) sources of enterprise network information is the proxy logs. While there are caveats to their usefulness in malware detection, the concepts and methods remain the same for other enterprise perimeter logs. The problem is, looking through any enterprise-wide logs isn’t going to be easy, primarily due to the sheer size and complexity of the information you get. To compound to the problem, simply looking for basic stats like top N remote hosts with requests made to them isn’t going to be enough, since too much information is lost when generating such statistics.
Before I move on, let’s define what a potential malware beacon is:
- It has to beacon (duh). Meaning that it is regular (every N seconds/minutes/hours/days from any given infected machine). This also means that it is not the once-off type.
- It does not look like normal traffic. If you have done web application profiling before, you would have some idea of what normal traffic is like. Most/all of a page’s resources are requested for within the first few seconds when the browser loads the page. Anything else is either “AJAX”-type requests (legit), or the suspicious domains that we’re looking for. This also helps in avoiding legit domains that enter our radar simply because they generate a lot of requests (the top N domains method).
Using Splunk‘s advanced features (sorry for this time, fellow CLI scriptors, but you could try out Splunk too, since it’s free to use), we can narrow the list of suspicious domains from thousands to a handful, quickly, and more accurately. Simply put, we try to find domains which have a “substantial” number of requests that aren’t normal. Here’s the basic Splunk query which is based on this idea:
index="myindex" | convert mktime(_time) as epoch | sort 0 uri_host,client_ip,epoch | delta epoch as epoch_delta | search epoch_delta>0 epoch_delta<30 | chart count over epoch_delta by uri_host
Let’s break down the query to understand what’s going on (and for you to tweak this according to your needs).
convert mktime(_time) as epoch
This is one of the most important pieces of the puzzle. This command adds a field to every event containing the epoch time of the event, regardless of whether the original event has the timestamps in epoch or not (most don’t). Having numerical timestamps is the key enabler here for the entire algorithm/method to work.
sort 0 uri_host,client_ip,epoch
Next, we sort all the events based on three levels/criteria: the remote host, then the (requesting) client IP, then the numerical timestamp. This would allow us to extract the period in between each successive request from each client to the remote host. The 0 in the command is to tell Splunk not to limit the number of results in the sorted set, else it limits to a certain number of events by default (dangerously counter-intuitive, I’d say).
delta epoch as epoch_delta
Here, we calculate the time periods themselves between every successive request from any given client to each remote host. At this stage, we have the “raw” information needed for sifting through the whole mess of requests: the actual time periods between successive client-remote requests.
search epoch_delta>0 epoch_delta<30
The first part filters away the events where the time period was calculated to be negative (this is a side effect of the sorting and time period calculation steps, but these can be removed without impact to the workings of this method). The second part is just there to assist in visualizing, and can be tweaked (or removed) according to your needs.
chart count over epoch_delta by uri_host
Lastly, we chart (remember to set to column type graphs) the counts of these time periods for each remote host, and we get something like this… (will try to put in a better example when I can, sorry)
Remembering that the norm for pageloads is that most/all of the requests will be fired off within the first few seconds, you will find that the charts generally follow a bell curve. The outliers will be like the two shown in the picture above.
In the charting view in Splunk, you can mouse over the outliers to find out more details about them:
ALso, mousing over the legend items in Splunk’s charting view shows you all the relevant plots for that remote host. Nice and interactive 🙂 In this example, the (legit) domain has pages that fire off requests every 2-4 seconds, which probably is due to the picture ticker for every news article.