While Pandas may not care much about time then incident responders should. Timeline creation and analysis are the core activities of many deep dive digital forensics investigations. Run log2timeline/plaso on logs and other common evidence data and you’ll get a nice csv file with parsed events together with their associated time. This allows to correlate large amount of information in time and build up an understanding of what exactly happened on a system and when.
The only downside of this process is that the resulting super timeline file can get really large. There are people out there who like working with 100s megs csv files in Excel. I think it’s a form of meditation. But those who don’t look at using tools like Eric Zimmerman’s Timeline Explorer that provides both much faster experience of working with timelines and a better presentation of the data as well.
The problem with both these tools is that to work efficiently you already need to have a timestamp in mind that you want to pivot around. And that may be often the case. But still I usually find myself in a need to quickly visualize certain types of events across the whole super timeline before starting the manual deep dive. Again that’s something that can be done with Excel but even with 300M files (and timelines can easily get larger than that) that’s a test of one’s patience.
So below is my take at doing the initial timeline visualization and triage with Jupyter and Pandas. What I like about that approach is that you can get all that’s needed running on Windows, Linux or MacOS in minutes and that it takes below 3 seconds to complete any computation I throw at it working with 300M csv timeline sample.
Another good thing is that my very modest experience with Pandas was enough to get me started. I can imagine people with better data science skills doing wonders working with multiple timelines in the same investigation, building baselines, finding anomalies, automating and all that stuff. So in short there’s still a lot more that could be done here.
Prepare the data
The first couple of lines take care about loading the source file for analysis and fixing up data types in some columns. I’m also extracting Windows event IDs here as these are very common in most timelines.
Checking number of different event types we got in.
Distribution of events count in time. This helps me in identifying any spikes or gaps in the timeline.
Analyzing logon events
Here I’m focusing on the logon event IDs 4624 and 4625. First extracting interesting values to dedicated columns. Kerberos and RDP events could be another good set to analyze here.
Summarizing logon type 3 and 10 events by username and hostname to get a grasp on all the remote logons.
Comparing number of remote logon events performed by different users to identify outliers.
Looking at remote logon events count in time. This can be easily modified to visualize a specific user or hostname.
Quantitive comparison of different logon types.
Summarizing failed logon attempts.
Checking out on Prefetch
Finding rare executions in Prefetch. The top rare executable names here are often worth to look at more carefully especially if our timeline extends over a longer period of time.
Quickly checking out binary name of interest without needing to switch to Timeline Explorer just yet.
Pivot around specific timeline while still in Jupyter.
This is normally something you’d do in Timeline Explorer but this snippet may be useful for some narrowed visualizations while still here.