robswain.au

Writing

IT posture, infrastructure fundamentals, and the patterns that keep showing up.

A place to collect thinking on IT posture, infrastructure fundamentals, and the kinds of problems that appear in practice across different organisations and sectors. Most of it is observational. Some of it is opinionated. None of it is sponsored.

Posts

  1. The cause is usually older than the trigger, and sometimes the cause is you. 12 Apr 2026

    The worst change I ever made to a production environment took months to fully surface, and I'd made it deliberately, carefully, and for good reasons.

    Microsoft had renamed the user shell folder from Documents to My Documents, or back the other way depending on which era you're counting from, and the Group Policy that redirected user folders needed updating to reflect the new path. I made the change. It applied cleanly. Spot checks looked fine. New logons picked up the new path, redirection worked, files were where they were supposed to be, and the change moved out of my head and into the pile of things that were done.

    The problems started turning up weeks later and didn't look like a Group Policy problem. A user couldn't find a file they were sure they'd saved. Another user's application was writing to a path that no longer existed for anyone else. A third user had two folders with overlapping contents and no clear sense of which was current. Each one looked like a user error or an application quirk in isolation, and each one got handled in isolation, which is the worst possible way to handle a pattern.

    The mechanism was that the rename hadn't been clean for every profile. Existing profiles, profiles that had logged on before the change, profiles that had been roamed, profiles that had been touched by older policies still cached somewhere, all behaved slightly differently from the freshly minted ones I'd tested against. Some users were reading from one path and writing to another. The redirection was working exactly as configured. The configuration was correct. The environment underneath it wasn't uniform enough for "correct" to mean the same thing for every user.

    What made it take months to surface was that nothing failed loudly. No error dialogs, no event log entries that pointed at the policy, no helpdesk ticket that said "since the Group Policy change". In fact, the helpdesk started doing "Profile Resets" which caused even more harm and user data loss.

    The symptoms were all downstream and all human: missing files, confusion, duplicated work, the slow erosion of trust in the file server. By the time I traced it back to the change I'd made, the incident had stopped being a Group Policy problem and started being a data hygiene problem across dozens of profiles, which is a much harder thing to fix than the original change would have been to roll back.

    The lesson I took from it, and the reason I still think about it nearly 15 years later, is that a change being technically correct is not the same as a change being safe in the environment you're applying it to. The test environment was uniform. Production wasn't. Nothing in my process at the time was designed to find that difference before users did.

    The technical trigger is always recent. The cause is usually older than the trigger, and sometimes the cause is you.

  2. Event Logs and the Story Behind Every Incident. 1 Apr 2026

    This one is close to my heart and probably the most important thing I would say to anyone starting out in IT.

    Over the years, I've learned that most systems are already telling you what went wrong. Not in summaries or dashboards, but in the event logs, quietly and often long before anyone notices there's a problem.

    When incidents escalate, it is common to see teams restart services, reapply configurations or focus on the last visible failure. The pressure to "do something" is understandable, but it often skips over the place where the full story already exists.

    Event logs are rarely neat. They're noisy, repetitive, and sometimes frustrating to interpret but taken together, they describe sequence, timing, dependency and cause in a way no single alert ever can.

    With experience, patterns start to stand out. The same warnings appearing before different failures, the same authentication errors preceding broader outages and the same timing gaps that point back to an earlier dependency failure rather than the symptom everyone is reacting to.

    Reading logs well isn't about memorising event IDs or knowing every subsystem in advance. It's about learning how systems narrate their own behaviour, and trusting that narrative even when it contradicts first impressions.

    I've lost count of the number of times an incident felt complex and multi-dimensional, only for the logs to show a very ordinary sequence of events once they were read in order.

    The fundamentals show up here again. Time matters, identity matters and dependencies matter. The logs reflect all of it, if you're willing to slow down and listen.

    For me, this is where troubleshooting stopped feeling reactive and started feeling deliberate, not because problems disappeared, but because the story was already there, waiting to be read.

Most of the writing lives on LinkedIn. The posts here are a selection of the longer or more reference-worthy pieces.

Let's talk.

Interested in working together, or just want to connect? Drop me a line and I'll get back to you.

rob@robswain.au