Build With Me

A Data Scientist's Workshop

Building real things with AI, open data, and a security mindset.
From messy datasets to deployed products. NYC-based.

Let's Learn Together

Welcome to the workshop. The world is moving fast — and technology is moving even faster. There's a lot to learn: new things, old things, things that didn't exist six months ago. I'm a data scientist based in NYC with a background in cybersecurity, and I'm here to learn alongside you. We'll work through building real tools together, varying in scope and difficulty, and where applicable, I'll show exactly how I use LLMs and AI during the process because they're part of how this work gets done now. So here we are. Welcome.


Here's what I believe: We are living through a genuine shift in how technical work gets done. Tools like ChatGPT, Claude, and the broader ecosystem of AI assistants have changed the game. I started my career right before this wave hit, which means I'm in a strange middle ground. I didn't get decades of deep programming fundamentals before AI showed up. But I also know enough to understand that "just have AI do it" is not always a feasible strategy.

A lot of people will tell you that learning to code properly doesn't matter anymore. I disagree. Not because everyone needs to be a software engineer, but because knowing how to think through a problem is what makes AI tools actually useful. If you can't break a problem down, you can't prompt your way out of it. If you don't understand what a deployment pipeline does, you can't ask an AI to build one for you. The fundamentals matter very much still matter.

what this blog is about
# Not just code. The full picture.

$ find /real-world-problems -type f
  ./messy-data/cleaning-and-parsing
  ./ai-tools/using-them-to-actually-build
  ./deployment/getting-it-into-production
  ./security/because-that-matters-too
  ./problem-solving/the-real-skill

$ echo "Let's build."
  Let's build.

So what am I actually going to build? Good thing is I have a few projects lined up and a few ideas to talk through based on recent discoveries. Here's the roadmap:

Project Roadmap
01

Contributing to Apache Flink

Taking on FLINK-25672, a known limitation in Flink's DataStream filesystem connector: when the source runs unbounded, the enumerator keeps every processed file path in state and that state grows forever. This series walks through the real open-source contribution end to end reading the codebase, reproducing the issue, designing a compressed tracking approach, and submitting it upstream.

Apache Flink Open Source Data Engineering Stateful Streaming
02

Building Security Agents for AI-Accelerated Offense

Anthropic recently published Preparing Your Security Program for AI-Accelerated Offense a piece about how defenders need to rethink their programs as attackers start moving at AI speed. It's a compelling vision, but it's still a vision. I'm going to put it to the test: take the use cases they describe and actually build them out as working agents, documenting what's realistic today, what breaks, and where the gaps between blog post and production really are.

AI Agents Cybersecurity Claude API Defensive Security

Every project will be built in the open. I'll share the code, the data sources, the mistakes, and the decisions. If you want to build along with me, you'll have everything you need.

Who is this for? I welcome everyone to take a look at this site and have some fun reading through it.

Welcome to the workshop. Let's build something.

Coming Up Next
Flink Series · Post 2

Understand the Codebase

Following the problem write-up, this post digs into Flink's FileSource internals tracing how the enumerator discovers files, where processed-path state lives, and which classes we'll need to touch in order to fix FLINK-25672.

Standalone Post

Verizon DBIR: How to Read It

The DBIR is dense. This is a practical guide to reading it: which sections are worth your time, how to interpret the methodology, and how to apply the findings to your own environment.

Strengthening Cyber Defense with AI: Lessons from the 2026 Threat Landscape

This post is based on a talk I gave at the Women in Data Sciences (WiDS) NYC event in April 2026. The findings draw from the 2026 IBM X-Force Threat Intelligence Index which is an annual report based on data from thousands of real security incidents that the IBM X-Force team responded to across the globe. This post expands on that talk, and connects it to something that landed the day before I presented: Anthropic's blog post on preparing your security program for AI-accelerated offense. What struck me was how directly their recommendations mapped to the problems I was already planning to discuss. The threats are real, the solutions are emerging, and as security professionals we must find a way to stay ahead of it.


The Common Thread: Security Basics Are Still Broken

If you follow the cybersecurity news cycle, you'd think the biggest threats are prompt injection attacks and deepfake doomsday scenarios. I would like to emphasize that those are real concerns, bad actors do use those techniques to take advantage of people. But based on the X-Force Threat Intelligence Index, we have other more pressing threats to focus on and from this report we can also identify other more pertinent ways in which AI can be used against us.

The majority of incidents that IBM's X-Force team responded to last year weren't caused by anything exotic. They were caused by the basics not being done:

  • passwords being reused
  • authentication controls that were weak or missing entirely
  • organizations not having clear policies around AI use
  • teams not even knowing what assets they have.

That gap between what makes headlines and what's actually causing breaches is the backdrop for everything that follows. The 2026 threat landscape is punishing organizations for basic security hygiene failures, and attackers are increasingly using AI to exploit those gaps faster than humans can close them.

Here is what I find compelling: the same AI capabilities being used against us can be turned around to strengthen defense. For the talk and this article, I pulled three lessons from the 2025 threat data to illustrate this. For each one, I want to show the same thing: here's the threat, here's how AI makes it worse, and here's how we can use AI to fight back.

Lesson 1: The Attack Surface Is Exploding

The threat. During previous years the X-Force Threat Intelligence Index has had valid credentials/use of valid accounts as the leading initial access vector but that changed in 2025. X-Force observed a 44% increase in attacks that began with the exploitation of public-facing applications, things like customer portals, APIs, and web apps. Exploiting vulnerabilities in internet-facing software is now the number one way attackers gain initial access, overtaking stolen credentials for the first time in years.

How AI makes it worse. AI is squeezing organizations from both sides. On the development side, AI-generated code is introducing more vulnerabilities. Veracode's 2025 GenAI Code Security Report tested over 100 large language models and found that AI-generated code contains roughly 2.7 times more vulnerabilities than human-written code. Meanwhile, Georgia Tech's Vibe Security Radar project tracked CVEs directly caused by AI coding tools and reported finding 56 in the first three months of 2026, with 35 coming from March alone.

On the attack side, AI is helping adversaries find and exploit those vulnerabilities faster. In 2025, over 32% of vulnerabilities were exploited on or before the day the CVE was publicly disclosed, and AI-powered scanning reached 36,000 scans per second. The window between a vulnerability existing and an attacker exploiting it is fast collapsing.

The chain is simple: an organization builds a piece of software (possibly with AI, which introduces more flaws), the software goes live on the internet, attackers use AI to scan for known flaws at massive scale, the scanner finds a match, and the attacker exploits it to get in. I believe this combination of more flaws being created alongside faster scanning to find them is a significant contributor to the 44% jump we see in the X-Force Index.

How can we use AI to curb this. Security teams have a tough job in front of us. The Forum of Incident Response and Security Teams, in their 2026 Vulnerability Forecast, predicted a median of 59,000 new CVEs this year.

The good news is that real systems are already being built to solve this. CrowdStrike built an Exposure Prioritization Agent (that works alongside ExPRT.AI) that uses live data to answer questions like “how could a bad actor use this vulnerability?” and “what's the business impact?”, then delivers customers a prioritized list of what to fix first.

In the recent Claude blog post on preparing for AI-accelerated offense, the first recommendation is to close the patch gap by using EPSS (Exploit Prediction Scoring System) to prioritize, automate deployment, and reduce time-to-patch on internet-exposed systems. The blog also talks about what we have just covered, which is to expect more strain on our vulnerability processes. They lay out a playbook for using AI, and one example I want to highlight is AI-powered vulnerability scanning. Traditional code scanners are rule-based: they check your code against a library of known vulnerability patterns. AI-powered scanning works differently: instead of pattern matching, an agent reads and reasons through your code the way a human security researcher would. Anthropic's recommendation is straightforward: build or implement an AI agent that scans your own codebase before a bad actor does. In practice, this means pointing an LLM at your codebase in a contained environment, having it find vulnerabilities, and keeping a human in the loop to verify the findings before acting on them.

The X-Force findings tell us what the problem is and Anthropic's recommendations show the direction of how we can use AI to act faster. The issue here is fast implementation.

Lesson 2: Your Software Supply Chain Is Your Attack Surface

The threat. Most organizations out there today do not build all their software in house. They make use of platforms, libraries, packages, and services from third-party suppliers. That has created an interconnectedness and luckily for us it is what makes modern software possible, but it's also what makes it fragile. The X-Force report tracked this over five years and found that major supply chain and third-party breaches have nearly quadrupled. Attackers target open-source registries like npm and PyPI, exploiting developer trust. One compromised component can propagate across thousands of projects.

How AI makes it worse. The AI supply chain adds a new layer. Organizations aren't just pulling in traditional software dependencies anymore. They're pulling in training data, pre-built models, plugins, skills, and AI agents all from third parties. When you download a pre-trained model from Hugging Face, you're trusting that the weights are safe, that the training data was clean, and that nobody has tampered with it. But you didn't train it and as a result you can't be certain of what went into it. That's a supply chain decision. The chain is getting much more complex and harder to trace, and AI adoption is accelerating that.

How can we use AI to curb this. The core defensive need is visibility, knowing what's in your software and what happens if a piece of it breaks, or worse, is compromised. A lot of organizations out there don't have this picture.

When it comes to the software supply chain, AI can help in ways that go beyond what traditional tools offer. Anthropic's blog lays out a few practical approaches that stood out to me. The first is using AI to identify redundancy in your dependencies. Most large codebases accumulate multiple libraries doing the same job: multiple HTTP clients, multiple JSON parsers, and each one extends the attack surface for no functional gain. Anthropic recommends pointing an LLM at your dependency file and asking which packages overlap and what consolidation would look like. Fewer dependencies means fewer things that can be compromised.

The second is using AI to replace dependencies that are no longer maintained. Some packages your software relies on may have no active maintainer, no recent updates, and no commitment to patching vulnerabilities. Rather than continuing to depend on them, Anthropic recommends having an LLM rewrite the specific functionality you actually use from that package. The LLM can scan the package's codebase and replicate the functionality. This way you replace a risky third-party component with code you now control, thereby removing that link from the supply chain entirely.

Non-AI tools like OpenSSF Scorecard can also help audit the security of your open-source dependencies.

Like the previous lesson, the X-Force report quantifies the risk while Anthropic's recommendations show how AI-powered tooling can address it.

Lesson 3: More Attackers, More Noise, Same You

The threat. In the past couple of years, law enforcement has had real success dismantling the big ransomware gangs. REvil for example was dismantled in 2021/2022. Though this is positive for the community, we must remember that when you break up a large criminal operation the factions disperse into smaller groups. The X-Force report identified 109 active ransomware or extortion groups in 2025, a 49% increase from the year before. These smaller groups are less resourced, but there are a lot more of them, and they're harder to track because they use shared tooling and overlapping tactics.

How AI makes it worse. We can't prove that the increase in ransomware groups is because of AI. The fragmentation happened because of law enforcement action. That said, we can reckon that AI likely sustains it by lowering the barrier to operate. In March 2026, IBM X-Force published research on “Slopoly” which is AI-generated malware found during a real ransomware investigation. The script used was technically mediocre, probably produced by a less advanced model. But it did work. The attackers were a group called Hive0163 and are known to be responsible for major global ransomware attacks, they used their malware to maintain persistent access for over a week. As the X-Force analysis concluded: “AI-generated malware doesn't pose a new or sophisticated threat from a technical standpoint. What it does is disproportionately enable attackers by reducing the time needed to develop and execute an attack.”

How can we use AI to curb this. When you go from tracking a handful of major ransomware groups to 109, the volume of threat intelligence explodes. More groups means more indicators of compromise, more tactics and techniques to catalog, more reports to read, more alerts to triage. The people doing this work are our colleagues: the threat intelligence analysts, detection engineers, and threat hunters. This volume of intel can be overwhelming, and we can use AI to sift through a good chunk of it.

This is where AI plugs in most directly. An AI system can take an indicator from a threat feed, say an IP address, and check it against internal telemetry. Has it appeared in your logs? When? What happened when it did appear? When analysts write up incident reports, they need to map attacker behavior to the MITRE ATT&CK framework and LLMs can be used to perform this task. Across all the feeds, alerts, and disclosures, AI can filter, prioritize, and surface what actually needs human attention.

Anthropic's blog discusses some direct AI Agent recommendations. They recommend putting a model at the front of your alert queue giving every inbound alert an automated first-pass investigation before a human sees it. They also describe an AI “triage agent” with read-only access to your SIEM platform that can direct attention to the alerts requiring human judgment. They also recommend using AI as an incident scribe and parallel investigator during active incidents, thereby allowing the agent to take notes, capture artifacts, pursue parallel investigation tracks, and draft postmortems can be an immense timesaver.

Their practical advice is also worth noting for this lesson: “pick one noisy alert rule with a high false positive rate, wire a model into its alert stream with read-only access, have it produce a structured disposition for every firing, and measure agreement against a human reviewer for two weeks. Start small, prove it works, expand from there.”

The Pattern

What struck me most when I read Anthropic's blog the day before my talk was how clearly the recommendations mapped to the problems the X-Force data was surfacing. Two completely independent sources one is an annual threat report based on thousands of real incidents, the other a set of security recommendations from an AI company based on what they've learned using frontier models to secure real systems, were pointing in the same direction.

The vulnerability flood needs AI-powered prioritization and scanning. The supply chain complexity needs AI-powered dependency mapping and auditing. The threat intelligence overload needs AI-powered triage, classification, and summarization.

At WiDS, I framed this through the lens of data science asking where do people with our skills plug into these problems? I will admit, though, that the broader point holds regardless of your role. AI is accelerating both offense and defense in cybersecurity, and the organizations and practitioners that adopt AI-powered defensive tooling will be better positioned than those that don't. The threats the X-Force report documents are not going away; they will likely be back in the report next year, and I reckon we will see these same trends by the time the Verizon DBIR rolls around. The good news is that the tools we can employ to fight back are readily available, but as security practitioners we need to act fast.

The same AI that is being used against us can be used to defend us.

Hey, I'm Sophia.

I'm a security data engineer based in NYC who spends most of her days building the pipes, tools, and interfaces that help security teams surface relevant information in mountains of telemetry. I am excited about working on things that sit right at the intersection of data engineering, data science, and cybersecurity.

I started Build With Me because I wanted a place to share projects I'm working on, things I'm learning, thoughts on the tools and problems I find interesting. Sometimes that'll be a deep dive into a blog post released by a frontier AI company. Other times it might be me picking up something totally new and documenting the messy middle of figuring it out.

I'm also using this blog to show what I am capable of and my thought process as i figure stuff out. Mostly, I just like building things and I'd rather do it where people can follow along.

Open Source Contributions
Public Appearances & Speeches
AI.dev / Cassandra Summit 2023 talk thumbnail
AI.dev · Cassandra Summit · 2023

Watch the talk on YouTube

WiDS NYC 2026 speaker poster featuring Sophia Izokun
WiDS NYC · Apr 11, 2026

Strengthening Cyber Defense with AI: Lessons from the 2026 Threat Landscape