[{
    "title": "Why Your AI Agent Keeps Making the Same Mistakes",
    "url": "/blog/2026/06/15/why-your-ai-agent-keeps-making-the-same-mistakes/",
    "date": "Jun 15, 2026",
    "tags": ["AI","Productivity","Agents","Automation","Claude Code"],
    "excerpt": "My AI agent used to guess people’s names from their email addresses. It saw an email handle and confidently produced a full name in a meeting brief - wrong person entirely. In the same session, it did the same thing with another handle.",
    "content": "My AI agent used to guess people’s names from their email addresses. It saw an email handle and confidently produced a full name in a meeting brief - wrong person entirely. In the same session, it did the same thing with another handle.I corrected it. The agent fixed the output. Next session, it guessed again.This is the frustrating part about working with AI agents: they don’t remember being wrong. You correct the same mistakes across sessions, and every tomorrow starts fresh. I decided immediately that this needed a real fix - /learn was one of the first skills I invested in.The problem isn’t intelligence, it’s amnesiaAI agents are good at following instructions in the moment. If you say “don’t guess names from email handles, look them up in the directory,” it will do exactly that - for the rest of the conversation. Next session, it has no idea you ever said it.This isn’t a flaw - it’s how LLMs work. Under the hood, a language model is a function: it takes a sequence of tokens and predicts the next one based on patterns frozen into its weights during training. There’s no feedback loop, no parameter update at inference time - just matrix multiplications over whatever’s in the context window. The model’s knowledge is static; its memory is the prompt. Your correction lives in one conversation’s context and vanishes when that session ends. Fine-tuning could theoretically bake corrections into the model’s weights, but it’s too slow, too expensive, and too broad for individual preferences. You shouldn’t have to retrain a model just because it guessed someone’s name wrong.The fix seems obvious: write the correction down somewhere the agent reads at session start. But that raises questions:  Where does the rule go? A project config file? A memory file? A specific skill’s instructions?  How do you phrase it so the agent applies it correctly in new contexts?  How do you know it actually worked?  What happens when you have 50 rules and some of them contradict each other?Over time, I built a system that handles this. It has three stages: correct, codify, verify. The first two are about capturing and persisting corrections. The third - which came later as the rule count grew - is about making sure they actually stuck.Correct (the conversation)Every correction starts in conversation. Some are explicit - “No, don’t do that,” “Stop summarizing what you just did, I can read the diff.” Some are implicit - I rewrite the agent’s output, or I ignore a suggestion and do something different.Both carry signal. Explicit corrections are obvious. Implicit ones reveal preferences you haven’t articulated yet.Then there are positive confirmations. “Yes, exactly like that.” Accepting an unusual approach without pushback. These are easy to miss. If you only capture mistakes, you avoid past errors but drift away from approaches that already work.Codify (the learning)At the end of a session with corrections, I run /learn - a skill that reviews the conversation, identifies corrections, and asks three questions about each one:Is this generalizable? “Change this sentence” is not. “Don’t use markdown blockquotes for text I need to copy-paste” is.What’s the scope? A rule about Jira triaging belongs in the triage skill’s instructions. A rule about communication style belongs in a feedback memory. A rule that applies everywhere belongs in the project config.Is there a why? “Don’t use blockquotes for copy-paste” is a weak rule. “Don’t use blockquotes for copy-paste - the &gt; characters get included when selecting text, requiring manual cleanup before pasting” is strong. The why lets the agent judge edge cases instead of following the rule blindly.Each correction gets routed to exactly one place. Global behavior goes to the project config (AGENT.md). Personal preferences go to feedback memory files. Skill-specific quirks go to that skill’s instructions. Domain facts go to knowledge files.The output is a structured summary showing what was codified, where, and the exact text added. I review everything before it’s final.Routing is the hard partAt this point I have dozens of feedback memories. Getting the routing right matters more than getting the rule right. A rule in the wrong place either gets ignored (too narrow) or creates noise (too broad).Some real routing decisions:  “Don’t use em dashes” goes in the writing style memory, not the project config - it’s about my voice, not agent behavior  “Always include clickable links in output” goes in the project config as a protocol rule - it applies to every skill  “Search all Gmail pages, don’t stop at the first batch” goes in the inbox triage skill - it’s specific to how that skill processes email  “Never mark tasks as done, only move them to review” goes in the project config as a non-negotiable rule - violating it breaks my review workflow across every skillThe pattern: if a rule applies to more than one skill, it goes in the project config or a memory file. If it applies to one context, it goes in that skill.What a feedback memory looks likeEach memory has three parts: the rule, the reason, and how to apply it. The email handle rule says: never guess names - always look them up in the directory or stakeholder map. The reason explains that wrong names in meeting briefs destroy credibility. The application section specifies the fallback chain: check stakeholders first, then directory lookup, then show the raw email handle rather than guessing.The reason isn’t decoration. It helps the agent decide what to do in situations the rule doesn’t explicitly cover.Verify (the system)This is the part most people skip./learn captures individual corrections. But did they actually propagate? If I added “always include links” to the project config, does every skill that produces output actually include links? If I renamed a concept in one skill, did the other skills that reference it update too?/audit is a separate skill that scans the repo for drift and inconsistencies:  Cross-references: A skill says “save to the triages directory” - does it exist? A skill says “run /meeting-prep” - does that skill exist?  Contradictions: A memory file says one thing, a skill says the opposite. The project config documents a tool that was removed.  Drift: A count in a document fell out of date. The workflow guide no longer matches reality.  Staleness: Memory files reference things that no longer exist./learn is writing a test. /audit is running the test suite. I run /learn at the end of sessions with corrections and /audit weekly or after structural changes. Most audit runs find nothing - which is the point. The ones that do catch something prevent subtle bugs: a rule that contradicts another rule, a skill that references a renamed file, a count that drifted.What compoundsThe first few weeks were clunky. Over time, corrections started getting rarer. Eventually, most sessions had zero.A few interesting observations:The agent improves at things I never explicitly corrected. The timezone rule (“always verify dates before stating them”) came from a few incidents where the agent guessed the day of the week wrong. But the underlying principle - don’t assert verifiable facts, look them up - started applying to other contexts too.Positive patterns compound faster than corrections. When I confirmed that bundling related changes into one PR was the right call, recording that confirmation meant the agent would default to it in similar situations. Without the positive signal, it might try something different next time.The loop can close itself. I codified rules that tell the agent to proactively suggest running /learn when it notices it was corrected multiple times, and /audit after structural changes. The learning system uses its own output to trigger itself - which is a satisfying kind of recursion.Curation is ongoing. Dozens of feedback memories is a lot of context to load at session start. Rules need consolidation, outdated ones need pruning. The system doesn’t maintain itself - but the maintenance cost is low compared to repeating corrections every session.The bigger pictureThe agent’s intelligence isn’t the bottleneck. The system around it - rules, memory, verification - determines whether it’s useful day after day or just impressive in demos. This aligns with what others are finding: Adel Zaalouk built an attribution pipeline that traces failures to specific skills. The concept of harness engineering frames the system around the model - memory, skills, context management - as more important than what the model generates. Charity Majors argues that nondeterministic systems demand more engineering discipline, not less.The three-stage loop is simple: correct, codify, verify. Most people do the first. Some do the second. Almost nobody does the third. The verify step is what turns a collection of corrections into a system that actually learns.The template repo includes a basic version of /learn you can customize. /audit is something I built for my own setup as the rule count grew - once you have enough corrections to lose track of, you’ll want one too. Start correcting, start codifying, and don’t skip the verification."
  },{
    "title": "Build Your Own AI-Augmented Workflow",
    "url": "/blog/2026/06/14/build-your-own-ai-augmented-workflow/",
    "date": "Jun 14, 2026",
    "tags": ["AI","Productivity","Claude Code","MCP","Agents","Automation"],
    "excerpt": "A few weeks ago, I published a post about NirOps - my AI-augmented workflow for product management. The response was more than I expected. People reached out asking how to build something similar for their own roles.",
    "content": "A few weeks ago, I published a post about NirOps - my AI-augmented workflow for product management. The response was more than I expected. People reached out asking how to build something similar for their own roles.The honest answer was “it took months of daily iteration.” Not a great answer.So I extracted the underlying architecture into a template repo that anyone can clone, customize, and make their own. This post walks through the key concepts and how the pieces fit together.The problemKnowledge workers spend a huge chunk of their day on operational overhead. Scanning email, triaging Slack, prepping for meetings, tracking tasks, writing status updates, chasing down context that lives in six different tools. The actual thinking - the part you were hired for - gets squeezed into whatever time is left.AI assistants help, but most workflows look like this: open a chat window, paste some context, ask a question, copy the answer somewhere else. Each interaction starts from scratch. The AI doesn’t know your role, your tools, your preferences, or what you did yesterday. That’s not AI-first - that’s AI-assisted, and the gap between the two is enormous.My own system is built around product management. But when people asked how to build something similar, I realized the underlying architecture has nothing to do with PM. The skills are role-specific; the infrastructure, memory system, and trust model are generic. So I extracted the foundation into a template that works for any knowledge worker role.The approachThe system I built uses Claude Code as the agent and MCP (Model Context Protocol) to connect it to real tools - email, calendar, chat, task management, documents. The key idea is that the agent operates with the same tools you use, but with explicit boundaries and a human-as-approver model.Here’s the architecture:Everything runs locally. The MCP servers connect to APIs you’re already using - no new cloud services, no data leaving your machine except the API calls themselves. I use Claude Code, but the architecture - MCP servers, skills as markdown, memory files, tool allowlists - is portable to any MCP-compatible agent or coding assistant.Six layersThe system has six layers, each doing a different job.1. Rules (CLAUDE.md)CLAUDE.md is the project instruction file that Claude Code reads at every session start. It’s where you define the agent’s behavior, boundaries, and knowledge.The most important part is the protocol rules - non-negotiable behaviors that prevent real damage. For my setup:  Never send messages (Slack and Gmail are read-only - the agent drafts, I send)  Never create issues in external trackers without confirmation  Never mark tasks as done (only I can close the review loop)These aren’t limitations - they’re the trust model that makes the system usable for real work. If the agent could send messages on my behalf, I’d never trust it to triage my inbox.Beyond protocol rules, CLAUDE.md documents the MCP integrations (what tools are available and their quirks), the repo structure (where things go), and key context (account IDs, email addresses, constants the agent needs).2. Infrastructure (MCP stack)The MCP servers run as containers behind a proxy gateway. The proxy does two things: it aggregates multiple servers behind a single endpoint (so the agent connects to one URL), and it enforces tool allowlists - your safety net for read-only access.This is important. Many MCP servers have both read and write capabilities. The proxy config lets you restrict which tools are actually available. My Slack server can technically post messages, but the proxy only exposes read and search tools. Same for Gmail - read, search, and label management (archive/organize), but no send.The stack uses Docker or Podman Compose and includes servers for Google Workspace (Gmail, Calendar, Drive, Docs), Slack, Google Contacts, and a timezone-aware clock. The template also includes Crux (a task board with its own MCP server) which connects directly to the agent as a local stdio process, so it doesn’t need the network proxy. Adding a new tool - GitHub, a CRM, whatever you need - means adding a container and a proxy route.3. SkillsSkills are reusable multi-step workflows encoded as markdown files. When I type /inbox-triage, the agent follows a structured process: gather unread messages from Slack and Gmail, categorize each one (respond/review/FYI/defer), extract action items into the task board, archive processed emails, and produce a summary with direct links to every item.Each skill has a tool allowlist that restricts what it can access. The inbox triage skill can read email and create tasks, but it can’t touch anything else. This is defense in depth - even if the skill instructions were somehow corrupted, the tool boundary holds.The template includes six skills:  inbox-triage - process unread messages, extract action items  meeting-prep - generate a brief with context, prior interactions, talking points  capture - auto-discover meeting notes, capture knowledge  delegate - pick up a queued task, execute it, hand back for review  review - end-of-day or end-of-week work review  learn - analyze corrections from the session, codify into rulesBut the real value comes from building or customizing skills for whatever you want to achieve. A skill is just a markdown file with instructions - if you can explain it to a person, you can encode it as a skill. The agent skills format is becoming an industry standard - open-source skill marketplaces are growing, and enterprises are building internal collections.4. MemoryMemory files give the agent persistent context across sessions. They’re markdown files with frontmatter that cover four types:  User - your role, goals, preferences, expertise  Feedback - corrections you’ve made (“don’t do X because Y”)  Project - ongoing work, decisions, deadlines  Reference - pointers to external systems and resourcesThe feedback memories are the most interesting. Every time you correct the agent, that correction can become a rule that prevents recurrence. Over weeks, this builds a detailed model of how you work. My system has about 100 memory files covering everything from “always verify dates before stating them” to “always check open merge requests before claiming a codebase gap.”5. TemplatesTemplates provide consistent structure for recurring documents - meeting briefs, decision records, meeting notes. When a skill creates an output file, it starts from the matching template. This keeps output predictable without forcing you to specify formatting every time. The repo includes starter templates for common document types, and you add your own as needed.6. OpsThe ops layer is what separates “cool demo” from “daily driver.” It includes:  Crux - a lightweight task board with an MCP server. Skills create and manage tasks here. Bundled in the template repo at crux/.  App Dashboard - a web UI for managing local services. Start, stop, restart, view logs, toggle autostart. Because after a reboot, you don’t want to remember which five containers need starting. Bundled at app-dashboard/.  Backup script - a daily scheduled task that snapshots the task database, copies configs, and commits/pushes the repo. Your knowledge base is in git - losing it would mean losing months of accumulated context.  Disaster recovery - step-by-step restore procedure. If your laptop dies tomorrow, you can rebuild the full system in under an hour.What it’s like to useA typical session might start with /inbox-triage. The agent reads unread Slack and Gmail, categorizes everything, creates tasks for action items, and archives processed emails. This takes about 5 minutes and replaces what used to be 30-45 minutes of manual scanning and context-switching.Before meetings, I run /meeting-prep [topic]. The agent pulls calendar details, looks up attendees in the company directory, searches for prior interactions in Slack and email, and produces a brief with context I’d otherwise need to scramble for.Throughout the day, /delegate handles routine tasks - reading articles, drafting responses, researching topics. Each task goes through a human review gate before anything is sent or published.At the end of the day, /review daily produces a summary of what got done, what’s carrying forward, and what needs attention tomorrow.The pattern is always the same: the agent gathers, synthesizes, and drafts. I review, adjust, and decide. No messages are sent automatically. No issues created without my confirmation. The agent handles the operational overhead; I handle the judgment calls.Beyond the six skills in the template, my own setup has about 20 skills tailored to product management and Red Hat’s environment - Jira triage, RFE drafting, competitive analysis, quarterly reviews, and more. The template gives you the foundation; you build the role-specific layer on top.One thing to watch for: running an AI-heavy workflow has its own cognitive cost - something I’m actively researching and working to improve. I wrote about that separately in AI Brain Fry Is Real.Getting startedYou’ll need a Claude Pro or Max subscription for Claude Code.The template repo includes everything described above. A setup script handles the automatable parts:git clone https://github.com/nyechiel/ai-augmented-workflow.gitcd ai-augmented-workflow./scripts/setup.sh        # Linux/macOS/WSL2# or: .\\scripts\\setup.ps1  # Windows PowerShellThis installs Crux and App Dashboard (both bundled in the repo), copies config templates, and sets up the symlinks Claude Code needs. What’s left is the parts that need your input:  Clone the MCP servers - The MCP servers that connect to Gmail, Slack, etc. are separate open-source projects. The template includes the docker-compose file that builds and runs them as containers, but you need to clone each server’s source code alongside your workflow repo (e.g., ~/Projects/google_workspace_mcp, ~/Projects/slack-mcp-server). Start with Google Workspace and Slack.  Configure credentials - Set up OAuth tokens for Google and browser session tokens for Slack in the mcp-secrets.env file the setup script created. This is the most involved step.  Start the MCP stack - docker compose up -d (or podman-compose up -d) to launch the containers.  Customize CLAUDE.md - Add your rules, integrations, and context. The setup script created a starter file from the template.  Start Claude Code - type / and try a skill.The setup guide has the full step-by-step, the customization guide covers how to adapt everything to your specific role and tools, and the workflow guide shows how the skills connect and a suggested daily rhythm.Start small. Get CLAUDE.md, two memory files, and one skill working before adding complexity. Let feedback memories accumulate naturally as you correct the agent. The system gets better the more you use it. Initial setup takes a few hours, mostly for OAuth credentials. The real investment is in writing rules and building skills over weeks.What I learnedBuilding this system taught me a few things that only became clear in hindsight.The rules layer is the most important. I spent more time refining CLAUDE.md than any other file. A well-written rules file makes the agent reliable; a sloppy one makes every session feel like explaining things from scratch. Skills encode what to do; rules encode how to think.Memory compounds. The first week is clunky. By week four, the agent knows your preferences, your stakeholders, and your quirks. It stops making mistakes you’ve already corrected. This compounding effect is the thing that makes the system genuinely useful rather than just interesting.Read-only is a feature, not a limitation. The strongest trust signal is that the agent can’t accidentally send a message or create an issue. Once you trust the boundaries, you stop second-guessing and start delegating more.Skills are never done. My earliest skills were simple and brittle. Over time, they grew cross-references (inbox-triage creates tasks that delegate picks up), graceful degradation (skills that work even when some MCP servers are down), and edge case handling I never anticipated. Quality comes from multiple layers: tool allowlists limit what each skill can access, structured steps guide the agent through a predictable process, the human-as-approver model means every output gets reviewed, and /learn turns corrections into persistent rules. I’m looking at a more formal eval pipeline as the skill set grows.The ops layer matters more than you think. A backup script and a service manager sound boring compared to skills and MCP servers. But the system I had in week one (no backup, manual container restarts) felt fragile and temporary. The one I have now feels permanent. That psychological shift changes how much you invest in it.This template is a starting point. The value comes from making it yours - your tools, your workflows, your accumulated knowledge. Clone the repo, start small, and iterate. If you build something interesting with it, I’d love to hear about it - contributions are welcome too."
  },{
    "title": "AI Brain Fry Is Real",
    "url": "/blog/2026/06/05/ai-brain-fry-is-real/",
    "date": "Jun 5, 2026",
    "tags": ["AI","Product Management","Productivity"],
    "excerpt": "A few weeks ago I wrote about building an AI operating system for product management - a local system called NirOps that takes an AI-first approach to my PM workflows. Not just operational tasks like inbox triage and meeting prep, but core PM work - RFE drafting, feature specs, competitive analysis, research, writing. It has 20 skills now, connects to my email, calendar, Slack, Jira, and task board, and runs entirely on my laptop. I use it daily and keep tailoring it to how I actually work.",
    "content": "A few weeks ago I wrote about building an AI operating system for product management - a local system called NirOps that takes an AI-first approach to my PM workflows. Not just operational tasks like inbox triage and meeting prep, but core PM work - RFE drafting, feature specs, competitive analysis, research, writing. It has 20 skills now, connects to my email, calendar, Slack, Jira, and task board, and runs entirely on my laptop. I use it daily and keep tailoring it to how I actually work.It works. That’s the problem.The system works perfectlyNirOps captures every action item from every meeting. It extracts tasks from emails, Slack threads, and document comments. It tracks everything in a local task board and queues reading and research for background execution. Nothing falls through the cracks anymore.But it goes beyond capture. The system produces research summaries, drafts, and analyses faster than I can review them. Tasks in “review” queue up waiting for my attention. The bottleneck shifted from production to absorption - and I’m the bottleneck.Before I built this, I was like most PMs - drowning in messages, forgetting follow-ups, occasionally rediscovering an action item weeks after the meeting where it was assigned. The manual approach was lossy but it had a hidden feature: natural forgetting acted as a filter. If something was important enough, it would come back around. If it wasn’t, it quietly disappeared.NirOps removed that filter. Now everything is captured, everything is visible, and the task board shows me the actual scope of my commitments for the first time. It turns out the real workload is bigger than what I was tracking manually. Not because the work increased - because visibility did. And the ease of doing more means I actually do more.This has a name, apparentlyI went looking for research to understand what I was feeling and found that this pattern is well-documented. It’s part of what people are now calling “AI fatigue” - the cognitive and emotional toll of working alongside AI systems. The research gives it more specific names.BCG surveyed 1,488 workers in March 2026 and coined the term “AI brain fry.” Workers whose AI tasks required higher oversight experienced 14% more mental effort, 12% greater mental fatigue, and 19% greater information overload.This isn’t burnout, which builds over months. It’s acute cognitive overload that recovers when you step away.UC Berkeley ran an 8-month ethnographic study at a tech company and identified three mechanisms:  Scope expansion - workers absorb tasks that would have belonged to others (“I can just AI this”)  Boundary dissolution - natural stopping points disappear (prompts sent during lunch, evenings, between meetings)  Relentless multitasking - multiple AI threads running in the background during meetingsThe paradox they found: moment-to-moment, using AI feels exciting and empowering. Cumulatively, workers feel stretched and unable to disconnect. And none of this was imposed by management - it emerged from voluntary adoption.The Jevons paradox for productivityThere’s an economics concept called the Jevons paradox: when a resource becomes more efficient to use, total consumption goes up, not down. Coal-powered engines got more efficient, so people used more coal, not less.AI is doing this to knowledge work. When capture becomes effortless, you capture everything. When triage becomes fast, you triage more. When delegation is easy, you delegate more - which generates more outputs to review. Your “saved time” gets absorbed by new tasks, more content to review, and more requests from stakeholders who know you have AI assistance.Developers on high-AI teams merge 98% more pull requests. But PR review time on those same teams increased 91%. Production is happening at AI speed. Absorption is still happening at human speed.And there’s another layer: when AI handles more routine parts of your job, you’re left with the concentrated hard stuff - complex decisions, deep thinking, edge cases, stakeholder politics. Cognitive intensity per hour goes up. You feel exhausted without being able to point to volume that justifies it.What this looks like in practiceHere’s how each of those patterns maps to my own workflow:Scope expansion. NirOps captures every action item from every meeting. Before, many of those would have been naturally forgotten. Now they’re all on the board, and each one feels like a commitment even when it probably isn’t.Boundary dissolution. I can send myself a quick capture at any time - a link, a thought, an observation - and it gets automatically routed to the right place in my knowledge base. The friction that used to make me think “I’ll deal with this tomorrow” is gone, and with it, the ability to disconnect. Work thoughts become capture actions instantly, at any hour.Production faster than absorption. This is the bottleneck problem I described above. The review queue grows faster than I can work through it, and every item sitting there is a small cognitive weight.The body keeps the scoreI’m a regular runner and do yoga weekly. Physical activity, especially running outside, has been a key tool in my toolbox for managing stress and staying sharp long before any of this. I also happen to wear a Garmin watch that tracks stress, heart rate variability (HRV), sleep, and something called Body Battery. When I started noticing the cognitive load patterns described above, I looked at what the watch data was already telling me.My weekly stress average has been trending up: from 27 to 33 over four weeks. Not alarming on its own, but a clear direction.The more interesting finding: morning runs have a measurable next-day effect. Days after a morning run show lower stress (avg 29 vs 35), higher overnight HRV (+5.8ms), and zero “stressful” day ratings. Two-thirds of days without a preceding morning run were rated “stressful.”Small sample, directional data, all the usual caveats. But directional is enough to act on.What I’m tryingI don’t have solutions. I have experiments. Here’s what I’m implementing, and I’ll report back on what actually works.Deliberate prioritizationThis is the biggest mindset shift. NirOps should capture everything - that’s its job. But capturing something doesn’t mean committing to it. I added a weekly review step that sorts new tasks into three buckets: commit, delegate, and deprioritize.The key is making deprioritization an explicit decision, not a passive one. I added an “archived” status to my task board for items I’ve reviewed and decided aren’t worth pursuing right now. Not deleted (that loses the signal), not left sitting indefinitely. Explicitly triaged and set aside. The goal is to focus on the things that actually matter rather than spreading thin across everything the system captured.Finish before startingI limit myself to 5 tasks in “doing” at any time. The captured list can grow - that’s fine, it’s just a queue. What matters is the active set. When you have 20 things in progress, nothing actually moves forward. When you have 5, things get done.I’m tracking this in my weekly review. The hard part isn’t setting the limit - it’s resisting the urge to start something new when the current work hits a snag.Batch review, not drip-feedThe BCG research found that isolated individual AI power-use is the highest-risk pattern for cognitive overload. Reviewing AI outputs one at a time throughout the day is drip-feeding - each one costs a context switch, and research suggests each switch takes 23+ minutes to recover from.I’m trying to batch my review of AI outputs into dedicated blocks instead of processing them as they arrive.Protect the deep work dayI’m based in Israel, so my workweek is Sunday through Thursday. Sunday is a gift - I still catch up on what happened over my weekend and Friday, but most of my colleagues haven’t started their week yet, so nothing new piles up during the day. No meetings, no incoming requests. It’s the one day where I can do deep work without interruptions. I protect that day aggressively.Track sustainability, not just productivityI built a weekly review that combines work output (tasks completed, zone alignment against my goals) with physical metrics - stress trend, exercise frequency, sleep quality. The idea is to make the connection between work patterns and physical state visible, not just track them separately.The Garmin data confirms what I already knew intuitively: consistent exercise matters. Morning runs correlate with lower next-day stress. My goal is four runs a week, and the review tracks whether I’m hitting the target. No optimization games about which days to run - just consistency.The weekly stress trend is a leading indicator. If it keeps climbing while task throughput is flat, the answer is workload reduction, not optimization.What the research doesn’t cover yetThis is genuinely new territory. The tools have moved faster than our understanding of how they change the way we work. The research is starting to document the patterns, but we simply haven’t had enough time to understand the long-term impact of going AI-first in our daily work. Nobody has a playbook yet.Why I’m sharing thisI think a lot of knowledge workers are going to hit this wall. You build or adopt an AI system, it works great, your operational efficiency improves - and then the cognitive load increases because you can see and do more than before. The AI didn’t reduce your work. It made the real scope visible for the first time, and that visibility is both valuable and overwhelming.If you’re experiencing something similar - feeling stretched even though your tools are better than ever - I hope it helps to know this pattern has names, research backing, and some directional data on what might help.I’ll follow up on what actually works and what doesn’t. For now, I’m running the experiments."
  },{
    "title": "What I Learned in My First 30 Days Back in Product Management",
    "url": "/blog/2026/05/31/what-i-learned-in-my-first-30-days-back-in-product-management/",
    "date": "May 31, 2026",
    "tags": ["Product Management","Career","AI","Red Hat","Leadership"],
    "excerpt": "In May 2026, I started a new role as Senior Principal Product Manager at Red Hat, focused on AI and the company’s Digital Workforce initiative. On paper, it’s a PM role. In practice, it’s a homecoming - and a culture shock at the same time.",
    "content": "In May 2026, I started a new role as Senior Principal Product Manager at Red Hat, focused on AI and the company’s Digital Workforce initiative. On paper, it’s a PM role. In practice, it’s a homecoming - and a culture shock at the same time.I spent the last 6+ years leading engineering teams. Before that, I was a PM. Now I’m back, and the job I returned to is not the job I left.This post is the story of those first 30 days - what transferred from my engineering leadership years, what didn’t, and why I think the gap between PM and engineering leadership is both narrower and wider than people assume.Why I went backI want to be clear: I loved engineering leadership. Mentoring engineers, watching people grow into roles they didn’t think they were ready for, building teams that shipped great products - that was some of the most rewarding work of my career. I’d do it again, and honestly, I might go back to management someday.But as you climb the leadership ladder, the work naturally shifts. More of your time goes to people, process, and organizational design - and less to the technology itself. That’s not a complaint, it’s just the nature of the role. Over time, though, I noticed I was spending most of my energy on staffing plans, reorgs, and cross-team alignment, and less and less on the technical problems that drew me to this industry in the first place.I missed being close to the technology. Forming opinions about architecture, understanding trade-offs at the system level, being in the room where technical direction gets set - not as the person approving headcount, but as someone with a point of view on the product itself. At the same time, I found myself drawn more to the what and why than the how. PM lets you stay technical while owning the direction - that combination appealed to me.When the opportunity came to own an AI product portfolio - a domain I was genuinely curious about, at a company I already knew deeply - it felt like the right move at the right time.What transferred1. Technical credibility buys you time.Walking into a new PM role with a strong engineering leadership background gives you a head start that’s hard to overstate. Engineers trust you faster. Architecture conversations are productive from day one. You don’t need someone to translate the trade-offs for you.This matters more than I expected. In AI specifically, the gap between understanding the technology and not understanding it is the difference between asking good questions and nodding along. PMs who can engage meaningfully in architecture conversations and evaluate technical trade-offs on their own - they earn a seat at the table faster.2. Organizational intuition is portable.After years of navigating engineering orgs - managing managers, working across teams, handling reorgs, building consensus in open source communities - I came in with a feel for how large organizations actually work. Who to talk to. When to escalate. How to read the room in a meeting where nobody is saying what they really think.This is the stuff that doesn’t show up on a resume but determines whether your first month is productive or just busy. Working in open source communities, where you have no authority by definition, turns out to be pretty good practice for this.3. Empathy for the engineering side of the house.I know what it’s like to receive a vague RFE that could mean five different things. I know what it’s like to be mid-sprint when priorities shift. I know what it’s like when a PM doesn’t understand the cost of what they’re asking for.So I try not to be that PM. Every feature request I write includes the why, the user problem, and an honest assessment of complexity as I understand it. When I don’t know the cost, I say so and ask. When the engineering team pushes back, I listen - because I’ve been on their side of that conversation.What didn’t transfer1. Domain knowledge resets to zero.Technical credibility gets you in the room. Domain knowledge keeps you there. And in a new domain - AI enterprise products, sales tooling, LLM-powered agents - I was starting from scratch.The first few weeks were humbling. I was in meetings where every acronym was new. I was reading internal docs where the context assumed two years of history I didn’t have. I was forming opinions about product direction while still figuring out what the product actually did.The temptation is to fake it - to nod along and figure it out later. I decided early on to just be honest about what I didn’t know. “I’m new, help me understand why we made that decision” turned out to be one of the most useful phrases in my vocabulary. People are surprisingly generous with context when you ask directly.2. The tempo is different.Engineering leadership operates on a cadence: sprints, releases, quarterly planning. You know what’s coming and when. The feedback loops are relatively tight - you ship something, you see if it works, you iterate.Product management - especially at the strategic level - operates on a different cadence. Some things move slower: you’re writing RFEs that might ship in six months, planting seeds for initiatives that won’t materialize for a quarter. The feedback loops are longer and noisier. I found myself itching to do things - to ship, to fix, to make progress that I could measure.But other things move incredibly fast. The AI space doesn’t wait for your quarterly planning cycle. Models improve weekly, competitive dynamics shift overnight, and the product you’re defining today might need to be rethought next month based on new capabilities that didn’t exist when you started. It’s a strange combination - strategic patience and tactical urgency at the same time. Learning to hold both took deliberate adjustment.3. The information firehose is real.In engineering leadership, the scope is relatively bounded - your teams, your area, your stakeholders. Even as a Director, where you’re too far from the code to go truly deep, you at least know the boundaries of what you need to track.In PM - especially a cross-functional role that touches multiple business functions - the information surface is enormous. Slack channels, customer feedback, competitive intel, market research, internal strategy docs, stakeholder updates, field reports. I attended 30+ meetings in my first three weeks.The risk isn’t missing information - it’s drowning in it. I spent my first month absorbing much more than I synthesized. That’s natural for a ramp-up period, but it’s a pattern you have to actively break. The value of a PM isn’t in how much they know - it’s in what they do with what they know.What surprised mePM changed while I was away. The biggest surprise wasn’t the domain shift - it was how much the craft of product management had evolved. AI is now a first-class tool in the PM toolkit, not a curiosity. The best PMs I see aren’t just using AI to write faster emails - they’re building systems around it, using it for research synthesis, competitive analysis, and stakeholder preparation at a depth that would have been impossible three years ago.But here’s the flip side: building things has never been easier, and that makes the PM’s core job - truly understanding the problem, scoping the right solution, knowing what not to build - more critical than ever. When the cost of building drops toward zero, the cost of building the wrong thing becomes the dominant risk. The PMs who will thrive aren’t the ones who ship the fastest. They’re the ones who make sure the team is pointed at the right problem before anyone writes a line of code.I leaned into both sides of this. Within my first month, I built an AI-augmented workflow that goes beyond just reducing operational overhead. It’s about being AI-first across the entire PM lifecycle - from research and discovery, through competitive analysis and stakeholder prep, to feature definition and delivery tracking. The goal isn’t to automate the PM role. It’s to leverage AI at every stage so I can go deeper on the parts that actually require human judgment: understanding the problem, talking to customers, making trade-off decisions, setting direction.The outsider advantage is real. Coming from outside the AI domain, I ask questions that insiders stopped asking a long time ago. “Why do we do it this way?” is a powerful question when asked sincerely by someone who genuinely doesn’t know the historical context. Sometimes the answer is a good reason. Sometimes the answer is “that’s just how we’ve always done it.” The second answer is where the opportunities live.The challenge is holding onto that fresh perspective as you ramp up. The more you learn, the more you risk normalizing the same things everyone else has normalized. I’m trying to be deliberate about preserving that outsider’s critical eye - questioning assumptions, challenging defaults, not accepting “that’s just how it works here” too quickly. That fresh point of view is a temporary asset, and once it’s gone, it’s hard to get back.Relationships compound faster than knowledge. The most valuable thing I did in my first 30 days wasn’t reading docs or writing specs. It was meeting people. Every 1:1, every “help me understand your world” conversation added a layer of context and trust that keeps paying off weeks later. Knowledge is perishable - especially in a fast-moving AI domain. Relationships are durable.The honest versionReturning to PM after years of engineering leadership is not a lateral move. It’s a diagonal one - you carry some things across, you leave others behind, and you have to build new muscles in real time while delivering in a role that expects you to already have them.There were days in that first month where I felt completely lost.And then there were days where everything clicked - where my engineering background gave me an insight that a pure PM would have missed, where a relationship from my open source years opened a door in my new domain, where the outsider perspective helped me see a problem that insiders had normalized.The answer, 30 days in, is that the transition is worth it - but only if you’re honest about what you don’t know, deliberate about what you need to learn, and patient enough to let the compound interest of relationships and domain knowledge do its work.What I’d tell someone considering the same move  Your technical depth is an asset, not a crutch. Use it to earn credibility, but don’t let it become your identity. You’re a PM now. The job is to define what to build and why, not to prove you could build it yourself.  Be honest about what you don’t know. The fastest way to earn trust in a new domain is to ask good questions, not to pretend you have answers.  Invest in relationships early. You obviously need to read the docs too, but don’t let that crowd out meeting people. Knowledge decays. Relationships compound.  Go AI-first from day one. Don’t wait until you’re “settled in” to rethink your workflow. I started building my AI-augmented PM system during my first week, and it’s one of the best decisions I made. It’s not just about saving time - it’s been a force multiplier for onboarding itself, helping me absorb context, track stakeholders, and synthesize information faster than I could have done manually. That investment is already paying dividends.  Give yourself grace. The first 30 days are supposed to be uncomfortable. If they’re not, you’re probably not stretching enough."
  },{
    "title": "I Built a Personal AI Operating System for Product Management",
    "url": "/blog/2026/05/13/ai-operating-system-for-product-management/",
    "date": "May 13, 2026",
    "tags": ["AI","Product Management","Claude Code","MCP","Red Hat"],
    "excerpt": "I recently moved back into PM after more than six years in engineering leadership - a transition that probably deserves its own post someday. I currently work as a Senior Principal Product Manager at Red Hat with a strong technical focus, building agentic AI systems that transform how teams across the business work.",
    "content": "I recently moved back into PM after more than six years in engineering leadership - a transition that probably deserves its own post someday. I currently work as a Senior Principal Product Manager at Red Hat with a strong technical focus, building agentic AI systems that transform how teams across the business work.But this post is about something else. Over the past few weeks I’ve been building a personal AI system that handles the operational side of my job. This space is evolving fast, and what I describe here will probably look different in a few months, but I wanted to capture where I am in this journey and share what’s working.Product management generates a lot of operational work. Triaging inboxes, prepping for meetings, tracking action items, writing specs, staying current on industry news, keeping stakeholders aligned. Most of it is important, but repetitive. It fills the day and leaves little room for the deep work that actually compounds.I wanted to flip that ratio. Instead of spending 70% of my time on operational overhead and 30% on strategic work, I wanted the inverse.What I builtI started building what I call NirOps - a local AI system that handles the operational side of my job so I can focus on the parts that require human judgment.The core idea is simple:  The AI handles routine operations. Inbox triage, meeting prep, research, news monitoring, action item tracking, knowledge capture.  I handle judgment calls. Strategy, prioritization, stakeholder relationships, decisions.  Everything is saved locally. Every meeting brief, triage output, research note, and decision record lives in a structured repo on my machine. The AI builds context over time, so it gets better the longer I use it.It’s not a chatbot I talk to occasionally. It’s closer to an operating system that sits underneath my daily workflow - connecting my calendar, email, stakeholders, and ongoing work into a single context that the AI uses to do real operational work.What a typical day looks likeMorningI start with three commands:  Jira triage - the agent scans my Jira landscape and flags what needs attention.  News brief - curated top-10 industry news relevant to my domain. Deep-dive articles get queued as reading tasks that the agent processes in the background.  Inbox triage - the agent scans Slack and Gmail, categorizes every message (respond / review / FYI / defer), drafts responses for the urgent ones, extracts action items into a task board, and archives processed emails.Before MeetingsI give the agent a meeting topic and attendee list. It pulls context from prior interactions, recent activity, relevant documents, and stakeholder notes, then generates a brief with talking points. For first-time 1:1s, it generates a personal cheat sheet with the person’s background, our overlap areas, and good questions to ask.After MeetingsThis is where it gets interesting. My meetings are transcribed by Google’s Gemini. The agent auto-discovers those transcripts, correlates them with calendar events, extracts action items into the task board, updates my contact map with new people I met, and saves key facts to domain knowledge files. I don’t fill in any forms or update any trackers manually.I can also quick-capture ad-hoc thoughts throughout the day - a link, a fact, a contact, a decision - and the agent routes it to the right place in my knowledge base.During the DayResearch on demand. Discovery briefs for vague problem areas. Competitive analyses. Quick briefings on any Slack thread, email, or document someone sends me.All of these are codified workflows - not ad-hoc prompts.How it’s wired togetherThe system runs on Claude Code, Anthropic’s CLI agent, connected to my work tools via MCP (Model Context Protocol) - small services that bridge the AI to external APIs.Claude Code    |    +-- Gmail (read + archive)    +-- Google Calendar (read)    +-- Google Drive / Docs / Sheets / Slides (read)    +-- Slack (read + mark-as-read)    +-- Google Contacts (directory lookup)    +-- Jira / Confluence (read + write)    +-- Crux (custom task board)Most MCP servers run locally as containers behind a gateway proxy. The proxy enforces tool allowlists - a JSON config defines exactly which tools each server can expose. This is how I enforce read-only access: the Google Workspace MCP server has dozens of tools, but the proxy only exposes the ones for searching and reading, not sending or creating.Everything runs on my laptop (shoutout to the Fedora team). No third-party SaaS indexing my workspace data. The knowledge base is markdown files, the task board is SQLite. If the AI tool disappears tomorrow, I still have all my meeting briefs, research notes, and decision records.What makes this different from “Using ChatGPT”The difference isn’t the model - it’s the context and the workflow.A generic AI chat session starts from zero every time. You paste in context, ask a question, get an answer, lose everything when you close the tab. Useful for one-off tasks, but nothing compounds.This system maintains context across sessions: who my stakeholders are, what was discussed in last week’s meeting, what action items are open, what the competitive landscape looks like, how I prefer to communicate. When I prep for a meeting with someone, the agent already knows our shared history, our last conversation, and the open items between us.The other difference is tooling. The agent doesn’t just answer questions - it reads my email, checks my calendar, searches Jira, looks up people in the directory, and saves its output to structured files. It operates on my actual work environment, not a text box.Skills, not promptsEach workflow is codified as a skill - a markdown file with explicit steps, tool access declarations, verification checks, and graceful degradation rules. This makes the system reliable instead of ad-hoc:  Access control. Skills declare which tools they can use. An inbox triage skill can read email and manage labels but can’t send messages. A research skill can search the web but can’t modify Jira.  Self-verification. Every skill that produces output reads it back and checks for completeness before presenting results. This catches silent failures - the kind where the AI says “done” but actually dropped half the data.  Composability. Skills can invoke other skills. My inbox triage automatically delegates reading tasks after categorizing messages. My news brief queues deep-dive articles and processes them in the background.  Learning from corrections. When I correct the agent, a dedicated skill extracts the principle and persists it so the same mistake doesn’t happen twice.There are currently 16 skills covering everything from quick text polishing to full competitive analyses.The design principles that matterHuman-as-approver. The agent can research, draft, and organize - but it never sends messages, creates tickets without my confirmation, or marks work as done. Every outward-facing action goes through me. This isn’t a limitation - it’s the design. The judgment layer stays human.Artifacts over chat. The agent produces files, not just conversation responses. Meeting briefs, research notes, decision records, competitive analyses - all saved in structured formats. This means the work compounds: next month’s meeting prep references last month’s capture notes.Progressive automation. I started with read-only integrations and manual workflows. As I trust the system more, I automate more. Email archiving was added after weeks of manual triage. Action item extraction was added after I noticed I was doing it by hand every session. The system grows with confidence, not ambition.Inbox zero as a feature. After triage, processed emails get archived and Slack gets marked as read. Items awaiting my reply stay visible. My inbox reflects my actual to-do list, not a pile of stuff I’ve already processed mentally.What I’ve learnedStart with read-only. Connect tools for reading before writing. Build trust in the system’s judgment before giving it any write access. This is the single most important principle for anyone building something similar.Verification matters more than you think. Silent failures are the biggest trust killer. The AI says “I triaged your inbox” but actually only processed 8 of 15 messages because it didn’t paginate. Every skill now reads its own output back and checks for completeness. This sounds paranoid until the first time it catches a real bug.The contact map is the most valuable artifact. An auto-maintained list of every person I interact with - their role, our shared context, interaction history. It makes every meeting prep and inbox triage smarter because the agent knows who people are and why they matter.Quick capture changes everything. The ability to text myself a link, thought, or observation from my phone and have it automatically routed to the right place in my knowledge base means nothing falls through the cracks. Before this, I had a dozen half-organized note apps. Now there’s one inbox, and the agent sorts it.The AI gets better, not just me. Every captured meeting, every triage run, every research note makes the next one faster and more contextual. The system compounds. After a few weeks, meeting prep that used to take 20 minutes of manual research takes one command and 30 seconds.What’s nextThis space is moving fast. The models get more capable every few months, and the tooling around them is maturing just as quickly. I expect the skills and workflows I described here to keep evolving - some of what I do manually today will be automated tomorrow, and new capabilities will open up patterns I haven’t thought of yet.I also made a deliberate choice to invest in building this system while onboarding into my new role, rather than waiting until I was “settled in.” The thinking is that the earlier I build the right habits and infrastructure, the more it compounds over time. So far, that bet is paying off - the system is already making me faster at absorbing context, tracking commitments, and staying on top of a new domain.The bigger challenge ahead isn’t the AI itself - it’s connecting these workflows to the people around me. Right now, the system mostly serves me: my inbox, my context, my task board. The next step is integrating more deeply into the product delivery team - bridging the gap between PM operational work and engineering execution, making the context I accumulate useful to the people I work with, not just to me. That’s where the real leverage is.Is this for everyone?Not yet. Today, this requires comfort with the command line, willingness to debug container networking, and patience to iterate on workflow design. It’s not a product you install - it’s a system you build. I expect more turnkey offerings to appear as this space matures, but there will always be a need to tune and optimize for your specific workflow.But the pattern - AI as operational infrastructure for knowledge workers - is where things are heading. Not AI replacing PMs (or lawyers, or analysts, or researchers), but AI handling the operational overhead so humans can focus on the work that requires judgment.If you’re interested in exploring something similar, I’m happy to walk through the details. The practical challenges (auth, reliability, prompt engineering, trust calibration) are real, but solvable. And the payoff - reclaiming hours of your week for the work that actually matters - is worth the investment.Special thanks to Jonathan Zarecki who helped me get started and validated some of my ideas."
  },{
    "title": "Second Anniversary at Red Hat",
    "url": "/blog/2022/03/20/second-anniversary-at-red-hat/",
    "date": "Mar 20, 2022",
    "tags": ["Career","Kubernetes","Leadership","OpenShift","Red Hat"],
    "excerpt": "Last month marked my second service anniversary since rejoining Red Hat. I feel like this is a good opportunity to reflect on the last two years and what I have been up to, both personally and professionally.",
    "content": "Last month marked my second service anniversary since rejoining Red Hat. I feel like this is a good opportunity to reflect on the last two years and what I have been up to, both personally and professionally.Recharge – or why I decided to rejoin Red Hat?When I left my job at Facebook I had no idea what I wanted to do next. I certainly did not plan to come back to Red Hat at that point. What I did know was that I needed a break. Like a real, proper break. This is when I decided to carefully plan a “funemployment” period with full support from my wife and family. After about 10 years in tech, with much international travel involved, this was very well-needed.During my recharge, which took around four months, I focused on myself and my family first. No laptop. Very little emails. Free schedule. That also allowed my wife, a full time employee in the tech industry herself, to develop her career and spend more time at work like she wanted to.Other than a few personal projects I finally got a chance to focus on (like getting back in shape and completing my first half-marathon), the best part was two vacations: one with the entire family – we spent two weeks traveling across Israel with the kids; and the other one just with my wife – we were in Rome in December 2019, a couple of months before COVID spread across nearly every border.This time away from work really got me thinking about what I wanted to do next. I started to explore a few ideas but then an interesting offer came up: to come back to Red Hat as an engineering manager. I knew Red Hat (I worked there for more than 5 years before). I even knew most of the team members I was supposed to manage and how awesome they are. I wasn’t an engineering manager before.. but I was up for the challenge. I felt like it’s the right move for me, and the opportunity to directly support people was very appealing. In February 2020 I officially rejoined Red Hat.Transitioning from Product to Engineering ManagementPeople keep asking me about the transition I did from being a Product Manager to an Engineering Manager role. First, a disclaimer: your mileage may vary. I can only share my view which is based on my own experience. Now back to the question: in theory, this move is supposed to be about shifting from owning the “problem space” (product) to owning the “solution space” (engineering). In reality, things are much more complex. The boundaries are not so clear, and depending on the company, the teams involved, the product itself, and where the product is in its life-cycle, there could be some good amount of overlap between product and engineering work. At least the way I look at this, the title does not mean much. There is work that needs to be done, and users and customers that want to get value. I always keep asking myself: what can I bring to the table at this point and how can I help?To me, the biggest difference wasn’t about “problem space”, “solution space”, nor technology. It’s about moving from being an individual contributor (IC) to a people management role.With people, for the peopleI always had a passion for supporting others, whether I was a manager or not. As a product manager, which by definition is an IC-type role, I got the opportunity to work with many other teams and individuals and I learnt how to earn trust and ultimately influence and make an impact. I also got the opportunity to mentor others, be it officially (via mentoring programs) or organically as part of the job. But being a people manager with a defined team of people you need to support is a different experience.A people manager plays a pivotal role in the employee experience. As much as I love the technical parts of the role, I made a clear decision: people first.Multi-cluster KubernetesI rejoined Red Hat to manage a team that is focused on multi-cluster networking and the Submariner project. In the last two years, the team managed to achieve a lot of amazing things, both on the technology front with many key features and cabiliplites added to the project, as well as on the community front, making Submariner a vibrant community project which is now part of the CNCF. The team also worked hard to integrate Submariner into Red Hat Advanced Cluster Management for Kubernetes, which is how Red Hat customers can leverage and use Submariner.What’s next?As I become more mature as a people manager I continue to look for ways for me to delegate and empower others, so that I can focus more on career development and personal and professional mentoring. We are also starting to see more adoption of multi-cluster Kubernetes solutions, and I am looking forward to supporting customers and helping them in their multi-cluster journey, while advocating for further standardization in the networking space."
  },{
    "title": "Migrating my blog from WordPress to GitHub Pages",
    "url": "/blog/2020/07/01/migrating-my-blog-from-wordpress-to-github-pages/",
    "date": "Jul 1, 2020",
    "tags": ["GitHub Pages","Jekyll"],
    "excerpt": "Goodbye The Network Way, welcome nyechiel.com!",
    "content": "Goodbye The Network Way, welcome nyechiel.com!When I started my blog a few years back I made the decision to use WordPress.com. While WordPress provides a powerful set of tools for creating and maintaining a blog and hides many of the complexities associated with running a website, I felt like I needed more control at this time and that I prefer to maintain my site myself. My goal was to make things as clean, minimalist and simple as possible. I really just need a way to keep my posts organized and accessible, and I don’t see a need in many of the advanced features that WordPress offers.After looking at some of the alternatives I decided upon using GitHub Pages to host my blog, and use Jekyll, a static site generator which has built-in support for GitHub Pages. My new site is now available at nyechiel.com, which you obviously know if you read this post. I migrated all of the content and old posts from the WordPress site into this new blog, and I decided to completely delete my WordPress site and account.The source code for this website is available on GitHub. Feel free to take a look if you want to learn how I use Jekyll and what customization I made to make the site works for me.Happy blogging, and of course stay safe!"
  },{
    "title": "Thoughts on mental health, project Aristotle, and Maslow's hierarchy of needs",
    "url": "/blog/2020/05/24/thoughts-on-mental-health-project-aristotle-and-maslows-hierarchy-of-needs/",
    "date": "May 24, 2020",
    "tags": ["COVID-19","Leadership","People"],
    "excerpt": "At least up until today, I used this blog to document technical stuff. But with May being Mental Health Awareness month, and with many of us spending most of our days at home amid the spread of COVID-19, I’ve been thinking a lot about mental health recently. I wanted to put together some of my thoughts, even if somewhat random, here.",
    "content": "At least up until today, I used this blog to document technical stuff. But with May being Mental Health Awareness month, and with many of us spending most of our days at home amid the spread of COVID-19, I’ve been thinking a lot about mental health recently. I wanted to put together some of my thoughts, even if somewhat random, here.Personally, I feel incredibly privileged. My family is healthy and safe and I work in a great supportive company. And yet, I feel like it’s hard to ignore the millions around me that have been directly or indirectly impacted by this new situation, not to mention the long term impact (which is still unknown, for the most part) on job markets and macroeconomics. In 2012, Google embarked on an internal project code-named project Aristotle studying over 180 teams to figure out the answer to one key question: what makes teams successful? The study, which was published two years later, highlighted five factors common to effective team dynamics at Google. The first, and by far the most important factor according to this study, was psychological safety - a term first coined by Harvard scientist Amy Edmondson, and defined as “a shared belief held by members of a team that the team is safe for interpersonal risk taking”. Simply put, according to this Google research, the most successful teams are those where team members feel safe to take risks and be honest and vulnerable in front of each other without the fear of being embarrassed, or “wrong”. For completeness, the other four factors were dependability, structure and clarity, meaning, and impact.70 years back, psychologist Abraham Maslow proposed his famous hierarchy of needs theory. The theory studies what motivates humans behavior, and highlights the things that are vital to our survival and defined as physiological needs (level 1), followed by security and safety needs (level 2), as the most basic necessities. In other words, without these foundational needs met, one could not strive towards any other layers associated with personal growth and reach his potential.The conclusion, even if somewhat simplistic, is that with so many people concerned about their physiological and security needs as defined by Maslow, teams would also struggle to develop any form of psychological safety dynamic as defined in Google’s research. It’s time to step back and give ourselves a moment of care for our health, both physically and mentally. it’s going to be a long run.Stay safe!"
  },{
    "title": "Red Hat Summit 2020: Networking Sessions You Don’t Want to Miss",
    "url": "/blog/2020/04/13/red-hat-summit-2020-networking-sessions-you-dont-want-to-miss/",
    "date": "Apr 13, 2020",
    "tags": ["COVID-19","Open Source","Red Hat","Red Hat Summit","Talks"],
    "excerpt": "Red Hat Summit is one of my favorite events of the year. It brings together customers, partners, community members and Red Hatters to talk about the open source innovations that are enabling the future of enterprise technology. Seeing the number of attendees growing year after year is also impressive and reassuring, as more and more folks are showing interest in Red Hat and its growing portfolio of products.",
    "content": "Red Hat Summit is one of my favorite events of the year. It brings together customers, partners, community members and Red Hatters to talk about the open source innovations that are enabling the future of enterprise technology. Seeing the number of attendees growing year after year is also impressive and reassuring, as more and more folks are showing interest in Red Hat and its growing portfolio of products.This year, due to the evolving coronavirus (COVID-19) developments, we decided to rebuild Red Hat Summit 2020 as a virtual event and cancel the physical event that was planned to be hosted in San Francisco. The date does not change - and the virtual event is going to happen on April 28-29.The Red Hat Summit team is doing an INCREDIBLE job recreating the same content you would expect, with keynotes, breakout sessions, access to Red Hat experts, and even “booth” areas. In addition, all recorded content will be available for up to one year after the event. And the cherry on top, access and registration is free of charge and the event is designed for a global audience across three main time-zones: North America and Latin America (NA + LATAM); Europe, the Middle East, and Africa (EMEA); and Asia and the Pacific Rim (APAC).Similar to a post I did back in 2016, I would like to highlight key networking related sessions that are planned this time. (Note: The Summit 2020 event platform has since been decommissioned, so session detail links are no longer available.)  Red Hat Ansible Network Automation + Telco—containerized network function virtualization (NFV)  Red Hat Enterprise Linux networking roadmap: Connecting workloads of tomorrow  Red Hat OpenShift’s road to the network edge  Connecting workloads across OpenShift clusters  Living on the edge—A reference architecture for distributed compute nodes (DCN)  OpenShift core technologies roadmap  The next evolution of Red Hat OpenStack Platform  Datacenter networking with Cisco Application Centric Infrastructure (ACI)Stay healthy and safe!"
  },{
    "title": "Hello (again), Red Hat!",
    "url": "/blog/2020/04/12/hello-again-red-hat/",
    "date": "Apr 12, 2020",
    "tags": ["Career","Kubernetes","Leadership","OpenShift","Red Hat","Submariner"],
    "excerpt": "tl;dr - as of February 2020, I am back at Red Hat, focusing on OpenShift multi-cluster networking.",
    "content": "tl;dr - as of February 2020, I am back at Red Hat, focusing on OpenShift multi-cluster networking.2020 brings a new beginning for me. Last year I decided to join the Connectivity team at Facebook. Working at Facebook was a great experience for me, but after a few months there I realized that I was not happy and this is not going to be a long-term fit for me. In August, I made the decision to quit and go and spend some well-needed time with my family. I was happily funemployed for around 4 months, and it was just awesome!Starting with February 2020, I am back at Red Hat, supporting the OpenShift multi-cluster network engineering team. Transitioning from Product Management to engineering (and engineering management in particular) is new to me, but I enjoy every second of it so far.Like everything else we do in Red Hat, our contribution is fully open sourced and we follow an upstream first development philosophy. The main project I am currently working on is Submariner, and our mission is to allow organizations to seamlessly connect, scale, and migrate their workloads across OpenShift (Kubernetes) clusters deployed on prem or in any cloud.I am super excited and looking forward to working with brilliant folks at Red Hat and the Kubernetes community. I am also planning to be more active on this blog, where I hope to share more information about our progress.Stay healthy and safe!"
  },{
    "title": "A New Beginning",
    "url": "/blog/2019/04/23/a-new-beginning/",
    "date": "Apr 23, 2019",
    "tags": ["Red Hat","Career","Product Management"],
    "excerpt": "2019 brings a new beginning for me. After a little over five years at Red Hat, I have decided to move on to my next challenge. This is a big move, and I wanted to take a moment and reflect back on those years.",
    "content": "2019 brings a new beginning for me. After a little over five years at Red Hat, I have decided to move on to my next challenge. This is a big move, and I wanted to take a moment and reflect back on those years.I joined Red Hat at 2013 as Product Manager for Red Hat Enterprise Virtualization (which since then was rebranded to Red Hat Virtualization, or RHV). That was my first PM job ever, and really a dream come true. I would always be grateful for my manager at the time, Andy Cathrow, for believing in me and offering me the job although I did not have PM experience before.When I joined Red Hat we were around 4,500 employees worldwide. We were still primarily known for our enterprise Linux offering, and cloud was not yet a thing. On the personal front I was in a relationship with my soon-to-be my wife, but not married yet.If I need to sum up my time at Red Hat with just one word, it would be “growth”. I was really fortunate to experience growth in many aspects: the company itself, my career, and my family lives.Fast forward five years and Red Hat is a major player in hybrid cloud, with an impressive portfolio that goes way beyond Linux and touches application development, virtualization, cloud, storage, networking, management and automation. The company has more than 13,000 (!) employees as of February 2019, and was acquired by IBM in what is expected to be the third-biggest deal in the history of U.S tech. Seeing this growth from the inside was truly amazing. I have had the chance to work on cool projects, and bring new products to market.  Career wise, I ended up working on three major products: RHV, OpenStack, and OpenShift - and due to the special culture of Red Hat, also working and interacting with many open source communities, including (but not limited to) Kubernetes, oVirt, Open vSwitch, KVM, OpenDaylight, Skydive, Kuryr, OPNFV, DPDK, and Ansible. I had the pleasure to work alongside and learn from the most amazing group of people and shape my product management skills and personality. I have also learned (the hard way) what it’s like to be working from home and in a super distributed team across Israel, Europe, and US East and West Coasts.On the personal front, I am married with two beautiful children now. Integrating “work” and “life” and building the optimal schedule was also a key lesson learned from my time at Red Hat.Next I am joining the Facebook Connectivity team to work on a new product. Facebook certainly feels different than what I did previously, although I am still going to be involved with networking. The Connectivity mission to bring more people online to a faster Internet is near and dear to my heart.Moving on was not an easy decision. That said, I felt like this was the right thing for me and for my career. Among other things, I wanted to experience what it’s like to be a PM in a different company, and explore new markets and people problems."
  },{
    "title": "Red Hat OpenStack Platform 13: five things you need to know about networking",
    "url": "/blog/2018/07/15/red-hat-openstack-platform-13-five-things-you-need-to-know-about-networking/",
    "date": "Jul 15, 2018",
    "tags": ["Kubernetes","Kuryr","NFV","OpenDaylight","OpenShift","OpenStack","OVN","OVS","Red Hat","SDN"],
    "excerpt": "A post I wrote for the Red Hat Stack blog, on key networking features included in Red Hat OpenStack Platform 13. Read more here: Red Hat OpenStack Platform 13: five things you need to know about networking.",
    "content": "A post I wrote for the Red Hat Stack blog, on key networking features included in Red Hat OpenStack Platform 13. Read more here: Red Hat OpenStack Platform 13: five things you need to know about networking."
  },{
    "title": "SDN with Red Hat OpenStack Platform: OpenDaylight Integration",
    "url": "/blog/2017/02/28/sdn-with-red-hat-openstack-platform-opendaylight-integration/",
    "date": "Feb 28, 2017",
    "tags": ["NetVirt","NFV","OpenDaylight","OpenStack","SDN"],
    "excerpt": "A short post I wrote for the Red Hat Stack blog, on what Red Hat is doing with OpenStack and OpenDaylight. Read more here: SDN with Red Hat OpenStack Platform: OpenDaylight Integration.",
    "content": "A short post I wrote for the Red Hat Stack blog, on what Red Hat is doing with OpenStack and OpenDaylight. Read more here: SDN with Red Hat OpenStack Platform: OpenDaylight Integration."
  },{
    "title": "Networking sessions in Red Hat Summit 2016",
    "url": "/blog/2016/07/07/networking-sessions-in-red-hat-summit-2016/",
    "date": "Jul 6, 2016",
    "tags": ["Ansible","Linux","NFV","OpenStack","Red Hat","Red Hat Summit","SDN","Talks","Virtualization"],
    "excerpt": "I recently attended the Red Hat Summit 2016 event that took place at San Francisco, CA, on June 27-30. Red Hat Summit is a great place to interact with customers, partners, and product leads, and learn about Red Hat and the company’s direction. While Red Hat is still mostly known for its Enterprise Linux (RHEL) business, it also offers products and solutions in the cloud computing, virtualization, middleware, storage, and systems management spaces. And networking is really a key piece in all of these.",
    "content": "I recently attended the Red Hat Summit 2016 event that took place at San Francisco, CA, on June 27-30. Red Hat Summit is a great place to interact with customers, partners, and product leads, and learn about Red Hat and the company’s direction. While Red Hat is still mostly known for its Enterprise Linux (RHEL) business, it also offers products and solutions in the cloud computing, virtualization, middleware, storage, and systems management spaces. And networking is really a key piece in all of these.In this short post I wanted to highlight a few sessions which are relevant to networking and were presented during the event. While video recordings are not available, slide decks were originally available for download from the event platform, which has since been decommissioned.Software-defined networking (SDN) fundamentals for NFV, OpenStack, and containers  Session overview: With software-defined networking (SDN) gaining traction, administrators are faced with technologies that they need to integrate into their infrastructure. Red Hat Enterprise Linux offers a robust foundation for SDN implementations that are based on an open source standard technologies and designed for deploying containers, OpenStack, and network function virtualization (NFV). We’ll dissect the technology stack involved in SDN and introduce the latest Red Hat Enterprise Linux options designed to address the packet processing requirements of virtual network functions (VNFs), such as Open vSwitch (OVS), single root I/O virtualization (SR-IOV), PCI Passthrough, and DPDK accelerated OVS.  (Slides no longer available)Use Linux on your whole rack with RDO and open networking  Session overview: OpenStack networking is never easy–each new release presents new challenges that are hard to keep up with. Come see how open networking using Linux can help simplify and standardize your RDO deployment. We will demonstrate spine/leaf topology basics, Layer-2 and Layer-3 trade-offs, and building your deployment in a virtual staging environment–all in Linux. Let us demystify your network.  (Slides no longer available)Extending full stack automation to the physical network  Session overview: In this session, we’ll talk about the unique operational challenges facing organizations considering how to encompass the physical network infrastructure when implementing agile practices. We’ll focus on the technical and cultural challenges facing this transition, including how Ansible is uniquely architected to serve as the right foundational framework for powering this change. We’ll touch on why it’s more important than ever that organizations embrace the introduction of new automated orchestration capabilities and start moving away from traditional command and control network device administration being done hop by hop. You’ll see some some of the theories in action and touch on expanding configuration automation to include elements of state validation of configuration changes. Finally, we’ll touch on the changing role of network engineering and operations teams and why their expertise is needed now more than ever to lead this transition.  (Slides no longer available)Best practices for deploying and scaling Red Hat OpenStack Platform with Open vSwitch and Red Hat Ceph storage  Session overview: In this deep dive implementation session, you’ll learn how to successfully deploy and scale Red Hat OpenStack Platform with Red Hat’s best practices for integration of Open vSwitch and Red Hat Ceph Storage, taking into consideration high availability, IPv6 networking, and the deployment and usage of Director for massive scalability. Learn the tips and tricks, while avoiding typical pitfalls to ensure you’re successful.  (Slides no longer available)Telco breakout: Reliability, availability, and serviceability at cloud scale  Session overview: Many operators are faced with fierce market competition that is attracting their customers with personalized alternatives. Technologies, like SDN, NFV, and 5G, hold the key to adapting to the networks of the future. However, operators are also looking to ensure that they can continue to offer the service-level guarantees their customers expect.With the advent of cloud-based service infrastructures, building secure, fault-tolerant, and reliable networks that deliver five nines (99.999%) service availability in the same way they have done for years has become untenable. The goal of zero downtime is still the same, as every hour of it is costly to service providers and their customers. As we continually move to new levels of scale, service providers and their customers expect that infrastructure failures will occur and are pro-actively changing their development and operational strategies. This session will explore these industry challenges and how service providers are applying new technologies and approaches to achieve reliability, availability, and serviceability at cloud scale. Service providers and vendors will join us to share their views on this complex topic and explain how they are applying and balancing the use of open source innovations, resilient service and application software, automation, DevOps, service assurance, and analytics to add value for their customers and business partners.  (Slides no longer available)Red Hat Enterprise Linux roadmap  Session overview: Red Hat Enterprise Linux is the premier Linux distribution, known for reliability, security, and performance. Red Hat Enterprise Linux is also the underpinning of Red Hat’s efforts in containers, virtualization, Internet of Things (IoT), network function virtualization (NFV), Red Hat Enterprise Linux OpenStack Platform, and more. Learn what’s new and emerging in this powerful operating system, and how new function and capability can help in your environment.  (Slides no longer available)Repeatable, reliable OpenStack deployments: Pipe dream or reality?  Session overview: Deploying OpenStack is an involved, complicated, and error-prone process, especially when deploying a physical Highly Available (HA) cluster with other software and hardware components, like Ceph. Difficulties include everything from hardware selection to the actual deployment process. Dell and Red Hat have partnered together to produce a solution based on Red Hat Enterprise Linux OSP Director that streamlines the entire process of setting up an HA OpenStack cluster. This solution includes a jointly developed reference architecture that includes hardware, simplified Director installation and configuration, Ceph storage backed by multiple back ends including Dell SC and PS series storage arrays, and other enterprise features–such as VM instance HA and networking segregation flexibility. In this session, you’ll learn how this solution drastically simplifies standing up an OpenStack cloud.  (Slides no longer available)Running a policy-based cloud with Cisco Application Centric Infrastructure, Red Hat OpenStack, and Project Contiv  Session overview: Infrastructure managers are constantly asked to push the envelope in how they deliver cloud environments. In addition to speed, scale, and flexibility, they are increasingly focused on both security and operational management and visibility as adoption increases within their organizations. This presentation will look at ways Cisco and Red Hat are partnering together to deliver policy-based cloud solutions to address these growing challenges. We will discuss how we are collaborating in the open source community and building products to based on this collaboration. It will cover topics including:          Group-Based Policy for OpenStack      Cisco Application Centric Infrastructure (ACI) with Red Hat OpenStack      Project Contiv and its integration with Cisco ACI        (Slides no longer available)"
  },{
    "title": "Boosting the NFV datapath with RHEL OpenStack Platform",
    "url": "/blog/2016/02/14/boosting-the-nfv-datapath-with-rhel-openstack-platform/",
    "date": "Feb 14, 2016",
    "tags": ["DPDK","NFV","Open Source","OpenStack","OVS","SR-IOV"],
    "excerpt": "A post I wrote for the Red Hat Stack blog, trying to clarify what we are doing with RHEL OpenStack Platform to accelerate the datapath for NFV applications.",
    "content": "A post I wrote for the Red Hat Stack blog, trying to clarify what we are doing with RHEL OpenStack Platform to accelerate the datapath for NFV applications.Read the full post here: Boosting the NFV datapath with RHEL OpenStack Platform"
  },{
    "title": "NFV and Open Networking with RHEL OpenStack Platfrom",
    "url": "/blog/2016/01/04/nfv-and-open-networking-with-rhel-openstack-platfrom/",
    "date": "Jan 4, 2016",
    "tags": ["Network Virtualization","NFV","Open Source","OpenStack","Talks"],
    "excerpt": "(This is a summary version of a talk I gave at Intel Israel Telecom and NFV event on December 2nd, 2015. Slides are available on GitHub).",
    "content": "(This is a summary version of a talk I gave at Intel Israel Telecom and NFV event on December 2nd, 2015. Slides are available on GitHub).I was honored to be invited to speak on a local Intel event about Red Hat and what we are doing in the NFV space. I only had 30 minutes, so I tried to provide a high level overview of our offering, covering some main points:  Upstream first approach and why we believe it is a fundamental piece in the NFV journey; this is not a marketing pitch but really how we deliver our entire product portfolio  NFV and OpenStack; I touched on the fact that many service providers are asking for OpenStack-based solutions, and that OpenStack is the de-facto choice for VIM. That said, there are some limitations today (both cultural and technical) with OpenStack and clearly we have a way to go to make it a better engine for the telco needs  Full open source approach to NFV; it’s not just OpenStack but also other key projects such as QEMU/KVM, Open vSwitch, DPDK, libvirt, and the underlying Linux operating system. It’s hard to coordinate across these different communities, but this is what we are trying to do, with active participants on all of those  Red Hat product focus and alignment with OPNFV  Main use-cases we see in the market (atomic VNFs, vCPE, vEPC) with a design example of vPGW using SR-IOV  What telco and NFV specific features were introduced in RHEL OpenStack Platform 7 (Kilo) and what is planned for OpenStack Platform 8 (Liberty); as a VIM provider we want to offer our customers and the Network Equipment Providers (NEPs) maximum flexibility for packet processing options with PCI Passthrough, SR-IOV, Open vSwitch and DPDK-accelerated Open vSwitch based solutions.Thanks to Intel Israel for a very interesting and well-organized event! "
  },{
    "title": "LLDP traffic and Linux bridges",
    "url": "/blog/2016/01/04/lldp-traffic-and-linux-bridges/",
    "date": "Jan 4, 2016",
    "tags": ["Bridge","Linux","LLDP","Virtualization"],
    "excerpt": "In my previous post I described my Cumulus VX lab environment which is based on Fedora and KVM. One of the first things I noticed after bringing up the setup is that although I have got L3 connectivity between the emulated Cumulus switches, I can’t get LLDP to operate properly between the devices.",
    "content": "In my previous post I described my Cumulus VX lab environment which is based on Fedora and KVM. One of the first things I noticed after bringing up the setup is that although I have got L3 connectivity between the emulated Cumulus switches, I can’t get LLDP to operate properly between the devices.For example, a basic ICMP ping between the directly connected interfaces of leaf1 and spine3 is successful, but no LLDP neighbor shows up:cumulus@leaf1$ ping 13.0.0.3PING 13.0.0.3 (13.0.0.3) 56(84) bytes of data.64 bytes from 13.0.0.3: icmp_req=1 ttl=64 time=0.210 ms64 bytes from 13.0.0.3: icmp_req=2 ttl=64 time=0.660 ms64 bytes from 13.0.0.3: icmp_req=3 ttl=64 time=0.635 mscumulus@leaf1$ lldpcli show neighbors LLDP neighbors:-------------------------------------Reading through the Cumulus Networks documentation, I discovered that LLDP is turn on by default on all active interfaces. It is possible to tweak things, such as timers, but the basic neighbor discovery functionality should be there by default.Looking at the output from lldpcli show statistics I also discovered that LLDP messages are being sent out of the interfaces, but never received.So what’s going on?Remember that leaf1 and spine3 are not really directly connected. They are bridged together using a Linux bridge device.This is where I discovered that by design, Linux bridges silently drop LLDP messages (sent to the LLDP_Multicast address 01-80-C2-00-00-0E) and other control frames in the 01-80-C2-00-00-xx range.Explanation to that can be found in the 802.1AB standard which is stating that “the destination address shall be 01-80-C2-00-00-0E. This address is within the range reserved by IEEE Std 802.1D-2004 for protocols constrained to an individual LAN, and ensures that the LLDPDU will not be forward by MAC Bridges that conform to IEEE Std 802.1D-2004.”It is possible to change this behavior on a per bridge basis, though, by using:  # echo 16384 \\&gt; /sys/class/net/\\&lt;bridge\\_name\\&gt;/bridge/group\\_fwd\\_mask  Retesting with leaf1 and spine3  # echo 16384 \\&gt; /sys/class/net/virbr1/bridge/group\\_fwd\\_mask  LLDP now operates as expected between leaf1 and spine3. Remember that this is a per bridge setting, so in order to get this fixed across the entire setup, the command needs to be issued for the rest of the bridges (virbr2, virbr3, virbr4) as well."
  },{
    "title": "Hands on with Fedora, KVM and Cumulus VX",
    "url": "/blog/2015/12/31/hands-on-with-fedora-kvm-and-cumulus-vx/",
    "date": "Dec 31, 2015",
    "tags": ["Data Center","Disaggregation","Ethernet","IP","Linux","Open Source","Virtualization"],
    "excerpt": "Cumulus Linux is a network operating system based on Debian that runs on top of industry standard networking hardware. By providing a software-only solution, Cumulus is enabling disaggregation of data center switches similar to the x86 server hardware/software disaggregation. In addition to the networking features you would expect from a network operating system like L2 bridging, Spanning Tree Protocol, LLDP, bonding/LAG, L3 routing, and so on, it enables users to take advantage of the latest Linux applications and automation tools, which is in my opinion its true power.",
    "content": "Cumulus Linux is a network operating system based on Debian that runs on top of industry standard networking hardware. By providing a software-only solution, Cumulus is enabling disaggregation of data center switches similar to the x86 server hardware/software disaggregation. In addition to the networking features you would expect from a network operating system like L2 bridging, Spanning Tree Protocol, LLDP, bonding/LAG, L3 routing, and so on, it enables users to take advantage of the latest Linux applications and automation tools, which is in my opinion its true power.Cumulus VX is a community-supported virtual appliance that enables network engineers to preview and test Cumulus Networks technology. The appliance is available in different formats (for VMware, VirtualBox, KVM, and Vagrant environments), and since I am running Fedora on my laptop the easiest thing for me was to use the KVM qcow2 image to try it out.My goal is to build a four node leaf/spine topology. To form the fabric, each leaf will be connected to each spine, so we will end up with two “fabric facing” interfaces on each switch. In addition, I want to have a separate management interface on each device I can use for SSH access as well as automation purposes (Ansible being an immediate suspect), and a loopback interface to be used as the router-id.Prerequisites      Install KVM and related virtualization packages. I am running Fedora 22 and used yum groupinstall \"Virtualization\\*\" to obtain the latest versions of libvirt, virt-manager, qemu-kvm and associated dependencies.    From the Virtual Machine Manager, create four basic isolated networks (without IP, DHCP or NAT settings). Those will serve as transport for the point-to-point links between our switches. I named them as follows:          net1      net2      net3      net4        Download the KVM qcow2 image from the Cumulus website. At the time of writing the image is based on Cumulus Linux v2.5.5. You would want to copy it four times, and name them as follows:          leaf1.qcow2      leaf2.qcow2      spine3.qcow2      spine4.qcow2      Creating the VMsWhile creating each VM you will need to specify the network settings, in particular what interfaces you want to be created, what networks they should be part of, and what is their L2 (MAC) information. To ease troubleshooting, I came out with my own convention for the interfaces MAC addresses.Leaf1:  Leaf1 should have three interfaces:          One belonging to the default network - a network created by virt-manager with DHCP and NAT enabled, and will be used for the management access.      One belonging to net1, which is going to be used for the connection between leaf1 and spine3. Behind the scenes, virt-manager created a Linux bridge for this network.      One belonging to net2, which is going to be used for the connection between leaf1 and spine4. Behind the scenes, virt-manager created a Linux bridge for this network.        Make sure to adjust the path to specify the location of the image:  sudo virt-install --os-variant=generic --ram=256 --vcpus=1 --network=default,model=virtio,mac=00:00:00:00:00:11 --network network=net1,model=virtio,mac=00:00:01:00:00:13 --network network=net2,model=virtio,mac=00:00:01:00:00:14 --boot hd --disk path=/home/nyechiel/Downloads/VX/leaf1.qcow2,format=qcow2 --name=leaf1  Leaf2:  Leaf2 should have three interfaces:          One belonging to the default network - a network created by virt-manager with DHCP and NAT enabled, and will be used for the management access.      One belonging to net3, which is going to be used for the connection between leaf2 and spine3. Behind the scenes, virt-manager created a Linux bridge for this network.      One belonging to net4, which is going to be used for the connection between leaf2 and spine4. Behind the scenes, virt-manager created a Linux bridge for this network.        Make sure to adjust the path to specify the location of the image:  sudo virt-install --os-variant=generic --ram=256 --vcpus=1 --network=default,model=virtio,mac=00:00:00:00:00:22 --network network=net3,model=virtio,mac=00:00:02:00:00:23 --network network=net4,model=virtio,mac=00:00:02:00:00:24 --boot hd --disk path=/home/nyechiel/Downloads/VX/leaf2.qcow2,format=qcow2 --name=leaf2  Spine3:  Spine3 should have three interfaces:          One belonging to the default network - a network created by virt-manager with DHCP and NAT enabled, and will be used for the management access.      One belonging to net1, which is going to be used for the connection between leaf1 and spine3. Behind the scenes, virt-manager created a Linux bridge for this network.      One belonging to net3, which is going to be used for the connection between leaf2 and spine3. Behind the scenes, virt-manager created a Linux bridge for this network.        Make sure to adjust the path to specify the location of the image:  sudo virt-install --os-variant=generic --ram=256 --vcpus=1 --network=default,model=virtio,mac=00:00:00:00:00:33 --network network=net1,model=virtio,mac=00:00:03:00:00:31 --network network=net3,model=virtio,mac=00:00:03:00:00:32 --boot hd --disk path=/home/nyechiel/Downloads/VX/spine3.qcow2,format=qcow2 --name=spine3  Spine4:  Spine4 should have three interfaces:          One belonging to the default network - a network created by virt-manager with DHCP and NAT enabled, and will be used for the management access.      One belonging to net2, which is going to be used for the connection between leaf1 and spine4. Behind the scenes, virt-manager created a Linux bridge for this network.      One belonging to net4, which is going to be used for the connection between leaf2 and spine4. Behind the scenes, virt-manager created a Linux bridge for this network.        Make sure to adjust the path to specify the location of the image:  sudo virt-install --os-variant=generic --ram=256 --vcpus=1 --network=default,model=virtio,mac=00:00:00:00:00:44 --network network=net2,model=virtio,mac=00:00:04:00:00:41 --network network=net4,model=virtio,mac=00:00:04:00:00:42 --boot hd --disk path=/home/nyechiel/Downloads/VX/spine4.qcow2,format=qcow2 --name=spine4  Verifying the hypervisor topologyBefore we log in to any of the newly created VMs, I first would like to verify the configuration and make sure that we have got the right connectivity in place. Using ifconfig on my Fedora system, and by looking into the MAC addresses, I correlated between the Linux bridges created by virt-manager (virbr0, virbr1, virbr2, virbr3, virbr4) and the virtual Ethernet devices (vnet). This is giving me the hypervisor point of view, and going to be really useful for troubleshooting purposes. I came up with this topology:Useful commands to use here are brctl show and brctl showmacs. For example, let’s examine the link between leaf1 and spine3 (note that libvirt based the MAC on the configured guest MAC address with high byte set to 0xFE):  $ ip link show vnet1 | grep linklink/ether fe:00:01:00:00:13 brd ff:ff:ff:ff:ff:ff    $ ip link show vnet10 | grep linklink/ether fe:00:03:00:00:31 brd ff:ff:ff:ff:ff:ff    brctl show virbr1    brctl showmacs virbr1  Verifying the fabric topologyNow that we have the basic networking setup between the VMs and we understand the topology, we can jump into the switches and confirm their view. The switches can be accessed with the username “cumulus” and the password “CumulusLinux!”. This is also the password for root.Using console access to the VMs and the ifconfig command we can learn a couple of things:  eth0 is the base interface on each switch used for management purposes. It picked up an address from the 192.168.122.0/22 range, which is what virt-manager used to setup the “default” network. SSH to this address is enabled by default with standard TCP port 22.  The “fabric” interfaces are swp1 and swp2.Based on this information we can build up our final topology, which is a representation of the actual fabric:Now what?Now that we have the basic topology setup and the right diagrams to support us, we can go on and configure things. Cumulus has got some good level of documentation so I will let you take it from here. You can configure things manually using the CLI (which is really a bash system with standard Linux commands) or use automation tools to control the switch.Using the CLI and following the documentation, it was pretty straightforward to me to configure hostnames, IP addresses and bring up OSPF and BFD (using Quagga) between the switches. Next I plan to play with the more advanced stuff (personally I want to test out BGP and IPv6 configurations), and try to automate things using Ansible. Happy testing!"
  },{
    "title": "Reflections on the networking industry, part 2: On CLI, APIs and SNMP",
    "url": "/blog/2015/12/29/reflections-on-the-networking-industry-part-2-on-cli-apis-and-snmp/",
    "date": "Dec 29, 2015",
    "tags": ["API","Automation","CLI","NETCONF","Open Source","REST","SNMP","Vendor"],
    "excerpt": "In the previous post I briefly described the fact that many networks today are closed and vertically designed. While standard protocols are being adopted by vendors, true interoperability is still a challenge. Sure, you can bring up a BGP peer between platforms from different vendors and exchange route information (otherwise we couldn’t scale the Internet), but management and configuration is still, in most cases, vendor specific.",
    "content": "In the previous post I briefly described the fact that many networks today are closed and vertically designed. While standard protocols are being adopted by vendors, true interoperability is still a challenge. Sure, you can bring up a BGP peer between platforms from different vendors and exchange route information (otherwise we couldn’t scale the Internet), but management and configuration is still, in most cases, vendor specific.Every network engineer out there got to respect the CLI. We sometimes love them and sometimes hate them, but we all tend to master them. The glorious way of interacting with a network device, even in 2015. Some common properties of CLIs are:  They are vendor, and sometimes even device, specific;  They are not standardized; there is no standard for setting up the data or for displaying the text  They don’t have a strict notion of versioning or guarantee backward compatibility;  They can change between software releases;All of the above make CLIs an acceptable solution up to a certain scale. With large-scale networks automation is a key part and usually mandatory. But giving the properties mentioned above, automating a network device configuration based on CLI commands isn’t a trivial task.Today, you can see more and more vendors that support other protocols such as NETCONF or REST for interacting with their devices. The impression is that you suddenly have a proper API and a standard method to communicate with the devices. Reality is that with such protocols you do have a standard transport to interact with a device, but you still do not have an API, with each device/vendor still represents data differently as brilliantly described by Jason Edelman in this blog post.We, as an industry, must agree on a standard way for representing the network data. No more vendor-specific implementations, but true, open, models. The last major try was with SNMP, the Simple Network Management Protocol, which is anything but simple. Most people just turn it off, or use it to capture (read: poll) very basic information from a device. Anything more complex than that, not to mention device configuration, requires installation of vendor specific MIBs and we are back to the same problem."
  },{
    "title": "Reflections on the networking industry, part 1: Welcome to vendor land",
    "url": "/blog/2015/12/28/reflections-on-the-networking-industry-part-1/",
    "date": "Dec 28, 2015",
    "tags": ["Data Center","Ethernet","IP","Open Source","Vendor"],
    "excerpt": "I have been involved with networking for quite some time now; I have had the opportunity to design, implement and operate different networks across different environments such as enterprise, data-center, and service provider - which inspired me to create this series of short blog posts exploring the computer networking industry. My view on the history, challenges, hype and reality, and most importantly - what’s next and how we can do better.",
    "content": "I have been involved with networking for quite some time now; I have had the opportunity to design, implement and operate different networks across different environments such as enterprise, data-center, and service provider - which inspired me to create this series of short blog posts exploring the computer networking industry. My view on the history, challenges, hype and reality, and most importantly - what’s next and how we can do better.Protocols and standards were always a key part of networking and were born out of necessity: we need different systems to be able to talk to each other.Modern networking suite is built around Ethernet and TCP/IP stack, including TCP, UDP, and ICMP - all riding on top of IPv4 or IPv6. There is a general consensus that Ethernet and TCP/IP won the race against the other alternatives. This is great, right? Well, the problem is not with Ethernet nor the TCP/IP stack, but with their “ecosystem”: a long list of complementary technologies and protocols.Getting the industry to agree on the base layer 2, layer 3 and layer 4 protocols and their header format was indeed a big thing, but we kind of stopped there. Let’s say you have got a standard-based Ethernet link. How would you bring it up and negotiate its speed? And what about monitoring, loop prevention, or neighbor discovery? Except for the very basic, common denominator functionality, vendors came out with their own set of proprietary protocols for solving these issues. Just from the top of my mind: ISL, VTP, DTP, UDLD, PAgP, CDP, and PVST are all examples of the “Ethernet ecosystem” from one (!) vendor.True, you can find standard alternatives for the mentioned protocols today. Vendors are embracing open standards and tend to replace their proprietary implementation with a standard one if available. But why not to start with the standard one to begin with?If you think that these are just historical examples from a different era, think again. Even today, more and more protocols are being developed and/or adopted by single vendors only. I usually like to point out MC-LAG as an example of a fairly recent and very common architecture with no standard-based implementation. This feature alone can lead you to choose one vendor (or even one specific hardware model from one vendor) across your entire network, resulting in a perfect vendor lock-in."
  },{
    "title": "Neutron networking with Red Hat Enterprise Linux OpenStack Platform",
    "url": "/blog/2015/06/28/neutron-networking-with-red-hat-enterprise-linux-openstack-platform/",
    "date": "Jun 28, 2015",
    "tags": ["IPv6","Neutron","OpenStack","SDN","Talks","VXLAN"],
    "excerpt": "(This is a summary version of a talk I gave at Red Hat Summit on June 25th, 2015. Slides are available on GitHub).",
    "content": "(This is a summary version of a talk I gave at Red Hat Summit on June 25th, 2015. Slides are available on GitHub).I was honored to speak the second time in a row on Red Hat Summit, the premier open source technology event hosted in Boston this year. As I am now focusing on product management for networking in Red Hat Enterprise Linux OpenStack Platform I presented Red Hat’s approach to Neutron, the OpenStack networking service.Since OpenStack is fairly a new project and a new product on Red Hat’s portfolio, I was not sure what level of knowledge to expect from my audience. Therefore I have started with a fairly basic overview of Neutron - what it is and what are some of the common features you can get from its API today. I was very happy to see that most of the people at the audience seemed to be already familiar with OpenStack and with Neutron so the overview part was quick.The next part of my presentation was a deep dive into Neutron when deployed with the ML2/Open vSwitch (OVS) plugin. This is our default configuration when deploying Red Hat Enterprise Linux OpenStack Platform today, and like any other Red Hat products, based on fully open-source components. Since there is so much to cover here (and I only had one hour for the entire talk), I focused on the core elements of the solution, and the common features we see customers using today: L2 connectivity, L3 routing and NAT for IPv4, and DHCP for IP address assignment. I explained the theory of operation and used some graphics to describe the backend implementation and how things look on the OpenStack nodes.OVS-based solution is our default, but we are also working with a very large number of leading vendors in the industry providing their own solutions through the use of Neutron plugins. I spent some time to describe the various plugins out there, our current partner ecosystem, and Red Hat’s certification program for 3rd party software.I then covered some of the major recent enhancements introduced in Red Hat Enterprise Linux OpenStack Platform 6 based on the upstream Juno code base: IPv6 support, L3 HA, and distributed virtual router (DVR) - which is still a Technology Preview feature, yet very interesting to our customers.Overall, I was very happy with this talk and with the number of questions I got in the end. It looks like OpenStack is happening, and more and more customers are interested to find out more about it. See you next year in San Francisco for Red Hat Summit 2016!"
  },{
    "title": "IPv6 Prefix Delegation - what is it and how does it going to help OpenStack?",
    "url": "/blog/2015/06/23/ipv6-prefix-delegation-what-is-it-and-how-does-it-going-to-help-openstack/",
    "date": "Jun 22, 2015",
    "tags": ["IPv6","Neutron","OpenStack"],
    "excerpt": "IPv6 offers several ways to assign IP addresses to end hosts. Some of them (SLAAC, stateful DHCPv6, stateless DHCPv6) were already covered in this post. The IPv6 Prefix Delegation mechanism (described in RFC 3769 and RFC 3633) provides “a way of automatically configuring IPv6 prefixes and addresses on routers and hosts” - which sounds like yet another IP assignment option. How does it differ from the other methods? And why do we need it? Let’s try to figure it out.",
    "content": "IPv6 offers several ways to assign IP addresses to end hosts. Some of them (SLAAC, stateful DHCPv6, stateless DHCPv6) were already covered in this post. The IPv6 Prefix Delegation mechanism (described in RFC 3769 and RFC 3633) provides “a way of automatically configuring IPv6 prefixes and addresses on routers and hosts” - which sounds like yet another IP assignment option. How does it differ from the other methods? And why do we need it? Let’s try to figure it out.Understanding the problemI know that you still find it hard to believe… but IPv6 is here, and with IPv6 there are enough addresses. That means that we can finally design our networks properly and avoid using different kinds of network address translation (NAT) in different places across the network. Clean IPv6 design will use addresses from the Global Unicast Address (GUA) range, which are routable in the public Internet. Since these are globally routed, care needs to be taken to ensure that prefixes configured by one customer do not overlap with prefixes chosen by another.While SLAAC or DHCPv6 enable simple and automatic host configuration, they do not provide specification to automatically delegate a prefix to a customer site. With IPv6, there is a need to create a hierarchical model in which the service provider allocates prefixes from a set of pools to the customer. The customer then assign addresses to its end systems out of the predefined pool. This is powerful, as it provides the service provider with control over the IPv6 prefixes assignment, and could eliminate potential conflicts in prefix selection.How does it work?With Prefix Delegation, a delegating router (Prefix Delegation Server) delegates IPv6 prefixes to a requesting router (Prefix Delegation Client). The requesting router then uses the prefixes to assign global IPv6 addresses to the devices on its internal interfaces. Prefix Delegation is useful when the delegating router does not have information about the topology of the networks in which the requesting router is located. The delegating router requires only the identity of the requesting router to choose a prefix for delegation. Prefix Delegation is not a new protocol. It is using DHCPv6 messages as defined in RFC 3633, thus sometimes referred to as DHCPv6 Prefix Delegation.DHCPv6 prefix delegation operates as follows:  A delegating router (Server) is provided with IPv6 prefixes to be delegated to requesting routers.  A requesting router (Client) requests one or more prefixes from the delegating router.  The delegating router (Server) chooses prefixes for delegation, and responds with prefixes to the requesting router (Client).  The requesting router (Client) is then responsible for the delegated prefixes.  The final address allocation mechanism in the local network can be performed with SLAAC or stateful/stateless DHCPv6, based on the customer preference. At this step the key thing is the IPv6 prefix and not how it is delivered to end systems.IPv6 in OpenStack NeutronBack in the Icehouse development cycle, the Neutron “subnet” API was enhanced to support IPv6 address assignment options. Reference implementation of this followed at the Juno cycle, where dnsmasq and radvd processes were chosen to serve the subnets with RAs, SLAAC or DHCPv6.In the current Neutron implementation, tenants must supply a prefix when creating subnets. This is not a big deal for IPv4, as tenants are expected to pick private IPv4 subnets for their networks and NAT is going to take place anyway when reaching external public networks. For IPv6 subnets that use Global Unicast Address (GUA) format, addresses are globally routable and cannot overlap. There is no NAT or floating IP model for IPv6 in Neutron. And if you ask me, there should not be one. GUA is the way to go. But can we just trust the tenants to configure their IPv6 prefixes correctly? Probably not, and that’s why Prefix Delegation is an important feature for OpenStack.An OpenStack administrator may want to simplify the process of subnet prefix selection for the tenants by automatically supplying prefixes for IPv6 subnets from one or more large pools of pre-configured IPv6 prefixes. The tenant would not need to specify any prefix configuration. Prefix Delegation will take care of the address assignment.The code is expected to land in OpenStack Liberty based on this specification. Other than REST API changes, a PD client would need to run in the Neutron router network namespace whenever a subnet attached to that router requires prefix delegation. Dibbler is an open-source utility that supports PD client and can be used to provide the required functionality."
  },{
    "title": "OpenStack Networking with Neutron: What Plugin Should I Deploy?",
    "url": "/blog/2015/06/17/openstack-networking-with-neutron-what-plugin-should-i-deploy/",
    "date": "Jun 17, 2015",
    "tags": ["Network Virtualization","Neutron","OpenStack","Overlay Networks","SDN","Talks"],
    "excerpt": "(This is a summary version of a talk I gave at OpenStack Israel event on June 15th, 2015. Slides are available on GitHub).",
    "content": "(This is a summary version of a talk I gave at OpenStack Israel event on June 15th, 2015. Slides are available on GitHub).Neutron is probably one of the most pluggable projects in OpenStack today. The theory is very simple and goes like this: Neutron is providing just an API layer and you have got to choose the backend implementation you want. But in reality, there are plenty of plugins (or drivers) to choose from and the plugin architecture is not always so clear.The plugin is a critical piece of the deployment and directly affects the feature set you are going to get, as well as the scale, performance, high availability, and supported network topologies. In addition, different plugins offer different approaches for managing and operating the networks.So what is a Neutron plugin?The Neutron API exposed via the Neutron server is splitted into two buckets: the core (L2) API and the API extensions. While the core API consists only of the fundamental Neutron definitions (Network, Subnet, Port), the API extension is where the interesting stuff get to be defined, and where you can deal with constructs like L3 router, provider networks, or L4-L7 services such as FWaaS, LBaaS or VPNaaS.In order to match this design, the plugin architecture is built out of a “core” plugin (which implements the core API) and one or more “service” plugins (to implement additional “advanced” services defined in the API extensions). To make things more interesting, these advanced network services can also be provided by the core plugin by implementing the relevant extensions.What plugins are out there?There are many plugins out there, each with its own approach. But when trying to categorize them, I found that usually it boils down to “software centric” plugins versus “hardware centric” plugins.With the software centric ones, the assumption is that the network hardware is general-purpose, and the functionality is offered, as the name implies, with software only. This is where we get to see most of the overlay networking approaches with the virtual tunnel end-points (VTEP) implemented in the Compute/Hypervisor nodes. The requirements from the physical fabric is to provide only basic IP routing/switching. The plugin can use an SDN approach to provision the overly tunnels in an optimal manner, and handle broadcast, unknown unicast and multicast (BUM) traffic efficiently.With the hardware centric ones, the assumption is that a dedicated network hardware is in place. This is where the traditional network vendors usually offer a combined software/hardware solution taking advantage of their network gear. The advantages of this design is better performance (if you offload certain network function to the hardware) and the promise of better manageability and control of the physical fabric.And what is there by default?There are efforts in the Neutron community to completely separate the API (or control-plane components) from the plugin or actual implementation. The vision is to position Neutron as a platform, and not as any specific implementation. That being said, Neutron was really developed out of the Open vSwitch plugin, and some good amount of the upstream development today is still focused around that. Open vSwitch (with the OVS ML2 driver) is what you get by default, and this is by far the most common plugin deployed in production (see the recent user survey). This solution is not perfect and has pros and cons like any other of the solutions out there.While Open vSwitch is used on the Compute nodes to provide connectivity for VM instances, some of the key components with this solution are actually not related to Open vSwitch. L3 routing, DHCP, and other services are implemented using dedicated software agents using Linux tools such as network namespaces (ip netns), dnsmasq, or iptables.So how one should choose a plugin?I am sorry, but there is no easy answer here. From my experience, the best way is to develop a methodological approach:  Evaluate the default Open vSwitch based solution first. Even if you end up not choosing it for your production environment, it should at least get you familiar with the Neutron constructs, definitions and concepts   Get to know your business needs, and collect technical requirements. Some key questions to answer:          Are you building a greenfield deployment?      What level of interaction is expected with your existing network?      What type of applications are going to run in your cloud?      Is self-service required?      Who are the end-users?      What level of isolation and security is required?      What level of QoS is expected?      Are you building a multi cloud/multi data-center or an hybrid deployment?        Test things up yourself. Don’t rely on vendor presentations and other marketing materials"
  },{
    "title": "What’s Coming in OpenStack Networking for the Kilo Release",
    "url": "/blog/2015/05/11/whats-coming-in-openstack-networking-for-the-kilo-release/",
    "date": "May 11, 2015",
    "tags": ["IPv6","Network Virtualization","Neutron","OpenStack"],
    "excerpt": "A post I wrote for the Red Hat Stack blog on what’s coming in OpenStack Networking for the Kilo release.",
    "content": "A post I wrote for the Red Hat Stack blog on what’s coming in OpenStack Networking for the Kilo release."
  },{
    "title": "An Overview of Link Aggregation and LACP",
    "url": "/blog/2015/05/01/an-overview-of-link-aggregation-and-lacp/",
    "date": "May 1, 2015",
    "tags": ["Data Center","Ethernet","LACP","LAG","Network Virtualization"],
    "excerpt": "The concept of Link Aggregation (LAG) is well known in the networking industry by now, and people usually consider it as a basic functionality that just works out of the box. With all of the SDN hype that’s going on out there, I sometimes feel that we tend to neglect some of the more “traditional” stuff like this one. As with many networking technologies and protocols, things may not just work out of the box, and it’s important to master the details to be able to design things properly, know what to expect to (i.e., what the normal behavior is) and ultimately being able to troubleshoot in case of a problem.",
    "content": "The concept of Link Aggregation (LAG) is well known in the networking industry by now, and people usually consider it as a basic functionality that just works out of the box. With all of the SDN hype that’s going on out there, I sometimes feel that we tend to neglect some of the more “traditional” stuff like this one. As with many networking technologies and protocols, things may not just work out of the box, and it’s important to master the details to be able to design things properly, know what to expect to (i.e., what the normal behavior is) and ultimately being able to troubleshoot in case of a problem.The basic concept of LAG is that multiple physical links are combined into one logical bundle. This provides two major benefits, depending on the LAG configuration:  Increased capacity - traffic may be balanced across the member links to provide aggregated throughput  Redundancy - the LAG bundle can survive the loss of one or more member linksLAG is defined by the IEEE 802.1AX-2008 standard, which states, “Link Aggregation allows one or more links to be aggregated together to form a Link Aggregation Group, such that a MAC client can treat the Link Aggregation Group as if it were a single link”. This layer 2 transparency is achieved by the LAG using a single MAC address for all the device’s ports in the LAG group. The individual port members must be of the same speed, so you cannot bundle for example a 1G and 10G interfaces. The ports should also have the same duplex settings, encapsulation type (i.e., access/untagged or 802.1q tagged with the exact same number of VLANs) as well as MTU.LAG can be configured as either static (manually) or dynamic by using a protocol to negotiate the LAG formation, with LACP being the standard-based one. There is also the Port Aggregation Protocol (PAgP), which is similar in many regards to LACP, but is Cisco proprietary and not in common usage anymore.Wait… LAG, bond, bundle, team, trunk, EtherChannel, Port Channel?Let’s clear this right away - there are several acronyms used to describe LAG which are sometimes used interchangeably. While LAG is the standard name defined by the IEEE specification, different vendors and operating systems came up with their own implementation and terminology. Bond, for example, is really known on Linux-based systems, following the name of the kernel driver. Team (or NIC teaming) is also pretty common across Windows systems, and lately Linux systems as well. EtherChannel is one of the famous terms, being used on Cisco’s IOS. Interesting enough, Cisco have changed the term in their IOS-XR software to bundles, and in their NX-OS systems to Port Channels. Oh… I love the standardization out there!LAG can also be used as a general term to describe link aggregation with different technologies (such as MLPPP for PPP links) which can cause some confusion, while Ethernet is the de facto standard and the focus of the IEEE spec.Use casesToday, Link Aggregations can be found in many network designs, and across different portions of the network. LAG can be found across the Enterprise, Data Center, and Service Provider networks. In the cloud and virtualization space, it’s also common to want to use multiple network connections in your hypervisors to support Virtual Machine traffic. So you can have LAG configured between different network devices (for e.g., switch to switch, router to router), or between an end host or hypervisor and the upstream network device (usually some sort of a ToR switch).L2 LAG and STPFrom Spanning Tree Protocol (STP) perspective, no matter how many physical ports are being used to form the LAG, there is going to be only one logical interface representing each LAG bundle. The individual ports are not part of the STP topology, but only the one logical interface. STP is still going to be active on the LAG interface and should not be turned off, so that if there are multiple LAGs configured between two adjacent nodes, STP will block one of them.L3 LAGWhile LAG is extremely common across L2 network designs, and sometimes even seen as a partial replacement for Spanning Tree Protocol (STP), it is important to mention that LAG can also operate at L3, i.e, by assigning an IPv4 or IPv6 subnet to the aggregated link. You can then setup static or dynamic routing over the LAG like any other routed interface.LAG versus MC-LAGBy definition, LAG is formed across two adjacent nodes which are directly connected to each other. The two nodes must be configured properly to form the LAG, so that traffic would be transferred properly between the nodes without a fear of creating traffic loops between the individual members for example.MC-LAG, or Multi-Chassis Link Aggregation Group, is a type of LAG with constituent ports that terminate on separate chassis, thereby providing node-level redundancy. Unlike link aggregation in general, MC-LAG is not covered under IEEE standard, and its implementation varies by vendor. Cisco’s vPC is a good example for a MC-LAG implementation. The real challenge with MC-LAG is to maintain a consistent control plane state across the LAG setup, which is why the various multi-chassis mechanisms insist on countermeasures such as peer links or out of band connectivity between the redundant chassis.Load sharing operationTraffic is not randomly placed across the LAG members, but instead shared using a deterministic hash algorithm. Depending on the platform and the configuration, a number of parameters may feed into the algorithm, including for example the ingress interface, source and/or destination MAC address, source and/or destination IP address, source and/or destination L4 (TCP/UDP) port numbers, MPLS labels, and so on.Ultimately the hash will take in some combination of parameters to identify a flow and decide to which member link the frame should be placed in. It is important to note that all traffic for a particular flow will always be placed on the same link. That’s also means that traffic for a single flow (e.g., source and destination MAC) cannot exceed the bandwidth of a single member link. It is also important to note that each node (or chassis) performs the hash calculations locally itself, so that upstream and downstream traffic for a single flow will not necessarily traverse the same link.Static configurationThe basic way to form a LAG is to simply specify the member ports on each node manually. This method does not involve any protocols to negotiate and form the LAG. Depending on the platform, the user can also control the hash algorithm on each side. As soon as a port becomes physically up it becomes a member of the LAG bundle. The major advantage of this is that the configuration is very simple. The disadvantage is that there is no method to detect any kind of cabling or configuration errors, which is most vendors would recommend a LACP configuration instead.LACP configurationLACP is the standards based protocol used to signal LAGs. It detects and protects the network from a variety of misconfiguration, ensuring that links are only aggregated into a bundle if they are consistently configured and cabled. LACP can be configured in one of two modes:  Active mode - the device immediately sends LACP messages (LACP PDUs) when the port comes up  Passive mode - Places a port into a passive negotiating state, in which the port only responds to LACP PDUs it receives but does not initiate LACP negotiationIf both sides are configured as active, LAG can be formed assuming successful negotiation of the other parameters. If one side is configured as active and the other one as passive, LAG can be formed as the passive port will respond to the LACP PDUs received from the active side. If both sides are passive, LACP will fail to negotiate the bundle. In practice it is rare to find passive mode used as it should be clearly and consistently defined which links will use LACP/LAG ahead of deployment. There are even vendors who does not offer the passive mode option at all.With LACP, you can also control the timeout interval in which LACP PDUs will be sent. The standard defines two intervals: fast (1 second) and slow (30 seconds). Note that the timeout value does not have to agree between peers. While it is not a recommended configuration, it is possible to bring up a LAG with one end sending every 1 second and the other sending every 30 seconds. Depending on the platform and configuration, it is also possible to use Bidirectional Forwarding Detection (BFD) for fast detection of link failures."
  },{
    "title": "Red Hat Enterprise Linux OpenStack Platform 6: SR-IOV Networking - Part II: Walking Through the Implementation",
    "url": "/blog/2015/04/29/red-hat-enterprise-linux-openstack-platform-6-sr-iov-networking-part-ii-walking-through-the-implementation/",
    "date": "Apr 29, 2015",
    "tags": ["Data Center","NFV","OpenStack","SR-IOV"],
    "excerpt": "Second part of the SR-IOV networking post I wrote for the Red Hat Stack blog.",
    "content": "Second part of the SR-IOV networking post I wrote for the Red Hat Stack blog."
  },{
    "title": "Red Hat Enterprise Linux OpenStack Platform 6: SR-IOV Networking - Part I: Understanding the Basics",
    "url": "/blog/2015/03/05/red-hat-enterprise-linux-openstack-platform-6-sr-iov-networking-part-i-understanding-the-basics/",
    "date": "Mar 5, 2015",
    "tags": ["Data Center","NFV","OpenStack","SR-IOV"],
    "excerpt": "Check out this blog post I wrote for Red Hat Stack on SR-IOV networking support introduced in RHEL OpenStack Platfrom 6. This is based on the Nova and Neutron work done at the upstream community for the OpenStack Juno release.",
    "content": "Check out this blog post I wrote for Red Hat Stack on SR-IOV networking support introduced in RHEL OpenStack Platfrom 6. This is based on the Nova and Neutron work done at the upstream community for the OpenStack Juno release."
  },{
    "title": "The need for Network Overlays – part II",
    "url": "/blog/2014/11/30/the-need-for-network-overlays-part-ii/",
    "date": "Nov 30, 2014",
    "tags": ["Data Center","Geneve","Network Virtualization","OpenStack","Overlay Networks","VXLAN"],
    "excerpt": "In the previous post, I covered some of the basic concepts behind network overlays, primarily highlighting the need to move into a more robust, L3 based, network environments. In this post I would like to cover network overlays in more detail, going over the different encapsulation options and highlighting some of the key points to consider when deploying an overlay-based solution.",
    "content": "In the previous post, I covered some of the basic concepts behind network overlays, primarily highlighting the need to move into a more robust, L3 based, network environments. In this post I would like to cover network overlays in more detail, going over the different encapsulation options and highlighting some of the key points to consider when deploying an overlay-based solution.Underlying fabric considerationsWhile network overlays give you the impression that networks are suddenly all virtualized, we still need to consider the physical underlying network. No matter what overlay solution you might pick, it’s still going to be the job of the underlying transport network to switch or route the traffic from source to destination (and vice versa).Like any other network design, there are several options to choose from when building the underlying network. Before picking up a solution, it’s important to analyze the requirements - namely the scale, amount of virtual machines (VMs), size of the network as well as the amount of traffic. Yes, there are some fancy network fabric solutions out there from any of your favorite vendors, but simple L3 Clos network will do just fine. The big news here is that the underlying network should no longer be a L2 bridged network, but can be configured as a L3 routed network. Clos topology with ECMP routing can provide efficient non-blocking forwarding with a quick convergence time in a case of a failure. Known protocols such as OSPF, IS-IS, and BGP, with the addition of a protocol like BFD, can provide a good standard-based foundation for such a network. One thing I do want to highlight when it comes to the underlying network, is the requirement to support Jumbo frames. No matter what overlay encapsulation you may choose to implement, extra bytes of header will be added to the frames, resulting in a need for high MTU support from the physical network.For the virtualization/cloud admin, with overlay networks, the data network used to carry the overly traffic is no longer a special network that requires careful VLAN configuration. It is now just one more infrastructure network used to provide simple TCP/IP connectivity.EncapsulationWhen it comes to the overlay data-plane encapsulation, the amount of discussions, comparisons and debate out there is amazing. There are several options and standards available, all of them have the same goal: provide an emulated L2 networks over IP infrastructure. The main difference between them is the encapsulation format itself and their approach to the control plane - which is essentially the way to obtain MAC-to-IP mapping information for the tunnel end-points.It all started with the well-known Generic Routing Encapsulation (GRE) protocol that was rebranded as NVGRE. GRE is a simple point-to-point tunneling protocol which is being used in todays networks to solve various design challenges and therfore is well understood by many network engineers. With NVGRE, the inner frame is being encapsulated with GRE encapsulation as specified in RFC 2784 and RFC 2890. The Key field (32 bits) in the GRE header is used to carry the Tenant Network Identifier (TNI) and is used to isolate different logical segments. One thing to note about GRE is the fact that it uses IP protocol number 47 for communication, i.e., it does not use TCP or UDP - which make it hard to create header entropy. Header entropy is something that you really want to have if you are using a flow-based ECMP network to carry the overlay traffic. Interesting enough, the authors of NVGRE do not cover the control plane part but only the data-plane considerations.Other option would be Virtual Extensible LAN (VXLAN). Unlike NVGRE, VXLAN is a new protocol that was designed to solve the overlay networks use case. It uses UDP for communication (port 4789) and a 24-bit segment ID known as the VXLAN network identifier (VNID). With VXLAN, a hash of the inner frame’s header is used as the VXLAN source UDP port. As a result, a VXLAN flow can be unique, with the IP addresses and UDP ports combination in its outer header while traversing the underlay physical network. Therefore, the hashed source UDP port introduces a desirable level of entropy for ECMP load balancing. When it comes to the control plane, VXLAN does not provide any solution, but instead relies on flooding emulated with IP multicast. The original standard recommends to create an IP multicast group per VNI to handle broadcast traffic within a segment. This requires support for IP multicast on the underlying physical network as well as proper configuration and maintenance of the various multicast trees. This approach may work for small scale environments, but for large environments with good number of logical VXLAN segments this is probably not a good idea. It also important to note here that while IP multicast is a clever way to handle IP traffic, it is not commonly implemented in Data Center networks today, and the requirement to deploy an IP multicast network (which can be fairly complex) just to introduce VXLAN is not something that is accepted in most cases. These days, it is common to see “unicast mode” VXLAN implementations that do not require any kind of multicast support.You may also have heard about Stateless Transport Tunneling Protocol (STT) which was originally introduced by Nicira (now VMware NSX). The main reason I decided to mention STT here is one of its benefits: the ability to leverage TCP offloading capabilities from existing physical NICs, resulting in improved performance. STT uses a header that looks just like the TCP header to the NIC. The NIC is thus able to perform Large Segment Offload (LSO) on what it thinks is a simple TCP datagram. That said, new generation NICs also offer offload capabilities for NVGRE and VXLAN, so this is not a unique benefit of STT anymore.Last but not least, I would also like to introduce Geneve: Generic Network Virtualization Encapsulation, which looks to take a more holistic view of tunneling. From a first look, Geneve looks pretty much similar to VXLAN. It uses a UDP-based header and a 24 bit Virtual Network Identifier. So what is unique about Geneve? The fact that it uses an extendable header format, similar to (long-living) protocols such as BGP, LLDP, and IS-IS. The idea is that Geneve can evolve over time with new capabilities, not by revising the base protocol, but by adding new optional capabilities. The protocol has a set of fixed header, parameters and values, but then leave room for non-defined optional fields. New fields can be added to the protocol by simply defining and publishing them. The protocol is created in such a way that implementations know there may be optional fields that they may or may not understand. Although the protocol is new, there is some work to enable Open vSwitch support as well as NIC vendors announcing support for offloading capabilities.I also want to leave room here for some other protocols that can be used as an encapsulation option. There is nothing wrong with MPLS for example, other than the fact that it requires to be enabled throughout the underlying transport network.So should I pick a winner? probably not. As you can see you have got some options to choose from, but let’s make it clear: all protocols discussed above are ignoring the real problems (hint: control-plane operations) and providing a nice framework for data-plane encapsulation, which is just part of the deal. If I need to pick one, I would say that it looks like VXLAN and Geneve are here to stay (but we should let the market decide).Tunnel End PointI have already mentioned the term tunnel end-point, sometimes refer to as VTEP, earlier. But what is this end-point, and more importantly, where is it located? The function of VTEP is to encapsulate the VM traffic within an IP header to send across the underlying IP network. With the most common implementations, the VMs are unaware of the VTEP. They just send untagged or VLAN-tagged traffic that needs to classified and associated with a VTEP. Initial designs (which are still the most common ones) implemented the VTEP functionality within the hypervisor which houses the VMs, usually in the software vSwitch. While this is a valid solution that is probably here to stay, it also worth mentioning an alternative design in which the VTEP functionality is implemented in hardware, for e.g., within a top-of-rack (ToR) switch. This makes sense is some environments, especially where performance and throughput is critical.Control plane or floodingProbably the most interesting question to ask when picking an overlay network solution is what’s going on with the control plane and how the network is going to handle Broadcast, Unknown unicast and Multicast traffic (sometimes refer to as BUM traffic). I am not to going to provide easy answers here, simply because of the fact that there are plenty of solutions out there, each addresses this problem differently. I just want to emphasize that the protocol you are going to use to form the overly network (e.g., NVGRE, VXLAN, or what have you) is essentially taking care only for the data-plane encapsulation. For control plane you will need to rely either on flooding (basically continue to learn MAC addresses via the “flood and learn” method to ensure that the packet reaches all the other tunnel end-points), or consulting some sort of database which includes the MAC to IP bindings in the network (e.g., an SDN controller).Connectivity with the outside worldAnother factor to consider is the connectivity with the outside world - or how can a VM within an overlay network communicate with a device resides outside of the network. No matter how much overlays would be popular throughout the network, there are still going to be devices inside and outside of the Data Center that speaks only native IP or understand just 802.1Q VLANs. In order to communicate with those the overlay packet will need to get into some kind of a gateway that is capable of bridging or routing the traffic correctly. This gateway should handle the encapsulation/decapsulation function and provide the required connectivity. As with the control plane considerations, this part is not really covered in any of the encapsulation standards. Common ways to solve this challenge is by using virtual gateways, essentially logical routers/switches implemented in software (take a look on Neutron’s l3-agent to see how OpenStack handle this), or by introducing dedicated physical gateway devices.Are overlays the only option?I would like to summarize this post by emphasizing that overlays are an exciting technology which probably makes sense in certain environments. As you saw, an overly-based solution needs to be carefully designed, and as always depends on your business and network requirements. I also would like to emphasize that overlays are not the only option to scale-out networking, and I have seen some cool proposals lately which are probably deserve their own post."
  },{
    "title": "What’s Coming in OpenStack Networking for Juno Release",
    "url": "/blog/2014/09/16/whats-coming-in-openstack-networking-for-juno-release/",
    "date": "Sep 16, 2014",
    "tags": ["Neutron","OpenStack"],
    "excerpt": "A blog post I wrote for Red Hat Stack on what’s coming in OpenStack Neutron for the Juno release.",
    "content": "A blog post I wrote for Red Hat Stack on what’s coming in OpenStack Neutron for the Juno release."
  },{
    "title": "IPv6 address assignment – stateless, stateful, DHCP... oh my!",
    "url": "/blog/2014/07/02/ipv6-address-assignment-stateless-stateful-dhcp-oh-my/",
    "date": "Jul 2, 2014",
    "tags": ["IPv6"],
    "excerpt": "People don’t like changes. IPv6 could have help to solve a lot of the burden in networks deployed today, which are still mostly based on the original version of the Internet Protocol, aka version 4. But time has come, and even the old tricks like throwing network address translation (NAT) everywhere are not going to help anymore, simply because we are out of IP addresses. It may take some more time, and people will do everything they can to (continue and) delay it, but believe me – there is no other way around – IPv6 is here to replace IPv4. IPv6 is also a critical part of the promise of the cloud and the Internet of Things (IoT). If you want to connect everything to the network, you better plan for massive scale and have enough addresses to use.",
    "content": "People don’t like changes. IPv6 could have help to solve a lot of the burden in networks deployed today, which are still mostly based on the original version of the Internet Protocol, aka version 4. But time has come, and even the old tricks like throwing network address translation (NAT) everywhere are not going to help anymore, simply because we are out of IP addresses. It may take some more time, and people will do everything they can to (continue and) delay it, but believe me – there is no other way around – IPv6 is here to replace IPv4. IPv6 is also a critical part of the promise of the cloud and the Internet of Things (IoT). If you want to connect everything to the network, you better plan for massive scale and have enough addresses to use.One of the trickiest things with IPv6 though is the fact that it’s pretty different from IPv4. While some of the concepts remains the same, there are some fundamental differences between IPv4 and IPv6, and it’s definitely takes some time to get used into some of the IPv6 basics, including the terms being used. Experienced IPv4 engineers will probably need to change their mindset, and as I stated before, people don’t really like changes…In this post, I want to highlight the address assignment options available with IPv6, which is in my view one of the most fundamental things in IP networking, and where things are pretty different comparing to IPv4. I am going to assume you have some basic background on IPv6, and while I will cover the theory part I will also show the command line interface and demonstrate some of the configuration options, focusing on SLAAC and stateless DHCPv6. I am going to use a simple topology with two Cisco routers directly connected to each other using their GigabitEthernet 1/0 interface. Both routers are running IOS 15.2(4).Let the party startedWith IPv6 an interface can have multiple prefixes and IP addresses, and unlike IPv4, all of them are primary. All interfaces will have a Link-Local address which is the address used to implement many of the control plane functions. If you don’t manually set the Link-Local address, one will automatically be generated for you. Note that the IPv6 protocol stack will not become operational on an interface until a Link-Local address was assigned or generated and it passed Duplicate Address Detection (DAD) verification. In Cisco IOS, we will first need to enable IPv6 on the router which is done globally using the ipv6 unicast-routing command. We will then enable IPv6 on the interface using the ipv6 enable command:ipv6 unicast-routing ! interface GigabitEthernet1/0 ipv6 enable !Now IPv6 in enabled on the interface, and we should get a Link-Local address assigned automatically:show ipv6 interface g1/0 | include link IPv6 is enabled, link-local address is FE80::C800:51FF:FE2F:1CIPv6 address assignment optionsA little bit of theory as promised. When it comes to IPv6 address assignment there are several options you can use:      Static (manual) address assignment - exactly like with IPv4, you can go on and apply the address yourself. I believe this is straight forward and therefore I am not going to demonstrate that.        Stateless Address Auto Configuration (SLAAC) - nodes listen for ICMPv6 Router Advertisements (RA) messages periodically sent out by routers on the local link, or requested by the node using an RA solicitation message. They can then create a Global unicast IPv6 address by combining its interface EUI-64 (based on the MAC address on Ethernet interfaces) plus the Link Prefix obtained via the Router Advertisement. This is a unique feature only to IPv6 which provides simple “plug &amp; play” networking. By default, SLAAC does not provide anything to the client outside of an IPv6 address and a default gateway. SLAAC is greatly discussed in RFC 4862.        Stateless DHCPv6 – with this option SLAAC is still used to get the IP address, but DHCP is used to obtain “other” configuration options, usually things like DNS, NTP, etc. The advantage here is that the DHCP server is not required to store any dynamic state information about any individual clients. In case of large networks which has huge number of end points attached to it, implementing stateless DHCPv6 will highly reduce the number of DHCPv6 messages that are needed for address state refreshment.        Stateful DHCPv6 - functions exactly the same as IPv4 DHCP in which hosts receive both their IPv6 address and additional parameters from the DHCP server. Like DHCP for IPv4, the components of a DHCPv6 infrastructure consist of DHCPv6 clients that request configuration, DHCPv6 servers that provide configuration, and DHCPv6 relay agents that convey messages between clients and servers when clients are on subnets that do not have a DHCPv6 server. You can learn more about DHCP for IPv6 in RFC 3315.  Note: The only way to get a default gateway in IPv6 is via a RA message. DHCPv6 does not carry default route information at this time.Putting it all togetherAn IPv6 host performs stateless address autoconfiguration (SLAAC) by default and uses a configuration protocol such as DHCPv6 based on the following flags in the Router Advertisement message sent by a neighboring router:      Managed Address Configuration Flag, the ‘M’ flag. When set to 1, this flag instructs the host to use a configuration protocol to obtain stateful IPv6 addresses        Other Stateful Configuration Flag, the ‘O’ flag. When set to 1, this flag instructs the host to use a configuration protocol to obtain other configuration settings, e.g., DNS, NTP, etc.  Combining the values of the M and O flags can yield the following:      Both M and O Flags are set to 0. This combination corresponds to a network without a DHCPv6 infrastructure. Hosts use Router Advertisements for non-link-local addresses and other methods (such as manual configuration) to configure other parameters.        Both M and O Flags are set to 1. DHCPv6 is used for both addresses and other configuration settings, aka stateful DHCPv6.        The M Flag is set to 0 and the O Flag is set to 1. DHCPv6 is not used to assign addresses, only to assign other configuration settings. Neighboring routers are configured to advertise non-link-local address prefixes from which IPv6 hosts derive stateless addresses. This combination is known as statless DHCPv6.  Examining the configurationSLAACClient configuration:interface GigabitEthernet1/0 ipv6 address autoconfig ipv6 enableServer configuration:interface GigabitEthernet1/0 ipv6 address 2001:1111:1111::1/64 ipv6 enableWe can see the server sending the RA message with the prefix that was configured:ICMPv6-ND: Request to send RA for FE80::C801:51FF:FE2F:1C ICMPv6-ND: Setup RA from FE80::C801:51FF:FE2F:1C to FF02::1 on GigabitEthernet1/0 ICMPv6-ND: MTU = 1500 ICMPv6-ND: prefix = 2001:1111:1111::/64 onlink autoconfig ICMPv6-ND: 2592000/604800 (valid/preferred)And the client receiving the message and calculating an address using EUI-64:ICMPv6-ND: Received RA from FE80::C801:51FF:FE2F:1C on GigabitEthernet1/0 ICMPv6-ND: Prefix : 2001:1111:1111::ICMPv6-ND: Update on-link prefix 2001:1111:1111::/64 on GiabitEthernet1/0 IPV6ADDR: Generating IntfID for 'eui64', prefix 2001:1111:1111::/64 ICMPv6-ND: IPv6 Address Autoconfig 2001:1111:1111:0:C800:51FF:FE2F:1CR1#show ipv6 interface briefGigabitEthernet1/0 [up/up]FE80::C800:51FF:FE2F:1C2001:1111:1111:0:C800:51FF:FE2F:1Stateless DHCPClient configuration:No changes are required on the client side. The client is configured to use SLAAC by setting the “auto-config” option.interface GigabitEthernet1/0 ipv6 address autoconfig ipv6 enableServer configuration:ipv6 dhcp pool STATELESS\\_DHCP dns-server 2001:1111:1111::10 domain-name test.com ! interface GigabitEthernet1/0 ipv6 address 2001:1111:11111::1/64 ipv6 enable ipv6 nd other-config-flag ipv6 dhcp server STATELESS\\_DHCPWe can see the client keeping the same IP address, but now obtaining DNS settings through DHCP:IPv6 DHCP: Adding server FE80::C801:51FF:FE2F:1C IPv6 DHCP: Processing options IPv6 DHCP: Configuring DNS server 2001:1111:1111::10 IPv6 DHCP: Configuring domain name test.com"
  },{
    "title": "The need for Network Overlays – part I",
    "url": "/blog/2014/07/01/the-need-for-network-overlays-part-i/",
    "date": "Jul 1, 2014",
    "tags": ["Data Center","Network Virtualization","OpenStack","Overlay Networks","VXLAN"],
    "excerpt": "The IT industry has gained significant efficiency and flexibility as a direct result of virtualization. Organizations are moving toward a virtual datacenter model, and flexibility, speed, scale and automation are central to their success. While compute, memory resources and operating systems were successfully virtualized in the last decade, primarily due to the x86 server architecture, networks and network services have not kept pace.",
    "content": "The IT industry has gained significant efficiency and flexibility as a direct result of virtualization. Organizations are moving toward a virtual datacenter model, and flexibility, speed, scale and automation are central to their success. While compute, memory resources and operating systems were successfully virtualized in the last decade, primarily due to the x86 server architecture, networks and network services have not kept pace.The traditional solution: VLANsWay before the era of server virtualization, Virtual LANs (or 802.1q VLANs) were used to partition different logical networks (or broadcast domains) over the same physical fabric. Instead of wiring a separate physical infrastructure for each group, VLANs were used efficiently to isolate the traffic from different groups or applications based on the business needs, with a unique identifier allocated to each logical network. For years, a physical server represented one end-point from the network perspective and was attached to an “access” (i.e., untagged) port in the network switch. The access switch was responsible to enforce the VLAN ID as well as other security and network settings (e.g., quality of service). The VLAN ID is a 12-bit field allowing a theoretical limit of 4096 unique logical networks. In practice though, most switch vendors support much lower number to be configured. You should remember that for each active VLAN in a switch, a VLAN database need to be maintained for proper mapping of the physical interfaces and the MAC addresses associated with the VLAN. Furthermore, some vendors would also create a different spanning-tree (STP) instance for each active VLAN on the switch which require additional memory cycles.VLANs are a perfect solution for small-scale environments, where the number of end-points (and MAC addresses respectively) is small and controlled. With virtualization though, one server, now called hypervisor, can host many virtual machines and many network end-points. As I stated before, the networks have not kept pace, and the easiest (and also rational) thing to do was to reuse the good-old VLANs. We were essentially adding an additional layer of software access switch in the hypervisor to link the different virtual machines on the host, and those server “access” ports in the physical switch that traditionally were untagged, are now expecting tagged traffic with different VLAN IDs differentiating between the virtual machine networks. The main issue here is the fact that the virtual machines MAC addresses must be visible end-to-end throughout the network core. Reminder: VLANs must be properly configured on each switch along the path, as well as on the appropriate interfaces to get end-to-end MAC learning and connectivity.In a virtualized world, where the number of end-points is constantly increasing and can be very high, VLANs is a limited solution that does not follow one of the main participles beyond virtualization: use of software application to divide one physical resource into multiple isolated virtual environments. Yes, VLANs does offer segmentation of different logical networks (or broadcast domains) over the same physical fabric, but you still need to manually provision the network and make sure the VLANs are properly configured across the network devices. This start to become a management and configuration nightmare and simply does not scale.Where network vendors started to be (really) creativeAt this point, when there was no doubt that VLANs and traditional L2 based networks are not suitable for large virtualized environments, plenty of network solutions were raised. I don’t really want to go into detail on any of those, but you can look for 802.1Qbg, VM-FEX, FabricPath, TRILL, 802.1ad (QinQ), and 802.1ah (PBB) to name a few. In my view, these are over complicating the network while ignoring the main problem – L2-based solution is a bad thing to begin with, and we should have looked for something completely different (hint: L3 routing is your friend).Overlays to the rescueL3 routing is a scalable and well-known solution (it runs the Internet, isn’t it?). With proper planning, routing domains can handle massive number of routes/networks, keeping the broadcast (and failure) domains small. Furthermore, most modern routing operating systems can utilize equal-cost multi-path routing (ECMP), effectively load-sharing the traffic across all available routed links. In contrast, by default spanning-tree protocol (STP) blocks redundant L2 switched links to avoid switching loops, simply because that there is no way to handle loops within a switched environment (there is no “time-to-live” field within an Ethernet frame).Routing sounds a lot better, but note that L2 adjacency is required by most applications running inside the virtual machines. L2 connectivity between the virtual machines is also required for virtual machine mobility (e.g., Live Migration in VMware terminology). This is where overlay networks enter the picture; an overlay network is a computer network which is built on top of another network. Using an overlay, we can build a L2 switched network on top of a L3 routed network. Don’t get it wrong – overlays are not a new networking concept and are already used extensively to solve many network challenges (see GRE tunneling and MPLS L2/L3 VPNs for some examples and use cases).In the next post I will bring the second part of this article, diving into the theory behind network overlays and the way they tend to solve the network virtualization case."
  }]