Mediabistro logo
job logo

DevOps Engineer

TechDigital Group, Florida, NY, United States


Job Description:
We are seeking a highly skilled and experienced Principal Software Engineer focused on Agentic AI and DevOps. The ideal candidate will architect and deliver agentic microservices and platform capabilities, lead cloud-native DevOps at scale, and partner with organizational leaders to communicate strategy, status, and results. Deep hands‑on expertise with Azure, Kubernetes, CI/CD, infrastructure as code, and LLM/agent frameworks (LangChain/LangSmith/OpenAI/LiteLLM) is essential. Experience with dataflow orchestration (Apache NiFi), enterprise integrations (ServiceNow/Snowflake/Power BI/SharePoint), and production‑grade observability is highly desirable.

What You’ll Do:

Architect, build, and operate agentic AI services and microservices leveraging LangChain, LangSmith, OpenAI/Azure OpenAI, and LiteLLM; implement tool‑use orchestration, evaluation, and guardrails.

Design, build, and maintain CI/CD pipelines using Azure DevOps (ADO) YAML and GitHub Actions; enforce trunk‑based workflows, quality gates, progressive delivery, and automated rollbacks.

Stand up and manage Azure infrastructure (AKS, Service Bus, Event Hubs, Storage Accounts, Key Vault, Bastion); codify environments with Terraform; implement secure networking, secrets, and RBAC.

Containerize and ship services with Docker/Buildah; operate Kubernetes with CNI networking and Linkerd service mesh; implement canary/blue‑green strategies and autoscaling.

Create and operate Apache NiFi dataflows; deploy and manage NiFi clusters on AKS with VM Scale Sets, enabling resilient, scalable ingestion and orchestration.

Implement enterprise‑grade observability and logging: ELK/EFK (Elasticsearch, Fluentd/Fluent‑Bit, Kibana), Prometheus metrics, Azure Dashboards, and KQL‑based alerting.

Engineer data and analytics integrations: Azure Databricks, PostgreSQL, Snowflake; operationalize Power BI, SharePoint, and Jupyter‑based workflows.

Build robust platform and app integrations: ServiceNow APIs, REST APIs, SMTP/IMAP/POP email automations; configure and manage NGINX/HAProxy load balancers.

Lead incident response, root‑cause analysis, and postmortems; continuously improve reliability, performance, security, and cost.

Mentor teams, drive architectural runway, and communicate plans, trade‑offs, and outcomes to stakeholders and leadership.

Key Qualifications / Experience Required:
DevOps Experience

Expert‑level hands‑on DevOps across Azure and Kubernetes: CI/CD, Git workflows, infrastructure as code, automated testing, monitoring, and secure deployment.

Proficiency with Azure DevOps (ADO) YAML pipelines and GitHub Actions; experience optimizing pipelines for cloud‑native systems.

Strong Kubernetes operations including CNI networking and service mesh (Linkerd); container build and supply chain (Docker, Buildah).

Observability at scale using ELK/EFK, Prometheus, Fluentd/Fluent‑Bit, Azure Monitor dashboards and alerting (KQL).

Automation Skills

Deep automation with PowerShell, Bash, and Python to eliminate toil across build, release, environment, and operational workflows.

Infrastructure as Code expertise with Terraform (Azure resources: AKS, Service Bus, Event Hubs, Storage, Key Vault, Bastion).

Proven track record reducing manual intervention, increasing repeatability, and improving MTTR through automation.

Agentic AI Experience

Practical, production experience delivering agentic AI solutions (task orchestration, tool‑use, planning, retrieval, and evaluation).

Hands‑on with LangChain, LangSmith (tracing/eval), OpenAI/Azure OpenAI, and LiteLLM integration; familiarity with prompt engineering, safety/guardrails, and LLM observability (e.g., Arize).

Experience operationalizing AI services within DevOps pipelines and platform governance.

Technical Proficiency

Apache NiFi expertise: authoring and governing dataflows; deploying and scaling NiFi clusters on AKS with VM Scale Sets.

Azure services: AKS, Service Bus, Event Hubs (setup and integration), Storage Accounts (setup and integration), Key Vault, Bastion, Azure Dashboards & Kusto Query Language (KQL).

Data/analytics: Azure Databricks, PostgreSQL, Snowflake; Power BI and SharePoint integrations; Jupyter Notebook workflows.

Networking fundamentals: DHCP/DNS; load balancer configuration and operations (NGINX, HAProxy); Kubernetes ingress best practices.

Messaging and email protocols: SMTP, IMAP/POP.

Microservices and app frameworks: Python and Node.js microservices (REST APIs), Electron build and packaging.

Required Technical Skills

Windows PowerShell; Linux/Unix administration; Bash and Python.

Azure Cloud (architecture, security, cost, RBAC); Azure DevOps (ADO) with YAML; GitHub Actions.

Docker and Buildah; Kubernetes (CNI), Linkerd; ELK/EFK, Prometheus, Fluentd/Fluent‑Bit.

Apache NiFi flow development and clustered operations on Kubernetes with scale sets.

Azure Databricks; PostgreSQL; Snowflake; REST APIs; ServiceNow APIs; Power BI; SharePoint.

Azure Service Bus, Azure Event Hubs, Storage Accounts, Key Vault, Bastion.

Jira; Jupyter Notebook; Azure Dashboards and KQL; SMTP/IMAP/POP.

Python and Node.js microservice architecture; Electron build.

Project Management Skills

Plan, schedule, and coordinate multi‑team deliveries and releases; manage dependencies, risks, and change.

Drive execution across platform, app, data, and AI workstreams with clear milestones and success criteria.

Establish SLOs/SLAs and error budgets; align roadmaps to business priorities.

Communication and Interpersonal Skills

Communicate architectural decisions, roadmaps, and trade‑offs to technical and executive audiences.

Lead cross‑functional ceremonies; produce clear runbooks, architecture docs, and dashboards.

Foster collaboration across engineering, product, security, and operations.

Analytical and Problem‑Solving Abilities

Rapid diagnosis and resolution of complex production issues; strong RCA and remediation planning.

Attention to detail in reliability, security, performance, and cost optimization.

Adaptability and Continuous Learning

Track and adopt evolving best practices in cloud, containers, DevOps, and agentic AI.

Champion continuous improvement in engineering excellence and platform governance.

Experience and Education

Typically requires 10–15+ years in software engineering, DevOps/SRE, or platform engineering with principal‑level impact.

Bachelor's degree in Computer Science, Information Technology, or related field preferred (or equivalent experience).

Secondary Skills and Experience (Desired)
Design and Development

Define and design subsystems and interfaces; allocate responsibilities across services and platforms.

Translate non‑functional requirements (security, reliability, scalability) into concrete designs.

Technical Enablement

Provide technical enablement for components and subsystems; drive critical design decisions and reviews.

Establish patterns and reusable templates for CI/CD, IaC, and agentic service scaffolding.

Continuous Delivery Pipeline

Plan, define, and implement the continuous delivery pipeline with quality gates, progressive delivery, and rollback strategies.

Architectural Runway

Develop the architectural runway to support new features and capabilities; align with Solution and Enterprise Architects and portfolio stakeholders.

Integration

Architect and implement integrations with external components, systems, and platforms (ServiceNow, Snowflake, Power BI, SharePoint, email systems, and enterprise identity/secrets).

Top Skills

Windows PowerShell; Linux/Unix administration; Bash and Python

Azure Cloud (architecture, security, cost, RBAC); Azure DevOps (ADO) with YAML; GitHub Actions

Docker and Buildah; Kubernetes (CNI), Linkerd; ELK/EFK, Prometheus, Fluentd/Fluent‑Bit

#J-18808-Ljbffr