Toshiba Global Commerce Solutions
Director of Production Engineering (Reliability Platform Engineering)
Toshiba Global Commerce Solutions, Durham, North Carolina, United States, 27703
Toshiba Global Commerce Solutions is seeking a Director of Production Engineering to own the engineering systems that make reliability, performance, correctness, and release safety predictable across our global POS, edge, cloud, and middleware platform.
This is a production engineering leadership role focused on distributed system correctness, resilience, and performance engineering. You will partner closely with existing SRE and Operations leaders, but your charter is to engineer prevention by building production readiness standards, automated release gates, and performance/resilience validation mechanisms that stop unsafe changes before they ship.
As AI accelerates development velocity, the bottleneck shifts from writing code to verifying correctness, performance, and safe behavior under failure. This role builds the production engineering foundation that allows teams to move fast without degrading latency, throughput, or availability.
What Success Looks Like
Releases ship faster without increasing Sev-1 / Sev-2 incidents
Incident recurrence drops measurably due to enforced learning and prevention
Edge → store → cloud workflows behave safely under real failure conditions
Reliability is engineered, automated, and enforced, not reactive
Teams clearly understand what “safe to release” means — and pipelines enforce it
Responsibilities Production Engineering & Release Safety
Own non-functional release criteria and automated release gates for reliability, resilience, performance, and correctness across complex release trains.
Define and enforce Production Readiness Reviews (PRRs) and platform-wide engineering standards.
Establish objective, measurable “safe-to-release” signals consumed by CI/CD and release tooling.
Distributed Systems Correctness (Edge → Cloud Commerce)
Partner with Architects and Principal Engineers to define failure modes, degradation behavior, and system guardrails for distributed and eventually consistent workflows.
Ensure systems behave correctly during retries, partial outages, intermittent connectivity, degraded modes, and recovery.
Lead engineering initiatives that reduce risk of data loss, duplication, corruption, or inconsistent state across POS, middleware, and cloud services.
Incident Learning That Prevents Recurrence
Lead blameless incident reviews using formal analysis methods.
Ensure corrective actions are engineered into systems, validated, tracked, and audited.
Institutionalize learning so failures do not reappear under new conditions or scale.
Resilience & Performance Engineering
Own platform-level strategies for resilience, performance, and scalability validation.
Drive chaos, failover, load, stress, and soak testing focused on real failure modes, not synthetic demos.
Validate store-mode behavior, payment workflows, edge-device dependencies, and multi-service interactions.
Observability & Reliability Signals
Ensure high-fidelity telemetry (logs, metrics, traces, and business signals) that supports release gating, correctness verification, and diagnosis.
Drive instrumentation standards that allow teams to prove reliability outcomes with data.
Cross-Org Technical Leadership
Partner with Software Engineering, Architecture, Quality Engineering, Cloud Operations, and TPM/TPO teams.
Build and lead senior technical managers and staff-level engineers.
Set expectations for technical depth, ownership, and execution quality.
Required Experience
Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience.
10–15+ years building production engineering capabilities for distributed software platforms, with direct accountability for production outcomes.
Demonstrated experience defining and enforcing production readiness standards and non-functional release gates that prevent unsafe changes from shipping.
Proven ability to lead formal root-cause / reliability analysis and ensure systemic fixes reduce recurrence.
Strong distributed systems fundamentals, including the ability to reason about:
Failure modes and degradation behavior
Dependency risk, retries, and backpressure
Consistency tradeoffs and correctness under failure
Experience partnering deeply with Architecture and Software Engineering to embed reliability guardrails into design reviews, CI/CD pipelines, and system standards.
Senior leadership experience building teams and influencing across large engineering organizations.
Preferred Experience
Retail POS, payments, edge devices, or store environments.
Designing reliability automation (release scoring, regression detection, incident pattern analysis).
Hybrid cloud + edge architectures; Kubernetes/AKS; modern observability platforms.
Leading reliability transformations in large, complex engineering organizations.
Why This Role Matters
Uptime is engineered, not reactive
Development and QA operate at AI-enabled speed
The platform scales safely without sacrificing correctness
TGCS matches or exceeds best-in-class engineering organizations
Benefits
Group health coverage (medical, dental, & vision)
Employee Assistance Programs
Pre-tax spending accounts
401(k) plan (with company match)
Company provided life insurance
Pet Insurance
Employee discounts
Generous paid holiday schedule, paid vacation & sick/personal days
Eeo Toshiba Global Commerce Solutions is an equal opportunity/affirmative action employer that evaluates qualified applicants without regard to age, ancestry, color, religious creed, disability, marital status, medical condition, genetic information, military or veteran status, national origin, race, sex, gender, gender identity, gender expression and sexual orientation or any other protected factor. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements.
Individuals who need a reasonable accommodation because of a disability for any part of the employment process should email
benefits@toshibagcs.com
to request an accommodation.
Diversity, Equity & Inclusion We at Toshiba Global Commerce Solutions firmly believe that our people are an integral part to the success of our customers. Furthermore, we're committed to Diversity, Equity, and Inclusion for all our people as highlighted by our 5 Core Principles (Create Outreach, Foster Belonging, Unleash Opportunity, Diverse Cultural Engagement and Culture of Transparency). We're passionate about our customers the retail industry and becoming a more responsible company as we help create a brighter future.
#J-18808-Ljbffr
This is a production engineering leadership role focused on distributed system correctness, resilience, and performance engineering. You will partner closely with existing SRE and Operations leaders, but your charter is to engineer prevention by building production readiness standards, automated release gates, and performance/resilience validation mechanisms that stop unsafe changes before they ship.
As AI accelerates development velocity, the bottleneck shifts from writing code to verifying correctness, performance, and safe behavior under failure. This role builds the production engineering foundation that allows teams to move fast without degrading latency, throughput, or availability.
What Success Looks Like
Releases ship faster without increasing Sev-1 / Sev-2 incidents
Incident recurrence drops measurably due to enforced learning and prevention
Edge → store → cloud workflows behave safely under real failure conditions
Reliability is engineered, automated, and enforced, not reactive
Teams clearly understand what “safe to release” means — and pipelines enforce it
Responsibilities Production Engineering & Release Safety
Own non-functional release criteria and automated release gates for reliability, resilience, performance, and correctness across complex release trains.
Define and enforce Production Readiness Reviews (PRRs) and platform-wide engineering standards.
Establish objective, measurable “safe-to-release” signals consumed by CI/CD and release tooling.
Distributed Systems Correctness (Edge → Cloud Commerce)
Partner with Architects and Principal Engineers to define failure modes, degradation behavior, and system guardrails for distributed and eventually consistent workflows.
Ensure systems behave correctly during retries, partial outages, intermittent connectivity, degraded modes, and recovery.
Lead engineering initiatives that reduce risk of data loss, duplication, corruption, or inconsistent state across POS, middleware, and cloud services.
Incident Learning That Prevents Recurrence
Lead blameless incident reviews using formal analysis methods.
Ensure corrective actions are engineered into systems, validated, tracked, and audited.
Institutionalize learning so failures do not reappear under new conditions or scale.
Resilience & Performance Engineering
Own platform-level strategies for resilience, performance, and scalability validation.
Drive chaos, failover, load, stress, and soak testing focused on real failure modes, not synthetic demos.
Validate store-mode behavior, payment workflows, edge-device dependencies, and multi-service interactions.
Observability & Reliability Signals
Ensure high-fidelity telemetry (logs, metrics, traces, and business signals) that supports release gating, correctness verification, and diagnosis.
Drive instrumentation standards that allow teams to prove reliability outcomes with data.
Cross-Org Technical Leadership
Partner with Software Engineering, Architecture, Quality Engineering, Cloud Operations, and TPM/TPO teams.
Build and lead senior technical managers and staff-level engineers.
Set expectations for technical depth, ownership, and execution quality.
Required Experience
Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience.
10–15+ years building production engineering capabilities for distributed software platforms, with direct accountability for production outcomes.
Demonstrated experience defining and enforcing production readiness standards and non-functional release gates that prevent unsafe changes from shipping.
Proven ability to lead formal root-cause / reliability analysis and ensure systemic fixes reduce recurrence.
Strong distributed systems fundamentals, including the ability to reason about:
Failure modes and degradation behavior
Dependency risk, retries, and backpressure
Consistency tradeoffs and correctness under failure
Experience partnering deeply with Architecture and Software Engineering to embed reliability guardrails into design reviews, CI/CD pipelines, and system standards.
Senior leadership experience building teams and influencing across large engineering organizations.
Preferred Experience
Retail POS, payments, edge devices, or store environments.
Designing reliability automation (release scoring, regression detection, incident pattern analysis).
Hybrid cloud + edge architectures; Kubernetes/AKS; modern observability platforms.
Leading reliability transformations in large, complex engineering organizations.
Why This Role Matters
Uptime is engineered, not reactive
Development and QA operate at AI-enabled speed
The platform scales safely without sacrificing correctness
TGCS matches or exceeds best-in-class engineering organizations
Benefits
Group health coverage (medical, dental, & vision)
Employee Assistance Programs
Pre-tax spending accounts
401(k) plan (with company match)
Company provided life insurance
Pet Insurance
Employee discounts
Generous paid holiday schedule, paid vacation & sick/personal days
Eeo Toshiba Global Commerce Solutions is an equal opportunity/affirmative action employer that evaluates qualified applicants without regard to age, ancestry, color, religious creed, disability, marital status, medical condition, genetic information, military or veteran status, national origin, race, sex, gender, gender identity, gender expression and sexual orientation or any other protected factor. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements.
Individuals who need a reasonable accommodation because of a disability for any part of the employment process should email
benefits@toshibagcs.com
to request an accommodation.
Diversity, Equity & Inclusion We at Toshiba Global Commerce Solutions firmly believe that our people are an integral part to the success of our customers. Furthermore, we're committed to Diversity, Equity, and Inclusion for all our people as highlighted by our 5 Core Principles (Create Outreach, Foster Belonging, Unleash Opportunity, Diverse Cultural Engagement and Culture of Transparency). We're passionate about our customers the retail industry and becoming a more responsible company as we help create a brighter future.
#J-18808-Ljbffr