Site Reliability / Infrastructure Engineer

General Intuition & Medal, New York, NY, United States

The Company
Medal
Medal is the world’s largest and fastest-growing platform for gaming clips, where millions of gamers capture, share, and relive their best moments. Every year, our players record billions of clips, each representing a unique, action-packed highlight. We’re building the next generation of gaming communities: social, monetized, and creator-powered. Our mission is to design products that make sharing, discovering, and connecting around gaming moments seamless and fun.

The Role
Medal's infrastructure handles billions of clips, video ingestion pipelines, and social features at a massive scale most engineers never get to touch. We're looking for an SRE who cares deeply about reliability and scalability.

The work centers on reliability, incident response, scaling, and making sure our infrastructure keeps up with our growth. You'll own the on-call rotation, drive postmortems, and work directly with engineering teams to meet their infra needs.

The right person probably came through startups and scale-ups. You've been in the room when things broke at 2am, you've scaled databases under pressure, and you know the difference between a durable fix and a patch that buys you a week.

Key Responsibilities

Own reliability across our GCP infrastructure: Kubernetes clusters, managed services, and data pipelines, driving measurable improvements to availability and latency

Lead incident response end-to-end: on-call rotations, runbooks, postmortems, and the follow-through that makes sure the same thing doesn't happen twice

Architect and execute database scaling strategies (sharding, replication, query optimization, and capacity planning) across MySQL and Postgres at meaningful scale

Partner with product engineering to translate feature requirements into infrastructure designs that hold up as we grow

Manage and evolve our Terraform-managed GCP environment and Kubernetes cluster configurations

Build and maintain observability across the stack: metrics, dashboards, alerting, and tracing

Constantly improve CI/CD reliability and delivery pipelines across GitHub Actions

Harden IAM, secrets management, and network segmentation as part of normal infra hygiene

About You

You’ve worked at startups and are comfortable in an environment of rapid growth where scaling up is a priority

You have great judgment - you know the difference between a durable, sustainable fix vs. a patch that buys you a week

You have deep, hands-on experience scaling and sharding relational databases in production environments

You know GCP maybe a little too well: Kubernetes, VPC, IAM, Cloud Logging, and the managed services ecosystem

You are fluent in Terraform and have owned real infrastructure-as-code at scale

You have strong incident response instincts: you can work a P0 calmly, communicate clearly under pressure, and run a postmortem that prevents recurrence.

You’ve worked with GitHub Actions in a production CI/CD environment.

You have excellent communication skills (this is crucial!) and can both flag issues clearly and rapidly during incidents, and lead / write actionable postmortems

Our Stack
Google Cloud Platform

Terraform, Salt, GitHub Actions

Java, Redis, RabbitMQ, ElasticSearch, BigQuery, Kubernetes for backend

Electron+React

C# and C++ for native windows recording & more

Swift for iOS, Kotlin for Android

Benefits

Competitive salary and meaningful equity

Comprehensive medical, dental, and vision coverage

401(k)

Wellness and fitness perks including a Wellhub membership and mental health resources

Paid parental leave, fertility and maternal health benefits

Generous PTO policy

Daily meals and commuter benefits at our NYC HQ in Flatiron

Learning and development stipend

Benefits vary by country and employment type.

#J-18808-Ljbffr