Skip to content

CI/CD Best Practices

First PublishedLast UpdatedByAtif Alam

Good CI/CD isn’t just about having a pipeline — it’s about having a pipeline that is fast, reliable, secure, and maintainable. This page covers the practices that separate a basic pipeline from a production-grade one.

Platform teams often expose self-service pipelines or templates so product teams can ship without a ticket for every change. That only works with guardrails: approved base images, mandatory scans, environment promotion rules, and observability hooks. The goal is safe autonomy—speed with defaults that prevent repeated mistakes. See Pipeline fundamentals for stages and secrets; pair with Kubernetes production patterns and service readiness for what “done” means before production.

Order stages so the quickest checks run first. If linting takes 10 seconds and e2e tests take 10 minutes, run linting first:

lint (10s) ──► unit tests (60s) ──► integration tests (3m) ──► e2e tests (10m) ──► deploy
▲ │
│ If any step fails, pipeline stops here │
└──────────────────────────────────────────────────────────────────────────────┘

Run independent jobs simultaneously:

┌── lint (10s)
build (30s) ────────┼── unit tests (60s) Total: 30s + 60s = 90s
│ (not 30s + 10s + 60s + 30s = 130s)
└── security scan (30s)

Fast feedback is the core value of CI. If the pipeline takes 30+ minutes, developers stop waiting for it and context-switch.

TechniqueImpact
Cache dependenciesSave 30-60s per run (npm, pip, Go modules)
Parallelize testsCut test time by N (number of parallel jobs)
Use faster runnersLarger VMs = faster builds
Skip unnecessary workPath filtering for monorepos
Use incremental buildsOnly recompile changed modules
Split test suitesRun unit tests in CI, e2e tests on merge to main only

A recommended stage ordering:

StageWhatWhen to Run
Lint / formatCode style, formattingEvery push and PR
BuildCompile, install deps, create artifactEvery push and PR
Unit testsFast, isolated testsEvery push and PR
Integration testsTests with real dependencies (DB, API)Every push and PR (or on merge)
Security scanSAST, dependency vulnerabilities, container scanEvery push and PR
E2E testsFull system testsOn merge to main (or nightly)
Deploy stagingDeploy to staging, smoke testOn merge to main
Deploy productionManual approval, deploy, monitorAfter staging validation
Bad: AWS_SECRET_KEY = "AKIA..." hardcoded in pipeline YAML or source code
Good: Use the CI/CD platform's encrypted secret store
Best: Use OIDC — no stored credentials at all

OIDC (OpenID Connect) lets the pipeline request a short-lived token from the cloud provider — no access keys to store, rotate, or leak:

PlatformOIDC Support
GitHub Actionspermissions: id-token: write + cloud provider trust
GitLab CIid_tokens keyword
Azure PipelinesWorkload Identity Federation

See GitHub Actions OIDC and GitLab CI OIDC for setup.

  • GitHub Actions: Set permissions in the workflow to restrict GITHUB_TOKEN scope.
  • GitLab CI: Use protected and masked variables, scoped to environments.
  • Cloud roles: Grant only the permissions the pipeline needs (e.g. push to ECR, deploy to ECS — not full admin).
# Bad: uses latest (could change without notice)
- uses: actions/checkout@main
# Good: pin to a version tag
- uses: actions/checkout@v4
# Best: pin to a full commit SHA (immutable)
- uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11
PracticeWhat It Does
Pin action/image versionsPrevent unexpected changes from upstream
Dependency scanningDetect known vulnerabilities in packages
Container scanningScan Docker images for CVEs
SBOM generationCreate a Software Bill of Materials for each build
Signed artifactsSign container images (cosign, Notary) to prove provenance
Dependabot / RenovateAuto-update dependencies with PRs
┌─────────┐
/ E2E Tests \ Slow, expensive, fragile
/ (few: ~10) \ Run on merge to main
/─────────────────────\
/ Integration Tests \ Medium speed
/ (~50-100) \ Run on every PR
/─────────────────────────────\
/ Unit Tests \ Fast, cheap, reliable
/ (~500-1000+) \ Run on every push
/─────────────────────────────────────\
LevelWhat It TestsSpeedRun When
UnitIndividual functions/classes in isolationMillisecondsEvery push
IntegrationComponents working together (DB, API)SecondsEvery PR
E2EFull user flows through the UI or APIMinutesMerge to main, nightly

For large test suites, split tests across parallel runners:

# GitHub Actions — run tests in parallel shards
strategy:
matrix:
shard: [1, 2, 3, 4]
steps:
- run: npm test -- --shard=${{ matrix.shard }}/4

Flaky tests (tests that sometimes pass, sometimes fail) erode confidence in the pipeline:

StrategyWhat It Does
QuarantineMove flaky tests to a separate job (non-blocking)
RetryRetry failed tests once (but track flake rate)
Track metricsDashboard of flaky tests — fix or delete them
No new flakesRequire new tests to pass 10 consecutive runs before merging
main ───●───●───●───●───●───●───●───●──► (always deployable)
\ / \ /
feat-A feat-B
(short-lived, 1-2 days)
  • Everyone commits to main (or very short-lived feature branches).
  • Feature flags hide incomplete work.
  • CI runs on every push; CD deploys main continuously.

Best for: Teams with good test coverage and feature flags. Fastest feedback loop.

main ───●───────●───────●───────●──► (protected, always deployable)
\ / \ /
feat-A feat-B
(PR + review) (PR + review)
  • Create a feature branch from main.
  • Open a PR, get review, merge.
  • main is always deployable.

Best for: Most teams. Simple, well-understood, works with GitHub PRs.

main ──────●──────●──────●──────●───► (development)
\ \
▼ ▼
staging ─────●──────────────●───────► (staging environment)
\ \
▼ ▼
production ────●──────────────●─────► (production environment)
  • main for development, staging and production branches for deployment.
  • Merge from mainstagingproduction.

Best for: Teams that need environment branches and explicit promotion.

StrategyBranchesMerge FrequencyComplexityBest For
Trunk-basedmain only (+ short feature)Multiple times/dayLowHigh-performing teams
GitHub Flowmain + feature branchesDaily to weeklyLowMost teams
GitLab Flowmain + env branchesWeeklyMediumTeams needing env promotion
Git Flowmain + develop + feature + release + hotfixWeekly to monthlyHighVersioned software (avoid if possible)

For repositories containing multiple services/packages:

Only run pipelines for the service that changed:

# GitHub Actions
on:
push:
paths:
- 'services/api/**'
- 'shared/**' # Also rebuild if shared code changes
# GitLab CI
api-tests:
rules:
- changes:
- services/api/**
- shared/**

Tools like Nx (JavaScript), Turborepo, Bazel, or pants understand the dependency graph and only build/test what was affected:

Terminal window
# Nx: only test projects affected by changes since main
npx nx affected --target=test --base=origin/main
PracticeWhy
Path filtersDon’t rebuild everything on every change
Shared base imagePre-built Docker image with common deps
Dependency graph toolOnly build/test affected packages
Separate deploy jobs per serviceDon’t deploy the API when only the frontend changed
Cache aggressivelyShare caches across services where possible

Treat pipeline definitions like application code:

PracticeWhat It Means
Version controlledPipeline YAML lives in the same repo as the code
Code reviewedPipeline changes go through PR review
TestedUse act (GitHub Actions) or gitlab-ci-lint to validate locally
DRYReusable workflows (GitHub) / includes+extends (GitLab) / templates (Azure)
DocumentedComments explaining non-obvious steps
# GitHub Actions — Slack notification on failure
- name: Notify Slack on failure
if: failure()
uses: slackapi/slack-github-action@v1
with:
channel-id: 'C0123456789'
slack-message: "Pipeline failed: ${{ github.repository }}@${{ github.sha }}"
env:
SLACK_BOT_TOKEN: ${{ secrets.SLACK_TOKEN }}
ChannelWhenWhat
Slack / TeamsFailurePipeline failed, deployment failed
EmailFailure (optional)Summary of failures
GitHub/GitLab commentsPR pipelinesTest results, coverage, plan output
DashboardAlwaysPipeline success rate, duration trends

The DORA (DevOps Research and Assessment) metrics measure CI/CD effectiveness:

MetricWhat It MeasuresElite Benchmark
Deployment FrequencyHow often you deploy to productionMultiple times per day
Lead Time for ChangesTime from commit to productionLess than 1 hour
Change Failure Rate% of deployments that cause a failure0–15%
Time to Restore ServiceTime to recover from a production failureLess than 1 hour

Track these metrics to understand and improve your CI/CD process:

Deployment Frequency: 3x/day ✓ Elite
Lead Time for Changes: 45 min ✓ Elite
Change Failure Rate: 8% ✓ Elite
Time to Restore Service: 30 min ✓ Elite

Tools for DORA metrics: Sleuth, LinearB, Faros AI, GitLab Value Stream Analytics, GitHub-based custom dashboards.

Anti-PatternProblemFix
30+ minute pipelinesDevelopers don’t wait, context-switchParallelize, cache, split test suites
Flaky testsFalse failures erode trustQuarantine, fix, or delete flaky tests
Manual gates everywhereSlow deployments, bottleneck on approversAutomate staging deploy; manual gate only for production
No rollback planStuck when a deployment goes badTest rollback procedures regularly
Secrets in codeCredential leaksUse platform secret store + OIDC
Pipeline YAML copy-pasteInconsistent, hard to maintainReusable workflows / includes / templates
No path filtering in monorepoEvery change rebuilds everythingAdd path filters and affected-only builds
Testing only in CISlow feedback for developersRun fast tests locally too (pre-commit, husky)
Ignoring security scansVulnerabilities ship to productionBlock merge if critical/high vulnerabilities found
No deployment observabilityDon’t know if deploy succeeded or degradedSmoke tests + monitoring after deploy

A quick checklist for a healthy CI/CD setup:

  • Pipeline runs on every push and PR.
  • Pipeline completes in under 10 minutes (ideally under 5).
  • Secrets are in the platform’s encrypted store, not in code.
  • OIDC is used for cloud authentication (no long-lived keys).
  • Actions and dependencies are pinned to specific versions.
  • Tests follow the test pyramid (many unit, some integration, few e2e).
  • Flaky tests are tracked and fixed.
  • Path filtering is in place for monorepos.
  • Staging deploys automatically; production has a manual approval gate.
  • Rollback procedure is documented and tested.
  • Pipeline failures send notifications (Slack/Teams).
  • DORA metrics are tracked.
  • Pipeline YAML is reviewed like application code.
  • Fail fast — lint and unit tests before long-running jobs.
  • Parallelize — independent jobs should run simultaneously.
  • Keep it under 10 minutes — fast feedback is the core value of CI.
  • OIDC > stored credentials — short-lived tokens with no secrets to manage.
  • Pin everything — actions, images, dependencies.
  • Test pyramid — many unit tests, some integration, few e2e.
  • Trunk-based or GitHub Flow — short-lived branches, frequent merges.
  • Track DORA metrics — deployment frequency, lead time, change failure rate, time to restore.
  • Treat pipeline YAML as code — version controlled, reviewed, DRY, documented.