Commit Graph

70 Commits

Author SHA1 Message Date
Donal McBreen
822590dcf6 Replace Traefik with parachute
[mproxy](https://github.com/basecamp/parachute) is a custom minimal
proxy designed specifically for Kamal.

It has two big advantages over Traefik:
1. Imperative deployments - we tell it to switch from container A to
   container B, and it waits for container B to start then switches. No
   need to poll for health checks ourselves or mess around with forcing
   health checks to fail.
2. Support for multiple apps - as much as possible, configuration is
   supplied at runtime by the deploy command, allowing us to have
   multiple apps share an instance of mproxy without conflicting config.
2024-05-27 14:00:24 +01:00
Donal McBreen
64f5955444 Don't hold lock on error 2024-05-21 12:02:12 +01:00
Donal McBreen
706b82baa1 Simplify messages and remove multiple execute error 2024-05-21 10:40:01 +01:00
Donal McBreen
78c0a0ba4b Don't start other roles we have a healthy container
If a primary role container is unhealthy, we might take a while to
timeout the health check poller. In the meantime if we have started the
other roles, they'll be running tow containers.

This could be a problem, especially if they read run jobs as that
doubles the worker capacity which could cause exessive load.

We'll wait for the first primary role container to boot successfully
before starting the other containers from other roles.
2024-05-21 08:37:36 +01:00
Donal McBreen
ee758d951a Only use barrier when needed, more descriptive info 2024-05-20 12:18:30 +01:00
Donal McBreen
773ba3a5ab Show container logs and healthcheck status on failure 2024-05-20 12:18:30 +01:00
Donal McBreen
0efb5ccfff Remove the healthcheck step
To speed up deployments, we'll remove the healthcheck step.

This adds some risk to deployments for non-web roles - if they don't
have a Docker healthcheck configured then the only check we do is if
the container is running.

If there is a bad image we might see the container running before it
exits and deploy it. Previously the healthcheck step would have avoided
this by ensuring a web container could boot and serve traffic first.

To mitigate this, we'll add a deployment barrier. Until one of the
primary role containers passes its healthcheck, we'll keep the barrier
up and avoid stopping the containers on the non-primary roles.

It the primary role container fails its healthcheck, we'll close the
barrier and shut down the new containers on the waiting roles.

We also have a new integration test to check we correctly handle a
a broken image. This highlighted that SSHKit's default runner will
stop at the first error it encounters. We'll now have a custom runner
that waits for all threads to finish allowing them to clean up.
2024-05-20 12:18:30 +01:00
Donal McBreen
12cad5458a Merge pull request #762 from kryachkov/main
Trim long hostnames
2024-05-10 16:05:27 +01:00
Donal McBreen
6d062ce271 Host specific env with tags
Allow hosts to be tagged so we can have host specific env variables.

We might want host specific env variables for things like datacenter
specific tags or testing GC settings on a specific host.

Right now you either need to set up a separate role, or have the app
be host aware.

Now you can define tag env variables and assign those to hosts.

For example:
```
servers:
  - 1.1.1.1
  - 1.1.1.2: tag1
  - 1.1.1.2: tag2
  - 1.1.1.3: [ tag1, tag2 ]
env_tags:
  tag1:
    ENV1: value1
  tag2:
    ENV2: value2
```

The tag env supports the full env format, allowing you to set secret and
clear values.
2024-05-09 16:02:45 +01:00
André Falk
63c47eca4c Trim long hostnames
Hostnames longer than 64 characters are not supported by docker
2024-05-07 19:06:39 +02:00
Donal McBreen
1f5b936fa2 Escape single quotes to fix log following
Fixes: https://github.com/basecamp/kamal/issues/777
2024-04-26 14:16:19 +01:00
Donal McBreen
05ac808f2a Use image tag to determine stale containers
Use current_running_version to determine the latest version when finding
stale containers.
2024-03-29 10:23:50 +00:00
Donal McBreen
bade195e93 Redefine what the "latest" container means
Currently the latest container is the one that was created last. But if
we have had a failed deployment that left two containers running that
would not be the one we want. The second container could be in a
restart loop for example.

Instead we want the container that is running the image tagged as
latest. As we now tag as latest after a successful deployment we can
trust that that is a healthy container.

In the case that there is no container running the latest image tag,
we'll fall back to the latest container.

This could happen if the deploy was halted in between the old container
being stopped and the image being tagged as latest.
2024-03-29 08:51:50 +00:00
Igor Alexandrov
cee449c269 Put locks in a locks directory. Ensure that locks directory exits on a primary host. 2024-03-27 12:04:39 +04:00
Donal McBreen
3ecfb3744f Add Rubocop
- Pull in the 37signals house style
- Autofix violations
- Add to CI
2024-03-20 10:23:02 +00:00
Leon
2d86d4f7cc Add SSH port to run_over_ssh 2023-11-03 22:32:37 +01:00
Donal McBreen
645f5ab72d App exec with env file
When calling `kamal app exec` for new non interactive containers, run
the command per role on each server and include the role config
including the environment.

Fixes: https://github.com/basecamp/kamal/issues/492
2023-09-25 15:07:05 +01:00
Donal McBreen
0861730e0e Run interactive commands with the correct host
Fixes https://github.com/basecamp/kamal/issues/430
2023-09-18 12:00:36 +01:00
Donal McBreen
3c12d1799c Copy all files into asset volume
Adding -T to the copy command ensures that the files are copied at the
same level into the target directory whether it exists or not.

That allows us to drop the `/*` which was not picking up hidden files.

Fixes: https://github.com/basecamp/kamal/issues/465
2023-09-15 08:07:48 +01:00
Donal McBreen
fb0aeec27e Escape the newline in the inspect query 2023-09-12 19:10:39 +01:00
Donal McBreen
0b439362da Asset paths
During deployments both the old and new containers will be active for a
small period of time. There also may be lagging requests for older CSS
and JS after the deployment.

This can lead to 404s if a request for old assets hits a new container
or visa-versa.

This PR makes sure that both sets of assets are available throughout the
deployment from before the new version of the app is booted.

This can be configured by setting the asset path:

```yaml
asset_path: "/rails/public/assets"
```

The process is:
1. We extract the assets out of the container, with docker run, docker
cp, docker stop. Docker run sets the container command to "sleep" so
this needs to be available in the container.
2. We create an asset volume directory on the host for the new version
of the app on the host and copy the assets in there.
3. If there is a previous deployment we also copy the new assets into
its asset volume and copy the older assets into the new asset volume.
4. We start the new container mapping the asset volume over the top of
the container's asset path.

This means the both the old and new versions have replaced the asset
path with a volume containing both sets of assets and should be able
to serve any request during the deployment. The older assets will
continue to be available until the next deployment.
2023-09-11 12:18:18 +01:00
Donal McBreen
cd02510d0f Output one mount per line
The go template was concatenating all the mounts into one line. It
happened to work because the mount we are interested was always first.

Fix it to output one mount per line instead.
2023-09-07 15:20:50 +01:00
Donal McBreen
8a41d15b69 Zero downtime deployment with cord file
When replacing a container currently we:
1. Boot the new container
2. Wait for it to become healthy
3. Stop the old container

Traefik will send requests to the old container until it notices that it
is unhealthy. But it may have stopped serving requests before that point
which can result in errors.

To get round that the new boot process is:

1. Create a directory with a single file on the host
2. Boot the new container, mounting the cord file into /tmp and
including a check for the file in the docker healthcheck
3. Wait for it to become healthy
4. Delete the healthcheck file ("cut the cord") for the old container
5. Wait for it to become unhealthy and give Traefik a couple of seconds
to notice
6. Stop the old container

The extra steps ensure that Traefik stops sending requests before the
old container is shutdown.
2023-09-06 14:35:30 +01:00
David Heinemeier Hansson
c4a203e648 Rename to Kamal 2023-08-22 08:24:31 -07:00
Donal McBreen
4dd8208290 Extract versions that contains dashes
The version extraction assumed that the version is everything after the
last `-` in the container name. This doesn't work if you deploy a
non-MRSK generated version that contains a `-`.

To fix we'll generate the non version prefix and strip it off. In some
places for this to work we need to make sure to pass the role through.

Fixes: https://github.com/mrsked/mrsk/issues/402
2023-08-08 14:16:32 +01:00
Donal McBreen
28d6a131a9 Prefix container hostname with the underlying one
To make it easier to identity where a docker container is running,
prefix its hostname with the underlying one from the host.

Docker chooses a 12 character random hex string by default, so we'll
keep that as the suffix.
2023-05-31 16:22:25 +01:00
Donal McBreen
cedb8d900f Stop containers with restarting status
When stopping the old container we need to also look for ones with a
restarting status.
2023-05-25 12:10:26 +01:00
Donal McBreen
19f0f40adf Add skip_hooks option 2023-05-23 15:56:47 +01:00
Donal McBreen
58c1096a90 MRSK hooks
Adds hooks to MRSK. Currently just two hooks, pre-build and post-push.

We could break the build and push into two separate commands if we
found the need for post-build and/or pre-push hooks.

Hooks are stored in `.mrsk/hooks`. Running `mrsk init` will now create
that folder and add sample hook scripts.

Hooks returning non-zero exit codes will abort the current command.

Further potential work here:
- We could replace the audit broadcast command with a
post-deploy/post-rollback hook or similar
- Maybe provide pre-command/post-command hooks that run after every
mrsk invocation
- Also look for hooks in `~/.mrsk/hooks`
2023-05-23 13:55:04 +01:00
Donal McBreen
ee25f200d7 Call app:boot to rollback
The code in Mrsk::Cli::Main#rollback was very similar to
Mrsk::Cli::App#boot.

Modify Mrsk::Cli::App#boot so it can handle rollbacks by:
1. Only renaming running containers
2. Trying first to start then run the new container
2023-05-16 08:59:07 +01:00
Donal McBreen
a5ef1f254f Highlight uncommitted changes in version
If there are uncommitted changes in the app repository when building,
then append `_uncommitted_<random>` to it to distinguish the image
from one built from a clean checkout.

Also change the version used when renaming a container on redeploy to
distinguish and explain the version suffixes.
2023-05-12 11:08:48 +01:00
David Heinemeier Hansson
71bc9bcf54 Merge pull request #222 from basecamp/deploy-groups
Allow booting containers in groups for rolling restarts
2023-05-02 13:14:32 +02:00
David Heinemeier Hansson
c83b74dcb7 Simplify domain language to just "boot" and unscoped config keys 2023-05-02 13:11:31 +02:00
Kevin McConnell
f055766918 Allow percentage-based rolling deployments 2023-04-14 12:46:14 +01:00
Kevin McConnell
df202d6ef4 Move health checks into Docker
Replaces our current host-based HTTP healthchecks with Docker
healthchecks, and adds a new `healthcheck.cmd` config option that can be
used to define a custom health check command. Also removes Traefik's
healthchecks, since they are no longer necessary.

When deploying a container that has a healthcheck defined, we wait for
it to report a healthy status before stopping the old container that it
replaces. Containers that don't have a healthcheck defined continue to
wait for `MRSK.config.readiness_delay`.

There are some pros and cons to using Docker healthchecks rather than
checking from the host. The main advantages are:

- Supports non-HTTP checks, and app-specific check scripts provided by a
  container.
- When booting a container, allows MRSK to wait for a container to be
  healthy before shutting down the old container it replaces. This
  should be safer than relying on a timeout.
- Containers with healthchecks won't be active in Traefik until they
  reach a healthy state, which prevents any traffic from being routed to
  them before they are ready.

The main _disadvantage_ is that containers are now required to provide
some way to check their health. Our default check assumes that `curl` is
available in the container which, while common, won't always be the
case.
2023-04-13 16:08:43 +01:00
Kevin McConnell
f530009a6e Allow performing boot & start operations in groups
Adds top-level configuration options for `group_limit` and `group_wait`.
When a `group_limit` is present, we'll perform app boot & start
operations on no more than `group_limit` hosts at a time, optionally
sleeping for `group_wait` seconds after each batch.

We currently only do this batching on boot & start operations (including
when they are part of a deployment). Other commands, like `app stop` or
`app details` still work on all hosts in parallel.
2023-04-13 15:58:27 +01:00
Jacopo
9ddb181f50 Merge branch 'main' into cleanup-excessive-containers-running
* main:
  Pull the primary host from the role
  Minimise holding the deploy lock
2023-04-12 15:19:19 +02:00
Donal McBreen
051556674f Minimise holding the deploy lock
If we get an error we'll only hold the deploy lock if it occurs while
trying to switch the running containers.

We'll also move tagging the latest image from when the image is pulled
to just before the container switch. This ensures that earlier errors
don't leave the hosts with an updated latest tag while still running the
older version.
2023-04-12 12:09:56 +01:00
Jacopo
03d933d10b Add Role to the message 2023-04-11 10:59:25 +02:00
Jacopo
579b4cd9aa Simplify
By using and ad-hoc command to detect and stop stale containers.
By default stale containers are only detected.
2023-04-11 10:22:03 +02:00
Jacopo
8ae5331d97 Boot stop all the old containers 2023-04-11 08:53:33 +02:00
Jacopo
4d47fbdf41 Merge stop and stop_stale_containers 2023-04-11 08:53:33 +02:00
Jacopo
e980f1164e Avoid using GNU-only Perl Regepx Grep 2023-04-11 08:53:33 +02:00
Jacopo
e2f6db5cae Clear stale containers
By stopping all the older containers with matching /#{service}-#{role}-#{dest}-.*/ running on the same host.
2023-04-11 08:53:33 +02:00
David Heinemeier Hansson
1f83b5f6be Fix failure to pass on class options to subcommands 2023-03-28 18:04:16 +02:00
milk1000cc
03614bfb79 Rolify app logs cli/command 2023-03-27 23:08:46 +09:00
Donal McBreen
05488e4c1e Zero downtime redeploys
When deploying check if there is already a container with the existing
name. If there is rename it to "<version>_<random_hex_string>" to remove
the name clash with the new container we want to boot.

We can then do the normal zero downtime run/wait/stop.

While implementing this I discovered the --filter name=foo does a
substring match for foo, so I've updated those filters to do an exact
match instead.
2023-03-24 17:09:20 +00:00
David Heinemeier Hansson
494e29d672 Fix tests 2023-03-24 14:35:17 +01:00
David Heinemeier Hansson
93423f2f20 Merge branch 'main' into pr/99
* main:
  Wording
  Remove accessory images using tags rather than labels
  Update readme to point to ghcr.io/mrsked/mrsk
  Validate that all roles have hosts
  Commander needn't accumulate configuration
  Pull latest image tag, so we can identity it
  Default to deploying the config version
  Remove unneeded Dockerfile.dind, update Readme
  add D-in-D dockerfile, update Readme
2023-03-24 14:26:31 +01:00
Donal McBreen
1ed4a37da2 Pull latest image tag, so we can identity it
`docker image ls` doesn't tell us what the latest deployed image is (e.g
if we've rolled back). Pull the latest image tag through to the server
so we can use it instead.
2023-03-23 14:39:32 +00:00