kamal

Author	SHA1	Message	Date
Donal McBreen	822590dcf6	Replace Traefik with parachute [mproxy](https://github.com/basecamp/parachute) is a custom minimal proxy designed specifically for Kamal. It has two big advantages over Traefik: 1. Imperative deployments - we tell it to switch from container A to container B, and it waits for container B to start then switches. No need to poll for health checks ourselves or mess around with forcing health checks to fail. 2. Support for multiple apps - as much as possible, configuration is supplied at runtime by the deploy command, allowing us to have multiple apps share an instance of mproxy without conflicting config.	2024-05-27 14:00:24 +01:00
Donal McBreen	64f5955444	Don't hold lock on error	2024-05-21 12:02:12 +01:00
Donal McBreen	706b82baa1	Simplify messages and remove multiple execute error	2024-05-21 10:40:01 +01:00
Donal McBreen	78c0a0ba4b	Don't start other roles we have a healthy container If a primary role container is unhealthy, we might take a while to timeout the health check poller. In the meantime if we have started the other roles, they'll be running tow containers. This could be a problem, especially if they read run jobs as that doubles the worker capacity which could cause exessive load. We'll wait for the first primary role container to boot successfully before starting the other containers from other roles.	2024-05-21 08:37:36 +01:00
Donal McBreen	ee758d951a	Only use barrier when needed, more descriptive info	2024-05-20 12:18:30 +01:00
Donal McBreen	773ba3a5ab	Show container logs and healthcheck status on failure	2024-05-20 12:18:30 +01:00
Donal McBreen	0efb5ccfff	Remove the healthcheck step To speed up deployments, we'll remove the healthcheck step. This adds some risk to deployments for non-web roles - if they don't have a Docker healthcheck configured then the only check we do is if the container is running. If there is a bad image we might see the container running before it exits and deploy it. Previously the healthcheck step would have avoided this by ensuring a web container could boot and serve traffic first. To mitigate this, we'll add a deployment barrier. Until one of the primary role containers passes its healthcheck, we'll keep the barrier up and avoid stopping the containers on the non-primary roles. It the primary role container fails its healthcheck, we'll close the barrier and shut down the new containers on the waiting roles. We also have a new integration test to check we correctly handle a a broken image. This highlighted that SSHKit's default runner will stop at the first error it encounters. We'll now have a custom runner that waits for all threads to finish allowing them to clean up.	2024-05-20 12:18:30 +01:00
Donal McBreen	12cad5458a	Merge pull request #762 from kryachkov/main Trim long hostnames	2024-05-10 16:05:27 +01:00
Donal McBreen	6d062ce271	Host specific env with tags Allow hosts to be tagged so we can have host specific env variables. We might want host specific env variables for things like datacenter specific tags or testing GC settings on a specific host. Right now you either need to set up a separate role, or have the app be host aware. Now you can define tag env variables and assign those to hosts. For example: ``` servers: - 1.1.1.1 - 1.1.1.2: tag1 - 1.1.1.2: tag2 - 1.1.1.3: [ tag1, tag2 ] env_tags: tag1: ENV1: value1 tag2: ENV2: value2 ``` The tag env supports the full env format, allowing you to set secret and clear values.	2024-05-09 16:02:45 +01:00
André Falk	63c47eca4c	Trim long hostnames Hostnames longer than 64 characters are not supported by docker	2024-05-07 19:06:39 +02:00
Donal McBreen	1f5b936fa2	Escape single quotes to fix log following Fixes: https://github.com/basecamp/kamal/issues/777	2024-04-26 14:16:19 +01:00
Donal McBreen	05ac808f2a	Use image tag to determine stale containers Use current_running_version to determine the latest version when finding stale containers.	2024-03-29 10:23:50 +00:00
Donal McBreen	bade195e93	Redefine what the "latest" container means Currently the latest container is the one that was created last. But if we have had a failed deployment that left two containers running that would not be the one we want. The second container could be in a restart loop for example. Instead we want the container that is running the image tagged as latest. As we now tag as latest after a successful deployment we can trust that that is a healthy container. In the case that there is no container running the latest image tag, we'll fall back to the latest container. This could happen if the deploy was halted in between the old container being stopped and the image being tagged as latest.	2024-03-29 08:51:50 +00:00
Igor Alexandrov	cee449c269	Put locks in a locks directory. Ensure that locks directory exits on a primary host.	2024-03-27 12:04:39 +04:00
Donal McBreen	3ecfb3744f	Add Rubocop - Pull in the 37signals house style - Autofix violations - Add to CI	2024-03-20 10:23:02 +00:00
Leon	2d86d4f7cc	Add SSH port to `run_over_ssh`	2023-11-03 22:32:37 +01:00
Donal McBreen	645f5ab72d	App exec with env file When calling `kamal app exec` for new non interactive containers, run the command per role on each server and include the role config including the environment. Fixes: https://github.com/basecamp/kamal/issues/492	2023-09-25 15:07:05 +01:00
Donal McBreen	0861730e0e	Run interactive commands with the correct host Fixes https://github.com/basecamp/kamal/issues/430	2023-09-18 12:00:36 +01:00
Donal McBreen	3c12d1799c	Copy all files into asset volume Adding -T to the copy command ensures that the files are copied at the same level into the target directory whether it exists or not. That allows us to drop the `/*` which was not picking up hidden files. Fixes: https://github.com/basecamp/kamal/issues/465	2023-09-15 08:07:48 +01:00
Donal McBreen	fb0aeec27e	Escape the newline in the inspect query	2023-09-12 19:10:39 +01:00
Donal McBreen	0b439362da	Asset paths During deployments both the old and new containers will be active for a small period of time. There also may be lagging requests for older CSS and JS after the deployment. This can lead to 404s if a request for old assets hits a new container or visa-versa. This PR makes sure that both sets of assets are available throughout the deployment from before the new version of the app is booted. This can be configured by setting the asset path: ```yaml asset_path: "/rails/public/assets" ``` The process is: 1. We extract the assets out of the container, with docker run, docker cp, docker stop. Docker run sets the container command to "sleep" so this needs to be available in the container. 2. We create an asset volume directory on the host for the new version of the app on the host and copy the assets in there. 3. If there is a previous deployment we also copy the new assets into its asset volume and copy the older assets into the new asset volume. 4. We start the new container mapping the asset volume over the top of the container's asset path. This means the both the old and new versions have replaced the asset path with a volume containing both sets of assets and should be able to serve any request during the deployment. The older assets will continue to be available until the next deployment.	2023-09-11 12:18:18 +01:00
Donal McBreen	cd02510d0f	Output one mount per line The go template was concatenating all the mounts into one line. It happened to work because the mount we are interested was always first. Fix it to output one mount per line instead.	2023-09-07 15:20:50 +01:00
Donal McBreen	8a41d15b69	Zero downtime deployment with cord file When replacing a container currently we: 1. Boot the new container 2. Wait for it to become healthy 3. Stop the old container Traefik will send requests to the old container until it notices that it is unhealthy. But it may have stopped serving requests before that point which can result in errors. To get round that the new boot process is: 1. Create a directory with a single file on the host 2. Boot the new container, mounting the cord file into /tmp and including a check for the file in the docker healthcheck 3. Wait for it to become healthy 4. Delete the healthcheck file ("cut the cord") for the old container 5. Wait for it to become unhealthy and give Traefik a couple of seconds to notice 6. Stop the old container The extra steps ensure that Traefik stops sending requests before the old container is shutdown.	2023-09-06 14:35:30 +01:00
David Heinemeier Hansson	c4a203e648	Rename to Kamal	2023-08-22 08:24:31 -07:00
Donal McBreen	4dd8208290	Extract versions that contains dashes The version extraction assumed that the version is everything after the last `-` in the container name. This doesn't work if you deploy a non-MRSK generated version that contains a `-`. To fix we'll generate the non version prefix and strip it off. In some places for this to work we need to make sure to pass the role through. Fixes: https://github.com/mrsked/mrsk/issues/402	2023-08-08 14:16:32 +01:00
Donal McBreen	28d6a131a9	Prefix container hostname with the underlying one To make it easier to identity where a docker container is running, prefix its hostname with the underlying one from the host. Docker chooses a 12 character random hex string by default, so we'll keep that as the suffix.	2023-05-31 16:22:25 +01:00
Donal McBreen	cedb8d900f	Stop containers with restarting status When stopping the old container we need to also look for ones with a restarting status.	2023-05-25 12:10:26 +01:00
Donal McBreen	19f0f40adf	Add skip_hooks option	2023-05-23 15:56:47 +01:00
Donal McBreen	58c1096a90	MRSK hooks Adds hooks to MRSK. Currently just two hooks, pre-build and post-push. We could break the build and push into two separate commands if we found the need for post-build and/or pre-push hooks. Hooks are stored in `.mrsk/hooks`. Running `mrsk init` will now create that folder and add sample hook scripts. Hooks returning non-zero exit codes will abort the current command. Further potential work here: - We could replace the audit broadcast command with a post-deploy/post-rollback hook or similar - Maybe provide pre-command/post-command hooks that run after every mrsk invocation - Also look for hooks in `~/.mrsk/hooks`	2023-05-23 13:55:04 +01:00
Donal McBreen	ee25f200d7	Call app:boot to rollback The code in Mrsk::Cli::Main#rollback was very similar to Mrsk::Cli::App#boot. Modify Mrsk::Cli::App#boot so it can handle rollbacks by: 1. Only renaming running containers 2. Trying first to start then run the new container	2023-05-16 08:59:07 +01:00
Donal McBreen	a5ef1f254f	Highlight uncommitted changes in version If there are uncommitted changes in the app repository when building, then append `_uncommitted_<random>` to it to distinguish the image from one built from a clean checkout. Also change the version used when renaming a container on redeploy to distinguish and explain the version suffixes.	2023-05-12 11:08:48 +01:00
David Heinemeier Hansson	71bc9bcf54	Merge pull request #222 from basecamp/deploy-groups Allow booting containers in groups for rolling restarts	2023-05-02 13:14:32 +02:00
David Heinemeier Hansson	c83b74dcb7	Simplify domain language to just "boot" and unscoped config keys	2023-05-02 13:11:31 +02:00
Kevin McConnell	f055766918	Allow percentage-based rolling deployments	2023-04-14 12:46:14 +01:00
Kevin McConnell	df202d6ef4	Move health checks into Docker Replaces our current host-based HTTP healthchecks with Docker healthchecks, and adds a new `healthcheck.cmd` config option that can be used to define a custom health check command. Also removes Traefik's healthchecks, since they are no longer necessary. When deploying a container that has a healthcheck defined, we wait for it to report a healthy status before stopping the old container that it replaces. Containers that don't have a healthcheck defined continue to wait for `MRSK.config.readiness_delay`. There are some pros and cons to using Docker healthchecks rather than checking from the host. The main advantages are: - Supports non-HTTP checks, and app-specific check scripts provided by a container. - When booting a container, allows MRSK to wait for a container to be healthy before shutting down the old container it replaces. This should be safer than relying on a timeout. - Containers with healthchecks won't be active in Traefik until they reach a healthy state, which prevents any traffic from being routed to them before they are ready. The main _disadvantage_ is that containers are now required to provide some way to check their health. Our default check assumes that `curl` is available in the container which, while common, won't always be the case.	2023-04-13 16:08:43 +01:00
Kevin McConnell	f530009a6e	Allow performing boot & start operations in groups Adds top-level configuration options for `group_limit` and `group_wait`. When a `group_limit` is present, we'll perform app boot & start operations on no more than `group_limit` hosts at a time, optionally sleeping for `group_wait` seconds after each batch. We currently only do this batching on boot & start operations (including when they are part of a deployment). Other commands, like `app stop` or `app details` still work on all hosts in parallel.	2023-04-13 15:58:27 +01:00
Jacopo	9ddb181f50	Merge branch 'main' into cleanup-excessive-containers-running * main: Pull the primary host from the role Minimise holding the deploy lock	2023-04-12 15:19:19 +02:00
Donal McBreen	051556674f	Minimise holding the deploy lock If we get an error we'll only hold the deploy lock if it occurs while trying to switch the running containers. We'll also move tagging the latest image from when the image is pulled to just before the container switch. This ensures that earlier errors don't leave the hosts with an updated latest tag while still running the older version.	2023-04-12 12:09:56 +01:00
Jacopo	03d933d10b	Add Role to the message	2023-04-11 10:59:25 +02:00
Jacopo	579b4cd9aa	Simplify By using and ad-hoc command to detect and stop stale containers. By default stale containers are only detected.	2023-04-11 10:22:03 +02:00
Jacopo	8ae5331d97	Boot stop all the old containers	2023-04-11 08:53:33 +02:00
Jacopo	4d47fbdf41	Merge stop and stop_stale_containers	2023-04-11 08:53:33 +02:00
Jacopo	e980f1164e	Avoid using GNU-only Perl Regepx Grep	2023-04-11 08:53:33 +02:00
Jacopo	e2f6db5cae	Clear stale containers By stopping all the older containers with matching /#{service}-#{role}-#{dest}-.*/ running on the same host.	2023-04-11 08:53:33 +02:00
David Heinemeier Hansson	1f83b5f6be	Fix failure to pass on class options to subcommands	2023-03-28 18:04:16 +02:00
milk1000cc	03614bfb79	Rolify app logs cli/command	2023-03-27 23:08:46 +09:00
Donal McBreen	05488e4c1e	Zero downtime redeploys When deploying check if there is already a container with the existing name. If there is rename it to "<version>_<random_hex_string>" to remove the name clash with the new container we want to boot. We can then do the normal zero downtime run/wait/stop. While implementing this I discovered the --filter name=foo does a substring match for foo, so I've updated those filters to do an exact match instead.	2023-03-24 17:09:20 +00:00
David Heinemeier Hansson	494e29d672	Fix tests	2023-03-24 14:35:17 +01:00
David Heinemeier Hansson	93423f2f20	Merge branch 'main' into pr/99 * main: Wording Remove accessory images using tags rather than labels Update readme to point to ghcr.io/mrsked/mrsk Validate that all roles have hosts Commander needn't accumulate configuration Pull latest image tag, so we can identity it Default to deploying the config version Remove unneeded Dockerfile.dind, update Readme add D-in-D dockerfile, update Readme	2023-03-24 14:26:31 +01:00
Donal McBreen	1ed4a37da2	Pull latest image tag, so we can identity it `docker image ls` doesn't tell us what the latest deployed image is (e.g if we've rolled back). Pull the latest image tag through to the server so we can use it instead.	2023-03-23 14:39:32 +00:00

1 2

70 Commits