Sandor Zeestraten

A dutchman living in Trondheim, Norway, who messes around with computer stuff. Currently working as a sysadmin at NINA.

Archive

2018 2

2018

Expanding Ceph clusters with Juju

13 minute read

We just got a set of new SuperMicro servers for one of our Ceph clusters at HUNT Cloud. This made for a great opportunity to write up the simple steps of expanding a Ceph cluster with Juju.

New to Juju? Juju is a cool controller and agent based tool from Canonical to easily deploy and manage applications (called Charms) on different clouds and environments (see how it works for more details).

Scaling applications with Juju is easy and Ceph is no exception. You can deploy more Ceph OSD hosts with just a simple juju add-unit ceph-osd command. The challenging part is to add new OSDs without impacting client performance due to large amounts of backfilling.

Below is a brief walk through of the steps on how you can scale your Ceph cluster, with a small example cluster that you can deploy locally on LXD containers to follow along:

Set crush initial weight to 0
Add new OSDs to the cluster
Clear crush initial weight
Reweight new OSDs

Deploy a Ceph cluster on LXD for testing

The first step is to get a Juju controller up and running so you can deploy Ceph. If you’re new to Juju and LXD, you can get started with the official docs here. In case you already have installed all the requirements, you can simply bootstrap a new controller like so:

$ juju bootstrap localhost

Creating Juju controller "localhost-localhost" on localhost/localhost
Looking for packaged Juju agent version 2.3.7 for amd64
To configure your system to better support LXD containers, please see: https://github.com/lxc/lxd/blob/master/doc/production-setup.md
Launching controller instance(s) on localhost/localhost...
 - juju-a24b9d-0 (arch=amd64)
Installing Juju agent on bootstrap instance
Fetching Juju GUI 2.12.1
Waiting for address
Attempting to connect to 10.181.145.171:22
Connected to 10.181.145.171
Running machine configuration script...
Bootstrap agent now started
Contacting Juju controller at 10.181.145.171 to verify accessibility...
Bootstrap complete, "localhost-localhost" controller now available
Controller machines are in the "controller" model
Initial model "default" added

Next up you need to deploy the Ceph cluster. Here’s a bundle called ceph-lxd which sets up a small cluster for you with:

1 Ceph Monitor host using the ceph-mon charm
3 Ceph OSD hosts with 3 OSDs each using the ceph-osd charm

You can deploy it straight from the Juju charm store:

$ juju deploy cs:~szeestraten/bundle/ceph-lxd

Located bundle "cs:~szeestraten/bundle/ceph-lxd-1"
Resolving charm: cs:ceph-mon-24
Resolving charm: cs:ceph-osd-261
Executing changes:
- upload charm cs:ceph-mon-24 for series xenial
- deploy application ceph-mon on xenial using cs:ceph-mon-24
- set annotations for ceph-mon
- upload charm cs:ceph-osd-261 for series xenial
- deploy application ceph-osd on xenial using cs:ceph-osd-261
- set annotations for ceph-osd
- add relation ceph-osd:mon - ceph-mon:osd
- add unit ceph-mon/0 to new machine 0
- add unit ceph-osd/0 to new machine 1
- add unit ceph-osd/1 to new machine 2
- add unit ceph-osd/2 to new machine 3
Deploy of bundle completed.

The deployment may take a little while, so here’s your perfect chance to refill your coffee. You can follow along with juju status (or watch --color juju status --color in case you get impatient).

If all goes well, you should end up with something that looks like this:

$ juju status

Model    Controller  Cloud/Region  Version  SLA
default  lxd         lxd           2.3.7    unsupported

App       Version  Status  Scale  Charm     Store       Rev  OS      Notes
ceph-mon  12.2.4   active      1  ceph-mon  jujucharms   24  ubuntu
ceph-osd  12.2.4   active      3  ceph-osd  jujucharms  261  ubuntu

Unit         Workload  Agent  Machine  Public address  Ports  Message
ceph-mon/0*  active    idle   0        10.247.146.247         Unit is ready and clustered
ceph-osd/0*  active    idle   1        10.247.146.135         Unit is ready (3 OSD)
ceph-osd/1   active    idle   2        10.247.146.173         Unit is ready (3 OSD)
ceph-osd/2   active    idle   3        10.247.146.143         Unit is ready (3 OSD)

Machine  State    DNS             Inst id        Series  AZ  Message
0        started  10.247.146.247  juju-07321b-0  xenial      Running
1        started  10.247.146.135  juju-07321b-1  xenial      Running
2        started  10.247.146.173  juju-07321b-2  xenial      Running
3        started  10.247.146.143  juju-07321b-3  xenial      Running

Relation provider  Requirer      Interface  Type     Message
ceph-mon:mon       ceph-mon:mon  ceph       peer
ceph-mon:osd       ceph-osd:mon  ceph-osd   regular

Now, take a closer look at the cluster to ensure that it is in a healthy state and that all OSDs have been created:

$ juju ssh ceph-mon/0 "sudo -s"

$ ceph status
  cluster:
    id:     f719d3e8-52ff-11e8-91f4-00163e5622ff
    health: HEALTH_OK

  services:
    mon: 1 daemons, quorum juju-07321b-0
    mgr: juju-07321b-0(active)
    osd: 9 osds: 9 up, 9 in

  data:
    pools:   0 pools, 0 pgs
    objects: 0 objects, 0 bytes
    usage:   7367 MB used, 234 GB / 242 GB avail
    pgs:

The output above says that the cluster does not contain any pools, pgs or objects. So, let’s fix that by creating a pool called testpool and writing some data to it with one of Ceph’s internal benchmarking tools, rados bench, so that we have something to actually shuffle around:

$ ceph osd pool create testpool 100 100
pool 'testpool' created

$ ceph osd pool application enable testpool rgw
enabled application 'rgw' on pool 'testpool'

$ rados bench -p testpool 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_juju-47966f-0_25611
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16        36        20    79.891        80    0.934116    0.444238
    2      16        63        47   93.8659       108     1.39832    0.513196
    3      16        80        64   85.2469        68     1.32455    0.610896
    4      16       107        91   90.9015       108    0.877032    0.653832
    5      16       125       109   87.1202        72    0.218025    0.675221
    6      16       147       131   87.2638        88    0.754443     0.66334
    7      16       170       154   87.9318        92    0.645512    0.689901
    8      16       204       188   93.9318       136    0.987254    0.663071
    9      16       230       214   95.0482       104     1.37409    0.645098
   10      16       259       243   97.1393       116     1.39946    0.630096
Total time run:         10.306007
Total writes made:      260
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     100.912
Stddev Bandwidth:       21.1702
Max bandwidth (MB/sec): 136
Min bandwidth (MB/sec): 68
Average IOPS:           25
Stddev IOPS:            5
Max IOPS:               34
Min IOPS:               17
Average Latency(s):     0.630013
Stddev Latency(s):      0.382552
Max latency(s):         2.26802
Min latency(s):         0.0696069

Let’s check ceph status once again. You should now see the new pool and some objects created by the benchmarking tool.

$ ceph status
  cluster:
    id:     f719d3e8-52ff-11e8-91f4-00163e5622ff
    health: HEALTH_OK

  services:
    mon: 1 daemons, quorum juju-07321b-0
    mgr: juju-07321b-0(active)
    osd: 9 osds: 9 up, 9 in

  data:
    pools:   1 pools, 100 pgs
    objects: 262 objects, 1044 MB
    usage:   7537 MB used, 234 GB / 241 GB avail
    pgs:     100 active+clean

Finally, let’s take a look at ceph osd tree which prints out a tree of all the OSDs according to their position in the CRUSH map. Pay particular attention to the WEIGHT column as we will be manipulating these values for the new OSDs when you expand the cluster later.

$ ceph osd tree
ID CLASS WEIGHT  TYPE NAME              STATUS REWEIGHT PRI-AFF
-1       0.23662 root default
-3       0.07887     host juju-07321b-1
 1   hdd 0.02629         osd.1              up  1.00000 1.00000
 3   hdd 0.02629         osd.3              up  1.00000 1.00000
 6   hdd 0.02629         osd.6              up  1.00000 1.00000
-5       0.07887     host juju-07321b-2
 0   hdd 0.02629         osd.0              up  1.00000 1.00000
 4   hdd 0.02629         osd.4              up  1.00000 1.00000
 8   hdd 0.02629         osd.8              up  1.00000 1.00000
-7       0.07887     host juju-07321b-3
 2   hdd 0.02629         osd.2              up  1.00000 1.00000
 5   hdd 0.02629         osd.5              up  1.00000 1.00000
 7   hdd 0.02629         osd.7              up  1.00000 1.00000

Don’t worry if you don’t end up with the exact weight or usage numbers as above. Those numbers depend on the size of storage available on the LXD host.

So, with a working Ceph cluster, we can finally get started.

Set crush initial weight to 0

As mentioned at the top, the challenge is to manage the amount of backfilling when adding new OSDs. One way of doing this is to make sure that all new OSDs get an initial weight of 0. This ensures that Ceph doesn’t start shuffling around data right away when we introduce the new OSDs.

The ceph-osd charm has a handy configuration option called crush-initial-weight which allows us to set this easily across all OSDs hosts:

juju config ceph-osd crush-initial-weight=0

Note: There was a bug in the ceph-osd charm before revision 261 which did not render the correct configuration when setting crush-initial-weight=0. Here’s a workaround for those on older revisions:
juju config ceph-osd config-flags='{ "global": { "osd crush initial weight": 0 } }'

Add new OSDs to the cluster

Juju makes adding new Ceph OSD hosts a breeze. You simply tell it to add another ceph-osd unit and it will take care of the rest:

juju add-unit ceph-osd

Time for another coffee while you wait for a new LXD container to spin up. If all goes well, you should end up with a fourth ceph-osd unit in your list:

$ juju status

Model    Controller  Cloud/Region  Version  SLA
default  lxd         lxd           2.3.7    unsupported

App       Version  Status  Scale  Charm     Store       Rev  OS      Notes
ceph-mon  12.2.4   active      1  ceph-mon  jujucharms   24  ubuntu
ceph-osd  12.2.4   active      4  ceph-osd  jujucharms  261  ubuntu

Unit         Workload  Agent  Machine  Public address  Ports  Message
ceph-mon/0*  active    idle   0        10.247.146.247         Unit is ready and clustered
ceph-osd/0*  active    idle   1        10.247.146.135         Unit is ready (3 OSD)
ceph-osd/1   active    idle   2        10.247.146.173         Unit is ready (3 OSD)
ceph-osd/2   active    idle   3        10.247.146.143         Unit is ready (3 OSD)
ceph-osd/3   active    idle   4        10.247.146.230         Unit is ready (3 OSD)

Machine  State    DNS             Inst id        Series  AZ  Message
0        started  10.247.146.247  juju-07321b-0  xenial      Running
1        started  10.247.146.135  juju-07321b-1  xenial      Running
2        started  10.247.146.173  juju-07321b-2  xenial      Running
3        started  10.247.146.143  juju-07321b-3  xenial      Running
4        started  10.247.146.230  juju-07321b-4  xenial      Running

Relation provider  Requirer      Interface  Type     Message
ceph-mon:mon       ceph-mon:mon  ceph       peer
ceph-mon:osd       ceph-osd:mon  ceph-osd   regular

Note that the new host and OSDs should get a weight of 0 in the CRUSH tree (here represented by juju-07321b-4 and osd.9, osd.10 and osd.11).

$ ceph osd tree
ID CLASS WEIGHT  TYPE NAME              STATUS REWEIGHT PRI-AFF
-1       0.23662 root default
-3       0.07887     host juju-07321b-1
 1   hdd 0.02629         osd.1              up  1.00000 1.00000
 3   hdd 0.02629         osd.3              up  1.00000 1.00000
 6   hdd 0.02629         osd.6              up  1.00000 1.00000
-5       0.07887     host juju-07321b-2
 0   hdd 0.02629         osd.0              up  1.00000 1.00000
 4   hdd 0.02629         osd.4              up  1.00000 1.00000
 8   hdd 0.02629         osd.8              up  1.00000 1.00000
-7       0.07887     host juju-07321b-3
 2   hdd 0.02629         osd.2              up  1.00000 1.00000
 5   hdd 0.02629         osd.5              up  1.00000 1.00000
 7   hdd 0.02629         osd.7              up  1.00000 1.00000
-9             0     host juju-07321b-4
 9   hdd       0         osd.9              up  1.00000 1.00000
10   hdd       0         osd.10             up  1.00000 1.00000
11   hdd       0         osd.11             up  1.00000 1.00000

If you only want to add OSDs (drives) to existing Ceph OSD hosts you can use the osd-devices configuration option. Here’s an example for this test cluster which adds a new directory as a new OSD for all ceph-osd hosts:
juju config ceph-osd osd-devices="/srv/osd1 /srv/osd2 /srv/osd3 /srv/osd4"

Clear crush initial weight

With all new OSDs in place, you can clear the crush-initial-weight configuration option we set earlier.

juju config ceph-osd --reset crush-initial-weight

Reweight new OSDs

You now need to increase the weight of the new OSDs in the CRUSH map from 0 to their target weight in order for Ceph store data on them.

There are a couple of different ways to go depending on how many new OSDs you have and how carefully/slowly you want to introduce them:

Reweight OSDs with ceph osd crush reweight <name> <weight>
- Good for individual or small amounts of OSDs
Reweight subtrees ceph osd crush reweight-subtree <name> <weight>
- Good for larger amount of OSDs under a bucket (such as host, chassis, rack etc.)
Gently reweight a list of OSDs with ceph-gentle-reweight, a tool from the folks at CERN
- Good for gradually adding/removing a list of OSDs and want to limit disruption by monitoring latency and backfilling
Manually editing the CRUSH map
- Good for manually controlling the CRUSH map with automation, version control etc.

The main point is that you want to increment the weight of the OSDs in small steps in order to control the amount of backfilling Ceph will have to do. As with all things Ceph, this will of course depend on your cluster so I highly recommended you try these things in a staging cluster and start small.

For our test, let’s use the reweight-subtree command with a weight of 0.01 so we can reweight all the OSDs of the new Ceph OSD host:

$ ceph osd crush reweight-subtree juju-07321b-4 0.01
reweighted subtree id -9 name 'juju-07321b-4' to 0.01 in crush map

While Ceph is working, keep an eye on the CRUSH tree and the disk usage with this command:

$ ceph osd df tree
ID CLASS WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE VAR  PGS TYPE NAME
-1       0.26660        -   316G 10012M   306G 3.09 1.00   - root default
-3       0.07887        - 80948M  2509M 78439M 3.10 1.00   -     host juju-07321b-1
 1   hdd 0.02629  1.00000 26982M   836M 26146M 3.10 1.00  26         osd.1
 3   hdd 0.02629  1.00000 26983M   836M 26146M 3.10 1.00  27         osd.3
 6   hdd 0.02629  1.00000 26983M   836M 26146M 3.10 1.00  31         osd.6
-5       0.07887        - 80947M  2507M 78439M 3.10 1.00   -     host juju-07321b-2
 0   hdd 0.02629  1.00000 26982M   835M 26146M 3.10 1.00  24         osd.0
 4   hdd 0.02629  1.00000 26982M   835M 26146M 3.10 1.00  24         osd.4
 8   hdd 0.02629  1.00000 26982M   835M 26146M 3.10 1.00  32         osd.8
-7       0.07887        - 80951M  2511M 78439M 3.10 1.00   -     host juju-07321b-3
 2   hdd 0.02629  1.00000 26983M   837M 26146M 3.10 1.00  20         osd.2
 5   hdd 0.02629  1.00000 26983M   837M 26146M 3.10 1.00  41         osd.5
 7   hdd 0.02629  1.00000 26983M   837M 26146M 3.10 1.00  29         osd.7
-9       0.02998        - 80923M  2483M 78439M 3.07 0.99   -     host juju-07321b-4
 9   hdd 0.00999  1.00000 26974M   827M 26146M 3.07 0.99  10         osd.9
10   hdd 0.00999  1.00000 26974M   827M 26146M 3.07 0.99  13         osd.10
11   hdd 0.00999  1.00000 26974M   827M 26146M 3.07 0.99  23         osd.11
                    TOTAL   316G 10012M   306G 3.09
MIN/MAX VAR: 0.99/1.00  STDDEV: 0.01

As you can see, the new weight of the new OSDs osd.9, osd.10 and osd.11 is ~0.01.

Again, for real clusters, reweighting in multiple small steps is what will take most of the time and what you really want to automate.

To keep this short, let’s reweight the OSDs again, this time directly to the target weight of the original OSDs:

$ ceph osd crush reweight-subtree juju-07321b-4 0.026299
reweighted subtree id -9 name 'juju-07321b-4' to 0.026299 in crush map

$ ceph osd df tree
ID CLASS WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE VAR  PGS TYPE NAME
-1       0.31549        -   316G 10029M   306G 3.10 1.00   - root default
-3       0.07887        - 80931M  2509M 78421M 3.10 1.00   -     host juju-07321b-1
 1   hdd 0.02629  1.00000 26977M   836M 26140M 3.10 1.00  25         osd.1
 3   hdd 0.02629  1.00000 26977M   836M 26140M 3.10 1.00  26         osd.3
 6   hdd 0.02629  1.00000 26977M   836M 26140M 3.10 1.00  25         osd.6
-5       0.07887        - 80927M  2505M 78421M 3.10 1.00   -     host juju-07321b-2
 0   hdd 0.02629  1.00000 26975M   835M 26140M 3.10 1.00  20         osd.0
 4   hdd 0.02629  1.00000 26975M   835M 26140M 3.10 1.00  23         osd.4
 8   hdd 0.02629  1.00000 26975M   835M 26140M 3.10 1.00  22         osd.8
-7       0.07887        - 80933M  2511M 78421M 3.10 1.00   -     host juju-07321b-3
 2   hdd 0.02629  1.00000 26977M   837M 26140M 3.10 1.00  21         osd.2
 5   hdd 0.02629  1.00000 26977M   837M 26140M 3.10 1.00  33         osd.5
 7   hdd 0.02629  1.00000 26977M   837M 26140M 3.10 1.00  27         osd.7
-9       0.07887        - 80925M  2503M 78421M 3.09 1.00   -     host juju-07321b-4
 9   hdd 0.02629  1.00000 26975M   834M 26140M 3.09 1.00  25         osd.9
10   hdd 0.02629  1.00000 26975M   834M 26140M 3.09 1.00  21         osd.10
11   hdd 0.02629  1.00000 26975M   834M 26140M 3.09 1.00  32         osd.11
                    TOTAL   316G 10029M   306G 3.10
MIN/MAX VAR: 1.00/1.00  STDDEV: 0.00

And there you go. All the new OSDs have been introduced to the cluster and weighted correctly.

Afterword

Thanks to all the OpenStack Charmers for creating and keeping all of these charms in great shape. Also thanks to Dan van der Ster and the storage folks at CERN for the tools and many great tips on how to run Ceph at scale. Finally, many thanks to Oddgeir Lingaas Holmen for helping write and clean up these posts.

Upgrading Juju

4 minute read

I recently spent some time upgrading our Juju environments from 2.1 to 2.3. Below are a few lessons learned aimed at other Juju enthusiasts doing the same experiment.

First, Juju is a cool controller and agent based tool from Canonical to easily deploy and manage applications (called Charms) on different clouds and environments (see how it works for more details).

We run an academic cloud, HUNT Cloud, where we utilize a highly available Juju deployment, in concert with MAAS, to run things like OpenStack and Ceph. For this upgrade, we were looking forward to some of the new features such as cross model relations and overlay bundles.

How to upgrade Juju (for dummies)

Upgrading a Juju environment is usually a straightforward task completed with a cup of coffee and a couple of commands. The main steps are:

Upgrade your Juju client (the client talking to the Juju controllers, usually on your local machine, apt upgrade juju or snap refresh juju)
Upgrade your Juju controller (the controller managing the agents, juju upgrade-juju --model controller)
Upgrade your Juju model (the model containing your deployed applications, juju upgrade-juju --model <name-of-model>)

Check out the official docs for a more thorough explanation.

Our task at hand was to upgrade the Juju environment from 2.1.2 to 2.3. Step 1 was easy as can be, however the remaining steps provided a few lessons learned that might prove useful for others.

Issue No. 1

We ran the following command to upgrade our controllers:

$ juju upgrade-juju --model controller
best version:
    2.2.9
started upgrade to 2.2.9

Now, if you look closely, the output above says 2.2.9, not 2.3.2 which was the latest version at the time and the one I actually wanted. Well, the upgrade to 2.2.9 succeeded, so I continued upgrading once more by running juju upgrade-juju --model controller to reach 2.3.2.

This time things didn’t go as smooth for the controllers and they got stuck upgrading which rendered the environment unusable. It did however produce some rather bleak yet humorous error messages.

2018-01-30 11:15:22 WARNING juju.worker.upgradesteps worker.go:275 stopped waiting for other controllers: tomb: dying
2018-01-30 11:15:22 ERROR juju.worker.upgradesteps worker.go:379 upgrade from 2.2.9 to 2.3.2 for "machine-0" failed (giving up): tomb: dying

I was able to reproduce this in one of our larger staging areas and the bug got fixed in 2.3.3 in lp#1746265.

Issue No. 2

So, after getting stuck with the issue above, I was encouraged to try upgrading straight to 2.3.2, skipping 2.2.9 altogether. Juju allows you to specify the target version using the --agent-version flag. The command you end up with is juju upgrade-juju --model controller --agent-version 2.3.2.

Sticking to good form and the rule of three, the controllers got stuck upgrading rendering the environment unusable once again. Fortunately, it was easy to reproduce both in our staging area and on local LXD deployments so this one also got fixed in 2.3.3 in lp#1748294.

Issue no. 3

We gave the upgrade a new try when version 2.3.4 rolled around late in February. Things looked good after multiple runs in staging, so I finally upgraded one of our production controllers using juju upgrade-juju --model controller --agent-version 2.3.4.

The upgrade process took around 15 minutes. After a lot of logspam in the controller logs and some unnerving error messages in the juju status --model controller output, things seemed to settle. Almost. We noticed charm agent failures and connection errors between the controllers and a small number of the applications in the main production Juju model containing our OpenStack and Ceph deployments.

After filing lp#1755155, I was recommended to push on and upgrade the Juju model even though some of the charms errored out. This approach resolved the connection errors.

The root cause was most likely lp#1697936 which was reported last year. It turned out 2.1 agents could fail to read from 2.2 and newer controllers. I did eventually find a mention of the bug in the changelog for 2.2.0, however the description did not contain the error messages leaving my searches in Launchpad coming up empty.

Upgrading the model with juju upgrade-juju --model openstack --agent-version 2.3.4 and restarting the affected agents finally did the trick and all components were running smoothly on 2.3.4.

Afterword

To be fair to the Juju team, our production model has a decent amount of different charms and therefore a decent amount of Juju agents (we are talking about OpenStack after all).

Now you might rightfully ask, Sandor, why on earth didn’t you just upgrade the model right away as described in step 3? Well, I simply became a bit wary of proceeding without any easy way to rollback after running into all the previous bugs where things got stuck.

Lessons learned

Always read the changelogs. Carefully.
Always test the upgrades. This goes both for users and the dev team.
The upgrade UX has room for improvements with everything from apt upgrade juju, snap refresh juju. juju upgrade-juju --model controller, juju upgrade-juju --model model, juju upgrade-charm to juju upgrade-gui.
As things can go awry, it would be nice if juju upgrade-juju would tell you what it will do without the --dry-run flag as it may not pick the version you want.
It would also be nice if there was a way to do proper dry runs or even rollback (both failed and successful) upgrades besides backing up and restoring your controllers.
Even though the controller and the model are upgraded separately and should be able to run different versions, they can break each other.

Many thanks to Rick, Tim, John and the rest of the Juju gang from Canonical for helping out with tips, troubleshooting and fixes.