James Denton

Neutron Dynamic Routing - What it is (and isn’t)

2022-10-12T00:00:00+00:00

To understand OpenStack Neutron’s Dynamic Routing feature, you must first understand what BGP Speaker is… and what it isn’t.

Recent workshops with a customer made it very clear to me that Neutron’s Dynamic Routing feature leaves a lot on the table, and likely isn’t a good fit for many of the environments that would look at using it. That doesn’t mean it isn’t useful, though.

Before jumping too far into Neutron Dynamic Routing and it’s core function, advertising tenant networks, let’s revisit Neutron’s logical network designs.

Tenant Networking

Whether you’re using ML2/LXB (Linux Bridge), ML2/OVS (Open vSwitch), or ML2/OVN, the logical network topology for tenant networking looks relatively the same. It’s composed of:

An external provider network
A virtual router
One or more tenant networks

On paper, it looks something like this:

Tenant networks are not reachable by default. The virtual router can source NAT (SNAT) outbound traffic from instances in tenant networks to allow connectivity to external networks or the Internet. Inbound traffic in this scenario is not possible without the use of Floating IPs. Floating IPs, in turn, are sourced from the external provider network. Your standard Neutron tenant network topology looks something like this:

To reach tenant networks directly and bypass the use of floating IPs, one could implement a static route on the provider network gateway device and redistribute that route upstream. In fact, we’ve done this for many years as far back as the Grizzly release of OpenStack, when Neutron (neé Quantum) was in its infancy. Where this falls apart, though, is in the self-servicing of tenant networking. Tenants can’t (or shouldn’t) access that provider gateway device and would not be able to add that static route.

Neutron Dynamic Routing

The obvious solution is to implement some sort of dynamic routing mechanism to allow tenants to advertise their tenant network(s) upstream with no involvement from the network administrator. Neutron provides this capability with a combination of Neutron Dynamic Routing, Subnet Pools, and Address Scopes.

Neutron Dynamic Routing provides a service known as BGP Speaker that peers with external routers to advertise the tenant networks using BGP. Subnet pools and address scopes are used together to avoid overlapping subnets, especially when advertising to a given peer.

Where the misunderstanding appears to sneak in is how and where the advertisements occur. It’s fairly common practice to have two routers directly connected to one another to exchange routes, like so:

One might assume, then, that the Neutron router would peer with the provider network router in this fashion. They’d be wrong!

That’s where BGP Speaker comes into play. The BGP Speaker is a control plane service that advertises tenant network(s) on behalf of the tenant router. The BGP Speaker peers with the provider network router and advertises the tenant network with a next hop of the tenant router, like so:

The BGP Speaker is not a router. It is not a route reflector. It does not accept BGP routes from other speakers or routers. It. Only. Speaks. BGP. And, it does this from the control plane or network node hosting the BGP “dragent”. What that means in practice is that the controller or network node hosting the agent needs L3 connectivity to the provider network gateway device: either the WAN, the LAN, or some other interface to peer on. This requirement is not ideal in many environments and could be a deal breaker in others.

Summary

The documentation upstream for Neutron Dynamic Routing has some pretty good diagrams and goes into further detail than what I’ve described here. The BGP speaker can even advertise floating IPs, though I’m not sure how this makes sense if the provider router is locally connected. However, I’m sure there’s a use case I haven’t considered. There have been attempts to implement BGP at the Neutron router itself, as seen in RFE, but it has not really made much traction since late 2021. This functionality would mirror something I’ve seen in NSX and other (legacy) cases, but might result in too much overhead; especially when hundreds of routers are involved.

If you have some thoughts or comments on this post, I’d love to hear ‘em. Feel free to reach out on Twitter at @jimmdenton or hit me up on LinkedIn.

[OpenStack] Migrating from LinuxBridge to OVN

2022-08-31T00:00:00+00:00

Migrating from one Neutron mechanism driver to another, especially in a production environment, is not a decision one takes on without giving much thought. In many cases, the process involves migrating to a “Greenfield” environment, or a new environment that is stood up running the same or similar operating system and cloud service software but configured in a new way, then migrating entire workloads in a weekend (or more). To say this process is tedious is an understatement.

Brave individuals have sometimes taken to in-place migrations. In fact, my first OpenStack Summit presentation involved migrating from ML2/OVS to ML2/LXB in-place due to issues with Open vSwitch stability and performance in the early days. Since then, I have been involved with multiple OVS->LXB and LXB->OVS migrations, as well as LXB->OVN.

Overview

Since performing the initial migration(s) in the lab, I’ve decided to better document the process here so you, the reader, can see what’s involved and determine if this is the right move for your environment. I’m running OpenStack-Ansible Wallaby, so the steps may need to be extrapolated for environments that involve a more ‘manual’ process of modifying configurations.

The environment here consists of five nodes:

3x controller
2x compute

The original plugin/driver is ML2/LinuxBridge with multiple Neutron resources:

2x routers
2x provider (vlan) networks
3x tenant (vxlan) networks

root@infra1:~# openstack router list
+--------------------------------------+---------+--------+-------+----------------------------------+
| ID                                   | Name    | Status | State | Project                          |
+--------------------------------------+---------+--------+-------+----------------------------------+
| cee5e805-ecf9-456b-87be-d60f155c8fd8 | rtr-web | ACTIVE | UP    | d1ae5313d10c411fa772e8fa697a6aeb |
| d5052734-53e8-4a58-9fbd-2b76ec138af6 | rtr-db  | ACTIVE | UP    | d1ae5313d10c411fa772e8fa697a6aeb |
+--------------------------------------+---------+--------+-------+----------------------------------+
root@infra1:~# openstack network list
+--------------------------------------+----------------------------------------------------+--------------------------------------+
| ID                                   | Name                                               | Subnets                              |
+--------------------------------------+----------------------------------------------------+--------------------------------------+
| 12a0ab09-d130-4e69-9aa2-c28c66509b02 | db                                                 | 37ae585e-1c48-4aff-98de-dad4f9502428 |
| 282e63e3-5120-4396-a63d-0186e5e96466 | app                                                | d6974d0e-685e-4cfd-ba06-4335c2834788 |
| 3fb2d48e-8c71-4bca-92ce-f64a4c932338 | vlan200                                            | 6e960212-2104-4aea-b51f-686d2b1190d7 |
| 9e151884-67a5-4905-b157-f08f1b3b0040 | HA network tenant d1ae5313d10c411fa772e8fa697a6aeb | 5db20ee1-1d8d-42fe-9724-301fee8c6f43 |
| ab3f0f85-a509-406a-8dca-5db13fbcb48b | web                                                | 90d2e2fe-2301-47b8-b31c-bb6dc7264acb |
| dddfdce8-a8fd-4802-a01c-261b92043488 | vlan100                                            | 6799e6c1-5b66-4894-81b6-6dc698d43462 |
+--------------------------------------+----------------------------------------------------+--------------------------------------+
root@infra1:~# openstack subnet list
+--------------------------------------+---------------------------------------------------+--------------------------------------+------------------+
| ID                                   | Name                                              | Network                              | Subnet           |
+--------------------------------------+---------------------------------------------------+--------------------------------------+------------------+
| 37ae585e-1c48-4aff-98de-dad4f9502428 | db                                                | 12a0ab09-d130-4e69-9aa2-c28c66509b02 | 192.168.55.0/24  |
| 5db20ee1-1d8d-42fe-9724-301fee8c6f43 | HA subnet tenant d1ae5313d10c411fa772e8fa697a6aeb | 9e151884-67a5-4905-b157-f08f1b3b0040 | 169.254.192.0/18 |
| 6799e6c1-5b66-4894-81b6-6dc698d43462 | vlan100                                           | dddfdce8-a8fd-4802-a01c-261b92043488 | 192.168.100.0/24 |
| 6e960212-2104-4aea-b51f-686d2b1190d7 | vlan200                                           | 3fb2d48e-8c71-4bca-92ce-f64a4c932338 | 192.168.200.0/24 |
| 90d2e2fe-2301-47b8-b31c-bb6dc7264acb | web                                               | ab3f0f85-a509-406a-8dca-5db13fbcb48b | 10.5.0.0/24      |
| d6974d0e-685e-4cfd-ba06-4335c2834788 | app                                               | 282e63e3-5120-4396-a63d-0186e5e96466 | 172.25.0.0/24    |
+--------------------------------------+---------------------------------------------------+--------------------------------------+------------------+

Six virtual machine instances were deployed across two compute nodes:

root@infra1:~# openstack server list
+--------------------------------------+---------+--------+-----------------------------------+--------------+--------+
| ID                                   | Name    | Status | Networks                          | Image        | Flavor |
+--------------------------------------+---------+--------+-----------------------------------+--------------+--------+
| b3a33fb1-98dc-4cf9-99c3-53d5352310e5 | vm-db2  | ACTIVE | db=192.168.55.223                 | cirros-0.5.2 | 1-1-1  |
| e2ea6e2a-aa47-4f44-b285-1b727ad4f709 | vm-db1  | ACTIVE | db=192.168.100.215, 192.168.55.21 | cirros-0.5.2 | 1-1-1  |
| 916052d7-a5f7-4e4a-87a0-7249eef45801 | vm-app2 | ACTIVE | app=172.25.0.250                  | cirros-0.5.2 | 1-1-1  |
| dd3046a3-128a-4585-8ffd-54c11b516052 | vm-app1 | ACTIVE | app=172.25.0.50                   | cirros-0.5.2 | 1-1-1  |
| 7e1af764-a034-4ef2-9695-ca19838812e5 | vm-web1 | ACTIVE | web=10.5.0.121, 192.168.100.90    | cirros-0.5.2 | 1-1-1  |
| dbb98201-52fc-420d-bf6e-5a40fad74327 | vm-web2 | ACTIVE | web=10.5.0.162                    | cirros-0.5.2 | 1-1-1  |
+--------------------------------------+---------+--------+-----------------------------------+--------------+--------+

root@compute1:~# virsh list
 Id   Name                State
-----------------------------------
 1    instance-00000006   running
 2    instance-0000000c   running
 3    instance-00000012   running

root@compute2:~# virsh list
 Id   Name                State
-----------------------------------
 1    instance-00000009   running
 2    instance-0000000f   running
 3    instance-00000015   running

Inspections

Before conducting the migration, I performed a series of tests to ensure the following was successful:

ICMP to all instances from the DHCP namespace(s)

root@infra1:~# ip netns exec qdhcp-12a0ab09-d130-4e69-9aa2-c28c66509b02 ping 192.168.55.223 -c2
PING 192.168.55.223 (192.168.55.223) 56(84) bytes of data.
64 bytes from 192.168.55.223: icmp_seq=1 ttl=64 time=13.3 ms
64 bytes from 192.168.55.223: icmp_seq=2 ttl=64 time=1.20 ms

--- 192.168.55.223 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 1.204/7.239/13.275/6.035 ms
root@infra1:~# ip netns exec qdhcp-12a0ab09-d130-4e69-9aa2-c28c66509b02 ping 192.168.55.21 -c2
PING 192.168.55.21 (192.168.55.21) 56(84) bytes of data.
64 bytes from 192.168.55.21: icmp_seq=1 ttl=64 time=1.61 ms
64 bytes from 192.168.55.21: icmp_seq=2 ttl=64 time=1.19 ms

--- 192.168.55.21 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 1.193/1.401/1.610/0.208 ms

root@infra1:~# ip netns exec qdhcp-282e63e3-5120-4396-a63d-0186e5e96466 ping 172.25.0.250 -c2
PING 172.25.0.250 (172.25.0.250) 56(84) bytes of data.
64 bytes from 172.25.0.250: icmp_seq=1 ttl=64 time=1.69 ms
64 bytes from 172.25.0.250: icmp_seq=2 ttl=64 time=1.34 ms

--- 172.25.0.250 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 1.341/1.516/1.691/0.175 ms

root@infra1:~# ip netns exec qdhcp-282e63e3-5120-4396-a63d-0186e5e96466 ping 172.25.0.50 -c2
PING 172.25.0.50 (172.25.0.50) 56(84) bytes of data.
64 bytes from 172.25.0.50: icmp_seq=1 ttl=64 time=1.27 ms
64 bytes from 172.25.0.50: icmp_seq=2 ttl=64 time=1.24 ms

--- 172.25.0.50 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 1.241/1.255/1.270/0.014 ms

root@infra1:~# ip netns exec qdhcp-ab3f0f85-a509-406a-8dca-5db13fbcb48b ping 10.5.0.121 -c2
PING 10.5.0.121 (10.5.0.121) 56(84) bytes of data.
64 bytes from 10.5.0.121: icmp_seq=1 ttl=64 time=1.94 ms
64 bytes from 10.5.0.121: icmp_seq=2 ttl=64 time=1.44 ms

--- 10.5.0.121 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 1.437/1.687/1.937/0.250 ms
root@infra1:~# ip netns exec qdhcp-ab3f0f85-a509-406a-8dca-5db13fbcb48b ping 10.5.0.162 -c2
PING 10.5.0.162 (10.5.0.162) 56(84) bytes of data.
64 bytes from 10.5.0.162: icmp_seq=1 ttl=64 time=1.71 ms
64 bytes from 10.5.0.162: icmp_seq=2 ttl=64 time=1.61 ms

--- 10.5.0.162 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 1.608/1.656/1.705/0.048 ms

SSH to all instances from the DHCP namespace(s)

root@infra1:~# ip netns exec qdhcp-ab3f0f85-a509-406a-8dca-5db13fbcb48b ssh cirros@10.5.0.162 uptime
The authenticity of host '10.5.0.162 (10.5.0.162)' can't be established.
ECDSA key fingerprint is SHA256:NAb9iUzaNKhRptbCLQj/ROZ1vJKisSlFM2amR/s/1Dk.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added '10.5.0.162' (ECDSA) to the list of known hosts.
cirros@10.5.0.162's password:
 14:15:52 up 20 min,  0 users,  load average: 0.00, 0.00, 0.00

root@infra1:~#  ip netns exec qdhcp-282e63e3-5120-4396-a63d-0186e5e96466 ssh cirros@172.25.0.50 uptime
The authenticity of host '172.25.0.50 (172.25.0.50)' can't be established.
ECDSA key fingerprint is SHA256:GMBDGbQ1g1JiyqCTH/kIlrzaojtAXoCGCG/J8BdxEKA.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added '172.25.0.50' (ECDSA) to the list of known hosts.
cirros@172.25.0.50's password:
 14:16:26 up 18 min,  0 users,  load average: 0.00, 0.00, 0.00

root@infra1:~# ip netns exec qdhcp-12a0ab09-d130-4e69-9aa2-c28c66509b02 ssh cirros@192.168.55.223 uptime
The authenticity of host '192.168.55.223 (192.168.55.223)' can't be established.
ECDSA key fingerprint is SHA256:WRuu37KvrvU16c7cgF3f4EbA+U9oWMVTY59r/X7rRaA.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added '192.168.55.223' (ECDSA) to the list of known hosts.
cirros@192.168.55.223's password:
 14:16:51 up 13 min,  0 users,  load average: 0.00, 0.00, 0.00

ICMP between instances

DB2->DB1

$ hostname
vm-db2
$ ping 192.168.55.21 -c2
PING 192.168.55.21 (192.168.55.21): 56 data bytes
64 bytes from 192.168.55.21: seq=0 ttl=64 time=1.481 ms
64 bytes from 192.168.55.21: seq=1 ttl=64 time=1.868 ms

--- 192.168.55.21 ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 1.481/1.674/1.868 ms

WEB1 -> WEB2 and WEB1 -> APP2

root@infra1:~# ip netns exec qdhcp-ab3f0f85-a509-406a-8dca-5db13fbcb48b ssh cirros@10.5.0.121
cirros@10.5.0.121's password:
$ hostname
vm-web1
$ ping 10.5.0.162 -c2
PING 10.5.0.162 (10.5.0.162): 56 data bytes
64 bytes from 10.5.0.162: seq=0 ttl=64 time=2.099 ms
64 bytes from 10.5.0.162: seq=1 ttl=64 time=1.880 ms

--- 10.5.0.162 ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 1.880/1.989/2.099 ms
$ ping 172.25.0.50 -c2
PING 172.25.0.50 (172.25.0.50): 56 data bytes
64 bytes from 172.25.0.50: seq=0 ttl=63 time=9.040 ms
64 bytes from 172.25.0.50: seq=1 ttl=63 time=2.553 ms

--- 172.25.0.50 ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 2.553/5.796/9.040 ms

Connectivity to floating IP

WEB1->DB1 via FLOAT

root@infra1:~# ip netns exec qdhcp-ab3f0f85-a509-406a-8dca-5db13fbcb48b ssh cirros@10.5.0.121
cirros@10.5.0.121's password:
$ hostname
vm-web1
$ ping 192.168.100.215 -c2
PING 192.168.100.215 (192.168.100.215): 56 data bytes
64 bytes from 192.168.100.215: seq=0 ttl=62 time=11.783 ms
64 bytes from 192.168.100.215: seq=1 ttl=62 time=10.835 ms

--- 192.168.100.215 ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 10.835/11.309/11.783 ms

Pre-Flight

Before starting the migration, there are a few config changes that can be staged. Please note that this entire process will result in downtime, and is probably not well suited to any sort of “rollback” without serious testing beforehand.

I like to live dangerously.

First, modify the /etc/openstack_deploy/user_variables.yml file to include some overrides:

neutron_plugin_type: ml2.ovn
neutron_plugin_base:
  - neutron.services.ovn_l3.plugin.OVNL3RouterPlugin
  - qos
neutron_ml2_drivers_type: "geneve,vxlan,vlan,flat"

The qos plugin may be required if not already enabled in your environment. Previous testing showed that the Neutron API server would not start without it. YMMV.

Next, Update the openstack_inventory.json inventory file to remove members of L3, DHCP, LinuxBridge and Metadata groups (this will have to be done by hand).

BEFORE

"neutron_dhcp_agent": {
        "children": [],
        "hosts": [
            "infra1",
            "infra2",
            "infra3"
        ]
    },
    "neutron_l3_agent": {
        "children": [],
        "hosts": [
            "infra1",
            "infra2",
            "infra3"
        ]
    },
    "neutron_linuxbridge_agent": {
        "children": [],
        "hosts": [
            "compute1",
            "compute2",
            "infra1",
            "infra2",
            "infra3"
        ]
    },
    "neutron_metadata_agent": {
        "children": [],
        "hosts": [
            "infra1",
            "infra2",
            "infra3"
        ]
    }

AFTER

"neutron_dhcp_agent": {
        "children": [],
        "hosts": [
        ]
    },
    "neutron_l3_agent": {
        "children": [],
        "hosts": [
        ]
    },
    "neutron_linuxbridge_agent": {
        "children": [],
        "hosts": [
        ]
    },
    "neutron_metadata_agent": {
        "children": [],
        "hosts": [
        ]
    }

Then, update the /etc/openstack_deploy/group_vars/network_hosts file to add an OVS-related override:

openstack_host_specific_kernel_modules:
  - name: "openvswitch"
    pattern: "CONFIG_OPENVSWITCH"

Modify the /etc/openstack_deploy/env.d/neutron.yml file to update Neutron-related group memberships:

---
component_skel:
  neutron_ovn_controller:
    belongs_to:
      - neutron_all
  neutron_ovn_northd:
    belongs_to:
      - neutron_all
 
container_skel:
  neutron_agents_container:
    contains: {}
  neutron_ovn_northd_container:
    belongs_to:
      - network_containers
    contains:
      - neutron_ovn_northd
    properties:
      is_metal: true
  neutron_server_container:
    belongs_to:
      - network_containers
    contains:
      - neutron_server
      - opendaylight
    properties:
      is_metal: true

Also, modify the /etc/openstack_deploy/env.d/nova.yml file to update Nova-related group memberships:

---
container_skel:
  nova_api_container:
    belongs_to:
      - compute-infra_containers
      - os-infra_containers
    contains:
      - nova_api_metadata
      - nova_api_os_compute
      - nova_conductor
      - nova_scheduler
      - nova_console
    properties:
      is_metal: true
  nova_compute_container:
    belongs_to:
      - compute_containers
      - kvm-compute_containers
      - lxd-compute_containers
      - qemu-compute_containers
    contains:
      - neutron_ovn_controller
      - nova_compute
    properties:
      is_metal: true

The network definitions in openstack_user_config.yml will need to be updated to reflect changes to support OVN. In this environment there are two bridges: br-vlan and br-flat. I am taking the opportunity to rename br-vlan to br-ex to better match upstream documentation. Also, host_bind_override is really no good in an OVS-based deployment, we should use network_interface instead.

BEFORE

    - network:
        container_bridge: "br-vxlan"
        container_type: "veth"
        container_interface: "eth10"
        ip_from_q: "tunnel"
        type: "vxlan"
        range: "1:1000"
        net_name: "vxlan"
        group_binds:
          - neutron_linuxbridge_agent
    - network:
        container_bridge: "br-vlan"
        container_type: "veth"
        container_interface: "eth11"
        type: "vlan"
        range: "1:1"
        net_name: "vlan"
        group_binds:
          - neutron_linuxbridge_agent
    - network:
        container_bridge: "br-flat"
        container_type: "veth"
        container_interface: "eth12"
        host_bind_override: "veth2"
        type: "flat"
        net_name: "flat"
        group_binds:
          - neutron_linuxbridge_agent
          - utility_all

AFTER

    - network:
        container_bridge: "br-vxlan"
        container_type: "veth"
        container_interface: "eth10"
        ip_from_q: "tunnel"
        type: "geneve"
        range: "1:1000"
        net_name: "geneve"
        group_binds:
          - neutron_ovn_controller
    - network:
        container_bridge: "br-ex"
        container_type: "veth"
        container_interface: "eth11"
        type: "vlan"
        range: "1:1"
        net_name: "vlan"
        group_binds:
          - neutron_ovn_controller
    - network:
        container_bridge: "br-flat"
        container_type: "veth"
        container_interface: "eth12"
        network_interface: "veth2"
        type: "flat"
        net_name: "flat"
        group_binds:
          - neutron_ovn_controller
          - utility_all

Once those changes are made, notate and stop all running VMs:

root@infra1:~# openstack server list --all | grep ACTIVE
| b3a33fb1-98dc-4cf9-99c3-53d5352310e5 | vm-db2  | ACTIVE | db=192.168.55.223                 | cirros-0.5.2 | 1-1-1  |
| e2ea6e2a-aa47-4f44-b285-1b727ad4f709 | vm-db1  | ACTIVE | db=192.168.100.215, 192.168.55.21 | cirros-0.5.2 | 1-1-1  |
| 916052d7-a5f7-4e4a-87a0-7249eef45801 | vm-app2 | ACTIVE | app=172.25.0.250                  | cirros-0.5.2 | 1-1-1  |
| dd3046a3-128a-4585-8ffd-54c11b516052 | vm-app1 | ACTIVE | app=172.25.0.50                   | cirros-0.5.2 | 1-1-1  |
| 7e1af764-a034-4ef2-9695-ca19838812e5 | vm-web1 | ACTIVE | web=10.5.0.121, 192.168.100.90    | cirros-0.5.2 | 1-1-1  |
| dbb98201-52fc-420d-bf6e-5a40fad74327 | vm-web2 | ACTIVE | web=10.5.0.162                    | cirros-0.5.2 | 1-1-1  |


root@infra1:~# for i in $(openstack server list --all | grep ACTIVE | awk {'print $2'}); do openstack server stop $i; done

Lift Off

Now that everything is staged, it’s time to kick off the changes.

STOP and DISABLE existing Neutron agents on network and compute hosts:

cd /opt/openstack-ansible/playbooks
ansible network_hosts,compute_hosts -m shell -a 'systemctl stop neutron-linuxbridge-agent'
ansible network_hosts,compute_hosts -m shell -a 'systemctl stop neutron-l3-agent'
ansible network_hosts,compute_hosts -m shell -a 'systemctl stop neutron-dhcp-agent'
ansible network_hosts,compute_hosts -m shell -a 'systemctl stop neutron-metadata-agent'
 
ansible network_hosts,compute_hosts -m shell -a 'systemctl disable neutron-linuxbridge-agent'
ansible network_hosts,compute_hosts -m shell -a 'systemctl disable neutron-l3-agent'
ansible network_hosts,compute_hosts -m shell -a 'systemctl disable neutron-dhcp-agent'
ansible network_hosts,compute_hosts -m shell -a 'systemctl disable neutron-metadata-agent'

Delete the Neutron-managed network namespaces (qdhcp,qrouter) from controller and compute hosts (repeat as necessary):

ssh infra1; 
for i in $(ip netns | grep 'qdhcp\|qrouter' | awk {'print $1'}); do ip netns delete $i; done; 
exit

Delete all ‘brq’ bridges and ‘tap’ interfaces from controller and compute hosts (repeat as necessary):

ssh infra1;
for i in $(ip -br link show | grep brq | awk {'print $1'}); do ip link delete $i; done
for i in $(ip -br link show | grep tap | awk {'print $1'} | sed 's/@.*//'); do ip link delete $i; done
exit;

Run the playbooks:

cd /opt/openstack-ansible/playbooks
openstack-ansible os-nova-install.yml
openstack-ansible os-neutron-install.yml

Turbulance

After the playbooks have executed, you should expect to have Open vSwitch installed where needed and if configured correctly, you may even have the physical interfaces connected (via network_interface).

Check the agent list – the L3, DHCP, and LXB agents should be down and can be deleted. Metering is TBD:

root@infra1:~# openstack network agent list
+--------------------------------------+------------------------------+----------+-------------------+-------+-------+----------------------------+
| ID                                   | Agent Type                   | Host     | Availability Zone | Alive | State | Binary                     |
+--------------------------------------+------------------------------+----------+-------------------+-------+-------+----------------------------+
| 002dd54a-7637-4989-b217-10cf79d6b7f2 | L3 agent                     | infra3   | nova              | XXX   | UP    | neutron-l3-agent           |
| 06ba670f-d560-43aa-b1e7-be60d5914551 | Metering agent               | infra2   | None              | :-)   | UP    | neutron-metering-agent     |
| 1d3bf53f-dcc5-453c-893e-05b053dda55f | Metering agent               | infra3   | None              | :-)   | UP    | neutron-metering-agent     |
| 4353a993-d512-4d42-a554-eccdd4ceeaf8 | Metadata agent               | infra2   | None              | XXX   | UP    | neutron-metadata-agent     |
| 491f1be3-407d-48dd-b8ed-364f2b90c6cb | DHCP agent                   | infra1   | nova              | XXX   | UP    | neutron-dhcp-agent         |
| 55a04c8e-f54e-4e32-81e7-b12c1e2e1c3f | Metadata agent               | infra1   | None              | XXX   | UP    | neutron-metadata-agent     |
| 581954d1-25c8-4b63-a82e-5792250f8b58 | L3 agent                     | infra2   | nova              | XXX   | UP    | neutron-l3-agent           |
| 70d614fa-d3f4-4be9-8fb7-a37eb76d5e38 | DHCP agent                   | infra2   | nova              | XXX   | UP    | neutron-dhcp-agent         |
| 76a84bb3-84b1-4ba9-b2bb-00f4d2776b6d | Linux bridge agent           | infra2   | None              | XXX   | UP    | neutron-linuxbridge-agent  |
| 8153be7b-82d5-4077-b5df-b7e414756220 | Linux bridge agent           | compute2 | None              | XXX   | UP    | neutron-linuxbridge-agent  |
| 9ed0c376-44c1-4150-b7ab-23138cee7430 | Linux bridge agent           | infra3   | None              | XXX   | UP    | neutron-linuxbridge-agent  |
| a7d803df-191e-413c-bafc-23049c7732e0 | Linux bridge agent           | compute1 | None              | XXX   | UP    | neutron-linuxbridge-agent  |
| bb9986d4-5d44-4040-8490-a1a5af1feb33 | Metadata agent               | infra3   | None              | XXX   | UP    | neutron-metadata-agent     |
| d04290ee-1215-42b2-af34-3ce84eada471 | Metering agent               | infra1   | None              | :-)   | UP    | neutron-metering-agent     |
| d1bd340d-8da8-4915-8eba-f7078d08e9ed | Linux bridge agent           | infra1   | None              | XXX   | UP    | neutron-linuxbridge-agent  |
| eb868890-7b6d-41e3-8fbd-54730963bca7 | DHCP agent                   | infra3   | nova              | XXX   | UP    | neutron-dhcp-agent         |
| f13952a5-ba28-4767-ad3c-b72fe6c0db6a | L3 agent                     | infra1   | nova              | XXX   | UP    | neutron-l3-agent           |
| fc536b52-a35c-4523-885d-0708759445e0 | OVN Controller Gateway agent | compute1 |                   | :-)   | UP    | ovn-controller             |
| d60d8a20-d977-4352-a886-c7b5ef477446 | OVN Controller Gateway agent | compute2 |                   | :-)   | UP    | ovn-controller             |
| c3c7ff97-998c-5adb-ac2a-75c930724959 | OVN Metadata agent           | compute2 |                   | :-)   | UP    | neutron-ovn-metadata-agent |
| f20d28dd-83c7-5589-8f9f-37a4f974996d | OVN Metadata agent           | compute1 |                   | :-)   | UP    | neutron-ovn-metadata-agent |
+--------------------------------------+------------------------------+----------+-------------------+-------+-------+----------------------------+

DELETE the now-stale agents:

root@infra1:~# for i in $(openstack network agent list | grep XXX | awk {'print $2'}); do openstack network agent delete $i; done

root@infra1:~# openstack network agent list
+--------------------------------------+------------------------------+----------+-------------------+-------+-------+----------------------------+
| ID                                   | Agent Type                   | Host     | Availability Zone | Alive | State | Binary                     |
+--------------------------------------+------------------------------+----------+-------------------+-------+-------+----------------------------+
| 06ba670f-d560-43aa-b1e7-be60d5914551 | Metering agent               | infra2   | None              | :-)   | UP    | neutron-metering-agent     |
| 1d3bf53f-dcc5-453c-893e-05b053dda55f | Metering agent               | infra3   | None              | :-)   | UP    | neutron-metering-agent     |
| d04290ee-1215-42b2-af34-3ce84eada471 | Metering agent               | infra1   | None              | :-)   | UP    | neutron-metering-agent     |
| fc536b52-a35c-4523-885d-0708759445e0 | OVN Controller Gateway agent | compute1 |                   | :-)   | UP    | ovn-controller             |
| d60d8a20-d977-4352-a886-c7b5ef477446 | OVN Controller Gateway agent | compute2 |                   | :-)   | UP    | ovn-controller             |
| c3c7ff97-998c-5adb-ac2a-75c930724959 | OVN Metadata agent           | compute2 |                   | :-)   | UP    | neutron-ovn-metadata-agent |
| f20d28dd-83c7-5589-8f9f-37a4f974996d | OVN Metadata agent           | compute1 |                   | :-)   | UP    | neutron-ovn-metadata-agent |
+--------------------------------------+------------------------------+----------+-------------------+-------+-------+----------------------------+

Check the OVN DBs using the local server IP - the northbound database is likely empty, while the southbound database should be populated:

root@infra1:~# ovn-nbctl --db=tcp:10.0.236.100:6641 show
root@infra1:~# ovn-sbctl --db=tcp:10.0.236.100:6642 show
Chassis "d60d8a20-d977-4352-a886-c7b5ef477446"
    hostname: compute2
    Encap vxlan
        ip: "10.0.240.121"
        options: {csum="true"}
    Encap geneve
        ip: "10.0.240.121"
        options: {csum="true"}
Chassis "fc536b52-a35c-4523-885d-0708759445e0"
    hostname: compute1
    Encap vxlan
        ip: "10.0.240.120"
        options: {csum="true"}
    Encap geneve
        ip: "10.0.240.120"
        options: {csum="true"}

An empty northbound database is the result of a lack of sync between OVN and Neutron, and can be resolved by running the neutron-ovn-db-sync-util command in repair mode:

/openstack/venvs/neutron-23.4.1.dev3/bin/neutron-ovn-db-sync-util \
--config-file /etc/neutron/neutron.conf \
--config-file /etc/neutron/plugins/ml2/ml2_conf.ini \
--ovn-neutron_sync_mode repair

EXAMPLE

Example:
root@infra1:~# /openstack/venvs/neutron-23.4.1.dev3/bin/neutron-ovn-db-sync-util \
> --config-file /etc/neutron/neutron.conf \
> --config-file /etc/neutron/plugins/ml2/ml2_conf.ini \
> --ovn-neutron_sync_mode repair
/openstack/venvs/neutron-23.4.1.dev3/lib/python3.8/site-packages/sqlalchemy/orm/relationships.py:1994: SAWarning: Setting backref / back_populates on relationship QosNetworkPolicyBinding.port to refer to viewonly relationship Port.qos_network_policy_binding should include sync_backref=False set on the QosNetworkPolicyBinding.port relationship.  (this warning may be suppressed after 10 occurrences)
  util.warn_limited(
/openstack/venvs/neutron-23.4.1.dev3/lib/python3.8/site-packages/sqlalchemy/orm/relationships.py:1994: SAWarning: Setting backref / back_populates on relationship Tag.standard_attr to refer to viewonly relationship StandardAttribute.tags should include sync_backref=False set on the Tag.standard_attr relationship.  (this warning may be suppressed after 10 occurrences)
  util.warn_limited(
root@infra1:~# echo $?
0

A successful run should result in logical switch, ports, floating IPs, etc. being populated in the northbound DB:

root@infra1:~# ovn-nbctl --db=tcp:10.0.236.100:6641 show
switch 44717724-6a70-4ee7-b0ab-143bdcb12c79 (neutron-12a0ab09-d130-4e69-9aa2-c28c66509b02) (aka db)
    port f3e92114-f005-4029-81e3-65f1d60e8862
        addresses: ["fa:16:3e:8c:1b:8f 192.168.55.2", "unknown"]
    port 196d3e0e-295d-4317-a04c-e9d950160e61
        addresses: ["fa:16:3e:b6:b1:7f 192.168.55.223"]
    port c762d350-417c-4a3b-b40c-d595dafcc368
        type: localport
        addresses: ["fa:16:3e:e4:21:7c 192.168.55.5"]
    port b0c5e704-fe40-4538-9232-94a091b7adb7
        addresses: ["fa:16:3e:6c:6d:57 192.168.55.21"]
    port 60c8459c-be90-44b0-8007-56cfa995da4f
        addresses: ["fa:16:3e:19:0c:14 192.168.55.4", "unknown"]
    port 76dd53d9-7eac-4bc6-92f9-48cf975235b5
        type: router
        router-port: lrp-76dd53d9-7eac-4bc6-92f9-48cf975235b5
    port 2837c67c-c7c0-44dd-be82-1192226cb7b8
        addresses: ["fa:16:3e:dc:07:6f 192.168.55.3", "unknown"]
switch 38efee60-da0c-4ed8-ad21-e76ce12a4cb3 (neutron-282e63e3-5120-4396-a63d-0186e5e96466) (aka app)
    port c8caf3c7-ac86-4cb5-85eb-12e88f3713eb
        addresses: ["fa:16:3e:73:70:29 172.25.0.50"]
    port ae7b0df8-4343-448b-af68-5f3afd78e869
        type: localport
        addresses: ["fa:16:3e:56:b9:6a 172.25.0.5"]
    port a448543b-fe5c-4aaf-aef4-cdcd6421e84b
        addresses: ["fa:16:3e:56:b7:dc 172.25.0.2", "unknown"]
    port 4626c5a9-f578-4849-8fb3-93700f3ddb06
        addresses: ["fa:16:3e:60:8c:2d 172.25.0.250"]
    port cf2644b6-abc4-42e7-bbdb-0e204f261446
        addresses: ["fa:16:3e:aa:42:c8 172.25.0.3", "unknown"]
    port 8dcc107d-272f-4ced-b601-3090171ce01c
        addresses: ["fa:16:3e:51:6c:b8 172.25.0.4", "unknown"]
    port 9b4eb252-e69e-43e1-8585-8b79c986d07c
        type: router
        router-port: lrp-9b4eb252-e69e-43e1-8585-8b79c986d07c
switch f56b65e0-8638-4e0d-baeb-2194ec8dacac (neutron-3fb2d48e-8c71-4bca-92ce-f64a4c932338) (aka vlan200)
    port 07e29662-6e52-4ccb-b2de-361a888e633c
        addresses: ["fa:16:3e:c7:9b:37 192.168.200.4", "unknown"]
    port c2bbb3e0-2c65-45d7-b8f6-573e665dfc6e
        addresses: ["fa:16:3e:e4:69:4f 192.168.200.2", "unknown"]
    port d493513f-8b81-4e01-9dfd-89b43f2fa3f5
        addresses: ["fa:16:3e:da:44:d6 192.168.200.3", "unknown"]
    port c3d08e61-41a9-4495-83f1-6720cf798c75
        type: localport
        addresses: ["fa:16:3e:05:61:e9 192.168.200.5"]
    port provnet-ccfefb4d-0da0-4138-ac74-be1934eca9d7
        type: localnet
        tag: 200
        addresses: ["unknown"]
switch 08c14ec5-c809-467d-92e4-a9dc5092217e (neutron-dddfdce8-a8fd-4802-a01c-261b92043488) (aka vlan100)
    port ee9592f0-d028-4941-ad02-77385cd371aa
        type: router
        router-port: lrp-ee9592f0-d028-4941-ad02-77385cd371aa
    port d4478a35-0406-46b9-bab9-17df99e1e44c
        addresses: ["fa:16:3e:a4:71:eb 192.168.100.2", "unknown"]
    port 7484fd6e-c82c-4679-aa5b-a7f7b6ef5f9a
        type: localport
        addresses: ["fa:16:3e:57:3c:8f 192.168.100.5"]
    port 555e54dd-4edc-4286-84d1-d639cc7fb143
        addresses: ["fa:16:3e:6b:80:b5 192.168.100.4", "unknown"]
    port provnet-fc4d896e-9eb8-4a73-a363-223a5dc81ec5
        type: localnet
        tag: 100
        addresses: ["unknown"]
    port 10062270-348a-473a-8ed0-f551cfacfce5
        type: router
        router-port: lrp-10062270-348a-473a-8ed0-f551cfacfce5
    port ae75e43e-c4e6-4972-8917-59f5779b3d5c
        addresses: ["fa:16:3e:07:86:b5 192.168.100.3", "unknown"]
switch 74bd93f5-0434-4c64-8b65-1a44d4370bef (neutron-9e151884-67a5-4905-b157-f08f1b3b0040) (aka HA network tenant d1ae5313d10c411fa772e8fa697a6aeb)
    port 15a2cb60-d85f-4ae2-b867-4621c4e66b72 (aka HA port tenant d1ae5313d10c411fa772e8fa697a6aeb)
        type: router
        router-port: lrp-15a2cb60-d85f-4ae2-b867-4621c4e66b72
    port cdb9739f-9b11-453e-b1c2-3bfbb8bad187 (aka HA port tenant d1ae5313d10c411fa772e8fa697a6aeb)
        type: router
        router-port: lrp-cdb9739f-9b11-453e-b1c2-3bfbb8bad187
    port 9634c76e-e309-40cf-b701-1bcf38b4bde4
        type: localport
        addresses: ["fa:16:3e:2e:78:c3"]
    port 106ede6e-1f6f-4c17-a478-9e58045da88b (aka HA port tenant d1ae5313d10c411fa772e8fa697a6aeb)
        type: router
        router-port: lrp-106ede6e-1f6f-4c17-a478-9e58045da88b
    port e1918430-42ca-40dc-aa59-5c54934e121c (aka HA port tenant d1ae5313d10c411fa772e8fa697a6aeb)
        type: router
        router-port: lrp-e1918430-42ca-40dc-aa59-5c54934e121c
    port 850149e1-8f9c-4c64-8b47-90df032a8d65 (aka HA port tenant d1ae5313d10c411fa772e8fa697a6aeb)
        type: router
        router-port: lrp-850149e1-8f9c-4c64-8b47-90df032a8d65
    port 3670eea6-7adf-42d9-b524-cd438cf51a09 (aka HA port tenant d1ae5313d10c411fa772e8fa697a6aeb)
        type: router
        router-port: lrp-3670eea6-7adf-42d9-b524-cd438cf51a09
switch de0dac56-f19e-45b9-b7fb-ee12ccb2fea4 (neutron-ab3f0f85-a509-406a-8dca-5db13fbcb48b) (aka web)
    port 09380ac6-cfcf-4969-b704-4f0de6433f89
        type: router
        router-port: lrp-09380ac6-cfcf-4969-b704-4f0de6433f89
    port 45f94f9a-2a3d-489e-9b64-23002a1d495c
        addresses: ["fa:16:3e:b6:02:c7 10.5.0.3", "unknown"]
    port 37511b88-ee9a-4be2-bb46-fff22d01d5af
        addresses: ["fa:16:3e:6d:3a:91 10.5.0.162"]
    port 53a7a6c2-d350-4e98-91e3-9c3df6ebc3e2
        addresses: ["fa:16:3e:a2:cd:b8 10.5.0.4", "unknown"]
    port cf4dc1f3-0774-496a-87e1-b9954cb90320
        type: localport
        addresses: ["fa:16:3e:31:2b:9b 10.5.0.5"]
    port 637e2a67-c198-4a1d-b836-55757227eb39
        addresses: ["fa:16:3e:b5:d2:44 10.5.0.121"]
    port 7a94ca01-8e5a-4248-b56d-8343e3a15fe8
        addresses: ["fa:16:3e:10:c9:68 10.5.0.2", "unknown"]
router 5e9befda-2983-44a9-ab68-195858ca89f8 (neutron-d5052734-53e8-4a58-9fbd-2b76ec138af6) (aka rtr-db)
    port lrp-cdb9739f-9b11-453e-b1c2-3bfbb8bad187
        mac: "fa:16:3e:82:19:af"
        networks: ["169.254.194.128/18"]
    port lrp-850149e1-8f9c-4c64-8b47-90df032a8d65
        mac: "fa:16:3e:c9:6c:7d"
        networks: ["169.254.193.130/18"]
    port lrp-15a2cb60-d85f-4ae2-b867-4621c4e66b72
        mac: "fa:16:3e:8f:03:97"
        networks: ["169.254.195.150/18"]
    port lrp-76dd53d9-7eac-4bc6-92f9-48cf975235b5
        mac: "fa:16:3e:f0:d2:09"
        networks: ["192.168.55.1/24"]
    port lrp-10062270-348a-473a-8ed0-f551cfacfce5
        mac: "fa:16:3e:64:41:cb"
        networks: ["192.168.100.235/24"]
        gateway chassis: [d60d8a20-d977-4352-a886-c7b5ef477446 fc536b52-a35c-4523-885d-0708759445e0]
    nat 6c32224c-1d97-44a4-abb8-184bea546880
        external ip: "192.168.100.235"
        logical ip: "169.254.192.0/18"
        type: "snat"
    nat 917d1234-ebb7-4578-a8c0-5355302e5aab
        external ip: "192.168.100.215"
        logical ip: "192.168.55.21"
        type: "dnat_and_snat"
    nat df8a97a6-0954-4ccc-a742-26e32c493974
        external ip: "192.168.100.235"
        logical ip: "192.168.55.0/24"
        type: "snat"
    nat e0d236dc-63e0-4c51-9451-588a3ac5c051
        external ip: "192.168.100.235"
        logical ip: "169.254.192.0/18"
        type: "snat"
    nat f0ee4069-7d38-47f2-ad48-9d6b393aa773
        external ip: "192.168.100.235"
        logical ip: "169.254.192.0/18"
        type: "snat"
router dfde8f9a-c539-48ec-82ce-a3fda47c7a86 (neutron-cee5e805-ecf9-456b-87be-d60f155c8fd8) (aka rtr-web)
    port lrp-e1918430-42ca-40dc-aa59-5c54934e121c
        mac: "fa:16:3e:0a:cc:de"
        networks: ["169.254.193.111/18"]
    port lrp-ee9592f0-d028-4941-ad02-77385cd371aa
        mac: "fa:16:3e:60:72:8a"
        networks: ["192.168.100.202/24"]
        gateway chassis: [fc536b52-a35c-4523-885d-0708759445e0 d60d8a20-d977-4352-a886-c7b5ef477446]
    port lrp-09380ac6-cfcf-4969-b704-4f0de6433f89
        mac: "fa:16:3e:90:36:8f"
        networks: ["10.5.0.1/24"]
    port lrp-3670eea6-7adf-42d9-b524-cd438cf51a09
        mac: "fa:16:3e:3a:fe:53"
        networks: ["169.254.195.230/18"]
    port lrp-106ede6e-1f6f-4c17-a478-9e58045da88b
        mac: "fa:16:3e:be:b5:54"
        networks: ["169.254.194.238/18"]
    port lrp-9b4eb252-e69e-43e1-8585-8b79c986d07c
        mac: "fa:16:3e:70:e3:46"
        networks: ["172.25.0.1/24"]
    nat 02530272-13cf-4690-aad9-4160943b7418
        external ip: "192.168.100.202"
        logical ip: "172.25.0.0/24"
        type: "snat"
    nat 4149ca39-b9f2-4e35-8be7-9c0d3a343bf7
        external ip: "192.168.100.202"
        logical ip: "10.5.0.0/24"
        type: "snat"
    nat bc084986-d1a6-4877-810a-ca39fc01d064
        external ip: "192.168.100.202"
        logical ip: "169.254.192.0/18"
        type: "snat"
    nat d83383c4-91f4-47ca-9fac-9d1cfb56ed01
        external ip: "192.168.100.202"
        logical ip: "169.254.192.0/18"
        type: "snat"
    nat dddaa10e-6a24-4133-8efa-612517247c88
        external ip: "192.168.100.202"
        logical ip: "169.254.192.0/18"
        type: "snat"
    nat ef26f8a2-b61b-431f-a288-3fa6c6b6488c
        external ip: "192.168.100.90"
        logical ip: "10.5.0.121"
        type: "dnat_and_snat"

Approach

One of the last steps of this process is one of the trickiest: Neutron ports must be updated to reflect a vif_type of ovs rather than bridge. Unfortunately, this is not an API-driven change but one that must be done within the database itself.

The following command can be used:

use neutron;
update ml2_port_bindings set vif_type='ovs' where vif_type='bridge';

EXAMPLE

MariaDB [neutron]> select * from ml2_port_bindings where vif_type='bridge';
+--------------------------------------+----------+----------+-----------+---------+---------------------------------------------+--------+
| port_id                              | host     | vif_type | vnic_type | profile | vif_details                                 | status |
+--------------------------------------+----------+----------+-----------+---------+---------------------------------------------+--------+
| 07e29662-6e52-4ccb-b2de-361a888e633c | infra3   | bridge   | normal    |         | {"connectivity": "l2", "port_filter": true} | ACTIVE |
| 09380ac6-cfcf-4969-b704-4f0de6433f89 | infra3   | bridge   | normal    |         | {"connectivity": "l2", "port_filter": true} | ACTIVE |
| 106ede6e-1f6f-4c17-a478-9e58045da88b | infra2   | bridge   | normal    |         | {"connectivity": "l2", "port_filter": true} | ACTIVE |
| 15a2cb60-d85f-4ae2-b867-4621c4e66b72 | infra3   | bridge   | normal    |         | {"connectivity": "l2", "port_filter": true} | ACTIVE |
| 196d3e0e-295d-4317-a04c-e9d950160e61 | compute2 | bridge   | normal    |         | {"connectivity": "l2", "port_filter": true} | ACTIVE |
| 2837c67c-c7c0-44dd-be82-1192226cb7b8 | infra2   | bridge   | normal    |         | {"connectivity": "l2", "port_filter": true} | ACTIVE |
| 3670eea6-7adf-42d9-b524-cd438cf51a09 | infra3   | bridge   | normal    |         | {"connectivity": "l2", "port_filter": true} | ACTIVE |
| 37511b88-ee9a-4be2-bb46-fff22d01d5af | compute1 | bridge   | normal    |         | {"connectivity": "l2", "port_filter": true} | ACTIVE |
| 45f94f9a-2a3d-489e-9b64-23002a1d495c | infra1   | bridge   | normal    |         | {"connectivity": "l2", "port_filter": true} | ACTIVE |
| 4626c5a9-f578-4849-8fb3-93700f3ddb06 | compute2 | bridge   | normal    |         | {"connectivity": "l2", "port_filter": true} | ACTIVE |
| 53a7a6c2-d350-4e98-91e3-9c3df6ebc3e2 | infra3   | bridge   | normal    |         | {"connectivity": "l2", "port_filter": true} | ACTIVE |
| 555e54dd-4edc-4286-84d1-d639cc7fb143 | infra3   | bridge   | normal    |         | {"connectivity": "l2", "port_filter": true} | ACTIVE |
| 60c8459c-be90-44b0-8007-56cfa995da4f | infra1   | bridge   | normal    |         | {"connectivity": "l2", "port_filter": true} | ACTIVE |
| 637e2a67-c198-4a1d-b836-55757227eb39 | compute2 | bridge   | normal    |         | {"connectivity": "l2", "port_filter": true} | ACTIVE |
| 76dd53d9-7eac-4bc6-92f9-48cf975235b5 | infra3   | bridge   | normal    |         | {"connectivity": "l2", "port_filter": true} | ACTIVE |
| 7a94ca01-8e5a-4248-b56d-8343e3a15fe8 | infra2   | bridge   | normal    |         | {"connectivity": "l2", "port_filter": true} | ACTIVE |
| 850149e1-8f9c-4c64-8b47-90df032a8d65 | infra2   | bridge   | normal    |         | {"connectivity": "l2", "port_filter": true} | ACTIVE |
| 8dcc107d-272f-4ced-b601-3090171ce01c | infra1   | bridge   | normal    |         | {"connectivity": "l2", "port_filter": true} | ACTIVE |
| 9b4eb252-e69e-43e1-8585-8b79c986d07c | infra3   | bridge   | normal    |         | {"connectivity": "l2", "port_filter": true} | ACTIVE |
| a448543b-fe5c-4aaf-aef4-cdcd6421e84b | infra3   | bridge   | normal    |         | {"connectivity": "l2", "port_filter": true} | ACTIVE |
| ae75e43e-c4e6-4972-8917-59f5779b3d5c | infra2   | bridge   | normal    |         | {"connectivity": "l2", "port_filter": true} | ACTIVE |
| b0c5e704-fe40-4538-9232-94a091b7adb7 | compute1 | bridge   | normal    |         | {"connectivity": "l2", "port_filter": true} | ACTIVE |
| c2bbb3e0-2c65-45d7-b8f6-573e665dfc6e | infra1   | bridge   | normal    |         | {"connectivity": "l2", "port_filter": true} | ACTIVE |
| c8caf3c7-ac86-4cb5-85eb-12e88f3713eb | compute1 | bridge   | normal    |         | {"connectivity": "l2", "port_filter": true} | ACTIVE |
| cdb9739f-9b11-453e-b1c2-3bfbb8bad187 | infra1   | bridge   | normal    |         | {"connectivity": "l2", "port_filter": true} | ACTIVE |
| cf2644b6-abc4-42e7-bbdb-0e204f261446 | infra2   | bridge   | normal    |         | {"connectivity": "l2", "port_filter": true} | ACTIVE |
| d4478a35-0406-46b9-bab9-17df99e1e44c | infra1   | bridge   | normal    |         | {"connectivity": "l2", "port_filter": true} | ACTIVE |
| d493513f-8b81-4e01-9dfd-89b43f2fa3f5 | infra2   | bridge   | normal    |         | {"connectivity": "l2", "port_filter": true} | ACTIVE |
| e1918430-42ca-40dc-aa59-5c54934e121c | infra1   | bridge   | normal    |         | {"connectivity": "l2", "port_filter": true} | ACTIVE |
| f3e92114-f005-4029-81e3-65f1d60e8862 | infra3   | bridge   | normal    |         | {"connectivity": "l2", "port_filter": true} | ACTIVE |
+--------------------------------------+----------+----------+-----------+---------+---------------------------------------------+--------+

MariaDB [neutron]> update ml2_port_bindings set vif_type='ovs' where vif_type='bridge';
Query OK, 30 rows affected (0.026 sec)
Rows matched: 30  Changed: 30  Warnings: 0

Also, according to the OVN manpage, VXLAN networks are only supported for gateway nodes and not traffic between hypervisors:

external_ids:ovn-encap-type
   The encapsulation type that a chassis should use to  con‐
   nect  to  this  node. Multiple encapsulation types may be
   specified with a comma-separated list. Each listed encap‐
   sulation type will be paired with ovn-encap-ip.

   Supported  tunnel  types  for  connecting hypervisors are
   geneve and stt. Gateways may use geneve, vxlan, or stt.

So, the DB can also be munged to make this change, too:

MariaDB [neutron]> select * from networksegments;
+--------------------------------------+--------------------------------------+--------------+------------------+-----------------+------------+---------------+------------------+------+
| id                                   | network_id                           | network_type | physical_network | segmentation_id | is_dynamic | segment_index | standard_attr_id | name |
+--------------------------------------+--------------------------------------+--------------+------------------+-----------------+------------+---------------+------------------+------+
| 84334795-0a9c-46dc-bb45-abd858a787ae | 282e63e3-5120-4396-a63d-0186e5e96466 | vxlan        | NULL             |             942 |          0 |             0 |               27 | NULL |
| 8c866571-b041-426c-9ddd-5f126fd694e3 | ab3f0f85-a509-406a-8dca-5db13fbcb48b | vxlan        | NULL             |             230 |          0 |             0 |               21 | NULL |
| 949f5d71-fd93-45e3-9895-ad2415541e89 | 9e151884-67a5-4905-b157-f08f1b3b0040 | vxlan        | NULL             |              10 |          0 |             0 |              144 | NULL |
| b4f4131d-f988-49ef-9440-361d747af8eb | 12a0ab09-d130-4e69-9aa2-c28c66509b02 | vxlan        | NULL             |             665 |          0 |             0 |               33 | NULL |
| ccfefb4d-0da0-4138-ac74-be1934eca9d7 | 3fb2d48e-8c71-4bca-92ce-f64a4c932338 | vlan         | vlan             |             200 |          0 |             0 |              126 | NULL |
| fc4d896e-9eb8-4a73-a363-223a5dc81ec5 | dddfdce8-a8fd-4802-a01c-261b92043488 | vlan         | vlan             |             100 |          0 |             0 |              120 | NULL |
+--------------------------------------+--------------------------------------+--------------+------------------+-----------------+------------+---------------+------------------+------+

MariaDB [neutron]> update networksegments set network_type='geneve' where network_type='vxlan';
Query OK, 4 rows affected (0.008 sec)
Rows matched: 4  Changed: 4  Warnings: 0

Soft landing

At this point, all of the tough changes have been made and it’s time to try out our new toy.

root@infra1:~# openstack server start vm-web1
get() takes 1 positional argument but 2 were given

Uh oh.

root@infra1:~# openstack server start vm-web1

That’s better.

Checking the console of the VM demonstrates proper DHCP and Metadata connectivity:

root@infra1:~# openstack console log show vm-web1
...
Starting network: udhcpc: started, v1.29.3
udhcpc: sending discover
udhcpc: sending select for 10.5.0.121
udhcpc: lease of 10.5.0.121 obtained, lease time 43200
...
checking http://169.254.169.254/2009-04-04/instance-id
successful after 1/20 tries: up 3.04. iid=i-00000009
...

Let’s try spinning up the others:

root@infra1:~# openstack server start vm-web2
root@infra1:~# openstack server start vm-app1
get() takes 1 positional argument but 2 were given
root@infra1:~# openstack server start vm-app1
root@infra1:~# openstack server start vm-app2
get() takes 1 positional argument but 2 were given
root@infra1:~# openstack server start vm-app2
root@infra1:~# openstack server start vm-db1
get() takes 1 positional argument but 2 were given
root@infra1:~# openstack server start vm-db1
root@infra1:~# openstack server start vm-db2
Networking client is experiencing an unauthorized exception. (HTTP 400) (Request-ID: req-3a3348c9-39fb-49f2-84b2-f2b3c3e9e466)
root@infra1:~# openstack server start vm-db2

It looks like the first VMs of a given network are the only ones to complain, which could be related to stale cache or something else that gets resolved automatically. Without looking at the API logs, it’s hard to say.

root@infra1:~# openstack server list
+--------------------------------------+---------+--------+-----------------------------------+--------------+--------+
| ID                                   | Name    | Status | Networks                          | Image        | Flavor |
+--------------------------------------+---------+--------+-----------------------------------+--------------+--------+
| b3a33fb1-98dc-4cf9-99c3-53d5352310e5 | vm-db2  | ACTIVE | db=192.168.55.223                 | cirros-0.5.2 | 1-1-1  |
| e2ea6e2a-aa47-4f44-b285-1b727ad4f709 | vm-db1  | ACTIVE | db=192.168.100.215, 192.168.55.21 | cirros-0.5.2 | 1-1-1  |
| 916052d7-a5f7-4e4a-87a0-7249eef45801 | vm-app2 | ACTIVE | app=172.25.0.250                  | cirros-0.5.2 | 1-1-1  |
| dd3046a3-128a-4585-8ffd-54c11b516052 | vm-app1 | ACTIVE | app=172.25.0.50                   | cirros-0.5.2 | 1-1-1  |
| 7e1af764-a034-4ef2-9695-ca19838812e5 | vm-web1 | ACTIVE | web=10.5.0.121, 192.168.100.90    | cirros-0.5.2 | 1-1-1  |
| dbb98201-52fc-420d-bf6e-5a40fad74327 | vm-web2 | ACTIVE | web=10.5.0.162                    | cirros-0.5.2 | 1-1-1  |
+--------------------------------------+---------+--------+-----------------------------------+--------------+--------+

Inspection

The moment of truth is here, but performing checks from DHCP namespaces that no longer exist will be tricky. Fortunately, an ovmmeta namespace exists on each node that is connected to their respective network. Unfortunately, the namespace is only connected to the local bridge and cannot communicate across hosts.

The following example demonstrates connectivity from the ovnmeta namespace to vm-db1, and from within vm-db1 to vm-db2 (across hosts):

root@compute1:~# ip netns exec ovnmeta-12a0ab09-d130-4e69-9aa2-c28c66509b02 ssh cirros@192.168.55.21
cirros@192.168.55.21's password:
$ hostname
vm-db1
$ ping 192.168.55.223 -c2
PING 192.168.55.223 (192.168.55.223): 56 data bytes
64 bytes from 192.168.55.223: seq=0 ttl=64 time=4.595 ms
64 bytes from 192.168.55.223: seq=1 ttl=64 time=2.770 ms

$ ssh cirros@192.168.55.223

Host '192.168.55.223' is not in the trusted hosts file.
(ecdsa-sha2-nistp256 fingerprint sha1!! ae:44:c1:5c:da:13:06:05:56:22:76:0d:0c:82:1e:84:bf:e8:2d:9c)
Do you want to continue connecting? (y/n) y
cirros@192.168.55.223's password:
$ hostname
vm-db2

Here we ping from vm-web2 to vm-app1 and vm-app2

$ hostname
vm-web2
$ ping 172.25.0.50 -c2
PING 172.25.0.50 (172.25.0.50): 56 data bytes
64 bytes from 172.25.0.50: seq=0 ttl=63 time=1.578 ms
64 bytes from 172.25.0.50: seq=1 ttl=63 time=1.125 ms

--- 172.25.0.50 ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 1.125/1.351/1.578 ms

$ ping 172.25.0.250 -c2
PING 172.25.0.250 (172.25.0.250): 56 data bytes
64 bytes from 172.25.0.250: seq=0 ttl=63 time=5.781 ms
64 bytes from 172.25.0.250: seq=1 ttl=63 time=3.353 ms

--- 172.25.0.250 ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 3.353/4.567/5.781 ms

Lastly, we can see floating IP traffic between vm-web1 -> vm-db2 works as well:

$ hostname
vm-web1
$ ping 192.168.100.215 -c2
PING 192.168.100.215 (192.168.100.215): 56 data bytes
64 bytes from 192.168.100.215: seq=0 ttl=62 time=14.153 ms
64 bytes from 192.168.100.215: seq=1 ttl=62 time=4.775 ms

--- 192.168.100.215 ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 4.775/9.464/14.153 ms

Summary

Being able to perform in-place migrations and upgrades is important, especially when the resources don’t exist to perform a “lift-n-shift” type of migration. When looking to perform an in-place migration, my suggestion is to always TEST TEST TEST in a similarly-configured lab environment to work out all kninks and potential unknowns. Make configuration and database backups, and be prepared to lose instances in a worst-case scenario.

If you have some thoughts or comments on this post, I’d love to hear ‘em. Feel free to reach out on Twitter at @jimmdenton or hit me up on LinkedIn.

[OVN] ‘Chassis_Private’ object has no attribute ‘hostname’

2022-03-25T00:00:00+00:00

On more than one occasion I have turned to this blog to fix issues that reoccur weeks/months/years after the initial post is born, and this post will serve as one of those reference points in the future, I’m sure. In my OpenStack-Ansible Xena lab running OVN, I’ve twice now come across the following error when performing a openstack network agent list command:

'Chassis_Private' object has no attribute 'hostname'

What does that even mean?!

What chassis_private is referring to is a table in the OVN Southbound database. Not to be confused with the chassis table, a row in the chassis_private table is used by ovn-northd and the owning chassis to store private data about that chassis, including:

uuid
name
chassis
nb_cfg
nb_cfg_timestamp
external_ids

The manpage does better service of describing its purpose:

These data are stored in this separate table instead of the Chassis
table for performance considerations:

the  rows  in  this table can be conditionally monitored by chassises
so that each chassis only get update notifications for its own row,
to avoid unnecessary chassis  private data update flooding in a large
scale deployment.

My environment consists of 3x controller nodes and 3x compute nodes running a variety of services, including OVN, OVN Metadata Agent, Legacy DHCP Agent (for Ironic), and the SR-IOV Agent. The catalyst for this particular post was an error when trying to retrieve a list of those agents:

root@lab-infra01:~# openstack network agent list
HttpException: 500: Server Error for url: http://10.20.0.11:9696/v2.0/agents, Request Failed: internal server error while processing your request.

A look at the neutron-server log revealed the following traceback:

Mar 24 19:37:30 lab-infra03 neutron-server[3148184]: 
2022-03-24 19:37:30.625 3148184 ERROR neutron.api.v2.resource [req-82bf64ab-d8d4-4678-abdb-de439c392e71 34f3cf48b24f41c097555c07961f139e 7a8df96a3c6a47118e60e57aa9ecff54 - default default] index failed: No details.: AttributeError: 'Chassis_Private' object has no attribute 'hostname'
2022-03-24 19:37:30.625 3148184 ERROR neutron.api.v2.resource Traceback (most recent call last):
2022-03-24 19:37:30.625 3148184 ERROR neutron.api.v2.resource   File "/openstack/venvs/neutron-24.0.1/lib/python3.8/site-packages/neutron/api/v2/resource.py", line 98, in resource
2022-03-24 19:37:30.625 3148184 ERROR neutron.api.v2.resource     result = method(request=request, **args)
2022-03-24 19:37:30.625 3148184 ERROR neutron.api.v2.resource   File "/openstack/venvs/neutron-24.0.1/lib/python3.8/site-packages/neutron_lib/db/api.py", line 139, in wrapped
2022-03-24 19:37:30.625 3148184 ERROR neutron.api.v2.resource     setattr(e, '_RETRY_EXCEEDED', True)
2022-03-24 19:37:30.625 3148184 ERROR neutron.api.v2.resource   File "/openstack/venvs/neutron-24.0.1/lib/python3.8/site-packages/oslo_utils/excutils.py", line 227, in __exit__
2022-03-24 19:37:30.625 3148184 ERROR neutron.api.v2.resource     self.force_reraise()
2022-03-24 19:37:30.625 3148184 ERROR neutron.api.v2.resource   File "/openstack/venvs/neutron-24.0.1/lib/python3.8/site-packages/oslo_utils/excutils.py", line 200, in force_reraise
2022-03-24 19:37:30.625 3148184 ERROR neutron.api.v2.resource     raise self.value
2022-03-24 19:37:30.625 3148184 ERROR neutron.api.v2.resource   File "/openstack/venvs/neutron-24.0.1/lib/python3.8/site-packages/neutron_lib/db/api.py", line 135, in wrapped
2022-03-24 19:37:30.625 3148184 ERROR neutron.api.v2.resource     return f(*args, **kwargs)
2022-03-24 19:37:30.625 3148184 ERROR neutron.api.v2.resource   File "/openstack/venvs/neutron-24.0.1/lib/python3.8/site-packages/oslo_db/api.py", line 154, in wrapper
2022-03-24 19:37:30.625 3148184 ERROR neutron.api.v2.resource     ectxt.value = e.inner_exc
2022-03-24 19:37:30.625 3148184 ERROR neutron.api.v2.resource   File "/openstack/venvs/neutron-24.0.1/lib/python3.8/site-packages/oslo_utils/excutils.py", line 227, in __exit__
2022-03-24 19:37:30.625 3148184 ERROR neutron.api.v2.resource     self.force_reraise()
2022-03-24 19:37:30.625 3148184 ERROR neutron.api.v2.resource   File "/openstack/venvs/neutron-24.0.1/lib/python3.8/site-packages/oslo_utils/excutils.py", line 200, in force_reraise
2022-03-24 19:37:30.625 3148184 ERROR neutron.api.v2.resource     raise self.value
2022-03-24 19:37:30.625 3148184 ERROR neutron.api.v2.resource   File "/openstack/venvs/neutron-24.0.1/lib/python3.8/site-packages/oslo_db/api.py", line 142, in wrapper
2022-03-24 19:37:30.625 3148184 ERROR neutron.api.v2.resource     return f(*args, **kwargs)
2022-03-24 19:37:30.625 3148184 ERROR neutron.api.v2.resource   File "/openstack/venvs/neutron-24.0.1/lib/python3.8/site-packages/neutron_lib/db/api.py", line 183, in wrapped
2022-03-24 19:37:30.625 3148184 ERROR neutron.api.v2.resource     LOG.debug("Retry wrapper got retriable exception: %s", e)
2022-03-24 19:37:30.625 3148184 ERROR neutron.api.v2.resource   File "/openstack/venvs/neutron-24.0.1/lib/python3.8/site-packages/oslo_utils/excutils.py", line 227, in __exit__
2022-03-24 19:37:30.625 3148184 ERROR neutron.api.v2.resource     self.force_reraise()
2022-03-24 19:37:30.625 3148184 ERROR neutron.api.v2.resource   File "/openstack/venvs/neutron-24.0.1/lib/python3.8/site-packages/oslo_utils/excutils.py", line 200, in force_reraise
2022-03-24 19:37:30.625 3148184 ERROR neutron.api.v2.resource     raise self.value
2022-03-24 19:37:30.625 3148184 ERROR neutron.api.v2.resource   File "/openstack/venvs/neutron-24.0.1/lib/python3.8/site-packages/neutron_lib/db/api.py", line 179, in wrapped
2022-03-24 19:37:30.625 3148184 ERROR neutron.api.v2.resource     return f(*dup_args, **dup_kwargs)
2022-03-24 19:37:30.625 3148184 ERROR neutron.api.v2.resource   File "/openstack/venvs/neutron-24.0.1/lib/python3.8/site-packages/neutron/api/v2/base.py", line 369, in index
2022-03-24 19:37:30.625 3148184 ERROR neutron.api.v2.resource     return self._items(request, True, parent_id)
2022-03-24 19:37:30.625 3148184 ERROR neutron.api.v2.resource   File "/openstack/venvs/neutron-24.0.1/lib/python3.8/site-packages/neutron/api/v2/base.py", line 304, in _items
2022-03-24 19:37:30.625 3148184 ERROR neutron.api.v2.resource     obj_list = obj_getter(request.context, **kwargs)
2022-03-24 19:37:30.625 3148184 ERROR neutron.api.v2.resource   File "/openstack/venvs/neutron-24.0.1/lib/python3.8/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py", line 1165, in fn
2022-03-24 19:37:30.625 3148184 ERROR neutron.api.v2.resource     return op(results, new_method(*args, _driver=self, **kwargs))
2022-03-24 19:37:30.625 3148184 ERROR neutron.api.v2.resource   File "/openstack/venvs/neutron-24.0.1/lib/python3.8/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py", line 1229, in get_agents
2022-03-24 19:37:30.625 3148184 ERROR neutron.api.v2.resource     agent_dict = agent.as_dict()
2022-03-24 19:37:30.625 3148184 ERROR neutron.api.v2.resource   File "/openstack/venvs/neutron-24.0.1/lib/python3.8/site-packages/neutron/plugins/ml2/drivers/ovn/agent/neutron_agent.py", line 59, in as_dict
2022-03-24 19:37:30.625 3148184 ERROR neutron.api.v2.resource     'host': self.chassis.hostname,
2022-03-24 19:37:30.625 3148184 ERROR neutron.api.v2.resource AttributeError: 'Chassis_Private' object has no attribute 'hostname'
2022-03-24 19:37:30.625 3148184 ERROR neutron.api.v2.resource

Most importantly:

AttributeError: 'Chassis_Private' object has no attribute 'hostname'

A Look at OVN

To understand what the ‘Chassis_Private’ object was and what its structure was expected to be, I took a visit to the Neutron source code; specifically neutron/plugins/ml2/drivers/ovn/agent/neutron_agent.py line 59:

https://github.com/openstack/neutron/blob/stable/xena/neutron/plugins/ml2/drivers/ovn/agent/neutron_agent.py

    def as_dict(self):
        return {
            'binary': self.binary,
            'host': self.chassis.hostname,
            'heartbeat_timestamp': timeutils.utcnow(),
            'availability_zone': ', '.join(
                ovn_utils.get_chassis_availability_zones(self.chassis)),
            'topic': 'n/a',
            'description': self.description,
            'configurations': {
                'chassis_name': self.chassis.name,
                'bridge-mappings':
                    self.chassis.external_ids.get('ovn-bridge-mappings', '')},
            'start_flag': True,
            'agent_type': self.agent_type,
            'id': self.agent_id,
            'alive': self.alive,
            'admin_state_up': True}

In the above snippet, we can see in as_dict that host references self.chassis.hostname, and chassis itself is defined here:

@property
def chassis(self):
        return self.chassis_from_private(self.chassis_private)

If we take a look at chassis_from_private, we get this:

@staticmethod
    def chassis_from_private(chassis_private):
        try:
            return chassis_private.chassis[0]
        except (AttributeError, IndexError):
            # No Chassis_Private support, just use Chassis
            return chassis_private

I don’t proclaim to be a Python expert, or even a developer for that matter, but in following along I can see that it’s returning the 1st element ([0]) of the list chassis for this chassis_private object.

Using some OVN tools, I was able to list both the chassis_private and chassis tables from the Southbound DB:

chassis table

root@lab-infra02:~# ovn-sbctl list chassis
_uuid               : 6c90b020-be8e-4b7c-9aa8-0f4a9f826e6d
encaps              : [4c1cae4d-36a4-4541-af8c-fc02758fab4e, ac4dd143-10db-48c3-b4dd-8f42d0d6efd0]
external_ids        : {datapath-type=system, iface-types="bareudp,erspan,geneve,gre,gtpu,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan", is-interconn="false", ovn-bridge-mappings="physnet1:br-rpn,vlan:br-ex", ovn-chassis-mac-mappings="", ovn-cms-options=enable-chassis-as-gw, ovn-enable-lflow-cache="true", ovn-limit-lflow-cache="", ovn-memlimit-lflow-cache-kb="", ovn-monitor-all="false", ovn-trim-limit-lflow-cache="", ovn-trim-wmark-perc-lflow-cache="", port-up-notif="true"}
hostname            : lab-compute01
name                : "0c9b25a6-3760-4b57-ba71-49e7091730bb"
nb_cfg              : 0
other_config        : {datapath-type=system, iface-types="bareudp,erspan,geneve,gre,gtpu,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan", is-interconn="false", ovn-bridge-mappings="physnet1:br-rpn,vlan:br-ex", ovn-chassis-mac-mappings="", ovn-cms-options=enable-chassis-as-gw, ovn-enable-lflow-cache="true", ovn-limit-lflow-cache="", ovn-memlimit-lflow-cache-kb="", ovn-monitor-all="false", ovn-trim-limit-lflow-cache="", ovn-trim-wmark-perc-lflow-cache="", port-up-notif="true"}
transport_zones     : []
vtep_logical_switches: []

_uuid               : 478e3679-f4af-4a2d-a986-85323c840620
encaps              : [1e5060c3-a6ce-41bd-b54a-2ba3907f7092, 3177060c-bcdc-4c02-bf31-ff359c666538]
external_ids        : {datapath-type=system, iface-types="bareudp,erspan,geneve,gre,gtpu,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan", is-interconn="false", ovn-bridge-mappings="vlan:br-ex", ovn-chassis-mac-mappings="", ovn-cms-options=enable-chassis-as-gw, ovn-enable-lflow-cache="true", ovn-limit-lflow-cache="", ovn-memlimit-lflow-cache-kb="", ovn-monitor-all="false", ovn-trim-limit-lflow-cache="", ovn-trim-wmark-perc-lflow-cache="", port-up-notif="true"}
hostname            : lab-compute03
name                : "1f318a3c-f607-4272-814c-b0c4d813daa5"
nb_cfg              : 0
other_config        : {datapath-type=system, iface-types="bareudp,erspan,geneve,gre,gtpu,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan", is-interconn="false", ovn-bridge-mappings="vlan:br-ex", ovn-chassis-mac-mappings="", ovn-cms-options=enable-chassis-as-gw, ovn-enable-lflow-cache="true", ovn-limit-lflow-cache="", ovn-memlimit-lflow-cache-kb="", ovn-monitor-all="false", ovn-trim-limit-lflow-cache="", ovn-trim-wmark-perc-lflow-cache="", port-up-notif="true"}
transport_zones     : []
vtep_logical_switches: []

_uuid               : eed64f71-9cd1-4e1a-a891-e8bbb9049c41
encaps              : [3932bdff-e3a2-425b-9bf5-8d05fffbd171, a6f63e7b-a59a-41a3-9fdf-a2b6fae892cd]
external_ids        : {datapath-type=system, iface-types="bareudp,erspan,geneve,gre,gtpu,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan", is-interconn="false", ovn-bridge-mappings="physnet2:br-rpn,vlan:br-ex", ovn-chassis-mac-mappings="", ovn-cms-options=enable-chassis-as-gw, ovn-enable-lflow-cache="true", ovn-limit-lflow-cache="", ovn-memlimit-lflow-cache-kb="", ovn-monitor-all="false", ovn-trim-limit-lflow-cache="", ovn-trim-wmark-perc-lflow-cache="", port-up-notif="true"}
hostname            : lab-compute02
name                : "6c2a75b1-482a-40e3-91f8-3e449986f5b6"
nb_cfg              : 174
other_config        : {datapath-type=system, iface-types="bareudp,erspan,geneve,gre,gtpu,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan", is-interconn="false", ovn-bridge-mappings="physnet2:br-rpn,vlan:br-ex", ovn-chassis-mac-mappings="", ovn-cms-options=enable-chassis-as-gw, ovn-enable-lflow-cache="true", ovn-limit-lflow-cache="", ovn-memlimit-lflow-cache-kb="", ovn-monitor-all="false", ovn-trim-limit-lflow-cache="", ovn-trim-wmark-perc-lflow-cache="", port-up-notif="true"}
transport_zones     : []
vtep_logical_switches: []

_uuid               : 8a07dfc5-1e52-49aa-aa97-ec0515334fc6
encaps              : [63a84cd2-cb93-485e-aaa4-e6701dbb9a7d, a67c8b1c-da4e-4178-b0eb-58315983ca68]
external_ids        : {datapath-type=system, iface-types="bareudp,erspan,geneve,gre,gtpu,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan", is-interconn="false", ovn-bridge-mappings="vlan:br-ex", ovn-chassis-mac-mappings="", ovn-cms-options=enable-chassis-as-gw, ovn-enable-lflow-cache="true", ovn-limit-lflow-cache="", ovn-memlimit-lflow-cache-kb="", ovn-monitor-all="false", ovn-trim-limit-lflow-cache="", ovn-trim-wmark-perc-lflow-cache="", port-up-notif="true"}
hostname            : lab-infra01
name                : "900595a5-a02a-4566-b6dc-0c1e0e2cb392"
nb_cfg              : 0
other_config        : {datapath-type=system, iface-types="bareudp,erspan,geneve,gre,gtpu,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan", is-interconn="false", ovn-bridge-mappings="vlan:br-ex", ovn-chassis-mac-mappings="", ovn-cms-options=enable-chassis-as-gw, ovn-enable-lflow-cache="true", ovn-limit-lflow-cache="", ovn-memlimit-lflow-cache-kb="", ovn-monitor-all="false", ovn-trim-limit-lflow-cache="", ovn-trim-wmark-perc-lflow-cache="", port-up-notif="true"}
transport_zones     : []
vtep_logical_switches: []

_uuid               : 8f4829ad-746d-4125-8561-363adbbc4dce
encaps              : [ce7a5ab6-534a-4667-a7c0-5f112b0f4507, fcaf8226-fa90-4848-8c1d-d3c975276e05]
external_ids        : {datapath-type=system, iface-types="bareudp,erspan,geneve,gre,gtpu,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan", is-interconn="false", "neutron:ovn-metadata-id"="344341e0-8e69-5e00-979c-d59fee1b9b27", "neutron:ovn-metadata-sb-cfg"="173", ovn-bridge-mappings="vlan:br-ex", ovn-chassis-mac-mappings="", ovn-cms-options=enable-chassis-as-gw, ovn-enable-lflow-cache="true", ovn-limit-lflow-cache="", ovn-memlimit-lflow-cache-kb="", ovn-monitor-all="false", ovn-trim-limit-lflow-cache="", ovn-trim-wmark-perc-lflow-cache="", port-up-notif="true"}
hostname            : lab-infra03
name                : "d50d391d-910f-40d6-8aa7-24fbfda018ff"
nb_cfg              : 171
other_config        : {datapath-type=system, iface-types="bareudp,erspan,geneve,gre,gtpu,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan", is-interconn="false", ovn-bridge-mappings="vlan:br-ex", ovn-chassis-mac-mappings="", ovn-cms-options=enable-chassis-as-gw, ovn-enable-lflow-cache="true", ovn-limit-lflow-cache="", ovn-memlimit-lflow-cache-kb="", ovn-monitor-all="false", ovn-trim-limit-lflow-cache="", ovn-trim-wmark-perc-lflow-cache="", port-up-notif="true"}
transport_zones     : []
vtep_logical_switches: []

_uuid               : ada8b169-bd60-490b-8520-7e621cbbb84e
encaps              : [072fa878-8849-4b06-acfd-4e889ff308b0, 8b6e5992-a4bf-46e3-b1c2-d5494765ca62]
external_ids        : {datapath-type=system, iface-types="bareudp,erspan,geneve,gre,gtpu,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan", is-interconn="false", "neutron:ovn-metadata-id"="83641d9c-6244-564c-b67c-d5b3298adc85", "neutron:ovn-metadata-sb-cfg"="574", ovn-bridge-mappings="vlan:br-ex", ovn-chassis-mac-mappings="", ovn-cms-options=enable-chassis-as-gw, ovn-enable-lflow-cache="true", ovn-limit-lflow-cache="", ovn-memlimit-lflow-cache-kb="", ovn-monitor-all="false", ovn-trim-limit-lflow-cache="", ovn-trim-wmark-perc-lflow-cache="", port-up-notif="true"}
hostname            : lab-infra02
name                : "30757b96-cb1b-4512-bfdd-df6df50f2f4c"
nb_cfg              : 171
other_config        : {datapath-type=system, iface-types="bareudp,erspan,geneve,gre,gtpu,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan", is-interconn="false", ovn-bridge-mappings="vlan:br-ex", ovn-chassis-mac-mappings="", ovn-cms-options=enable-chassis-as-gw, ovn-enable-lflow-cache="true", ovn-limit-lflow-cache="", ovn-memlimit-lflow-cache-kb="", ovn-monitor-all="false", ovn-trim-limit-lflow-cache="", ovn-trim-wmark-perc-lflow-cache="", port-up-notif="true"}
transport_zones     : []
vtep_logical_switches: []

There I see 6 chassis defined in the Southbound DB, which is to be expected.

chassis_private table

root@lab-infra02:~# ovn-sbctl list chassis_private
_uuid               : 38940279-b953-4ec8-9069-9fc2e7b7fe3d
chassis             : eed64f71-9cd1-4e1a-a891-e8bbb9049c41
external_ids        : {"neutron:ovn-metadata-id"="6645f143-2dc0-5f03-b7bb-681bc3e8b969", "neutron:ovn-metadata-sb-cfg"="662"}
name                : "6c2a75b1-482a-40e3-91f8-3e449986f5b6"
nb_cfg              : 662
nb_cfg_timestamp    : 1648150649984

_uuid               : 521bfe75-5a10-4928-8158-b15df2cb0c5d
chassis             : ada8b169-bd60-490b-8520-7e621cbbb84e
external_ids        : {"neutron:ovn-metadata-id"="83641d9c-6244-564c-b67c-d5b3298adc85", "neutron:ovn-metadata-sb-cfg"="662"}
name                : "30757b96-cb1b-4512-bfdd-df6df50f2f4c"
nb_cfg              : 662
nb_cfg_timestamp    : 1648150649989

_uuid               : 4c614b3d-1e26-40d7-8d92-00d1a1e77243
chassis             : 8f4829ad-746d-4125-8561-363adbbc4dce
external_ids        : {"neutron:ovn-metadata-id"="344341e0-8e69-5e00-979c-d59fee1b9b27", "neutron:ovn-metadata-sb-cfg"="662"}
name                : "d50d391d-910f-40d6-8aa7-24fbfda018ff"
nb_cfg              : 662
nb_cfg_timestamp    : 1648150649988

_uuid               : 73b1096a-b38f-4e6a-960c-4a99e93735d6
chassis             : 6c90b020-be8e-4b7c-9aa8-0f4a9f826e6d
external_ids        : {"neutron:ovn-metadata-id"="2864488c-c9a8-5cf1-b1c0-184c295493b6", "neutron:ovn-metadata-sb-cfg"="662"}
name                : "0c9b25a6-3760-4b57-ba71-49e7091730bb"
nb_cfg              : 662
nb_cfg_timestamp    : 1648150649984

_uuid               : 3b580d16-2896-489f-8f1a-10d2cd13e1ae
chassis             : []
external_ids        : {"neutron:ovn-metadata-id"="bdc50d9c-42b1-5f20-8737-baba108b2f67", "neutron:ovn-metadata-sb-cfg"="425"}
name                : "5236f154-4a73-44ab-a588-b602a0b56bd5"
nb_cfg              : 425
nb_cfg_timestamp    : 1643778233179

_uuid               : a65846ba-67dc-49eb-9558-17ca0db09e0f
chassis             : 478e3679-f4af-4a2d-a986-85323c840620
external_ids        : {"neutron:ovn-metadata-id"="4d9e06dc-69c0-5ea7-8a6d-e750d11ebb9f", "neutron:ovn-metadata-sb-cfg"="662"}
name                : "1f318a3c-f607-4272-814c-b0c4d813daa5"
nb_cfg              : 662
nb_cfg_timestamp    : 1648150649985

_uuid               : cb281eea-a02c-44c7-81e9-19aab7637c12
chassis             : 8a07dfc5-1e52-49aa-aa97-ec0515334fc6
external_ids        : {"neutron:ovn-metadata-id"="64b68ff2-b068-5e64-a1cd-9c95afadd0b7", "neutron:ovn-metadata-sb-cfg"="662"}
name                : "900595a5-a02a-4566-b6dc-0c1e0e2cb392"
nb_cfg              : 662
nb_cfg_timestamp    : 1648150649986

In listing the chassis_private table, however, I see 7 entries. And wouldn’t you know, one of those entries has an empty chassis list:

_uuid               : 3b580d16-2896-489f-8f1a-10d2cd13e1ae
chassis             : []
external_ids        : {"neutron:ovn-metadata-id"="bdc50d9c-42b1-5f20-8737-baba108b2f67", "neutron:ovn-metadata-sb-cfg"="425"}
name                : "5236f154-4a73-44ab-a588-b602a0b56bd5"
nb_cfg              : 425
nb_cfg_timestamp    : 1643778233179

That could explain, then, that as the agent code attempted to reference the hostname of a null chassis, a traceback would be encountered:

AttributeError: 'Chassis_Private' object has no attribute 'hostname'

On a whim, I deleted the errant row:

ovn-sbctl destroy chassis_private 3b580d16-2896-489f-8f1a-10d2cd13e1ae

Running the command again, I confirmed there were only six entries and that they lined up to corresponding chassis:

root@lab-infra02:~# ovn-sbctl list chassis
_uuid               : 6c90b020-be8e-4b7c-9aa8-0f4a9f826e6d
encaps              : [4c1cae4d-36a4-4541-af8c-fc02758fab4e, ac4dd143-10db-48c3-b4dd-8f42d0d6efd0]
external_ids        : {datapath-type=system, iface-types="bareudp,erspan,geneve,gre,gtpu,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan", is-interconn="false", ovn-bridge-mappings="physnet1:br-rpn,vlan:br-ex", ovn-chassis-mac-mappings="", ovn-cms-options=enable-chassis-as-gw, ovn-enable-lflow-cache="true", ovn-limit-lflow-cache="", ovn-memlimit-lflow-cache-kb="", ovn-monitor-all="false", ovn-trim-limit-lflow-cache="", ovn-trim-wmark-perc-lflow-cache="", port-up-notif="true"}
hostname            : lab-compute01
name                : "0c9b25a6-3760-4b57-ba71-49e7091730bb"
nb_cfg              : 0
other_config        : {datapath-type=system, iface-types="bareudp,erspan,geneve,gre,gtpu,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan", is-interconn="false", ovn-bridge-mappings="physnet1:br-rpn,vlan:br-ex", ovn-chassis-mac-mappings="", ovn-cms-options=enable-chassis-as-gw, ovn-enable-lflow-cache="true", ovn-limit-lflow-cache="", ovn-memlimit-lflow-cache-kb="", ovn-monitor-all="false", ovn-trim-limit-lflow-cache="", ovn-trim-wmark-perc-lflow-cache="", port-up-notif="true"}
transport_zones     : []
vtep_logical_switches: []

_uuid               : 478e3679-f4af-4a2d-a986-85323c840620
encaps              : [1e5060c3-a6ce-41bd-b54a-2ba3907f7092, 3177060c-bcdc-4c02-bf31-ff359c666538]
external_ids        : {datapath-type=system, iface-types="bareudp,erspan,geneve,gre,gtpu,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan", is-interconn="false", ovn-bridge-mappings="vlan:br-ex", ovn-chassis-mac-mappings="", ovn-cms-options=enable-chassis-as-gw, ovn-enable-lflow-cache="true", ovn-limit-lflow-cache="", ovn-memlimit-lflow-cache-kb="", ovn-monitor-all="false", ovn-trim-limit-lflow-cache="", ovn-trim-wmark-perc-lflow-cache="", port-up-notif="true"}
hostname            : lab-compute03
name                : "1f318a3c-f607-4272-814c-b0c4d813daa5"
nb_cfg              : 0
other_config        : {datapath-type=system, iface-types="bareudp,erspan,geneve,gre,gtpu,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan", is-interconn="false", ovn-bridge-mappings="vlan:br-ex", ovn-chassis-mac-mappings="", ovn-cms-options=enable-chassis-as-gw, ovn-enable-lflow-cache="true", ovn-limit-lflow-cache="", ovn-memlimit-lflow-cache-kb="", ovn-monitor-all="false", ovn-trim-limit-lflow-cache="", ovn-trim-wmark-perc-lflow-cache="", port-up-notif="true"}
transport_zones     : []
vtep_logical_switches: []

_uuid               : eed64f71-9cd1-4e1a-a891-e8bbb9049c41
encaps              : [3932bdff-e3a2-425b-9bf5-8d05fffbd171, a6f63e7b-a59a-41a3-9fdf-a2b6fae892cd]
external_ids        : {datapath-type=system, iface-types="bareudp,erspan,geneve,gre,gtpu,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan", is-interconn="false", ovn-bridge-mappings="physnet2:br-rpn,vlan:br-ex", ovn-chassis-mac-mappings="", ovn-cms-options=enable-chassis-as-gw, ovn-enable-lflow-cache="true", ovn-limit-lflow-cache="", ovn-memlimit-lflow-cache-kb="", ovn-monitor-all="false", ovn-trim-limit-lflow-cache="", ovn-trim-wmark-perc-lflow-cache="", port-up-notif="true"}
hostname            : lab-compute02
name                : "6c2a75b1-482a-40e3-91f8-3e449986f5b6"
nb_cfg              : 174
other_config        : {datapath-type=system, iface-types="bareudp,erspan,geneve,gre,gtpu,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan", is-interconn="false", ovn-bridge-mappings="physnet2:br-rpn,vlan:br-ex", ovn-chassis-mac-mappings="", ovn-cms-options=enable-chassis-as-gw, ovn-enable-lflow-cache="true", ovn-limit-lflow-cache="", ovn-memlimit-lflow-cache-kb="", ovn-monitor-all="false", ovn-trim-limit-lflow-cache="", ovn-trim-wmark-perc-lflow-cache="", port-up-notif="true"}
transport_zones     : []
vtep_logical_switches: []

_uuid               : 8a07dfc5-1e52-49aa-aa97-ec0515334fc6
encaps              : [63a84cd2-cb93-485e-aaa4-e6701dbb9a7d, a67c8b1c-da4e-4178-b0eb-58315983ca68]
external_ids        : {datapath-type=system, iface-types="bareudp,erspan,geneve,gre,gtpu,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan", is-interconn="false", ovn-bridge-mappings="vlan:br-ex", ovn-chassis-mac-mappings="", ovn-cms-options=enable-chassis-as-gw, ovn-enable-lflow-cache="true", ovn-limit-lflow-cache="", ovn-memlimit-lflow-cache-kb="", ovn-monitor-all="false", ovn-trim-limit-lflow-cache="", ovn-trim-wmark-perc-lflow-cache="", port-up-notif="true"}
hostname            : lab-infra01
name                : "900595a5-a02a-4566-b6dc-0c1e0e2cb392"
nb_cfg              : 0
other_config        : {datapath-type=system, iface-types="bareudp,erspan,geneve,gre,gtpu,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan", is-interconn="false", ovn-bridge-mappings="vlan:br-ex", ovn-chassis-mac-mappings="", ovn-cms-options=enable-chassis-as-gw, ovn-enable-lflow-cache="true", ovn-limit-lflow-cache="", ovn-memlimit-lflow-cache-kb="", ovn-monitor-all="false", ovn-trim-limit-lflow-cache="", ovn-trim-wmark-perc-lflow-cache="", port-up-notif="true"}
transport_zones     : []
vtep_logical_switches: []

_uuid               : 8f4829ad-746d-4125-8561-363adbbc4dce
encaps              : [ce7a5ab6-534a-4667-a7c0-5f112b0f4507, fcaf8226-fa90-4848-8c1d-d3c975276e05]
external_ids        : {datapath-type=system, iface-types="bareudp,erspan,geneve,gre,gtpu,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan", is-interconn="false", "neutron:ovn-metadata-id"="344341e0-8e69-5e00-979c-d59fee1b9b27", "neutron:ovn-metadata-sb-cfg"="173", ovn-bridge-mappings="vlan:br-ex", ovn-chassis-mac-mappings="", ovn-cms-options=enable-chassis-as-gw, ovn-enable-lflow-cache="true", ovn-limit-lflow-cache="", ovn-memlimit-lflow-cache-kb="", ovn-monitor-all="false", ovn-trim-limit-lflow-cache="", ovn-trim-wmark-perc-lflow-cache="", port-up-notif="true"}
hostname            : lab-infra03
name                : "d50d391d-910f-40d6-8aa7-24fbfda018ff"
nb_cfg              : 171
other_config        : {datapath-type=system, iface-types="bareudp,erspan,geneve,gre,gtpu,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan", is-interconn="false", ovn-bridge-mappings="vlan:br-ex", ovn-chassis-mac-mappings="", ovn-cms-options=enable-chassis-as-gw, ovn-enable-lflow-cache="true", ovn-limit-lflow-cache="", ovn-memlimit-lflow-cache-kb="", ovn-monitor-all="false", ovn-trim-limit-lflow-cache="", ovn-trim-wmark-perc-lflow-cache="", port-up-notif="true"}
transport_zones     : []
vtep_logical_switches: []

_uuid               : ada8b169-bd60-490b-8520-7e621cbbb84e
encaps              : [072fa878-8849-4b06-acfd-4e889ff308b0, 8b6e5992-a4bf-46e3-b1c2-d5494765ca62]
external_ids        : {datapath-type=system, iface-types="bareudp,erspan,geneve,gre,gtpu,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan", is-interconn="false", "neutron:ovn-metadata-id"="83641d9c-6244-564c-b67c-d5b3298adc85", "neutron:ovn-metadata-sb-cfg"="574", ovn-bridge-mappings="vlan:br-ex", ovn-chassis-mac-mappings="", ovn-cms-options=enable-chassis-as-gw, ovn-enable-lflow-cache="true", ovn-limit-lflow-cache="", ovn-memlimit-lflow-cache-kb="", ovn-monitor-all="false", ovn-trim-limit-lflow-cache="", ovn-trim-wmark-perc-lflow-cache="", port-up-notif="true"}
hostname            : lab-infra02
name                : "30757b96-cb1b-4512-bfdd-df6df50f2f4c"
nb_cfg              : 171
other_config        : {datapath-type=system, iface-types="bareudp,erspan,geneve,gre,gtpu,internal,ip6erspan,ip6gre,lisp,patch,stt,system,tap,vxlan", is-interconn="false", ovn-bridge-mappings="vlan:br-ex", ovn-chassis-mac-mappings="", ovn-cms-options=enable-chassis-as-gw, ovn-enable-lflow-cache="true", ovn-limit-lflow-cache="", ovn-memlimit-lflow-cache-kb="", ovn-monitor-all="false", ovn-trim-limit-lflow-cache="", ovn-trim-wmark-perc-lflow-cache="", port-up-notif="true"}
transport_zones     : []
vtep_logical_switches: []

Testing

So now, the moment of truth!

root@lab-infra01:~# openstack network agent list
HttpException: 500: Server Error for url: http://10.20.0.11:9696/v2.0/agents, Request Failed: internal server error while processing your request.

Dang.

I spent another few minutes mulling this over before considering a restart of the neutron-server service might be warranted. After restarting neutron-server across the three controller nodes, the following attempt worked:

root@lab-infra01:~# openstack network agent list
+--------------------------------------+------------------------------+--------------------------------------+-------------------+-------+-------+----------------------------+
| ID                                   | Agent Type                   | Host                                 | Availability Zone | Alive | State | Binary                     |
+--------------------------------------+------------------------------+--------------------------------------+-------------------+-------+-------+----------------------------+
| 1591b8ad-8a59-47f8-b1cf-53c4375eea5c | NIC Switch agent             | lab-infra03                          | None              | :-)   | UP    | neutron-sriov-nic-agent    |
| 16355a23-b872-4ec2-995e-208094f2057c | Baremetal Node               | 8919cf4d-a9dd-4985-ae70-835ba024e7b7 | None              | :-)   | UP    | ironic-neutron-agent       |
| 258a10ff-1090-4e90-a32c-4c6f8d01c938 | DHCP agent                   | lab-infra01                          | nova              | :-)   | UP    | neutron-dhcp-agent         |
| 29d9376e-dee5-41a1-9e86-e3d9607f4a59 | NIC Switch agent             | lab-infra02                          | None              | :-)   | UP    | neutron-sriov-nic-agent    |
| 346ba9ea-1c2d-4dc8-ba61-4cde37bbeaf9 | Metering agent               | lab-infra03                          | None              | :-)   | UP    | neutron-metering-agent     |
| 416d7511-3ef2-4bda-9b5c-157d2bef182a | Baremetal Node               | f7945b37-f43f-4b69-b987-1277d0a5777f | None              | :-)   | UP    | ironic-neutron-agent       |
| 467885c9-539e-4b9b-8bde-69405bf0597d | Baremetal Node               | 97c9e327-9b72-4566-a345-ca0544e28d14 | None              | :-)   | UP    | ironic-neutron-agent       |
| 57175ad2-02be-4d05-a9ee-08643a6393c8 | NIC Switch agent             | lab-infra01                          | None              | :-)   | UP    | neutron-sriov-nic-agent    |
| 594fdaab-d0be-4c69-8081-b293009b4808 | Metering agent               | lab-infra01                          | None              | :-)   | UP    | neutron-metering-agent     |
| 8533ec17-f8f5-4240-a085-98f158a981df | NIC Switch agent             | lab-compute03                        | None              | :-)   | UP    | neutron-sriov-nic-agent    |
| 868a1ae9-3f3b-4574-9fce-0ff1762df160 | Metering agent               | lab-infra02                          | None              | :-)   | UP    | neutron-metering-agent     |
| 8bc8691c-064b-4661-b4d0-f2ca778012ee | DHCP agent                   | lab-infra02                          | nova              | :-)   | UP    | neutron-dhcp-agent         |
| b126376b-e253-47f8-b22e-fce5ffb87f94 | Baremetal Node               | 1ff24bbc-6058-41f9-aad5-7d4e78c81695 | None              | :-)   | UP    | ironic-neutron-agent       |
| b589a112-0877-4968-a6f5-04a3e3a383b6 | NIC Switch agent             | lab-compute01                        | None              | :-)   | UP    | neutron-sriov-nic-agent    |
| b5c326c6-cbe3-42b2-a78c-bf3008272dc1 | NIC Switch agent             | lab-compute02                        | None              | :-)   | UP    | neutron-sriov-nic-agent    |
| c2b0c5e4-9499-4d97-8ecf-f09c7496b0bd | DHCP agent                   | lab-infra03                          | nova              | :-)   | UP    | neutron-dhcp-agent         |
| da06498a-fc06-45a0-bbba-1568f700cca6 | Baremetal Node               | eac40a3f-3854-426c-b232-7ae7df4ab549 | None              | :-)   | UP    | ironic-neutron-agent       |
| 900595a5-a02a-4566-b6dc-0c1e0e2cb392 | OVN Controller Gateway agent | lab-infra01                          |                   | :-)   | UP    | ovn-controller             |
| 64b68ff2-b068-5e64-a1cd-9c95afadd0b7 | OVN Metadata agent           | lab-infra01                          |                   | :-)   | UP    | neutron-ovn-metadata-agent |
| 1f318a3c-f607-4272-814c-b0c4d813daa5 | OVN Controller Gateway agent | lab-compute03                        |                   | :-)   | UP    | ovn-controller             |
| 4d9e06dc-69c0-5ea7-8a6d-e750d11ebb9f | OVN Metadata agent           | lab-compute03                        |                   | :-)   | UP    | neutron-ovn-metadata-agent |
| d50d391d-910f-40d6-8aa7-24fbfda018ff | OVN Controller Gateway agent | lab-infra03                          |                   | :-)   | UP    | ovn-controller             |
| 344341e0-8e69-5e00-979c-d59fee1b9b27 | OVN Metadata agent           | lab-infra03                          |                   | :-)   | UP    | neutron-ovn-metadata-agent |
| 0c9b25a6-3760-4b57-ba71-49e7091730bb | OVN Controller Gateway agent | lab-compute01                        |                   | :-)   | UP    | ovn-controller             |
| 2864488c-c9a8-5cf1-b1c0-184c295493b6 | OVN Metadata agent           | lab-compute01                        |                   | :-)   | UP    | neutron-ovn-metadata-agent |
| 6c2a75b1-482a-40e3-91f8-3e449986f5b6 | OVN Controller Gateway agent | lab-compute02                        |                   | :-)   | UP    | ovn-controller             |
| 6645f143-2dc0-5f03-b7bb-681bc3e8b969 | OVN Metadata agent           | lab-compute02                        |                   | :-)   | UP    | neutron-ovn-metadata-agent |
| 30757b96-cb1b-4512-bfdd-df6df50f2f4c | OVN Controller Gateway agent | lab-infra02                          |                   | :-)   | UP    | ovn-controller             |
| 83641d9c-6244-564c-b67c-d5b3298adc85 | OVN Metadata agent           | lab-infra02                          |                   | :-)   | UP    | neutron-ovn-metadata-agent |
+--------------------------------------+------------------------------+--------------------------------------+-------------------+-------+-------+----------------------------+

Summary

This was not the first time I’d come across this issue, and unfortunately, can neither understand why it happens and what I did last time to fix it. It’s probably obvious I did something similar, but I’ve slept since then and don’t recall.

The following links were helpful in gaining a better understanding of what is/was happening and upstream changes being put in place to either keep it from happening in the future or more gracefully recover:

If you have some thoughts or comments on this post, I’d love to hear ‘em. Feel free to reach out on Twitter at @jimmdenton or hit me up on LinkedIn.

Using Minio as S3 Backend for OpenStack Glance

2021-12-25T00:00:00+00:00

My homelab consists of a few random devices, including a Synology NAS that doubles as a home backup system. I use NFS to provide shared storage for Glance images and Cinder volumes, and Synology even has Cinder drivers that leverage iSCSI. All-in-all, it’s a pretty useful setup to test a myriad of OpenStack functionality.

I recently discovered Minio, which is an open-source object storage solution that provides S3 compatibility. Installable with Docker, I thought I’d give it a go and test OpenStack’s reintroduced support for S3 backends in Glance.

Configuring Minio

To install Minio in Docker in DSM, I followed a guide that, while a little old, worked out well enough. In my environment, using host networking vs bridge worked out better.

Once installed, it requires a minimal amount of configuration to work with Glance. You will need:

a user with r/w permissions
a region defined

To create the user, navigate to Users -> Create User and provide an ACCESS KEY and SECRET KEY and appropriate permissions:

ACCESS_KEY: openstack
SECRET_KEY: 0p3nstack
POLICY: readwrite

To define a region, navigate to Settings -> Region and set the region name in the Server Location field. I originally set us-south-lab, but due to some pre-configured assumptions in the boto3 python client, I had to change this to us-east-1 for things to work properly.

Configuring OpenStack

There are some overides on the OpenStack-Ansible side that must be configured to allow the playbooks to properly configure Glance for the additional backend. Use the glance_additional_stores variable, taking care to ensure that any defaults are also specified (since you’re overriding the default variable).

The value for name is arbitrary, and used as an identifier for specific settings that will also be defined, while type is a specific type of Glance backend.

glance_additional_stores:
  - http
  - cinder
  - name: minio
    type: s3

Addition to glance_additional_stores, you must define a new configuration block that maps to the new backend definition. For OpenStack-Ansible, this can be done as a config override:

glance_glance_api_conf_overrides:
  minio:
    s3_store_host: http://172.22.0.4:9000
    s3_store_access_key: openstack
    s3_store_secret_key: 0p3nstack
    s3_store_bucket: glance
    s3_store_create_bucket_on_put: True
    s3_store_bucket_url_format: auto

In glance-api.conf, the override above will be written as this:

[minio]
s3_store_host = http://172.22.0.4:9000
s3_store_access_key = openstack
s3_store_secret_key = 0p3nstack
s3_store_bucket = glance
s3_store_create_bucket_on_put = True
s3_store_bucket_url_format = auto

Testing the Backend

If the default Glance backend (file) has not been changed, it is still possible to upload individual images to the new S3 backend using the glance client.

In this example, a Cirros image will be uploaded to the minio store:

root@lab-infra01:~/images# glance image-create --file cirros-0.5.1-x86_64-disk.img --disk-format raw --container-format bare --name cirros3 --store minio --progress
[=============================>] 100%
+------------------+----------------------------------------------------------------------------------+
| Property         | Value                                                                            |
+------------------+----------------------------------------------------------------------------------+
| checksum         | 1d3062cd89af34e419f7100277f38b2b                                                 |
| container_format | bare                                                                             |
| created_at       | 2021-12-24T04:23:19Z                                                             |
| disk_format      | raw                                                                              |
| id               | 53627724-da3e-4b81-9910-55598d9393d4                                             |
| locations        | [{"url": "s3://openstack:0p3nstack@172.22.0.4:9000/glance/53627724-da3e-4b81-991 |
|                  | 0-55598d9393d4", "metadata": {"store": "minio"}}]                                |
| min_disk         | 0                                                                                |
| min_ram          | 0                                                                                |
| name             | cirros3                                                                          |
| os_hash_algo     | sha512                                                                           |
| os_hash_value    | 553d220ed58cfee7dafe003c446a9f197ab5edf8ffc09396c74187cf83873c877e7ae041cb80f3b9 |
|                  | 1489acf687183adcd689b53b38e3ddd22e627e7f98a09c46                                 |
| os_hidden        | False                                                                            |
| owner            | 7a8df96a3c6a47118e60e57aa9ecff54                                                 |
| protected        | False                                                                            |
| size             | 16338944                                                                         |
| status           | active                                                                           |
| stores           | minio                                                                            |
| tags             | []                                                                               |
| updated_at       | 2021-12-24T04:23:21Z                                                             |
| virtual_size     | 16338944                                                                         |
| visibility       | shared                                                                           |
+------------------+----------------------------------------------------------------------------------+

Once uploaded, an instance can be created by specifying the new image name or UUID.

Benchmarking Minio

The Minio team provides a benchmarking utility named Warp, which is available on Github as source code of pre-compiled binaries.

To test, you’ll need the Minio endpoint along with the access and secret keys:

# warp mixed --host=172.22.0.4:9000 --access-key=openstack --secret-key=0p3nstack --autoterm

Throughput 7.3 objects/s within 7.500000% for 25.802s. Assuming stability. Terminating benchmark.
warp: Benchmark data written to "warp-mixed-2021-12-24[050521]-hCzP.csv.zst"
Mixed operations.
Operation: DELETE, 10%, Concurrency: 20, Ran 1m33s.
 * Throughput: 2.39 obj/s

Operation: GET, 44%, Concurrency: 20, Ran 1m33s.
 * Throughput: 104.80 MiB/s, 10.48 obj/s

Operation: PUT, 15%, Concurrency: 20, Ran 1m32s.
 * Throughput: 36.17 MiB/s, 3.62 obj/s

Operation: STAT, 30%, Concurrency: 20, Ran 1m33s.
 * Throughput: 7.00 obj/s

Cluster Total: 139.94 MiB/s, 23.35 obj/s over 1m34s.

The NAS hosting this instance of Minio is a DS1815+ with 4x 6TB 6Gbps SATA disks and 1Gbps networking. Things look considerably better with a different NAS (DS1621+) using NVMe and 10Gbps networking:

# warp mixed --host=10.22.0.4:9000 --access-key=openstack --secret-key=0p3nstack --autoterm

Throughput 51.6 objects/s within 7.500000% for 13.489s. Assuming stability. Terminating benchmark.
warp: Benchmark data written to "warp-mixed-2021-12-27[153057]-mzH0.csv.zst"
Mixed operations.
Operation: DELETE, 10%, Concurrency: 20, Ran 48s.
 * Throughput: 16.51 obj/s

Operation: GET, 45%, Concurrency: 20, Ran 48s.
 * Throughput: 742.10 MiB/s, 74.21 obj/s

Operation: PUT, 15%, Concurrency: 20, Ran 48s.
 * Throughput: 248.44 MiB/s, 24.84 obj/s

Operation: STAT, 30%, Concurrency: 20, Ran 48s.
 * Throughput: 49.47 obj/s

Cluster Total: 986.46 MiB/s, 164.60 obj/s over 48s.

Summary

I was glad to see that the S3 backend had been re-introduced in Ussuri after being deprecated around the Mitaka timeframe, and having some local object storage options is nice for testing and for eventually setting up Cinder volume backups. Using something like Ceph (for object) is a bit overkill for my usecases, and another administrative headache I don’t want to deal with. I might try to implement a Swift proxy to translate Swift -> S3 for Ironic, but will leave that for another day.

If you have some thoughts or comments on this article, I’d love to hear ‘em. Feel free to reach out on Twitter at @jimmdenton or hit me up on LinkedIn.

Mounting Virtual Media Using Redfish on iDRAC 8

2021-12-19T00:00:00+00:00

Using HP iLO 4 for the last few years, you could say I’ve been a bit spoiled with some of the conveniences provided within.

So, imagine my surprise when firing up my recently-acquired Dell R630 for the first time, only to find that HTTP-based virtual media was not an option in the UI! Some time later I came to find out that mounting virtual media requires the use of the API. No big deal, except that I had not found an obvious guide to using the included API (which I later found out was Redfish v1). It took some time to find a good, working example here.

And now, I’ll save you some time and trouble by demonstrating a mount and eject operation via curl.

Mounting Virtual Media via HTTP/S

To attach virtual media, one must use the following format:

POST
URI: https:///redfish/v1/Managers/iDRAC.Embedded.1/VirtualMedia/CD/Actions/VirtualMedia.InsertMedia
BODY:
{
"Image": "http:///.iso"
}

The following example will mount an ISO using curl:

curl -v -k -X POST https://172.19.0.25/redfish/v1/Managers/iDRAC.Embedded.1/VirtualMedia/CD/Actions/VirtualMedia.InsertMedia \
-u root \
-H 'Content-Type: application/json' \
-d '{"Image": "http://172.22.0.5/VMware-VMvisor-Installer-7.0U2-17630552.x86_64.iso"}'

A successful operation will result in an HTTP 204 status code:

> POST /redfish/v1/Managers/iDRAC.Embedded.1/VirtualMedia/CD/Actions/VirtualMedia.InsertMedia HTTP/1.1
> Host: 172.19.0.25
> Authorization: Basic cm9vdDpjYWx2aW5jYWx2aW4=
> User-Agent: curl/7.77.0
> Accept: */*
> Content-Type: application/json
> Content-Length: 81
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 204 No Content
< Strict-Transport-Security: max-age=63072000
< Vary: Accept-Encoding
< Keep-Alive: timeout=60, max=199
< X-Frame-Options: SAMEORIGIN
< Content-Type: application/json; charset=utf-8
< Server: iDRAC/8
< Date: Mon, 20 Dec 2021 07:24:06 GMT
< Cache-Control: no-cache
< Content-Length: 0
< Connection: Keep-Alive
< Accept-Ranges: bytes
<
* Connection #0 to host 172.19.0.25 left intact

Attempting to mount an ISO with something already attached will result in a 500 error:

> POST /redfish/v1/Managers/iDRAC.Embedded.1/VirtualMedia/CD/Actions/VirtualMedia.InsertMedia HTTP/1.1
> Host: 172.19.0.25
> Authorization: Basic cm9vdDpjYWx2aW5jYWx2aW4=
> User-Agent: curl/7.77.0
> Accept: */*
> Content-Type: application/json
> Content-Length: 81
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 500 Internal Server Error
< Strict-Transport-Security: max-age=63072000
< OData-Version: 4.0
< Vary: Accept-Encoding
< Keep-Alive: timeout=60, max=199
< X-Frame-Options: SAMEORIGIN
< Content-Type: application/json;odata.metadata=minimal;charset=utf-8
< Server: iDRAC/8
< Date: Mon, 20 Dec 2021 07:22:56 GMT
< Cache-Control: no-cache
< Content-Length: 424
< Connection: Keep-Alive
< Access-Control-Allow-Origin: *
< Accept-Ranges: bytes
<
{"error":{"@Message.ExtendedInfo":[{"Message":"The Virtual Media image server is already connected.","MessageArgs":[],"MessageArgs@odata.count":0,"MessageId":"IDRAC.1.6.VRM0012","RelatedProperties":[],"RelatedProperties@odata.count":0,"Resolution":"No response action is required.","Severity":"Informational"}],"code":"Base.1.2.GeneralError","message":"A general error has occurred. See ExtendedInfo for more information"}}

Ejecting Virtual Media

To eject virtual media, one must use the following format:

Method: POST
URI: https:///redfish/v1/Managers/iDRAC.Embedded.1/VirtualMedia/CD/Actions/VirtualMedia.EjectMedia
BODY:
{}

The following example will eject an ISO using curl:

curl -v -k -X POST https://drac_ip_address>/redfish/v1/Managers/iDRAC.Embedded.1/VirtualMedia/CD/Actions/VirtualMedia.EjectMedia \
-u root \
-H 'Content-Type: application/json' \
-d '{}'

And, yes, the payload is required (and empty) on an eject operation.

A successful operation will result in an HTTP 204 status code:

> POST /redfish/v1/Managers/iDRAC.Embedded.1/VirtualMedia/CD/Actions/VirtualMedia.EjectMedia HTTP/1.1
> Host: 172.19.0.25
> Authorization: Basic cm9vdDpjYWx2aW5jYWx2aW4=
> User-Agent: curl/7.77.0
> Accept: */*
> Content-Type: application/json
> Content-Length: 2
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 204 No Content
< Strict-Transport-Security: max-age=63072000
< Vary: Accept-Encoding
< Keep-Alive: timeout=60, max=199
< X-Frame-Options: SAMEORIGIN
< Content-Type: application/json; charset=utf-8
< Server: iDRAC/8
< Date: Mon, 20 Dec 2021 07:23:29 GMT
< Cache-Control: no-cache
< Connection: Keep-Alive
< Transfer-Encoding: chunked
< Accept-Ranges: bytes
<
* Excess found: excess = 5 url = /redfish/v1/Managers/iDRAC.Embedded.1/VirtualMedia/CD/Actions/VirtualMedia.EjectMedia (zero-length body)

Attempting to eject an ISO that is not attached will result in a 500 error:

> POST /redfish/v1/Managers/iDRAC.Embedded.1/VirtualMedia/CD/Actions/VirtualMedia.EjectMedia HTTP/1.1
> Host: 172.19.0.25
> Authorization: Basic cm9vdDpjYWx2aW5jYWx2aW4=
> User-Agent: curl/7.77.0
> Accept: */*
> Content-Type: application/json
> Content-Length: 2
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 500 Internal Server Error
< Strict-Transport-Security: max-age=63072000
< OData-Version: 4.0
< Vary: Accept-Encoding
< Keep-Alive: timeout=60, max=199
< X-Frame-Options: SAMEORIGIN
< Content-Type: application/json;odata.metadata=minimal;charset=utf-8
< Server: iDRAC/8
< Date: Mon, 20 Dec 2021 07:20:35 GMT
< Cache-Control: no-cache
< Content-Length: 774
< Connection: Keep-Alive
< Access-Control-Allow-Origin: *
< Accept-Ranges: bytes
<
{"error":{"@Message.ExtendedInfo":[{"Message":"No Virtual Media devices are currently connected.","MessageArgs":[],"MessageArgs@odata.count":0,"MessageId":"IDRAC.1.6.VRM0009","RelatedProperties":[],"RelatedProperties@odata.count":0,"Resolution":"No response action is required.","Severity":"Critical"},{"Message":"The request failed due to an internal service error.  The service is still operational.","MessageArgs":[],"MessageArgs@odata.count":0,"MessageId":"Base.1.2.InternalError","RelatedProperties":[],"RelatedProperties@odata.count":0,"Resolution":"Resubmit the request.  If the problem persists, consider resetting the service.","Severity":"Critical"}],"code":"Base.1.2.GeneralError","message":"A general error has occurred. See ExtendedInfo for more information"}}

Summary

Now that I know what to do, using the API can be faster (and more flexible) that the UI. However, getting there was a bit of a challenge. Hopefully this helps you on your journey.

If you have some thoughts or comments on this article, I’d love to hear ‘em. Feel free to reach out on Twitter at @jimmdenton or hit me up on LinkedIn.

Updating from 1024-bit to 2048-bit SSL Keys on HPE iLO 4

2021-12-16T00:00:00+00:00

A recent attempt to move away from IPMI to the native HPE iLO 4 driver in my OpenStack Ironic lab showed just how wrong I was to believe it would be a seamless change. What I found was that while ironic-conductor could communicate with iLO, apparently, it didn’t like what it saw:

[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: EE certificate key too weak

No big deal, right? There should be something obvious in the iLO page to alert me to this weakness, and a check and a click and I’d be back in business.

Wrong.

Even moving from the ECDHE-RSA-DES-CBC3-SHA cipher to ECDHE-RSA-AES256-GCM-SHA384 by enabling AES in iLO wasn’t enough to get things moving. I had to dig deeper.

Unnamed Internet Hero

A little bit of Googling and I came across something interesting:

There is an update for ILO 4 that incorporates a new 2048 bit certificate

My new friend roadglide03 gave me the hint I needed, along with an upgrade script and some RPMs. A cursory glance at the Perl didn’t reveal anything suspicious, so off I went.

Getting Started

To follow the process to the letter, one would download HPE’s Lights-Out Online Configuration Utility for Linux here. This link provides an RPM that may need to be extracted with rpm2cpio:

# rpm2cpio hponcfg-5.6.0-0.x86_64.rpm | cpio -id

Or, for the Ubuntu folks, the deb works just as well:

# curl https://downloads.linux.hpe.com/SDR/hpPublicKey2048.pub | apt-key add -
# curl https://downloads.linux.hpe.com/SDR/hpPublicKey2048_key1.pub | apt-key add -
# curl https://downloads.linux.hpe.com/SDR/hpePublicKey2048_key1.pub | apt-key add -

# add-apt-repository 'deb http://downloads.linux.hpe.com/SDR/repo/mcp focal/current non-free'
# add-apt-repository 'deb http://downloads.linux.hpe.com/SDR/repo/mcp focal/12.20 non-free'
# apt-get update

# apt-get install hponcfg 

The thing to know about hponcfg is that it allows one to modify the local iLO only. Fine if you have an OS on your machine and have a handful to manage. Not fine if you have a fleet and/or no operating system (more on that later).

Using replaceSSLcert.pl

The replaceSSLcert.pl works with the following switches:

–check
–update

Using --check, you ought to end up with a message like this:

# perl replaceSSLcert.pl --check

Here's the output:

Pre Check/Update Info Gathering
    Gathering info from the local iLO
        ILO IP: 172.19.0.27
        ILO DNS NAME: lab-infra01-ilo
        ILO DOMAIN NAME: shands.local
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING: ILO DNS Domain name does not match local domain
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING: ILO IP (172.19.0.27) resolves to

    which does not match configured ILO DNS Name
        lab-infra01-ilo
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Checking certificate for lab-infra01-ilo
CERTIFICATE UPDATE NEEDED
lab-infra01-ilo(172.19.0.27) certificate is only 1024 bits long
Which is less than the minimum length of 2048 bits.

Which is to say, there are a lot of complaints here about the state of iLO on this machine. Whatever, I don’t really care, I just want a 2048-bit key:

lab-infra01-ilo(172.19.0.27) certificate is only 1024 bits long
Which is less than the minimum length of 2048 bits.

The process of updating the key is handled by the script, and it will work through the following:

Generate CSR
Create a CA
Generate a ‘signed’ key
Upload PEM to iLO

To help generate a CSR with at least some accurate information, the following blocks in the replaceSSLcert.pl should be updated to reflect the proper values:

In addition, you should update /etc/hosts on the machine running replaceSSLcert.pl with an entry for iLo and the short and common name:

172.19.0.27     lab-infra01-ilo lab-infra01-ilo.jimmdenton.com

To verify things work as expected, you can run openssl to verify the key size before and after the change:

# echo | openssl s_client -connect 172.19.0.27:443 2>/dev/null | openssl x509 -text -noout | grep "Public-Key"
                RSA Public-Key: (1024 bit)

Now you can run the script with --update:

# perl replaceSSLcert.pl --update
Pre Check/Update Info Gathering
	Gathering info from the local iLO
        ILO IP: 172.19.0.27
        ILO DNS NAME: lab-infra01-ilo
        ILO DOMAIN NAME: shands.local
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING: ILO DNS Domain name does not match local domain
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
	ILO DNS Name matches ILO IP DNS Lookup (lab-infra01-ilo)
Checking certificate for lab-infra01-ilo
lab-infra01-ilo(172.19.0.27) certificate is only 1024 bits long
Which is less than the minimum length of 2048 bits.
Issuing openssl genrsa command
Issuing openssl req command
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING: ILO DNS Domain:
	arcanebyte.com
         DOES NOT MATCH LOCAL DOMAIN:
	openstack.local
THIS NEEDS TO BE FIXED ON THE LOCAL ILO
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
About to update the local iLO certificate with:
	FQDN: lab-infra01-ilo

ARE YOU SURE YOU WANT TO CONTINUE?
PLEASE ANSWER 'YES' or 'NO':

Answering YES will allow the script to proceed with the aforementioned steps. The process takes about 30 seconds, give or take, and will result in iLO being reset - so you you can expect to lose access if you’re already logged in.

To verify, run openssl again:

# echo | openssl s_client -connect 172.19.0.27:443 2>/dev/null | openssl x509 -text -noout | grep "Public-Key"
                RSA Public-Key: (2048 bit)

Sweet! This should rid me of the pesky certificate key too weak error. Now, how to scale this operation.

Remote iLO (or using locfg.pl)

To scale this out across the lab (about 9 nodes) I wanted to find a way to hit iLO over the network rather than locally. More Googling and a Little Bit of Luck™ let me to the HPE Lights-Out XML PERL Scripting Sample for Linux which provides (yet another) Perl script for managing iLO remotely: locfg.pl.

The download provides a ton of XML files along with the Perl script itself, which we will use to verify it actually works. But first, you may need to install a pre-requisite:

# apt install libsocket6-perl

Running the script, we can see what’s necessary to make it go:

# ./locfg.pl

Usage: perl locfg.pl -s server -f inputfile [options]
       perl locfg.pl -s ipV4Address -f inputfile [options]
       perl locfg.pl -s ipV4Address:portNumber -f inputfile [options]
       perl locfg.pl -s ipV6Address -f inputfile [options]
       perl locfg.pl -s [ipV6Address] -f inputfile [options]
       perl locfg.pl -s [ipV6Address]:portNumber -f inputfile [options]
       perl locfg.pl -s DnsName:portnumber -f inputfile [options]
    -l logfile         log file
    -v                 enable verbose mode
    -t                 substitute variables with values specified(ab=xy,c=z)
    -i                 entering username and password interactively
    -u username        username
    -p password        password
    -ilo3|-ilo4|-ilo5  target is iLO 3, iLO 4 or iLO 5

  Note: Use -u and -p with caution as command line options are
        visible on Linux. The '-i' option is for entering the
        username and password interactively.

To test, I’ll try to get all users using Get_All_Users.xml:

To execute looks like this:

# perl locfg.pl -s 172.19.0.24 -u root -p  -ilo4 -f Get_All_Users.xml

...Script Succeeded...

Looks good! Now comes the fun part of constructing everything that replaceSSLcert.pl did for us.

Generating Things

The server I hope to attack first is texas04, a baremetal node used with Ironic that does not have an operating system installed.

The first step is to generate an RSA key that will be used for the CA used to sign the new 2048-bit certificate/key generated for iLO on texas04:

From lab-infra01:

# mkdir /tmp/texas04/
# /usr/bin/openssl genrsa -out /tmp/texas04/myCA.key 2048 2>/dev/null

Then, generate a CA:

# /usr/bin/openssl req -x509 -new -nodes -key /tmp/texas04/myCA.key -sha256 -days 3650 -out /tmp/texas04/myCA.pem -subj "/C=US/ST=TX/L=San Antonio/O=jimmdenton/OU=lab/CN=US ORG" 2>/dev/null

Next, we want iLO on texas04 to generate a CSR using attributes we’ve defined here (which you should change). The locfg.pl script will be used to trigger iLO to do the needful:

# cat <> /tmp/texas04/csr.xml

EOF

Running this:

# perl locfg.pl -s 172.19.0.24 -u root -p  -ilo4 -f /tmp/texas04/csr.xml

Results in this:

The iLO subsystem is currently generating a Certificate Signing Request(CSR), run script after 10 minutes or more to receive the CSR.

I waited maybe 2 minutes before proceeding, but you can check the status with this command:

# perl locfg.pl -s 172.19.0.24 -u root -p  -ilo4 -f /tmp/texas04/csr.xml -l /tmp/texas04/csr.out

When the CSR is ready, it will be reflected in /tmp/texas04/csr.out:

# cat /tmp/texas04/csr.out

-----BEGIN CERTIFICATE REQUEST-----
MIIC7TCCAdUCAQAwdDEfMB0GA1UEAwwWdGV4YXMwNC5hcmNhbmVieXRlLmNvbTEM
MAoGA1UECwwDTGFiMRMwEQYDVQQKDApBcmNhbmVCeXRlMRQwEgYDVQQHDAtTYW4g
QW50b25pbzELMAkGA1UECAwCVFgxCzAJBgNVBAYTAlVTMIIBIjANBgkqhkiG9w0B
AQEFAAOCAQ8AMIIBCgKCAQEAuQUaPlY4LdIEecqYWEy6WHk4p/J5WyNyJ9o01l/R
dtrfquYsBgNWMZqVRJt8FgCbbLqTUBH+C+aB1E34BPxcKFBvIG2bYQuFf+aokPNc
RuXR8/0pOodQtMJQYrpCZwJMnU6CrDQ5aIl0NCiSOxU6HSxnS/Bkly2PR64JjgWq
bv5794MQQUXtP4bxhOodlJaIVagCenklSIm8xN+/dfjkZdtjo/yVSF79a/DokbNb
iiX+zLCQO11OjCFTJMBC2aub4F2Q9D6fqaAKgp8mdykGLM2GJBvKYEMzqv5/RrcE
qc2I8Uc6CjreDYApYDgsNrEuNG1XhnaeE8P1jBeqhExKvwIDAQABoDQwMgYJKoZI
hvcNAQkOMSUwIzAhBgNVHREEGjAYghZ0ZXhhczA0LmFyY2FuZWJ5dGUuY29tMA0G
CSqGSIb3DQEBCwUAA4IBAQBd7Zxy8Suo48csSDkoLxLnG3Z6zeqNvjAlnENVUfHg
IkKGctpPbzVSvZUJj+uaXGDsjJeg/Qwptab2PU/E2j/QPqt/9bNtl7eEdlqXaGHJ
qJoSL+wi4mO2/wczdax7QLvSvCtJ+HvDKIXwq1ra7cuThlosWjQhUzhKJCrK6PAH
xNcOhJxIGld41To+kH98YPJoWDq4GsD9Fl48OIpjr0ItDo3htGahKOMsinqgDfjc
GFsxEKDrVkDf8iD+7gHgs+VHtslkG5Bz+pIFeza9M4MmKPGaitlUR6K+j7ZiAs/L
N3EP5Ti6I6iUd8SA79i1wFhUzoyqDTk7UURatLu07XyX
-----END CERTIFICATE REQUEST-----

...Script Succeeded...

The important bits lie between the brackets:

That CSR should get saved in it’s own file:

cat <> /tmp/texas04/real-csr.out
-----BEGIN CERTIFICATE REQUEST-----
MIIC7TCCAdUCAQAwdDEfMB0GA1UEAwwWdGV4YXMwNC5hcmNhbmVieXRlLmNvbTEM
MAoGA1UECwwDTGFiMRMwEQYDVQQKDApBcmNhbmVCeXRlMRQwEgYDVQQHDAtTYW4g
QW50b25pbzELMAkGA1UECAwCVFgxCzAJBgNVBAYTAlVTMIIBIjANBgkqhkiG9w0B
AQEFAAOCAQ8AMIIBCgKCAQEAuQUaPlY4LdIEecqYWEy6WHk4p/J5WyNyJ9o01l/R
dtrfquYsBgNWMZqVRJt8FgCbbLqTUBH+C+aB1E34BPxcKFBvIG2bYQuFf+aokPNc
RuXR8/0pOodQtMJQYrpCZwJMnU6CrDQ5aIl0NCiSOxU6HSxnS/Bkly2PR64JjgWq
bv5794MQQUXtP4bxhOodlJaIVagCenklSIm8xN+/dfjkZdtjo/yVSF79a/DokbNb
iiX+zLCQO11OjCFTJMBC2aub4F2Q9D6fqaAKgp8mdykGLM2GJBvKYEMzqv5/RrcE
qc2I8Uc6CjreDYApYDgsNrEuNG1XhnaeE8P1jBeqhExKvwIDAQABoDQwMgYJKoZI
hvcNAQkOMSUwIzAhBgNVHREEGjAYghZ0ZXhhczA0LmFyY2FuZWJ5dGUuY29tMA0G
CSqGSIb3DQEBCwUAA4IBAQBd7Zxy8Suo48csSDkoLxLnG3Z6zeqNvjAlnENVUfHg
IkKGctpPbzVSvZUJj+uaXGDsjJeg/Qwptab2PU/E2j/QPqt/9bNtl7eEdlqXaGHJ
qJoSL+wi4mO2/wczdax7QLvSvCtJ+HvDKIXwq1ra7cuThlosWjQhUzhKJCrK6PAH
xNcOhJxIGld41To+kH98YPJoWDq4GsD9Fl48OIpjr0ItDo3htGahKOMsinqgDfjc
GFsxEKDrVkDf8iD+7gHgs+VHtslkG5Bz+pIFeza9M4MmKPGaitlUR6K+j7ZiAs/L
N3EP5Ti6I6iUd8SA79i1wFhUzoyqDTk7UURatLu07XyX
-----END CERTIFICATE REQUEST-----
EOF

Now, we generate the PEM:

# /usr/bin/openssl x509 -req -in /tmp/texas04/real-csr.out -CA /tmp/texas04/myCA.pem -CAkey /tmp/texas04/myCA.key -CAcreateserial -out /tmp/texas04/CRT.pem -days 3650 -sha256"

Output:

Signature ok
subject=CN = texas04.arcanebyte.com, OU = lab, O = jimmdenton, L = San Antonio, ST = TX, C = US
Getting CA Private Key

Verify:

# cat /tmp/texas04/CRT.pem
-----BEGIN CERTIFICATE-----
MIIDXzCCAkcCFFJKSKj/1ixZEyQEWKTbFs9DOQ66MA0GCSqGSIb3DQEBCwUAMGQx
CzAJBgNVBAYTAlVTMQswCQYDVQQIDAJUWDEUMBIGA1UEBwwLU2FuIEFudG9uaW8x
EzARBgNVBAoMCkFyY2FuZUJ5dGUxDDAKBgNVBAsMA0xhYjEPMA0GA1UEAwwGVVMg
T1JHMB4XDTIxMTIxNjA0MzQwNloXDTMxMTIxNDA0MzQwNlowdDEfMB0GA1UEAwwW
dGV4YXMwNC5hcmNhbmVieXRlLmNvbTEMMAoGA1UECwwDTGFiMRMwEQYDVQQKDApB
cmNhbmVCeXRlMRQwEgYDVQQHDAtTYW4gQW50b25pbzELMAkGA1UECAwCVFgxCzAJ
BgNVBAYTAlVTMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAuQUaPlY4
LdIEecqYWEy6WHk4p/J5WyNyJ9o01l/RdtrfquYsBgNWMZqVRJt8FgCbbLqTUBH+
C+aB1E34BPxcKFBvIG2bYQuFf+aokPNcRuXR8/0pOodQtMJQYrpCZwJMnU6CrDQ5
aIl0NCiSOxU6HSxnS/Bkly2PR64JjgWqbv5794MQQUXtP4bxhOodlJaIVagCenkl
SIm8xN+/dfjkZdtjo/yVSF79a/DokbNbiiX+zLCQO11OjCFTJMBC2aub4F2Q9D6f
qaAKgp8mdykGLM2GJBvKYEMzqv5/RrcEqc2I8Uc6CjreDYApYDgsNrEuNG1Xhnae
E8P1jBeqhExKvwIDAQABMA0GCSqGSIb3DQEBCwUAA4IBAQCXPFYDn69ceirgt5TR
i6iBgIsVDEcuFmSj72krf+dTrlt1JYtUVFyRYdLw3MaWy186JF3emq2lvPEyU6SA
fnOSM2lBrxF0LDZ9QpkOb+PWZE1JRthzE5Xxg6q5oUPbR/XJuFLljkg9hz60v5Xd
pGNXcV/Hh4S6EBELfQ94ju73rvuRK149VYSp9TMpzja5GEyKH9xHDgfG+GK/siDB
JlyLlmSwr3PeNZwtwB+rZmkjzxzBvsp9CQSuNiLN6B12OeD946MuvJcQ6hhXkImY
WwURDSE8sII4XYeLT9+4D1gPWbBDAkx5kUCgVqE4jtn232MCGd3Md+4ek23S+Stz
wW0O
-----END CERTIFICATE-----

Now, we can tuck the certificate into an XML file for upload:

cat <> /tmp/texas04/2048cert.xml




-----BEGIN CERTIFICATE-----
MIIDXzCCAkcCFFJKSKj/1ixZEyQEWKTbFs9DOQ66MA0GCSqGSIb3DQEBCwUAMGQx
CzAJBgNVBAYTAlVTMQswCQYDVQQIDAJUWDEUMBIGA1UEBwwLU2FuIEFudG9uaW8x
EzARBgNVBAoMCkFyY2FuZUJ5dGUxDDAKBgNVBAsMA0xhYjEPMA0GA1UEAwwGVVMg
T1JHMB4XDTIxMTIxNjA0MzQwNloXDTMxMTIxNDA0MzQwNlowdDEfMB0GA1UEAwwW
dGV4YXMwNC5hcmNhbmVieXRlLmNvbTEMMAoGA1UECwwDTGFiMRMwEQYDVQQKDApB
cmNhbmVCeXRlMRQwEgYDVQQHDAtTYW4gQW50b25pbzELMAkGA1UECAwCVFgxCzAJ
BgNVBAYTAlVTMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAuQUaPlY4
LdIEecqYWEy6WHk4p/J5WyNyJ9o01l/RdtrfquYsBgNWMZqVRJt8FgCbbLqTUBH+
C+aB1E34BPxcKFBvIG2bYQuFf+aokPNcRuXR8/0pOodQtMJQYrpCZwJMnU6CrDQ5
aIl0NCiSOxU6HSxnS/Bkly2PR64JjgWqbv5794MQQUXtP4bxhOodlJaIVagCenkl
SIm8xN+/dfjkZdtjo/yVSF79a/DokbNbiiX+zLCQO11OjCFTJMBC2aub4F2Q9D6f
qaAKgp8mdykGLM2GJBvKYEMzqv5/RrcEqc2I8Uc6CjreDYApYDgsNrEuNG1Xhnae
E8P1jBeqhExKvwIDAQABMA0GCSqGSIb3DQEBCwUAA4IBAQCXPFYDn69ceirgt5TR
i6iBgIsVDEcuFmSj72krf+dTrlt1JYtUVFyRYdLw3MaWy186JF3emq2lvPEyU6SA
fnOSM2lBrxF0LDZ9QpkOb+PWZE1JRthzE5Xxg6q5oUPbR/XJuFLljkg9hz60v5Xd
pGNXcV/Hh4S6EBELfQ94ju73rvuRK149VYSp9TMpzja5GEyKH9xHDgfG+GK/siDB
JlyLlmSwr3PeNZwtwB+rZmkjzxzBvsp9CQSuNiLN6B12OeD946MuvJcQ6hhXkImY
WwURDSE8sII4XYeLT9+4D1gPWbBDAkx5kUCgVqE4jtn232MCGd3Md+4ek23S+Stz
wW0O
-----END CERTIFICATE-----










EOF

Before the upload commences, check the current state:

# echo | openssl s_client -connect 172.19.0.24:443 2>/dev/null | openssl x509 -text -noout | grep "Public-Key"
                RSA Public-Key: (1024 bit)

Then, run locfg.pl with the new XML:

# perl locfg.pl -s 172.19.0.24 -u root -p  -ilo4 -f /tmp/texas04/2048cert.xml

Output:

root@lab-infra01:~/hp# perl locfg.pl -s 172.19.0.24 -u root -p  -ilo4 -f /tmp/texas04/2048cert.xml

Integrated Lights-Out will reset at the end of the script.

Integrated Lights-Out will reset at the end of the script.

...Script Succeeded...

After iLO resets, another look shows the change is successful:

# echo | openssl s_client -connect 172.19.0.24:443 2>/dev/null | openssl x509 -text -noout | grep "Public-Key"
                RSA Public-Key: (2048 bit)

Summary

Like everything I do, it is usually preceded by a day’s worth of unexpected, yet related, tasks to get things to a state where I can actually get done what I wanted to get done. And then, that is followed up by new errors that allow the process to repeat ad-nauseum. I’m sure you can all relate.

I don’t think it would take much to update the replaceSSLcert.pl script to use locfg.pl to interface with remote iLO, since it handles the bulk of the generation of SSL-related bits and would really save time. Better yet, Ansible could be used to make it a pretty quick process. I’ve made both available Perl scripts on my GitHub for anyone that needs them.

If you have some thoughts or comments on this article, I’d love to hear ‘em. Feel free to reach out on Twitter at @jimmdenton or hit me up on LinkedIn.

Configuring Masakari (Instance HA) on OpenStack-Ansible

2021-12-01T00:00:00+00:00

Providing high availability of cloud resources, whether it be networking or virtual machines, is a topic that comes up often in my corner of the world. I’d heard of Instance-HA in some Red Hat circles, and only recently learned the OpenStack Masakari project was the one to provide that functionality.

That said, I’d like to kick the tires on it, and what better place to start than my own OpenStack-based lab running OpenStack-Ansible (Xena).

Getting Started

OpenStack Masakari provides high availability of instances by performing the following actions:

Host evacuation
Instance restart

Host evacuation is a feature provided by a monitor known as masakari-hostmonitor. With this feature, all instances are evacuated from a node that is considered DOWN. There are some requirements for this feature, such as shared storage, and some other considerations that must be made, including the need to fence offline hosts to ensure the evacuation is successful.

Instance restart is provided by a monitor known as masakari-instancemonitor. With this feature, an instance is restarted should its process on the compute node die.

There are additional monitors provided by Masakari, including:

masakari-processmonitor
masakari-introspectivemonitor

The masakari-processmonitor monitor can be used to monitor other processes and services on the host to ensure they are running consistently. Processes and services can be added to the process_list.yaml file found at /etc/masakarimonitors/process_list.yaml on compute nodes or other nodes running monitoring agents. Monitored services can be modified using the masakari_monitors_process_overrides override in OpenStack-Ansible.

Lastly, the masakari-introspectivemonitor monitor can be used to detect system-level failure events via the qemu-qa protocol. Not much has been written about this particular monitor as of yet.

Host Evacuation

With Masakari, compute nodes are grouped into failover segments. In the event of a compute node failure, that node’s instances are moved onto another compute node within the same segment. Failover segments are not to be confused with other logical groups of computes, such as availability zones or aggregates, but represent a similar concept.

The destination node is determined by the recovery method configured for the affected segment. There are four methods:

reserved_host
auto
rh_priority
auto_priority

The reserved_host recovery method relocates instances to a subset of non-active nodes. Because these nodes are not active and are typically resourced adequately for failover duty of similarly-equipped active nodes, there is a guarantee that sufficient resources will exist on a reserved node to accommodate migrated instances.

The auto recovery method relocates instances to any available node in the same segment. Because all the nodes are active, however, there is no guarantee that sufficient resources will exist on the destination node to accommodate migrated instances.

The rh_priority recovery method attempts to evacuate instances using the reserved_host method first, and falls back to the auto method should the reserved_host method fail.

The auto_priority recovery method attempts to evacuate instances using the auto method first, and falls back to the reserved_host method should the auto method fail.

Host evacuation requires shared storage and some method of fencing nodes, likely provided by Pacemaker/STONITH and access to the OOB management network. Given these requirements and an incomplete implementation within OpenStack-Ansible at this time, I’ll skip this demonstration.

Instance Restart

The enabling of the instance restart feature is done on a per-instance basis using the HA_Enabled=True property. Once Masakari has been deployed, an agent on the compute node will detect instance failure and (hopefully) restart the instance according to policy.

Configuring and Deploying

In an OpenStack-Ansible environment, managing the inventory and group membership is key to deploying.

To enable Masakari, simply add the following to the openstack_user_config.yml file:

masakari-infra_hosts: *infrastructure_hosts
masakari-monitor_hosts: *compute_hosts

Keep in mind, those aliases will only work if you’ve defined them in your environment, like so:

infrastructure_hosts: &infrastructure_hosts
  lab-infra01:
    ip: 10.20.0.30
    no_containers: true
  lab-infra02:
    ip: 10.20.0.22
    no_containers: true
  lab-infra03:
    ip: 10.20.0.23
    no_containers: true

compute_hosts: &compute_hosts
  lab-compute01:
    ip: 10.20.0.31
  lab-compute02:
    ip: 10.20.0.32

Then, execute the following playbooks:

haproxy-install.yml
os-masakari-install.yml

Once installed, you should notice a few new services across your infrastructure and compute nodes, including:

Infra:

masakari-api.service
masakari-engine.service

Compute:

masakari-hostmonitor.service
masakari-instancemonitor.service
masakari-introspectiveinstancemonitor.service
masakari-processmonitor.service

Testing Instance Restart

To test automatic instance restart, I first spun up an instance with the following command:

openstack server create --image "Ubuntu Server 20.04 Focal" --boot-from-volume 20 --flavor m1.small --network LAN --security-group SSH --key-name imac-rsa masakari-vm1

+--------------------------------------+---------------+---------+--------------------+--------------------------+-----------+
| ID                                   | Name          | Status  | Networks           | Image                    | Flavor    |
+--------------------------------------+---------------+---------+--------------------+--------------------------+-----------+
| 5101ed69-00c9-4956-b7fd-7f2256c23474 | masakari-vm1  | ACTIVE  | LAN=192.168.2.188  | N/A (booted from volume) | m1.small  |
+--------------------------------------+---------------+---------+--------------------+--------------------------+-----------+

Then, I set the HA_Enabled property to True:

openstack server set --property HA_Enabled=True masakari-vm1

To simulate an unexpected failure, I killed the instance on the compute node:

root@lab-compute01:~# pgrep -f guest=instance-00000450
154259
root@lab-compute01:~# pkill -f -9 guest=instance-00000450

At the same time, we can see the following events taking place in the log:

Dec  1 16:36:28 lab-compute01 masakari-instancemonitor[151091]: 2021-12-01 16:36:28.660 151091 INFO masakarimonitors.instancemonitor.libvirt_handler.callback [-] Libvirt Event: type=VM, hostname=lab-compute01, uuid=5101ed69-00c9-4956-b7fd-7f2256c23474, time=2021-12-01 16:36:28.660446, event_id=LIFECYCLE, detail=STOPPED_FAILED)
Dec  1 16:36:28 lab-compute01 masakari-instancemonitor[151091]: 2021-12-01 16:36:28.661 151091 INFO masakarimonitors.ha.masakari [-] Send a notification. {'notification': {'type': 'VM', 'hostname': 'lab-compute01', 'generated_time': datetime.datetime(2021, 12, 1, 16, 36, 28, 660446), 'payload': {'event': 'LIFECYCLE', 'instance_uuid': '5101ed69-00c9-4956-b7fd-7f2256c23474', 'vir_domain_event': 'STOPPED_FAILED'}}}
Dec  1 16:36:29 lab-compute01 masakari-instancemonitor[151091]: 2021-12-01 16:36:29.910 151091 INFO masakarimonitors.ha.masakari [-] Response: openstack.instance_ha.v1.notification.Notification(type=VM, hostname=lab-compute01, generated_time=2021-12-01T16:36:28.660446, payload={'event': 'LIFECYCLE', 'instance_uuid': '5101ed69-00c9-4956-b7fd-7f2256c23474', 'vir_domain_event': 'STOPPED_FAILED'}, id=15, notification_uuid=bc7eb253-b5fc-4feb-92fd-46e494772f0d, source_host_uuid=6d39c8c7-9d8f-4faf-a7e7-bbcd7dd5d79d, status=new, created_at=2021-12-01T16:36:29.000000, updated_at=None, location=Munch({'cloud': '10.20.0.11', 'region_name': 'RegionOne', 'zone': None, 'project': Munch({'id': '36de0c24e456401d8df6ffaff42224d0', 'name': None, 'domain_id': None, 'domain_name': None})}))
Dec  1 16:36:43 lab-compute01 masakari-instancemonitor[151091]: 2021-12-01 16:36:43.372 151091 INFO masakarimonitors.instancemonitor.libvirt_handler.callback [-] Libvirt Event: type=VM, hostname=lab-compute01, uuid=5101ed69-00c9-4956-b7fd-7f2256c23474, time=2021-12-01 16:36:43.371886, event_id=REBOOT, detail=UNKNOWN)
Dec  1 16:36:43 lab-compute01 masakari-instancemonitor[151091]: 2021-12-01 16:36:43.373 151091 INFO masakarimonitors.ha.masakari [-] Send a notification. {'notification': {'type': 'VM', 'hostname': 'lab-compute01', 'generated_time': datetime.datetime(2021, 12, 1, 16, 36, 43, 371886), 'payload': {'event': 'REBOOT', 'instance_uuid': '5101ed69-00c9-4956-b7fd-7f2256c23474', 'vir_domain_event': 'UNKNOWN'}}}
Dec  1 16:36:43 lab-compute01 masakari-instancemonitor[151091]: 2021-12-01 16:36:43.383 151091 INFO masakarimonitors.instancemonitor.libvirt_handler.callback [-] Libvirt Event: type=VM, hostname=lab-compute01, uuid=5101ed69-00c9-4956-b7fd-7f2256c23474, time=2021-12-01 16:36:43.383194, event_id=REBOOT, detail=UNKNOWN)
Dec  1 16:36:43 lab-compute01 masakari-instancemonitor[151091]: 2021-12-01 16:36:43.384 151091 INFO masakarimonitors.ha.masakari [-] Send a notification. {'notification': {'type': 'VM', 'hostname': 'lab-compute01', 'generated_time': datetime.datetime(2021, 12, 1, 16, 36, 43, 383194), 'payload': {'event': 'REBOOT', 'instance_uuid': '5101ed69-00c9-4956-b7fd-7f2256c23474', 'vir_domain_event': 'UNKNOWN'}}}
Dec  1 16:36:44 lab-compute01 masakari-instancemonitor[151091]: 2021-12-01 16:36:44.818 151091 INFO masakarimonitors.ha.masakari [-] Response: openstack.instance_ha.v1.notification.Notification(type=VM, hostname=lab-compute01, generated_time=2021-12-01T16:36:43.371886, payload={'event': 'REBOOT', 'instance_uuid': '5101ed69-00c9-4956-b7fd-7f2256c23474', 'vir_domain_event': 'UNKNOWN'}, id=18, notification_uuid=7a08ef3c-ce4a-48a3-ba1a-05d8890c4b9f, source_host_uuid=6d39c8c7-9d8f-4faf-a7e7-bbcd7dd5d79d, status=new, created_at=2021-12-01T16:36:44.000000, updated_at=None, location=Munch({'cloud': '10.20.0.11', 'region_name': 'RegionOne', 'zone': None, 'project': Munch({'id': '36de0c24e456401d8df6ffaff42224d0', 'name': None, 'domain_id': None, 'domain_name': None})}))
Dec  1 16:36:44 lab-compute01 masakari-instancemonitor[151091]: 2021-12-01 16:36:44.918 151091 INFO masakarimonitors.ha.masakari [-] Stop retrying to send a notification because same notification have been already sent.

A look at the process list shows a new PID:

root@lab-compute01:~# pgrep -f guest=instance-00000450
168720

Finally, a simultaneous ping test shows the ping fail and subseqeuently recover once the instance has been restarted:

root@lab-infra01:~# ping 192.168.2.188
PING 192.168.2.188 (192.168.2.188) 56(84) bytes of data.
bytes from 192.168.2.188: icmp_seq=1 ttl=64 time=2.22 ms
bytes from 192.168.2.188: icmp_seq=2 ttl=64 time=1.06 ms
bytes from 192.168.2.188: icmp_seq=3 ttl=64 time=0.891 ms
bytes from 192.168.2.188: icmp_seq=4 ttl=64 time=0.909 ms
bytes from 192.168.2.188: icmp_seq=5 ttl=64 time=0.933 ms
bytes from 192.168.2.188: icmp_seq=6 ttl=64 time=0.813 ms
bytes from 192.168.2.188: icmp_seq=7 ttl=64 time=0.978 ms
bytes from 192.168.2.188: icmp_seq=8 ttl=64 time=0.967 ms
bytes from 192.168.2.188: icmp_seq=9 ttl=64 time=0.884 ms
bytes from 192.168.2.188: icmp_seq=10 ttl=64 time=0.969 ms
bytes from 192.168.2.188: icmp_seq=11 ttl=64 time=0.937 ms
bytes from 192.168.2.188: icmp_seq=12 ttl=64 time=0.876 ms
...
bytes from 192.168.2.188: icmp_seq=40 ttl=64 time=1.97 ms
bytes from 192.168.2.188: icmp_seq=41 ttl=64 time=0.886 ms
bytes from 192.168.2.188: icmp_seq=42 ttl=64 time=1.70 ms
bytes from 192.168.2.188: icmp_seq=43 ttl=64 time=0.728 ms

Testing Service Restart

A similar test can be used to verify the automatic restart of crucial services, such as libvirtd, nova-compute, and others.

Here, I ungracefully kill the libvirtd process:

root@lab-compute01:~# pgrep -f libvirtd
183773
183807
root@lab-compute01:~# killall libvirtd

Masakari gets to work restarting the service:

Dec  1 20:21:56 lab-compute01 masakari-instancemonitor[151091]: 2021-12-01 20:21:56.399 151091 WARNING masakarimonitors.instancemonitor.instance [-] Libvirt Connection Closed Unexpectedly.
Dec  1 20:21:56 lab-compute01 masakari-instancemonitor[151091]: 2021-12-01 20:21:56.400 151091 WARNING masakarimonitors.instancemonitor.instance [-] Error from libvirt : internal error: client socket is closed
Dec  1 20:21:56 lab-compute01 masakari-instancemonitor[151091]: 2021-12-01 20:21:56.400 151091 WARNING masakarimonitors.instancemonitor.instance [-] Error from libvirt : internal error: client socket is closed
Dec  1 20:21:56 lab-compute01 masakari-instancemonitor[151091]: 2021-12-01 20:21:56.401 151091 WARNING masakarimonitors.instancemonitor.instance [-] Error from libvirt : internal error: client socket is closed
Dec  1 20:21:56 lab-compute01 masakari-instancemonitor[151091]: message repeated 2 times: [ 2021-12-01 20:21:56.401 151091 WARNING masakarimonitors.instancemonitor.instance [-] Error from libvirt : internal error: client socket is closed]
Dec  1 20:21:56 lab-compute01 masakari-instancemonitor[151091]: 2021-12-01 20:21:56.402 151091 WARNING masakarimonitors.instancemonitor.instance [-] Error from libvirt : internal error: client socket is closed
Dec  1 20:21:56 lab-compute01 masakari-instancemonitor[151091]: message repeated 2 times: [ 2021-12-01 20:21:56.402 151091 WARNING masakarimonitors.instancemonitor.instance [-] Error from libvirt : internal error: client socket is closed]
Dec  1 20:21:56 lab-compute01 masakari-instancemonitor[151091]: 2021-12-01 20:21:56.403 151091 WARNING masakarimonitors.instancemonitor.instance [-] Error from libvirt : internal error: client socket is closed
Dec  1 20:21:56 lab-compute01 masakari-introspectiveinstancemonitor[151131]: 2021-12-01 20:21:56.460 151131 WARNING masakarimonitors.introspectiveinstancemonitor.instance [-] Libvirt Connection Closed Unexpectedly.
Dec  1 20:21:56 lab-compute01 masakari-introspectiveinstancemonitor[151131]: 2021-12-01 20:21:56.461 151131 WARNING masakarimonitors.introspectiveinstancemonitor.instance [-] Error from libvirt : internal error: client socket is closed
Dec  1 20:21:56 lab-compute01 masakari-introspectiveinstancemonitor[151131]: 2021-12-01 20:21:56.461 151131 WARNING masakarimonitors.introspectiveinstancemonitor.instance [-] Error from libvirt : internal error: client socket is closed
Dec  1 20:21:56 lab-compute01 masakari-processmonitor[184385]: 2021-12-01 20:21:56.937 184385 WARNING masakarimonitors.processmonitor.process_handler.handle_process [-] Process '/usr/sbin/libvirtd' is not found.
Dec  1 20:21:57 lab-compute01 masakari-processmonitor[184385]: 2021-12-01 20:21:57.046 184385 INFO masakarimonitors.processmonitor.process_handler.handle_process [-] Restart of process with executing command: systemctl restart libvirtd

A look at the process list shows a new PID:

root@lab-compute01:~# pgrep -f libvirtd
184026
184063

Summary

Automated anything can be a delicate balance of risk and reward. I’m glad to have had some time looking at OpenStack Masakari, and while the instance restart functionality is great, I’m really looking forward to helping implement the host evacation capabilities within OpenStack-Ansible in the coming ~~weeks~~ ~~months~~ ~~years~~ sometime.

If you have some thoughts or comments on this article, I’d love to hear ‘em. Feel free to reach out on Twitter at @jimmdenton or hit me up on LinkedIn.

[NSX] Installing VMware vCenter Server Appliance on OpenStack

2021-04-09T00:00:00+00:00

While working through the installation process of installing VMware NSX-T, I have not yet determined whether it is a standalone product or requires the use of vCenter (vSphere Client). I know NSX-T supports both ESXi and KVM hypervisors, so I will have to clear up this confusion later. However, I no longer have ESX anywhere in my home lab to host a vCenter appliance, so my mission has been to install NSX-T and supporting resources on my existing OpenStack cloud running OpenStack-Ansible (Ussuri).

vCenter ships as an ISO that would ordinarily be installed on a virtual machine running on ESX. Join me while I attempt (and succeed) in deploying vCenter on top of OpenStack.

This post is Part 2 of a series of posts about installing NSX-T and supporting resources onto an OpenStack cloud to be used with a separate OpenStack cloud. If you haven’t read it yet, check out Installing VMware NSX-T Manager on OpenStack, the first post in this series.

Getting Started

This isn’t my first rodeo in shoehorning operating systems onto (cloud) platforms they weren’t meant to run on, and for things that usually run on ESX, that means making a KVM-based environment look a lot like a VMware-based environment. For a VM, that may mean using sata disks instead of virtio disks, or e1000 instead of virtio nics. Not much is different here.

I found a few repos on GitHub where folks have deployed VCSA 6.0 and 6.5/6.7 using KVM, and those were super helpful starting points. For this NSX lab, vCenter 7.0 will be used.

Procuring vCenter

I have an active VMUG membership which allows access to NSX and vCenter, along with ESX and all sorts of other stuff. To start, I downloaded the latest VCSA image, VMware-VCSA-all-7.0.2-17694817.iso, and corresponding license key.

Extracting Files

The following packages are needed to peek inside the VCSA ISO and grab the components we need to make them compatible with OpenStack (KVM):

bsdtar
qemu-utils
virtinst

Install the packages with the following command:

# apt install bsdtar qemu-utils virtinst

Extract the iso to stdout and untar the ova directly into /tmp:

# bsdtar -xvOf VMware-VCSA-all-7.0.2-17694817.iso vcsa/VMware-vCenter-Server-Appliance-7.0.2.00000-17694817_OVF10.ova | tar xv -C /tmp/ -xvf -

The /tmp directory will end up with all sorts of files needed to make the magic happen:

-rw-r--r--  1   64   64     102400 Mar  2 23:01 VMware-vCenter-Server-Appliance-7.0.2.00000-17694817_OVF10.cert
-rw-r--r--  1   64   64  735812096 Mar  2 23:01 VMware-vCenter-Server-Appliance-7.0.2.00000-17694817_OVF10-disk1.vmdk
-rw-r--r--  1   64   64 5571473920 Mar  2 23:02 VMware-vCenter-Server-Appliance-7.0.2.00000-17694817_OVF10-disk2.vmdk
-rw-r--r--  1   64   64      72704 Mar  2 23:02 VMware-vCenter-Server-Appliance-7.0.2.00000-17694817_OVF10-disk3.vmdk
-rw-r--r--  1   64   64      14578 Mar  2 23:01 VMware-vCenter-Server-Appliance-7.0.2.00000-17694817_OVF10-file1.json
-rw-r--r--  1   64   64   89613741 Mar  2 23:01 VMware-vCenter-Server-Appliance-7.0.2.00000-17694817_OVF10-file2.rpm
-rw-r--r--  1   64   64        856 Mar  2 23:01 VMware-vCenter-Server-Appliance-7.0.2.00000-17694817_OVF10.mf
-rw-r--r--  1   64   64     182520 Mar  2 23:01 VMware-vCenter-Server-Appliance-7.0.2.00000-17694817_OVF10.ovf

Convert the disks from vmdk to qcow2:

# qemu-img convert -O qcow2 /tmp/VMware-vCenter-Server-Appliance-7.0.2.00000-17694817_OVF10-disk1.vmdk /tmp/vcenter70-disk1.qcow2
# qemu-img convert -O qcow2 /tmp/VMware-vCenter-Server-Appliance-7.0.2.00000-17694817_OVF10-disk2.vmdk /tmp/vcenter70-disk2.qcow2
# qemu-img convert -O qcow2 /tmp/VMware-vCenter-Server-Appliance-7.0.2.00000-17694817_OVF10-disk3.vmdk /tmp/vcenter70-disk3.qcow2

When attaching multiple disks to an OpenStack instance, the disks will be volumes hosted by Cinder. To create a volume from image, you must know the size of the disk needed. Using the VMware-vCenter-Server-Appliance-7.0.2.00000-17694817_OVF10-file1.json file, I was able to determine the vCPU, RAM, and size of each disk (there are 16) needed to attach to the VCSA appliance:

system          disk1: 48GB
cloudcomponents disk2: 6GB
swap            disk3: 25GB
core            disk4: 25GB
log             disk5: 10GB
db              disk6: 10GB
dblog           disk7: 15GB
seat            disk8: 10GB
netdump         disk9: 1GB
autodeploy      disk10: 10GB
imagebuilder    disk11: 10GB
updatemgr       disk12: 100GB
archive         disk13: 50GB
vtsdb           disk14: 10GB
vtsdblog        disk15: 5GB
disk-lifecycle  disk16: 100GB

The plan, then, was to pre-create a volume for disk2-16, with disk2-3 being based on the vmdk (now qcow2) included with the ISO. The other disks would be blank but sized accordingly.

Upload images

Using the openstack command, upload the qcow2 images:

for i in {1..3}; do 
openstack image create --disk-format qcow2 --container-format bare --file /tmp/vcenter70-disk$i.qcow2 vcenter70-disk$i
done

Verify:

root@lab-infra01:/home/jdenton/vcsa# openstack image list | grep vcenter
| b2e45a89-0d77-400e-8a46-8183d8926382 | vcenter70-disk1                   | active |
| 176329d8-cdbe-4724-9202-a6e5f03484c4 | vcenter70-disk2                   | active |
| d2a01904-7ab0-4786-8219-41f3a11465f1 | vcenter70-disk3                   | active |

As mentioned earlier, certain hardware needs to be present for the VCSA appliance to work properly. Notably, a SATA bus and e1000 NIC. The disk1 image is the system, or root, disk for the appliance, so it can be modified with some property adjustments to support the needed hardware:

openstack image set \
 --property hw_disk_bus=sata \
 --property hw_vif_model=e1000 \
 vcenter70-disk1

Create volumes

Using the openstack command, create two volumes for disk2 and disk3:

openstack volume create \
--image vcenter70-disk2 \
--size 6 \
--desc cloudcomponents \
vcenter70-disk2

openstack volume create \
--image vcenter70-disk3 \
--size 26 \
--desc swap \
vcenter70-disk3

The size may have to be adjusted higher to ensure the volume is created successfully.

Verify:

root@lab-infra01:/home/jdenton/vcsa# openstack volume list | grep vcenter
| 2528e324-b293-49b6-ab54-7fbe0c732e2e | vcenter70-disk3     | downloading    |   26 |
| 41380794-e410-4f0f-8ae3-0fd0a110fb9c | vcenter70-disk2     | downloading    |    6 |

Create empty volumes for the remainder:

openstack volume create --size 25 vcenter70-disk4 --desc core
openstack volume create --size 10 vcenter70-disk5 --desc log
openstack volume create --size 10 vcenter70-disk6 --desc db
openstack volume create --size 15 vcenter70-disk7 --desc dblog
openstack volume create --size 10 vcenter70-disk8 --desc seat
openstack volume create --size 1 vcenter70-disk9 --desc netdump
openstack volume create --size 10 vcenter70-disk10 --desc autodeploy
openstack volume create --size 10 vcenter70-disk11 --desc imagebuilder
openstack volume create --size 100 vcenter70-disk12 --desc updatemgr
openstack volume create --size 50 vcenter70-disk13 --desc archive
openstack volume create --size 10 vcenter70-disk14 --desc vtsdb
openstack volume create --size 5 vcenter70-disk15 --desc vtsdblog
openstack volume create --size 100 vcenter70-disk16 --desc disk-lifecycle

After a few minutes, all of the volumes should be listed:

root@lab-infra01:/home/jdenton/vcsa# openstack volume list | grep vcenter
| 5e443e77-f0cb-4ca9-a741-330ea215d4f1 | vcenter70-disk16                        | available      |  100 |
| 145f4c56-5565-41f7-a150-bc81aa50c519 | vcenter70-disk15                        | available      |    5 |
| 03b209ea-ad83-40a5-9cd4-60e4de1395b7 | vcenter70-disk14                        | available      |   10 |
| 467608e0-85b4-44da-92f2-375c8d4d2c25 | vcenter70-disk13                        | available      |   50 |
| 56a03176-ea36-4ad6-861e-565571ce9d12 | vcenter70-disk12                        | available      |  100 |
| 25000084-0c20-4dbd-949a-20065a64c143 | vcenter70-disk11                        | available      |   10 |
| e37ca38d-28c1-41cd-a1ac-fd13d48825dc | vcenter70-disk10                        | available      |   10 |
| cc7cf024-1bfb-4c75-8c9a-51b3f256079b | vcenter70-disk9                         | available      |    1 |
| 10afd11d-0d7a-4797-9b5f-bdb1ff42695f | vcenter70-disk8                         | available      |   10 |
| 07dcf745-510c-48a1-91d4-11aef9f7cc96 | vcenter70-disk7                         | available      |   15 |
| 9f95acd1-0131-4adc-9e93-a71a73ac57c3 | vcenter70-disk6                         | available      |   10 |
| 99ac0727-555a-4047-a1ec-4936ed9f9963 | vcenter70-disk5                         | available      |   10 |
| b27ffa4e-aa41-4481-83b4-d629e6e3dadf | vcenter70-disk4                         | available      |   25 |
| 2528e324-b293-49b6-ab54-7fbe0c732e2e | vcenter70-disk3                         | available      |   26 |
| 41380794-e410-4f0f-8ae3-0fd0a110fb9c | vcenter70-disk2                         | available      |    6 |

Create a flavor

Based on the information presented in the json file, I found there are different sizes of vCenter deployments that support tens or hundreds of nodes. For this environment, a tiny sizing will work well:

vCPU: 2
RAM: 12GB

openstack flavor create \
--vcpu 2 \
--ram 12288 \
vcsa-tiny

Do the networking

vCenter has port/traffic requirements that can be found here. The following command(s) create a new security group and rules that can be applied to the VCSA:

openstack security group create vcsa
openstack security group rule create vcsa --protocol icmp
openstack security group rule create vcsa --protocol tcp --dst-port 443
openstack security group rule create vcsa --protocol tcp --dst-port 80
openstack security group rule create vcsa --protocol tcp --dst-port 22 
openstack security group rule create vcsa --protocol tcp --dst-port 5480

The appliance needs at least one (1) interface for management. It supports DHCP, so I’ve pre-created a Neutron port with the security group applied:

openstack port create --network LAN --security-group vcsa VCSA1
...
| fixed_ips               | ip_address='192.168.2.190', subnet_id='1d500a35-ff27-4aa2-9201-82159ce1b2f5' |

DNS

My experience with vCenter tells me that functioning forward/reverse DNS is extremely important for a functioning deployment. I have an Unbound DNS service running in my environment, which makes it super simple to implement forward and reverse entries for any FQDN/IP. Here’s a working example for my vCenter host:

Hostname: vcsa1.jimmdenton.com IP: 192.168.2.190

local-data: "vcsa1.jimmdenton.com.  A 192.168.2.190"
local-data-ptr: "192.168.2.190 vcsa1.jimmdenton.com."

Once DNS is in place, it’s time to boot a server.

Boot the appliance

Using the nova command, boot the appliance with the first disk using source=image and bootindex=0. Additional disks should be attached, in order, using the sata bus:

nova boot --flavor vcsa-tiny \
--block-device source=image,id=b2e45a89-0d77-400e-8a46-8183d8926382,dest=volume,size=49,shutdown=preserve,bootindex=0 \
--nic port-id=35aff7fb-2c39-4041-bd52-d24e4a264ad8 \
--block-device source=volume,dest=volume,id=41380794-e410-4f0f-8ae3-0fd0a110fb9c,bus=sata,shutdown=preserve,bootindex=1 \
--block-device source=volume,dest=volume,id=2528e324-b293-49b6-ab54-7fbe0c732e2e,bus=sata,shutdown=preserve,bootindex=2 \
--block-device source=volume,dest=volume,id=b27ffa4e-aa41-4481-83b4-d629e6e3dadf,bus=sata,shutdown=preserve,bootindex=3 \
--block-device source=volume,dest=volume,id=99ac0727-555a-4047-a1ec-4936ed9f9963,bus=sata,shutdown=preserve,bootindex=4 \
--block-device source=volume,dest=volume,id=9f95acd1-0131-4adc-9e93-a71a73ac57c3,bus=sata,shutdown=preserve,bootindex=5 \
--block-device source=volume,dest=volume,id=07dcf745-510c-48a1-91d4-11aef9f7cc96,bus=sata,shutdown=preserve,bootindex=6 \
--block-device source=volume,dest=volume,id=10afd11d-0d7a-4797-9b5f-bdb1ff42695f,bus=sata,shutdown=preserve,bootindex=7 \
--block-device source=volume,dest=volume,id=cc7cf024-1bfb-4c75-8c9a-51b3f256079b,bus=sata,shutdown=preserve,bootindex=8 \
--block-device source=volume,dest=volume,id=e37ca38d-28c1-41cd-a1ac-fd13d48825dc,bus=sata,shutdown=preserve,bootindex=9 \
--block-device source=volume,dest=volume,id=25000084-0c20-4dbd-949a-20065a64c143,bus=sata,shutdown=preserve,bootindex=10 \
--block-device source=volume,dest=volume,id=56a03176-ea36-4ad6-861e-565571ce9d12,bus=sata,shutdown=preserve,bootindex=11 \
--block-device source=volume,dest=volume,id=467608e0-85b4-44da-92f2-375c8d4d2c25,bus=sata,shutdown=preserve,bootindex=12 \
--block-device source=volume,dest=volume,id=03b209ea-ad83-40a5-9cd4-60e4de1395b7,bus=sata,shutdown=preserve,bootindex=13 \
--block-device source=volume,dest=volume,id=145f4c56-5565-41f7-a150-bc81aa50c519,bus=sata,shutdown=preserve,bootindex=14 \
--block-device source=volume,dest=volume,id=5e443e77-f0cb-4ca9-a741-330ea215d4f1,bus=sata,shutdown=preserve,bootindex=15 \
vcsa1

Depending on the speed of your network and storage device, it may be a few minutes before the instance becomes ACTIVE. Once active, check out the console.

root@lab-infra01:/home/jdenton/vcsa# openstack console url show vcsa1
+-------+--------------------------------------------------------------------------------------------+
| Field | Value                                                                                      |
+-------+--------------------------------------------------------------------------------------------+
| type  | novnc                                                                                      |
| url   | https://10.20.0.10:6080/vnc_lite.html?path=%3Ftoken%3Dc3b68bb1-c1b3-4771-ba4a-5d4a49f98a42 |
+-------+--------------------------------------------------------------------------------------------+

The device may go through a series of reboots and/or service restarts before settling on the familiar VMware console dashboard:

The good news is that DHCP worked and the appliance picked up it’s IP address and other network configurations.

Before you can login, a root password must be set. Hit F2 and change the password. For this exercise, I set the password to 0p3nst@ck$$NSX.

Keep going

At this point, the installation process is really just getting started. The remainder of the process occurs within vCenter Server dashboard in a browser. The dashboard can be reached on port 5480:

https://vcsa1.jimmdenton.com:5480

Once logged in, click Setup, then Next to begin Stage 2.

Leave the network configuration alone (rather, leave it set to DHCP), but set the following:

Time synchronization mode: Synchronize time with NTP Servers
SSH Access: Enabled

My NTP server is 172.22.0.5, but use what’s right for you. Hit Next.

On the SSO Configuration screen, enter what’s appropriate for your environment. In this environment, I will build a new SSO domain.

SSO Domain Name: jimmdenton.com
Username: administrator
Password: 0p3nst@ck$$NSX

Hit Next.

On the following screen, accept (or not) the CEIP agreement, then hit Next. Once details are confirmed, hit Finish.

Once the installation process has started you will not be able to stop it. The install process may require you to log back in to the GUI after 10-15 minutes as services are (re)started. Login back in as root, then wait some more. For me, the entire process took approximately 20 minutes.

Finally, log back in as root to view the vCenter Server Management dashboard.

vCenter Client

To open to the vSphere Client dashboard, navigate to https://vcsa1.jimmdenton.com and hit the Launch vSphere Client (HTML5) button.

Here, you will login with the credentials set during the SSO creation process:

username: administrator@jimmdenton.com
password: 0p3nst@ck$$NSX

Once successfully logged in, you will see the vSphere Client dashboard in all its glory:

Where you go from here is up to you!

If you have some thoughts or comments on this process, I’d love to hear ‘em. Feel free to reach out on Twitter at @jimmdenton or hit me up on LinkedIn.

[NSX] Installing VMware NSX-T Manager on OpenStack

2021-04-07T00:00:00+00:00

For a long time now I’ve been interested in better understanding alternatives to a ‘vanilla’ Neutron deployment, but other than demonstrations and some hacking on OpenContrail a few years ago and Plumgrid years before that, I’ve really kept it simple by sticking to the upstream components and features.

VMware’s NSX-T product has been on my roadmap since it was first introduced as “compatible with All The Clouds™”, and I’m hoping to deploy the NSX-T Manager and other components on my OpenStack cloud as virtual machine instances that in turn manage networking for a yet-to-be-deployed OpenStack-Ansible based OpenStack cloud in the home lab.

This post demonstrates the steps involved in prepping an OpenStack cloud to host the NSX-T Manager appliance. Future posts will cover additional requirements.

First off, you’ll need the following:

An OpenStack cloud!
Cinder volume support
At least one (1) network for management
NSX-T software

I loosely followed the guide here and modified the KVM-based installation accordingly.

Obtaining Software

I can’t really help much when it comes to obtaining the NSX-T software and licenses, other than to say you may want to speak to your VMware representative. Don’t have one? The VMware Users Group (VMUG) provides a subscription to a host of VMware products for a reasonable yearly subscription fee. Check it out!.

When downloading the software, you’ll want to grab the following two qcow2 images:

nsx-unified-appliance-3.1.0.0.0.17107212-le.qcow2
nsx-unified-appliance-secondary-3.1.0.0.0.17107212-le.qcow2

Versioning may change, but you need both the (unmarked) primary and secondary unified appliance images.

Prep Work

Before starting, we need to create some resources in the OpenStack cloud hosting the NSX Manager, including security group rules and port(s) for the manager instance itself.

Create security group(s) and rules

VMware does a great job of listing protocols and ports needed for their software products here. I created the following group and rules based on their requirements:

openstack security group create nsx
openstack security group rule create nsx --protocol icmp
openstack security group rule create nsx --protocol tcp --dst-port 443
openstack security group rule create nsx --protocol tcp --dst-port 6081
openstack security group rule create nsx --protocol tcp --dst-port 9000
openstack security group rule create nsx --protocol tcp --dst-port 5671
openstack security group rule create nsx --protocol tcp --dst-port 1234
openstack security group rule create nsx --protocol tcp --dst-port 8080
openstack security group rule create nsx --protocol tcp --dst-port 1235
openstack security group rule create nsx --protocol udp --dst-port 6081
openstack security group rule create nsx --protocol tcp --dst-port 22 

Create a Neutron port

The NSX Manager appliance is bootstrapped with a configuration that is injected into the image using the guestfish utility. Part of the configuration defines the IP address, netmask, and gateway for the Manager appliance. That said, now is a good time to create a Neutron port on the management network in order to know what the fixed IP will be so the configuration can be created accordingly.

# openstack port create --network LAN --security-group nsx NSX_MANAGER_MGMT --description 'NSX Manager'

The port details are as follows:

IP: 192.168.2.168
Netmask: 255.255.255.0
Gateway: 192.168.2.1
DNS: 172.22.0.5
NTP: 172.22.0.5

Create the bootstrap config file

There are a handful of properties that must be defined in the configuration file to properly bootstrap the NSX-T Manager:

nsx_cli_passwd_0
nsx_cli_audit_passwd_0
nsx_passwd_0
nsx_hostname
nsx_role
nsx_isSSHEnabled
nsx_allowSSHRootLogin
nsx_dns1_0
nsx_ntp_0
nsx_domain_0
nsx_gateway_0
nsx_netmask_0
nsx_ip_0

The following values are intentionally insecure for demonstration purposes only:

nsx_cli_passwd_0 = 0p3nst@ck$$NSX
nsx_cli_audit_passwd_0 = 0p3nst@ck$$NSX
nsx_passwd_0 = 0p3nst@ck$$NSX
nsx_hostname = nsx-manager1
nsx_role = "NSX Manager"
nsx_isSSHEnabled
nsx_allowSSHRootLogin = yes
nsx_dns1_0 = 172.22.0.5
nsx_ntp_0 = 172.22.0.5
nsx_domain_0 = jimmdenton.com
nsx_gateway_0 = 192.168.2.1
nsx_netmask_0 = 255.255.255.0
nsx_ip_0 = 192.168.2.168

Create a file named guestinfo-manager.xml with the corresponding values, as shown here:

Create a flavor

VMware lists requirements for the virtualized Manager based on environment size. Here, in a small environment, the CPU and RAM requirements are somewhat reasonable:

CPUs: 4
RAM: 16 GB

Create the flavor:

openstack flavor create \
--vcpu 4 \
--ram 16384 \
nsx-manager-extra-small

You might have noticed a disk size was not set. Because we will be attaching volumes to the instance, no size is required in the flavor definition.

Upload the images

Both the primary and secondary unified appliance images must be uploaded to Glance. However, the primary image needs to be modified to include the guestinfo file created earlier.

Because the unified appliance image may be used to create other appliances in future posts, now is a good time to create a duplicate:

# cp nsx-unified-appliance-3.1.0.0.0.17107212-le.qcow2 nsx-unified-appliance-manager-3.1.0.0.0.17107212-le.qcow2

Use the guestfish utility to inject the xml file as /config/guestinfo:

# apt install libguestfs-tools
# guestfish --rw -i -a nsx-unified-appliance-manager-3.1.0.0.0.17107212-le.qcow2 upload guestinfo-manager.xml /config/guestinfo

After a brief moment, and with no feedback, the image will be modified. To verify, perform the following command:

# guestfish --ro -a nsx-unified-appliance-manager-3.1.0.0.0.17107212-le.qcow2 -i

The image will be opened, and a cat of the file should reveal the proper contents:

Welcome to guestfish, the guest filesystem shell for
editing virtual machine filesystems and disk images.

Type: 'help' for help on commands
      'man' to read the manual
      'quit' to quit the shell

Operating system: Ubuntu 18.04.4 LTS
/dev/sda2 mounted on /
/dev/sda1 mounted on /boot
/dev/nsx/config mounted on /config
/dev/nsx/config__bak mounted on /config_bak
/dev/nsx/image mounted on /image
/dev/sda3 mounted on /os_bak
/dev/nsx/repository mounted on /repository
/dev/nsx/tmp mounted on /tmp
/dev/nsx/var+dump mounted on /var/dump
/dev/nsx/var+log mounted on /var/log

> cat /config/guestinfo

> quit

Now, upload the images:

openstack image create \
--disk-format qcow2 \
--container-format bare \
--file nsx-unified-appliance-3.1.0.0.0.17107212-le.qcow2 \
nsx-unified-appliance-manager

openstack image create \
--disk-format qcow2 \
--container-format bare \
--file nsx-unified-appliance-secondary-3.1.0.0.0.17107212-le.qcow2 \
nsx-unified-appliance-secondary-manager

Create the volumes

Because we need to mount a secondary disk at boot, I found it easier to boot the instance with both images attached as volumes:

primary image as sda
secondary image as sdb

To create the volumes from image, you must first determine what size the volume needs to be. Using qemu-img, find the real size as shown here:

# qemu-img info nsx-unified-appliance-manager-3.1.0.0.0.17107212-le.qcow2
image: nsx-unified-appliance-3.1.0.0.0.17107212-le.qcow2
file format: qcow2
virtual size: 200 GiB (214748364800 bytes)
disk size: 10.2 GiB
cluster_size: 65536
Format specific information:
    compat: 1.1
    lazy refcounts: false
    refcount bits: 16
    corrupt: false

# qemu-img info nsx-unified-appliance-secondary-manager-3.1.0.0.0.17107212-le.qcow2
image: nsx-unified-appliance-secondary-3.1.0.0.0.17107212-le.qcow2
file format: qcow2
virtual size: 100 GiB (107374182400 bytes)
disk size: 196 KiB
cluster_size: 65536
Format specific information:
    compat: 1.1
    lazy refcounts: false
    refcount bits: 16
    corrupt: false

Turns out, the primary image has a size of 200GB while the secondary ends up as 100GB.

Knowing that, the volumes can now be created:

openstack volume create \
--image nsx-unified-appliance-manager \
--size 200 \
nsx-unified-appliance-manager

openstack volume create \
--image nsx-unified-appliance-secondary \
--size 100 \
nsx-unified-appliance-secondary-manager

After a while (depending on the speed of your network), the volumes should show as available:

root@lab-infra01:/home/jdenton# openstack volume list
+--------------------------------------+-----------------------------------------+----------------+------+-------------+
| ID                                   | Name                                    | Status         | Size | Attached to |
+--------------------------------------+-----------------------------------------+----------------+------+-------------+
| 15a3daff-b06e-4a3c-9a00-7ef4639a56da | nsx-unified-appliance-secondary-manager | available      |  100 |             |
| 327999a1-8901-4273-be43-d1151f388195 | nsx-unified-appliance-manager           | available      |  200 |             |
+--------------------------------------+-----------------------------------------+----------------+------+-------------+

Deploy an NSX-T Manager Instance

With the required resources in place, it’s time to create the instance:

openstack server create \
--port NSX_MANAGER_MGMT \
--flavor nsx-manager-extra-small \
--volume nsx-unified-appliance-manager \
--block-device-mapping vdb=nsx-unified-appliance-secondary-manager \
nsx-manager1

After a brief moment, the instance should go ACTIVE:

# openstack server show nsx-manager1
+-------------------------------------+----------------------------------------------------------------+
| Field                               | Value                                                          |
+-------------------------------------+----------------------------------------------------------------+
| OS-DCF:diskConfig                   | MANUAL                                                         |
| OS-EXT-AZ:availability_zone         | nova                                                           |
| OS-EXT-SRV-ATTR:host                | lab-compute02                                                  |
| OS-EXT-SRV-ATTR:hypervisor_hostname | lab-compute02.openstack.local                                  |
| OS-EXT-SRV-ATTR:instance_name       | instance-0000020e                                              |
| OS-EXT-STS:power_state              | Running                                                        |
| OS-EXT-STS:task_state               | None                                                           |
| OS-EXT-STS:vm_state                 | active                                                         |
| OS-SRV-USG:launched_at              | 2021-04-08T01:59:33.000000                                     |
| OS-SRV-USG:terminated_at            | None                                                           |
| accessIPv4                          |                                                                |
| accessIPv6                          |                                                                |
| addresses                           | LAN=192.168.2.168                                              |
| config_drive                        |                                                                |
| created                             | 2021-04-08T01:59:14Z                                           |
| flavor                              | nsx-manager-extra-small (38f00cb5-9d5e-43f4-b63e-4da7175f00a0) |
| hostId                              | 619a1b066ba5e16258c79ded5319a206777219e3e688f5200d74dd72       |
| id                                  | 89d16e20-1807-465f-9703-16d78675db1f                           |
| image                               | N/A (booted from volume)                                       |
| key_name                            | None                                                           |
| name                                | nsx-manager1                                                   |
| progress                            | 0                                                              |
| project_id                          | 7a8df96a3c6a47118e60e57aa9ecff54                               |
| properties                          |                                                                |
| security_groups                     | name='default'                                                 |
| status                              | ACTIVE                                                         |
| updated                             | 2021-04-08T01:59:33Z                                           |
| user_id                             | 34f3cf48b24f41c097555c07961f139e                               |
| volumes_attached                    | id='327999a1-8901-4273-be43-d1151f388195'                      |
|                                     | id='15a3daff-b06e-4a3c-9a00-7ef4639a56da'                      |
+-------------------------------------+----------------------------------------------------------------+

The instance’s console can be checked to ensure the instance is booting:

# openstack console url show nsx-manager1
+-------+--------------------------------------------------------------------------------------------+
| Field | Value                                                                                      |
+-------+--------------------------------------------------------------------------------------------+
| type  | novnc                                                                                      |
| url   | https://10.20.0.10:6080/vnc_lite.html?path=%3Ftoken%3Dcbe20437-6ad4-46e4-9056-014dc791040e |
+-------+--------------------------------------------------------------------------------------------+

After a few minutes, a console prompt appeared on screen:

Using the credentials provided in guestinfo-manager.xml, login to the console:

The VMware installation guide walks you through a few additional validation steps, one of those being network validation:

The IP is applied, and ICMP responds as well:

bytes from 192.168.2.168: icmp_seq=182 ttl=64 time=5.350 ms
bytes from 192.168.2.168: icmp_seq=183 ttl=64 time=4.304 ms
bytes from 192.168.2.168: icmp_seq=184 ttl=64 time=5.505 ms
bytes from 192.168.2.168: icmp_seq=185 ttl=64 time=4.239 ms
bytes from 192.168.2.168: icmp_seq=186 ttl=64 time=5.627 ms
bytes from 192.168.2.168: icmp_seq=187 ttl=64 time=4.980 ms
bytes from 192.168.2.168: icmp_seq=188 ttl=64 time=4.232 ms

Connecting to the Dashboard

At this point, all signs point to a successful deployment of the NSX-T Manager (unified appliance) on an OpenStack cloud. Using a web browser, connect to the management address defined in guestinfo-manager.xml:

If you’ve downloaded and installed the VMUG-provided image (like me), configure your individualized license key by clicking on Manage Licenses. The NSX For vShield Endpoint license is included, but the NSX Data Center Evaluation license is what is (likely) required for the fun stuff.

In a series of follow-on posts, I hope to explore NSX-T features and OpenStack Neutron integration by deploying a small all-in-one (AIO) cloud using OpenStack-Ansible. Stay tuned!

If you have some thoughts or comments on this process, I’d love to hear ‘em. Feel free to reach out on Twitter at @jimmdenton or hit me up on LinkedIn.

OpenStack: Instance Shelving

2020-12-30T00:00:00+00:00

My good friend Cody Bunch documented OpenStack instance shelving back in 2017, and I recently revisited the topic for a customer looking to better conserve resources in their busy cloud. They have developers all over the world working around the clock, and a limited set of resources to go around.

In a nutshell, “shelving” an instance allows one to stop an instance and regain the associated resources (i.e. vcpu, ram, disk) without having to completely delete the instance. The compute service creates a snapshot of the instance and uploads it to the Glance image library. The running instance is effectively deleted from the compute node, but runtime details such as vCPU(s), memory, disk size, and IP address(es) are retained for eventual unshelving.

Ordinarily, when an instance is shutdown, the resources tied to the instance continue to be reserved. This behavior can result in a compute node being full of non-running instances and unable to be scheduled to! When an instance is shelved, however, its resources are freed up and made available. When the instance is unshelved, Nova will subsequently rebuild the instance using the respective snapshot and schedule it where appropriate.

Easy peasy, right?

To demonstrate this behavior in action, I’ll be working with a lab environment running Ussuri consisting of a single controller node, two compute nodes, and a Glance image store over NFS.

Take a look at the resources currently consumed on the two compute nodes:

🌕OpenStack Lab % openstack host show lab-compute01
+---------------+----------------------------------+-----+-----------+---------+
| Host          | Project                          | CPU | Memory MB | Disk GB |
+---------------+----------------------------------+-----+-----------+---------+
| lab-compute01 | (total)                          |  32 |    257974 |    3646 |
| lab-compute01 | (used_now)                       |   4 |     10240 |      80 |
| lab-compute01 | (used_max)                       |   4 |      8192 |      80 |
| lab-compute01 | 7a8df96a3c6a47118e60e57aa9ecff54 |   4 |      8192 |      80 |
+---------------+----------------------------------+-----+-----------+---------+
🌕OpenStack Lab % openstack host show lab-compute02
+---------------+----------------------------------+-----+-----------+---------+
| Host          | Project                          | CPU | Memory MB | Disk GB |
+---------------+----------------------------------+-----+-----------+---------+
| lab-compute02 | (total)                          |  48 |    128796 |     937 |
| lab-compute02 | (used_now)                       |  18 |     21504 |     140 |
| lab-compute02 | (used_max)                       |  18 |     19456 |     150 |
| lab-compute02 | 7a8df96a3c6a47118e60e57aa9ecff54 |  17 |     18432 |     130 |
| lab-compute02 | 36de0c24e456401d8df6ffaff42224d0 |   1 |      1024 |      20 |
+---------------+----------------------------------+-----+-----------+---------+

There are two projects with instances spread across both nodes. I will use the openstack client to create an instance consuming 2 vCPUs, 4 GB RAM and 40 GB disk:

🌕OpenStack Lab % openstack server create --flavor 2-4-40 --image a299a16d-f46c-4e4e-9dc4-515634fc4ac1 --network LAN --key-name imac-rsa --security-group bench save-10
+-------------------------------------+-------------------------------------------------------------------+
| Field                               | Value                                                             |
+-------------------------------------+-------------------------------------------------------------------+
| OS-DCF:diskConfig                   | MANUAL                                                            |
| OS-EXT-AZ:availability_zone         |                                                                   |
| OS-EXT-SRV-ATTR:host                | None                                                              |
| OS-EXT-SRV-ATTR:hypervisor_hostname | None                                                              |
| OS-EXT-SRV-ATTR:instance_name       |                                                                   |
| OS-EXT-STS:power_state              | NOSTATE                                                           |
| OS-EXT-STS:task_state               | scheduling                                                        |
| OS-EXT-STS:vm_state                 | building                                                          |
| OS-SRV-USG:launched_at              | None                                                              |
| OS-SRV-USG:terminated_at            | None                                                              |
| accessIPv4                          |                                                                   |
| accessIPv6                          |                                                                   |
| addresses                           |                                                                   |
| adminPass                           | opjQcrG2ffTM                                                      |
| config_drive                        |                                                                   |
| created                             | 2020-12-30T15:25:14Z                                              |
| flavor                              | 2-4-40 (8a7c77fa-98da-4df0-9df4-5227a553dd70)                     |
| hostId                              |                                                                   |
| id                                  | 6495f1fd-f440-4db8-8502-f6c6bd8971ba                              |
| image                               | Ubuntu Server 18.04 Bionic (a299a16d-f46c-4e4e-9dc4-515634fc4ac1) |
| key_name                            | imac-rsa                                                          |
| name                                | save-10                                                           |
| progress                            | 0                                                                 |
| project_id                          | 7a8df96a3c6a47118e60e57aa9ecff54                                  |
| properties                          |                                                                   |
| security_groups                     | name='23ecf413-93fc-42fe-89ef-e0fafaa2f934'                       |
| status                              | BUILD                                                             |
| updated                             | 2020-12-30T15:25:14Z                                              |
| user_id                             | 34f3cf48b24f41c097555c07961f139e                                  |
| volumes_attached                    |                                                                   |
+-------------------------------------+-------------------------------------------------------------------+

After a few moments, the instance goes ACTIVE and responds to ping:

🌕OpenStack Lab % openstack server list
+--------------------------------------+---------+-------------------+-------------------------------+----------------------------+--------------------+
| ID                                   | Name    | Status            | Networks                      | Image                      | Flavor             |
+--------------------------------------+---------+-------------------+-------------------------------+----------------------------+--------------------+
| 6495f1fd-f440-4db8-8502-f6c6bd8971ba | save-10 | ACTIVE            | LAN=192.168.2.209             | Ubuntu Server 18.04 Bionic | 2-4-40             |
+--------------------------------------+---------+-------------------+-------------------------------+----------------------------+--------------------+

➜  ~ ping 192.168.2.209
PING 192.168.2.209 (192.168.2.209): 56 data bytes
64 bytes from 192.168.2.209: icmp_seq=0 ttl=64 time=7.365 ms
64 bytes from 192.168.2.209: icmp_seq=1 ttl=64 time=7.594 ms
64 bytes from 192.168.2.209: icmp_seq=2 ttl=64 time=3.963 ms
64 bytes from 192.168.2.209: icmp_seq=3 ttl=64 time=4.007 ms
64 bytes from 192.168.2.209: icmp_seq=4 ttl=64 time=4.107 ms
64 bytes from 192.168.2.209: icmp_seq=5 ttl=64 time=5.535 ms
...

Resource consumption is updated accordingly on compute01:

🌕OpenStack Lab % openstack host show lab-compute01
+---------------+----------------------------------+-----+-----------+---------+
| Host          | Project                          | CPU | Memory MB | Disk GB |
+---------------+----------------------------------+-----+-----------+---------+
| lab-compute01 | (total)                          |  32 |    257974 |    3646 |
| lab-compute01 | (used_now)                       |   6 |     14336 |     120 |
| lab-compute01 | (used_max)                       |   6 |     12288 |     120 |
| lab-compute01 | 7a8df96a3c6a47118e60e57aa9ecff54 |   6 |     12288 |     120 |
+---------------+----------------------------------+-----+-----------+---------+

🌕OpenStack Lab % openstack host show lab-compute02
+---------------+----------------------------------+-----+-----------+---------+
| Host          | Project                          | CPU | Memory MB | Disk GB |
+---------------+----------------------------------+-----+-----------+---------+
| lab-compute02 | (total)                          |  48 |    128796 |     937 |
| lab-compute02 | (used_now)                       |  18 |     21504 |     140 |
| lab-compute02 | (used_max)                       |  18 |     19456 |     150 |
| lab-compute02 | 7a8df96a3c6a47118e60e57aa9ecff54 |  17 |     18432 |     130 |
| lab-compute02 | 36de0c24e456401d8df6ffaff42224d0 |   1 |      1024 |      20 |
+---------------+----------------------------------+-----+-----------+---------+

If we simply shutdown the instance, the instance goes offline but the resources remain tied up:

🌕OpenStack Lab % openstack server stop save-10
🌕OpenStack Lab % openstack server show save-10 -c OS-EXT-STS:task_state -c status
+-----------------------+---------+
| Field                 | Value   |
+-----------------------+---------+
| OS-EXT-STS:task_state | None    |
| status                | SHUTOFF |
+-----------------------+---------+
🌕OpenStack Lab % openstack host show lab-compute01
+---------------+----------------------------------+-----+-----------+---------+
| Host          | Project                          | CPU | Memory MB | Disk GB |
+---------------+----------------------------------+-----+-----------+---------+
| lab-compute01 | (total)                          |  32 |    257974 |    3646 |
| lab-compute01 | (used_now)                       |   6 |     14336 |     120 |
| lab-compute01 | (used_max)                       |   6 |     12288 |     120 |
| lab-compute01 | 7a8df96a3c6a47118e60e57aa9ecff54 |   6 |     12288 |     120 |
+---------------+----------------------------------+-----+-----------+---------+

To meet the needs of the user, the instance needs to be shutdown but remain available for re-use. Simply shutting down the instance isn’t enough, since the resources are not freed up. This is where shelving comes in.

Shelving

To shelve the instance, make sure it’s in an ACTIVE state and issue the openstack server shelve command:

🌕OpenStack Lab % openstack server shelve save-10

No feedback is given. However, the ping will begin to fail and the task_state will change accordingly:

64 bytes from 192.168.2.209: icmp_seq=22 ttl=64 time=5.311 ms
64 bytes from 192.168.2.209: icmp_seq=23 ttl=64 time=5.429 ms
64 bytes from 192.168.2.209: icmp_seq=24 ttl=64 time=4.238 ms
64 bytes from 192.168.2.209: icmp_seq=25 ttl=64 time=3.949 ms
Request timeout for icmp_seq 26
Request timeout for icmp_seq 27
Request timeout for icmp_seq 28
Request timeout for icmp_seq 29
Request timeout for icmp_seq 30
Request timeout for icmp_seq 31
...

🌕OpenStack Lab % openstack server show save-10 -c OS-EXT-STS:task_state -c status
+-----------------------+--------------------------+
| Field                 | Value                    |
+-----------------------+--------------------------+
| OS-EXT-STS:task_state | shelving_image_uploading |
| status                | ACTIVE                   |
+-----------------------+--------------------------+

Once complete, the instance status will change:

🌕OpenStack Lab % openstack server show save-10 -c OS-EXT-STS:task_state -c status
+-----------------------+-------------------+
| Field                 | Value             |
+-----------------------+-------------------+
| OS-EXT-STS:task_state | None              |
| status                | SHELVED_OFFLOADED |
+-----------------------+-------------------+

If we check the resources again, we will find they have been returned:

🌕OpenStack Lab % openstack host show lab-compute01
+---------------+----------------------------------+-----+-----------+---------+
| Host          | Project                          | CPU | Memory MB | Disk GB |
+---------------+----------------------------------+-----+-----------+---------+
| lab-compute01 | (total)                          |  32 |    257974 |    3646 |
| lab-compute01 | (used_now)                       |   4 |     10240 |      80 |
| lab-compute01 | (used_max)                       |   4 |      8192 |      80 |
| lab-compute01 | 7a8df96a3c6a47118e60e57aa9ecff54 |   4 |      8192 |      80 | 
+---------------+----------------------------------+-----+-----------+---------+
🌕OpenStack Lab % openstack host show lab-compute02
+---------------+----------------------------------+-----+-----------+---------+
| Host          | Project                          | CPU | Memory MB | Disk GB |
+---------------+----------------------------------+-----+-----------+---------+
| lab-compute02 | (total)                          |  48 |    128796 |     937 |
| lab-compute02 | (used_now)                       |  18 |     21504 |     140 |
| lab-compute02 | (used_max)                       |  18 |     19456 |     150 |
| lab-compute02 | 7a8df96a3c6a47118e60e57aa9ecff54 |  17 |     18432 |     130 |
| lab-compute02 | 36de0c24e456401d8df6ffaff42224d0 |   1 |      1024 |      20 |
+---------------+----------------------------------+-----+-----------+---------+

A look at compute01 shows 2 vCPU, 2 GB RAM, and 40 GB disk have been freed up.

Unshelving the instance is as simple as openstack server unshelve:

🌕OpenStack Lab % openstack server unshelve save-10

There is no immediate feedback. The instance will respawn in a location determined by the Nova scheduler. Notice the various state changes as the instance comes online:

🌕OpenStack Lab % openstack server show save-10 -c OS-EXT-STS:task_state -c status
+-----------------------+-------------------+
| Field                 | Value             |
+-----------------------+-------------------+
| OS-EXT-STS:task_state | unshelving        |
| status                | SHELVED_OFFLOADED |
+-----------------------+-------------------+
🌕OpenStack Lab % openstack server show save-10 -c OS-EXT-STS:task_state -c status
+-----------------------+-------------------+
| Field                 | Value             |
+-----------------------+-------------------+
| OS-EXT-STS:task_state | spawning          |
| status                | SHELVED_OFFLOADED |
+-----------------------+-------------------+
🌕OpenStack Lab % openstack server show save-10 -c OS-EXT-STS:task_state -c status
+-----------------------+--------+
| Field                 | Value  |
+-----------------------+--------+
| OS-EXT-STS:task_state | None   |
| status                | ACTIVE |
+-----------------------+--------+

Once ACTIVE, the instance begins pinging again:

...
Request timeout for icmp_seq 951
Request timeout for icmp_seq 952
Request timeout for icmp_seq 953
Request timeout for icmp_seq 954
64 bytes from 192.168.2.209: icmp_seq=955 ttl=64 time=9.667 ms
64 bytes from 192.168.2.209: icmp_seq=956 ttl=64 time=8.269 ms
64 bytes from 192.168.2.209: icmp_seq=957 ttl=64 time=8.848 ms
64 bytes from 192.168.2.209: icmp_seq=958 ttl=64 time=10.406 ms
64 bytes from 192.168.2.209: icmp_seq=959 ttl=64 time=7.649 ms
...

Now we can see the resources consumed on compute02:

Now we see resources claimed on lab-compute02:

🌕OpenStack Lab % openstack host show lab-compute01
+---------------+----------------------------------+-----+-----------+---------+
| Host          | Project                          | CPU | Memory MB | Disk GB |
+---------------+----------------------------------+-----+-----------+---------+
| lab-compute01 | (total)                          |  32 |    257974 |    3646 |
| lab-compute01 | (used_now)                       |   4 |     10240 |      80 |
| lab-compute01 | (used_max)                       |   4 |      8192 |      80 |
| lab-compute01 | 7a8df96a3c6a47118e60e57aa9ecff54 |   4 |      8192 |      80 |
+---------------+----------------------------------+-----+-----------+---------+
🌕OpenStack Lab % openstack host show lab-compute02
+---------------+----------------------------------+-----+-----------+---------+
| Host          | Project                          | CPU | Memory MB | Disk GB |
+---------------+----------------------------------+-----+-----------+---------+
| lab-compute02 | (total)                          |  48 |    128796 |     937 |
| lab-compute02 | (used_now)                       |  20 |     25600 |     180 |
| lab-compute02 | (used_max)                       |  20 |     23552 |     190 |
| lab-compute02 | 7a8df96a3c6a47118e60e57aa9ecff54 |  19 |     22528 |     170 |
| lab-compute02 | 36de0c24e456401d8df6ffaff42224d0 |   1 |      1024 |      20 |
+---------------+----------------------------------+-----+-----------+---------+

Image Library

Prior to shelving, our image library consisted of a single image:

🌕OpenStack Lab % openstack image list
+--------------------------------------+-----------------------------------+--------+
| ID                                   | Name                              | Status |
+--------------------------------------+-----------------------------------+--------+
| a299a16d-f46c-4e4e-9dc4-515634fc4ac1 | Ubuntu Server 18.04 Bionic        | active |
+--------------------------------------+-----------------------------------+--------+

After shelving, our snapshot appears in the list:

🌕OpenStack Lab % openstack image list
+--------------------------------------+-----------------------------------+--------+
| ID                                   | Name                              | Status |
+--------------------------------------+-----------------------------------+--------+
| a299a16d-f46c-4e4e-9dc4-515634fc4ac1 | Ubuntu Server 18.04 Bionic        | active |
| 5df4680e-5f81-4d3b-aa74-3aece41fdb22 | save-10-shelved                   | active |
+--------------------------------------+-----------------------------------+--------+

The details of the snapshot can be seen here:

🌕OpenStack Lab % openstack image show save-10-shelved
+------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field            | Value                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
+------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| checksum         | e9226299db2b4d927e0c7e509de52e0e                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| container_format | bare                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| created_at       | 2020-12-30T15:44:52Z                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| disk_format      | qcow2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
| file             | /v2/images/5df4680e-5f81-4d3b-aa74-3aece41fdb22/file                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| id               | 5df4680e-5f81-4d3b-aa74-3aece41fdb22                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| min_disk         | 40                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| min_ram          | 0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| name             | save-10-shelved                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| owner            | 7a8df96a3c6a47118e60e57aa9ecff54                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| properties       | base_image_ref='a299a16d-f46c-4e4e-9dc4-515634fc4ac1', boot_roles='reader,admin,member', clean_attempts='3', image_location='snapshot', image_state='available', image_type='snapshot', instance_uuid='6495f1fd-f440-4db8-8502-f6c6bd8971ba', locations='[{'url': 'file:///var/lib/glance/images/5df4680e-5f81-4d3b-aa74-3aece41fdb22', 'metadata': {'store': 'file'}}]', os_hash_algo='sha512', os_hash_value='8903476ac71d81cd92898cf71d73c560d9acec2e61a1150f702bf77459416c8f7bde28bc063048a6fe8aab67f80ce3ea0b2d524bd8f145d0a0e49f540e5c1eea', os_hidden='False', owner_project_name='admin', owner_specified.openstack.md5='ed44b9745b8d62bcbbc180b5f36c24bb', owner_specified.openstack.object='images/Ubuntu Server 18.04 Bionic', owner_specified.openstack.sha256='c4517e054c398235aa7a09ddcc1db31cd168077049febcc4292ff77fe1e5eab3', owner_user_name='admin', stores='file', user_id='34f3cf48b24f41c097555c07961f139e' |
| protected        | False                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
| schema           | /v2/schemas/image                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| size             | 1155530752                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| status           | active                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| tags             |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| updated_at       | 2020-12-30T15:45:15Z                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| virtual_size     | 42949672960                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| visibility       | private                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
+------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Notice the visibility is private. It will only be visible to the owning project. A quick test found that I could create additional instances using this snapshot, which may or may not be a security concern. I am not sure if other users in the same project would be able to do the same with this image, however.

Anyhow, this solution may prove useful as users shelve their instances over weekends or overnight to allow resources to be consumed by other users of the cloud. Just be sure to have enough free space to store the snapshots!

If you have some thoughts or comments on this process, I’d love to hear ‘em. Feel free to reach out on Twitter at @jimmdenton or hit me up on LinkedIn.