git commit -m understanding firewalld's status

April 25, 2024

A few days ago, I accidentally ran into an observability (and potentially a security) problem. In order to prevent configuration drift, I have an Ansible playbook that runs in my environment every morning to check and make sure the firewalld zone and ipset files in /etc/firewalld contain the services, ports, and IP addresses I expect to be there. Here is the basic gist of what it does:

- name: "Deploy ipset files"
  ansible.builtin.template:
    lstrip_blocks: true
    trim_blocks: true
    src: "{{ item.src }}"
    dest: "{{ item.dest }}"
  loop:
    - { src: 'myipset_1.xml.j2', dest: '/etc/firewalld/ipsets/myipset_1.xml' }
    - { src: 'myipset_2.xml.j2', dest: '/etc/firewalld/ipsets/myipset_2.xml' }

- name: "Deploy zone files"
  ansible.builtin.template:
    lstrip_blocks: true
    trim_blocks: true
    src: "{{ item.src }}"
    dest: "{{ item.dest }}"
  loop:
    - { src: 'myzone_1.xml.j2', dest: '/etc/firewalld/zones/myzone_1.xml' }
    - { src: 'myzone_2.xml.j2', dest: '/etc/firewalld/zones/myzone_2.xml' }

The playbook is run in check mode, so it is only reporting changes, not actually making them (changing firewall rules when you’re AFK can often produce fun results). This is all well and good, but the state of these files on the filesystem does not always reflect the status of firewalld’s running configuration.

The silent opening

Earlier this week, I was playing with Fluent Bit logging and needed to open a port on our syslog server for testing. Since it’s just an experiment at the moment, I simply added the port in the public zone, but without the --permanent flag. This means the port was immediately opened on the host, but it was not written into the zone file on the filesystem in /etc/firewalld/zones/public.xml. I was expecting this change to trip the playbook during its morning run the next day and flag our syslog server as a drifted host. I awoke the next morning and checked the playbook run…

Drifted Hosts: 0

I immediately realized what the problem was: my playbook is checking the permanent firewall configuration as stored on the filesystem, but it’s not checking the runtime firewall configuration that is live in the kernel. For those uninitiated, you can see a breakdown of this concept of runtime vs. permanent in the documentation on the firewalld website.

Finding a solution

The cogs started to turn on how to solve this discrepancy between the configuration as written to the filesystem and the runtime state. Do I just insert a quick firewall-cmd --reload at the top to bring the state back to permanent before proceeding? Well, it wouldn’t work in a playbook that runs in check mode, but more importantly, I don’t want to change anything on the host, I just want to be notified if something has changed. My first instinct was to turn to syslog and the journal.

Does firewalld log any changes to its runtime configuration? Nope, it does not.

There are some clues that firewalld has been touched if you check the audit log, but it is not immediately clear what changed and I’d rather not stare at a fire hose to find a leak.

I then stumbled upon a novel idea in this serverfault post, which uses diff against two firewall-cmd commands—one showing the status of a zone as written in the zone file on the filesystem, and the other showing the live running status of the zone. Modified slightly, I can get a very concise view of any differences between the filesystem configuration and the runtime state:

diff \
> <(firewall-cmd --zone=myzone_1 --list-all) \
> <(firewall-cmd --permanent --zone=my_zone1 --list-all)

It is a simple stretch to modify this command to check an ipset, just use --info-ipset=myipset_1 instead of the --zone and --list-all flags.

The playbook

Now equipped with a solution, I sketched out a basic playbook to detect changes in the live configuration of firewalld that deviate from the state expected by the filesystem configuration. I was even able to add checks for entirely new zones or ipsets that may have been added in runtime, silly as it might sound to do such a thing.

- name: "Detect changes to firewalld runtime configuration"
  hosts: all
  tasks:
    # First, we get a space-delimited list of the runtime ipsets on the host
    - name: "Retrieve ipsets"
      ansible.builtin.command:
        cmd: firewalld-cmd --get-ipsets
      register: ipsets
      changed_when: false
      become: true

    # Now we iterate through the list of ipsets to see if any entries have been
    # added or removed. The grep pipe is used prevent false positives from the
    # diff command, as we only care about entries changing.
    - name: "Check for ipset runtime changes"
      ansible.builtin.shell:
        cmd: |
          diff \
          <(firewall-cmd --permanent --info-ipset={{ item }}) \
          <(firewall-cmd --info-ipset={{ item }}) \
          | grep entries
        executable: /bin/bash
      register: ipset_result
      changed_when: ipset_result.rc == 0
      failed_when: false
      become: true
      loop: "{{ ipsets.stdout | split }}"

    # Do the same thing with zones
    - name: "Retrieve zones"
      ansible.builtin.command:
        cmd: firewall-cmd --get-zones
      register: zones
      changed_when: false
      become: true

    # For zones, we grep for services, ports, sources, and the zone target
    # (i.e. the actor could change the target from DEFAULT to ACCEPT, which
    # would allow all traffic in a zone)
    - name: "Check for zone runtime changes"
      ansible.builtin.shell:
        cmd: |
          diff \
          <(firewall-cmd --permanent --info-zone={{ item }}) \
          <(firewall-cmd --info-zone={{ item }}) \
          | grep -E '(services|ports|sources|target)'
        executable: /bin/bash
      register: zone_result
      changed_when: zone_result.rc == 0
      failed_when: false
      become: true
      loop: "{{ zones.stdout | split }}"

    - name: "Catch changes and fail playbook"
      ansible.builtin.fail:
        msg: "Runtime changes detected"
      when: ipset_result.changed or zone_result.changed

The design of the playbook may turn some heads, especially all of the changed_when and failed_when: false statements. There are reasons for writing it this way. Intuitively, it makes sense to mark any runtime changes our playbook finds as “changed” in the playbook run. The grep command comes back with a return code of 0 when it finds a match, which in our case means it finds a runtime change. That means a task should be marked as changed when the return code of our shell command is 0, hence the changed_when: result.rc == 0. We add a failed_when: false to these tasks as well because any return code that is not 0 is considered an error or failure in shell terms, but in our case it just means we didn’t find any runtime changes, and everything is okay.

Perhaps you’re also wondering why I put become: true on every single task instead of declaring it at the play level. You can do it either way, but I think declaring it at the task level improves readability (it is also the recommended method by Red Hat in their Ansible Best Practices certification).

At the end, you can also see a fail task which will be triggered on any hosts where runtime changes have been detected. The purpose of this is to more easily highlight which hosts have runtime changes, and also to fail the playbook to get your attention.

I should also mention I use this in AWX (Red Hat Ansible Automation Platform, if you’re paying for it), so the playbook isn’t necessarily designed to communicate changes most effectively from the CLI. As-is, you might need to run with -vv to ensure you see the changes in stdout. In AWX, you can click the changed tasks and check the Output tab to see what the runtime changes are very easily. One other AWX-specific detail: the final fail task allows you to send a notification specifically when runtime changes are detected (i.e. a failure), as opposed to sending a notification on any successful playbook run that may or may not indicate changes were detected. This is helpful when you’re running this playbook on a schedule so that you don’t have to think about it unless you get notified.

Should I actually care?

So what really is the benefit of all this? Is it worth consuming resources and CPU cycles to run these checks? Well, if you’re managing an environment where the uptime is rarely more than 24 hours, this is probably a waste of time. But some of us “lucky” admins run 6-month patch cycles and have seen uptimes measured in years, and in such environments it might be prudent to keep an eye on the runtime state of the firewall.

More importantly, I would say it’s quite useful when you want to ensure nobody is opening ephemeral ports and services on your host behind your back. Perhaps a feisty application team that occasionally abuses the privileges given to them in good faith. As the owners and maintainers of our infrastructure, we have the responsibility of knowing what is happening on our systems. If someone can open a port or whitelist an IP on a host without me noticing, whether for good intention or ill, that’s not a good look for the sysadmin. Together with the original playbook that checks the permanent configuration, I now have a more complete view of the state of firewalld on any given host. And what’s more fun than catching someone red-handed trying to add an extra web service to their application server without telling anyone?

There is probably a more efficent way to do this, and there are certainly other ways to detect anomalous behavior on a host (IP addresses or unexpected services exposed in the logs), but in my opinon, running this playbook on a schedule seems like a straightforward way to get a more comprehensive look at your firewall configuration.

bergnerd

git commit -m understanding firewalld's status

The silent opening

Finding a solution

The playbook

Should I actually care?