ASA/FTD High-Availability Troubleshooting
ASA/FTD High-Availability Troubleshooting
October 2023
                                                                                   •   Understanding methodology of troubleshooting
                                                                                       most common issues regarding High-Availability
                                                                                       setup in both ASA and FTD.
Session Goal
                                                                                   •   Using verification commands in real scenarios to
                                                                                       determine causes of the failover events.
   © 2023 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential
                                                                                  1   Few words about High Availability
Agenda
                                                                                  3   Troubleshooting workflow
4 Common issues
  © 2023 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential
Disclaimer
  • Rebranding
  • Cisco Next-Generation Firewall (NGFW) is now Cisco Secure Firewall.
  • Rebranded names in version 7.2:
                                 Former Name                                                     Rebranded Name
Firepower Threat Defense Virtual (FTDv) Secure Firewall Threat Defense Virtual
Firepower Management Center Virtual (FMCv) Secure Firewall Management Center Virtual
Firepower eXtentsible Operating System (FXOS) Secure Firewall eXtensible Operating system
• High availability refers to the failover configuration. High availability or failover setup joins two devices
  so that if one of the devices fails, the other device can take over.
• Primary and Secondary are roles, stay with the units and specified during the HA initial configuration.
• Active and Standby are states and change depending on the health status of each unit.
       © 2023 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential
Few words about HA
• Both ASA/FTD in pair must be identical in hardware, software, memory, interfaces and mode.
     © 2023 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential
Classic ASA vs FTD failover
• ASA monitors the state of the interfaces. FTD also monitors Snort and Disk space.
• Failover replication command options are not configurable for FTD and use default setting:
• On ASA you can configure encryption for the failover link in 2 different ways: a simple key or an IPsec
  tunnel. FTD supports only the IPsec tunnel option.
• On ASA you can use a sub-interface as a failover or state interfaces. On FTD you must use a physical
  interface.
     © 2023 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential
HA state flow diagram
     © 2023 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential
Verification commands
        © 2023 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential
Verification commands
• show failover:
                                       Primary/Active                                                 Secondary/Standby
          > show failover                                                            > show failover
          Failover On                                                                Failover On
          Failover unit Primary                                                      Failover unit Secondary
          Failover LAN Interface: failover GigabitEthernet0/4 (up)                   Failover LAN Interface: failover GigabitEthernet0/4 (up)
          Reconnect timeout 0:00:00                                                  Reconnect timeout 0:00:00
          Unit Poll frequency 1 seconds, holdtime 15 seconds                         Unit Poll frequency 1 seconds, holdtime 15 seconds
          Interface Poll frequency 5 seconds, holdtime 25 seconds                    Interface Poll frequency 5 seconds, holdtime 25 seconds
          Interface Policy 1                                                         Interface Policy 1
          Monitored Interfaces 3 of 361 maximum                                      Monitored Interfaces 3 of 361 maximum
          MAC Address Move Notification Interval not set                             MAC Address Move Notification Interval not set
          failover replication http                                                  failover replication http
          Version: Ours 9.18(2)219, Mate 9.18(2)219                                  Version: Ours 9.18(2)219, Mate 9.18(2)219
          Serial Number: Ours 9AD2AL87FDQ, Mate 9ALU58NUM7A                          Serial Number: Ours 9ALU58NUM7A, Mate 9AD2AL87FDQ
          Last Failover at: 06:24:15 UTC Jul 5 2023                                  Last Failover at: 19:07:10 UTC Jul 5 2023
                  This host: Primary - Active                                                This host: Secondary - Standby Ready
                          Active time: 102448 (sec)                                                  Active time: 0 (sec)
                          slot 0: ASAv hw/sw rev (/9.18(2)219) status (Up Sys)                       slot 0: ASAv hw/sw rev (/9.18(2)219) status (Up Sys)
                            Interface diagnostic (0.0.0.0): Normal (Waiting)                           Interface Outside (0.0.0.0): Normal (Waiting)
                            Interface Outside (192.168.2.10): Normal (Waiting)                         Interface Inside (0.0.0.0): Normal (Waiting)
                            Interface Inside (192.168.28.1): Normal (Waiting)                          Interface diagnostic (0.0.0.0): Normal (Waiting)
                          slot 1: snort rev (1.0) status (up)                                        slot 1: snort rev (1.0) status (up)
                          slot 2: diskstatus rev (1.0) status (up)                                   slot 2: diskstatus rev (1.0) status (up)
                  Other host: Secondary - Standby Ready                                      Other host: Primary - Active
                          Active time: 0 (sec)                                                       Active time: 102512 (sec)
                            Interface diagnostic (0.0.0.0): Normal (Waiting)                           Interface Outside (192.168.2.10): Normal (Waiting)
                            Interface Outside (0.0.0.0): Normal (Waiting)                              Interface Inside (192.168.28.1): Normal (Waiting)
                            Interface Inside (0.0.0.0): Normal (Waiting)                               Interface diagnostic (0.0.0.0): Normal (Waiting)
                          slot 1: snort rev (1.0) status (up)                                        slot 1: snort rev (1.0) status (up)
                          slot 2: diskstatus rev (1.0) status (up)                                   slot 2: diskstatus rev (1.0) status (up)
     © 2023 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential
Verification commands
• show failover:
                                    Primary/Active                                                        Secondary/Standby
          Stateful Failover Logical Update Statistics                                         Stateful Failover Logical Update Statistics
                  Link : failover GigabitEthernet0/4 (up)                                             Link : failover GigabitEthernet0/4 (up)
                  Stateful Obj    xmit       xerr       rcv                        rerr               Stateful Obj    xmit       xerr       rcv       rerr
                  General         79005      0          78326                      0                  General         7601       0          7607      0
                  sys cmd         78333      0          78326                      0                  sys cmd         7601       0          7601      0
                  up time         0          0          0                          0                  up time         0          0          0         0
                  RPC services    0          0          0                          0                  RPC services    0          0          0         0
                  TCP conn        117        0          0                          0                  TCP conn        0          0          0         0
                  UDP conn        402        0          0                          0                  UDP conn        0          0          0         0
                  ARP tbl         143        0          0                          0                  ARP tbl         0          0          5         0
                  Xlate_Timeout   0          0          0                          0                  Xlate_Timeout   0          0          0         0
                  IPv6 ND tbl     0          0          0                          0                  IPv6 ND tbl     0          0          0         0
                  VPN IKEv1 SA    0          0          0                          0                  VPN IKEv1 SA    0          0          0         0
                  VPN IKEv1 P2    0          0          0                          0                  VPN IKEv1 P2    0          0          0         0
                  VPN IKEv2 SA    0          0          0                          0                  VPN IKEv2 SA    0          0          0         0
                  VPN IKEv2 P2    0          0          0                          0                  VPN IKEv2 P2    0          0          0         0
                  VPN CTCP upd    0          0          0                          0                  VPN CTCP upd    0          0          0         0
                  VPN SDI upd     0          0          0                          0                  VPN SDI upd     0          0          0         0
                  VPN DHCP upd    0          0          0                          0                  VPN DHCP upd    0          0          0         0
                  SIP Session     0          0          0                          0                  SIP Session     0          0          0         0
                  SIP Tx 0           0          0          0                                          SIP Tx 0           0          0           0
                  SIP Pinhole     0          0          0                          0                  SIP Pinhole     0          0          0         0
                  Route Session   0          0          0                          0                  Route Session   0          0          0         0
                  Router ID       0          0          0                          0                  Router ID       0          0          0         0
                  User-Identity   5          0          0                          0                  User-Identity   0          0          1         0
                  CTS SGTNAME     0          0          0                          0                  CTS SGTNAME     0          0          0         0
                  CTS PAC         0          0          0                          0                  CTS PAC         0          0          0         0
                  TrustSec-SXP    0          0          0                          0                  TrustSec-SXP    0          0          0         0
                  IPv6 Route      0          0          0                          0                  IPv6 Route      0          0          0         0
                  STS Table       0          0          0                          0                  STS Table       0          0          0         0
                  Umbrella Device-ID   0          0          0                            0           Umbrella Device-ID   0          0           0          0
                  Rule DB B-Sync 0           0          0                          0                  Rule DB B-Sync 0           0          0         0
                  Rule DB P-Sync 4           0          0                          0                  Rule DB P-Sync 0           0          0         0
                  Rule DB Delete 1           0          0                          0                  Rule DB Delete 0           0          0         0
                                                Primary/Active                                                                                Secondary/Standby
> show failover state                                                                                                 > show failover state
                  State                          Last Failure Reason                        Date/Time                                 State           Last Failure Reason   Date/Time
This host   -     Primary                                                                                             This host   -   Secondary
                  Active                        Comm Failure                                06:22:47 UTC Jul 5 2023                   Standby Ready   None
Other host -     Secondary                                                                                            Other host -    Primary
                 Standby Ready                   Comm Failure                               19:01:26 UTC Jul 5 2023                   Active          None
            © 2023 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential
   Verification commands
Primary/Active Secondary/Standby
            © 2023 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential
Common issues related to HA
• There are common situations where failover happens without a clear reason:
  • Issue with monitored interfaces.
  • Disk issue.
  • Traceback (reboot).
     © 2023 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential
Unexpected failover – Monitored interfaces
• When a unit does not receive hello messages on a monitored interface for 15 seconds, it runs
  interface tests.
• If one of the interface tests fails for an interface, but the same interface on the other unit continues to
  successfully pass traffic, then the interface is considered to be failed, and the device stops running
  tests.
• If faulty interface is on Active unit, failover will happen.
• If faulty interface is on Standby unit, no failover happens, Standby unit will be marked as Failed.
• If Unit is failed becasue of monitored interface failure, that interface need to be verified.
      © 2023 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential
Unexpected failover – Monitored interfaces
                                          Primary                                                                                              Secondary
> show failover state
                                                                                                             > show failover state
                  State                    Last Failure Reason                   Date/Time
This host   -     Primary                                                                                                    State        Last Failure Reason      Date/Time
                  Failed                   Ifc Failure                            10:31:10 UTC Jul 17 2023   This host   -   Secondary
                                           Outside: No Link                                                                  Active       Comm Failure             18:44:01 UTC Jul 10 2023
Other host -      Secondary                                                                                  Other host -    Primary
                  Active                   Comm Failure                          18:44:37 UTC Jul 10 2023                    Failed       Ifc Failure              10:31:10 UTC Jul 17 2023
                                                                                                                                          Outside: No Link
====Configuration State===
        Sync Done                                                                                            ====Configuration State===
====Communication State===                                                                                   ====Communication State===
        Mac set                                                                                                      Mac set
• Troubleshooting to be performed:
  • admin@firepower:~$ sudo df -hT ( -h: prints disk utilization in human-readable form, -T: print
    file system type):
     admin@firepower:~$ sudo df -hT
     Filesystem     1K-blocks          Used                                          Available   Use%    Mounted on
     overlay        720917580     104508748                                          616408832     15%   /
     tmpfs              65536             0                                              65536      0%   /dev
     tmpfs           98385644             0                                           98385644      0%   /sys/fs/cgroup
     /dev/sda6       41943040      40814524                                            1128516     98%   /opt
     tmpfs           98385644           248                                           98385396      1%   /run
     shm             13331456         51400                                           13280056      1%   /dev/shm
     tmpfs           98385644             4                                           98385640      1%   /var/config
     tmpfs           98385644         42320                                           98343324      1%   /var/volatile/tmp
     /dev/sda5       51474044         53200                                           48799456      1%   /var/data/cores
     /dev/sda2        1001328         30664                                             918136      4%   /opt/cisco/config/host-common
     /dev/sda3        4722056         16760                                            4458768      1%   /opt/cisco/csp/applications/cisco-ftd.7.2_ftd_001_/app_data/disk0/log/.ntp.log
     tmpfs           98385644             0                                           98385644      0%   /proc/acpi
     tmpfs           98385644             0                                           98385644      0%   /proc/scsi
     tmpfs           98385644             0                                           98385644      0%   /sys/firmware
     none              514048             0                                             514048      0%   /dev/shm/snort
     © 2023 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential
Unexpected failover – Disk issue
• High disk utilization can be observed because of old not needed files.
• Cleaning the disk from old files can be performed with extra caution.
• Linux does not have concept of a „recycle bin”, deleted items practically cannot be restored.
• Do not use absolute paths, first enter the directory and then remove file.
• If you are not sure if specific file can be removed, do not delete it.
       © 2023 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential
Unexpected failover – Disk issue
     © 2023 Cisco and/or its affiliates. All rights reserved.    Cisco Confidential
DEMO                                                                              Unexpected failover – Disk issue
  © 2023 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential
Unexpected failover – Traceback
• Root cause of Lina/Snort tracebacks are usually investigated by TAC and the software engineering
  team.
• There are steps which can be taken to collect needed outputs before opening the case:
  • Generate Troubleshoot file for FTD or show tech-support for ASA.
  • Verify show tech-support outputs for confirmation of the traceback.
  • Collect Lina crash-info (if exists).
  • Collect core file (if exists).
     © 2023 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential
  Unexpected failover – Traceback
Directory of disk0:/
             © 2023 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential
Unexpected failover – Traceback
        © 2023 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential
Unexpected failover – Traceback
------------------ show failover history ------------------ ------------------ show failover history ------------------
==========================================================================                              ==========================================================================
From State                    To State            Reason                                                From State                    To State                  Reason
==========================================================================                              ==========================================================================
06:08:57 UTC Jun 20 2023                                                                                04:51:06 UTC May 13 2023
Not Detected                  Disabled            No Error                                              Bulk Sync                     Standby Ready            Failover state check
         © 2023 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential
App Sync issues during joining HA
• If the show failover history output indicates an App Sync failure, then there was a problem at the time
  of the HA validation phase, where the system checks that the units can function correctly as a high
  availability group.
• The message “All validation passed” when the From State is App Sync appears, and the node moves
  to the Standby Ready state.
• Any validation failure transitions the peer to the Disabled (Failed).
     © 2023 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential
App Sync issues during joining HA
• At this stage, policy deployments also fail because the active unit thinks app sync is still in progress.
• Policy deployment throws the error - "since new Node join/AppSync process is in progress,
  Configuration Changes are not allowed, and hence rejects the deployment request. Please retry
  deployment after some time„.
• Sometimes, when you resume high availability on the Standby node, it can resolve the issue.
     © 2023 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential
App Sync issues during joining HA
• Sometimes, when you resume high availability on the Standby node, it can resolve the issue.
• CD App Sync error is Rsync based file retrieval failed. Check app-sync-history CLI for details.
• Standby unit can recover by its own, after reboot or after resuming HA.
     © 2023 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential
App Sync issues during joining HA
• Some Sync issues are temporary and can be resolved by resuming HA on standby unit:
• ASA:
Ciscoasa(config)#failover
• FTD:
• If issue persists after resuming, it need further analysis so TAC engineer needs to be involved.
     © 2023 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential
                                                                                 App-Sync Issues
DEMO
                                                                                 Bug: CSCwh02757
 © 2023 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential
     Split-Brain (Active/Active)- What is it?
                                         Primary                                                                                                      Secondary
>show failover state                                                                                                 >show failover state
                  State                    Last Failure Reason                             Date/Time                                  State       Last Failure Reason   Date/Time
This host –       Primary                                                                                            This host –     Secondary
                  Active                   None                                                                                      Active       None
Other host -      Secondary                                                                                          Other host -    Primary
                  Failed                   Comm Failure                                    06:24:15 UTC Jul 6 2023                   Failed       Comm Failure          06:24:15 UTC Jul 6 2023
               © 2023 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential
  Split-Brain (Active/Active)
                                             Primary                                                                                               Secondary
> show failover history                                                                                  > show failover history
==========================================================================                               ==========================================================================
From State                  To State                 Reason                                              From State                   To State                        Reason
==========================================================================                               ==========================================================================
06:45:28 UTC Jun 27 2023
Not Detected                Disabled                 No Error                                            19:04:58 UTC Jul 5 2023
                                                                                                         Bulk Sync                   Standby Ready           Detected an Active peer
11:54:36 UTC Jun 27 2023
Disabled                          Negotiation                                Set by the config command   06:24:15 UTC Jul 6 2023
                                                                            (failover)                   Standby Ready               Just Active            HELLO not heard from peer
                                                                                                                                                            (failover link up, no response from peer)
11:55:21 UTC Jun 27 2023
Negotiation                       Just Active                                 No Active unit found       06:24:15 UTC Jul 6 2023
                                                                                                         Just Active               Active Drain              HELLO not heard from peer
11:55:21 UTC Jun 27 2023                                                                                                                                    (failover link up, no response from peer)
Just Active                       Active Drain                                No Active unit found
                                                                                                         06:24:15 UTC Jul 6 2023
11:55:21 UTC Jun 27 2023                                                                                 Active Drain              Active Applying Config    HELLO not heard from peer
Active Drain                      Active Applying Config                      No Active unit found                                                          (failover link up, no response from peer)
             © 2023 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential
Emergency Recovery from Split-Brain
• To minimize impact of split-brain, you can disable failover on 1 of the units or disconnect it from the
  network
• Disable Failover on the unit not passing traffic:
  • On ASA Platform, over CLI, navigate to the configuration terminal and enter "no failover" command.
  • On FTD Platform, over CLI, enter "configure high-availability suspend" command.
• For FTD, shutdown the interfaces on the connected device. Alternatively, you can also physically
  disconnect the interfaces.
• Also, you can power off the device, but this will limit you from managing the device.
     © 2023 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential
   Emergency Recovery from Split-Brain
 > configure high-availability suspend
Resume HA:
> configure high-availability resume
Successfully resumed high-availablity.
             © 2023 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential
Split-Brain - Possible causes
• Split-Brain occurs when the communication between the failover Link interfaces is down
  (unidirectionally or bidirectionally). This scenario can be seen if failover and data links travel through
  the same path. The most common reasons are:
     © 2023 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential
Procedure to Troubleshoot failover link - Flowchart
           Start of
       troubleshooting
             L1/L2: Is the
                                                                                        The link on both of the units has to be UP. Common reasons for connecton to
         status/protocol for
                                                                                                                       be down include:
             Failover LAN                                                          NO     • Failed/Shut interface of an intermediate device – check intermediate
          interface on both
                                                                                                                            device if any
             the units up?
         Show interface                                                                 • Issue with physical cabling or interface failure – check physical connection,
              ip brief                                                                                            if possible replace cables/sfp
                              YES
   © 2023 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential
Procedure to Troubleshoot failover link - Flowchart
                                                                                            Apply captures on both the units for protocol 105 for failover link interface, Eg:
                                                                                                       cap test interface fover match 105 any any
                                                                                         You should see protocol 105 packets in the above capture between the Primary and
                                                                                        Secondary Unit. You will see ESP packets Incase IPSec Encryption is enabled on failover
                                                                                                                              interface.
                                                                                   NO               In case you see only one way traffic on both/one of the boxes:
          L3: Can both the
                                                                                               > Check show blocks to verify if Memory Block 1550 has been depleted
           units ping each
                                                                                         > Check show mac address-table on the intermediate L2 device, if any. Verify the
           other over the
                                                                                                                mac addresses are being correctly learnt.
            Failover Link?                                                              > Another quick way to verify connectivity is by running the show failover command
                                                                                           for both the units. A "normal" status on each interface indicates that the keepalive
                                                                                                                      packets are correctly received
15 packets captured
                                                                                         Check for latency ping peer firewalls failover interface. Usually the round-trip time/2 is a
                                                                                                               good indicator of peak and average latency.
                                                                                             For more accurate readings captures on failover interface from both units can be
        Is latency between                                                                                             exported and compared.
            the two units                                                          YES
             greater than                                                                      Latency between the two units in a Failover Pair needs to be under 250ms.
                                                                                                            It's recomended to keep latency under 10ms.
               10ms?
                                                                                         Though chances of latency causing Split-brain scenario are less, high latency can cause
                                                                                                   intermittent failovers and impact failover performance in general.
NO
    Your problem is not a common problem. You should engage TAC by opening a case for
                                  further troubleshooting
   © 2023 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential
To proactively prepare against a Split-Brain condition:
• Enable logging to external syslog server and enable logging timestamp option.
     © 2023 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential
Verification Cheat Sheet
                                                                        Co mmands                                   Lo gs
Disk0/log/fover_trace.log | /mnt/Disk0/log/fover_trace.log
   © 2023 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential
References
     © 2023 Cisco and/or its affiliates. All rights reserved.   Cisco Confidential
Summary
• Troubleshooting steps for unexpected failover due to issues with monitored interfaces, disk or
  traceback.
• Explanation of App-sync errors and troubleshooting steps.
• HA best practices.
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential