Sunday, February 2, 2014

How to detect a memory leak on Checkpoint Security Gateway SPLAT / Gaia

Background

Memory leak is an abnormal growth of memory usage, caused by either in Kernel Space or in User Space.
The memory allocated, but not freed, which will significantly impact the performance of the machine and might cause the machine to crash.
This article describes a procedure for detecting a memory leak in Kernel Space (memory leaks in User Space are detected using specical tools - e.g., valgrind - for specific process).


Procedure

Note: The kernel parameters described below can be enabled (value set to 1) indefinitely without any impact - neither on security, nor on performance.
  1. To enable memory leak detection, set the following kernel parameters in $FWDIR/boot/modules/fwkern.conf file per sk26202.

    fw_salloc_debug_leaks=1
    fw_hmem_debug_leaks=1
    fw_kmem_cphwd_use_fw=1
    fw_kmem_detailed_leak_report=1
    fw_kdprintf_limit=0
    fw_kdprintf_limit_time=0
  2. Save the changes and reboot the machine.
  3. Verify that the values for kernel parameters were accepted:

    [Expert@HostName]# fw ctl get int fw_salloc_debug_leaks
    [Expert@HostName]# fw ctl get int fw_hmem_debug_leaks
    [Expert@HostName]# fw ctl get int fw_kmem_cphwd_use_fw
    [Expert@HostName]# fw ctl get int fw_kmem_detailed_leak_report
    [Expert@HostName]# fw ctl get int fw_kdprintf_limit
    [Expert@HostName]# fw ctl get int fw_kdprintf_limit_time
  4. Collect CPinfo file:

    [Expert@HostName]# cpinfo -z -n -o /var/log/$(uname -n)_before.cpinfo
  5. Let the system run for at least several days - if possible, stress the machine by passing complex traffic through the gateway.
  6. On Gaia OS: Stop RouteD daemon:

    [Expert@HostName]# tellpm process:routed

    Notes:
    • In R76 cluster, this might cause a fail-over between cluster members (starting in R76, a new Device Name / Pnote called 'routed' was introduced). Refer to sk92787.
    • If RouteD daemon is not stopped, then Check Point kernel module will not be able to unload (in Steps 8,9 and 10) because /dev/fw* devices will remain in use, which can be seen in the output of 'lsof | grep -v grep | grep -E "PID|routed" | grep -E "PID|/dev/fw"' command.
    • This step applies only to Gaia R75.40 / R75.40VS / R75.45 / R75.46 / R76.
    • This issue was fixed in R75.47

  7. CRUCIAL: Collect CPinfo file right before next Step 8:

    [Expert@HostName]# cpinfo -z -n -o /var/log/$(uname -n)_during.cpinfo
  8. Stop all Check Point processes and applications:

    [Expert@HostName]# cpstop
  9. Stop all Check Point services:

    [Expert@HostName]# service cpboot stop
  10. Unload the Check Point kernel modules:

    [Expert@HostName]# cpstop -fwflag -driver

    Note: check the output carefully - there should NOT be
    any messages telling that FireWall kernel module could not be unloaded.
    Example of problematic message:
    fwmod_smp.2.4.21.cp.i686: Device or resource busy
    Possible reasons that the module is still being used:
    • Policy installation was in progress
    • Kernel debug was running
    • Some User Space process is still using the FireWall module (fwmod)
    Possible checks:
      Perform the following checks and the previous step again ('service cpboot stop')

    • Stop policy installation
    • Stop kernel debug

      • (A)
        Check that only the error, or warning, or none, or none flags were enabled for different modules
        [Expert@FW]# fw ctl debug

        To default the flags run
        [Expert@FW]# fw ctl debug 0
      • (B)
        Check that no kernel debugs are running
        The output of the following command should be empty
        [Expert@FW]# ps auxw | grep -v 'grep' | grep 'debug'

    • Stop the User Space process that uses the FireWall module

      • (A)
        The best practice is to try stopping the Service, which runs this process via Linux 'service' command
        If no such Service exists, then go to next Step (B)

        Example:
        [Expert@FW]# lsof /dev/fw0
        COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
        cpsnmpage 1112 root 20u CHR 253,0 65622 /dev/fw0

        In this case, 'cpsnmpage' process is /usr/sbin/cpsnmpagentx

        The service, which runs this process is SNMP

        Try stopping the SNMP service via Linux 'service' command
        [Expert@FW]# service snmp stop

        If 'lsof /dev/fw0' command still shows this process, then try stopping the SNMP service via SNMP command
        [Expert@FW]# snmp service disable

        If 'lsof /dev/fw0' command still shows this process, then go to next Step (B)
      • (B)
        Kill the process that uses the FireWall module
        [Expert@HostName]# kill -KILL PID_of_Process

        NOTE: PID of the process appears in the output of 'lsof' command in 2nd column 'PID'

        If 'lsof /dev/fw0' command still shows this process, then contact Contact Check Point Support

  11. Check if FireWall kernel module is still loaded:

    [Expert@HostName]# lsmod | grep fwmod

    Note: This step is relevant for R6x versions only, skip this step for R7x versions.
  12. If FireWall kernel module is still loaded, unload it manually:

    [Expert@HostName]# rmmod <NAME_OF_FWMOD>

    Notes:
    • This step is relevant for R6x versions only, skip this step for R7x versions.
    • Check the output carefully - there should NOT be any messages telling that kernel module could not be unloaded.

  13. CRUCIAL: Collect the memory leak information by using the following exact syntax:

    [Expert@HostName]# \date >> /var/log/leak.txt
    [Expert@HostName]# dmesg >> /var/log/leak.txt
    [Expert@HostName]# \date >> /var/log/leak.txt
  14. Collect CPinfo file:

    [Expert@HostName]# cpinfo -z -n -o /var/log/$(uname -n)_after.cpinfo
  15. Start the Check Point services:

    [Expert@HostName]# service cpboot start
  16. Start Check Point processes and applications:

    [Expert@HostName]# cpstart
  17. On Gaia OS: Start RouteD daemon (which was stopped in Step 6):

    [Expert@HostName]# tellpm process:routed t
  18. Send the following files to Check Point Support:

    /var/log/leak.txt
    /var/log/messag*
    /var/log/<HostName>_before.cpinfo.gz
    /var/log/<HostName>_during.cpinfo.gz
    /var/log/<HostName>_after.cpinfo.gz
  19. To disable memory leak detection, set the following kernel parameters in $FWDIR/boot/modules/fwkern.conf file per sk26202.

    fw_salloc_debug_leaks=0
    fw_hmem_debug_leaks=0
    fw_kmem_cphwd_use_fw=0
    fw_kmem_detailed_leak_report=0
    fw_kdprintf_limit=30
    fw_kdprintf_limit_time=60

    Note: Another way to disable memory leak detection is to delete all these parameters from the$FWDIR/boot/modules/fwkern.conf file.
  20. Save the changes in $FWDIR/boot/modules/fwkern.conf file and reboot the machine.

No comments:

Post a Comment