Sunday, February 2, 2014

How to detect a memory leak on Checkpoint Security Gateway SPLAT / Gaia

Background

Memory leak is an abnormal growth of memory usage, caused by either in Kernel Space or in User Space.
The memory allocated, but not freed, which will significantly impact the performance of the machine and might cause the machine to crash.
This article describes a procedure for detecting a memory leak in Kernel Space (memory leaks in User Space are detected using specical tools - e.g., valgrind - for specific process).


Procedure

Note: The kernel parameters described below can be enabled (value set to 1) indefinitely without any impact - neither on security, nor on performance.
  1. To enable memory leak detection, set the following kernel parameters in $FWDIR/boot/modules/fwkern.conf file per sk26202.

    fw_salloc_debug_leaks=1
    fw_hmem_debug_leaks=1
    fw_kmem_cphwd_use_fw=1
    fw_kmem_detailed_leak_report=1
    fw_kdprintf_limit=0
    fw_kdprintf_limit_time=0
  2. Save the changes and reboot the machine.
  3. Verify that the values for kernel parameters were accepted:

    [[email protected]]# fw ctl get int fw_salloc_debug_leaks
    [[email protected]]# fw ctl get int fw_hmem_debug_leaks
    [[email protected]]# fw ctl get int fw_kmem_cphwd_use_fw
    [[email protected]]# fw ctl get int fw_kmem_detailed_leak_report
    [[email protected]]# fw ctl get int fw_kdprintf_limit
    [[email protected]]# fw ctl get int fw_kdprintf_limit_time
  4. Collect CPinfo file:

    [[email protected]]# cpinfo -z -n -o /var/log/$(uname -n)_before.cpinfo
  5. Let the system run for at least several days - if possible, stress the machine by passing complex traffic through the gateway.
  6. On Gaia OS: Stop RouteD daemon:

    [[email protected]]# tellpm process:routed

    Notes:
    • In R76 cluster, this might cause a fail-over between cluster members (starting in R76, a new Device Name / Pnote called 'routed' was introduced). Refer to sk92787.
    • If RouteD daemon is not stopped, then Check Point kernel module will not be able to unload (in Steps 8,9 and 10) because /dev/fw* devices will remain in use, which can be seen in the output of 'lsof | grep -v grep | grep -E "PID|routed" | grep -E "PID|/dev/fw"' command.
    • This step applies only to Gaia R75.40 / R75.40VS / R75.45 / R75.46 / R76.
    • This issue was fixed in R75.47

  7. CRUCIAL: Collect CPinfo file right before next Step 8:

    [[email protected]]# cpinfo -z -n -o /var/log/$(uname -n)_during.cpinfo
  8. Stop all Check Point processes and applications:

    [[email protected]]# cpstop
  9. Stop all Check Point services:

    [[email protected]]# service cpboot stop
  10. Unload the Check Point kernel modules:

    [[email protected]]# cpstop -fwflag -driver

    Note: check the output carefully - there should NOT be
    any messages telling that FireWall kernel module could not be unloaded.
    Example of problematic message:
    fwmod_smp.2.4.21.cp.i686: Device or resource busy
    Possible reasons that the module is still being used:
    • Policy installation was in progress
    • Kernel debug was running
    • Some User Space process is still using the FireWall module (fwmod)
    Possible checks:
      Perform the following checks and the previous step again ('service cpboot stop')

    • Stop policy installation
    • Stop kernel debug

      • (A)
        Check that only the error, or warning, or none, or none flags were enabled for different modules
        [[email protected]]# fw ctl debug

        To default the flags run
        [[email protected]]# fw ctl debug 0
      • (B)
        Check that no kernel debugs are running
        The output of the following command should be empty
        [[email protected]]# ps auxw | grep -v 'grep' | grep 'debug'

    • Stop the User Space process that uses the FireWall module

      • (A)
        The best practice is to try stopping the Service, which runs this process via Linux 'service' command
        If no such Service exists, then go to next Step (B)

        Example:
        [[email protected]]# lsof /dev/fw0
        COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
        cpsnmpage 1112 root 20u CHR 253,0 65622 /dev/fw0

        In this case, 'cpsnmpage' process is /usr/sbin/cpsnmpagentx

        The service, which runs this process is SNMP

        Try stopping the SNMP service via Linux 'service' command
        [[email protected]]# service snmp stop

        If 'lsof /dev/fw0' command still shows this process, then try stopping the SNMP service via SNMP command
        [[email protected]]# snmp service disable

        If 'lsof /dev/fw0' command still shows this process, then go to next Step (B)
      • (B)
        Kill the process that uses the FireWall module
        [[email protected]]# kill -KILL PID_of_Process

        NOTE: PID of the process appears in the output of 'lsof' command in 2nd column 'PID'

        If 'lsof /dev/fw0' command still shows this process, then contact Contact Check Point Support

  11. Check if FireWall kernel module is still loaded:

    [[email protected]]# lsmod | grep fwmod

    Note: This step is relevant for R6x versions only, skip this step for R7x versions.
  12. If FireWall kernel module is still loaded, unload it manually:

    [[email protected]]# rmmod <NAME_OF_FWMOD>

    Notes:
    • This step is relevant for R6x versions only, skip this step for R7x versions.
    • Check the output carefully - there should NOT be any messages telling that kernel module could not be unloaded.

  13. CRUCIAL: Collect the memory leak information by using the following exact syntax:

    [[email protected]]# \date >> /var/log/leak.txt
    [[email protected]]# dmesg >> /var/log/leak.txt
    [[email protected]]# \date >> /var/log/leak.txt
  14. Collect CPinfo file:

    [[email protected]]# cpinfo -z -n -o /var/log/$(uname -n)_after.cpinfo
  15. Start the Check Point services:

    [[email protected]]# service cpboot start
  16. Start Check Point processes and applications:

    [[email protected]]# cpstart
  17. On Gaia OS: Start RouteD daemon (which was stopped in Step 6):

    [[email protected]]# tellpm process:routed t
  18. Send the following files to Check Point Support:

    /var/log/leak.txt
    /var/log/messag*
    /var/log/<HostName>_before.cpinfo.gz
    /var/log/<HostName>_during.cpinfo.gz
    /var/log/<HostName>_after.cpinfo.gz
  19. To disable memory leak detection, set the following kernel parameters in $FWDIR/boot/modules/fwkern.conf file per sk26202.

    fw_salloc_debug_leaks=0
    fw_hmem_debug_leaks=0
    fw_kmem_cphwd_use_fw=0
    fw_kmem_detailed_leak_report=0
    fw_kdprintf_limit=30
    fw_kdprintf_limit_time=60

    Note: Another way to disable memory leak detection is to delete all these parameters from the$FWDIR/boot/modules/fwkern.conf file.
  20. Save the changes in $FWDIR/boot/modules/fwkern.conf file and reboot the machine.

No comments: