[ Bottom of Page | Previous Page | Next Page | Contents | Index | Library Home | Legal | Search ]

Performance Management Guide

TCP and UDP performance tuning

The optimal settings of the tunable communications parameters vary with the type of LAN, as well as with the communications-I/O characteristics of the predominant system and application programs. This section describes the global principles of communications tuning for AIX.

Use the following outline for verifying and tuning a network installation and workload:

Adapter Placement

Network performance is dependent on the hardware you select, like the adapter type, and the adapter placement in the machine. To ensure best performance, you must place the network adapters in the I/O bus slots that are best suited for each adapter.

Consider the following items:

The higher the bandwidth or data rate of the adapter, the more critical the slot placement. For example, PCI-X adapters perform best when used in PCI-X slots, as they typically run at 133 MHz clock speed on the bus. You can place PCI-X adapters in PCI slots, but they run slower on the bus, typically at 33 MHz or 66 MHz, and do not perform as well on some workloads.

Similarly, 64-bit adapters work best when installed in 64-bit slots. You can place 64-bit adapters in a 32-bit slot, but they do not perform at optimal rates. Large MTU adapters, like Gigabit Ethernet in jumbo frame mode, perform much better in 64-bit slots.

Other issues that potentially affect performance are the number of adapters per bus or per PCI host bridge (PHB). Depending on the system model and the adapter type, the number of high speed adapters may be limited per PHB. The placement guidelines ensure that the adapters are spread across the various PCI buses and might limit the number of adapters per PCI bus. Consult the PCI Adapter Placement Reference for more information by machine model and adapter type.

The following table lists the types of PCI and PCI-X slots available in IBM pSeries eServers:

Slot type Code used in this topic
PCI 32-bit 33 MHz A
PCI 32-bit 50/66 MHz B
PCI 64-bit 33 MHz C
PCI 64-bit 50/66 MHz D
PCI-X 32-bit 33 MHz E
PCI-X 32-bit 66 MHz F
PCI-X 64-bit 33 MHz G
PCI-X 64-bit 66 MHz H
PCI-X 64-bit 133 MHz I

The newer IBM pSeries servers only have PCI-X slots. The PCI-X slots are backwards-compatible with the PCI adapters.

The following table shows examples of common adapters and the suggested slot types:

Adapter type Preferred slot type (lowest to highest priority)
10/100 Mbps Ethernet PCI Adapter II (10/100 Ethernet), FC 4962 A-I
IBM PCI 155 Mbps ATM adapter, FC 4953 or 4957 D, H, and I
IBM PCI 622 Mbs MMF ATM adapter, FC 2946 D, G, H, and I
Gigabit Ethernet-SX PCI Adapter , FC 2969 D, G, H, and I
IBM 10/100/1000 Base-T Ethernet PCI Adapter, FC 2975 D, G, H, and I
Gigabit Ethernet-SX PCI-X Adapter (Gigabit Ethernet fiber), FC 5700 G, H, and I
10/100/1000 Base-TX PCI-X Adapter (Gigabit Ethernet), FC 5701 G, H, and I
2-Port Gigabit Ethernet-SX PCI-X Adapter (Gigabit Ethernet fiber), FC 5707 G, H, and I
2-Port 10/100/1000 Base-TX PCI-X Adapter (Gigabit Ethernet), FC 5706 G, H, and I

The lsslot -c pci command provides the following information:

The following is an example of the lsslot -c pci command on a 2-way p615 system with 6 internal slots:

# lsslot -c pci
# Slot      Description                         Device(s)
U0.1-P1-I1  PCI-X capable, 64 bit, 133 MHz slot  fcs0
U0.1-P1-I2  PCI-X capable, 32 bit, 66 MHz slot   Empty
U0.1-P1-I3  PCI-X capable, 32 bit, 66 MHz slot   Empty
U0.1-P1-I4  PCI-X capable, 64 bit, 133 MHz slot  fcs1
U0.1-P1-I5  PCI-X capable, 64 bit, 133 MHz slot  ent0
U0.1-P1-I6  PCI-X capable, 64 bit, 133 MHz slot  ent2

For a Gigabit Ethernet adapter, the adapter-specific statistics at the end of the entstat -d en[interface-number] command output or the netstat -v command output shows the PCI bus type and bus speed of the adapter. The following is an example output of the netstat -v command:

# netstat -v

10/100/1000 Base-TX PCI-X Adapter (14106902) Specific Statistics:
--------------------------------------------------------------------
Link Status: Up
Media Speed Selected: Auto negotiation
Media Speed Running: 1000 Mbps Full Duplex
PCI Mode: PCI-X (100-133)
PCI Bus Width: 64 bit

System Firmware

The system firmware is responsible for configuring several key parameters on each PCI adapter as well as configuring options in the I/O chips on the various I/O and PCI buses in the system. In some cases, the firmware sets parameters unique to specific adapters, for example the PCI Latency Timer and Cache Line Size, and for PCI-X adapters, the Maximum Memory Read Byte Count (MMRBC) values. These parameters are key to obtaining good performance from the adapters. If these parameters are not properly set because of down-level firmware, it will be impossible to achieve optimal performance by software tuning alone. Ensure that you update the firmware on older systems before adding new adapters to the system.

Firmware release level information and firmware updates can be downloaded from the following link: https://techsupport.services.ibm.com/server/mdownload//download.html

You can see both the platform and system firmware levels with the lscfg -vp|grep -p " ROM" command, as in the following example:

lscfg -vp|grep -p " ROM"

     ...lines omitted...
 
     System Firmware:
        ROM Level (alterable).......M2P030828
        Version.....................RS6K
        System Info Specific.(YL)...U0.1-P1/Y1
      Physical Location: U0.1-P1/Y1

      SPCN firmware:
        ROM Level (alterable).......0000CMD02252
        Version.....................RS6K
        System Info Specific.(YL)...U0.1-P1/Y3
      Physical Location: U0.1-P1/Y3

      SPCN firmware:
        ROM Level (alterable).......0000CMD02252
        Version.....................RS6K
        System Info Specific.(YL)...U0.2-P1/Y3
      Physical Location: U0.2-P1/Y3

      Platform Firmware:
        ROM Level (alterable).......MM030829
        Version.....................RS6K
        System Info Specific.(YL)...U0.1-P1/Y2
      Physical Location: U0.1-P1/Y2

Adapter performance guidelines

User payload data rates can be obtained by sockets-based programs for applications that are streaming data over a TCP connection. For example, one program doingsend( )calls and the receiver doing recv( ) calls. The rates are a function of the network bit rate, MTU size (frame size), physical level overhead, like Inter-Frame gap and preamble bits, data link headers, and TCP/IP headers and assume a Gigahertz speed CPU. These rates are best case numbers for a single LAN, and may be lower if going through routers or additional network hops or remote links.

Single direction (simplex) TCP Streaming rates are rates that can be seen by a workload like FTP sending data from machine A to machine B in a memory to memory test. See The ftp Command in Analyzing Network Performance. Note that full duplex media performs slightly better than half duplex media because the TCP acks can flow back without contending for the same wire that the data packets are flowing on.

The following table lists maximum possible network payload speeds and the single direction (simplex) TCP streaming rates:

Note:
In the following tables, the Raw bit Rate value is the physical media bit rate and does not reflect physical media overheads like Inter-Frame gaps, preamble bits, cell overhead (for ATM), data link headers and trailers. These all reduce the effective usable bit rate of the wire.
Network type Raw bit Rate (Mbits) Payload Rate (Mbits) Payload Rate (MB)
10 Mbit Ethernet, Half Duplex 10 6 0.7
10 Mbit Ethernet, Full Duplex 10 (20 Mbit full duplex) 9.48 1.13
100 Mbit Ethernet, Half Duplex 100 62 7.3
100 Mbit Ethernet, Full Duplex 100 (200 Mbit full duplex) 94.8 11.3
1000 Mbit Ethernet, Full Duplex, MTU 1500 1000 (2000 Mbit full duplex) 948 113.0
1000 Mbit Ethernet, Full Duplex, MTU 9000 1000 (2000 Mbit full duplex) 989 117.9
FDDI, MTU 4352 (default) 100 92 11.0
ATM 155, MTU 1500 155 125 14.9
ATM 155, MTU 9180 (default) 155 133 15.9
ATM 622, MTU 1500 622 364 43.4
ATM 622, MTU 9180 (default) 622 534 63.6

Two direction (duplex) TCP streaming workloads have data streaming in both directions. For example, running the ftp command from machine A to machine B and another instance of the ftp command from machine B to A concurrently. These types of workloads take advantage of full duplex media that can send and receive data concurrently. Some media, like FDDI or Ethernet in Half Duplex mode, can not send and receive data concurrently and will not perform well when running duplex workloads. Duplex workloads do not scale to twice the rate of a simplex workload because the TCP ack packets coming back from the receiver now have to compete with data packets flowing in the same direction. The following table lists the two direction (duplex) TCP streaming rates:

Network type Raw bit Rate (Mbits) Payload Rate (Mbits) Payload Rate (MB)
10 Mbit Ethernet, Half Duplex 10 5.8 0.7
10 Mbit Ethernet, Full Duplex 10 (20 Mbit full duplex) 18 2.2
100 Mbit Ethernet, Half Duplex 100 58 7.0
100 Mbit Ethernet, Full Duplex 100 (200 Mbit full duplex) 177 21.1
1000 Mbit Ethernet, Full Duplex, MTU 1500 1000 (2000 Mbit full duplex) 1470 (1660 peak) 175 (198 peak)
1000 Mbit Ethernet, Full Duplex, MTU 9000 1000 (2000 Mbit full duplex) 1680 (1938 peak) 200 (231 peak)
FDDI, MTU 4352 (default) 100 97 11.6
ATM 155, MTU 1500 155 (310 Mbit full duplex) 180 21.5
ATM 155, MTU 9180 (default) 155 (310 Mbit full duplex) 236 28.2
ATM 622, MTU 1500 622 (1244 Mbit full duplex) 476 56.7
ATM 622, MTU 9180 (default) 622 (1244 Mbit full duplex) 884 105
Notes:
  1. Peak numbers represent best case throughput with multiple TCP sessions running in each direction. Other rates are for single TCP sessions.
  2. 1000 Mbit Ethernet (Gigabit Ethernet) duplex rates are for PCI-X adapters in PCI-X slots. Performance is slower on duplex workloads for PCI adapters or PCI-X adapters in PCI slots.
  3. Data rates are for TCP/IP using IPV4. Adapters with a MTU size of 4096 and larger have the RFC1323 option enabled.

Adapter and device settings

Several adapter or device options are important for both proper operation and best performance. AIX devices typically have default values that should work well for most installations. Therefore, these device values normally do not require changes. However, some companies have policies that require specific network settings or some network equipment might require some of these defaults to be changed.

Adapter speed and duplex mode settings

You can configure the Ethernet adapters for the following modes:

It is important that you configure both the adapter and the other endpoint of the cable (normally an Ethernet switch or another adapter if running in a point-to-point configuration without an Ethernet switch) the same way. The default setting for AIX is Auto_Negotiation, which negotiates the speed and duplex settings for the highest possible data rates. For the Auto_Negotiation mode to function properly, you must also configure the other endpoint (switch) for Auto_Negotiation mode.

If one endpoint is manually set to a specific speed and duplex mode, the other endpoint should also be manually set to the same speed and duplex mode. Having one end manually set and the other in Auto_Negotiation mode normally results in problems that make the link perform slowly.

It is best to use Auto_Negotiation mode whenever possible, as it is the default setting for most Ethernet switches. However, some 10/100 Ethernet switches do not support Auto_Negotiation mode of the duplex mode. These types of switches require that you manually set both endpoints to the desired speed and duplex mode.

You must use the commands that are unique to each Ethernet switch to display the port settings and change the port speed and duplex mode settings within the Ethernet switch. Refer to your switch vendors' documentation for these commands.

For AIX, you can use the smitty devices command to change the adapter settings. You can use the netstat -v command or the entstat -d enX command, where X is the Ethernet interface number to display the settings and negotiated mode. The following is part of an example of the entstat -d en3 command output:

10/100/1000 Base-TX PCI-X Adapter (14106902) Specific Statistics:
--------------------------------------------------------------------
Link Status: Up
Media Speed Selected: Auto negotiation
Media Speed Running: 1000 Mbps Full Duplex

Adapter MTU setting

All devices on the same physical network, or logical network if using VLAN tagging, must have the same Media Transmission Unit (MTU) size. This is the maximum size of a frame (or packet) that can be sent on the wire.

The various network adapters support different MTU sizes, so make sure that you use the same MTU size for all the devices on the network. For example, you can not have a Gigabit Ethernet adapter using jumbo frame mode with a MTU size of 9000 bytes, while other adapters on the network use the default MTU size of 1500 bytes. 10/100 Ethernet adapters do not support jumbo frame mode, so they are not compatible with this Gigabit Ethernet option. You also have to configure Ethernet switches to use jumbo frames, if jumbo frames are supported on your Ethernet switch.

It is important to select the MTU size of the adapter early in the network setup so you can properly configure all the devices and switches. Also, many AIX tuning options are dependent upon the selected MTU size.

MTU size performance impacts

The MTU size of the network can have a large impact on performance. The use of large MTU sizes allows the operating system to send fewer packets of a larger size to reach the same network throughput. The larger packets greatly reduce the processing required in the operating system, assuming the workload allows large messages to be sent. If the workload is only sending small messages, then the larger MTU size will not help.

When possible, use the largest MTU size that the adapter and network support. For example, on ATM, the default MTU size of 9180 is much more efficient than using a MTU size of 1500 bytes (normally used by LAN Emulation). With Gigabit Ethernet, if all of the machines on the network have Gigabit Ethernet adapters and no 10/100 adapters on the network, then it would be best to use jumbo frame mode. For example, a server-to-server connection within the computer lab can typically be done using jumbo frames.

Selecting jumbo frame mode on Gigabit Ethernet

You must select jumbo frame mode as a device option. Trying to change the MTU size with the ifconfig command does not work. Use SMIT to display the adapter settings with the following steps:

  1. Select Devices
  2. Select Communications
  3. Select Adapter Type
  4. Select Change/Show Characteristics of an Ethernet Adapter
  5. Change the Transmit Jumbo Frames option from no to yes

The SMIT screen looks like the following:

                                  Change/Show Characteristics of an Ethernet Adapter

Type or select values in entry fields.
Press Enter AFTER making all desired changes.

                                                     [Entry Fields]
  Ethernet Adapter                                    ent0
  Description                                         10/100/1000 Base-TX PCI-X Adapter (14106902)
  Status                                              Available
  Location                                            1H-08
  Receive descriptor queue size                      [1024]                                                      +#
  Transmit descriptor queue size                     [512]                                                       +#
  Software transmit queue size                       [8192]                                                      +#
  Transmit jumbo frames                               yes                                                         +
  Enable hardware transmit TCP resegmentation         yes                                                         +
  Enable hardware transmit and receive checksum       yes                                                         +
  Media Speed                                         Auto_Negotiation                                            +
  Enable ALTERNATE ETHERNET address                   no                                                          +
  ALTERNATE ETHERNET address                         [0x000000000000]                                             +
  Apply change to DATABASE only                       no                                                          +

F1=Help                       F2=Refresh                    F3=Cancel                     F4=List
Esc+5=Reset                   Esc+6=Command                 Esc+7=Edit                    Esc+8=Image
Esc+9=Shell                   Esc+0=Exit                    Enter=Do

The no command

The network option or no command displays, changes, and manages the global network options. An alternate method for tuning some of these parameters is discussed in the Interface-Specific Network Options (ISNO) section.

The following no command options are used to change the tuning parameters:

Option
Definition
-a
Prints all tunables and their current values.
-d [tunable]
Sets the specified tunable back to the default value.
-D
Sets all options back to their default values.
-o tunable=[New Value]
Displays the value or sets the specified tunable to the specified new value
-h [tunable]
Displays help about the specified tunable parameter, if one is specified. Otherwise, displays the no command usage statement.
-r
Used with the -o option to change a tunable that is of type Reboot to be permanent in the nextboot file.
-p
Used with the -o option to make a dynamic tunable permanent in the nextboot file.
-L [tunable]
Used with the -o option to list the characteristics of one or all tunables, one per line.

The following is an example of the no command:

NAME                      CUR    DEF    BOOT   MIN    MAX    UNIT           TYPE     DEPENDENCIES

-------------------------------------------------------------------------------------------------

General Network Parameters

-------------------------------------------------------------------------------------------------
sockthresh                85     85     85     0      100    %_of_thewall      D
-------------------------------------------------------------------------------------------------
fasttimo                  200    200    200    50     200    millisecond       D
-------------------------------------------------------------------------------------------------
inet_stack_size           16     16     16     1             kbyte             R
-------------------------------------------------------------------------------------------------
...lines omitted....


where:
CUR = current value

DEF = default value

BOOT = reboot value

MIN = minimal value

MAX = maximum value

UNIT = tunable unit of measure

TYPE = parameter type: D (for Dynamic), S (for Static), R for Reboot),B (for Bosboot), M (for Mount),
                       I (for Incremental) and C (for Connect)

DEPENDENCIES = list of dependent tunable parameters, one per line

Some network attributes are run-time attributes that can be changed at any time. Others are load-time attributes that must be set before the netinet kernel extension is loaded.

Note:
When you use the no command to change parameters, dynamic parameters are changed in memory and the change is in effect only until the next system boot. At that point, all parameters are set to their reboot settings. To make dynamic parameter changes permanent, use the -ror -p options of the no command to set the options in the nextboot file. Reboot parameter options require a system reboot to take affect.

For more information on the no command, see The no Command.

Interface-Specific Network Options (ISNO)

Interface-Specific Network Options (ISNO) allows IP network interfaces to be custom-tuned for the best performance. Values set for an individual interface take precedence over the systemwide values set with the no command. The feature is enabled (the default) or disabled for the whole system with the no command use_isno option. This single-point ISNO disable option is included as a diagnostic tool to eliminate potential tuning errors if the system administrator needs to isolate performance problems.

Programmers and performance analysts should note that the ISNO values will not show up in the socket (meaning they cannot be read by the getsockopt() system call) until after the TCP connection is made. The specific network interface that a socket actually uses is not known until the connection is complete, so the socket reflects the system defaults from the no command. After the TCP connection is accepted and the network interface is known, ISNO values are put into the socket.

The following parameters have been added for each supported network interface and are only effective for TCP (and not UDP) connections:

When set for a specific interface, these values override the corresponding no option values set for the system. These parameters are available for all of the mainstream TCP/IP interfaces (Token-Ring, FDDI, 10/100 Ethernet, and Gigabit Ethernet), except the css# IP interface on the SP switch. As a simple workaround, SP switch users can set the tuning options appropriate for the switch using the systemwide no command, then use the ISNOs to set the values needed for the other system interfaces.

These options are set for the TCP/IP interface (such as en0 or tr0), and not the network adapter (ent0 or tok0).

AIX sets default values for the Gigabit Ethernet interfaces, for both MTU 1500 and for jumbo frame mode (MTU 9000). As long as you configure the interface through the SMIT tcpip screens, the ISNO options should be set to the default values, which provides good performance.

For 10/100 Ethernet and token ring adapters, the ISNO defaults are not set by the system as they typically work fine with the system global no defaults. However, the ISNO attributes can be set if needed to override the global defaults.

The following example shows the default ISNO values for tcp_sendspace and tcp_recvspace for GigE in MTU 1500 mode :

# ifconfig en0
en0: flags=5e080863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD,PSEG,CHAIN>
        inet 10.0.0.1 netmask 0xffffff00 broadcast 192.0.0.255
        tcp_sendspace 131072 tcp_recvspace 65536

For jumbo frame mode, the default ISNO values for tcp_sendspace, tcp_recvspace, and rfc1323 are set as follows:

 # ifconfig en0
en0: flags=5e080863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD,PSEG,CHAIN>
       inet 192.0.0.1 netmask 0xffffff00 broadcast 192.0.0.255
       tcp_sendspace 262144 tcp_recvspace 131072 rfc1323 1

You can set ISNO options by the following methods:

Using SMIT or the chdev command changes the values in the ODM database on disk so they will be permanent. The ifconfig command only changes the values in memory, so they go back to the prior values stored in ODM on the next reboot.

Modifying the ISNO options with SMIT

You can change the ISNO options with SMIT as follows:

# smitty tcpip
  1. Select the Futher Configuration option.
  2. Select the Network Interfaces option.
  3. Select the Network Interface Selection.
  4. Select the Change/Show Characteristics of a Network Interface.
  5. Select the interface with your cursor. For example, en0

Then, you will see the following screen:

Change / Show a Standard Ethernet Interface

Type or select values in entry fields.
Press Enter AFTER making all desired changes.

                                                     [Entry Fields]
  Network Interface Name                              en0
  INTERNET ADDRESS (dotted decimal)                  [192.0.0.1]
  Network MASK (hexadecimal or dotted decimal)       [255.255.255.0]
  Current STATE                                       up                          +
  Use Address Resolution Protocol (ARP)?              yes                         +
  BROADCAST ADDRESS (dotted decimal)                 []
  Interface Specific Network Options
    ('NULL' will unset the option)
    rfc1323                                          []
    tcp_mssdflt                                      []
    tcp_nodelay                                      []
    tcp_recvspace                                    []
    tcp_sendspace                                    []


F1=Help                       F2=Refresh                    F3=Cancel                     F4=List
Esc+5=Reset                   Esc+6=Command                 Esc+7=Edit                    Esc+8=Image
Esc+9=Shell                   Esc+0=Exit                    Enter=Do

Notice that the ISNO system defaults do not display, even thought they are set internally. For this example, override the default value for tcp_sendspace and lower it down to 65536.

Bring the interface back up with smitty tcpip and select Minimum Configuration and Startup. Then select en0, and take the default values that were set when the interface was first setup.

If you use the ifconfig command to show the ISNO options, you can see that the value of the tcp_sendspace attribute is now set to 65536. The following is an example:

# ifconfig en0
en0: flags=5e080863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD,PSEG,CHAIN>
          inet 192.0.0.1 netmask 0xffffff00 broadcast 192.0.0.255
          tcp_sendspace 65536 tcp_recvspace 65536

The lsattr command output also shows that the system default has been overridden for this attribute:

# lsattr -E -l en0
alias4                      IPv4 Alias including Subnet Mask           True
alias6                      IPv6 Alias including Prefix Length         True
arp           on            Address Resolution Protocol (ARP)          True
authority                   Authorized Users                           True
broadcast                   Broadcast Address                          True
mtu           1500          Maximum IP Packet Size for This Device     True
netaddr       192.0.0.1     Internet Address                           True
netaddr6                    IPv6 Internet Address                      True
netmask       255.255.255.0 Subnet Mask                                True
prefixlen                   Prefix Length for IPv6 Internet Address    True
remmtu        576           Maximum IP Packet Size for REMOTE Networks True
rfc1323                     Enable/Disable TCP RFC 1323 Window Scaling True
security      none          Security Level                             True
state         up            Current Interface Status                   True
tcp_mssdflt                 Set TCP Maximum Segment Size               True
tcp_nodelay                 Enable/Disable TCP_NODELAY Option          True
tcp_recvspace               Set Socket Buffer Space for Receiving      True
tcp_sendspace 65536         Set Socket Buffer Space for Sending        True

Modifying the ISNO options with the chdev and ifconfig commands

You can use the following commands to first verify system and interface support and then to set and verify the new values.

TCP workload tuning

There are several AIX tunable values that can have a large impact on TCP performance. Many applications use the reliable Transport Control Protocol (TCP), including FTP and RCP.

Note:
The no -o command warns you that when you change tuning options that affect TCP/IP connections, the changes are only effective for connections that are established after the changes are made. In addition, the no -o command restarts the inetd daemon process when options are changed that might affect processes for which the inetd daemon is listening for new connections.

TCP streaming workload tuning

Streaming workloads move large amounts of data from one endpoint to the other endpoint. Examples of streaming workloads are file transfer, backup or restore workloads, or bulk data transfer. The main metric of interest in these workloads is bandwidth, but you can also look at end-to-end latency.

The primary tunables that affect TCP performance for streaming applications are the following:

The following table shows suggested sizes for the tunable values to obtain optimal performance, based on the type of adapter and the MTU size:

Device Speed MTU size tcp_sendspace tcp_recvspace sb_max1 rfc1323
Token Ring 4 or 16 Mbit 1492 16384 16384 32768 0
Ethernet 10 Mbit 1500 16384 16384 32768 0
Ethernet 100 Mbit 1500 16384 16384 65536 0
Ethernet Gigabit 1500 131072 65536 131072 0
Ethernet Gigabit 9000 131072 65535 262144 0
Ethernet Gigabit 9000 262144 1310722 524288 1
ATM 155 Mbit 1500 16384 16384 131072 0
ATM 155 Mbit 9180 65535 655353 131072 0
ATM 155 Mbit 65527 655360 6553604 1310720 1
FDDI 100 Mbit 4352 45056 45056 90012 0
Fiber Channel 2 Gigabit 65280 655360 655360 1310720 1
Notes:
  1. It is suggested to use the default value of 1048576 for the sb_max tunable. The values shown in the table are acceptable minimum values for the sb_max tunable.
  2. Performance is slightly better when using these options, with rfc1323 enabled, on jumbo frames on Gigabit Ethernet
  3. Certain combinations of TCP send and receive space will result in very low throughput, (1 Mbit or less). To avoid this problem, set the tcp_sendspace tunable to a minimum of 3 times the MTU size or greater or equal to the receiver's tcp_recvspace value.
  4. TCP has only a 16-bit value to use for its window size. This translates to a maximum window size of 65536 bytes. For adapters that have large MTU sizes (for example 32 KB or 64 KB), TCP streaming performance might be very poor. For example, on a device with a 64 KB MTU size, and with a tcp_recvspace set to 64 KB, TCP can only send one packet and then its window closes. It must wait for an ACK back from the receiver before it can send again. This problem can be solved in two ways:

The following are general guidelines for tuning TCP streaming workloads:

The ftp and rcp commands are examples of TCP applications that benefit from tuning the tcp_sendspace and tcp_recvspace tunables.

The tcp_sendspace tunable

TCP send buffer size can limit how much data the application can send before the application is put to sleep. The TCP socket send buffer is used to buffer the application data in the kernel using mbufs/clusters before it is sent beyond the socket and TCP layer. The default size of this buffer is specified by the parameter tcp_sendspace, but you can use the setsockopt() subroutine to override it.

If the amount of data that the application wants to send is smaller than the send buffer size and also smaller than the maximum segment size and if TCP_NODELAY is not set, then TCP will delay up to 200 ms, until enough data exists to fill the send buffer or the amount of data is greater than or equal to the maximum segment size, before transmitting the packets.

If TCP_NODELAY is set, then the data is sent immediately (useful for request/response type of applications). If the send buffer size is less than or equal to the maximum segment size (ATM and SP switches can have 64 K MTUs), then the application's data will be sent immediately and the application must wait for an ACK before sending another packet (this prevents TCP streaming and could reduce throughput).

Note:
To maintain a steady stream of packets, increase the socket send buffer size so that it is greater than the MTU (3-10 times the MTU size could be used as a starting point).

If an application does nonblocking I/O (specified O_NDELAY or O_NONBLOCK on the socket), then if the send buffer fills up, the application will return with an EWOULDBLOCK/EAGAIN error rather than being put to sleep. Applications must be coded to handle this error (suggested solution is to sleep for a short while and try to send again).

When you are changing send/recv space values, in some cases you must stop/restart the inetd process as follows:

# stopsrc -s inetd; startsrc -s inetd
The tcp_recvspace tunable

TCP receive-buffer size limits how much data the receiving system can buffer before the application reads the data. The TCP receive buffer is used to accommodate incoming data. When the data is read by the TCP layer, TCP can send back an acknowledgment (ACK) for that packet immediately or it can delay before sending the ACK. Also, TCP tries to piggyback the ACK if a data packet was being sent back anyway. If multiple packets are coming in and can be stored in the receive buffer, TCP can acknowledge all of these packets with one ACK. Along with the ACK, TCP returns a window advertisement to the sending system telling it how much room remains in the receive buffer. If not enough room remains, the sender will be blocked until the application has read the data. Smaller values will cause the sender to block more. The size of the TCP receive buffer can be set using the setsockopt() subroutine or by the tcp_recvspace parameter.

The rfc1323 tunable

The TCP window size by default is limited to 65536 bytes (64 K) but can be set higher if rfc1323 is set to 1. If you are setting tcp_recvspace to greater than 65536, set rfc1323=1 on each side of the connection. Without having rfc1323 set on both sides, the effective value for tcp_recvspace will be 65536.

If you are sending data through adapters that have large MTU sizes (32 K or 64 K for example), TCP streaming performance may not be optimal because the packet or packets will be sent and the sender will have to wait for an acknowledgment. By enabling the rfc1323 option using the command no -o rfc1323=1, TCP's window size can be set as high as 4 GB. However, on adapters that have 64 K or larger MTUs, TCP streaming performance can be degraded if the receive buffer can only hold 64 K. If the receiving machine does not support rfc1323, then reducing the MTU size is one way to enhance streaming performance.

After setting the rfc1323 option to 1, you can increase the tcp_recvspace parameter to something much larger, such as 10 times the size of the MTU.

THe sb_max tunable

This parameter controls how much buffer space is consumed by buffers that are queued to a sender's socket or to a receiver's socket. The system accounts for socket buffers used based on the size of the buffer, not on the contents of the buffer.

If a device driver puts 100 bytes of data into a 2048-byte buffer, then the system considers 2048 bytes of socket buffer space to be used. It is common for device drivers to receive buffers into a buffer that is large enough to receive the adapters maximum size packet. This often results in wasted buffer space but it would require more CPU cycles to copy the data to smaller buffers.

Because there are so many different network device drivers, increase the sb_max value much higher rather than making it the same as the largest TCP or UDP socket buffer size parameters. After the total number of mbufs/clusters on the socket reaches the sb_max limit, no additional buffers can be queued to the socket until the application has read the data.

Note:
When you are setting buffer size parameters to larger than 64 K, you must also increase the value of sb_max, which specifies the maximum socket buffer size for any socket buffer.

One guideline would be to set it to twice as large as the largest TCP or UDP receive space.

TCP request/response workload tuning

TCP request/response workloads are workloads that involve a two-way exchange of information. Examples of request/response workloads are Remote Procedure Call (RPC) types of applications or client/server applications, like web browser requests to a web server, NFS file systems (that use TCP for the transport protocol), or a database's lock management protocol. Such request are often small messages and larger responses, but might also be large requests and a small response.

The primary metric of interest in these workloads is the round-trip latency of the network. Many of these requests or responses use small messages, so the network bandwidth is not a major consideration.

Hardware has a major impact on latency. For example, the type of network, the type and performance of any network switches or routers, the speed of the processors used in each node of the network, the adapter and bus latencies all impact the round-trip time.

Tuning options to provide minimum latency (best response) typically cause higher CPU overhead as the system sends more packets, gets more interrupts, etc. in order to minimize latency and response time. These are classic performance trade-offs.

Primary tunables for request/response applications are the following:

Note:
Some request/response workloads involve large amounts of data in one direction. Such workloads might need to be tuned for a combination of streaming and latency, depending on the workload.

UDP Tuning

User Datagram Protocol (UDP) is a datagram protocol that is used by Network File System (NFS), name server (named), Trivial File Transfer Protocol (TFTP), and other special purpose protocols.

Since UDP is a datagram protocol, the entire message (datagram) must be copied into the kernel on a send operation as one atomic operation. The datagram is also received as one complete message on the recv or recvfrom system call. You must set the udp_sendspace and udp_recvspace parameters to handle the buffering requirements on a per-socket basis.

The largest UDP datagram that can be sent is 64 KB, minus the UDP header size (8 bytes) and the IP header size (20 bytes for IPv4 or 40 bytes for IPv6 headers).

The following tunables affect UDP performance:

The udp_sendspace tunable

Set this parameter to 65536, because any value greater than 65536 is ineffective. Because UDP transmits a packet as soon as it gets any data, and because IP has an upper limit of 65536 bytes per packet, anything beyond 65536 runs the small risk of being discarded by IP. The IP protocol will fragment the datagram into smaller packets if needed, based on the MTU size of the interface the packet will be sent on. For example, sending an 8 K datagram, IP would fragment this into 1500 byte packets if sent over Ethernet. Because UDP does not implement any flow control, all packets given to UPD are passed to IP (where they may be fragmented) and then placed directly on the device drivers transmit queue.

The udp_recvspace tunable

On the receive side, the incoming datagram (or fragment if the datagram is larger than the MTU size) will first be received into a buffer by the device driver. This will typically go into a buffer that is large enough to hold the largest possible packet from this device.

The setting of udp_recvspace is harder to compute because it varies by network adapter type, UDP sizes, and number of datagrams queued to the socket. Set the udp_recvspace larger rather than smaller, because packets will be discarded if it is too small.

For example, Ethernet might use 2 K receive buffers. Even if the incoming packet is maximum MTU size of 1500 bytes, it will only use 73 percent of the buffer. IP will queue the incoming fragments until a full UDP datagram is received. It will then be passed to UDP. UDP will put the incoming datagram on the receivers socket. However, if the total buffer space in use on this socket exceeds udp_recvspace, then the entire datagram will be discarded. This is indicated in the output of the netstat -s command as dropped due to full socket buffers errors.

Because the communication subsystem accounts for buffers used, and not the contents of the buffers, you must account for this when setting udp_recvspace. In the above example, the 8 K datagram would be fragmented into 6 packets which would use 6 receive buffers. These will be 2048 byte buffers for Ethernet. So, the total amount of socket buffer consumed by this one 8 K datagram is as follows:

6*2048=12,288 bytes

Thus, you can see that the udp_recvspace must be adjusted higher depending on how efficient the incoming buffering is. This will vary by datagram size and by device driver. Sending a 64 byte datagram would consume a 2 K buffer for each 64 byte datagram.

Then, you must account for the number of datagrams that may be queued onto this one socket. For example, NFS server receives UDP packets at one well-known socket from all clients. If the queue depth of this socket could be 30 packets, then you would use 30 * 12,288 = 368,640 for the udp_recvspace if NFS is using 8 K datagrams. NFS Version 3 allows up to 32K datagrams.

A suggested starting value for udp_recvspace is 10 times the value of udp_sendspace, because UDP may not be able to pass a packet to the application before another one arrives. Also, several nodes can send to one node at the same time. To provide some staging space, this size is set to allow 10 packets to be staged before subsequent packets are discarded. For large parallel applications using UDP, the value may have to be increased.

Note:
The value of sb_max, which specifies the maximum socket buffer size for any socket buffer, should be at least twice the size of the largest of the UDP and TCP send and receive buffers.

UDP packet chaining

When UDP Datagrams to be transmitted are larger than the adapters MTU size, the IP protocol layer will fragment the datagram into MTU size fragments. Ethernet interfaces include a UPD packet chaining feature. This feature is enabled by default in AIX

UDP packet chaining causes IP to build the entire chain of fragments and pass that chain down to the Ethernet device driver in one call. This improves performance by reducing the calls down through the ARP and interface layers and to the driver. This also reduces lockand unlock calls in SMP environment. It also helps the cache affinity of the code loops. These changes reduce the CPU utilization of the sender.

You can view the UDP packet chaining option with the ifconfig command. The following example shows the ifconfig command output for the en0 interface, where the CHAIN flag indicates that packet chaining in enabled:

# ifconfig en0
en0: flags=5e080863,80<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD,PSEG,CHAIN>
         inet 192.1.6.1 netmask 0xffffff00 broadcast 192.1.6.255
         tcp_sendspace 65536 tcp_recvspace 65536 tcp_nodelay 1

Packet chaining can be disabled by the following command:

# ifconfig en0 -pktchain

# ifconfig en0
en0: flags=5e080863,80<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD,PSEG>
         inet 192.1.6.1 netmask 0xffffff00 broadcast 192.1.6.255
         tcp_sendspace 65536 tcp_recvspace 65536 tcp_nodelay 1

Packet chaining can be re-enabled with the following command:

# ifconfig en0 pktchain

# ifconfig en0
en0: flags=5e080863,80<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD,PSEG,CHAIN>
         inet 192.1.6.1 netmask 0xffffff00 broadcast 192.1.6.255
         tcp_sendspace 65536 tcp_recvspace 65536 tcp_nodelay 1

Adapter Transmit and Receive Queue Tuning

Most communication drivers provide a set of tunable parameters to control transmit and receive resources. These parameters typically control the transmit queue and receive queue limits, but may also control the number and size of buffers or other resources. These parameters limit the number of buffers or packets that may be queued for transmit or limit the number of receive buffers that are available for receiving packets. These parameters can be tuned to ensure enough queueing at the adapter level to handle the peak loads generated by the system or the network.

Following are some general guidelines:

Transmit Queues

For transmit, the device drivers may provide a transmit queue limit. There may be both hardware queue and software queue limits, depending on the driver and adapter. Some drivers have only a hardware queue; some have both hardware and software queues. Some drivers internally control the hardware queue and only allow the software queue limits to be modified. Generally, the device driver will queue a transmit packet directly to the adapter hardware queue. If the system CPU is fast relative to the speed of the network, or on an SMP system, the system may produce transmit packets faster than they can be transmitted on the network. This will cause the hardware queue to fill. After the hardware queue is full, some drivers provide a software queue and they will then queue to the software queue. If the software transmit queue limit is reached, then the transmit packets are discarded. This can affect performance because the upper-level protocols must then time out and retransmit the packet.

The transmit queue limits on most of the device drivers are 2048 buffers. The default values were also increased to 512 for most of these drivers. The default values were increased because the faster CPUs and SMP systems can overrun the smaller queue limits.

Following are examples of PCI adapter transmit queue sizes:

PCI Adapter Type Default Range
Ethernet 64 16 - 256
10/100 Ethernet 256, 512, or 2048 16 -16384
Token-Ring 96, 512, or 2048 32 - 16384
FDDI 30 or 2048 3 - 16384
155 ATM 100 or 2048 0 - 16384

For adapters that provide hardware queue limits, changing these values will cause more real memory to be consumed on receives because of the associated control blocks and buffers associated with them. Therefore, raise these limits only if needed or for larger systems where the increase in memory use is negligible. For the software transmit queue limits, increasing these limits does not increase memory usage. It only allows packets to be queued that were already allocated by the higher layer protocols.

Receive Queues

Some adapters allow you to configure the number of resources used for receiving packets from the network. This might include the number of receive buffers (and even their size) or may be a receive queue parameter (which indirectly controls the number of receive buffers).

The receive resources may need to be increased to handle peak bursts on the network. The network interface device driver places incoming packets on a receive queue. If the receive queue is full, packets are dropped and lost, resulting in the sender needing to retransmit. The receive queue is tunable using the SMIT or chdev commands (see How to Change the Parameters). The maximum queue size is specified to each type of communication adapter (see Tuning PCI Adapters).

For the Micro Channel adapters and the PCI adapters, receive queue parameters typically control the number of receive buffers that are provided to the adapter for receiving input packets.

Device-Specific Buffers

AIX 4.1.4 and later support device-specific mbufs. This allows a driver to allocate its own private set of buffers and have them pre-setup for Direct Memory Access (DMA). This can provide additional performance because the overhead to set up the DMA mapping is done one time. Also, the adapter can allocate buffer sizes that are best suited to its MTU size. For example, ATM, High Performance Parallel Interface (HIPPI), and the SP switch support a 64 K MTU (packet) size. The maximum system mbuf size is 16 KB. By allowing the adapter to have 64 KB buffers, large 64 K writes from applications can be copied directly into the 64 KB buffers owned by the adapter, instead of copying them into multiple 16 K buffers (which has more overhead to allocate and free the extra buffers).

Device-specific buffers add an extra layer of complexity for the system administrator. The system administrator must use device-specific commands to view the statistics relating to the adapter's buffers and then change the adapter's parameters as necessary. If the statistics indicate that packets were discarded because not enough buffer resources were available, then those buffer sizes need to be increased.

Increasing the receive/transmit queue parameters

Following are some guidelines to help you determine when to increase the receive/transmit queue parameters:

  1. When the CPU is much faster than the network and multiple applications may be using the same network. This would be common on a larger multi-processor system (SMP).
  2. When running with large values for tcp_sendspace or tcp_recvspace as set in the no options or running applications that might use system calls to increase the TCP send and receive socket buffer space. These large values can cause the CPU to send down large numbers of packets to the adapter, which will need to be queued. Procedures are similar for udp_sendspace and udp_recvspace for UDP applications.
  3. When there is very bursty traffic.
  4. A high-traffic load of small packets can consume more resources than a high traffic load of large buffers. Because large buffers take more time to send on the network. The packet rate will therefore be slower for larger packets.

Commands to query and change the queue parameters

Several status utilities can be used to show the transmit queue high-water limits and number of queue overflows. You can use the command netstat -v, or go directly to the adapter statistics utilities (entstat for Ethernet, tokstat for Token-Ring, fddistat for FDDI, atmstat for ATM, and so on).

For an entstat example output, see The entstat Command. Another method is to use the netstat -i utility. If it shows non-zero counts in the Oerrs column for an interface, then this is typically the result of output queue overflows.

Viewing the network adapter settings

You can use the lsattr -E -l adapter-name command or you can use the SMIT command (smitty commodev) to show the adapter configuration.

Different adapters have different names for these variables. For example, they may be named sw_txq_size, tx_que_size, or xmt_que_size for the transmit queue parameter. The receive queue size and receive buffer pool parameters may be named rec_que_size, rx_que_size, or rv_buf4k_min for example.

Following is the output of a lsattr -E -l atm0 command on an IBM PCI 155 Mbs ATM adapter. This output shows the sw_txq_size is set to 250 and the rv_buf4K_min receive buffers set to x30.

# lsattr -E -l atm0
dma_mem        0x400000    N/A                                          False
regmem         0x1ff88000  Bus Memory address of Adapter Registers      False
virtmem        0x1ff90000  Bus Memory address of Adapter Virtual Memory False
busintr        3           Bus Interrupt Level                          False
intr_priority  3           Interrupt Priority                           False
use_alt_addr   no          Enable ALTERNATE ATM MAC address             True
alt_addr       0x0           ALTERNATE ATM MAC address (12 hex digits)  True
sw_txq_size  250         Software Transmit Queue size                 True
max_vc         1024        Maximum Number of VCs Needed                 True
min_vc         32          Minimum Guaranteed VCs Supported             True
rv_buf4k_min 0x30        Minimum 4K-byte pre-mapped receive buffers   True
interface_type 0           Sonet or SDH interface                       True
adapter_clock  1           Provide SONET Clock                          True
uni_vers       auto_detect N/A                                          True

Following is an example of a Micro Channel 10/100 Ethernet settings using the lsattr -E -l ent0 command. This output shows the tx_que_size and rx_que_size both set to 256.

# lsattr -E -l ent0
bus_intr_lvl  11              Bus interrupt level                False
intr_priority 3               Interrupt priority                 False
dma_bus_mem   0x7a0000        Address of bus memory used for DMA False
bus_io_addr   0x2000          Bus I/O address                    False
dma_lvl       7               DMA arbitration level              False
tx_que_size 256             TRANSMIT queue size                True
rx_que_size 256             RECEIVE queue size                 True
use_alt_addr  no              Enable ALTERNATE ETHERNET address  True
alt_addr      0x              ALTERNATE ETHERNET address         True
media_speed   100_Full_Duplex Media Speed                        True
ip_gap        96              Inter-Packet Gap                   True

Tuning PCI adapters

The information in this section is provided to document the various adapter-tuning parameters. These parameters and values are provided to aid you in understanding the various tuning parameters, or when a system is not available to view the parameters.

These parameter names, defaults, and range values were obtained from the ODM database. The comment field was obtained from the lsattr -E -l interface-name command.

The Notes field provides additional comments.

PCI Adapters

Feature Code 2985
IBM PCI Ethernet Adapter (22100020)

Parameter      Default  Range            Comment             Notes
------------- -------- ----------------- ------------------- ---------
tx_que_size   64       16,32,64,128,256  TRANSMIT queue size HW Queues
rx_que_size   32       16,32,64,128,256  RECEIVE queue size  HW Queues


Featue Code 2968
IBM 10/100 Mbps Ethernet PCI Adapter (23100020)

Parameter        Default Range            Comment               Notes
---------------- ------- ---------------- --------------------- --------------------
tx_que_size      256     16,32,64,128,256 TRANSMIT queue size   HW Queue Note 1
rx_que_size      256     16,32,64,128,256 RECEIVE queue size    HW Queue Note 2
rxbuf_pool_size  384     16-2048          # buffers in receive  Dedicat. receive
                                          buffer pool           buffers Note 3


Feature Code: 2969
Gigabit Ethernet-SX PCI Adapter (14100401)

Parameter     Default Range    Comment                             Notes
------------- ------- -------- ----------------------------------- ---------
tx_que_size   512     512-2048 Software Transmit Queueu size       SW Queue
rx_que_size   512     512      Receive queue size                  HW Queue
receive_proc  6       0-128    Minimum Receive Buffer descriptiors


Feature Code: 2986
3Com 3C905-TX-IBM Fast EtherLink XL NIC

Parameter      Default  Range  Comment                      Notes
-------------- -------- ------ ---------------------------- ----------
tx_wait_q_size 32       4-128  Driver TX Waiting Queue Size HW Queues
rx_wait_q_size 32       4-128  Driver RX Waiting Queue Size HW Queues


Feature Code: 2742
SysKonnect PCI FDDI Adapter (48110040)

Parameter     Default  Range    Comment             Notes
------------- -------- -------- ------------------- ---------------
tx_queue_size 30       3-250    Transmit Queue Size SW Queue
RX_buffer_cnt 42       1-128    Receive frame count Rcv buffer pool


Feature Code: 2979
IBM PCI Tokenring Adapter (14101800)

Parameter     Default  Range   Comment                     Notes
------------- -------- ------- --------------------------- --------
xmt_que_size  96       32-2048 TRANSMIT queue size         SW Queue
rx_que_size   32       32-160  HARDWARE RECEIVE queue size HW queue


Feature Code: 2979
IBM PCI Tokenring Adapter (14103e00)

Parameter     Default  Range    Comment              Notes
------------- -------- -------- -------------------- --------
xmt_que_size  512      32-2048  TRANSMIT queue size  SW Queue
rx_que_size   64       32-512   RECEIVE queue size   HW Queue


Feature Code: 2988
IBM PCI 155 Mbps ATM Adapter (14107c00)

Parameter     Default   Range        Comment                          Notes
------------- --------- ------------ -------------------------------- --------
sw_txq_size   100       0-4096       Software Transmit Queue size     SW Queue
rv_buf4k_min  48 (0x30) 0-512 (x200) Minimum 4K-byte pre-mapped receive buffers

Notes on the IBM 10/100 Mbps Ethernet PCI Adapter:

  1. Prior to AIX 4.3.2, default tx_queue_size was 64.
  2. Prior to AIX 4.3.2, default rx_que_size was 32.
  3. In AIX 4.3.2 and later, the driver added a new parameter to control the number of buffers dedicated to receiving packets.

Enabling thread usage on LAN adapters (dog threads)

Drivers, by default, call IP directly, which calls up the protocol stack to the socket level while running on the interrupt level. This minimizes instruction path length, but increases the interrupt hold time. On an SMP system, a single CPU can become the bottleneck for receiving packets from a fast adapter. By enabling the dog threads, the driver queues the incoming packet to the thread and the thread handles calling IP, TCP, and the socket code. The thread can run on other CPUs which may be idle. Enabling the dog threads can increase capacity of the system in some cases.

Note:
This feature is not supported on uniprocessors, because it would only add path length and slow down performance.

This is a feature for the input side (receive) of LAN adapters. It can be configured at the interface level with the ifconfig command (ifconfig interface thread or ifconfig interface hostname up thread).

To disable the feature, use the ifconfig interface -thread command.

Guidelines when considering using dog threads are as follows:

Changing network parameters

The following are some of the network parameters that are user-configurable:

To change any of the parameter values, do the following:

  1. Detach the interface by running the following command:

    # ifconfig en0 detach

    where en0 represents the adapter name.

  2. Use SMIT to display the adapter settings. Select Devices -> Communications -> adapter type -> Change/Show...
  3. Move the cursor to the field you want to change, and press F4 to see the minimum and maximum ranges for the field (or the specific set of sizes that are supported).
  4. Select the appropriate size, and press Enter to update the ODM database.
  5. Reattach the adapter by running the following command:

    # ifconfig en0 hosthame up

An alternative method to change these parameter values is to run the following command:

# chdev -l [ifname] -a [attribute-name]=newvalue

For example, to change the above tx_que_size on en0 to 128, use the following sequence of commands. Note that this driver only supports four different sizes, so it is better to use the SMIT command to see these values.

# ifconfig en0 detach
# chdev -l ent0 -a tx_que_size=128
# ifconfig en0 hostname up

TCP MSS tuning

The maximum size packets that TCP sends can have a major impact on bandwidth, because it is more efficient to send the largest possible packet size on the network. TCP controls this maximum size, known as Maximum Segment Size (MSS), for each TCP connection. For direct-attached networks, TCP computes the MSS by using the MTU size of the network interface and then subtracting the protocol headers to come up with the size of data in the TCP packet. For example, Ethernet with a MTU of 1500 would result in a MSS of 1460 after subtracting 20 bytes for IPv4 header and 20 bytes for TCP header.

The TCP protocol includes a mechanism for both ends of a connection to advertise the MSS to be used over the connection when the connection is created. Each end uses the OPTIONS field in the TCP header to advertise a proposed MSS. The MSS that is chosen is the smaller of the values provided by the two ends. If one endpoint does not provide its MSS, then 536 bytes is assumed, which is bad for performance.

The problem is that each TCP endpoint only knows the MTU of the network it is attached to. It does not know what the MTU size of other networks that might be between the two endpoints. So, TCP only knows the correct MSS if both endpoints are on the same network. Therefore, TCP handles the advertising of MSS differently depending on the network configuration, if it wants to avoid sending packets that might require IP fragmentation to go over smaller MTU networks.

The value of MSS advertised by the TCP software during connection setup depends on whether the other end is a local system on the same physical network (that is, the systems have the same network number) or whether it is on a different (remote) network.

Hosts on the same network

If the other end of the connection is on the same IP network, the MSS advertised by TCP is based on the MTU of the local network interface, as follows:

TCP MSS = MTU - TCP header size - IP header size

The TCP size is 20 bytes, the IPv4 header size is 20 bytes, and the IPv6 header size is 40 bytes.

Because this is the largest possible MSS that can be accommodated without IP fragmentation, this value is inherently optimal, so no MSS-tuning is required for local networks.

Hosts on different networks

When the other end of the connection is on a remote network, the operating system's TCP defaults to advertising an MSS that is determined with the method below. The method varies if TCP Path MTU discovery is enabled or not. If Path MTU discovery is not enabled, where tcp_pmtu_discover=0, TCP determines what MSS to use in the following order:

  1. If the route add command specified a MTU size for this route, the MSS is computed from this MTU size.
  2. If the tcp_mssdflt parameter for the ISNO is defined for the network interface being used, the tcp_mssdflt value is used for the MSS.
  3. If neither of the above are defined, TCP uses the global no tcp_mssdflt tunable value. The default value for this option is 512 bytes.

TCP path MTU discovery

The TCP path MTU discovery protocol option is enabled by default in AIX. This option is controlled by the tcp_pmtu_discover=1 network option. This option allows the protocol stack to determine the minimum MTU size on any network that is currently in the path between two hosts.

In AIX, this option is implemented using ICMP echo messages sent from the source to the destination host for a specific route. The initial echo message for a route has a size equal to the MTU size of the sending interface, and has the Don't Fragment (DF) bit set in the IP header. If this packet hits a network router that has a MTU smaller than the size of the echo message, an error packet is sent back indicating the message can not be forwarded because it can not be fragmented.

If the router sending the error packet complies with RFC 1191, the network's MTU is contained in the ICMP error packet, and the source host will try again with this smaller sized echo message. Otherwise the sending source has to make a guess at a smaller MTU size for the next ICMP echo message. This is done from a table of values within the AIX TCP/IP kernel extension. When a valid echo response is finally received from the destination host by the source host, the MSS size is saved in a cloned route in the routing table for this route. TCP uses this value on the next TCP connection on this route. The first TCP connection runs with the default tcp_mssdflt value (normally 512 bytes, or the smaller of the two advertised values).

Once the first TCP connection is open and a cloned route is created, subsequent TCP connections pick up the PMTU value from the cloned route entry. This can be seen under the PMTU column in the netstat -r command output. Also, for these connections, TCP will set the Don't Fragment (DF) bit in the IP header so it will be informed of any changes in the network topology. The Refs column of the netstat -r command report shows the use count, which is the number of TCP connections using this cloned route to this remote host. The cloned route remains in the routing table until this Refs count goes to zero. Then, after one minute, the cloned route will be deleted if the route_expire network option is enabled (set to "1", which is the AIX default). This purging of the cloned routes keeps the route table from getting too large.

A typical side effect of PMTU discovery being enabled is that the routing table will be larger, because a route entry is maintained to each remote host. The no command option route_expire should be set to a non-zero value, in order to have any unused cached route entry removed from the table, after route_expire time of inactivity. The default is route_expire=1, which purges expired routes from the routing table. Once a cloned route is deleted, any new TCP connection requests to that host start the path MTU discovery and route cloning process over again. This can be avoided by disabling the route_expire option.

To avoid the problem of the first TCP connection ending up with a MSS of 512 bytes, the tcp_mssdflt option should be set to 1460 bytes (or the size of the smallest MTU size in your network). Most networks use Ethernet, so a value of 1460 is a good choice. The worst case, if a smaller network were to be encountered, is that IP would have to fragment the TCP packets before crossing that network. While fragmentation of TCP packets is not desirable, it will work.

You can use the netstat -ao command to show all the TCP connections and the MSS field of each socket in the ESTABLISHED state, shows the MSS value in use for that connection.

The following is an example of the netstat -r command:

Routing tables
Destination      Gateway           Flags   Refs     Use  If   PMTU Exp Groups

Route tree for Protocol Family 2 (Internet):
default          res101141         UGc       0        0  en1     -   -
ausdns01.srv.ibm res101141         UGHW      1      225  en1  1500   -
10.1.14.0        server1           UHSb      0        0  en1     -   -  =>
10.1.14/24       server1           U         6     2228  en1     -   -
server1          loopback          UGHS      6      111  lo0     -   -
10.1.14.255      server1           UHSb      0        0  en1     -   -
127/8            loopback          U         7       17  lo0     -   -
192.1.0/24       en1host2          UGc       0        0  en0     -   -
en0host1         en1host2          UGHW      2      109  en0  1500   -
192.1.1.0        en1host1          UHSb      0        0  en0     -   -  =>
192.1.1/24       en1host1          U         2        2  en0     -   -
en1host1         loopback          UGHS      0        2  lo0     -   -
192.1.1.255      en1host1          UHSb      0        0  en0     -   -

The above routing table shows a cloned route entry for en0host1, with a Refs count of 2 and a PMTU size of 1500. Thus, there are two TCP connections to this remote host and the path discovery has set the PMTU size to 1500 bytes so TCP uses a MSS of 1460. The route is cloned from the 192.1.0/24 entry just above it. The c in that entry shows it is cloneable.

Static routes

The default MSS of 512 can be overridden by specifying a static route to a specific remote network. Use the -mtu option of the route command to specify the MTU to that network. In this case, you would specify the actual minimum MTU of the route, rather than calculating an MSS value. For example, the following command sets the default MTU size to 1500 for a route to network 192.3.3 and the default host to get to that gateway is en0host2:

# route add -net 192.1.0 jack -mtu 1500
1500 net 192.3.3: gateway en0host2

The netstat -r command displays the route table and shows that the PMTU size is 1500 bytes. TCP will compute the MSS from that MTU size. The following is an example of the netstat -r command:

# netstat -r
Routing tables
Destination      Gateway           Flags   Refs     Use  If   PMTU Exp Groups

Route tree for Protocol Family 2 (Internet):
default          res101141         UGc       0        0  en4     -   -
ausdns01.srv.ibm res101141         UGHW      8       40  en4  1500   -
10.1.14.0        server1           UHSb      0        0  en4     -   -  =>
10.1.14/24       server1           U         5     4043  en4     -   -
server1          loopback          UGHS      0      125  lo0     -   -
10.1.14.255      server1           UHSb      0        0  en4     -   -
127/8            loopback          U         2  1451769  lo0     -   -
192.1.0.0        en0host1          UHSb      0        0  en0     -   -  =>
192.1.0/24       en0host1          U         4       13  en0     -   -
en0host1         loopback          UGHS      0        2  lo0     -   -
192.1.0.255      en0host1          UHSb      0        0  en0     -   -
192.1.1/24       en0host2          UGc       0        0  en0     -   -
en1host1         en0host2          UGHW      1   143474  en0  1500   -
192.3.3/24       en0host2          UGc       0        0  en0  1500   -
192.6.0/24       en0host2          UGc       0        0  en0     -   -

Route tree for Protocol Family 24 (Internet v6):
loopbackv6       loopbackv6        UH        0        0  lo0 16896   -

In a small, stable environment, this method allows precise control of MSS on a network-by-network basis. The disadvantages of this approach are as follows:

Use of the tcp_mssdflt option of the no command

This parameter is used to set the maximum packet size for communication with remote networks. The global no command option tcp_mssdflt applies to all networks. However, for network interfaces that support the ISNO options, you can set the tcp_mssdflt option on each of those interfaces. This value overrides the global no command value for routes using the network.

The tcp_mssdflt option is the TCP MSS size, which represents the TCP data size. To compute this MSS size, take the desired network MTU size and subtract 40 bytes from it (20 for IP and 20 for TCP headers). There is no need to adjust for other protocol options as TCP handles this adjustment if other options, like the rfc1323 option are used.

In an environment with a larger-than-default MTU, this method has the advantage in that the MSS does not need to be set on a per-network basis. The disadvantages are as follows:

Subnetting and the subnetsarelocal option of the no command

You can use the subnetsarelocal option of the no command to control when TCP considers a remote endpoint to be local (on the same network) or remote. Several physical networks can be made to share the same network number by subnetting. The subnetsarelocal option specifies, on a system-wide basis, whether subnets are to be considered local or remote networks. With the no -o subnetsarelocal=1 command, which is the default, Host A on subnet 1 considers Host B on subnet 2 to be on the same physical network.

The consequence is that when Host A and Host B establish a connection, they negotiate the MSS assuming they are on the same network. Each host advertises an MSS based on the MTU of its network interface, usually leading to an optimal MSS being chosen.

The advantages to this approach are as follows:

The disadvantages to this approach are as follows:

IP protocol performance tuning recommendations

At the IP layer, the only tunable parameter is ipqmaxlen, which controls the length of the IP input queue discussed in IP Layer. In general, interfaces do not do queuing. Packets can arrive very quickly and overrun the IP input queue. You can use the netstat -s or netstat -p ip command to view an overflow counter (ipintrq overflows).

If the number returned is greater than 0, overflows have occurred. Use the no command to set the maximum length of this queue. For example:

# no -o ipqmaxlen=100

This example allows 100 packets to be queued up. The exact value to use is determined by the maximum burst rate received. If this cannot be determined, using the number of overflows can help determine what the increase should be. No additional memory is used by increasing the queue length. However, an increase may result in more time spent in the off-level interrupt handler, because IP will have more packets to process on its input queue. This could adversely affect processes needing CPU time. The tradeoff is reduced packet-dropping versus CPU availability for other processing. It is best to increase ipqmaxlen by moderate increments if the tradeoff is a concern in your environment.

[ Top of Page | Previous Page | Next Page | Contents | Index | Library Home | Legal | Search ]