The optimal settings of the tunable communications parameters vary with the type of LAN, as well as with the communications-I/O characteristics of the predominant system and application programs. This section describes the global principles of communications tuning for AIX.
Use the following outline for verifying and tuning a network installation and workload:
Network performance is dependent on the hardware you select, like the adapter type, and the adapter placement in the machine. To ensure best performance, you must place the network adapters in the I/O bus slots that are best suited for each adapter.
Consider the following items:
The higher the bandwidth or data rate of the adapter, the more critical the slot placement. For example, PCI-X adapters perform best when used in PCI-X slots, as they typically run at 133 MHz clock speed on the bus. You can place PCI-X adapters in PCI slots, but they run slower on the bus, typically at 33 MHz or 66 MHz, and do not perform as well on some workloads.
Similarly, 64-bit adapters work best when installed in 64-bit slots. You can place 64-bit adapters in a 32-bit slot, but they do not perform at optimal rates. Large MTU adapters, like Gigabit Ethernet in jumbo frame mode, perform much better in 64-bit slots.
Other issues that potentially affect performance are the number of adapters per bus or per PCI host bridge (PHB). Depending on the system model and the adapter type, the number of high speed adapters may be limited per PHB. The placement guidelines ensure that the adapters are spread across the various PCI buses and might limit the number of adapters per PCI bus. Consult the PCI Adapter Placement Reference for more information by machine model and adapter type.
The following table lists the types of PCI and PCI-X slots available in IBM pSeries eServers:
| Slot type | Code used in this topic |
| PCI 32-bit 33 MHz | A |
| PCI 32-bit 50/66 MHz | B |
| PCI 64-bit 33 MHz | C |
| PCI 64-bit 50/66 MHz | D |
| PCI-X 32-bit 33 MHz | E |
| PCI-X 32-bit 66 MHz | F |
| PCI-X 64-bit 33 MHz | G |
| PCI-X 64-bit 66 MHz | H |
| PCI-X 64-bit 133 MHz | I |
The newer IBM pSeries servers only have PCI-X slots. The PCI-X slots are backwards-compatible with the PCI adapters.
The following table shows examples of common adapters and the suggested slot types:
| Adapter type | Preferred slot type (lowest to highest priority) |
| 10/100 Mbps Ethernet PCI Adapter II (10/100 Ethernet), FC 4962 | A-I |
| IBM PCI 155 Mbps ATM adapter, FC 4953 or 4957 | D, H, and I |
| IBM PCI 622 Mbs MMF ATM adapter, FC 2946 | D, G, H, and I |
| Gigabit Ethernet-SX PCI Adapter , FC 2969 | D, G, H, and I |
| IBM 10/100/1000 Base-T Ethernet PCI Adapter, FC 2975 | D, G, H, and I |
| Gigabit Ethernet-SX PCI-X Adapter (Gigabit Ethernet fiber), FC 5700 | G, H, and I |
| 10/100/1000 Base-TX PCI-X Adapter (Gigabit Ethernet), FC 5701 | G, H, and I |
| 2-Port Gigabit Ethernet-SX PCI-X Adapter (Gigabit Ethernet fiber), FC 5707 | G, H, and I |
| 2-Port 10/100/1000 Base-TX PCI-X Adapter (Gigabit Ethernet), FC 5706 | G, H, and I |
The lsslot -c pci command provides the following information:
The following is an example of the lsslot -c pci command on a 2-way p615 system with 6 internal slots:
# lsslot -c pci # Slot Description Device(s) U0.1-P1-I1 PCI-X capable, 64 bit, 133 MHz slot fcs0 U0.1-P1-I2 PCI-X capable, 32 bit, 66 MHz slot Empty U0.1-P1-I3 PCI-X capable, 32 bit, 66 MHz slot Empty U0.1-P1-I4 PCI-X capable, 64 bit, 133 MHz slot fcs1 U0.1-P1-I5 PCI-X capable, 64 bit, 133 MHz slot ent0 U0.1-P1-I6 PCI-X capable, 64 bit, 133 MHz slot ent2
For a Gigabit Ethernet adapter, the adapter-specific statistics at the end of the entstat -d en[interface-number] command output or the netstat -v command output shows the PCI bus type and bus speed of the adapter. The following is an example output of the netstat -v command:
# netstat -v 10/100/1000 Base-TX PCI-X Adapter (14106902) Specific Statistics: -------------------------------------------------------------------- Link Status: Up Media Speed Selected: Auto negotiation Media Speed Running: 1000 Mbps Full Duplex PCI Mode: PCI-X (100-133) PCI Bus Width: 64 bit
The system firmware is responsible for configuring several key parameters on each PCI adapter as well as configuring options in the I/O chips on the various I/O and PCI buses in the system. In some cases, the firmware sets parameters unique to specific adapters, for example the PCI Latency Timer and Cache Line Size, and for PCI-X adapters, the Maximum Memory Read Byte Count (MMRBC) values. These parameters are key to obtaining good performance from the adapters. If these parameters are not properly set because of down-level firmware, it will be impossible to achieve optimal performance by software tuning alone. Ensure that you update the firmware on older systems before adding new adapters to the system.
Firmware release level information and firmware updates can be downloaded from the following link: https://techsupport.services.ibm.com/server/mdownload//download.html
You can see both the platform and system firmware levels with the lscfg -vp|grep -p " ROM" command, as in the following example:
lscfg -vp|grep -p " ROM"
...lines omitted...
System Firmware:
ROM Level (alterable).......M2P030828
Version.....................RS6K
System Info Specific.(YL)...U0.1-P1/Y1
Physical Location: U0.1-P1/Y1
SPCN firmware:
ROM Level (alterable).......0000CMD02252
Version.....................RS6K
System Info Specific.(YL)...U0.1-P1/Y3
Physical Location: U0.1-P1/Y3
SPCN firmware:
ROM Level (alterable).......0000CMD02252
Version.....................RS6K
System Info Specific.(YL)...U0.2-P1/Y3
Physical Location: U0.2-P1/Y3
Platform Firmware:
ROM Level (alterable).......MM030829
Version.....................RS6K
System Info Specific.(YL)...U0.1-P1/Y2
Physical Location: U0.1-P1/Y2
User payload data rates can be obtained by sockets-based programs for applications that are streaming data over a TCP connection. For example, one program doingsend( )calls and the receiver doing recv( ) calls. The rates are a function of the network bit rate, MTU size (frame size), physical level overhead, like Inter-Frame gap and preamble bits, data link headers, and TCP/IP headers and assume a Gigahertz speed CPU. These rates are best case numbers for a single LAN, and may be lower if going through routers or additional network hops or remote links.
Single direction (simplex) TCP Streaming rates are rates that can be seen by a workload like FTP sending data from machine A to machine B in a memory to memory test. See The ftp Command in Analyzing Network Performance. Note that full duplex media performs slightly better than half duplex media because the TCP acks can flow back without contending for the same wire that the data packets are flowing on.
The following table lists maximum possible network payload speeds and the single direction (simplex) TCP streaming rates:
| Network type | Raw bit Rate (Mbits) | Payload Rate (Mbits) | Payload Rate (MB) |
| 10 Mbit Ethernet, Half Duplex | 10 | 6 | 0.7 |
| 10 Mbit Ethernet, Full Duplex | 10 (20 Mbit full duplex) | 9.48 | 1.13 |
| 100 Mbit Ethernet, Half Duplex | 100 | 62 | 7.3 |
| 100 Mbit Ethernet, Full Duplex | 100 (200 Mbit full duplex) | 94.8 | 11.3 |
| 1000 Mbit Ethernet, Full Duplex, MTU 1500 | 1000 (2000 Mbit full duplex) | 948 | 113.0 |
| 1000 Mbit Ethernet, Full Duplex, MTU 9000 | 1000 (2000 Mbit full duplex) | 989 | 117.9 |
| FDDI, MTU 4352 (default) | 100 | 92 | 11.0 |
| ATM 155, MTU 1500 | 155 | 125 | 14.9 |
| ATM 155, MTU 9180 (default) | 155 | 133 | 15.9 |
| ATM 622, MTU 1500 | 622 | 364 | 43.4 |
| ATM 622, MTU 9180 (default) | 622 | 534 | 63.6 |
Two direction (duplex) TCP streaming workloads have data streaming in both directions. For example, running the ftp command from machine A to machine B and another instance of the ftp command from machine B to A concurrently. These types of workloads take advantage of full duplex media that can send and receive data concurrently. Some media, like FDDI or Ethernet in Half Duplex mode, can not send and receive data concurrently and will not perform well when running duplex workloads. Duplex workloads do not scale to twice the rate of a simplex workload because the TCP ack packets coming back from the receiver now have to compete with data packets flowing in the same direction. The following table lists the two direction (duplex) TCP streaming rates:
| Network type | Raw bit Rate (Mbits) | Payload Rate (Mbits) | Payload Rate (MB) |
| 10 Mbit Ethernet, Half Duplex | 10 | 5.8 | 0.7 |
| 10 Mbit Ethernet, Full Duplex | 10 (20 Mbit full duplex) | 18 | 2.2 |
| 100 Mbit Ethernet, Half Duplex | 100 | 58 | 7.0 |
| 100 Mbit Ethernet, Full Duplex | 100 (200 Mbit full duplex) | 177 | 21.1 |
| 1000 Mbit Ethernet, Full Duplex, MTU 1500 | 1000 (2000 Mbit full duplex) | 1470 (1660 peak) | 175 (198 peak) |
| 1000 Mbit Ethernet, Full Duplex, MTU 9000 | 1000 (2000 Mbit full duplex) | 1680 (1938 peak) | 200 (231 peak) |
| FDDI, MTU 4352 (default) | 100 | 97 | 11.6 |
| ATM 155, MTU 1500 | 155 (310 Mbit full duplex) | 180 | 21.5 |
| ATM 155, MTU 9180 (default) | 155 (310 Mbit full duplex) | 236 | 28.2 |
| ATM 622, MTU 1500 | 622 (1244 Mbit full duplex) | 476 | 56.7 |
| ATM 622, MTU 9180 (default) | 622 (1244 Mbit full duplex) | 884 | 105 |
Several adapter or device options are important for both proper operation and best performance. AIX devices typically have default values that should work well for most installations. Therefore, these device values normally do not require changes. However, some companies have policies that require specific network settings or some network equipment might require some of these defaults to be changed.
You can configure the Ethernet adapters for the following modes:
It is important that you configure both the adapter and the other endpoint of the cable (normally an Ethernet switch or another adapter if running in a point-to-point configuration without an Ethernet switch) the same way. The default setting for AIX is Auto_Negotiation, which negotiates the speed and duplex settings for the highest possible data rates. For the Auto_Negotiation mode to function properly, you must also configure the other endpoint (switch) for Auto_Negotiation mode.
If one endpoint is manually set to a specific speed and duplex mode, the other endpoint should also be manually set to the same speed and duplex mode. Having one end manually set and the other in Auto_Negotiation mode normally results in problems that make the link perform slowly.
It is best to use Auto_Negotiation mode whenever possible, as it is the default setting for most Ethernet switches. However, some 10/100 Ethernet switches do not support Auto_Negotiation mode of the duplex mode. These types of switches require that you manually set both endpoints to the desired speed and duplex mode.
You must use the commands that are unique to each Ethernet switch to display the port settings and change the port speed and duplex mode settings within the Ethernet switch. Refer to your switch vendors' documentation for these commands.
For AIX, you can use the smitty devices command to change the adapter settings. You can use the netstat -v command or the entstat -d enX command, where X is the Ethernet interface number to display the settings and negotiated mode. The following is part of an example of the entstat -d en3 command output:
10/100/1000 Base-TX PCI-X Adapter (14106902) Specific Statistics: -------------------------------------------------------------------- Link Status: Up Media Speed Selected: Auto negotiation Media Speed Running: 1000 Mbps Full Duplex
All devices on the same physical network, or logical network if using VLAN tagging, must have the same Media Transmission Unit (MTU) size. This is the maximum size of a frame (or packet) that can be sent on the wire.
The various network adapters support different MTU sizes, so make sure that you use the same MTU size for all the devices on the network. For example, you can not have a Gigabit Ethernet adapter using jumbo frame mode with a MTU size of 9000 bytes, while other adapters on the network use the default MTU size of 1500 bytes. 10/100 Ethernet adapters do not support jumbo frame mode, so they are not compatible with this Gigabit Ethernet option. You also have to configure Ethernet switches to use jumbo frames, if jumbo frames are supported on your Ethernet switch.
It is important to select the MTU size of the adapter early in the network setup so you can properly configure all the devices and switches. Also, many AIX tuning options are dependent upon the selected MTU size.
The MTU size of the network can have a large impact on performance. The use of large MTU sizes allows the operating system to send fewer packets of a larger size to reach the same network throughput. The larger packets greatly reduce the processing required in the operating system, assuming the workload allows large messages to be sent. If the workload is only sending small messages, then the larger MTU size will not help.
When possible, use the largest MTU size that the adapter and network support. For example, on ATM, the default MTU size of 9180 is much more efficient than using a MTU size of 1500 bytes (normally used by LAN Emulation). With Gigabit Ethernet, if all of the machines on the network have Gigabit Ethernet adapters and no 10/100 adapters on the network, then it would be best to use jumbo frame mode. For example, a server-to-server connection within the computer lab can typically be done using jumbo frames.
You must select jumbo frame mode as a device option. Trying to change the MTU size with the ifconfig command does not work. Use SMIT to display the adapter settings with the following steps:
The SMIT screen looks like the following:
Change/Show Characteristics of an Ethernet Adapter
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
[Entry Fields]
Ethernet Adapter ent0
Description 10/100/1000 Base-TX PCI-X Adapter (14106902)
Status Available
Location 1H-08
Receive descriptor queue size [1024] +#
Transmit descriptor queue size [512] +#
Software transmit queue size [8192] +#
Transmit jumbo frames yes +
Enable hardware transmit TCP resegmentation yes +
Enable hardware transmit and receive checksum yes +
Media Speed Auto_Negotiation +
Enable ALTERNATE ETHERNET address no +
ALTERNATE ETHERNET address [0x000000000000] +
Apply change to DATABASE only no +
F1=Help F2=Refresh F3=Cancel F4=List
Esc+5=Reset Esc+6=Command Esc+7=Edit Esc+8=Image
Esc+9=Shell Esc+0=Exit Enter=Do
The network option or no command displays, changes, and manages the global network options. An alternate method for tuning some of these parameters is discussed in the Interface-Specific Network Options (ISNO) section.
The following no command options are used to change the tuning parameters:
The following is an example of the no command:
NAME CUR DEF BOOT MIN MAX UNIT TYPE DEPENDENCIES
-------------------------------------------------------------------------------------------------
General Network Parameters
-------------------------------------------------------------------------------------------------
sockthresh 85 85 85 0 100 %_of_thewall D
-------------------------------------------------------------------------------------------------
fasttimo 200 200 200 50 200 millisecond D
-------------------------------------------------------------------------------------------------
inet_stack_size 16 16 16 1 kbyte R
-------------------------------------------------------------------------------------------------
...lines omitted....
where:
CUR = current value
DEF = default value
BOOT = reboot value
MIN = minimal value
MAX = maximum value
UNIT = tunable unit of measure
TYPE = parameter type: D (for Dynamic), S (for Static), R for Reboot),B (for Bosboot), M (for Mount),
I (for Incremental) and C (for Connect)
DEPENDENCIES = list of dependent tunable parameters, one per line
Some network attributes are run-time attributes that can be changed at any time. Others are load-time attributes that must be set before the netinet kernel extension is loaded.
For more information on the no command, see The no Command.
Interface-Specific Network Options (ISNO) allows IP network interfaces to be custom-tuned for the best performance. Values set for an individual interface take precedence over the systemwide values set with the no command. The feature is enabled (the default) or disabled for the whole system with the no command use_isno option. This single-point ISNO disable option is included as a diagnostic tool to eliminate potential tuning errors if the system administrator needs to isolate performance problems.
Programmers and performance analysts should note that the ISNO values will not show up in the socket (meaning they cannot be read by the getsockopt() system call) until after the TCP connection is made. The specific network interface that a socket actually uses is not known until the connection is complete, so the socket reflects the system defaults from the no command. After the TCP connection is accepted and the network interface is known, ISNO values are put into the socket.
The following parameters have been added for each supported network interface and are only effective for TCP (and not UDP) connections:
When set for a specific interface, these values override the corresponding no option values set for the system. These parameters are available for all of the mainstream TCP/IP interfaces (Token-Ring, FDDI, 10/100 Ethernet, and Gigabit Ethernet), except the css# IP interface on the SP switch. As a simple workaround, SP switch users can set the tuning options appropriate for the switch using the systemwide no command, then use the ISNOs to set the values needed for the other system interfaces.
These options are set for the TCP/IP interface (such as en0 or tr0), and not the network adapter (ent0 or tok0).
AIX sets default values for the Gigabit Ethernet interfaces, for both MTU 1500 and for jumbo frame mode (MTU 9000). As long as you configure the interface through the SMIT tcpip screens, the ISNO options should be set to the default values, which provides good performance.
For 10/100 Ethernet and token ring adapters, the ISNO defaults are not set by the system as they typically work fine with the system global no defaults. However, the ISNO attributes can be set if needed to override the global defaults.
The following example shows the default ISNO values for tcp_sendspace and tcp_recvspace for GigE in MTU 1500 mode :
# ifconfig en0
en0: flags=5e080863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD,PSEG,CHAIN>
inet 10.0.0.1 netmask 0xffffff00 broadcast 192.0.0.255
tcp_sendspace 131072 tcp_recvspace 65536
For jumbo frame mode, the default ISNO values for tcp_sendspace, tcp_recvspace, and rfc1323 are set as follows:
# ifconfig en0
en0: flags=5e080863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD,PSEG,CHAIN>
inet 192.0.0.1 netmask 0xffffff00 broadcast 192.0.0.255
tcp_sendspace 262144 tcp_recvspace 131072 rfc1323 1
You can set ISNO options by the following methods:
Using SMIT or the chdev command changes the values in the ODM database on disk so they will be permanent. The ifconfig command only changes the values in memory, so they go back to the prior values stored in ODM on the next reboot.
You can change the ISNO options with SMIT as follows:
# smitty tcpip
Then, you will see the following screen:
Change / Show a Standard Ethernet Interface
Type or select values in entry fields.
Press Enter AFTER making all desired changes.
[Entry Fields]
Network Interface Name en0
INTERNET ADDRESS (dotted decimal) [192.0.0.1]
Network MASK (hexadecimal or dotted decimal) [255.255.255.0]
Current STATE up +
Use Address Resolution Protocol (ARP)? yes +
BROADCAST ADDRESS (dotted decimal) []
Interface Specific Network Options
('NULL' will unset the option)
rfc1323 []
tcp_mssdflt []
tcp_nodelay []
tcp_recvspace []
tcp_sendspace []
F1=Help F2=Refresh F3=Cancel F4=List
Esc+5=Reset Esc+6=Command Esc+7=Edit Esc+8=Image
Esc+9=Shell Esc+0=Exit Enter=Do
Notice that the ISNO system defaults do not display, even thought they are set internally. For this example, override the default value for tcp_sendspace and lower it down to 65536.
Bring the interface back up with smitty tcpip and select Minimum Configuration and Startup. Then select en0, and take the default values that were set when the interface was first setup.
If you use the ifconfig command to show the ISNO options, you can see that the value of the tcp_sendspace attribute is now set to 65536. The following is an example:
# ifconfig en0
en0: flags=5e080863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD,PSEG,CHAIN>
inet 192.0.0.1 netmask 0xffffff00 broadcast 192.0.0.255
tcp_sendspace 65536 tcp_recvspace 65536
The lsattr command output also shows that the system default has been overridden for this attribute:
# lsattr -E -l en0 alias4 IPv4 Alias including Subnet Mask True alias6 IPv6 Alias including Prefix Length True arp on Address Resolution Protocol (ARP) True authority Authorized Users True broadcast Broadcast Address True mtu 1500 Maximum IP Packet Size for This Device True netaddr 192.0.0.1 Internet Address True netaddr6 IPv6 Internet Address True netmask 255.255.255.0 Subnet Mask True prefixlen Prefix Length for IPv6 Internet Address True remmtu 576 Maximum IP Packet Size for REMOTE Networks True rfc1323 Enable/Disable TCP RFC 1323 Window Scaling True security none Security Level True state up Current Interface Status True tcp_mssdflt Set TCP Maximum Segment Size True tcp_nodelay Enable/Disable TCP_NODELAY Option True tcp_recvspace Set Socket Buffer Space for Receiving True tcp_sendspace 65536 Set Socket Buffer Space for Sending True
You can use the following commands to first verify system and interface support and then to set and verify the new values.
# no -a | grep isno use_isno = 1
# lsattr -E -l en0 -H attribute value description user_settable : rfc1323 Enable/Disable TCP RFC 1323 Window Scaling True tcp_mssdflt Set TCP Maximum Segment Size True tcp_nodelay Enable/Disable TCP_NODELAY Option True tcp_recvspace Set Socket Buffer Space for Receiving True tcp_sendspace Set Socket Buffer Space for Sending True
For example, to set the tcp_recvspace and tcp_sendspace to 64 KB and enable tcp_nodelay, use one of the following methods:
# ifconfig en0 tcp_recvspace 65536 tcp_sendspace 65536 tcp_nodelay 1
or
# chdev -l en0 -a tcp_recvspace=65536 -a tcp_sendspace=65536 -a tcp_nodelay=1
# ifconfig en0
en0: flags=5e080863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD,PSEG,CHAIN>
inet 9.19.161.100 netmask 0xffffff00 broadcast 9.19.161.255
tcp_sendspace 65536 tcp_recvspace 65536 tcp_nodelay 1
or
# lsattr -El en0 rfc1323 Enable/Disable TCP RFC 1323 Window Scaling True tcp_mssdflt Set TCP Maximum Segment Size True tcp_nodelay 1 Enable/Disable TCP_NODELAY Option True tcp_recvspace 65536 Set Socket Buffer Space for Receiving True tcp_sendspace 65536 Set Socket Buffer Space for Sending True
There are several AIX tunable values that can have a large impact on TCP performance. Many applications use the reliable Transport Control Protocol (TCP), including FTP and RCP.
Streaming workloads move large amounts of data from one endpoint to the other endpoint. Examples of streaming workloads are file transfer, backup or restore workloads, or bulk data transfer. The main metric of interest in these workloads is bandwidth, but you can also look at end-to-end latency.
The primary tunables that affect TCP performance for streaming applications are the following:
The following table shows suggested sizes for the tunable values to obtain optimal performance, based on the type of adapter and the MTU size:
| Device | Speed | MTU size | tcp_sendspace | tcp_recvspace | sb_max1 | rfc1323 |
| Token Ring | 4 or 16 Mbit | 1492 | 16384 | 16384 | 32768 | 0 |
| Ethernet | 10 Mbit | 1500 | 16384 | 16384 | 32768 | 0 |
| Ethernet | 100 Mbit | 1500 | 16384 | 16384 | 65536 | 0 |
| Ethernet | Gigabit | 1500 | 131072 | 65536 | 131072 | 0 |
| Ethernet | Gigabit | 9000 | 131072 | 65535 | 262144 | 0 |
| Ethernet | Gigabit | 9000 | 262144 | 1310722 | 524288 | 1 |
| ATM | 155 Mbit | 1500 | 16384 | 16384 | 131072 | 0 |
| ATM | 155 Mbit | 9180 | 65535 | 655353 | 131072 | 0 |
| ATM | 155 Mbit | 65527 | 655360 | 6553604 | 1310720 | 1 |
| FDDI | 100 Mbit | 4352 | 45056 | 45056 | 90012 | 0 |
| Fiber Channel | 2 Gigabit | 65280 | 655360 | 655360 | 1310720 | 1 |
The following are general guidelines for tuning TCP streaming workloads:
The ftp and rcp commands are examples of TCP applications that benefit from tuning the tcp_sendspace and tcp_recvspace tunables.
TCP send buffer size can limit how much data the application can send before the application is put to sleep. The TCP socket send buffer is used to buffer the application data in the kernel using mbufs/clusters before it is sent beyond the socket and TCP layer. The default size of this buffer is specified by the parameter tcp_sendspace, but you can use the setsockopt() subroutine to override it.
If the amount of data that the application wants to send is smaller than the send buffer size and also smaller than the maximum segment size and if TCP_NODELAY is not set, then TCP will delay up to 200 ms, until enough data exists to fill the send buffer or the amount of data is greater than or equal to the maximum segment size, before transmitting the packets.
If TCP_NODELAY is set, then the data is sent immediately (useful for request/response type of applications). If the send buffer size is less than or equal to the maximum segment size (ATM and SP switches can have 64 K MTUs), then the application's data will be sent immediately and the application must wait for an ACK before sending another packet (this prevents TCP streaming and could reduce throughput).
If an application does nonblocking I/O (specified O_NDELAY or O_NONBLOCK on the socket), then if the send buffer fills up, the application will return with an EWOULDBLOCK/EAGAIN error rather than being put to sleep. Applications must be coded to handle this error (suggested solution is to sleep for a short while and try to send again).
When you are changing send/recv space values, in some cases you must stop/restart the inetd process as follows:
# stopsrc -s inetd; startsrc -s inetd
TCP receive-buffer size limits how much data the receiving system can buffer before the application reads the data. The TCP receive buffer is used to accommodate incoming data. When the data is read by the TCP layer, TCP can send back an acknowledgment (ACK) for that packet immediately or it can delay before sending the ACK. Also, TCP tries to piggyback the ACK if a data packet was being sent back anyway. If multiple packets are coming in and can be stored in the receive buffer, TCP can acknowledge all of these packets with one ACK. Along with the ACK, TCP returns a window advertisement to the sending system telling it how much room remains in the receive buffer. If not enough room remains, the sender will be blocked until the application has read the data. Smaller values will cause the sender to block more. The size of the TCP receive buffer can be set using the setsockopt() subroutine or by the tcp_recvspace parameter.
The TCP window size by default is limited to 65536 bytes (64 K) but can be set higher if rfc1323 is set to 1. If you are setting tcp_recvspace to greater than 65536, set rfc1323=1 on each side of the connection. Without having rfc1323 set on both sides, the effective value for tcp_recvspace will be 65536.
If you are sending data through adapters that have large MTU sizes (32 K or 64 K for example), TCP streaming performance may not be optimal because the packet or packets will be sent and the sender will have to wait for an acknowledgment. By enabling the rfc1323 option using the command no -o rfc1323=1, TCP's window size can be set as high as 4 GB. However, on adapters that have 64 K or larger MTUs, TCP streaming performance can be degraded if the receive buffer can only hold 64 K. If the receiving machine does not support rfc1323, then reducing the MTU size is one way to enhance streaming performance.
After setting the rfc1323 option to 1, you can increase the tcp_recvspace parameter to something much larger, such as 10 times the size of the MTU.
This parameter controls how much buffer space is consumed by buffers that are queued to a sender's socket or to a receiver's socket. The system accounts for socket buffers used based on the size of the buffer, not on the contents of the buffer.
If a device driver puts 100 bytes of data into a 2048-byte buffer, then the system considers 2048 bytes of socket buffer space to be used. It is common for device drivers to receive buffers into a buffer that is large enough to receive the adapters maximum size packet. This often results in wasted buffer space but it would require more CPU cycles to copy the data to smaller buffers.
Because there are so many different network device drivers, increase the sb_max value much higher rather than making it the same as the largest TCP or UDP socket buffer size parameters. After the total number of mbufs/clusters on the socket reaches the sb_max limit, no additional buffers can be queued to the socket until the application has read the data.
One guideline would be to set it to twice as large as the largest TCP or UDP receive space.
TCP request/response workloads are workloads that involve a two-way exchange of information. Examples of request/response workloads are Remote Procedure Call (RPC) types of applications or client/server applications, like web browser requests to a web server, NFS file systems (that use TCP for the transport protocol), or a database's lock management protocol. Such request are often small messages and larger responses, but might also be large requests and a small response.
The primary metric of interest in these workloads is the round-trip latency of the network. Many of these requests or responses use small messages, so the network bandwidth is not a major consideration.
Hardware has a major impact on latency. For example, the type of network, the type and performance of any network switches or routers, the speed of the processors used in each node of the network, the adapter and bus latencies all impact the round-trip time.
Tuning options to provide minimum latency (best response) typically cause higher CPU overhead as the system sends more packets, gets more interrupts, etc. in order to minimize latency and response time. These are classic performance trade-offs.
Primary tunables for request/response applications are the following:
User Datagram Protocol (UDP) is a datagram protocol that is used by Network File System (NFS), name server (named), Trivial File Transfer Protocol (TFTP), and other special purpose protocols.
Since UDP is a datagram protocol, the entire message (datagram) must be copied into the kernel on a send operation as one atomic operation. The datagram is also received as one complete message on the recv or recvfrom system call. You must set the udp_sendspace and udp_recvspace parameters to handle the buffering requirements on a per-socket basis.
The largest UDP datagram that can be sent is 64 KB, minus the UDP header size (8 bytes) and the IP header size (20 bytes for IPv4 or 40 bytes for IPv6 headers).
The following tunables affect UDP performance:
Set this parameter to 65536, because any value greater than 65536 is ineffective. Because UDP transmits a packet as soon as it gets any data, and because IP has an upper limit of 65536 bytes per packet, anything beyond 65536 runs the small risk of being discarded by IP. The IP protocol will fragment the datagram into smaller packets if needed, based on the MTU size of the interface the packet will be sent on. For example, sending an 8 K datagram, IP would fragment this into 1500 byte packets if sent over Ethernet. Because UDP does not implement any flow control, all packets given to UPD are passed to IP (where they may be fragmented) and then placed directly on the device drivers transmit queue.
On the receive side, the incoming datagram (or fragment if the datagram is larger than the MTU size) will first be received into a buffer by the device driver. This will typically go into a buffer that is large enough to hold the largest possible packet from this device.
The setting of udp_recvspace is harder to compute because it varies by network adapter type, UDP sizes, and number of datagrams queued to the socket. Set the udp_recvspace larger rather than smaller, because packets will be discarded if it is too small.
For example, Ethernet might use 2 K receive buffers. Even if the incoming packet is maximum MTU size of 1500 bytes, it will only use 73 percent of the buffer. IP will queue the incoming fragments until a full UDP datagram is received. It will then be passed to UDP. UDP will put the incoming datagram on the receivers socket. However, if the total buffer space in use on this socket exceeds udp_recvspace, then the entire datagram will be discarded. This is indicated in the output of the netstat -s command as dropped due to full socket buffers errors.
Because the communication subsystem accounts for buffers used, and not the contents of the buffers, you must account for this when setting udp_recvspace. In the above example, the 8 K datagram would be fragmented into 6 packets which would use 6 receive buffers. These will be 2048 byte buffers for Ethernet. So, the total amount of socket buffer consumed by this one 8 K datagram is as follows:
6*2048=12,288 bytes
Thus, you can see that the udp_recvspace must be adjusted higher depending on how efficient the incoming buffering is. This will vary by datagram size and by device driver. Sending a 64 byte datagram would consume a 2 K buffer for each 64 byte datagram.
Then, you must account for the number of datagrams that may be queued onto this one socket. For example, NFS server receives UDP packets at one well-known socket from all clients. If the queue depth of this socket could be 30 packets, then you would use 30 * 12,288 = 368,640 for the udp_recvspace if NFS is using 8 K datagrams. NFS Version 3 allows up to 32K datagrams.
A suggested starting value for udp_recvspace is 10 times the value of udp_sendspace, because UDP may not be able to pass a packet to the application before another one arrives. Also, several nodes can send to one node at the same time. To provide some staging space, this size is set to allow 10 packets to be staged before subsequent packets are discarded. For large parallel applications using UDP, the value may have to be increased.
When UDP Datagrams to be transmitted are larger than the adapters MTU size, the IP protocol layer will fragment the datagram into MTU size fragments. Ethernet interfaces include a UPD packet chaining feature. This feature is enabled by default in AIX
UDP packet chaining causes IP to build the entire chain of fragments and pass that chain down to the Ethernet device driver in one call. This improves performance by reducing the calls down through the ARP and interface layers and to the driver. This also reduces lockand unlock calls in SMP environment. It also helps the cache affinity of the code loops. These changes reduce the CPU utilization of the sender.
You can view the UDP packet chaining option with the ifconfig command. The following example shows the ifconfig command output for the en0 interface, where the CHAIN flag indicates that packet chaining in enabled:
# ifconfig en0
en0: flags=5e080863,80<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD,PSEG,CHAIN>
inet 192.1.6.1 netmask 0xffffff00 broadcast 192.1.6.255
tcp_sendspace 65536 tcp_recvspace 65536 tcp_nodelay 1
Packet chaining can be disabled by the following command:
# ifconfig en0 -pktchain
# ifconfig en0
en0: flags=5e080863,80<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD,PSEG>
inet 192.1.6.1 netmask 0xffffff00 broadcast 192.1.6.255
tcp_sendspace 65536 tcp_recvspace 65536 tcp_nodelay 1
Packet chaining can be re-enabled with the following command:
# ifconfig en0 pktchain
# ifconfig en0
en0: flags=5e080863,80<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD,PSEG,CHAIN>
inet 192.1.6.1 netmask 0xffffff00 broadcast 192.1.6.255
tcp_sendspace 65536 tcp_recvspace 65536 tcp_nodelay 1
Most communication drivers provide a set of tunable parameters to control transmit and receive resources. These parameters typically control the transmit queue and receive queue limits, but may also control the number and size of buffers or other resources. These parameters limit the number of buffers or packets that may be queued for transmit or limit the number of receive buffers that are available for receiving packets. These parameters can be tuned to ensure enough queueing at the adapter level to handle the peak loads generated by the system or the network.
Following are some general guidelines:
For transmit, the device drivers may provide a transmit queue limit. There may be both hardware queue and software queue limits, depending on the driver and adapter. Some drivers have only a hardware queue; some have both hardware and software queues. Some drivers internally control the hardware queue and only allow the software queue limits to be modified. Generally, the device driver will queue a transmit packet directly to the adapter hardware queue. If the system CPU is fast relative to the speed of the network, or on an SMP system, the system may produce transmit packets faster than they can be transmitted on the network. This will cause the hardware queue to fill. After the hardware queue is full, some drivers provide a software queue and they will then queue to the software queue. If the software transmit queue limit is reached, then the transmit packets are discarded. This can affect performance because the upper-level protocols must then time out and retransmit the packet.
The transmit queue limits on most of the device drivers are 2048 buffers. The default values were also increased to 512 for most of these drivers. The default values were increased because the faster CPUs and SMP systems can overrun the smaller queue limits.
Following are examples of PCI adapter transmit queue sizes:
| PCI Adapter Type | Default | Range |
| Ethernet | 64 | 16 - 256 |
| 10/100 Ethernet | 256, 512, or 2048 | 16 -16384 |
| Token-Ring | 96, 512, or 2048 | 32 - 16384 |
| FDDI | 30 or 2048 | 3 - 16384 |
| 155 ATM | 100 or 2048 | 0 - 16384 |
For adapters that provide hardware queue limits, changing these values will cause more real memory to be consumed on receives because of the associated control blocks and buffers associated with them. Therefore, raise these limits only if needed or for larger systems where the increase in memory use is negligible. For the software transmit queue limits, increasing these limits does not increase memory usage. It only allows packets to be queued that were already allocated by the higher layer protocols.
Some adapters allow you to configure the number of resources used for receiving packets from the network. This might include the number of receive buffers (and even their size) or may be a receive queue parameter (which indirectly controls the number of receive buffers).
The receive resources may need to be increased to handle peak bursts on the network. The network interface device driver places incoming packets on a receive queue. If the receive queue is full, packets are dropped and lost, resulting in the sender needing to retransmit. The receive queue is tunable using the SMIT or chdev commands (see How to Change the Parameters). The maximum queue size is specified to each type of communication adapter (see Tuning PCI Adapters).
For the Micro Channel adapters and the PCI adapters, receive queue parameters typically control the number of receive buffers that are provided to the adapter for receiving input packets.
AIX 4.1.4 and later support device-specific mbufs. This allows a driver to allocate its own private set of buffers and have them pre-setup for Direct Memory Access (DMA). This can provide additional performance because the overhead to set up the DMA mapping is done one time. Also, the adapter can allocate buffer sizes that are best suited to its MTU size. For example, ATM, High Performance Parallel Interface (HIPPI), and the SP switch support a 64 K MTU (packet) size. The maximum system mbuf size is 16 KB. By allowing the adapter to have 64 KB buffers, large 64 K writes from applications can be copied directly into the 64 KB buffers owned by the adapter, instead of copying them into multiple 16 K buffers (which has more overhead to allocate and free the extra buffers).
Device-specific buffers add an extra layer of complexity for the system administrator. The system administrator must use device-specific commands to view the statistics relating to the adapter's buffers and then change the adapter's parameters as necessary. If the statistics indicate that packets were discarded because not enough buffer resources were available, then those buffer sizes need to be increased.
Following are some guidelines to help you determine when to increase the receive/transmit queue parameters:
Several status utilities can be used to show the transmit queue high-water limits and number of queue overflows. You can use the command netstat -v, or go directly to the adapter statistics utilities (entstat for Ethernet, tokstat for Token-Ring, fddistat for FDDI, atmstat for ATM, and so on).
For an entstat example output, see The entstat Command. Another method is to use the netstat -i utility. If it shows non-zero counts in the Oerrs column for an interface, then this is typically the result of output queue overflows.
You can use the lsattr -E -l adapter-name command or you can use the SMIT command (smitty commodev) to show the adapter configuration.
Different adapters have different names for these variables. For example, they may be named sw_txq_size, tx_que_size, or xmt_que_size for the transmit queue parameter. The receive queue size and receive buffer pool parameters may be named rec_que_size, rx_que_size, or rv_buf4k_min for example.
Following is the output of a lsattr -E -l atm0 command on an IBM PCI 155 Mbs ATM adapter. This output shows the sw_txq_size is set to 250 and the rv_buf4K_min receive buffers set to x30.
# lsattr -E -l atm0 dma_mem 0x400000 N/A False regmem 0x1ff88000 Bus Memory address of Adapter Registers False virtmem 0x1ff90000 Bus Memory address of Adapter Virtual Memory False busintr 3 Bus Interrupt Level False intr_priority 3 Interrupt Priority False use_alt_addr no Enable ALTERNATE ATM MAC address True alt_addr 0x0 ALTERNATE ATM MAC address (12 hex digits) True sw_txq_size 250 Software Transmit Queue size True max_vc 1024 Maximum Number of VCs Needed True min_vc 32 Minimum Guaranteed VCs Supported True rv_buf4k_min 0x30 Minimum 4K-byte pre-mapped receive buffers True interface_type 0 Sonet or SDH interface True adapter_clock 1 Provide SONET Clock True uni_vers auto_detect N/A True
Following is an example of a Micro Channel 10/100 Ethernet settings using the lsattr -E -l ent0 command. This output shows the tx_que_size and rx_que_size both set to 256.
# lsattr -E -l ent0 bus_intr_lvl 11 Bus interrupt level False intr_priority 3 Interrupt priority False dma_bus_mem 0x7a0000 Address of bus memory used for DMA False bus_io_addr 0x2000 Bus I/O address False dma_lvl 7 DMA arbitration level False tx_que_size 256 TRANSMIT queue size True rx_que_size 256 RECEIVE queue size True use_alt_addr no Enable ALTERNATE ETHERNET address True alt_addr 0x ALTERNATE ETHERNET address True media_speed 100_Full_Duplex Media Speed True ip_gap 96 Inter-Packet Gap True
The information in this section is provided to document the various adapter-tuning parameters. These parameters and values are provided to aid you in understanding the various tuning parameters, or when a system is not available to view the parameters.
These parameter names, defaults, and range values were obtained from the ODM database. The comment field was obtained from the lsattr -E -l interface-name command.
The Notes field provides additional comments.
Feature Code 2985
IBM PCI Ethernet Adapter (22100020)
Parameter Default Range Comment Notes
------------- -------- ----------------- ------------------- ---------
tx_que_size 64 16,32,64,128,256 TRANSMIT queue size HW Queues
rx_que_size 32 16,32,64,128,256 RECEIVE queue size HW Queues
Featue Code 2968
IBM 10/100 Mbps Ethernet PCI Adapter (23100020)
Parameter Default Range Comment Notes
---------------- ------- ---------------- --------------------- --------------------
tx_que_size 256 16,32,64,128,256 TRANSMIT queue size HW Queue Note 1
rx_que_size 256 16,32,64,128,256 RECEIVE queue size HW Queue Note 2
rxbuf_pool_size 384 16-2048 # buffers in receive Dedicat. receive
buffer pool buffers Note 3
Feature Code: 2969
Gigabit Ethernet-SX PCI Adapter (14100401)
Parameter Default Range Comment Notes
------------- ------- -------- ----------------------------------- ---------
tx_que_size 512 512-2048 Software Transmit Queueu size SW Queue
rx_que_size 512 512 Receive queue size HW Queue
receive_proc 6 0-128 Minimum Receive Buffer descriptiors
Feature Code: 2986
3Com 3C905-TX-IBM Fast EtherLink XL NIC
Parameter Default Range Comment Notes
-------------- -------- ------ ---------------------------- ----------
tx_wait_q_size 32 4-128 Driver TX Waiting Queue Size HW Queues
rx_wait_q_size 32 4-128 Driver RX Waiting Queue Size HW Queues
Feature Code: 2742
SysKonnect PCI FDDI Adapter (48110040)
Parameter Default Range Comment Notes
------------- -------- -------- ------------------- ---------------
tx_queue_size 30 3-250 Transmit Queue Size SW Queue
RX_buffer_cnt 42 1-128 Receive frame count Rcv buffer pool
Feature Code: 2979
IBM PCI Tokenring Adapter (14101800)
Parameter Default Range Comment Notes
------------- -------- ------- --------------------------- --------
xmt_que_size 96 32-2048 TRANSMIT queue size SW Queue
rx_que_size 32 32-160 HARDWARE RECEIVE queue size HW queue
Feature Code: 2979
IBM PCI Tokenring Adapter (14103e00)
Parameter Default Range Comment Notes
------------- -------- -------- -------------------- --------
xmt_que_size 512 32-2048 TRANSMIT queue size SW Queue
rx_que_size 64 32-512 RECEIVE queue size HW Queue
Feature Code: 2988
IBM PCI 155 Mbps ATM Adapter (14107c00)
Parameter Default Range Comment Notes
------------- --------- ------------ -------------------------------- --------
sw_txq_size 100 0-4096 Software Transmit Queue size SW Queue
rv_buf4k_min 48 (0x30) 0-512 (x200) Minimum 4K-byte pre-mapped receive buffers
Notes on the IBM 10/100 Mbps Ethernet PCI Adapter:
Drivers, by default, call IP directly, which calls up the protocol stack to the socket level while running on the interrupt level. This minimizes instruction path length, but increases the interrupt hold time. On an SMP system, a single CPU can become the bottleneck for receiving packets from a fast adapter. By enabling the dog threads, the driver queues the incoming packet to the thread and the thread handles calling IP, TCP, and the socket code. The thread can run on other CPUs which may be idle. Enabling the dog threads can increase capacity of the system in some cases.
This is a feature for the input side (receive) of LAN adapters. It can be configured at the interface level with the ifconfig command (ifconfig interface thread or ifconfig interface hostname up thread).
To disable the feature, use the ifconfig interface -thread command.
Guidelines when considering using dog threads are as follows:
The dog threads run best when they find more work on their queue and do not have to go back to sleep (waiting for input). This saves the overhead of the driver waking up the thread and the system dispatching the thread.
The following are some of the network parameters that are user-configurable:
The device driver supports a user-configurable transmit queue. This is the queue the adapter uses (not an extension of the adapter's queue). It is configurable among the values of 16, 32, 64, 128 and 256, with a default of 256.
Because of the configurable size of the adapter's hardware queue, the driver does not support a software queue.
The device driver supports a user-configurable receive queue. This is the queue the adapter uses (not an extension of the adapter's queue). It is configurable among the values of 16, 32, 64, 128 and 256, with a default of 256.
The device driver supports a user-configurable receive buffer pool size. The buffer is the number of preallocated mbufs for receiving packets. The minimum size of the buffer is the receive queue size and the maximum is 2 KB (the default value of 384).
The device driver supports speeds of 10 (10 Mbps, half-duplex), 20 (10 Mbps, full-duplex), 100 (100 Mbps, half-duplex), 200 (100 Mbps, full-duplex), and auto-negotiate on twisted pair. On the AUI port, the device driver supports speeds of 10 (10 Mbps, half-duplex) and 20 (10 Mbps, full-duplex). The bayonet Niell-Concelman (BNC) port will only support 10 (10 Mbps, half-duplex). This attribute is user-configurable, with a default of auto-negotiate on twisted pair.
The device driver supports a configuration option to toggle use of an alternate network address. The values are yes and no, with a default of no. When this value is set to yes, the alt_addr parameter defines the address.
For the network address, the device driver accepts the adapter's hardware address or a configured alternate network address. When the use_alt_addr configuration option is set to yes, this alternate address is used. Any valid individual address can be used, but a multicast address cannot be defined as a network address.
The inter-packet gap (IPG) bit rate setting controls the aggressiveness of the adapter on the network. A smaller number will increase the aggressiveness of the adapter, while a larger number will decrease the aggressiveness (and increase the fairness) of the adapter. If the adapter statistics show a large number of collisions and deferrals, increase this number. Valid values range from 96 to 252, in increments of 4. The default value of 96 results in IPG of 9.6 microseconds for 10 Mb and 0.96 microseconds for 100 Mb media speeds. Each unit of bit rate introduces an IPG of 100 ns at 10 Mb and 10 ns at 100 Mb media speed.
To change any of the parameter values, do the following:
# ifconfig en0 detach
where en0 represents the adapter name.
# ifconfig en0 hosthame upAn alternative method to change these parameter values is to run the following command:
# chdev -l [ifname] -a [attribute-name]=newvalue
For example, to change the above tx_que_size on en0 to 128, use the following sequence of commands. Note that this driver only supports four different sizes, so it is better to use the SMIT command to see these values.
# ifconfig en0 detach
# chdev -l ent0 -a tx_que_size=128
# ifconfig en0 hostname up
The maximum size packets that TCP sends can have a major impact on bandwidth, because it is more efficient to send the largest possible packet size on the network. TCP controls this maximum size, known as Maximum Segment Size (MSS), for each TCP connection. For direct-attached networks, TCP computes the MSS by using the MTU size of the network interface and then subtracting the protocol headers to come up with the size of data in the TCP packet. For example, Ethernet with a MTU of 1500 would result in a MSS of 1460 after subtracting 20 bytes for IPv4 header and 20 bytes for TCP header.
The TCP protocol includes a mechanism for both ends of a connection to advertise the MSS to be used over the connection when the connection is created. Each end uses the OPTIONS field in the TCP header to advertise a proposed MSS. The MSS that is chosen is the smaller of the values provided by the two ends. If one endpoint does not provide its MSS, then 536 bytes is assumed, which is bad for performance.
The problem is that each TCP endpoint only knows the MTU of the network it is attached to. It does not know what the MTU size of other networks that might be between the two endpoints. So, TCP only knows the correct MSS if both endpoints are on the same network. Therefore, TCP handles the advertising of MSS differently depending on the network configuration, if it wants to avoid sending packets that might require IP fragmentation to go over smaller MTU networks.
The value of MSS advertised by the TCP software during connection setup depends on whether the other end is a local system on the same physical network (that is, the systems have the same network number) or whether it is on a different (remote) network.
If the other end of the connection is on the same IP network, the MSS advertised by TCP is based on the MTU of the local network interface, as follows:
TCP MSS = MTU - TCP header size - IP header size
The TCP size is 20 bytes, the IPv4 header size is 20 bytes, and the IPv6 header size is 40 bytes.
Because this is the largest possible MSS that can be accommodated without IP fragmentation, this value is inherently optimal, so no MSS-tuning is required for local networks.
When the other end of the connection is on a remote network, the operating system's TCP defaults to advertising an MSS that is determined with the method below. The method varies if TCP Path MTU discovery is enabled or not. If Path MTU discovery is not enabled, where tcp_pmtu_discover=0, TCP determines what MSS to use in the following order:
The TCP path MTU discovery protocol option is enabled by default in AIX. This option is controlled by the tcp_pmtu_discover=1 network option. This option allows the protocol stack to determine the minimum MTU size on any network that is currently in the path between two hosts.
In AIX, this option is implemented using ICMP echo messages sent from the source to the destination host for a specific route. The initial echo message for a route has a size equal to the MTU size of the sending interface, and has the Don't Fragment (DF) bit set in the IP header. If this packet hits a network router that has a MTU smaller than the size of the echo message, an error packet is sent back indicating the message can not be forwarded because it can not be fragmented.
If the router sending the error packet complies with RFC 1191, the network's MTU is contained in the ICMP error packet, and the source host will try again with this smaller sized echo message. Otherwise the sending source has to make a guess at a smaller MTU size for the next ICMP echo message. This is done from a table of values within the AIX TCP/IP kernel extension. When a valid echo response is finally received from the destination host by the source host, the MSS size is saved in a cloned route in the routing table for this route. TCP uses this value on the next TCP connection on this route. The first TCP connection runs with the default tcp_mssdflt value (normally 512 bytes, or the smaller of the two advertised values).
Once the first TCP connection is open and a cloned route is created, subsequent TCP connections pick up the PMTU value from the cloned route entry. This can be seen under the PMTU column in the netstat -r command output. Also, for these connections, TCP will set the Don't Fragment (DF) bit in the IP header so it will be informed of any changes in the network topology. The Refs column of the netstat -r command report shows the use count, which is the number of TCP connections using this cloned route to this remote host. The cloned route remains in the routing table until this Refs count goes to zero. Then, after one minute, the cloned route will be deleted if the route_expire network option is enabled (set to "1", which is the AIX default). This purging of the cloned routes keeps the route table from getting too large.
A typical side effect of PMTU discovery being enabled is that the routing table will be larger, because a route entry is maintained to each remote host. The no command option route_expire should be set to a non-zero value, in order to have any unused cached route entry removed from the table, after route_expire time of inactivity. The default is route_expire=1, which purges expired routes from the routing table. Once a cloned route is deleted, any new TCP connection requests to that host start the path MTU discovery and route cloning process over again. This can be avoided by disabling the route_expire option.
To avoid the problem of the first TCP connection ending up with a MSS of 512 bytes, the tcp_mssdflt option should be set to 1460 bytes (or the size of the smallest MTU size in your network). Most networks use Ethernet, so a value of 1460 is a good choice. The worst case, if a smaller network were to be encountered, is that IP would have to fragment the TCP packets before crossing that network. While fragmentation of TCP packets is not desirable, it will work.
You can use the netstat -ao command to show all the TCP connections and the MSS field of each socket in the ESTABLISHED state, shows the MSS value in use for that connection.
The following is an example of the netstat -r command:
Routing tables Destination Gateway Flags Refs Use If PMTU Exp Groups Route tree for Protocol Family 2 (Internet): default res101141 UGc 0 0 en1 - - ausdns01.srv.ibm res101141 UGHW 1 225 en1 1500 - 10.1.14.0 server1 UHSb 0 0 en1 - - => 10.1.14/24 server1 U 6 2228 en1 - - server1 loopback UGHS 6 111 lo0 - - 10.1.14.255 server1 UHSb 0 0 en1 - - 127/8 loopback U 7 17 lo0 - - 192.1.0/24 en1host2 UGc 0 0 en0 - - en0host1 en1host2 UGHW 2 109 en0 1500 - 192.1.1.0 en1host1 UHSb 0 0 en0 - - => 192.1.1/24 en1host1 U 2 2 en0 - - en1host1 loopback UGHS 0 2 lo0 - - 192.1.1.255 en1host1 UHSb 0 0 en0 - -
The above routing table shows a cloned route entry for en0host1, with a Refs count of 2 and a PMTU size of 1500. Thus, there are two TCP connections to this remote host and the path discovery has set the PMTU size to 1500 bytes so TCP uses a MSS of 1460. The route is cloned from the 192.1.0/24 entry just above it. The c in that entry shows it is cloneable.
The default MSS of 512 can be overridden by specifying a static route to a specific remote network. Use the -mtu option of the route command to specify the MTU to that network. In this case, you would specify the actual minimum MTU of the route, rather than calculating an MSS value. For example, the following command sets the default MTU size to 1500 for a route to network 192.3.3 and the default host to get to that gateway is en0host2:
# route add -net 192.1.0 jack -mtu 1500 1500 net 192.3.3: gateway en0host2
The netstat -r command displays the route table and shows that the PMTU size is 1500 bytes. TCP will compute the MSS from that MTU size. The following is an example of the netstat -r command:
# netstat -r Routing tables Destination Gateway Flags Refs Use If PMTU Exp Groups Route tree for Protocol Family 2 (Internet): default res101141 UGc 0 0 en4 - - ausdns01.srv.ibm res101141 UGHW 8 40 en4 1500 - 10.1.14.0 server1 UHSb 0 0 en4 - - => 10.1.14/24 server1 U 5 4043 en4 - - server1 loopback UGHS 0 125 lo0 - - 10.1.14.255 server1 UHSb 0 0 en4 - - 127/8 loopback U 2 1451769 lo0 - - 192.1.0.0 en0host1 UHSb 0 0 en0 - - => 192.1.0/24 en0host1 U 4 13 en0 - - en0host1 loopback UGHS 0 2 lo0 - - 192.1.0.255 en0host1 UHSb 0 0 en0 - - 192.1.1/24 en0host2 UGc 0 0 en0 - - en1host1 en0host2 UGHW 1 143474 en0 1500 - 192.3.3/24 en0host2 UGc 0 0 en0 1500 - 192.6.0/24 en0host2 UGc 0 0 en0 - - Route tree for Protocol Family 24 (Internet v6): loopbackv6 loopbackv6 UH 0 0 lo0 16896 -
In a small, stable environment, this method allows precise control of MSS on a network-by-network basis. The disadvantages of this approach are as follows:
This parameter is used to set the maximum packet size for communication with remote networks. The global no command option tcp_mssdflt applies to all networks. However, for network interfaces that support the ISNO options, you can set the tcp_mssdflt option on each of those interfaces. This value overrides the global no command value for routes using the network.
The tcp_mssdflt option is the TCP MSS size, which represents the TCP data size. To compute this MSS size, take the desired network MTU size and subtract 40 bytes from it (20 for IP and 20 for TCP headers). There is no need to adjust for other protocol options as TCP handles this adjustment if other options, like the rfc1323 option are used.
In an environment with a larger-than-default MTU, this method has the advantage in that the MSS does not need to be set on a per-network basis. The disadvantages are as follows:
You can use the subnetsarelocal option of the no command to control when TCP considers a remote endpoint to be local (on the same network) or remote. Several physical networks can be made to share the same network number by subnetting. The subnetsarelocal option specifies, on a system-wide basis, whether subnets are to be considered local or remote networks. With the no -o subnetsarelocal=1 command, which is the default, Host A on subnet 1 considers Host B on subnet 2 to be on the same physical network.
The consequence is that when Host A and Host B establish a connection, they negotiate the MSS assuming they are on the same network. Each host advertises an MSS based on the MTU of its network interface, usually leading to an optimal MSS being chosen.
The advantages to this approach are as follows:
The disadvantages to this approach are as follows:

At the IP layer, the only tunable parameter is ipqmaxlen, which controls the length of the IP input queue discussed in IP Layer. In general, interfaces do not do queuing. Packets can arrive very quickly and overrun the IP input queue. You can use the netstat -s or netstat -p ip command to view an overflow counter (ipintrq overflows).
If the number returned is greater than 0, overflows have occurred. Use the no command to set the maximum length of this queue. For example:
# no -o ipqmaxlen=100
This example allows 100 packets to be queued up. The exact value to use is determined by the maximum burst rate received. If this cannot be determined, using the number of overflows can help determine what the increase should be. No additional memory is used by increasing the queue length. However, an increase may result in more time spent in the off-level interrupt handler, because IP will have more packets to process on its input queue. This could adversely affect processes needing CPU time. The tradeoff is reduced packet-dropping versus CPU availability for other processing. It is best to increase ipqmaxlen by moderate increments if the tradeoff is a concern in your environment.