summaryrefslogtreecommitdiffstats
path: root/Documentation/networking
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/networking')
-rw-r--r--Documentation/networking/dccp.txt3
-rw-r--r--Documentation/networking/ip-sysctl.txt148
-rw-r--r--Documentation/networking/ixgbe.txt199
-rw-r--r--Documentation/networking/rds.txt356
-rw-r--r--Documentation/networking/timestamping.txt180
-rw-r--r--Documentation/networking/timestamping/.gitignore1
-rw-r--r--Documentation/networking/timestamping/Makefile6
-rw-r--r--Documentation/networking/timestamping/timestamping.c533
-rw-r--r--Documentation/networking/vxge.txt100
9 files changed, 1458 insertions, 68 deletions
diff --git a/Documentation/networking/dccp.txt b/Documentation/networking/dccp.txt
index 7a3bb1abb83..b132e4a3cf0 100644
--- a/Documentation/networking/dccp.txt
+++ b/Documentation/networking/dccp.txt
@@ -141,7 +141,8 @@ rx_ccid = 2
Default CCID for the receiver-sender half-connection; see tx_ccid.
seq_window = 100
- The initial sequence window (sec. 7.5.2).
+ The initial sequence window (sec. 7.5.2) of the sender. This influences
+ the local ackno validity and the remote seqno validity windows (7.5.1).
tx_qlen = 5
The size of the transmit buffer in packets. A value of 0 corresponds
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index c7712787933..ec5de02f543 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -2,7 +2,7 @@
ip_forward - BOOLEAN
0 - disabled (default)
- not 0 - enabled
+ not 0 - enabled
Forward Packets between interfaces.
@@ -36,49 +36,49 @@ rt_cache_rebuild_count - INTEGER
IP Fragmentation:
ipfrag_high_thresh - INTEGER
- Maximum memory used to reassemble IP fragments. When
+ Maximum memory used to reassemble IP fragments. When
ipfrag_high_thresh bytes of memory is allocated for this purpose,
the fragment handler will toss packets until ipfrag_low_thresh
is reached.
-
+
ipfrag_low_thresh - INTEGER
- See ipfrag_high_thresh
+ See ipfrag_high_thresh
ipfrag_time - INTEGER
- Time in seconds to keep an IP fragment in memory.
+ Time in seconds to keep an IP fragment in memory.
ipfrag_secret_interval - INTEGER
- Regeneration interval (in seconds) of the hash secret (or lifetime
+ Regeneration interval (in seconds) of the hash secret (or lifetime
for the hash secret) for IP fragments.
Default: 600
ipfrag_max_dist - INTEGER
- ipfrag_max_dist is a non-negative integer value which defines the
- maximum "disorder" which is allowed among fragments which share a
- common IP source address. Note that reordering of packets is
- not unusual, but if a large number of fragments arrive from a source
- IP address while a particular fragment queue remains incomplete, it
- probably indicates that one or more fragments belonging to that queue
- have been lost. When ipfrag_max_dist is positive, an additional check
- is done on fragments before they are added to a reassembly queue - if
- ipfrag_max_dist (or more) fragments have arrived from a particular IP
- address between additions to any IP fragment queue using that source
- address, it's presumed that one or more fragments in the queue are
- lost. The existing fragment queue will be dropped, and a new one
+ ipfrag_max_dist is a non-negative integer value which defines the
+ maximum "disorder" which is allowed among fragments which share a
+ common IP source address. Note that reordering of packets is
+ not unusual, but if a large number of fragments arrive from a source
+ IP address while a particular fragment queue remains incomplete, it
+ probably indicates that one or more fragments belonging to that queue
+ have been lost. When ipfrag_max_dist is positive, an additional check
+ is done on fragments before they are added to a reassembly queue - if
+ ipfrag_max_dist (or more) fragments have arrived from a particular IP
+ address between additions to any IP fragment queue using that source
+ address, it's presumed that one or more fragments in the queue are
+ lost. The existing fragment queue will be dropped, and a new one
started. An ipfrag_max_dist value of zero disables this check.
Using a very small value, e.g. 1 or 2, for ipfrag_max_dist can
result in unnecessarily dropping fragment queues when normal
- reordering of packets occurs, which could lead to poor application
- performance. Using a very large value, e.g. 50000, increases the
- likelihood of incorrectly reassembling IP fragments that originate
+ reordering of packets occurs, which could lead to poor application
+ performance. Using a very large value, e.g. 50000, increases the
+ likelihood of incorrectly reassembling IP fragments that originate
from different IP datagrams, which could result in data corruption.
Default: 64
INET peer storage:
inet_peer_threshold - INTEGER
- The approximate size of the storage. Starting from this threshold
+ The approximate size of the storage. Starting from this threshold
entries will be thrown aggressively. This threshold also determines
entries' time-to-live and time intervals between garbage collection
passes. More entries, less time-to-live, less GC interval.
@@ -105,7 +105,7 @@ inet_peer_gc_maxtime - INTEGER
in effect under low (or absent) memory pressure on the pool.
Measured in seconds.
-TCP variables:
+TCP variables:
somaxconn - INTEGER
Limit of socket listen() backlog, known in userspace as SOMAXCONN.
@@ -310,7 +310,7 @@ tcp_orphan_retries - INTEGER
tcp_reordering - INTEGER
Maximal reordering of packets in a TCP stream.
- Default: 3
+ Default: 3
tcp_retrans_collapse - BOOLEAN
Bug-to-bug compatibility with some broken printers.
@@ -521,7 +521,7 @@ IP Variables:
ip_local_port_range - 2 INTEGERS
Defines the local port range that is used by TCP and UDP to
- choose the local port. The first number is the first, the
+ choose the local port. The first number is the first, the
second the last local port number. Default value depends on
amount of memory available on the system:
> 128Mb 32768-61000
@@ -594,12 +594,12 @@ icmp_errors_use_inbound_ifaddr - BOOLEAN
If zero, icmp error messages are sent with the primary address of
the exiting interface.
-
+
If non-zero, the message will be sent with the primary address of
the interface that received the packet that caused the icmp error.
This is the behaviour network many administrators will expect from
a router. And it can make debugging complicated network layouts
- much easier.
+ much easier.
Note that if no primary address exists for the interface selected,
then the primary address of the first non-loopback interface that
@@ -611,7 +611,7 @@ igmp_max_memberships - INTEGER
Change the maximum number of multicast groups we can subscribe to.
Default: 20
-conf/interface/* changes special settings per interface (where "interface" is
+conf/interface/* changes special settings per interface (where "interface" is
the name of your network interface)
conf/all/* is special, changes the settings for all interfaces
@@ -625,11 +625,11 @@ log_martians - BOOLEAN
accept_redirects - BOOLEAN
Accept ICMP redirect messages.
accept_redirects for the interface will be enabled if:
- - both conf/{all,interface}/accept_redirects are TRUE in the case forwarding
- for the interface is enabled
+ - both conf/{all,interface}/accept_redirects are TRUE in the case
+ forwarding for the interface is enabled
or
- - at least one of conf/{all,interface}/accept_redirects is TRUE in the case
- forwarding for the interface is disabled
+ - at least one of conf/{all,interface}/accept_redirects is TRUE in the
+ case forwarding for the interface is disabled
accept_redirects for the interface will be disabled otherwise
default TRUE (host)
FALSE (router)
@@ -640,8 +640,8 @@ forwarding - BOOLEAN
mc_forwarding - BOOLEAN
Do multicast routing. The kernel needs to be compiled with CONFIG_MROUTE
and a multicast routing daemon is required.
- conf/all/mc_forwarding must also be set to TRUE to enable multicast routing
- for the interface
+ conf/all/mc_forwarding must also be set to TRUE to enable multicast
+ routing for the interface
medium_id - INTEGER
Integer value used to differentiate the devices by the medium they
@@ -649,7 +649,7 @@ medium_id - INTEGER
the broadcast packets are received only on one of them.
The default value 0 means that the device is the only interface
to its medium, value of -1 means that medium is not known.
-
+
Currently, it is used to change the proxy_arp behavior:
the proxy_arp feature is enabled for packets forwarded between
two devices attached to different media.
@@ -699,16 +699,22 @@ accept_source_route - BOOLEAN
default TRUE (router)
FALSE (host)
-rp_filter - BOOLEAN
- 1 - do source validation by reversed path, as specified in RFC1812
- Recommended option for single homed hosts and stub network
- routers. Could cause troubles for complicated (not loop free)
- networks running a slow unreliable protocol (sort of RIP),
- or using static routes.
-
+rp_filter - INTEGER
0 - No source validation.
-
- conf/all/rp_filter must also be set to TRUE to do source validation
+ 1 - Strict mode as defined in RFC3704 Strict Reverse Path
+ Each incoming packet is tested against the FIB and if the interface
+ is not the best reverse path the packet check will fail.
+ By default failed packets are discarded.
+ 2 - Loose mode as defined in RFC3704 Loose Reverse Path
+ Each incoming packet's source address is also tested against the FIB
+ and if the source address is not reachable via any interface
+ the packet check will fail.
+
+ Current recommended practice in RFC3704 is to enable strict mode
+ to prevent IP spoofing from DDos attacks. If using asymmetric routing
+ or other complicated routing, then loose mode is recommended.
+
+ conf/all/rp_filter must also be set to non-zero to do source validation
on the interface
Default value is 0. Note that some distributions enable it
@@ -782,6 +788,12 @@ arp_ignore - INTEGER
The max value from conf/{all,interface}/arp_ignore is used
when ARP request is received on the {interface}
+arp_notify - BOOLEAN
+ Define mode for notification of address and device changes.
+ 0 - (default): do nothing
+ 1 - Generate gratuitous arp replies when device is brought up
+ or hardware address changes.
+
arp_accept - BOOLEAN
Define behavior when gratuitous arp replies are received:
0 - drop gratuitous arp frames
@@ -823,7 +835,7 @@ apply to IPv6 [XXX?].
bindv6only - BOOLEAN
Default value for IPV6_V6ONLY socket option,
- which restricts use of the IPv6 socket to IPv6 communication
+ which restricts use of the IPv6 socket to IPv6 communication
only.
TRUE: disable IPv4-mapped address feature
FALSE: enable IPv4-mapped address feature
@@ -833,19 +845,19 @@ bindv6only - BOOLEAN
IPv6 Fragmentation:
ip6frag_high_thresh - INTEGER
- Maximum memory used to reassemble IPv6 fragments. When
+ Maximum memory used to reassemble IPv6 fragments. When
ip6frag_high_thresh bytes of memory is allocated for this purpose,
the fragment handler will toss packets until ip6frag_low_thresh
is reached.
-
+
ip6frag_low_thresh - INTEGER
- See ip6frag_high_thresh
+ See ip6frag_high_thresh
ip6frag_time - INTEGER
Time in seconds to keep an IPv6 fragment in memory.
ip6frag_secret_interval - INTEGER
- Regeneration interval (in seconds) of the hash secret (or lifetime
+ Regeneration interval (in seconds) of the hash secret (or lifetime
for the hash secret) for IPv6 fragments.
Default: 600
@@ -854,17 +866,17 @@ conf/default/*:
conf/all/*:
- Change all the interface-specific settings.
+ Change all the interface-specific settings.
[XXX: Other special features than forwarding?]
conf/all/forwarding - BOOLEAN
- Enable global IPv6 forwarding between all interfaces.
+ Enable global IPv6 forwarding between all interfaces.
- IPv4 and IPv6 work differently here; e.g. netfilter must be used
+ IPv4 and IPv6 work differently here; e.g. netfilter must be used
to control which interfaces may forward packets and which not.
- This also sets all interfaces' Host/Router setting
+ This also sets all interfaces' Host/Router setting
'forwarding' to the specified value. See below for details.
This referred to as global forwarding.
@@ -875,12 +887,12 @@ proxy_ndp - BOOLEAN
conf/interface/*:
Change special settings per interface.
- The functional behaviour for certain settings is different
+ The functional behaviour for certain settings is different
depending on whether local forwarding is enabled or not.
accept_ra - BOOLEAN
Accept Router Advertisements; autoconfigure using them.
-
+
Functional default: enabled if local forwarding is disabled.
disabled if local forwarding is enabled.
@@ -926,7 +938,7 @@ accept_source_route - INTEGER
Default: 0
autoconf - BOOLEAN
- Autoconfigure addresses using Prefix Information in Router
+ Autoconfigure addresses using Prefix Information in Router
Advertisements.
Functional default: enabled if accept_ra_pinfo is enabled.
@@ -935,11 +947,11 @@ autoconf - BOOLEAN
dad_transmits - INTEGER
The amount of Duplicate Address Detection probes to send.
Default: 1
-
+
forwarding - BOOLEAN
- Configure interface-specific Host/Router behaviour.
+ Configure interface-specific Host/Router behaviour.
- Note: It is recommended to have the same setting on all
+ Note: It is recommended to have the same setting on all
interfaces; mixed router/host scenarios are rather uncommon.
FALSE:
@@ -948,13 +960,13 @@ forwarding - BOOLEAN
1. IsRouter flag is not set in Neighbour Advertisements.
2. Router Solicitations are being sent when necessary.
- 3. If accept_ra is TRUE (default), accept Router
+ 3. If accept_ra is TRUE (default), accept Router
Advertisements (and do autoconfiguration).
4. If accept_redirects is TRUE (default), accept Redirects.
TRUE:
- If local forwarding is enabled, Router behaviour is assumed.
+ If local forwarding is enabled, Router behaviour is assumed.
This means exactly the reverse from the above:
1. IsRouter flag is set in Neighbour Advertisements.
@@ -989,7 +1001,7 @@ router_solicitation_interval - INTEGER
Default: 4
router_solicitations - INTEGER
- Number of Router Solicitations to send until assuming no
+ Number of Router Solicitations to send until assuming no
routers are present.
Default: 3
@@ -1013,11 +1025,11 @@ temp_prefered_lft - INTEGER
max_desync_factor - INTEGER
Maximum value for DESYNC_FACTOR, which is a random value
- that ensures that clients don't synchronize with each
+ that ensures that clients don't synchronize with each
other and generate new addresses at exactly the same time.
value is in seconds.
Default: 600
-
+
regen_max_retry - INTEGER
Number of attempts before give up attempting to generate
valid temporary addresses.
@@ -1025,13 +1037,15 @@ regen_max_retry - INTEGER
max_addresses - INTEGER
Number of maximum addresses per interface. 0 disables limitation.
- It is recommended not set too large value (or 0) because it would
- be too easy way to crash kernel to allow to create too much of
+ It is recommended not set too large value (or 0) because it would
+ be too easy way to crash kernel to allow to create too much of
autoconfigured addresses.
Default: 16
disable_ipv6 - BOOLEAN
- Disable IPv6 operation.
+ Disable IPv6 operation. If accept_dad is set to 2, this value
+ will be dynamically set to TRUE if DAD fails for the link-local
+ address.
Default: FALSE (enable IPv6 operation)
accept_dad - INTEGER
diff --git a/Documentation/networking/ixgbe.txt b/Documentation/networking/ixgbe.txt
new file mode 100644
index 00000000000..eeb68685c78
--- /dev/null
+++ b/Documentation/networking/ixgbe.txt
@@ -0,0 +1,199 @@
+Linux Base Driver for 10 Gigabit PCI Express Intel(R) Network Connection
+========================================================================
+
+March 10, 2009
+
+
+Contents
+========
+
+- In This Release
+- Identifying Your Adapter
+- Building and Installation
+- Additional Configurations
+- Support
+
+
+
+In This Release
+===============
+
+This file describes the ixgbe Linux Base Driver for the 10 Gigabit PCI
+Express Intel(R) Network Connection. This driver includes support for
+Itanium(R)2-based systems.
+
+For questions related to hardware requirements, refer to the documentation
+supplied with your 10 Gigabit adapter. All hardware requirements listed apply
+to use with Linux.
+
+The following features are available in this kernel:
+ - Native VLANs
+ - Channel Bonding (teaming)
+ - SNMP
+ - Generic Receive Offload
+ - Data Center Bridging
+
+Channel Bonding documentation can be found in the Linux kernel source:
+/Documentation/networking/bonding.txt
+
+Ethtool, lspci, and ifconfig can be used to display device and driver
+specific information.
+
+
+Identifying Your Adapter
+========================
+
+This driver supports devices based on the 82598 controller and the 82599
+controller.
+
+For specific information on identifying which adapter you have, please visit:
+
+ http://support.intel.com/support/network/sb/CS-008441.htm
+
+
+Building and Installation
+=========================
+
+select m for "Intel(R) 10GbE PCI Express adapters support" located at:
+ Location:
+ -> Device Drivers
+ -> Network device support (NETDEVICES [=y])
+ -> Ethernet (10000 Mbit) (NETDEV_10000 [=y])
+
+1. make modules & make modules_install
+
+2. Load the module:
+
+# modprobe ixgbe
+
+ The insmod command can be used if the full
+ path to the driver module is specified. For example:
+
+ insmod /lib/modules/<KERNEL VERSION>/kernel/drivers/net/ixgbe/ixgbe.ko
+
+ With 2.6 based kernels also make sure that older ixgbe drivers are
+ removed from the kernel, before loading the new module:
+
+ rmmod ixgbe; modprobe ixgbe
+
+3. Assign an IP address to the interface by entering the following, where
+ x is the interface number:
+
+ ifconfig ethx <IP_address>
+
+4. Verify that the interface works. Enter the following, where <IP_address>
+ is the IP address for another machine on the same subnet as the interface
+ that is being tested:
+
+ ping <IP_address>
+
+
+Additional Configurations
+=========================
+
+ Viewing Link Messages
+ ---------------------
+ Link messages will not be displayed to the console if the distribution is
+ restricting system messages. In order to see network driver link messages on
+ your console, set dmesg to eight by entering the following:
+
+ dmesg -n 8
+
+ NOTE: This setting is not saved across reboots.
+
+
+ Jumbo Frames
+ ------------
+ The driver supports Jumbo Frames for all adapters. Jumbo Frames support is
+ enabled by changing the MTU to a value larger than the default of 1500.
+ The maximum value for the MTU is 16110. Use the ifconfig command to
+ increase the MTU size. For example:
+
+ ifconfig ethx mtu 9000 up
+
+ The maximum MTU setting for Jumbo Frames is 16110. This value coincides
+ with the maximum Jumbo Frames size of 16128.
+
+ Generic Receive Offload, aka GRO
+ --------------------------------
+ The driver supports the in-kernel software implementation of GRO. GRO has
+ shown that by coalescing Rx traffic into larger chunks of data, CPU
+ utilization can be significantly reduced when under large Rx load. GRO is an
+ evolution of the previously-used LRO interface. GRO is able to coalesce
+ other protocols besides TCP. It's also safe to use with configurations that
+ are problematic for LRO, namely bridging and iSCSI.
+
+ GRO is enabled by default in the driver. Future versions of ethtool will
+ support disabling and re-enabling GRO on the fly.
+
+
+ Data Center Bridging, aka DCB
+ -----------------------------
+
+ DCB is a configuration Quality of Service implementation in hardware.
+ It uses the VLAN priority tag (802.1p) to filter traffic. That means
+ that there are 8 different priorities that traffic can be filtered into.
+ It also enables priority flow control which can limit or eliminate the
+ number of dropped packets during network stress. Bandwidth can be
+ allocated to each of these priorities, which is enforced at the hardware
+ level.
+
+ To enable DCB support in ixgbe, you must enable the DCB netlink layer to
+ allow the userspace tools (see below) to communicate with the driver.
+ This can be found in the kernel configuration here:
+
+ -> Networking support
+ -> Networking options
+ -> Data Center Bridging support
+
+ Once this is selected, DCB support must be selected for ixgbe. This can
+ be found here:
+
+ -> Device Drivers
+ -> Network device support (NETDEVICES [=y])
+ -> Ethernet (10000 Mbit) (NETDEV_10000 [=y])
+ -> Intel(R) 10GbE PCI Express adapters support
+ -> Data Center Bridging (DCB) Support
+
+ After these options are selected, you must rebuild your kernel and your
+ modules.
+
+ In order to use DCB, userspace tools must be downloaded and installed.
+ The dcbd tools can be found at:
+
+ http://e1000.sf.net
+
+
+ Ethtool
+ -------
+ The driver utilizes the ethtool interface for driver configuration and
+ diagnostics, as well as displaying statistical information. Ethtool
+ version 3.0 or later is required for this functionality.
+
+ The latest release of ethtool can be found from
+ http://sourceforge.net/projects/gkernel.
+
+
+ NAPI
+ ----
+
+ NAPI (Rx polling mode) is supported in the ixgbe driver. NAPI is enabled
+ by default in the driver.
+
+ See www.cyberus.ca/~hadi/usenix-paper.tgz for more information on NAPI.
+
+
+Support
+=======
+
+For general information, go to the Intel support website at:
+
+ http://support.intel.com
+
+or the Intel Wired Networking project hosted by Sourceforge at:
+
+ http://e1000.sourceforge.net
+
+If an issue is identified with the released source code on the supported
+kernel with a supported adapter, email the specific information related
+to the issue to e1000-devel@lists.sf.net
diff --git a/Documentation/networking/rds.txt b/Documentation/networking/rds.txt
new file mode 100644
index 00000000000..c67077cbeb8
--- /dev/null
+++ b/Documentation/networking/rds.txt
@@ -0,0 +1,356 @@
+
+Overview
+========
+
+This readme tries to provide some background on the hows and whys of RDS,
+and will hopefully help you find your way around the code.
+
+In addition, please see this email about RDS origins:
+http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html
+
+RDS Architecture
+================
+
+RDS provides reliable, ordered datagram delivery by using a single
+reliable connection between any two nodes in the cluster. This allows
+applications to use a single socket to talk to any other process in the
+cluster - so in a cluster with N processes you need N sockets, in contrast
+to N*N if you use a connection-oriented socket transport like TCP.
+
+RDS is not Infiniband-specific; it was designed to support different
+transports. The current implementation used to support RDS over TCP as well
+as IB. Work is in progress to support RDS over iWARP, and using DCE to
+guarantee no dropped packets on Ethernet, it may be possible to use RDS over
+UDP in the future.
+
+The high-level semantics of RDS from the application's point of view are
+
+ * Addressing
+ RDS uses IPv4 addresses and 16bit port numbers to identify
+ the end point of a connection. All socket operations that involve
+ passing addresses between kernel and user space generally
+ use a struct sockaddr_in.
+
+ The fact that IPv4 addresses are used does not mean the underlying
+ transport has to be IP-based. In fact, RDS over IB uses a
+ reliable IB connection; the IP address is used exclusively to
+ locate the remote node's GID (by ARPing for the given IP).
+
+ The port space is entirely independent of UDP, TCP or any other
+ protocol.
+
+ * Socket interface
+ RDS sockets work *mostly* as you would expect from a BSD
+ socket. The next section will cover the details. At any rate,
+ all I/O is performed through the standard BSD socket API.
+ Some additions like zerocopy support are implemented through
+ control messages, while other extensions use the getsockopt/
+ setsockopt calls.
+
+ Sockets must be bound before you can send or receive data.
+ This is needed because binding also selects a transport and
+ attaches it to the socket. Once bound, the transport assignment
+ does not change. RDS will tolerate IPs moving around (eg in
+ a active-active HA scenario), but only as long as the address
+ doesn't move to a different transport.
+
+ * sysctls
+ RDS supports a number of sysctls in /proc/sys/net/rds
+
+
+Socket Interface
+================
+
+ AF_RDS, PF_RDS, SOL_RDS
+ These constants haven't been assigned yet, because RDS isn't in
+ mainline yet. Currently, the kernel module assigns some constant
+ and publishes it to user space through two sysctl files
+ /proc/sys/net/rds/pf_rds
+ /proc/sys/net/rds/sol_rds
+
+ fd = socket(PF_RDS, SOCK_SEQPACKET, 0);
+ This creates a new, unbound RDS socket.
+
+ setsockopt(SOL_SOCKET): send and receive buffer size
+ RDS honors the send and receive buffer size socket options.
+ You are not allowed to queue more than SO_SNDSIZE bytes to
+ a socket. A message is queued when sendmsg is called, and
+ it leaves the queue when the remote system acknowledges
+ its arrival.
+
+ The SO_RCVSIZE option controls the maximum receive queue length.
+ This is a soft limit rather than a hard limit - RDS will
+ continue to accept and queue incoming messages, even if that
+ takes the queue length over the limit. However, it will also
+ mark the port as "congested" and send a congestion update to
+ the source node. The source node is supposed to throttle any
+ processes sending to this congested port.
+
+ bind(fd, &sockaddr_in, ...)
+ This binds the socket to a local IP address and port, and a
+ transport.
+
+ sendmsg(fd, ...)
+ Sends a message to the indicated recipient. The kernel will
+ transparently establish the underlying reliable connection
+ if it isn't up yet.
+
+ An attempt to send a message that exceeds SO_SNDSIZE will
+ return with -EMSGSIZE
+
+ An attempt to send a message that would take the total number
+ of queued bytes over the SO_SNDSIZE threshold will return
+ EAGAIN.
+
+ An attempt to send a message to a destination that is marked
+ as "congested" will return ENOBUFS.
+
+ recvmsg(fd, ...)
+ Receives a message that was queued to this socket. The sockets
+ recv queue accounting is adjusted, and if the queue length
+ drops below SO_SNDSIZE, the port is marked uncongested, and
+ a congestion update is sent to all peers.
+
+ Applications can ask the RDS kernel module to receive
+ notifications via control messages (for instance, there is a
+ notification when a congestion update arrived, or when a RDMA
+ operation completes). These notifications are received through
+ the msg.msg_control buffer of struct msghdr. The format of the
+ messages is described in manpages.
+
+ poll(fd)
+ RDS supports the poll interface to allow the application
+ to implement async I/O.
+
+ POLLIN handling is pretty straightforward. When there's an
+ incoming message queued to the socket, or a pending notification,
+ we signal POLLIN.
+
+ POLLOUT is a little harder. Since you can essentially send
+ to any destination, RDS will always signal POLLOUT as long as
+ there's room on the send queue (ie the number of bytes queued
+ is less than the sendbuf size).
+
+ However, the kernel will refuse to accept messages to
+ a destination marked congested - in this case you will loop
+ forever if you rely on poll to tell you what to do.
+ This isn't a trivial problem, but applications can deal with
+ this - by using congestion notifications, and by checking for
+ ENOBUFS errors returned by sendmsg.
+
+ setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in)
+ This allows the application to discard all messages queued to a
+ specific destination on this particular socket.
+
+ This allows the application to cancel outstanding messages if
+ it detects a timeout. For instance, if it tried to send a message,
+ and the remote host is unreachable, RDS will keep trying forever.
+ The application may decide it's not worth it, and cancel the
+ operation. In this case, it would use RDS_CANCEL_SENT_TO to
+ nuke any pending messages.
+
+
+RDMA for RDS
+============
+
+ see rds-rdma(7) manpage (available in rds-tools)
+
+
+Congestion Notifications
+========================
+
+ see rds(7) manpage
+
+
+RDS Protocol
+============
+
+ Message header
+
+ The message header is a 'struct rds_header' (see rds.h):
+ Fields:
+ h_sequence:
+ per-packet sequence number
+ h_ack:
+ piggybacked acknowledgment of last packet received
+ h_len:
+ length of data, not including header
+ h_sport:
+ source port
+ h_dport:
+ destination port
+ h_flags:
+ CONG_BITMAP - this is a congestion update bitmap
+ ACK_REQUIRED - receiver must ack this packet
+ RETRANSMITTED - packet has previously been sent
+ h_credit:
+ indicate to other end of connection that
+ it has more credits available (i.e. there is
+ more send room)
+ h_padding[4]:
+ unused, for future use
+ h_csum:
+ header checksum
+ h_exthdr:
+ optional data can be passed here. This is currently used for
+ passing RDMA-related information.
+
+ ACK and retransmit handling
+
+ One might think that with reliable IB connections you wouldn't need
+ to ack messages that have been received. The problem is that IB
+ hardware generates an ack message before it has DMAed the message
+ into memory. This creates a potential message loss if the HCA is
+ disabled for any reason between when it sends the ack and before
+ the message is DMAed and processed. This is only a potential issue
+ if another HCA is available for fail-over.
+
+ Sending an ack immediately would allow the sender to free the sent
+ message from their send queue quickly, but could cause excessive
+ traffic to be used for acks. RDS piggybacks acks on sent data
+ packets. Ack-only packets are reduced by only allowing one to be
+ in flight at a time, and by the sender only asking for acks when
+ its send buffers start to fill up. All retransmissions are also
+ acked.
+
+ Flow Control
+
+ RDS's IB transport uses a credit-based mechanism to verify that
+ there is space in the peer's receive buffers for more data. This
+ eliminates the need for hardware retries on the connection.
+
+ Congestion
+
+ Messages waiting in the receive queue on the receiving socket
+ are accounted against the sockets SO_RCVBUF option value. Only
+ the payload bytes in the message are accounted for. If the
+ number of bytes queued equals or exceeds rcvbuf then the socket
+ is congested. All sends attempted to this socket's address
+ should return block or return -EWOULDBLOCK.
+
+ Applications are expected to be reasonably tuned such that this
+ situation very rarely occurs. An application encountering this
+ "back-pressure" is considered a bug.
+
+ This is implemented by having each node maintain bitmaps which
+ indicate which ports on bound addresses are congested. As the
+ bitmap changes it is sent through all the connections which
+ terminate in the local address of the bitmap which changed.
+
+ The bitmaps are allocated as connections are brought up. This
+ avoids allocation in the interrupt handling path which queues
+ sages on sockets. The dense bitmaps let transports send the
+ entire bitmap on any bitmap change reasonably efficiently. This
+ is much easier to implement than some finer-grained
+ communication of per-port congestion. The sender does a very
+ inexpensive bit test to test if the port it's about to send to
+ is congested or not.
+
+
+RDS Transport Layer
+==================
+
+ As mentioned above, RDS is not IB-specific. Its code is divided
+ into a general RDS layer and a transport layer.
+
+ The general layer handles the socket API, congestion handling,
+ loopback, stats, usermem pinning, and the connection state machine.
+
+ The transport layer handles the details of the transport. The IB
+ transport, for example, handles all the queue pairs, work requests,
+ CM event handlers, and other Infiniband details.
+
+
+RDS Kernel Structures
+=====================
+
+ struct rds_message
+ aka possibly "rds_outgoing", the generic RDS layer copies data to
+ be sent and sets header fields as needed, based on the socket API.
+ This is then queued for the individual connection and sent by the
+ connection's transport.
+ struct rds_incoming
+ a generic struct referring to incoming data that can be handed from
+ the transport to the general code and queued by the general code
+ while the socket is awoken. It is then passed back to the transport
+ code to handle the actual copy-to-user.
+ struct rds_socket
+ per-socket information
+ struct rds_connection
+ per-connection information
+ struct rds_transport
+ pointers to transport-specific functions
+ struct rds_statistics
+ non-transport-specific statistics
+ struct rds_cong_map
+ wraps the raw congestion bitmap, contains rbnode, waitq, etc.
+
+Connection management
+=====================
+
+ Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and
+ ERROR states.
+
+ The first time an attempt is made by an RDS socket to send data to
+ a node, a connection is allocated and connected. That connection is
+ then maintained forever -- if there are transport errors, the
+ connection will be dropped and re-established.
+
+ Dropping a connection while packets are queued will cause queued or
+ partially-sent datagrams to be retransmitted when the connection is
+ re-established.
+
+
+The send path
+=============
+
+ rds_sendmsg()
+ struct rds_message built from incoming data
+ CMSGs parsed (e.g. RDMA ops)
+ transport connection alloced and connected if not already
+ rds_message placed on send queue
+ send worker awoken
+ rds_send_worker()
+ calls rds_send_xmit() until queue is empty
+ rds_send_xmit()
+ transmits congestion map if one is pending
+ may set ACK_REQUIRED
+ calls transport to send either non-RDMA or RDMA message
+ (RDMA ops never retransmitted)
+ rds_ib_xmit()
+ allocs work requests from send ring
+ adds any new send credits available to peer (h_credits)
+ maps the rds_message's sg list
+ piggybacks ack
+ populates work requests
+ post send to connection's queue pair
+
+The recv path
+=============
+
+ rds_ib_recv_cq_comp_handler()
+ looks at write completions
+ unmaps recv buffer from device
+ no errors, call rds_ib_process_recv()
+ refill recv ring
+ rds_ib_process_recv()
+ validate header checksum
+ copy header to rds_ib_incoming struct if start of a new datagram
+ add to ibinc's fraglist
+ if competed datagram:
+ update cong map if datagram was cong update
+ call rds_recv_incoming() otherwise
+ note if ack is required
+ rds_recv_incoming()
+ drop duplicate packets
+ respond to pings
+ find the sock associated with this datagram
+ add to sock queue
+ wake up sock
+ do some congestion calculations
+ rds_recvmsg
+ copy data into user iovec
+ handle CMSGs
+ return to application
+
+
diff --git a/Documentation/networking/timestamping.txt b/Documentation/networking/timestamping.txt
new file mode 100644
index 00000000000..0e58b453917
--- /dev/null
+++ b/Documentation/networking/timestamping.txt
@@ -0,0 +1,180 @@
+The existing interfaces for getting network packages time stamped are:
+
+* SO_TIMESTAMP
+ Generate time stamp for each incoming packet using the (not necessarily
+ monotonous!) system time. Result is returned via recv_msg() in a
+ control message as timeval (usec resolution).
+
+* SO_TIMESTAMPNS
+ Same time stamping mechanism as SO_TIMESTAMP, but returns result as
+ timespec (nsec resolution).
+
+* IP_MULTICAST_LOOP + SO_TIMESTAMP[NS]
+ Only for multicasts: approximate send time stamp by receiving the looped
+ packet and using its receive time stamp.
+
+The following interface complements the existing ones: receive time
+stamps can be generated and returned for arbitrary packets and much
+closer to the point where the packet is really sent. Time stamps can
+be generated in software (as before) or in hardware (if the hardware
+has such a feature).
+
+SO_TIMESTAMPING:
+
+Instructs the socket layer which kind of information is wanted. The
+parameter is an integer with some of the following bits set. Setting
+other bits is an error and doesn't change the current state.
+
+SOF_TIMESTAMPING_TX_HARDWARE: try to obtain send time stamp in hardware
+SOF_TIMESTAMPING_TX_SOFTWARE: if SOF_TIMESTAMPING_TX_HARDWARE is off or
+ fails, then do it in software
+SOF_TIMESTAMPING_RX_HARDWARE: return the original, unmodified time stamp
+ as generated by the hardware
+SOF_TIMESTAMPING_RX_SOFTWARE: if SOF_TIMESTAMPING_RX_HARDWARE is off or
+ fails, then do it in software
+SOF_TIMESTAMPING_RAW_HARDWARE: return original raw hardware time stamp
+SOF_TIMESTAMPING_SYS_HARDWARE: return hardware time stamp transformed to
+ the system time base
+SOF_TIMESTAMPING_SOFTWARE: return system time stamp generated in
+ software
+
+SOF_TIMESTAMPING_TX/RX determine how time stamps are generated.
+SOF_TIMESTAMPING_RAW/SYS determine how they are reported in the
+following control message:
+ struct scm_timestamping {
+ struct timespec systime;
+ struct timespec hwtimetrans;
+ struct timespec hwtimeraw;
+ };
+
+recvmsg() can be used to get this control message for regular incoming
+packets. For send time stamps the outgoing packet is looped back to
+the socket's error queue with the send time stamp(s) attached. It can
+be received with recvmsg(flags=MSG_ERRQUEUE). The call returns the
+original outgoing packet data including all headers preprended down to
+and including the link layer, the scm_timestamping control message and
+a sock_extended_err control message with ee_errno==ENOMSG and
+ee_origin==SO_EE_ORIGIN_TIMESTAMPING. A socket with such a pending
+bounced packet is ready for reading as far as select() is concerned.
+If the outgoing packet has to be fragmented, then only the first
+fragment is time stamped and returned to the sending socket.
+
+All three values correspond to the same event in time, but were
+generated in different ways. Each of these values may be empty (= all
+zero), in which case no such value was available. If the application
+is not interested in some of these values, they can be left blank to
+avoid the potential overhead of calculating them.
+
+systime is the value of the system time at that moment. This
+corresponds to the value also returned via SO_TIMESTAMP[NS]. If the
+time stamp was generated by hardware, then this field is
+empty. Otherwise it is filled in if SOF_TIMESTAMPING_SOFTWARE is
+set.
+
+hwtimeraw is the original hardware time stamp. Filled in if
+SOF_TIMESTAMPING_RAW_HARDWARE is set. No assumptions about its
+relation to system time should be made.
+
+hwtimetrans is the hardware time stamp transformed so that it
+corresponds as good as possible to system time. This correlation is
+not perfect; as a consequence, sorting packets received via different
+NICs by their hwtimetrans may differ from the order in which they were
+received. hwtimetrans may be non-monotonic even for the same NIC.
+Filled in if SOF_TIMESTAMPING_SYS_HARDWARE is set. Requires support
+by the network device and will be empty without that support.
+
+
+SIOCSHWTSTAMP:
+
+Hardware time stamping must also be initialized for each device driver
+that is expected to do hardware time stamping. The parameter is:
+
+struct hwtstamp_config {
+ int flags; /* no flags defined right now, must be zero */
+ int tx_type; /* HWTSTAMP_TX_* */
+ int rx_filter; /* HWTSTAMP_FILTER_* */
+};
+
+Desired behavior is passed into the kernel and to a specific device by
+calling ioctl(SIOCSHWTSTAMP) with a pointer to a struct ifreq whose
+ifr_data points to a struct hwtstamp_config. The tx_type and
+rx_filter are hints to the driver what it is expected to do. If
+the requested fine-grained filtering for incoming packets is not
+supported, the driver may time stamp more than just the requested types
+of packets.
+
+A driver which supports hardware time stamping shall update the struct
+with the actual, possibly more permissive configuration. If the
+requested packets cannot be time stamped, then nothing should be
+changed and ERANGE shall be returned (in contrast to EINVAL, which
+indicates that SIOCSHWTSTAMP is not supported at all).
+
+Only a processes with admin rights may change the configuration. User
+space is responsible to ensure that multiple processes don't interfere
+with each other and that the settings are reset.
+
+/* possible values for hwtstamp_config->tx_type */
+enum {
+ /*
+ * no outgoing packet will need hardware time stamping;
+ * should a packet arrive which asks for it, no hardware
+ * time stamping will be done
+ */
+ HWTSTAMP_TX_OFF,
+
+ /*
+ * enables hardware time stamping for outgoing packets;
+ * the sender of the packet decides which are to be
+ * time stamped by setting SOF_TIMESTAMPING_TX_SOFTWARE
+ * before sending the packet
+ */
+ HWTSTAMP_TX_ON,
+};
+
+/* possible values for hwtstamp_config->rx_filter */
+enum {
+ /* time stamp no incoming packet at all */
+ HWTSTAMP_FILTER_NONE,
+
+ /* time stamp any incoming packet */
+ HWTSTAMP_FILTER_ALL,
+
+ /* return value: time stamp all packets requested plus some others */
+ HWTSTAMP_FILTER_SOME,
+
+ /* PTP v1, UDP, any kind of event packet */
+ HWTSTAMP_FILTER_PTP_V1_L4_EVENT,
+
+ ...
+};
+
+
+DEVICE IMPLEMENTATION
+
+A driver which supports hardware time stamping must support the
+SIOCSHWTSTAMP ioctl. Time stamps for received packets must be stored
+in the skb with skb_hwtstamp_set().
+
+Time stamps for outgoing packets are to be generated as follows:
+- In hard_start_xmit(), check if skb_hwtstamp_check_tx_hardware()
+ returns non-zero. If yes, then the driver is expected
+ to do hardware time stamping.
+- If this is possible for the skb and requested, then declare
+ that the driver is doing the time stamping by calling
+ skb_hwtstamp_tx_in_progress(). A driver not supporting
+ hardware time stamping doesn't do that. A driver must never
+ touch sk_buff::tstamp! It is used to store how time stamping
+ for an outgoing packets is to be done.
+- As soon as the driver has sent the packet and/or obtained a
+ hardware time stamp for it, it passes the time stamp back by
+ calling skb_hwtstamp_tx() with the original skb, the raw
+ hardware time stamp and a handle to the device (necessary
+ to convert the hardware time stamp to system time). If obtaining
+ the hardware time stamp somehow fails, then the driver should
+ not fall back to software time stamping. The rationale is that
+ this would occur at a later time in the processing pipeline
+ than other software time stamping and therefore could lead
+ to unexpected deltas between time stamps.
+- If the driver did not call skb_hwtstamp_tx_in_progress(), then
+ dev_hard_start_xmit() checks whether software time stamping
+ is wanted as fallback and potentially generates the time stamp.
diff --git a/Documentation/networking/timestamping/.gitignore b/Documentation/networking/timestamping/.gitignore
new file mode 100644
index 00000000000..71e81eb2e22
--- /dev/null
+++ b/Documentation/networking/timestamping/.gitignore
@@ -0,0 +1 @@
+timestamping
diff --git a/Documentation/networking/timestamping/Makefile b/Documentation/networking/timestamping/Makefile
new file mode 100644
index 00000000000..2a1489fdc03
--- /dev/null
+++ b/Documentation/networking/timestamping/Makefile
@@ -0,0 +1,6 @@
+CPPFLAGS = -I../../../include
+
+timestamping: timestamping.c
+
+clean:
+ rm -f timestamping
diff --git a/Documentation/networking/timestamping/timestamping.c b/Documentation/networking/timestamping/timestamping.c
new file mode 100644
index 00000000000..43d14310421
--- /dev/null
+++ b/Documentation/networking/timestamping/timestamping.c
@@ -0,0 +1,533 @@
+/*
+ * This program demonstrates how the various time stamping features in
+ * the Linux kernel work. It emulates the behavior of a PTP
+ * implementation in stand-alone master mode by sending PTPv1 Sync
+ * multicasts once every second. It looks for similar packets, but
+ * beyond that doesn't actually implement PTP.
+ *
+ * Outgoing packets are time stamped with SO_TIMESTAMPING with or
+ * without hardware support.
+ *
+ * Incoming packets are time stamped with SO_TIMESTAMPING with or
+ * without hardware support, SIOCGSTAMP[NS] (per-socket time stamp) and
+ * SO_TIMESTAMP[NS].
+ *
+ * Copyright (C) 2009 Intel Corporation.
+ * Author: Patrick Ohly <patrick.ohly@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. * See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <errno.h>
+#include <string.h>
+
+#include <sys/time.h>
+#include <sys/socket.h>
+#include <sys/select.h>
+#include <sys/ioctl.h>
+#include <arpa/inet.h>
+#include <net/if.h>
+
+#include "asm/types.h"
+#include "linux/net_tstamp.h"
+#include "linux/errqueue.h"
+
+#ifndef SO_TIMESTAMPING
+# define SO_TIMESTAMPING 37
+# define SCM_TIMESTAMPING SO_TIMESTAMPING
+#endif
+
+#ifndef SO_TIMESTAMPNS
+# define SO_TIMESTAMPNS 35
+#endif
+
+#ifndef SIOCGSTAMPNS
+# define SIOCGSTAMPNS 0x8907
+#endif
+
+#ifndef SIOCSHWTSTAMP
+# define SIOCSHWTSTAMP 0x89b0
+#endif
+
+static void usage(const char *error)
+{
+ if (error)
+ printf("invalid option: %s\n", error);
+ printf("timestamping interface option*\n\n"
+ "Options:\n"
+ " IP_MULTICAST_LOOP - looping outgoing multicasts\n"
+ " SO_TIMESTAMP - normal software time stamping, ms resolution\n"
+ " SO_TIMESTAMPNS - more accurate software time stamping\n"
+ " SOF_TIMESTAMPING_TX_HARDWARE - hardware time stamping of outgoing packets\n"
+ " SOF_TIMESTAMPING_TX_SOFTWARE - software fallback for outgoing packets\n"
+ " SOF_TIMESTAMPING_RX_HARDWARE - hardware time stamping of incoming packets\n"
+ " SOF_TIMESTAMPING_RX_SOFTWARE - software fallback for incoming packets\n"
+ " SOF_TIMESTAMPING_SOFTWARE - request reporting of software time stamps\n"
+ " SOF_TIMESTAMPING_SYS_HARDWARE - request reporting of transformed HW time stamps\n"
+ " SOF_TIMESTAMPING_RAW_HARDWARE - request reporting of raw HW time stamps\n"
+ " SIOCGSTAMP - check last socket time stamp\n"
+ " SIOCGSTAMPNS - more accurate socket time stamp\n");
+ exit(1);
+}
+
+static void bail(const char *error)
+{
+ printf("%s: %s\n", error, strerror(errno));
+ exit(1);
+}
+
+static const unsigned char sync[] = {
+ 0x00, 0x01, 0x00, 0x01,
+ 0x5f, 0x44, 0x46, 0x4c,
+ 0x54, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00,
+ 0x01, 0x01,
+
+ /* fake uuid */
+ 0x00, 0x01,
+ 0x02, 0x03, 0x04, 0x05,
+
+ 0x00, 0x01, 0x00, 0x37,
+ 0x00, 0x00, 0x00, 0x08,
+ 0x00, 0x00, 0x00, 0x00,
+ 0x49, 0x05, 0xcd, 0x01,
+ 0x29, 0xb1, 0x8d, 0xb0,
+ 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x01,
+
+ /* fake uuid */
+ 0x00, 0x01,
+ 0x02, 0x03, 0x04, 0x05,
+
+ 0x00, 0x00, 0x00, 0x37,
+ 0x00, 0x00, 0x00, 0x04,
+ 0x44, 0x46, 0x4c, 0x54,
+ 0x00, 0x00, 0xf0, 0x60,
+ 0x00, 0x01, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x01,
+ 0x00, 0x00, 0xf0, 0x60,
+ 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x04,
+ 0x44, 0x46, 0x4c, 0x54,
+ 0x00, 0x01,
+
+ /* fake uuid */
+ 0x00, 0x01,
+ 0x02, 0x03, 0x04, 0x05,
+
+ 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00
+};
+
+static void sendpacket(int sock, struct sockaddr *addr, socklen_t addr_len)
+{
+ struct timeval now;
+ int res;
+
+ res = sendto(sock, sync, sizeof(sync), 0,
+ addr, addr_len);
+ gettimeofday(&now, 0);
+ if (res < 0)
+ printf("%s: %s\n", "send", strerror(errno));
+ else
+ printf("%ld.%06ld: sent %d bytes\n",
+ (long)now.tv_sec, (long)now.tv_usec,
+ res);
+}
+
+static void printpacket(struct msghdr *msg, int res,
+ char *data,
+ int sock, int recvmsg_flags,
+ int siocgstamp, int siocgstampns)
+{
+ struct sockaddr_in *from_addr = (struct sockaddr_in *)msg->msg_name;
+ struct cmsghdr *cmsg;
+ struct timeval tv;
+ struct timespec ts;
+ struct timeval now;
+
+ gettimeofday(&now, 0);
+
+ printf("%ld.%06ld: received %s data, %d bytes from %s, %d bytes control messages\n",
+ (long)now.tv_sec, (long)now.tv_usec,
+ (recvmsg_flags & MSG_ERRQUEUE) ? "error" : "regular",
+ res,
+ inet_ntoa(from_addr->sin_addr),
+ msg->msg_controllen);
+ for (cmsg = CMSG_FIRSTHDR(msg);
+ cmsg;
+ cmsg = CMSG_NXTHDR(msg, cmsg)) {
+ printf(" cmsg len %d: ", cmsg->cmsg_len);
+ switch (cmsg->cmsg_level) {
+ case SOL_SOCKET:
+ printf("SOL_SOCKET ");
+ switch (cmsg->cmsg_type) {
+ case SO_TIMESTAMP: {
+ struct timeval *stamp =
+ (struct timeval *)CMSG_DATA(cmsg);
+ printf("SO_TIMESTAMP %ld.%06ld",
+ (long)stamp->tv_sec,
+ (long)stamp->tv_usec);
+ break;
+ }
+ case SO_TIMESTAMPNS: {
+ struct timespec *stamp =
+ (struct timespec *)CMSG_DATA(cmsg);
+ printf("SO_TIMESTAMPNS %ld.%09ld",
+ (long)stamp->tv_sec,
+ (long)stamp->tv_nsec);
+ break;
+ }
+ case SO_TIMESTAMPING: {
+ struct timespec *stamp =
+ (struct timespec *)CMSG_DATA(cmsg);
+ printf("SO_TIMESTAMPING ");
+ printf("SW %ld.%09ld ",
+ (long)stamp->tv_sec,
+ (long)stamp->tv_nsec);
+ stamp++;
+ printf("HW transformed %ld.%09ld ",
+ (long)stamp->tv_sec,
+ (long)stamp->tv_nsec);
+ stamp++;
+ printf("HW raw %ld.%09ld",
+ (long)stamp->tv_sec,
+ (long)stamp->tv_nsec);
+ break;
+ }
+ default:
+ printf("type %d", cmsg->cmsg_type);
+ break;
+ }
+ break;
+ case IPPROTO_IP:
+ printf("IPPROTO_IP ");
+ switch (cmsg->cmsg_type) {
+ case IP_RECVERR: {
+ struct sock_extended_err *err =
+ (struct sock_extended_err *)CMSG_DATA(cmsg);
+ printf("IP_RECVERR ee_errno '%s' ee_origin %d => %s",
+ strerror(err->ee_errno),
+ err->ee_origin,
+#ifdef SO_EE_ORIGIN_TIMESTAMPING
+ err->ee_origin == SO_EE_ORIGIN_TIMESTAMPING ?
+ "bounced packet" : "unexpected origin"
+#else
+ "probably SO_EE_ORIGIN_TIMESTAMPING"
+#endif
+ );
+ if (res < sizeof(sync))
+ printf(" => truncated data?!");
+ else if (!memcmp(sync, data + res - sizeof(sync),
+ sizeof(sync)))
+ printf(" => GOT OUR DATA BACK (HURRAY!)");
+ break;
+ }
+ case IP_PKTINFO: {
+ struct in_pktinfo *pktinfo =
+ (struct in_pktinfo *)CMSG_DATA(cmsg);
+ printf("IP_PKTINFO interface index %u",
+ pktinfo->ipi_ifindex);
+ break;
+ }
+ default:
+ printf("type %d", cmsg->cmsg_type);
+ break;
+ }
+ break;
+ default:
+ printf("level %d type %d",
+ cmsg->cmsg_level,
+ cmsg->cmsg_type);
+ break;
+ }
+ printf("\n");
+ }
+
+ if (siocgstamp) {
+ if (ioctl(sock, SIOCGSTAMP, &tv))
+ printf(" %s: %s\n", "SIOCGSTAMP", strerror(errno));
+ else
+ printf("SIOCGSTAMP %ld.%06ld\n",
+ (long)tv.tv_sec,
+ (long)tv.tv_usec);
+ }
+ if (siocgstampns) {
+ if (ioctl(sock, SIOCGSTAMPNS, &ts))
+ printf(" %s: %s\n", "SIOCGSTAMPNS", strerror(errno));
+ else
+ printf("SIOCGSTAMPNS %ld.%09ld\n",
+ (long)ts.tv_sec,
+ (long)ts.tv_nsec);
+ }
+}
+
+static void recvpacket(int sock, int recvmsg_flags,
+ int siocgstamp, int siocgstampns)
+{
+ char data[256];
+ struct msghdr msg;
+ struct iovec entry;
+ struct sockaddr_in from_addr;
+ struct {
+ struct cmsghdr cm;
+ char control[512];
+ } control;
+ int res;
+
+ memset(&msg, 0, sizeof(msg));
+ msg.msg_iov = &entry;
+ msg.msg_iovlen = 1;
+ entry.iov_base = data;
+ entry.iov_len = sizeof(data);
+ msg.msg_name = (caddr_t)&from_addr;
+ msg.msg_namelen = sizeof(from_addr);
+ msg.msg_control = &control;
+ msg.msg_controllen = sizeof(control);
+
+ res = recvmsg(sock, &msg, recvmsg_flags|MSG_DONTWAIT);
+ if (res < 0) {
+ printf("%s %s: %s\n",
+ "recvmsg",
+ (recvmsg_flags & MSG_ERRQUEUE) ? "error" : "regular",
+ strerror(errno));
+ } else {
+ printpacket(&msg, res, data,
+ sock, recvmsg_flags,
+ siocgstamp, siocgstampns);
+ }
+}
+
+int main(int argc, char **argv)
+{
+ int so_timestamping_flags = 0;
+ int so_timestamp = 0;
+ int so_timestampns = 0;
+ int siocgstamp = 0;
+ int siocgstampns = 0;
+ int ip_multicast_loop = 0;
+ char *interface;
+ int i;
+ int enabled = 1;
+ int sock;
+ struct ifreq device;
+ struct ifreq hwtstamp;
+ struct hwtstamp_config hwconfig, hwconfig_requested;
+ struct sockaddr_in addr;
+ struct ip_mreq imr;
+ struct in_addr iaddr;
+ int val;
+ socklen_t len;
+ struct timeval next;
+
+ if (argc < 2)
+ usage(0);
+ interface = argv[1];
+
+ for (i = 2; i < argc; i++) {
+ if (!strcasecmp(argv[i], "SO_TIMESTAMP"))
+ so_timestamp = 1;
+ else if (!strcasecmp(argv[i], "SO_TIMESTAMPNS"))
+ so_timestampns = 1;
+ else if (!strcasecmp(argv[i], "SIOCGSTAMP"))
+ siocgstamp = 1;
+ else if (!strcasecmp(argv[i], "SIOCGSTAMPNS"))
+ siocgstampns = 1;
+ else if (!strcasecmp(argv[i], "IP_MULTICAST_LOOP"))
+ ip_multicast_loop = 1;
+ else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_TX_HARDWARE"))
+ so_timestamping_flags |= SOF_TIMESTAMPING_TX_HARDWARE;
+ else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_TX_SOFTWARE"))
+ so_timestamping_flags |= SOF_TIMESTAMPING_TX_SOFTWARE;
+ else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_RX_HARDWARE"))
+ so_timestamping_flags |= SOF_TIMESTAMPING_RX_HARDWARE;
+ else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_RX_SOFTWARE"))
+ so_timestamping_flags |= SOF_TIMESTAMPING_RX_SOFTWARE;
+ else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_SOFTWARE"))
+ so_timestamping_flags |= SOF_TIMESTAMPING_SOFTWARE;
+ else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_SYS_HARDWARE"))
+ so_timestamping_flags |= SOF_TIMESTAMPING_SYS_HARDWARE;
+ else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_RAW_HARDWARE"))
+ so_timestamping_flags |= SOF_TIMESTAMPING_RAW_HARDWARE;
+ else
+ usage(argv[i]);
+ }
+
+ sock = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP);
+ if (socket < 0)
+ bail("socket");
+
+ memset(&device, 0, sizeof(device));
+ strncpy(device.ifr_name, interface, sizeof(device.ifr_name));
+ if (ioctl(sock, SIOCGIFADDR, &device) < 0)
+ bail("getting interface IP address");
+
+ memset(&hwtstamp, 0, sizeof(hwtstamp));
+ strncpy(hwtstamp.ifr_name, interface, sizeof(hwtstamp.ifr_name));
+ hwtstamp.ifr_data = (void *)&hwconfig;
+ memset(&hwconfig, 0, sizeof(&hwconfig));
+ hwconfig.tx_type =
+ (so_timestamping_flags & SOF_TIMESTAMPING_TX_HARDWARE) ?
+ HWTSTAMP_TX_ON : HWTSTAMP_TX_OFF;
+ hwconfig.rx_filter =
+ (so_timestamping_flags & SOF_TIMESTAMPING_RX_HARDWARE) ?
+ HWTSTAMP_FILTER_PTP_V1_L4_SYNC : HWTSTAMP_FILTER_NONE;
+ hwconfig_requested = hwconfig;
+ if (ioctl(sock, SIOCSHWTSTAMP, &hwtstamp) < 0) {
+ if ((errno == EINVAL || errno == ENOTSUP) &&
+ hwconfig_requested.tx_type == HWTSTAMP_TX_OFF &&
+ hwconfig_requested.rx_filter == HWTSTAMP_FILTER_NONE)
+ printf("SIOCSHWTSTAMP: disabling hardware time stamping not possible\n");
+ else
+ bail("SIOCSHWTSTAMP");
+ }
+ printf("SIOCSHWTSTAMP: tx_type %d requested, got %d; rx_filter %d requested, got %d\n",
+ hwconfig_requested.tx_type, hwconfig.tx_type,
+ hwconfig_requested.rx_filter, hwconfig.rx_filter);
+
+ /* bind to PTP port */
+ addr.sin_family = AF_INET;
+ addr.sin_addr.s_addr = htonl(INADDR_ANY);
+ addr.sin_port = htons(319 /* PTP event port */);
+ if (bind(sock,
+ (struct sockaddr *)&addr,
+ sizeof(struct sockaddr_in)) < 0)
+ bail("bind");
+
+ /* set multicast group for outgoing packets */
+ inet_aton("224.0.1.130", &iaddr); /* alternate PTP domain 1 */
+ addr.sin_addr = iaddr;
+ imr.imr_multiaddr.s_addr = iaddr.s_addr;
+ imr.imr_interface.s_addr =
+ ((struct sockaddr_in *)&device.ifr_addr)->sin_addr.s_addr;
+ if (setsockopt(sock, IPPROTO_IP, IP_MULTICAST_IF,
+ &imr.imr_interface.s_addr, sizeof(struct in_addr)) < 0)
+ bail("set multicast");
+
+ /* join multicast group, loop our own packet */
+ if (setsockopt(sock, IPPROTO_IP, IP_ADD_MEMBERSHIP,
+ &imr, sizeof(struct ip_mreq)) < 0)
+ bail("join multicast group");
+
+ if (setsockopt(sock, IPPROTO_IP, IP_MULTICAST_LOOP,
+ &ip_multicast_loop, sizeof(enabled)) < 0) {
+ bail("loop multicast");
+ }
+
+ /* set socket options for time stamping */
+ if (so_timestamp &&
+ setsockopt(sock, SOL_SOCKET, SO_TIMESTAMP,
+ &enabled, sizeof(enabled)) < 0)
+ bail("setsockopt SO_TIMESTAMP");
+
+ if (so_timestampns &&
+ setsockopt(sock, SOL_SOCKET, SO_TIMESTAMPNS,
+ &enabled, sizeof(enabled)) < 0)
+ bail("setsockopt SO_TIMESTAMPNS");
+
+ if (so_timestamping_flags &&
+ setsockopt(sock, SOL_SOCKET, SO_TIMESTAMPING,
+ &so_timestamping_flags,
+ sizeof(so_timestamping_flags)) < 0)
+ bail("setsockopt SO_TIMESTAMPING");
+
+ /* request IP_PKTINFO for debugging purposes */
+ if (setsockopt(sock, SOL_IP, IP_PKTINFO,
+ &enabled, sizeof(enabled)) < 0)
+ printf("%s: %s\n", "setsockopt IP_PKTINFO", strerror(errno));
+
+ /* verify socket options */
+ len = sizeof(val);
+ if (getsockopt(sock, SOL_SOCKET, SO_TIMESTAMP, &val, &len) < 0)
+ printf("%s: %s\n", "getsockopt SO_TIMESTAMP", strerror(errno));
+ else
+ printf("SO_TIMESTAMP %d\n", val);
+
+ if (getsockopt(sock, SOL_SOCKET, SO_TIMESTAMPNS, &val, &len) < 0)
+ printf("%s: %s\n", "getsockopt SO_TIMESTAMPNS",
+ strerror(errno));
+ else
+ printf("SO_TIMESTAMPNS %d\n", val);
+
+ if (getsockopt(sock, SOL_SOCKET, SO_TIMESTAMPING, &val, &len) < 0) {
+ printf("%s: %s\n", "getsockopt SO_TIMESTAMPING",
+ strerror(errno));
+ } else {
+ printf("SO_TIMESTAMPING %d\n", val);
+ if (val != so_timestamping_flags)
+ printf(" not the expected value %d\n",
+ so_timestamping_flags);
+ }
+
+ /* send packets forever every five seconds */
+ gettimeofday(&next, 0);
+ next.tv_sec = (next.tv_sec + 1) / 5 * 5;
+ next.tv_usec = 0;
+ while (1) {
+ struct timeval now;
+ struct timeval delta;
+ long delta_us;
+ int res;
+ fd_set readfs, errorfs;
+
+ gettimeofday(&now, 0);
+ delta_us = (long)(next.tv_sec - now.tv_sec) * 1000000 +
+ (long)(next.tv_usec - now.tv_usec);
+ if (delta_us > 0) {
+ /* continue waiting for timeout or data */
+ delta.tv_sec = delta_us / 1000000;
+ delta.tv_usec = delta_us % 1000000;
+
+ FD_ZERO(&readfs);
+ FD_ZERO(&errorfs);
+ FD_SET(sock, &readfs);
+ FD_SET(sock, &errorfs);
+ printf("%ld.%06ld: select %ldus\n",
+ (long)now.tv_sec, (long)now.tv_usec,
+ delta_us);
+ res = select(sock + 1, &readfs, 0, &errorfs, &delta);
+ gettimeofday(&now, 0);
+ printf("%ld.%06ld: select returned: %d, %s\n",
+ (long)now.tv_sec, (long)now.tv_usec,
+ res,
+ res < 0 ? strerror(errno) : "success");
+ if (res > 0) {
+ if (FD_ISSET(sock, &readfs))
+ printf("ready for reading\n");
+ if (FD_ISSET(sock, &errorfs))
+ printf("has error\n");
+ recvpacket(sock, 0,
+ siocgstamp,
+ siocgstampns);
+ recvpacket(sock, MSG_ERRQUEUE,
+ siocgstamp,
+ siocgstampns);
+ }
+ } else {
+ /* write one packet */
+ sendpacket(sock,
+ (struct sockaddr *)&addr,
+ sizeof(addr));
+ next.tv_sec += 5;
+ continue;
+ }
+ }
+
+ return 0;
+}
diff --git a/Documentation/networking/vxge.txt b/Documentation/networking/vxge.txt
new file mode 100644
index 00000000000..d2e2997e6fa
--- /dev/null
+++ b/Documentation/networking/vxge.txt
@@ -0,0 +1,100 @@
+Neterion's (Formerly S2io) X3100 Series 10GbE PCIe Server Adapter Linux driver
+==============================================================================
+
+Contents
+--------
+
+1) Introduction
+2) Features supported
+3) Configurable driver parameters
+4) Troubleshooting
+
+1) Introduction:
+----------------
+This Linux driver supports all Neterion's X3100 series 10 GbE PCIe I/O
+Virtualized Server adapters.
+The X3100 series supports four modes of operation, configurable via
+firmware -
+ Single function mode
+ Multi function mode
+ SRIOV mode
+ MRIOV mode
+The functions share a 10GbE link and the pci-e bus, but hardly anything else
+inside the ASIC. Features like independent hw reset, statistics, bandwidth/
+priority allocation and guarantees, GRO, TSO, interrupt moderation etc are
+supported independently on each function.
+
+(See below for a complete list of features supported for both IPv4 and IPv6)
+
+2) Features supported:
+----------------------
+
+i) Single function mode (up to 17 queues)
+
+ii) Multi function mode (up to 17 functions)
+
+iii) PCI-SIG's I/O Virtualization
+ - Single Root mode: v1.0 (up to 17 functions)
+ - Multi-Root mode: v1.0 (up to 17 functions)
+
+iv) Jumbo frames
+ X3100 Series supports MTU up to 9600 bytes, modifiable using
+ ifconfig command.
+
+v) Offloads supported: (Enabled by default)
+ Checksum offload (TCP/UDP/IP) on transmit and receive paths
+ TCP Segmentation Offload (TSO) on transmit path
+ Generic Receive Offload (GRO) on receive path
+
+vi) MSI-X: (Enabled by default)
+ Resulting in noticeable performance improvement (up to 7% on certain
+ platforms).
+
+vii) NAPI: (Enabled by default)
+ For better Rx interrupt moderation.
+
+viii)RTH (Receive Traffic Hash): (Enabled by default)
+ Receive side steering for better scaling.
+
+ix) Statistics
+ Comprehensive MAC-level and software statistics displayed using
+ "ethtool -S" option.
+
+x) Multiple hardware queues: (Enabled by default)
+ Up to 17 hardware based transmit and receive data channels, with
+ multiple steering options (transmit multiqueue enabled by default).
+
+3) Configurable driver parameters:
+----------------------------------
+
+i) max_config_dev
+ Specifies maximum device functions to be enabled.
+ Valid range: 1-8
+
+ii) max_config_port
+ Specifies number of ports to be enabled.
+ Valid range: 1,2
+ Default: 1
+
+iii)max_config_vpath
+ Specifies maximum VPATH(s) configured for each device function.
+ Valid range: 1-17
+
+iv) vlan_tag_strip
+ Enables/disables vlan tag stripping from all received tagged frames that
+ are not replicated at the internal L2 switch.
+ Valid range: 0,1 (disabled, enabled respectively)
+ Default: 1
+
+v) addr_learn_en
+ Enable learning the mac address of the guest OS interface in
+ virtualization environment.
+ Valid range: 0,1 (disabled, enabled respectively)
+ Default: 0
+
+4) Troubleshooting:
+-------------------
+
+To resolve an issue with the source code or X3100 series adapter, please collect
+the statistics, register dumps using ethool, relevant logs and email them to
+support@neterion.com.