summaryrefslogtreecommitdiffstats
path: root/Documentation
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2012-12-12 18:07:07 -0800
committerLinus Torvalds <torvalds@linux-foundation.org>2012-12-12 18:07:07 -0800
commit6be35c700f742e911ecedd07fcc43d4439922334 (patch)
treeca9f37214d204465fcc2d79c82efd291e357c53c /Documentation
parente37aa63e87bd581f9be5555ed0ba83f5295c92fc (diff)
parent520dfe3a3645257bf83660f672c47f8558f3d4c4 (diff)
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking changes from David Miller: 1) Allow to dump, monitor, and change the bridge multicast database using netlink. From Cong Wang. 2) RFC 5961 TCP blind data injection attack mitigation, from Eric Dumazet. 3) Networking user namespace support from Eric W. Biederman. 4) tuntap/virtio-net multiqueue support by Jason Wang. 5) Support for checksum offload of encapsulated packets (basically, tunneled traffic can still be checksummed by HW). From Joseph Gasparakis. 6) Allow BPF filter access to VLAN tags, from Eric Dumazet and Daniel Borkmann. 7) Bridge port parameters over netlink and BPDU blocking support from Stephen Hemminger. 8) Improve data access patterns during inet socket demux by rearranging socket layout, from Eric Dumazet. 9) TIPC protocol updates and cleanups from Ying Xue, Paul Gortmaker, and Jon Maloy. 10) Update TCP socket hash sizing to be more in line with current day realities. The existing heurstics were choosen a decade ago. From Eric Dumazet. 11) Fix races, queue bloat, and excessive wakeups in ATM and associated drivers, from Krzysztof Mazur and David Woodhouse. 12) Support DOVE (Distributed Overlay Virtual Ethernet) extensions in VXLAN driver, from David Stevens. 13) Add "oops_only" mode to netconsole, from Amerigo Wang. 14) Support set and query of VEB/VEPA bridge mode via PF_BRIDGE, also allow DCB netlink to work on namespaces other than the initial namespace. From John Fastabend. 15) Support PTP in the Tigon3 driver, from Matt Carlson. 16) tun/vhost zero copy fixes and improvements, plus turn it on by default, from Michael S. Tsirkin. 17) Support per-association statistics in SCTP, from Michele Baldessari. And many, many, driver updates, cleanups, and improvements. Too numerous to mention individually. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1722 commits) net/mlx4_en: Add support for destination MAC in steering rules net/mlx4_en: Use generic etherdevice.h functions. net: ethtool: Add destination MAC address to flow steering API bridge: add support of adding and deleting mdb entries bridge: notify mdb changes via netlink ndisc: Unexport ndisc_{build,send}_skb(). uapi: add missing netconf.h to export list pkt_sched: avoid requeues if possible solos-pci: fix double-free of TX skb in DMA mode bnx2: Fix accidental reversions. bna: Driver Version Updated to 3.1.2.1 bna: Firmware update bna: Add RX State bna: Rx Page Based Allocation bna: TX Intr Coalescing Fix bna: Tx and Rx Optimizations bna: Code Cleanup and Enhancements ath9k: check pdata variable before dereferencing it ath5k: RX timestamp is reported at end of frame ath9k_htc: RX timestamp is reported at end of frame ...
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/ABI/testing/sysfs-bus-mdio9
-rw-r--r--Documentation/ABI/testing/sysfs-class-net-batman-adv11
-rw-r--r--Documentation/ABI/testing/sysfs-class-net-grcan35
-rw-r--r--Documentation/ABI/testing/sysfs-class-net-mesh40
-rw-r--r--Documentation/devicetree/bindings/net/can/grcan.txt28
-rw-r--r--Documentation/devicetree/bindings/net/cdns-emac.txt23
-rw-r--r--Documentation/devicetree/bindings/net/cpsw.txt48
-rw-r--r--Documentation/kernel-parameters.txt18
-rw-r--r--Documentation/networking/batman-adv.txt3
-rw-r--r--Documentation/networking/ip-sysctl.txt56
-rw-r--r--Documentation/networking/packet_mmap.txt246
-rw-r--r--Documentation/networking/stmmac.txt28
12 files changed, 430 insertions, 115 deletions
diff --git a/Documentation/ABI/testing/sysfs-bus-mdio b/Documentation/ABI/testing/sysfs-bus-mdio
new file mode 100644
index 00000000000..6349749ebc2
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-bus-mdio
@@ -0,0 +1,9 @@
+What: /sys/bus/mdio_bus/devices/.../phy_id
+Date: November 2012
+KernelVersion: 3.8
+Contact: netdev@vger.kernel.org
+Description:
+ This attribute contains the 32-bit PHY Identifier as reported
+ by the device during bus enumeration, encoded in hexadecimal.
+ This ID is used to match the device with the appropriate
+ driver.
diff --git a/Documentation/ABI/testing/sysfs-class-net-batman-adv b/Documentation/ABI/testing/sysfs-class-net-batman-adv
index 38dd762def4..bdc00707c75 100644
--- a/Documentation/ABI/testing/sysfs-class-net-batman-adv
+++ b/Documentation/ABI/testing/sysfs-class-net-batman-adv
@@ -1,4 +1,10 @@
+What: /sys/class/net/<iface>/batman-adv/iface_status
+Date: May 2010
+Contact: Marek Lindner <lindner_marek@yahoo.de>
+Description:
+ Indicates the status of <iface> as it is seen by batman.
+
What: /sys/class/net/<iface>/batman-adv/mesh_iface
Date: May 2010
Contact: Marek Lindner <lindner_marek@yahoo.de>
@@ -7,8 +13,3 @@ Description:
displays the batman mesh interface this <iface>
currently is associated with.
-What: /sys/class/net/<iface>/batman-adv/iface_status
-Date: May 2010
-Contact: Marek Lindner <lindner_marek@yahoo.de>
-Description:
- Indicates the status of <iface> as it is seen by batman.
diff --git a/Documentation/ABI/testing/sysfs-class-net-grcan b/Documentation/ABI/testing/sysfs-class-net-grcan
new file mode 100644
index 00000000000..f418c92ca55
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-class-net-grcan
@@ -0,0 +1,35 @@
+
+What: /sys/class/net/<iface>/grcan/enable0
+Date: October 2012
+KernelVersion: 3.8
+Contact: Andreas Larsson <andreas@gaisler.com>
+Description:
+ Hardware configuration of physical interface 0. This file reads
+ and writes the "Enable 0" bit of the configuration register.
+ Possible values: 0 or 1. See the GRCAN chapter of the GRLIB IP
+ core library documentation for details. The default value is 0
+ or set by the module parameter grcan.enable0 and can be read at
+ /sys/module/grcan/parameters/enable0.
+
+What: /sys/class/net/<iface>/grcan/enable1
+Date: October 2012
+KernelVersion: 3.8
+Contact: Andreas Larsson <andreas@gaisler.com>
+Description:
+ Hardware configuration of physical interface 1. This file reads
+ and writes the "Enable 1" bit of the configuration register.
+ Possible values: 0 or 1. See the GRCAN chapter of the GRLIB IP
+ core library documentation for details. The default value is 0
+ or set by the module parameter grcan.enable1 and can be read at
+ /sys/module/grcan/parameters/enable1.
+
+What: /sys/class/net/<iface>/grcan/select
+Date: October 2012
+KernelVersion: 3.8
+Contact: Andreas Larsson <andreas@gaisler.com>
+Description:
+ Configuration of which physical interface to be used. Possible
+ values: 0 or 1. See the GRCAN chapter of the GRLIB IP core
+ library documentation for details. The default value is 0 or is
+ set by the module parameter grcan.select and can be read at
+ /sys/module/grcan/parameters/select.
diff --git a/Documentation/ABI/testing/sysfs-class-net-mesh b/Documentation/ABI/testing/sysfs-class-net-mesh
index c81fe89c4c4..bc41da61608 100644
--- a/Documentation/ABI/testing/sysfs-class-net-mesh
+++ b/Documentation/ABI/testing/sysfs-class-net-mesh
@@ -6,6 +6,14 @@ Description:
Indicates whether the batman protocol messages of the
mesh <mesh_iface> shall be aggregated or not.
+What: /sys/class/net/<mesh_iface>/mesh/ap_isolation
+Date: May 2011
+Contact: Antonio Quartulli <ordex@autistici.org>
+Description:
+ Indicates whether the data traffic going from a
+ wireless client to another wireless client will be
+ silently dropped.
+
What: /sys/class/net/<mesh_iface>/mesh/bonding
Date: June 2010
Contact: Simon Wunderlich <siwu@hrz.tu-chemnitz.de>
@@ -31,14 +39,6 @@ Description:
mesh will be fragmented or silently discarded if the
packet size exceeds the outgoing interface MTU.
-What: /sys/class/net/<mesh_iface>/mesh/ap_isolation
-Date: May 2011
-Contact: Antonio Quartulli <ordex@autistici.org>
-Description:
- Indicates whether the data traffic going from a
- wireless client to another wireless client will be
- silently dropped.
-
What: /sys/class/net/<mesh_iface>/mesh/gw_bandwidth
Date: October 2010
Contact: Marek Lindner <lindner_marek@yahoo.de>
@@ -60,6 +60,13 @@ Description:
Defines the selection criteria this node will use
to choose a gateway if gw_mode was set to 'client'.
+What: /sys/class/net/<mesh_iface>/mesh/hop_penalty
+Date: Oct 2010
+Contact: Linus Lüssing <linus.luessing@web.de>
+Description:
+ Defines the penalty which will be applied to an
+ originator message's tq-field on every hop.
+
What: /sys/class/net/<mesh_iface>/mesh/orig_interval
Date: May 2010
Contact: Marek Lindner <lindner_marek@yahoo.de>
@@ -67,19 +74,12 @@ Description:
Defines the interval in milliseconds in which batman
sends its protocol messages.
-What: /sys/class/net/<mesh_iface>/mesh/hop_penalty
-Date: Oct 2010
-Contact: Linus Lüssing <linus.luessing@web.de>
-Description:
- Defines the penalty which will be applied to an
- originator message's tq-field on every hop.
-
-What: /sys/class/net/<mesh_iface>/mesh/routing_algo
-Date: Dec 2011
-Contact: Marek Lindner <lindner_marek@yahoo.de>
+What: /sys/class/net/<mesh_iface>/mesh/routing_algo
+Date: Dec 2011
+Contact: Marek Lindner <lindner_marek@yahoo.de>
Description:
- Defines the routing procotol this mesh instance
- uses to find the optimal paths through the mesh.
+ Defines the routing procotol this mesh instance
+ uses to find the optimal paths through the mesh.
What: /sys/class/net/<mesh_iface>/mesh/vis_mode
Date: May 2010
diff --git a/Documentation/devicetree/bindings/net/can/grcan.txt b/Documentation/devicetree/bindings/net/can/grcan.txt
new file mode 100644
index 00000000000..34ef3498f88
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/can/grcan.txt
@@ -0,0 +1,28 @@
+Aeroflex Gaisler GRCAN and GRHCAN CAN controllers.
+
+The GRCAN and CRHCAN CAN controllers are available in the GRLIB VHDL IP core
+library.
+
+Note: These properties are built from the AMBA plug&play in a Leon SPARC system
+(the ordinary environment for GRCAN and GRHCAN). There are no dts files for
+sparc.
+
+Required properties:
+
+- name : Should be "GAISLER_GRCAN", "01_03d", "GAISLER_GRHCAN" or "01_034"
+
+- reg : Address and length of the register set for the device
+
+- freq : Frequency of the external oscillator clock in Hz (the frequency of
+ the amba bus in the ordinary case)
+
+- interrupts : Interrupt number for this device
+
+Optional properties:
+
+- systemid : If not present or if the value of the least significant 16 bits
+ of this 32-bit property is smaller than GRCAN_TXBUG_SAFE_GRLIB_VERSION
+ a bug workaround is activated.
+
+For further information look in the documentation for the GLIB IP core library:
+http://www.gaisler.com/products/grlib/grip.pdf
diff --git a/Documentation/devicetree/bindings/net/cdns-emac.txt b/Documentation/devicetree/bindings/net/cdns-emac.txt
new file mode 100644
index 00000000000..09055c2495f
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/cdns-emac.txt
@@ -0,0 +1,23 @@
+* Cadence EMAC Ethernet controller
+
+Required properties:
+- compatible: Should be "cdns,[<chip>-]{emac}"
+ Use "cdns,at91rm9200-emac" Atmel at91rm9200 SoC.
+ or the generic form: "cdns,emac".
+- reg: Address and length of the register set for the device
+- interrupts: Should contain macb interrupt
+- phy-mode: String, operation mode of the PHY interface.
+ Supported values are: "mii", "rmii".
+
+Optional properties:
+- local-mac-address: 6 bytes, mac address
+
+Examples:
+
+ macb0: ethernet@fffc4000 {
+ compatible = "cdns,at91rm9200-emac";
+ reg = <0xfffc4000 0x4000>;
+ interrupts = <21>;
+ phy-mode = "rmii";
+ local-mac-address = [3a 0e 03 04 05 06];
+ };
diff --git a/Documentation/devicetree/bindings/net/cpsw.txt b/Documentation/devicetree/bindings/net/cpsw.txt
index dcaabe9fe86..6ddd0286a9b 100644
--- a/Documentation/devicetree/bindings/net/cpsw.txt
+++ b/Documentation/devicetree/bindings/net/cpsw.txt
@@ -9,21 +9,15 @@ Required properties:
number
- interrupt-parent : The parent interrupt controller
- cpdma_channels : Specifies number of channels in CPDMA
-- host_port_no : Specifies host port shift
-- cpdma_reg_ofs : Specifies CPDMA submodule register offset
-- cpdma_sram_ofs : Specifies CPDMA SRAM offset
-- ale_reg_ofs : Specifies ALE submodule register offset
- ale_entries : Specifies No of entries ALE can hold
-- host_port_reg_ofs : Specifies host port register offset
-- hw_stats_reg_ofs : Specifies hardware statistics register offset
-- bd_ram_ofs : Specifies internal desciptor RAM offset
- bd_ram_size : Specifies internal descriptor RAM size
- rx_descs : Specifies number of Rx descriptors
- mac_control : Specifies Default MAC control register content
for the specific platform
- slaves : Specifies number for slaves
-- slave_reg_ofs : Specifies slave register offset
-- sliver_reg_ofs : Specifies slave sliver register offset
+- cpts_active_slave : Specifies the slave to use for time stamping
+- cpts_clock_mult : Numerator to convert input clock ticks into nanoseconds
+- cpts_clock_shift : Denominator to convert input clock ticks into nanoseconds
- phy_id : Specifies slave phy id
- mac-address : Specifies slave MAC address
@@ -45,30 +39,22 @@ Examples:
interrupts = <55 0x4>;
interrupt-parent = <&intc>;
cpdma_channels = <8>;
- host_port_no = <0>;
- cpdma_reg_ofs = <0x800>;
- cpdma_sram_ofs = <0xa00>;
- ale_reg_ofs = <0xd00>;
ale_entries = <1024>;
- host_port_reg_ofs = <0x108>;
- hw_stats_reg_ofs = <0x900>;
- bd_ram_ofs = <0x2000>;
bd_ram_size = <0x2000>;
no_bd_ram = <0>;
rx_descs = <64>;
mac_control = <0x20>;
slaves = <2>;
+ cpts_active_slave = <0>;
+ cpts_clock_mult = <0x80000000>;
+ cpts_clock_shift = <29>;
cpsw_emac0: slave@0 {
- slave_reg_ofs = <0x208>;
- sliver_reg_ofs = <0xd80>;
- phy_id = "davinci_mdio.16:00";
+ phy_id = <&davinci_mdio>, <0>;
/* Filled in by U-Boot */
mac-address = [ 00 00 00 00 00 00 ];
};
cpsw_emac1: slave@1 {
- slave_reg_ofs = <0x308>;
- sliver_reg_ofs = <0xdc0>;
- phy_id = "davinci_mdio.16:01";
+ phy_id = <&davinci_mdio>, <1>;
/* Filled in by U-Boot */
mac-address = [ 00 00 00 00 00 00 ];
};
@@ -79,30 +65,22 @@ Examples:
compatible = "ti,cpsw";
ti,hwmods = "cpgmac0";
cpdma_channels = <8>;
- host_port_no = <0>;
- cpdma_reg_ofs = <0x800>;
- cpdma_sram_ofs = <0xa00>;
- ale_reg_ofs = <0xd00>;
ale_entries = <1024>;
- host_port_reg_ofs = <0x108>;
- hw_stats_reg_ofs = <0x900>;
- bd_ram_ofs = <0x2000>;
bd_ram_size = <0x2000>;
no_bd_ram = <0>;
rx_descs = <64>;
mac_control = <0x20>;
slaves = <2>;
+ cpts_active_slave = <0>;
+ cpts_clock_mult = <0x80000000>;
+ cpts_clock_shift = <29>;
cpsw_emac0: slave@0 {
- slave_reg_ofs = <0x208>;
- sliver_reg_ofs = <0xd80>;
- phy_id = "davinci_mdio.16:00";
+ phy_id = <&davinci_mdio>, <0>;
/* Filled in by U-Boot */
mac-address = [ 00 00 00 00 00 00 ];
};
cpsw_emac1: slave@1 {
- slave_reg_ofs = <0x308>;
- sliver_reg_ofs = <0xdc0>;
- phy_id = "davinci_mdio.16:01";
+ phy_id = <&davinci_mdio>, <1>;
/* Filled in by U-Boot */
mac-address = [ 00 00 00 00 00 00 ];
};
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 28bd0f0e32c..20e248cc03a 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -905,6 +905,24 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
gpt [EFI] Forces disk with valid GPT signature but
invalid Protective MBR to be treated as GPT.
+ grcan.enable0= [HW] Configuration of physical interface 0. Determines
+ the "Enable 0" bit of the configuration register.
+ Format: 0 | 1
+ Default: 0
+ grcan.enable1= [HW] Configuration of physical interface 1. Determines
+ the "Enable 0" bit of the configuration register.
+ Format: 0 | 1
+ Default: 0
+ grcan.select= [HW] Select which physical interface to use.
+ Format: 0 | 1
+ Default: 0
+ grcan.txsize= [HW] Sets the size of the tx buffer.
+ Format: <unsigned int> such that (txsize & ~0x1fffc0) == 0.
+ Default: 1024
+ grcan.rxsize= [HW] Sets the size of the rx buffer.
+ Format: <unsigned int> such that (rxsize & ~0x1fffc0) == 0.
+ Default: 1024
+
hashdist= [KNL,NUMA] Large hashes allocated during boot
are distributed across NUMA nodes. Defaults on
for 64-bit NUMA, off otherwise.
diff --git a/Documentation/networking/batman-adv.txt b/Documentation/networking/batman-adv.txt
index a173d2a879f..c1d82047a4b 100644
--- a/Documentation/networking/batman-adv.txt
+++ b/Documentation/networking/batman-adv.txt
@@ -203,7 +203,8 @@ abled during run time. Following log_levels are defined:
2 - Enable messages related to route added / changed / deleted
4 - Enable messages related to translation table operations
8 - Enable messages related to bridge loop avoidance
-15 - enable all messages
+16 - Enable messaged related to DAT, ARP snooping and parsing
+31 - Enable all messages
The debug output can be changed at runtime using the file
/sys/class/net/bat0/mesh/log_level. e.g.
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index c7fc1072494..dd52d516cb8 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -30,16 +30,24 @@ neigh/default/gc_thresh3 - INTEGER
Maximum number of neighbor entries allowed. Increase this
when using large numbers of interfaces and when communicating
with large numbers of directly-connected peers.
+ Default: 1024
neigh/default/unres_qlen_bytes - INTEGER
The maximum number of bytes which may be used by packets
queued for each unresolved address by other network layers.
(added in linux 3.3)
+ Seting negative value is meaningless and will retrun error.
+ Default: 65536 Bytes(64KB)
neigh/default/unres_qlen - INTEGER
The maximum number of packets which may be queued for each
unresolved address by other network layers.
(deprecated in linux 3.3) : use unres_qlen_bytes instead.
+ Prior to linux 3.3, the default value is 3 which may cause
+ unexpected packet loss. The current default value is calculated
+ according to default value of unres_qlen_bytes and true size of
+ packet.
+ Default: 31
mtu_expires - INTEGER
Time, in seconds, that cached PMTU information is kept.
@@ -199,15 +207,16 @@ tcp_early_retrans - INTEGER
Default: 2
tcp_ecn - INTEGER
- Enable Explicit Congestion Notification (ECN) in TCP. ECN is only
- used when both ends of the TCP flow support it. It is useful to
- avoid losses due to congestion (when the bottleneck router supports
- ECN).
+ Control use of Explicit Congestion Notification (ECN) by TCP.
+ ECN is used only when both ends of the TCP connection indicate
+ support for it. This feature is useful in avoiding losses due
+ to congestion by allowing supporting routers to signal
+ congestion before having to drop packets.
Possible values are:
- 0 disable ECN
- 1 ECN enabled
- 2 Only server-side ECN enabled. If the other end does
- not support ECN, behavior is like with ECN disabled.
+ 0 Disable ECN. Neither initiate nor accept ECN.
+ 1 Always request ECN on outgoing connection attempts.
+ 2 Enable ECN when requested by incomming connections
+ but do not request ECN on outgoing connections.
Default: 2
tcp_fack - BOOLEAN
@@ -215,15 +224,14 @@ tcp_fack - BOOLEAN
The value is not used, if tcp_sack is not enabled.
tcp_fin_timeout - INTEGER
- Time to hold socket in state FIN-WAIT-2, if it was closed
- by our side. Peer can be broken and never close its side,
- or even died unexpectedly. Default value is 60sec.
- Usual value used in 2.2 was 180 seconds, you may restore
- it, but remember that if your machine is even underloaded WEB server,
- you risk to overflow memory with kilotons of dead sockets,
- FIN-WAIT-2 sockets are less dangerous than FIN-WAIT-1,
- because they eat maximum 1.5K of memory, but they tend
- to live longer. Cf. tcp_max_orphans.
+ The length of time an orphaned (no longer referenced by any
+ application) connection will remain in the FIN_WAIT_2 state
+ before it is aborted at the local end. While a perfectly
+ valid "receive only" state for an un-orphaned connection, an
+ orphaned connection in FIN_WAIT_2 state could otherwise wait
+ forever for the remote to close its end of the connection.
+ Cf. tcp_max_orphans
+ Default: 60 seconds
tcp_frto - INTEGER
Enables Forward RTO-Recovery (F-RTO) defined in RFC4138.
@@ -1514,6 +1522,20 @@ cookie_preserve_enable - BOOLEAN
Default: 1
+cookie_hmac_alg - STRING
+ Select the hmac algorithm used when generating the cookie value sent by
+ a listening sctp socket to a connecting client in the INIT-ACK chunk.
+ Valid values are:
+ * md5
+ * sha1
+ * none
+ Ability to assign md5 or sha1 as the selected alg is predicated on the
+ configuarion of those algorithms at build time (CONFIG_CRYPTO_MD5 and
+ CONFIG_CRYPTO_SHA1).
+
+ Default: Dependent on configuration. MD5 if available, else SHA1 if
+ available, else none.
+
rcvbuf_policy - INTEGER
Determines if the receive buffer is attributed to the socket or to
association. SCTP supports the capability to create multiple
diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt
index 1c08a4b0981..94444b152fb 100644
--- a/Documentation/networking/packet_mmap.txt
+++ b/Documentation/networking/packet_mmap.txt
@@ -3,9 +3,9 @@
--------------------------------------------------------------------------------
This file documents the mmap() facility available with the PACKET
-socket interface on 2.4 and 2.6 kernels. This type of sockets is used for
-capture network traffic with utilities like tcpdump or any other that needs
-raw access to network interface.
+socket interface on 2.4/2.6/3.x kernels. This type of sockets is used for
+i) capture network traffic with utilities like tcpdump, ii) transmit network
+traffic, or any other that needs raw access to network interface.
You can find the latest version of this document at:
http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap
@@ -21,19 +21,18 @@ Please send your comments to
+ Why use PACKET_MMAP
--------------------------------------------------------------------------------
-In Linux 2.4/2.6 if PACKET_MMAP is not enabled, the capture process is very
-inefficient. It uses very limited buffers and requires one system call
-to capture each packet, it requires two if you want to get packet's
-timestamp (like libpcap always does).
+In Linux 2.4/2.6/3.x if PACKET_MMAP is not enabled, the capture process is very
+inefficient. It uses very limited buffers and requires one system call to
+capture each packet, it requires two if you want to get packet's timestamp
+(like libpcap always does).
In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size
configurable circular buffer mapped in user space that can be used to either
send or receive packets. This way reading packets just needs to wait for them,
most of the time there is no need to issue a single system call. Concerning
transmission, multiple packets can be sent through one system call to get the
-highest bandwidth.
-By using a shared buffer between the kernel and the user also has the benefit
-of minimizing packet copies.
+highest bandwidth. By using a shared buffer between the kernel and the user
+also has the benefit of minimizing packet copies.
It's fine to use PACKET_MMAP to improve the performance of the capture and
transmission process, but it isn't everything. At least, if you are capturing
@@ -41,7 +40,8 @@ at high speeds (this is relative to the cpu speed), you should check if the
device driver of your network interface card supports some sort of interrupt
load mitigation or (even better) if it supports NAPI, also make sure it is
enabled. For transmission, check the MTU (Maximum Transmission Unit) used and
-supported by devices of your network.
+supported by devices of your network. CPU IRQ pinning of your network interface
+card can also be an advantage.
--------------------------------------------------------------------------------
+ How to use mmap() to improve capture process
@@ -87,9 +87,7 @@ the following process:
socket creation and destruction is straight forward, and is done
the same way with or without PACKET_MMAP:
-int fd;
-
-fd= socket(PF_PACKET, mode, htons(ETH_P_ALL))
+ int fd = socket(PF_PACKET, mode, htons(ETH_P_ALL));
where mode is SOCK_RAW for the raw interface were link level
information can be captured or SOCK_DGRAM for the cooked
@@ -163,11 +161,23 @@ As capture, each frame contains two parts:
A complete tutorial is available at: http://wiki.gnu-log.net/
+By default, the user should put data at :
+ frame base + TPACKET_HDRLEN - sizeof(struct sockaddr_ll)
+
+So, whatever you choose for the socket mode (SOCK_DGRAM or SOCK_RAW),
+the beginning of the user data will be at :
+ frame base + TPACKET_ALIGN(sizeof(struct tpacket_hdr))
+
+If you wish to put user data at a custom offset from the beginning of
+the frame (for payload alignment with SOCK_RAW mode for instance) you
+can set tp_net (with SOCK_DGRAM) or tp_mac (with SOCK_RAW). In order
+to make this work it must be enabled previously with setsockopt()
+and the PACKET_TX_HAS_OFF option.
+
--------------------------------------------------------------------------------
+ PACKET_MMAP settings
--------------------------------------------------------------------------------
-
To setup PACKET_MMAP from user level code is done with a call like
- Capture process
@@ -201,7 +211,6 @@ indeed, packet_set_ring checks that the following condition is true
frames_per_block * tp_block_nr == tp_frame_nr
-
Lets see an example, with the following values:
tp_block_size= 4096
@@ -227,7 +236,6 @@ be spawned across two blocks, so there are some details you have to take into
account when choosing the frame_size. See "Mapping and use of the circular
buffer (ring)".
-
--------------------------------------------------------------------------------
+ PACKET_MMAP setting constraints
--------------------------------------------------------------------------------
@@ -264,7 +272,6 @@ User space programs can include /usr/include/sys/user.h and
The pagesize can also be determined dynamically with the getpagesize (2)
system call.
-
Block number limit
--------------------
@@ -284,7 +291,6 @@ called pg_vec, its size limits the number of blocks that can be allocated.
v block #2
block #1
-
kmalloc allocates any number of bytes of physically contiguous memory from
a pool of pre-determined sizes. This pool of memory is maintained by the slab
allocator which is at the end the responsible for doing the allocation and
@@ -299,7 +305,6 @@ pointers to blocks is
131072/4 = 32768 blocks
-
PACKET_MMAP buffer size calculator
------------------------------------
@@ -340,7 +345,6 @@ and a value for <frame size> of 2048 bytes. These parameters will yield
and hence the buffer will have a 262144 MiB size. So it can hold
262144 MiB / 2048 bytes = 134217728 frames
-
Actually, this buffer size is not possible with an i386 architecture.
Remember that the memory is allocated in kernel space, in the case of
an i386 kernel's memory size is limited to 1GiB.
@@ -372,7 +376,6 @@ the following (from include/linux/if_packet.h):
- Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16.
- Pad to align to TPACKET_ALIGNMENT=16
*/
-
The following are conditions that are checked in packet_set_ring
@@ -413,7 +416,6 @@ and the following flags apply:
#define TP_STATUS_LOSING 4
#define TP_STATUS_CSUMNOTREADY 8
-
TP_STATUS_COPY : This flag indicates that the frame (and associated
meta information) has been truncated because it's
larger than tp_frame_size. This packet can be
@@ -462,7 +464,6 @@ packets are in the ring:
It doesn't incur in a race condition to first check the status value and
then poll for frames.
-
++ Transmission process
Those defines are also used for transmission:
@@ -494,6 +495,196 @@ The user can also use poll() to check if a buffer is available:
retval = poll(&pfd, 1, timeout);
-------------------------------------------------------------------------------
++ What TPACKET versions are available and when to use them?
+-------------------------------------------------------------------------------
+
+ int val = tpacket_version;
+ setsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val));
+ getsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val));
+
+where 'tpacket_version' can be TPACKET_V1 (default), TPACKET_V2, TPACKET_V3.
+
+TPACKET_V1:
+ - Default if not otherwise specified by setsockopt(2)
+ - RX_RING, TX_RING available
+ - VLAN metadata information available for packets
+ (TP_STATUS_VLAN_VALID)
+
+TPACKET_V1 --> TPACKET_V2:
+ - Made 64 bit clean due to unsigned long usage in TPACKET_V1
+ structures, thus this also works on 64 bit kernel with 32 bit
+ userspace and the like
+ - Timestamp resolution in nanoseconds instead of microseconds
+ - RX_RING, TX_RING available
+ - How to switch to TPACKET_V2:
+ 1. Replace struct tpacket_hdr by struct tpacket2_hdr
+ 2. Query header len and save
+ 3. Set protocol version to 2, set up ring as usual
+ 4. For getting the sockaddr_ll,
+ use (void *)hdr + TPACKET_ALIGN(hdrlen) instead of
+ (void *)hdr + TPACKET_ALIGN(sizeof(struct tpacket_hdr))
+
+TPACKET_V2 --> TPACKET_V3:
+ - Flexible buffer implementation:
+ 1. Blocks can be configured with non-static frame-size
+ 2. Read/poll is at a block-level (as opposed to packet-level)
+ 3. Added poll timeout to avoid indefinite user-space wait
+ on idle links
+ 4. Added user-configurable knobs:
+ 4.1 block::timeout
+ 4.2 tpkt_hdr::sk_rxhash
+ - RX Hash data available in user space
+ - Currently only RX_RING available
+
+-------------------------------------------------------------------------------
++ AF_PACKET fanout mode
+-------------------------------------------------------------------------------
+
+In the AF_PACKET fanout mode, packet reception can be load balanced among
+processes. This also works in combination with mmap(2) on packet sockets.
+
+Minimal example code by David S. Miller (try things like "./test eth0 hash",
+"./test eth0 lb", etc.):
+
+#include <stddef.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <sys/socket.h>
+#include <sys/ioctl.h>
+
+#include <unistd.h>
+
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+
+#include <net/if.h>
+
+static const char *device_name;
+static int fanout_type;
+static int fanout_id;
+
+#ifndef PACKET_FANOUT
+# define PACKET_FANOUT 18
+# define PACKET_FANOUT_HASH 0
+# define PACKET_FANOUT_LB 1
+#endif
+
+static int setup_socket(void)
+{
+ int err, fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_IP));
+ struct sockaddr_ll ll;
+ struct ifreq ifr;
+ int fanout_arg;
+
+ if (fd < 0) {
+ perror("socket");
+ return EXIT_FAILURE;
+ }
+
+ memset(&ifr, 0, sizeof(ifr));
+ strcpy(ifr.ifr_name, device_name);
+ err = ioctl(fd, SIOCGIFINDEX, &ifr);
+ if (err < 0) {
+ perror("SIOCGIFINDEX");
+ return EXIT_FAILURE;
+ }
+
+ memset(&ll, 0, sizeof(ll));
+ ll.sll_family = AF_PACKET;
+ ll.sll_ifindex = ifr.ifr_ifindex;
+ err = bind(fd, (struct sockaddr *) &ll, sizeof(ll));
+ if (err < 0) {
+ perror("bind");
+ return EXIT_FAILURE;
+ }
+
+ fanout_arg = (fanout_id | (fanout_type << 16));
+ err = setsockopt(fd, SOL_PACKET, PACKET_FANOUT,
+ &fanout_arg, sizeof(fanout_arg));
+ if (err) {
+ perror("setsockopt");
+ return EXIT_FAILURE;
+ }
+
+ return fd;
+}
+
+static void fanout_thread(void)
+{
+ int fd = setup_socket();
+ int limit = 10000;
+
+ if (fd < 0)
+ exit(fd);
+
+ while (limit-- > 0) {
+ char buf[1600];
+ int err;
+
+ err = read(fd, buf, sizeof(buf));
+ if (err < 0) {
+ perror("read");
+ exit(EXIT_FAILURE);
+ }
+ if ((limit % 10) == 0)
+ fprintf(stdout, "(%d) \n", getpid());
+ }
+
+ fprintf(stdout, "%d: Received 10000 packets\n", getpid());
+
+ close(fd);
+ exit(0);
+}
+
+int main(int argc, char **argp)
+{
+ int fd, err;
+ int i;
+
+ if (argc != 3) {
+ fprintf(stderr, "Usage: %s INTERFACE {hash|lb}\n", argp[0]);
+ return EXIT_FAILURE;
+ }
+
+ if (!strcmp(argp[2], "hash"))
+ fanout_type = PACKET_FANOUT_HASH;
+ else if (!strcmp(argp[2], "lb"))
+ fanout_type = PACKET_FANOUT_LB;
+ else {
+ fprintf(stderr, "Unknown fanout type [%s]\n", argp[2]);
+ exit(EXIT_FAILURE);
+ }
+
+ device_name = argp[1];
+ fanout_id = getpid() & 0xffff;
+
+ for (i = 0; i < 4; i++) {
+ pid_t pid = fork();
+
+ switch (pid) {
+ case 0:
+ fanout_thread();
+
+ case -1:
+ perror("fork");
+ exit(EXIT_FAILURE);
+ }
+ }
+
+ for (i = 0; i < 4; i++) {
+ int status;
+
+ wait(&status);
+ }
+
+ return 0;
+}
+
+-------------------------------------------------------------------------------
+ PACKET_TIMESTAMP
-------------------------------------------------------------------------------
@@ -519,6 +710,13 @@ the networking stack is used (the behavior before this setting was added).
See include/linux/net_tstamp.h and Documentation/networking/timestamping
for more information on hardware timestamps.
+-------------------------------------------------------------------------------
++ Miscellaneous bits
+-------------------------------------------------------------------------------
+
+- Packet sockets work well together with Linux socket filters, thus you also
+ might want to have a look at Documentation/networking/filter.txt
+
--------------------------------------------------------------------------------
+ THANKS
--------------------------------------------------------------------------------
diff --git a/Documentation/networking/stmmac.txt b/Documentation/networking/stmmac.txt
index ef9ee71b4d7..f9fa6db40a5 100644
--- a/Documentation/networking/stmmac.txt
+++ b/Documentation/networking/stmmac.txt
@@ -29,11 +29,9 @@ The kernel configuration option is STMMAC_ETH:
dma_txsize: DMA tx ring size;
buf_sz: DMA buffer size;
tc: control the HW FIFO threshold;
- tx_coe: Enable/Disable Tx Checksum Offload engine;
watchdog: transmit timeout (in milliseconds);
flow_ctrl: Flow control ability [on/off];
pause: Flow Control Pause Time;
- tmrate: timer period (only if timer optimisation is configured).
3) Command line options
Driver parameters can be also passed in command line by using:
@@ -60,17 +58,19 @@ Then the poll method will be scheduled at some future point.
The incoming packets are stored, by the DMA, in a list of pre-allocated socket
buffers in order to avoid the memcpy (Zero-copy).
-4.3) Timer-Driver Interrupt
-Instead of having the device that asynchronously notifies the frame receptions,
-the driver configures a timer to generate an interrupt at regular intervals.
-Based on the granularity of the timer, the frames that are received by the
-device will experience different levels of latency. Some NICs have dedicated
-timer device to perform this task. STMMAC can use either the RTC device or the
-TMU channel 2 on STLinux platforms.
-The timers frequency can be passed to the driver as parameter; when change it,
-take care of both hardware capability and network stability/performance impact.
-Several performance tests on STM platforms showed this optimisation allows to
-spare the CPU while having the maximum throughput.
+4.3) Interrupt Mitigation
+The driver is able to mitigate the number of its DMA interrupts
+using NAPI for the reception on chips older than the 3.50.
+New chips have an HW RX-Watchdog used for this mitigation.
+
+On Tx-side, the mitigation schema is based on a SW timer that calls the
+tx function (stmmac_tx) to reclaim the resource after transmitting the
+frames.
+Also there is another parameter (like a threshold) used to program
+the descriptors avoiding to set the interrupt on completion bit in
+when the frame is sent (xmit).
+
+Mitigation parameters can be tuned by ethtool.
4.4) WOL
Wake up on Lan feature through Magic and Unicast frames are supported for the
@@ -121,6 +121,7 @@ struct plat_stmmacenet_data {
int bugged_jumbo;
int pmt;
int force_sf_dma_mode;
+ int riwt_off;
void (*fix_mac_speed)(void *priv, unsigned int speed);
void (*bus_setup)(void __iomem *ioaddr);
int (*init)(struct platform_device *pdev);
@@ -156,6 +157,7 @@ Where:
o pmt: core has the embedded power module (optional).
o force_sf_dma_mode: force DMA to use the Store and Forward mode
instead of the Threshold.
+ o riwt_off: force to disable the RX watchdog feature and switch to NAPI mode.
o fix_mac_speed: this callback is used for modifying some syscfg registers
(on ST SoCs) according to the link speed negotiated by the
physical layer .