diff options
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/feature-removal-schedule.txt | 27 | ||||
-rw-r--r-- | Documentation/networking/ip-sysctl.txt | 3 | ||||
-rw-r--r-- | Documentation/networking/l2tp.txt | 169 | ||||
-rw-r--r-- | Documentation/networking/multiqueue.txt | 111 | ||||
-rw-r--r-- | Documentation/networking/netdevices.txt | 38 |
5 files changed, 321 insertions, 27 deletions
diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt index 281458b47d7..0599a0c7c02 100644 --- a/Documentation/feature-removal-schedule.txt +++ b/Documentation/feature-removal-schedule.txt @@ -262,25 +262,6 @@ Who: Richard Purdie <rpurdie@rpsys.net> --------------------------- -What: Multipath cached routing support in ipv4 -When: in 2.6.23 -Why: Code was merged, then submitter immediately disappeared leaving - us with no maintainer and lots of bugs. The code should not have - been merged in the first place, and many aspects of it's - implementation are blocking more critical core networking - development. It's marked EXPERIMENTAL and no distribution - enables it because it cause obscure crashes due to unfixable bugs - (interfaces don't return errors so memory allocation can't be - handled, calling contexts of these interfaces make handling - errors impossible too because they get called after we've - totally commited to creating a route object, for example). - This problem has existed for years and no forward progress - has ever been made, and nobody steps up to try and salvage - this code, so we're going to finally just get rid of it. -Who: David S. Miller <davem@davemloft.net> - ---------------------------- - What: read_dev_chars(), read_conf_data{,_lpm}() (s390 common I/O layer) When: December 2007 Why: These functions are a leftover from 2.4 times. They have several @@ -337,3 +318,11 @@ Who: Jean Delvare <khali@linux-fr.org> --------------------------- +What: iptables SAME target +When: 1.1. 2008 +Files: net/ipv4/netfilter/ipt_SAME.c, include/linux/netfilter_ipv4/ipt_SAME.h +Why: Obsolete for multiple years now, NAT core provides the same behaviour. + Unfixable broken wrt. 32/64 bit cleanness. +Who: Patrick McHardy <kaber@trash.net> + +--------------------------- diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index af6a63ab902..09c184e41cf 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -874,8 +874,7 @@ accept_redirects - BOOLEAN accept_source_route - INTEGER Accept source routing (routing extension header). - > 0: Accept routing header. - = 0: Accept only routing header type 2. + >= 0: Accept only routing header type 2. < 0: Do not accept routing header. Default: 0 diff --git a/Documentation/networking/l2tp.txt b/Documentation/networking/l2tp.txt new file mode 100644 index 00000000000..2451f551c50 --- /dev/null +++ b/Documentation/networking/l2tp.txt @@ -0,0 +1,169 @@ +This brief document describes how to use the kernel's PPPoL2TP driver +to provide L2TP functionality. L2TP is a protocol that tunnels one or +more PPP sessions over a UDP tunnel. It is commonly used for VPNs +(L2TP/IPSec) and by ISPs to tunnel subscriber PPP sessions over an IP +network infrastructure. + +Design +====== + +The PPPoL2TP driver, drivers/net/pppol2tp.c, provides a mechanism by +which PPP frames carried through an L2TP session are passed through +the kernel's PPP subsystem. The standard PPP daemon, pppd, handles all +PPP interaction with the peer. PPP network interfaces are created for +each local PPP endpoint. + +The L2TP protocol http://www.faqs.org/rfcs/rfc2661.html defines L2TP +control and data frames. L2TP control frames carry messages between +L2TP clients/servers and are used to setup / teardown tunnels and +sessions. An L2TP client or server is implemented in userspace and +will use a regular UDP socket per tunnel. L2TP data frames carry PPP +frames, which may be PPP control or PPP data. The kernel's PPP +subsystem arranges for PPP control frames to be delivered to pppd, +while data frames are forwarded as usual. + +Each tunnel and session within a tunnel is assigned a unique tunnel_id +and session_id. These ids are carried in the L2TP header of every +control and data packet. The pppol2tp driver uses them to lookup +internal tunnel and/or session contexts. Zero tunnel / session ids are +treated specially - zero ids are never assigned to tunnels or sessions +in the network. In the driver, the tunnel context keeps a pointer to +the tunnel UDP socket. The session context keeps a pointer to the +PPPoL2TP socket, as well as other data that lets the driver interface +to the kernel PPP subsystem. + +Note that the pppol2tp kernel driver handles only L2TP data frames; +L2TP control frames are simply passed up to userspace in the UDP +tunnel socket. The kernel handles all datapath aspects of the +protocol, including data packet resequencing (if enabled). + +There are a number of requirements on the userspace L2TP daemon in +order to use the pppol2tp driver. + +1. Use a UDP socket per tunnel. + +2. Create a single PPPoL2TP socket per tunnel bound to a special null + session id. This is used only for communicating with the driver but + must remain open while the tunnel is active. Opening this tunnel + management socket causes the driver to mark the tunnel socket as an + L2TP UDP encapsulation socket and flags it for use by the + referenced tunnel id. This hooks up the UDP receive path via + udp_encap_rcv() in net/ipv4/udp.c. PPP data frames are never passed + in this special PPPoX socket. + +3. Create a PPPoL2TP socket per L2TP session. This is typically done + by starting pppd with the pppol2tp plugin and appropriate + arguments. A PPPoL2TP tunnel management socket (Step 2) must be + created before the first PPPoL2TP session socket is created. + +When creating PPPoL2TP sockets, the application provides information +to the driver about the socket in a socket connect() call. Source and +destination tunnel and session ids are provided, as well as the file +descriptor of a UDP socket. See struct pppol2tp_addr in +include/linux/if_ppp.h. Note that zero tunnel / session ids are +treated specially. When creating the per-tunnel PPPoL2TP management +socket in Step 2 above, zero source and destination session ids are +specified, which tells the driver to prepare the supplied UDP file +descriptor for use as an L2TP tunnel socket. + +Userspace may control behavior of the tunnel or session using +setsockopt and ioctl on the PPPoX socket. The following socket +options are supported:- + +DEBUG - bitmask of debug message categories. See below. +SENDSEQ - 0 => don't send packets with sequence numbers + 1 => send packets with sequence numbers +RECVSEQ - 0 => receive packet sequence numbers are optional + 1 => drop receive packets without sequence numbers +LNSMODE - 0 => act as LAC. + 1 => act as LNS. +REORDERTO - reorder timeout (in millisecs). If 0, don't try to reorder. + +Only the DEBUG option is supported by the special tunnel management +PPPoX socket. + +In addition to the standard PPP ioctls, a PPPIOCGL2TPSTATS is provided +to retrieve tunnel and session statistics from the kernel using the +PPPoX socket of the appropriate tunnel or session. + +Debugging +========= + +The driver supports a flexible debug scheme where kernel trace +messages may be optionally enabled per tunnel and per session. Care is +needed when debugging a live system since the messages are not +rate-limited and a busy system could be swamped. Userspace uses +setsockopt on the PPPoX socket to set a debug mask. + +The following debug mask bits are available: + +PPPOL2TP_MSG_DEBUG verbose debug (if compiled in) +PPPOL2TP_MSG_CONTROL userspace - kernel interface +PPPOL2TP_MSG_SEQ sequence numbers handling +PPPOL2TP_MSG_DATA data packets + +Sample Userspace Code +===================== + +1. Create tunnel management PPPoX socket + + kernel_fd = socket(AF_PPPOX, SOCK_DGRAM, PX_PROTO_OL2TP); + if (kernel_fd >= 0) { + struct sockaddr_pppol2tp sax; + struct sockaddr_in const *peer_addr; + + peer_addr = l2tp_tunnel_get_peer_addr(tunnel); + memset(&sax, 0, sizeof(sax)); + sax.sa_family = AF_PPPOX; + sax.sa_protocol = PX_PROTO_OL2TP; + sax.pppol2tp.fd = udp_fd; /* fd of tunnel UDP socket */ + sax.pppol2tp.addr.sin_addr.s_addr = peer_addr->sin_addr.s_addr; + sax.pppol2tp.addr.sin_port = peer_addr->sin_port; + sax.pppol2tp.addr.sin_family = AF_INET; + sax.pppol2tp.s_tunnel = tunnel_id; + sax.pppol2tp.s_session = 0; /* special case: mgmt socket */ + sax.pppol2tp.d_tunnel = 0; + sax.pppol2tp.d_session = 0; /* special case: mgmt socket */ + + if(connect(kernel_fd, (struct sockaddr *)&sax, sizeof(sax) ) < 0 ) { + perror("connect failed"); + result = -errno; + goto err; + } + } + +2. Create session PPPoX data socket + + struct sockaddr_pppol2tp sax; + int fd; + + /* Note, the target socket must be bound already, else it will not be ready */ + sax.sa_family = AF_PPPOX; + sax.sa_protocol = PX_PROTO_OL2TP; + sax.pppol2tp.fd = tunnel_fd; + sax.pppol2tp.addr.sin_addr.s_addr = addr->sin_addr.s_addr; + sax.pppol2tp.addr.sin_port = addr->sin_port; + sax.pppol2tp.addr.sin_family = AF_INET; + sax.pppol2tp.s_tunnel = tunnel_id; + sax.pppol2tp.s_session = session_id; + sax.pppol2tp.d_tunnel = peer_tunnel_id; + sax.pppol2tp.d_session = peer_session_id; + + /* session_fd is the fd of the session's PPPoL2TP socket. + * tunnel_fd is the fd of the tunnel UDP socket. + */ + fd = connect(session_fd, (struct sockaddr *)&sax, sizeof(sax)); + if (fd < 0 ) { + return -errno; + } + return 0; + +Miscellanous +============ + +The PPPoL2TP driver was developed as part of the OpenL2TP project by +Katalix Systems Ltd. OpenL2TP is a full-featured L2TP client / server, +designed from the ground up to have the L2TP datapath in the +kernel. The project also implemented the pppol2tp plugin for pppd +which allows pppd to use the kernel driver. Details can be found at +http://openl2tp.sourceforge.net. diff --git a/Documentation/networking/multiqueue.txt b/Documentation/networking/multiqueue.txt new file mode 100644 index 00000000000..00b60cce222 --- /dev/null +++ b/Documentation/networking/multiqueue.txt @@ -0,0 +1,111 @@ + + HOWTO for multiqueue network device support + =========================================== + +Section 1: Base driver requirements for implementing multiqueue support +Section 2: Qdisc support for multiqueue devices +Section 3: Brief howto using PRIO or RR for multiqueue devices + + +Intro: Kernel support for multiqueue devices +--------------------------------------------------------- + +Kernel support for multiqueue devices is only an API that is presented to the +netdevice layer for base drivers to implement. This feature is part of the +core networking stack, and all network devices will be running on the +multiqueue-aware stack. If a base driver only has one queue, then these +changes are transparent to that driver. + + +Section 1: Base driver requirements for implementing multiqueue support +----------------------------------------------------------------------- + +Base drivers are required to use the new alloc_etherdev_mq() or +alloc_netdev_mq() functions to allocate the subqueues for the device. The +underlying kernel API will take care of the allocation and deallocation of +the subqueue memory, as well as netdev configuration of where the queues +exist in memory. + +The base driver will also need to manage the queues as it does the global +netdev->queue_lock today. Therefore base drivers should use the +netif_{start|stop|wake}_subqueue() functions to manage each queue while the +device is still operational. netdev->queue_lock is still used when the device +comes online or when it's completely shut down (unregister_netdev(), etc.). + +Finally, the base driver should indicate that it is a multiqueue device. The +feature flag NETIF_F_MULTI_QUEUE should be added to the netdev->features +bitmap on device initialization. Below is an example from e1000: + +#ifdef CONFIG_E1000_MQ + if ( (adapter->hw.mac.type == e1000_82571) || + (adapter->hw.mac.type == e1000_82572) || + (adapter->hw.mac.type == e1000_80003es2lan)) + netdev->features |= NETIF_F_MULTI_QUEUE; +#endif + + +Section 2: Qdisc support for multiqueue devices +----------------------------------------------- + +Currently two qdiscs support multiqueue devices. A new round-robin qdisc, +sch_rr, and sch_prio. The qdisc is responsible for classifying the skb's to +bands and queues, and will store the queue mapping into skb->queue_mapping. +Use this field in the base driver to determine which queue to send the skb +to. + +sch_rr has been added for hardware that doesn't want scheduling policies from +software, so it's a straight round-robin qdisc. It uses the same syntax and +classification priomap that sch_prio uses, so it should be intuitive to +configure for people who've used sch_prio. + +The PRIO qdisc naturally plugs into a multiqueue device. If PRIO has been +built with NET_SCH_PRIO_MQ, then upon load, it will make sure the number of +bands requested is equal to the number of queues on the hardware. If they +are equal, it sets a one-to-one mapping up between the queues and bands. If +they're not equal, it will not load the qdisc. This is the same behavior +for RR. Once the association is made, any skb that is classified will have +skb->queue_mapping set, which will allow the driver to properly queue skb's +to multiple queues. + + +Section 3: Brief howto using PRIO and RR for multiqueue devices +--------------------------------------------------------------- + +The userspace command 'tc,' part of the iproute2 package, is used to configure +qdiscs. To add the PRIO qdisc to your network device, assuming the device is +called eth0, run the following command: + +# tc qdisc add dev eth0 root handle 1: prio bands 4 multiqueue + +This will create 4 bands, 0 being highest priority, and associate those bands +to the queues on your NIC. Assuming eth0 has 4 Tx queues, the band mapping +would look like: + +band 0 => queue 0 +band 1 => queue 1 +band 2 => queue 2 +band 3 => queue 3 + +Traffic will begin flowing through each queue if your TOS values are assigning +traffic across the various bands. For example, ssh traffic will always try to +go out band 0 based on TOS -> Linux priority conversion (realtime traffic), +so it will be sent out queue 0. ICMP traffic (pings) fall into the "normal" +traffic classification, which is band 1. Therefore pings will be send out +queue 1 on the NIC. + +Note the use of the multiqueue keyword. This is only in versions of iproute2 +that support multiqueue networking devices; if this is omitted when loading +a qdisc onto a multiqueue device, the qdisc will load and operate the same +if it were loaded onto a single-queue device (i.e. - sends all traffic to +queue 0). + +Another alternative to multiqueue band allocation can be done by using the +multiqueue option and specify 0 bands. If this is the case, the qdisc will +allocate the number of bands to equal the number of queues that the device +reports, and bring the qdisc online. + +The behavior of tc filters remains the same, where it will override TOS priority +classification. + + +Author: Peter P. Waskiewicz Jr. <peter.p.waskiewicz.jr@intel.com> diff --git a/Documentation/networking/netdevices.txt b/Documentation/networking/netdevices.txt index ce1361f9524..37869295fc7 100644 --- a/Documentation/networking/netdevices.txt +++ b/Documentation/networking/netdevices.txt @@ -20,6 +20,30 @@ private data which gets freed when the network device is freed. If separately allocated data is attached to the network device (dev->priv) then it is up to the module exit handler to free that. +MTU +=== +Each network device has a Maximum Transfer Unit. The MTU does not +include any link layer protocol overhead. Upper layer protocols must +not pass a socket buffer (skb) to a device to transmit with more data +than the mtu. The MTU does not include link layer header overhead, so +for example on Ethernet if the standard MTU is 1500 bytes used, the +actual skb will contain up to 1514 bytes because of the Ethernet +header. Devices should allow for the 4 byte VLAN header as well. + +Segmentation Offload (GSO, TSO) is an exception to this rule. The +upper layer protocol may pass a large socket buffer to the device +transmit routine, and the device will break that up into separate +packets based on the current MTU. + +MTU is symmetrical and applies both to receive and transmit. A device +must be able to receive at least the maximum size packet allowed by +the MTU. A network device may use the MTU as mechanism to size receive +buffers, but the device should allow packets with VLAN header. With +standard Ethernet mtu of 1500 bytes, the device should allow up to +1518 byte packets (1500 + 14 header + 4 tag). The device may either: +drop, truncate, or pass up oversize packets, but dropping oversize +packets is preferred. + struct net_device synchronization rules ======================================= @@ -43,16 +67,17 @@ dev->get_stats: dev->hard_start_xmit: Synchronization: netif_tx_lock spinlock. + When the driver sets NETIF_F_LLTX in dev->features this will be called without holding netif_tx_lock. In this case the driver has to lock by itself when needed. It is recommended to use a try lock - for this and return -1 when the spin lock fails. + for this and return NETDEV_TX_LOCKED when the spin lock fails. The locking there should also properly protect against - set_multicast_list - Context: Process with BHs disabled or BH (timer). - Notes: netif_queue_stopped() is guaranteed false - Interrupts must be enabled when calling hard_start_xmit. - (Interrupts must also be enabled when enabling the BH handler.) + set_multicast_list. + + Context: Process with BHs disabled or BH (timer), + will be called with interrupts disabled by netconsole. + Return codes: o NETDEV_TX_OK everything ok. o NETDEV_TX_BUSY Cannot transmit packet, try later @@ -74,4 +99,5 @@ dev->poll: Synchronization: __LINK_STATE_RX_SCHED bit in dev->state. See dev_close code and comments in net/core/dev.c for more info. Context: softirq + will be called with interrupts disabled by netconsole. |