Alexander Duyck [Mon, 4 May 2015 21:34:05 +0000 (14:34 -0700)]
openvswitch: Use eth_proto_is_802_3
Replace "ntohs(proto) >= ETH_P_802_3_MIN" w/ eth_proto_is_802_3(proto).
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Mon, 4 May 2015 21:33:59 +0000 (14:33 -0700)]
ipv4/ip_tunnel_core: Use eth_proto_is_802_3
Replace "ntohs(proto) >= ETH_P_802_3_MIN" w/ eth_proto_is_802_3(proto).
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Mon, 4 May 2015 21:33:54 +0000 (14:33 -0700)]
ebtables: Use eth_proto_is_802_3
Replace "ntohs(proto) >= ETH_P_802_3_MIN" w/ eth_proto_is_802_3(proto).
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Mon, 4 May 2015 21:33:48 +0000 (14:33 -0700)]
etherdev: Fix sparse error, make test usable by other functions
This change does two things. First it fixes a sparse error for the fact
that the __be16 degrades to an integer. Since that is actually what I am
kind of doing I am simply working around that by forcing both sides of the
comparison to u16.
Also I realized on some compilers I was generating another instruction for
big endian systems such as PowerPC since it was masking the value before
doing the comparison. So to resolve that I have simply pulled the mask out
and wrapped it in an #ifndef __BIG_ENDIAN.
Lastly I pulled this all out into its own function. I notices there are
similar checks in a number of other places so this function can be reused
there to help reduce overhead in these paths as well.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bernhard Thaler [Mon, 4 May 2015 20:47:13 +0000 (22:47 +0200)]
bridge: change BR_GROUPFWD_RESTRICTED to allow forwarding of LLDP frames
BR_GROUPFWD_RESTRICTED bitmask restricts users from setting values to
/sys/class/net/brX/bridge/group_fwd_mask that allow forwarding of
some IEEE 802.1D Table 7-10 Reserved addresses:
(MAC Control) 802.3 01-80-C2-00-00-01
(Link Aggregation) 802.3 01-80-C2-00-00-02
802.1AB LLDP 01-80-C2-00-00-0E
Change BR_GROUPFWD_RESTRICTED to allow to forward LLDP frames and document
group_fwd_mask.
e.g.
echo 16384 > /sys/class/net/brX/bridge/group_fwd_mask
allows to forward LLDP frames.
This may be needed for bridge setups used for network troubleshooting or
any other scenario where forwarding of LLDP frames is desired (e.g. bridge
connecting a virtual machine to real switch transmitting LLDP frames that
virtual machine needs to receive).
Tested on a simple bridge setup with two interfaces and host transmitting
LLDP frames on one side of this bridge (used lldpd). Setting group_fwd_mask
as described above lets LLDP frames traverse bridge.
Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 4 May 2015 04:34:46 +0000 (21:34 -0700)]
tcp: provide SYN headers for passive connections
This patch allows a server application to get the TCP SYN headers for
its passive connections. This is useful if the server is doing
fingerprinting of clients based on SYN packet contents.
Two socket options are added: TCP_SAVE_SYN and TCP_SAVED_SYN.
The first is used on a socket to enable saving the SYN headers
for child connections. This can be set before or after the listen()
call.
The latter is used to retrieve the SYN headers for passive connections,
if the parent listener has enabled TCP_SAVE_SYN.
TCP_SAVED_SYN is read once, it frees the saved SYN headers.
The data returned in TCP_SAVED_SYN are network (IPv4/IPv6) and TCP
headers.
Original patch was written by Tom Herbert, I changed it to not hold
a full skb (and associated dst and conntracking reference).
We have used such patch for about 3 years at Google.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Lüssing [Mon, 4 May 2015 22:19:35 +0000 (00:19 +0200)]
net: fix two sparse warnings introduced by IGMP/MLD parsing exports
> net/core/skbuff.c:4108:13: sparse: incorrect type in assignment (different base types)
> net/ipv6/mcast_snoop.c:63 ipv6_mc_check_exthdrs() warn: unsigned 'offset' is never less than zero.
Introduced by
9afd85c9e4552b276e2f4cfefd622bdeeffbbf26
("net: Export IGMP/MLD message validation code")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 4 May 2015 19:37:08 +0000 (15:37 -0400)]
Merge branch 'master' of git://git./linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:
====================
Intel Wired LAN Driver Updates 2015-05-04
This series contains updates to igb, e100, e1000e and ixgbe.
Todd cleans up igb_enable_mas() since it should only be called for the
82575 silicon and has no clear return, so modify the function to void.
Jean Sacren found upon inspection that 'err' did not need to be
initialized, since it is immediately overwritten.
Alex Duyck provides two patches for e1000e, the first cleans up the
handling VLAN_HLEN as a part of max frame size. Fixes the issue:
c751a3d58cf2d ("e1000e: Correctly include VLAN_HLEN when changing
interface MTU"). The second fixes an issue where the driver was not
allowing jumbo frames to be enabled when CRC stripping was disabled,
however it was allowing CRC stripping to be disabled while jumbo frames
were enabled.
Jeff (me) fixes a warning found on PPC where the use of do_div() needed
to use u64 arg and not s64.
Mark provides three ixgbe patches, first to fix the Intel On-chip System
Fabric (IOSF) Sideband message interfaces, to serialize access using both
PHY bits in the SWFW_SEMAPHORE register. Then fixes how semaphore bits
were released, since they should be released in reverse of the order that
they were taken. Lastly updates ixgbe to use a signed type to hold
error codes, since error codes are negative, so consistently use signed
types when handling them.
v2: dropped the previous #6-#8 patches by Hiroshi Shimanoto based on
feedback from Or Gerlitz (and David Miller) that it appears there
needs to be further discussion on how this gets implemented.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 4 May 2015 19:04:02 +0000 (15:04 -0400)]
Merge branch 'tipc-topology-cleanup'
Ying Xue says:
====================
tipc: cleanup topology server
Not only function names declared in subscr.c are very confused, but
also topology server's locking policy is not designed very well, for
instance, usually leading to panic in some special corner cases.
In this series, we attempt to eliminate the confusion of function names
and simplify topology server's locking policy to solve above mentioned
issues. More importantly, the change will make relevant code easily
understandable and maintainable.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Ying Xue [Mon, 4 May 2015 02:36:48 +0000 (10:36 +0800)]
tipc: deal with return value of tipc_conn_new callback
Once tipc_conn_new() returns NULL, the connection should be shut
down immediately, otherwise, oops may happen due to the NULL pointer.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ying Xue [Mon, 4 May 2015 02:36:47 +0000 (10:36 +0800)]
tipc: adjust locking policy of subscription
Currently subscriber's lock protects not only subscriber's subscription
list but also all subscriptions linked into the list. However, as all
members of subscription are never changed after they are initialized,
it's unnecessary for subscription to be protected under subscriber's
lock. If the lock is used to only protect subscriber's subscription
list, the adjustment not only makes the locking policy simpler, but
also helps to avoid a deadlock which may happen once creating a
subscription is failed.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ying Xue [Mon, 4 May 2015 02:36:46 +0000 (10:36 +0800)]
tipc: involve reference counter for subscriber
At present subscriber's lock is used to protect the subscription list
of subscriber as well as subscriptions linked into the list. While one
or all subscriptions are deleted through iterating the list, the
subscriber's lock must be held. Meanwhile, as deletion of subscription
may happen in subscription timer's handler, the lock must be grabbed
in the function as well. When subscription's timer is terminated with
del_timer_sync() during above iteration, subscriber's lock has to be
temporarily released, otherwise, deadlock may occur. However, the
temporary release may cause the double free of a subscription as the
subscription is not disconnected from the subscription list.
Now if a reference counter is introduced to subscriber, subscription's
timer can be asynchronously stopped with del_timer(). As a result, the
issue is not only able to be fixed, but also relevant code is pretty
readable and understandable.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ying Xue [Mon, 4 May 2015 02:36:45 +0000 (10:36 +0800)]
tipc: introduce tipc_subscrb_create routine
Introducing a new function makes the purpose of tipc_subscrb_connect_cb
callback routine more clear.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ying Xue [Mon, 4 May 2015 02:36:44 +0000 (10:36 +0800)]
tipc: rename functions defined in subscr.c
When a topology server accepts a connection request from its client,
it allocates a connection instance and a tipc_subscriber structure
object. The former is used to communicate with client, and the latter
is often treated as a subscriber which manages all subscription events
requested from a same client. When a topology server receives a request
of subscribing name services from a client through the connection, it
creates a tipc_subscription structure instance which is seen as a
subscription recording what name services are subscribed. In order to
manage all subscriptions from a same client, topology server links
them into the subscrp_list of the subscriber. So subscriber and
subscription completely represents different meanings respectively,
but function names associated with them make us so confused that we
are unable to easily tell which function is against subscriber and
which is to subscription. So we want to eliminate the confusion by
renaming them.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 4 May 2015 18:49:23 +0000 (14:49 -0400)]
Merge branch 'igmp_mld_export'
Linus Lüssing says:
====================
Exporting IGMP/MLD checking from bridge code
The multicast optimizations in batman-adv are yet only usable and
enabled in non-bridged scenarios. To be able to support bridged setups
batman-adv needs to be able to detect IGMP/MLD queriers and reports on
mesh nodes without bridges, too. See the following link for details:
http://www.open-mesh.org/projects/batman-adv/wiki/Multicast-optimizations-listener-reports
To avoid duplicate code between the bridge and batman-adv, the IGMP/MLD
message validation code is moved from the bridge to the IPv4/IPv6 stack.
On the way, some refactoring to increase readability and to iron out
some subtle differences between the IGMP and MLD parsing code is done.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Lüssing [Sat, 2 May 2015 12:01:07 +0000 (14:01 +0200)]
net: Export IGMP/MLD message validation code
With this patch, the IGMP and MLD message validation functions are moved
from the bridge code to IPv4/IPv6 multicast files. Some small
refactoring was done to enhance readibility and to iron out some
differences in behaviour between the IGMP and MLD parsing code (e.g. the
skb-cloning of MLD messages is now only done if necessary, just like the
IGMP part always did).
Finally, these IGMP and MLD message validation functions are exported so
that not only the bridge can use it but batman-adv later, too.
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Lüssing [Sat, 2 May 2015 12:01:06 +0000 (14:01 +0200)]
bridge: multicast: call skb_checksum_{simple_, }validate
Let's use these new, neat helpers.
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jamal Hadi Salim [Sat, 2 May 2015 05:19:43 +0000 (22:19 -0700)]
tc: remove unused redirect ttl
improves ingress+u32 performance from 22.4 Mpps to 22.9 Mpps
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Florian Westphal <fw@strlen.de>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mark Rustad [Fri, 10 Apr 2015 17:36:36 +0000 (10:36 -0700)]
ixgbe: Use a signed type to hold error codes
Because error codes are negative, it only makes sense to
consistently use signed types when handling them. Also remove
some explicit comparisons with 0 on these variables.
Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Mark Rustad [Fri, 10 Apr 2015 17:36:31 +0000 (10:36 -0700)]
ixgbe: Release semaphore bits in the right order
The global semaphore bits should be released in the reverse of the
order that they were taken, so correct that.
Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Mark Rustad [Fri, 10 Apr 2015 17:36:26 +0000 (10:36 -0700)]
ixgbe: Fix IOSF SB access issues
IOSF is the Intel On-chip System Fabric used in SOCs. IOSF SB is
the IOSF SideBand message interface. This patch serializes IOSF SB
access using both phy bits in the SWFW_SEMAPHORE register. It also
adds a helper function to wait for IOSF SB accesses to complete.
Use the new function to perform this wait before each access, as
specified in the datasheet, in addition to using it to wait for
IOSF SB read/write completion.
Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jeff Kirsher [Sat, 2 May 2015 08:20:04 +0000 (01:20 -0700)]
e1000e: fix call to do_div() to use u64 arg
We were using s64 for lat_ns (latency nano-second value) since in
our calculations a negative value could be a resultant. For negative
values, we then assign lat_ns to be zero, so the value passed to
do_div() was never negative, but do_div() expects the argument type
to be u64, so do a cast to resolve a compile warning seen on
PowerPC.
CC: Yanjiang Jin <yanjiang.jin@windriver.com>
CC: Yanir Lubetkin <yanirx.lubetkin@intel.com>
Reported-by: Yanjiang Jin <yanjiang.jin@windriver.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Alexander Duyck [Sat, 2 May 2015 08:09:59 +0000 (01:09 -0700)]
e1000e: Do not allow CRC stripping to be disabled on 82579 w/ jumbo frames
The driver wasn't allowing jumbo frames to be
enabled when CRC stripping was disabled, however it was allowing CRC
stripping to be disabled while jumbo frames were enabled. This fixes that by
making it so that the NETIF_F_RXFCS flag cannot be set when jumbo frames are
enabled on 82579 and newer parts.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Duyck [Sat, 2 May 2015 07:52:00 +0000 (00:52 -0700)]
e1000e: Cleanup handling of VLAN_HLEN as a part of max frame size
When the VLAN_HLEN was added to the calculation for the maximum frame size
there seems to have been a number of issues added to the driver.
The first issue is that in some cases the maximum frame size for a device
never really reached the actual maximum frame size as the VLAN header
length was not included the calculation for that value. As a result some
parts only supported a maximum frame size of either 1496 in the case of
parts that didn't support jumbo frames, and 8996 in the case of the parts
that do.
The second issue is the fact that there were several checks that weren't
updated so as a result setting an MTU of 1500 was treated as enabling jumbo
frames as the calculated value was 1522 instead of 1518. I have addressed
those by replacing ETH_FRAME_LEN with VLAN_ETH_FRAME_LEN where appropriate.
The final issue was the fact that lowering the MTU below 1500 would cause
the driver to allocate 2K buffers for the rings. This is an old issue that
was fixed several years ago in igb/ixgbe and I am addressing now by just
replacing == with a <= so that we always just round up to 1522 for anything
that isn't a jumbo frame.
Fixes:
c751a3d58cf2d ("e1000e: Correctly include VLAN_HLEN when changing interface MTU")
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jean Sacren [Sat, 2 May 2015 07:49:26 +0000 (00:49 -0700)]
e100: don't initialize int object to zero
'err' will be overwritten so no need to initialize it to zero.
Signed-off-by: Jean Sacren <sakiwit@gmail.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Todd Fujinaka [Sat, 2 May 2015 07:39:03 +0000 (00:39 -0700)]
igb: simplify and clean up igb_enable_mas()
igb_enable_mas() should only be called for the 82575 and has no clear
return so changing it to void. Also simplify the odd conditional
expression.
Signed-off-by: Todd Fujinaka <todd.fujinaka@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
David S. Miller [Mon, 4 May 2015 04:18:27 +0000 (00:18 -0400)]
Merge branch 'via-rhine-rework'
Francois Romieu says:
====================
via-rhine rework
The series applies against davem-next as of
9dd3c797496affd699805c8a9d8429ad318c892f ("drivers: net: xgene: fix kbuild
warnings").
Patches #1..#4 avoid holes in the receive ring.
Patch #5 is a small leftover cleanup for #1..#4.
Patches #6 and #7 are fairly simple barrier stuff.
Patch #8 closes some SMP transmit races - not that anyone really
complained about these but it's a bit hard to handwave that they
can be safely ignored. Some testing, especially SMP testing of
course, would be welcome.
. Changes since #2:
- added dma_rmb barrier in vlan related patch 6.
- s/wmb/dma_wmb/ in (*new*) patch 7 of 8.
- added explicit SMP barriers in (*new*) patch 8 of 8.
. Changes since #1:
- turned wmb() into dma_wmb() as suggested by davem and Alexander Duyck
in patch 1 of 6.
- forgot to reset rx_head_desc in rhine_reset_rbufs in patch 4 of 6.
- removed rx_head_desc altogether in (*new*) patch 5 of 6
- remoed some vlan receive uglyness in (*new*) patch 6 of 6.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
françois romieu [Fri, 1 May 2015 20:14:45 +0000 (22:14 +0200)]
via-rhine: close SMP transmit races.
7ab87ff4c770eed71e3777936299292739fcd0fe ("via-rhine: move work from
irq handler to softirq and beyond") forgot to explicitely control the
lifespan of the tx_dirty and tx_cur pointers.
Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
françois romieu [Fri, 1 May 2015 20:14:44 +0000 (22:14 +0200)]
via-rhine: dma_wmb transmit barrier.
Follow the now usual transmit descriptor update path:
1. content change
2. dma_wmb
3. ownership change
Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
françois romieu [Fri, 1 May 2015 20:14:43 +0000 (22:14 +0200)]
via-rhine: add consistent memory barrier in vlan receive code.
The NAPI receive path depends on desc->rx_status but it does not
enforce any explicit receive barrier.
Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
françois romieu [Fri, 1 May 2015 20:14:42 +0000 (22:14 +0200)]
via-rhine: kiss rx_head_desc goodbye.
The driver no longer produces holes in its receive ring so rx_head_desc
only duplicates cur_rx.
Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
françois romieu [Fri, 1 May 2015 20:14:41 +0000 (22:14 +0200)]
via-rhine: forbid holes in the receive descriptor ring.
Rationales:
- throttle work under memory pressure
- lower receive descriptor recycling latency for the network adapter
- lower the maintenance burden of uncommon paths
The patch is twofold:
- it fails early if the receive ring can't be completely initialized
at dev->open() time
- it drops packets on the floor in the napi receive handler so as to
keep the received ring full
Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
françois romieu [Fri, 1 May 2015 20:14:40 +0000 (22:14 +0200)]
via-rhine: gotoize rhine_open error path.
Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
françois romieu [Fri, 1 May 2015 20:14:39 +0000 (22:14 +0200)]
via-rhine: allocate and map receive buffer in a single transaction
It's used to initialize the receive ring but it will actually shine when
the receive poll code is reworked.
Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
françois romieu [Fri, 1 May 2015 20:14:38 +0000 (22:14 +0200)]
via-rhine: commit receive buffer address before descriptor status update.
Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 4 May 2015 04:09:09 +0000 (00:09 -0400)]
Merge branch 'flow_keys_digest'
Tom Herbert says:
====================
net: Eliminate calls to flow_dissector and introduce flow_keys_digest
In this patch set we add skb_get_hash_perturb which gets the skbuff
hash for a packet and perturbs it using a provided key and jhash1.
This function is used in serveral qdiscs and eliminates many calls
to flow_dissector and jhash3 to get a perturbed hash for a packet.
To handle the sch_choke issue (passes flow_keys in skbuff cb) we
add flow_keys_digest which is a digest of a flow constructed
from a flow_keys structure.
This is the second version of these patches I posted a while ago,
and is prerequisite work to increasing the size of the flow_keys
structure and hashing over it (full IPv6 address, flow label, VLAN ID,
etc.).
Version 2:
- Add keyval parameter to __flow_hash_from_keys which allows caller to
set the initval for jhash
- Perturb always does flow dissection and creates hash based on
input perturb value which acts as the keyval to __flow_hash_from_keys
- Added a _flow_keys_digest_data which is used in make_flow_keys_digest.
This fills out the digest by populating individual fields instead
of copying the whole structure.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Fri, 1 May 2015 18:30:18 +0000 (11:30 -0700)]
sch_choke: Use flow_keys_digest
Call make_flow_keys_digest to get a digest from flow keys and
use that to pass skbuff cb and for comparing flows.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Fri, 1 May 2015 18:30:17 +0000 (11:30 -0700)]
net: Add flow_keys digest
Some users of flow keys (well just sch_choke now) need to pass
flow_keys in skbuff cb, and use them for exact comparisons of flows
so that skb->hash is not sufficient. In order to increase size of
the flow_keys structure, we introduce another structure for
the purpose of passing flow keys in skbuff cb. We limit this structure
to sixteen bytes, and we will technically treat this as a digest of
flow_keys struct hence its name flow_keys_digest. In the first
incaranation we just copy the flow_keys structure up to 16 bytes--
this is the same information previously passed in the cb. In the
future, we'll adapt this for larger flow_keys and could use something
like SHA-1 over the whole flow_keys to improve the quality of the
digest.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Fri, 1 May 2015 18:30:16 +0000 (11:30 -0700)]
sched: Call skb_get_hash_perturb in sch_sfq
Call skb_get_hash_perturb instead of doing skb_flow_dissect and then
jhash by hand.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Fri, 1 May 2015 18:30:15 +0000 (11:30 -0700)]
sched: Call skb_get_hash_perturb in sch_sfb
Call skb_get_hash_perturb instead of doing skb_flow_dissect and then
jhash by hand.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Fri, 1 May 2015 18:30:14 +0000 (11:30 -0700)]
sched: Call skb_get_hash_perturb in sch_hhf
Call skb_get_hash_perturb instead of doing skb_flow_dissect and then
jhash by hand.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Fri, 1 May 2015 18:30:13 +0000 (11:30 -0700)]
sched: Call skb_get_hash_perturb in sch_fq_codel
Call skb_get_hash_perturb instead of doing skb_flow_dissect and then
jhash by hand.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Fri, 1 May 2015 18:30:12 +0000 (11:30 -0700)]
net: Add skb_get_hash_perturb
This calls flow_disect and __skb_get_hash to procure a hash for a
packet. Input includes a key to initialize jhash. This function
does not set skb->hash.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Fri, 1 May 2015 14:39:54 +0000 (16:39 +0200)]
net: ipv4: route: Fix sending IGMP messages with link address
In setups with a global scope address on an interface, and a lesser
scope address on an interface sending IGMP reports, the reports can be
sent using the other interfaces global scope address rather than the
local interface address. RFC 2236 suggests:
Ignore the Report if you cannot identify the source address of
the packet as belonging to a subnet assigned to the interface on
which the packet was received.
since such reports could be forged.
Look at the protocol when deciding if a RT_SCOPE_LINK address should
be used for the packet.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov [Fri, 1 May 2015 03:14:07 +0000 (20:14 -0700)]
net: sched: run ingress qdisc without locks
TC classifiers/actions were converted to RCU by John in the series:
http://thread.gmane.org/gmane.linux.network/329739/focus=329739
and many follow on patches.
This is the last patch from that series that finally drops
ingress spin_lock.
Single cpu ingress+u32 performance goes from 22.9 Mpps to 24.5 Mpps.
In two cpu case when both cores are receiving traffic on the same
device and go into the same ingress+u32 the performance jumps
from 4.5 + 4.5 Mpps to 23.5 + 23.5 Mpps
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 4 May 2015 03:18:02 +0000 (23:18 -0400)]
Merge branch 'tcp_sack_rttm'
Kenneth Klette Jonassen says:
====================
tcp: SACK RTTM changes for congestion control
This patch series improves SACK RTT measurements for congestion control:
o Picks the latest sequence SACKed for RTT, i.e. most accurate delay
signal.
o Calls the congestion control's pkts_acked hook with SACK RTTMs
even when not sequentially ACKing new data.
V2: amend misleading comment
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Kenneth Klette Jonassen [Thu, 30 Apr 2015 23:10:59 +0000 (01:10 +0200)]
tcp: invoke pkts_acked hook on every ACK
Invoking pkts_acked is currently conditioned on FLAG_ACKED:
receiving a cumulative ACK of new data, or ACK with SYN flag set.
Remove this condition so that CC may get RTT measurements from all SACKs.
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kenneth Klette Jonassen [Thu, 30 Apr 2015 23:10:58 +0000 (01:10 +0200)]
tcp: improve RTT from SACK for CC
tcp_sacktag_one() always picks the earliest sequence SACKed for RTT.
This might not make sense for congestion control in cases where:
1. ACKs are lost, i.e. a SACK following a lost SACK covers both
new and old segments at the receiver.
2. The receiver disregards the RFC 5681 recommendation to immediately
ACK out-of-order segments.
Give congestion control a RTT for the latest segment SACKed, which is the
most accurate RTT estimate, but preserve the conservative RTT for RTO.
Removes the call to skb_mstamp_get() in tcp_sacktag_one().
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kenneth Klette Jonassen [Thu, 30 Apr 2015 23:10:57 +0000 (01:10 +0200)]
tcp: move struct tcp_sacktag_state to tcp_ack()
Later patch passes two values set in tcp_sacktag_one() to
tcp_clean_rtx_queue(). Prepare passing them via struct tcp_sacktag_state.
Acked-by: Yuchung Cheng <ycheng@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 4 May 2015 03:08:54 +0000 (23:08 -0400)]
Merge branch 'rhashtable-test'
Thomas Graf says:
====================
rhashtable self-test improvements
This series improves the rhashtable self-test to:
* Avoid allocation of test objects
* Measure the time of test runs
* Use the iterator to walk the table for consistency
* Account for failed insertions due to memory pressure or
utilization pressure
* Ignore failed insertions when checking for consistency
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Thomas Graf [Thu, 30 Apr 2015 22:37:45 +0000 (22:37 +0000)]
rhashtable-test: Detect insertion failures
Account for failed inserts due to memory pressure or EBUSY and
ignore failed entries during the consistency check.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Thomas Graf [Thu, 30 Apr 2015 22:37:44 +0000 (22:37 +0000)]
rhashtable-test: Use walker to test bucket statistics
As resizes may continue to run in the background, use walker to
ensure we see all entries. Also print the encountered number
of rehashes queued up while traversing.
This may lead to warnings due to entries being seen multiple
times. We consider them non-fatal.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Thomas Graf [Thu, 30 Apr 2015 22:37:43 +0000 (22:37 +0000)]
rhashtable-test: Do not allocate individual test objects
By far the most expensive part of the selftest was the allocation
of entries. Using a static array allows to measure the rhashtable
operations.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Thomas Graf [Thu, 30 Apr 2015 22:37:42 +0000 (22:37 +0000)]
rhashtable-test: Get rid of ptr in test_obj structure
This only blows up the size of the test structure for no gain
in test coverage. Reduces size of test_obj from 24 to 16 bytes.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Thomas Graf [Thu, 30 Apr 2015 22:37:41 +0000 (22:37 +0000)]
rhashtable-test: Measure time to insert, remove & traverse entries
Make test configurable by allowing to specify all relevant knobs
through module parameters.
Do several test runs and measure the average time it takes to
insert & remove all entries. Note, a deferred resize might still
continue to run in the background.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Thomas Graf [Thu, 30 Apr 2015 22:37:40 +0000 (22:37 +0000)]
rhashtable-test: Remove unused TEST_NEXPANDS
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 4 May 2015 02:30:36 +0000 (22:30 -0400)]
Merge branch 'eth_type_trans'
Alexander Duyck says:
====================
A few minor clean-ups to eth_type_trans
This series addresses a few minor issues I found in eth_type_trans that
that allow us to gain back something like 3 or more cycles per packet.
The first change is to drop the byte swap since it isn't necessary. On x86
we could just check the first byte and compare that against the upper 8
bits of the Ethertype to determine if we are dealing with a size value or
not.
The second makes it so that the value we read in to test for multicast can
be used for the address comparison. This allows us to avoid a second read
of the destination address.
The final change is to avoid some unneeded instructions in computing the
Ethernet header pointer. When we start the call the Ethernet header is at
skb->data, so we just use that rather than computing mac_header, and then
adding that back to skb->head.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Thu, 30 Apr 2015 21:53:59 +0000 (14:53 -0700)]
etherdev: Use skb->data to retrieve Ethernet header instead of eth_hdr
Avoid recomputing the Ethernet header location and instead just use the
pointer provided by skb->data. The problem with using eth_hdr is that the
compiler wasn't smart enough to realize that skb->head + skb->mac_header
was the same thing as skb->data before it added ETH_HLEN. By just caching
it off before calling skb_pull_inline we can avoid a few unnecessary
instructions.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Thu, 30 Apr 2015 21:53:54 +0000 (14:53 -0700)]
etherdev: Process is_multicast_ether_addr at same size as other operations
This change makes it so that we process the address in
is_multicast_ether_addr at the same size as the other calls. This allows
us to avoid duplicate reads when used with other calls such as
is_zero_ether_addr or eth_addr_copy. In addition I have added a 64 bit
version of the function so in eth_type_trans we can process the destination
address as a 64 bit value throughout.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Thu, 30 Apr 2015 21:53:48 +0000 (14:53 -0700)]
etherdev: Avoid unnecessary byte swap in check for Ethertype
This change takes advantage of the fact that ETH_P_802_3_MIN is aligned to
512 so as a result we can actually ignore the lower 8b when comparing the
Ethertype to ETH_P_802_3_MIN. This allows us to avoid a byte swap by simply
masking the value and comparing it to the byte swapped value for
ETH_P_802_3_MIN.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Wed, 29 Apr 2015 22:33:21 +0000 (15:33 -0700)]
ipv6: Flow label state ranges
This patch divides the IPv6 flow label space into two ranges:
0-7ffff is reserved for flow label manager, 80000-fffff will be
used for creating auto flow labels (per RFC6438). This only affects how
labels are set on transmit, it does not affect receive. This range split
can be disbaled by systcl.
Background:
IPv6 flow labels have been an unmitigated disappointment thus far
in the lifetime of IPv6. Support in HW devices to use them for ECMP
is lacking, and OSes don't turn them on by default. If we had these
we could get much better hashing in IPv6 networks without resorting
to DPI, possibly eliminating some of the motivations to to define new
encaps in UDP just for getting ECMP.
Unfortunately, the initial specfications of IPv6 did not clarify
how they are to be used. There has always been a vague concept that
these can be used for ECMP, flow hashing, etc. and we do now have a
good standard how to this in RFC6438. The problem is that flow labels
can be either stateful or stateless (as in RFC6438), and we are
presented with the possibility that a stateless label may collide
with a stateful one. Attempts to split the flow label space were
rejected in IETF. When we added support in Linux for RFC6438, we
could not turn on flow labels by default due to this conflict.
This patch splits the flow label space and should give us
a path to enabling auto flow labels by default for all IPv6 packets.
This is an API change so we need to consider compatibility with
existing deployment. The stateful range is chosen to be the lower
values in hopes that most uses would have chosen small numbers.
Once we resolve the stateless/stateful issue, we can proceed to
look at enabling RFC6438 flow labels by default (starting with
scaled testing).
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Mon, 4 May 2015 00:05:49 +0000 (17:05 -0700)]
ipv6: Check RTF_LOCAL on rt->rt6i_flags instead of rt->dst.flags
In my earlier commit:
653437d02f1f ("ipv6: Stop /128 route from disappearing after pmtu update"),
there was a horrible typo. Instead of checking RTF_LOCAL on
rt->rt6i_flags, it was checked on rt->dst.flags. This patch fixes
it.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hajime Tazaki <tazaki@sfc.wide.ad.jp>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Thu, 30 Apr 2015 10:12:00 +0000 (12:12 +0200)]
net: sched: remove TC_MUNGED bits
Not used.
pedit sets TC_MUNGED when packet content was altered, but all the core
does is unset MUNGED again and then set OK2MUNGE.
And the latter isn't tested anywhere. So lets remove both
TC_MUNGED and TC_OK2MUNGE.
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Li RongQing [Thu, 30 Apr 2015 09:25:12 +0000 (17:25 +0800)]
ipv4: remove the unnecessary codes in fib_info_hash_move
The whole hlist will be moved, so not need to call hlist_del before
add the hlist_node to other hlist_head.
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 3 May 2015 02:05:58 +0000 (22:05 -0400)]
Merge git://git./linux/kernel/git/davem/net
Merge net into net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Sat, 2 May 2015 03:51:04 +0000 (20:51 -0700)]
Merge git://git./linux/kernel/git/davem/net
Pull networking fixes from David Miller:
1) Receive packet length needs to be adjust by 2 on RX to accomodate
the two padding bytes in altera_tse driver. From Vlastimil Setka.
2) If rx frame is dropped due to out of memory in macb driver, we leave
the receive ring descriptors in an undefined state. From Punnaiah
Choudary Kalluri
3) Some netlink subsystems erroneously signal NLM_F_MULTI. That is
only for dumps. Fix from Nicolas Dichtel.
4) Fix mis-use of raw rt->rt_pmtu value in ipv4, one must always go via
the ipv4_mtu() helper. From Herbert Xu.
5) Fix null deref in bridge netfilter, and miscalculated lengths in
jump/goto nf_tables verdicts. From Florian Westphal.
6) Unhash ping sockets properly.
7) Software implementation of BPF divide did 64/32 rather than 64/64
bit divide. The JITs got it right. Fix from Alexei Starovoitov.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (30 commits)
ipv4: Missing sk_nulls_node_init() in ping_unhash().
net: fec: Fix RGMII-ID mode
net/mlx4_en: Schedule napi when RX buffers allocation fails
netxen_nic: use spin_[un]lock_bh around tx_clean_lock
net/mlx4_core: Fix unaligned accesses
mlx4_en: Use correct loop cursor in error path.
cxgb4: Fix MC1 memory offset calculation
bnx2x: Delay during kdump load
net: Fix Kernel Panic in bonding driver debugfs file: rlb_hash_table
net: dsa: Fix scope of eeprom-length property
net: macb: Fix race condition in driver when Rx frame is dropped
hv_netvsc: Fix a bug in netvsc_start_xmit()
altera_tse: Correct rx packet length
mlx4: Fix tx ring affinity_mask creation
tipc: fix problem with parallel link synchronization mechanism
tipc: remove wrong use of NLM_F_MULTI
bridge/nl: remove wrong use of NLM_F_MULTI
bridge/mdb: remove wrong use of NLM_F_MULTI
net: sched: act_connmark: don't zap skb->nfct
trivial: net: systemport: bcmsysport.h: fix 0x0x prefix
...
Stefan Hajnoczi [Fri, 1 May 2015 23:12:29 +0000 (08:42 +0930)]
virtio: fix typo in vring_need_event() doc comment
Here the "other side" refers to the guest or host.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rusty Russell [Fri, 1 May 2015 23:12:38 +0000 (08:42 +0930)]
virtio: pass baton to Michael Tsirkin
With my job change kernel work will be "own time"; I'm keeping lguest
and modules (and the virtio standards work), but virtio kernel has to
go.
This makes it clear that Michael is in charge. He's good, but having
me watch over his shoulder won't help.
Good luck Michael!
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sat, 2 May 2015 03:35:39 +0000 (20:35 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/sage/ceph-client
Pull Ceph RBD fix from Sage Weil.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
rbd: end I/O the entire obj_request on error
David S. Miller [Sat, 2 May 2015 02:02:47 +0000 (22:02 -0400)]
ipv4: Missing sk_nulls_node_init() in ping_unhash().
If we don't do that, then the poison value is left in the ->pprev
backlink.
This can cause crashes if we do a disconnect, followed by a connect().
Tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Wen Xu <hotdog3645@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simon Horman [Thu, 30 Apr 2015 06:21:29 +0000 (15:21 +0900)]
net: rocker: Use ether_addr_equal
A small cleanup to make use of the ether_addr_equal helper.
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 2 May 2015 00:57:07 +0000 (20:57 -0400)]
Merge branch 'rt6_pmtu'
Martin KaFai Lau says:
====================
ipv6: Stop /128 route from disappearing after pmtu update
The series is separated from another patch series,
'ipv6: Only create RTF_CACHE route after encountering pmtu exception',
which can be found here:
http://thread.gmane.org/gmane.linux.network/359140
This series focus on fixing the /128 route issues. It is currently targeted
for net-next due to the number of code churn but it is also applicable
to net (should be without conflict). The original reported problem can be
found here:
http://thread.gmane.org/gmane.linux.network/348138
Patch 01 and 02 are to prepare the fib6 search to expect both the
RTF_CACHE clone and its original route exist at the same fib6_node.
Patch 03 fixes the /128 route disappearing bug.
Patch 04 and 05 stop rt6_info from using the inet_peer's metrics to
avoid the /128 routes (like the /128 clone and its original route)
from stepping on each others' metrics.
The second patch is by 'Steffen Klassert <steffen.klassert@secunet.com>'
which I pulled off from netdev. The third patch is also mostly by
Steffen with one minor optimization.
Many thanks to Hannes Frederic Sowa <hannes@stressinduktion.org> on
reviewing the patches and giving advice.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Tue, 28 Apr 2015 20:03:07 +0000 (13:03 -0700)]
ipv6: Remove DST_METRICS_FORCE_OVERWRITE and _rt6i_peer
_rt6i_peer is no longer needed after the last patch,
'ipv6: Stop rt6_info from using inet_peer's metrics'.
DST_METRICS_FORCE_OVERWRITE is added by
commit
e5fd387ad5b3 ("ipv6: do not overwrite inetpeer metrics prematurely").
Since inetpeer is no longer used for metrics, this bit is also not needed.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Michal Kubeček <mkubecek@suse.cz>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Tue, 28 Apr 2015 20:03:06 +0000 (13:03 -0700)]
ipv6: Stop rt6_info from using inet_peer's metrics
inet_peer is indexed by the dst address alone. However, the fib6 tree
could have multiple routing entries (rt6_info) for the same dst. For
example,
1. A /128 dst via multiple gateways.
2. A RTF_CACHE route cloned from a /128 route.
In the above cases, all of them will share the same metrics and
step on each other.
This patch will steer away from inet_peer's metrics and use
dst_cow_metrics_generic() for everything.
Change Highlights:
1. Remove rt6_cow_metrics() which currently acquires metrics from
inet_peer for DST_HOST route (i.e. /128 route).
2. Add rt6i_pmtu to take care of the pmtu update to avoid creating a
full size metrics just to override the RTAX_MTU.
3. After (2), the RTF_CACHE route can also share the metrics with its
dst.from route, by:
dst_init_metrics(&cache_rt->dst, dst_metrics_ptr(cache_rt->dst.from), true);
4. Stop creating RTF_CACHE route by cloning another RTF_CACHE route. Instead,
directly clone from rt->dst.
[ Currently, cloning from another RTF_CACHE is only possible during
rt6_do_redirect(). Also, the old clone is removed from the tree
immediately after the new clone is added. ]
In case of cloning from an older redirect RTF_CACHE, it should work as
before.
In case of cloning from an older pmtu RTF_CACHE, this patch will forget
the pmtu and re-learn it (if there is any) from the redirected route.
The _rt6i_peer and DST_METRICS_FORCE_OVERWRITE will be removed
in the next cleanup patch.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Tue, 28 Apr 2015 20:03:05 +0000 (13:03 -0700)]
ipv6: Stop /128 route from disappearing after pmtu update
This patch is mostly from Steffen Klassert <steffen.klassert@secunet.com>.
I only removed the (rt6->rt6i_dst.plen == 128) check from
ip6_rt_update_pmtu() because the (rt6->rt6i_flags & RTF_CACHE) test
has already implied it.
This patch:
1. Create RTF_CACHE route for /128 non local route
2. After (1), all routes that allow pmtu update should have a RTF_CACHE
clone. Hence, stop updating MTU for any non RTF_CACHE route.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert [Tue, 28 Apr 2015 20:03:04 +0000 (13:03 -0700)]
ipv6: Extend the route lookups to low priority metrics.
We search only for routes with highest priority metric in
find_rr_leaf(). However if one of these routes is marked
as invalid, we may fail to find a route even if there is
a appropriate route with lower priority. Then we loose
connectivity until the garbage collector deletes the
invalid route. This typically happens if a host route
expires afer a pmtu event. Fix this by searching also
for routes with a lower priority metric.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Tue, 28 Apr 2015 20:03:03 +0000 (13:03 -0700)]
ipv6: Consider RTF_CACHE when searching the fib6 tree
It is a prep work for the later bug-fix patch which will stop /128 route
from disappearing after pmtu update.
The later bug-fix patch will allow a /128 route and its RTF_CACHE clone
both exist at the same fib6_node. To do this, we need to prepare the
existing fib6 tree search to expect RTF_CACHE for /128 route.
Note that the fn->leaf is sorted by rt6i_metric. Hence,
RTF_CACHE (if there is any) is always at the front. This property
leads to the following:
1. When doing ip6_route_del(), it should honor the RTF_CACHE flag which
the caller is used to ask for deleting clone or non-clone.
The rtm_to_fib6_config() should also check the RTM_F_CLONED and
then set RTF_CACHE accordingly so that:
- 'ip -6 r del...' will make ip6_route_del() to delete a route
and all its clones. Note that its clones is flushed by fib6_del()
- 'ip -6 r flush table cache' will make ip6_route_del() to
only delete clone(s).
2. Exclude RTF_CACHE from addrconf_get_prefix_route() which
should not configure on a cloned route.
3. No change is need for rt6_device_match() since it currently could
return a RTF_CACHE clone route, so the later bug-fix patch will not
affect it.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ilya Dryomov [Sat, 25 Apr 2015 12:56:15 +0000 (15:56 +0300)]
rbd: end I/O the entire obj_request on error
When we end I/O struct request with error, we need to pass
obj_request->length as @nr_bytes so that the entire obj_request worth
of bytes is completed. Otherwise block layer ends up confused and we
trip on
rbd_assert(more ^ (which == img_request->obj_request_count));
in rbd_img_obj_callback() due to more being true no matter what. We
already do it in most cases but we are missing some, in particular
those where we don't even get a chance to submit any obj_requests, due
to an early -ENOMEM for example.
A number of obj_request->xferred assignments seem to be redundant but
I haven't touched any of obj_request->xferred stuff to keep this small
and isolated.
Cc: Alex Elder <elder@linaro.org>
Cc: stable@vger.kernel.org # 3.10+
Reported-by: Shawn Edwards <lesser.evil@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Eric Dumazet [Fri, 1 May 2015 17:37:49 +0000 (10:37 -0700)]
ipv4: speedup ip_idents_reserve()
Under stress, ip_idents_reserve() is accessing a contended
cache line twice, with non optimal MESI transactions.
If we place timestamps in separate location, we reduce this
pressure by ~50% and allow atomic_add_return() to issue
a Request for Ownership.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Fri, 1 May 2015 14:46:21 +0000 (07:46 -0700)]
Merge branch 'for-linus-4.1' of git://git./linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
"A few more btrfs fixes.
These range from corners Filipe found in the new free space cache
writeback to a grab bag of fixes from the list"
* 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: btrfs_release_extent_buffer_page didn't free pages of dummy extent
Btrfs: fill ->last_trans for delayed inode in btrfs_fill_inode.
btrfs: unlock i_mutex after attempting to delete subvolume during send
btrfs: check io_ctl_prepare_pages return in __btrfs_write_out_cache
btrfs: fix race on ENOMEM in alloc_extent_buffer
btrfs: handle ENOMEM in btrfs_alloc_tree_block
Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole
Btrfs: don't check for delalloc_bytes in cache_save_setup
Btrfs: fix deadlock when starting writeback of bg caches
Btrfs: fix race between start dirty bg cache writeout and bg deletion
Linus Torvalds [Fri, 1 May 2015 14:44:32 +0000 (07:44 -0700)]
Merge tag 'arm64-fixes' of git://git./linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
"Not too much here, but we've addressed a couple of nasty issues in the
dma-mapping code as well as adding the halfword and byte variants of
load_acquire/store_release following on from the CSD locking bug that
you fixed in the core.
- fix perf devicetree warnings at probe time
- fix memory leak in __dma_free()
- ensure DMA buffers are always zeroed
- show IRQ trigger in /proc/interrupts (for parity with ARM)
- implement byte and halfword access for smp_{load_acquire,store_release}"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: perf: Fix the pmu node name in warning message
arm64: perf: don't warn about missing interrupt-affinity property for PPIs
arm64: add missing PAGE_ALIGN() to __dma_free()
arm64: dma-mapping: always clear allocated buffers
ARM64: Enable CONFIG_GENERIC_IRQ_SHOW_LEVEL
arm64: add missing data types in smp_load_acquire/smp_store_release
Iyappan Subramanian [Thu, 30 Apr 2015 23:09:17 +0000 (16:09 -0700)]
drivers: net: xgene: fix kbuild warnings
Fixed the following kbuild warnings:
1. unused variable 'of_id'
2. buffer overflow 'ring_cfg' 5 <= 5
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Thu, 30 Apr 2015 21:23:31 +0000 (14:23 -0700)]
Merge tag 'pm+acpi-4.1-rc2' of git://git./linux/kernel/git/rafael/linux-pm
Pull power management and ACPI fixes from Rafael Wysocki:
"Three regression fixes this time, one for a recent regression in the
cpuidle core affecting multiple systems, one for an inadvertently
added duplicate typedef in ACPICA that breaks compilation with GCC 4.5
and one for an ACPI Smart Battery Subsystem driver regression
introduced during the 3.18 cycle (stable-candidate).
Specifics:
- Fix for a regression in the cpuidle core introduced by one of the
recent commits in the clockevents_notify() removal series that put
a call to a function which had to be executed with disabled
interrupts into a code path running with enabled interrupts (Rafael
J Wysocki)
- Fix for a build problem in ACPICA (with GCC 4.5) introduced by one
of the recent ACPICA tools commits that added a duplicate typedef
to one of the ACPICA's header files by mistake (Olaf Hering)
- Fix for a regression in the ACPI SBS (Smart Battery Subsystem)
driver introduced during the 3.18 development cycle causing the
smart battery manager to be marked as not present when it should be
marked as present (Chris Bainbridge)"
* tag 'pm+acpi-4.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpuidle: Run tick_broadcast_exit() with disabled interrupts
ACPI / SBS: Enable battery manager when present
ACPICA: remove duplicate u8 typedef
Linus Torvalds [Thu, 30 Apr 2015 21:00:18 +0000 (14:00 -0700)]
Merge tag 'sound-4.1-rc2' of git://git./linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"One nice fix is Peter's patch to make the old good SB Audigy PCI to
work with 32bit DMA instead of 31bit. This allows the MIDI synth
running on modern machines again. Along with it, a few fixes for
emu10k1 have merged.
In ASoC side, there is one fix in the common code, but it's just
trivial additions of static inline functions for CONFIG_PM=n. The
rest are various device-specific small fixes.
Last but not least, a few HD-audio fixes are included, as usual, too"
* tag 'sound-4.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (23 commits)
ASoC: rt5677: fixed wrong DMIC ref clock
ALSA: emu10k1: Emu10k2 32 bit DMA mode
ALSA: emux: Fix mutex deadlock in OSS emulation
ASoC: Update email-id of Rajeev Kumar
ASoC: rt5645: Fix mask for setting RT5645_DMIC_2_DP_GPIO12 bit
ALSA: hda - Fix missing va_end() call in snd_hda_codec_pcm_new()
ALSA: emux: Fix mutex deadlock at unloading
ALSA: emu10k1: Fix card shortname string buffer overflow
ALSA: hda - Add mute-LED mode control to Thinkpad
ALSA: hda - Fix mute-LED fixed mode
ALSA: hda - Fix click noise at start on Dell XPS13
ASoC: rt5645: Add ACPI match ID
ASoC: rt5677: add register patch for PLL
ASoC: Intel: fix the makefile for atom code
ASoC: dapm: Enable autodisable on SOC_DAPM_SINGLE_TLV_AUTODISABLE
ASoC: add static inline funcs to fix a compiling issue
ASoC: Intel: sst_byt: remove kfree for memory allocated with devm_kzalloc
ASoC: samsung: s3c24xx-i2s: Fix return value check in s3c24xx_iis_dev_probe()
ASoC: tfa9879: Fix return value check in tfa9879_i2c_probe()
ASoC: fsl_ssi: Fix platform_get_irq() error handling
...
Markus Pargmann [Thu, 30 Apr 2015 15:07:50 +0000 (17:07 +0200)]
net: fec: Fix RGMII-ID mode
RGMII-ID uses an internal delay within the transmitter or receiver. This
feature is phy specific. The rest of the communication is normal RGMII.
So the fec driver has to check for all RGMII modes, not only
'PHY_INTERFACE_MODE_RGMII'.
Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Shamay [Thu, 30 Apr 2015 14:32:46 +0000 (17:32 +0300)]
net/mlx4_en: Schedule napi when RX buffers allocation fails
When system is out of memory, refilling of RX buffers fails while
the driver continue to pass the received packets to the kernel stack.
At some point, when all RX buffers deplete, driver may fall into a
sleep, and not recover when memory for new RX buffers is once again
availible. This is because hardware does not have valid descriptors,
so no interrupt will be generated for the driver to return to work
in napi context. Fix it by schedule the napi poll function from
stats_task delayed workqueue, as long as the allocations fail.
Signed-off-by: Ido Shamay <idos@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Thu, 30 Apr 2015 14:17:27 +0000 (16:17 +0200)]
test_bpf: indicate whether bpf prog got jited in test suite
I think this is useful to verify whether a filter could be JITed or
not in case of bpf_prog_enable >= 1, which otherwise the test suite
doesn't tell besides taking a good peek at the performance numbers.
Nicolas Schichan reported a bug in the ARM JIT compiler that rejected
and waved the filter to the interpreter although it shouldn't have.
Nevertheless, the test passes as expected, but such information is
not visible.
It's i.e. useful for the remaining classic JITs, but also for
implementing remaining opcodes that are not yet present in eBPF JITs
(e.g. ARM64 waves some of them to the interpreter). This minor patch
allows to grep through dmesg to find those accordingly, but also
provides a total summary, i.e.: [<X>/53 JIT'ed]
# echo 1 > /proc/sys/net/core/bpf_jit_enable
# insmod lib/test_bpf.ko
# dmesg | grep "jited:0"
dmesg example on the ARM issue with JIT rejection:
[...]
[ 67.925387] test_bpf: #2 ADD_SUB_MUL_K jited:1 24 PASS
[ 67.930889] test_bpf: #3 DIV_MOD_KX jited:0 794 PASS
[ 67.943940] test_bpf: #4 AND_OR_LSH_K jited:1 20 20 PASS
[...]
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Nicolas Schichan <nschichan@freebox.fr>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tony Camuso [Thu, 30 Apr 2015 11:51:27 +0000 (07:51 -0400)]
netxen_nic: use spin_[un]lock_bh around tx_clean_lock
While testing this driver with DEBUG_LOCKDEP and DEBUG_SPINLOCK
enabled did not produce any traces, it would be more prudent in the
case of tx_clean_lock to use spin_[un]lock_bh, since this lock is
manipulated in both the process and softirq contexts.
This patch was tested for functionality and regressions with netperf
and DEBUG_LOCKDEP and DEBUG_SPINLOCK enabled.
Signed-off-by: Tony Camuso <tcamuso@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ivan Vecera [Thu, 30 Apr 2015 09:59:49 +0000 (11:59 +0200)]
be2net: log link status
The driver unlike other drivers does not log link state changes. It's
better for an user when asynchronous link states are logged to the system
log.
v3: Changes from v2 discarded as "not necessary"
Cc: Sathya Perla <sathya.perla@emulex.com>
Cc: Subbu Seetharaman <subbu.seetharaman@emulex.com>
Cc: Ajit Khaparde <ajit.khaparde@emulex.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Thomas Falcon [Wed, 29 Apr 2015 21:25:47 +0000 (16:25 -0500)]
ibmveth: Add support for Large Receive Offload
Enables receiving large packets from other LPARs. These packets
have a -1 IP header checksum, so we must recalculate to have
a valid checksum.
Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Thomas Falcon [Wed, 29 Apr 2015 21:25:46 +0000 (16:25 -0500)]
ibmveth: Add GRO support
Cc: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Thomas Falcon [Wed, 29 Apr 2015 21:25:45 +0000 (16:25 -0500)]
ibmveth: Add support for TSO
Add support for TSO. TSO is turned off by default and
must be enabled and configured by the user. The driver
version number is increased so that users can be sure
that they are using ibmveth with TSO support.
Cc: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Thomas Falcon [Wed, 29 Apr 2015 21:25:44 +0000 (16:25 -0500)]
ibmveth: change rx buffer default allocation for CMO
This patch enables 64k rx buffer pools by default. If Cooperative
Memory Overcommitment (CMO) is enabled, the number of 64k buffers
is reduced to save memory.
Cc: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Wed, 29 Apr 2015 20:52:51 +0000 (16:52 -0400)]
net/mlx4_core: Fix unaligned accesses
Addresses the following kernel logs seen during boot:
Kernel unaligned access at TPC[
100ee150] mlx4_QUERY_HCA+0x80/0x248 [mlx4_core]
Kernel unaligned access at TPC[
100f071c] mlx4_QUERY_ADAPTER+0x100/0x12c [mlx4_core]
Kernel unaligned access at TPC[
100f071c] mlx4_QUERY_ADAPTER+0x100/0x12c [mlx4_core]
Kernel unaligned access at TPC[
100f071c] mlx4_QUERY_ADAPTER+0x100/0x12c [mlx4_core]
Kernel unaligned access at TPC[
100f071c] mlx4_QUERY_ADAPTER+0x100/0x12c [mlx4_core]
Signed-off-by: David Ahern <david.ahern@oracle.com>
Acked-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Benjamin Poirier [Wed, 29 Apr 2015 22:59:35 +0000 (15:59 -0700)]
mlx4_en: Use correct loop cursor in error path.
Signed-off-by: Benjamin Poirier <bpoirier@suse.de>
Fixes:
9e311e7 ("net/mlx4_en: Use affinity hint")
Acked-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 30 Apr 2015 20:03:14 +0000 (16:03 -0400)]
Merge branch 'xgene-next'
Iyappan Subramanian says:
====================
drivers: net: xgene: Add ethernet with ring manager v2 support
Adding XFI based 10GbE and SGMII based 1GbE with ring manager v2 support
for APM X-Gene ethernet driver. The ring manager v2 is used by 2nd
generation SoC.
v1:
* Initial version
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Iyappan Subramanian [Tue, 28 Apr 2015 20:52:40 +0000 (13:52 -0700)]
drivers: net: xgene: Add SGMII based 1GbE support with ring manager v2
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Iyappan Subramanian [Tue, 28 Apr 2015 20:52:39 +0000 (13:52 -0700)]
drivers: net: xgene: Add 10GbE support with ring manager v2
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Iyappan Subramanian [Tue, 28 Apr 2015 20:52:38 +0000 (13:52 -0700)]
drivers: net: xgene: Add ring manager v2 functions
Adding ring manager v2 support for APM X-Gene ethernet driver.
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Iyappan Subramanian [Tue, 28 Apr 2015 20:52:37 +0000 (13:52 -0700)]
drivers: net: xgene: Change ring manager to use function pointers
This is a preparatory patch for adding ethernet support for APM X-Gene
ethernet driver to work with ring manager v2.
Added xgene_ring_ops structure for storing chip specific ring manager
properties and functions.
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>