git.monstr.eu Git - linux-2.6-microblaze.git/log

net: Lookup actual route when oif is VRF device

If the user specifies a VRF device in a get route query the custom route
pointing to the VRF device is returned:

    $ ip route ls table vrf-red
    unreachable default
    broadcast 10.2.1.0 dev eth1  proto kernel  scope link  src 10.2.1.2
    10.2.1.0/24 dev eth1  proto kernel  scope link  src 10.2.1.2
    local 10.2.1.2 dev eth1  proto kernel  scope host  src 10.2.1.2
    broadcast 10.2.1.255 dev eth1  proto kernel  scope link  src 10.2.1.2

    $ ip route get oif vrf-red 10.2.1.40
    10.2.1.40 dev vrf-red
        cache

Add the flags to skip the custom route and go directly to the FIB. With
this patch the actual route is returned:

    $ ip route get oif vrf-red 10.2.1.40
    10.2.1.40 dev eth1  src 10.2.1.2
        cache

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'mac80211-next-for-davem-2015-10-05' of git://git./linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
For the current cycle, we have the following right now:
* many internal fixes, API improvements, cleanups, etc.
* full AP client state tracking in cfg80211/mac80211 from Ayala
* VHT support (in mac80211) for mesh
* some A-MSDU in A-MPDU support from Emmanuel
* show current TX power to userspace (from Rafał)
* support for netlink dump in vendor commands (myself)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'l3mdev_saddr_op'

David Ahern says:

====================
net: Add saddr op to l3mdev and vrf

First 2 patches are re-sends of patches that got lost in the ethosphere
Tuesday; they were part of the first round of l3mdev conversions.
Next 3 handle the source address lookup for raw and datagram sockets
bound to a VRF device.

The conversion to the get_saddr op also fixes locally originated TCP
packets showing up at the VRF device. The use of the FLOWI_FLAG_L3MDEV_SRC
flag in ip_route_connect_init was causing locally generated packets
to skip the VRF device.

v2
- rebased to top of net-next per device delete fix and hash based
multipath patches
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: Add l3mdev saddr lookup to raw_sendmsg

ping originated on box through a VRF device is showing up in tcpdump
without a source address:
    $ tcpdump -n -i vrf-blue
    08:58:33.311303 IP 0.0.0.0 > 10.2.2.254: ICMP echo request, id 2834, seq 1, length 64
    08:58:33.311562 IP 10.2.2.254 > 10.2.2.2: ICMP echo reply, id 2834, seq 1, length 64

Add the call to l3mdev_get_saddr to raw_sendmsg.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Add source address lookup op for VRF

Add operation to l3mdev to lookup source address for a given flow.
Add support for the operation to VRF driver and convert existing
IPv4 hooks to use the new lookup.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Refactor path selection in __ip_route_output_key_hash

VRF device needs the same path selection following lookup to set source
address. Rather than duplicating code, move existing code into a
function that is exported to modules.

Code move only; no functional change.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Add netif_is_l3_slave

IPv6 addrconf keys off of IFF_SLAVE so can not use it for L3 slave.
Add a new private flag and add netif_is_l3_slave function for checking
it.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Rename FLOWI_FLAG_VRFSRC to FLOWI_FLAG_L3MDEV_SRC

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Fix vti use case with oif in dst lookups for IPv6

It occurred to me yesterday that 741a11d9e4103 ("net: ipv6: Add
RT6_LOOKUP_F_IFACE flag if oif is set") means that xfrm6_dst_lookup
needs the FLOWI_FLAG_SKIP_NH_OIF flag set. This latest commit causes
the oif to be considered in lookups which is known to break vti. This
explains why 58189ca7b274 did not the IPv6 change at the time it was
submitted.

Fixes: 42a7b32b73d6 ("xfrm: Add oif to dst lookups")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

gianfar: Add WAKE_UCAST and "wake-on-filer" support

This enables eTSEC's filer (Rx parser) and the FGPI Rx
interrupt (Filer General Purpose Interrupt) as a wakeup
source event.

Upon entering suspend state, the eTSEC filer is given
a rule to match incoming L2 unicast packets. A packet
matching the rule will be enqueued in the Rx ring and
a FGPI Rx interrupt will be asserted by the filer to
wakeup the system. Other packet types will be dropped.
On resume the filer table is restored to the content
before entering suspend state.
The set of rules from gfar_filer_config_wol() could be
extended to implement other WoL capabilities as well.

The "fsl,wake-on-filer" DT binding enables this capability
on certain platforms that feature the necessary power
management infrastructure, targeting mainly printing and
imaging applications.
(refer to Power Management section of the SoC Ref Man)

Cc: Li Yang <leoli@freescale.com>
Cc: Zhao Chenhui <chenhui.zhao@freescale.com>
Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

powerpc: dts: p1022si: Add fsl,wake-on-filer for eTSEC

Enable the "wake-on-filer" (aka. wake on user defined packet)
wake on lan capability for the eTSEC ethernet nodes.

Cc: Li Yang <leoli@freescale.com>
Cc: Zhao Chenhui <chenhui.zhao@freescale.com>
Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

doc: dt: net: Add fsl,wake-on-filer for eTSEC

Add the "fsl,wake-on-filer" property for eTSEC nodes to
indicate that the system has the power management
infrastructure needed to be able to wake up the system
via FGPI (filer, aka. h/w rx parser) interrupt.

Cc: Li Yang <leoli@freescale.com>
Cc: Zhao Chenhui <chenhui.zhao@freescale.com>
Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'ovs-ipv6-tunnel'

Jiri Benc says:

====================
openvswitch: add IPv6 tunneling support

This builds on the previous work that added IPv6 support to lwtunnels and
adds IPv6 tunneling support to ovs.

To use IPv6 tunneling, there needs to be a metadata based tunnel net_device
created and added to the ovs bridge. Currently, only vxlan is supported by
the kernel, with geneve to follow shortly. There's no need nor intent to add
a support for this into the vport-vxlan (etc.) compat layer.

v3: dropped the last two patches added in v2.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

openvswitch: netlink attributes for IPv6 tunneling

Add netlink attributes for IPv6 tunnel addresses. This enables IPv6 support
for tunnels.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

openvswitch: add tunnel protocol to sw_flow_key

Store tunnel protocol (AF_INET or AF_INET6) in sw_flow_key. This field now
also acts as an indicator whether the flow contains tunnel data (this was
previously indicated by tun_key.u.ipv4.dst being set but with IPv6 addresses
in an union with IPv4 ones this won't work anymore).

The new field was added to a hole in sw_flow_key.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

bridge: netlink: make br_fill_info's frame size smaller

When KASAN is enabled the frame size grows > 2048 bytes and we get a
warning, so make it smaller.
net/bridge/br_netlink.c: In function 'br_fill_info':
>> net/bridge/br_netlink.c:1110:1: warning: the frame size of 2160 bytes
>> is larger than 2048 bytes [-Wframe-larger-than=]

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Add support for filtering neigh dump by device index

Add support for filtering neighbor dumps by device by adding the
NDA_IFINDEX attribute to the dump request.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'master' of git://git./linux/kernel/git/jkirsher/next-queue

Jeff Kirsher says:

====================
Intel Wired LAN Driver Updates 2015-10-03

This series contains updates to i40e and i40evf, some of which are to
resolve more Red Hat bugzilla issues.

Jiang Liu updates the i40e and i40evf drivers to use numa_mem_id()
instead of numa_node_id() to get the nearest node with memory which
better supports memoryless nodes.

Anjali fixes an issue from Dan Carpenter <dan.carpenter@oracle.com>,
to resolve a memory leak in X722 RSS configuration path, where we should
free the memory allocated before exiting.

Shannon modifies the drivers to ensure we have the spinlocks before we
clear the ARQ and ASQ management registers.  In addition, we widen the
locked portion insert a sanity check to ensure we are working with safe
register values.

Mitch fixes an issue where under certain circumstances, we can get an
extra VF_RESOURCES message from the PF driver at runtime.  When this
occurs, we need to parse it because our VSI may have changed and that
will affect the relationship with the PF driver.  But this parsing also
blows away our current MAC address, so resolve the issue by restoring
the current MAC address from the netdev struct after we parse the
resource message.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: better error reporting

Add additional error reporting to the generic DSA code, so it's easier
to debug when things go wrong. This was useful when initially bringing
up 88e6176 on a new board.

Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: remove link polling

The link status is polled by the generic phy layer, there's no need to
duplicate that polling with additional polling. This additional polling
adds additional MDIO traffic, and races with the generic phy layer,
resulting in missing or duplicated link status messages.

Tested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

Revert "regmap: Allow installing custom reg_update_bits function"

This reverts commit 7741c373cf3ea1f5383fa97fb7a640a429d3dd7c.

Revert "net: Microchip encx24j600 driver"

This reverts commit 04fbfce7a222327b97ca165294ef19f0faa45960.

Revert "net: encx24j600_exit() can be static"

This reverts commit 9886ce2b9d4e5a8bb3d78d0f7eff3c0f1ed58d67.

ipv4: Fix compilation errors in fib_rebalance

This fixes

net/built-in.o: In function `fib_rebalance':
fib_semantics.c:(.text+0x9df14): undefined reference to `__divdi3'

and

net/built-in.o: In function `fib_rebalance':
net/ipv4/fib_semantics.c:572: undefined reference to `__aeabi_ldivmod'

Fixes: 0e884c78ee19 ("ipv4: L3 hash-based multipath")

Signed-off-by: Peter Nørlund <pch@ordbogen.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mac80211: make ieee80211_new_mesh_header return unsigned

The function returns always non-negative values.

The problem has been detected using proposed semantic patch
scripts/coccinelle/tests/assign_signed_to_unsigned.cocci [1].

[1]: http://permalink.gmane.org/gmane.linux.kernel/2046107

Signed-off-by: Andrzej Hajda <a.hajda@samsung.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>

ebpf: include perf_event only where really needed

Commit ea317b267e9d ("bpf: Add new bpf map type to store the pointer
to struct perf_event") added perf_event.h to the main eBPF header, so
it gets included for all users. perf_event.h is actually only needed
from array map side, so lets sanitize this a bit.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Kaixu Xia <xiakaixu@huawei.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ARM: net: support BPF_ALU | BPF_MOD instructions in the BPF JIT.

For ARMv7 with UDIV instruction support, generate an UDIV instruction
followed by an MLS instruction.

For other ARM variants, generate code calling a C wrapper similar to
the jit_udiv() function used for BPF_ALU | BPF_DIV instructions.

Some performance numbers reported by the test_bpf module (the duration
per filter run is reported in nanoseconds, between "jitted:<x>" and
"PASS":

ARMv7 QEMU nojit: test_bpf: #3 DIV_MOD_KX jited:0 2196 PASS
ARMv7 QEMU jit: test_bpf: #3 DIV_MOD_KX jited:1 104 PASS
ARMv5 QEMU nojit: test_bpf: #3 DIV_MOD_KX jited:0 2176 PASS
ARMv5 QEMU jit: test_bpf: #3 DIV_MOD_KX jited:1 1104 PASS
ARMv5 kirkwood nojit: test_bpf: #3 DIV_MOD_KX jited:0 1103 PASS
ARMv5 kirkwood jit: test_bpf: #3 DIV_MOD_KX jited:1 311 PASS

Signed-off-by: Nicolas Schichan <nschichan@freebox.fr>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'asix-rx-mem-handling'

Mark Craske says:

====================
Improve ASIX RX memory allocation error handling

The ASIX RX handler algorithm is weak on error handling.
There is a design flaw in the ASIX RX handler algorithm because the
implementation for handling RX Ethernet frames for the DUB-E100 C1 can
have Ethernet frames spanning multiple URBs. This means that payload data
from more than 1 URB is sometimes needed to fill the socket buffer with a
complete Ethernet frame. When the URB with the start of an Ethernet frame
is received then an attempt is made to allocate a socket buffer. If the
memory allocation fails then the algorithm sets the buffer pointer member
to NULL and the function exits (no crash yet). Subsequently, the RX hander
is called again to process the next URB which assumes there is a socket
buffer available and the kernel crashes when there is no buffer.

This patchset implements an improvement to the RX handling algorithm to
avoid a crash when no memory is available for the socket buffer.

The patchset will apply cleanly to the net-next master branch but the
created kernel has not been tested. The driver was tested on ARM kernels
v3.8 and v3.14 for a commercial product.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

asix: Continue processing URB if no RX netdev buffer

Avoid a loss of synchronisation of the Ethernet Data header 32-bit
word due to a failure to get a netdev socket buffer.

The ASIX RX handling algorithm returned 0 upon a failure to get
an allocation of a netdev socket buffer. This causes the URB
processing to stop which potentially causes a loss of synchronisation
with the Ethernet Data header 32-bit word. Therefore, subsequent
processing of URBs may be rejected due to a loss of synchronisation.
This may cause additional good Ethernet frames to be discarded
along with outputting of synchronisation error messages.

Implement a solution which checks whether a netdev socket buffer
has been allocated before trying to copy the Ethernet frame into
the netdev socket buffer. But continue to process the URB so that
synchronisation is maintained. Therefore, only a single Ethernet
frame is discarded when no netdev socket buffer is available.

Signed-off-by: Dean Jenkins <Dean_Jenkins@mentor.com>
Signed-off-by: Mark Craske <Mark_Craske@mentor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

asix: On RX avoid creating bad Ethernet frames

When RX Ethernet frames span multiple URB socket buffers,
the data stream may suffer a discontinuity which will cause
the current Ethernet frame in the netdev socket buffer
to be incomplete. This frame needs to be discarded instead
of appending unrelated data from the current URB socket buffer
to the Ethernet frame in the netdev socket buffer. This avoids
creating a corrupted Ethernet frame in the netdev socket buffer.

A discontinuity can occur when the previous URB socket buffer
held an incomplete Ethernet frame due to truncation or a
URB socket buffer containing the end of the Ethernet frame
was missing.

Therefore, add a sanity test for when an Ethernet frame
spans multiple URB socket buffers to check that the remaining
bytes of the currently received Ethernet frame point to
a good Data header 32-bit word of the next Ethernet
frame. Upon error, reset the remaining bytes variable to
zero and discard the current netdev socket buffer.
Assume that the Data header is located at the start of
the current socket buffer and attempt to process the next
Ethernet frame from there. This avoids unnecessarily
discarding a good URB socket buffer that contains a new
Ethernet frame.

Signed-off-by: Dean Jenkins <Dean_Jenkins@mentor.com>
Signed-off-by: Mark Craske <Mark_Craske@mentor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

asix: Simplify asix_rx_fixup_internal() netdev alloc

The code is checking that the Ethernet frame will fit into a
netdev allocated socket buffer within the constraints of MTU size,
Ethernet header length plus VLAN header length.

The original code was checking rx->remaining each loop of the while
loop that processes multiple Ethernet frames per URB and/or Ethernet
frames that span across URBs. rx->remaining decreases per while loop
so there is no point in potentially checking multiple times that the
Ethernet frame (remaining part) will fit into the netdev socket buffer.

The modification checks that the size of the Ethernet frame will fit
the netdev socket buffer before allocating the netdev socket buffer.
This avoids grabbing memory and then deciding that the Ethernet frame
is too big and then freeing the memory.

Signed-off-by: Dean Jenkins <Dean_Jenkins@mentor.com>
Signed-off-by: Mark Craske <Mark_Craske@mentor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

asix: Tidy-up 32-bit header word synchronisation

Tidy-up the Data header 32-bit word synchronisation logic in
asix_rx_fixup_internal() by removing redundant logic tests.

The code is looking at the following cases of the Data header
32-bit word that is present before each Ethernet frame:

a) all 32 bits of the Data header word are in the URB socket buffer
b) first 16 bits of the Data header word are at the end of the URB
socket buffer
c) last 16 bits of the Data header word are at the start of the URB
socket buffer eg. split_head = true

Note that the lifetime of rx->split_head exists outside of the
function call and is accessed per processing of each URB. Therefore,
split_head being true acts on the next URB to be processed.

To check for b) the offset will be 16 bits (2 bytes) from the end of
the buffer then indicate split_head is true.
To check for c) split_head must be true because the first 16 bits
have been found.
To check for a) else c)

Note that the || logic of the old code included the state
(skb->len - offset == sizeof(u16) && rx->split_head) which is not
possible because the split_head cannot be true whilst checking for b).
This is because the split_head indicates that the first 16 bits have
been found and that is not possible whilst checking for the first 16
bits. Therefore simplify the logic.

Signed-off-by: Dean Jenkins <Dean_Jenkins@mentor.com>
Signed-off-by: Mark Craske <Mark_Craske@mentor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

asix: Rename remaining and size for clarity

The Data header synchronisation is easier to understand
if the variables "remaining" and "size" are renamed.

Therefore, the lifetime of the "remaining" variable exists
outside of asix_rx_fixup_internal() and is used to indicate
any remaining pending bytes of the Ethernet frame that need
to be obtained from the next socket buffer. This allows an
Ethernet frame to span across multiple socket buffers.

"size" is now local to asix_rx_fixup_internal() and contains
the size read from the Data header 32-bit word.

Add "copy_length" to hold the number of the Ethernet frame
bytes (maybe a part of a full frame) that are to be copied
out of the socket buffer.

Signed-off-by: Dean Jenkins <Dean_Jenkins@mentor.com>
Signed-off-by: Mark Craske <Mark_Craske@mentor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bpf, seccomp: prepare for upcoming criu support

The current ongoing effort to dump existing cBPF seccomp filters back
to user space requires to hold the pre-transformed instructions like
we do in case of socket filters from sk_attach_filter() side, so they
can be reloaded in original form at a later point in time by utilities
such as criu.

To prepare for this, simply extend the bpf_prog_create_from_user()
API to hold a flag that tells whether we should store the original
or not. Also, fanout filters could make use of that in future for
things like diag. While fanout filters already use bpf_prog_destroy(),
move seccomp over to them as well to handle original programs when
present.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Tycho Andersen <tycho.andersen@canonical.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Tested-by: Tycho Andersen <tycho.andersen@canonical.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

vrf: fix a kernel warning

This fixes:

tried to remove device ip6gre0 from (null)
------------[ cut here ]------------
kernel BUG at net/core/dev.c:5219!
invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
CPU: 3 PID: 161 Comm: kworker/u8:2 Not tainted 4.3.0-rc2+ #1142
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Workqueue: netns cleanup_net
task: ffff8800d784a9c0 ti: ffff8800d74a4000 task.ti: ffff8800d74a4000
RIP: 0010:[<ffffffff817f0797>]  [<ffffffff817f0797>] __netdev_adjacent_dev_remove+0x40/0xec
RSP: 0018:ffff8800d74a7a98  EFLAGS: 00010282
RAX: 000000000000002a RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffff88011adcf701 RSI: ffff88011adccbf8 RDI: ffff88011adccbf8
RBP: ffff8800d74a7ab8 R08: 0000000000000001 R09: 0000000000000000
R10: ffffffff81d190ff R11: 00000000ffffffff R12: ffff8800d599e7c0
R13: 0000000000000000 R14: ffff8800d599e890 R15: ffffffff82385e00
FS:  0000000000000000(0000) GS:ffff88011ac00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007ffd6f003000 CR3: 000000000220c000 CR4: 00000000000006e0
Stack:
  0000000000000000 ffff8800d599e7c0 0000000000000b00 ffff8800d599e8a0
  ffff8800d74a7ad8 ffffffff817f0861 0000000000000000 ffff8800d599e7c0
  ffff8800d74a7af8 ffffffff817f088f 0000000000000000 ffff8800d599e7c0
Call Trace:
  [<ffffffff817f0861>] __netdev_adjacent_dev_unlink+0x1e/0x35
  [<ffffffff817f088f>] __netdev_adjacent_dev_unlink_neighbour+0x17/0x41
  [<ffffffff817f56e6>] netdev_upper_dev_unlink+0x6c/0x13d
  [<ffffffff81674a3d>] vrf_del_slave+0x26/0x7d
  [<ffffffff81674ac3>] vrf_device_event+0x2f/0x34
  [<ffffffff81098c40>] notifier_call_chain+0x75/0x9c
  [<ffffffff81098fa2>] raw_notifier_call_chain+0x14/0x16
  [<ffffffff817ee129>] call_netdevice_notifiers_info+0x52/0x59
  [<ffffffff817f179d>] call_netdevice_notifiers+0x13/0x15
  [<ffffffff817f6f18>] rollback_registered_many+0x14f/0x24f
  [<ffffffff817f70f2>] unregister_netdevice_many+0x19/0x64
  [<ffffffff819a2455>] ip6gre_exit_net+0x163/0x177
  [<ffffffff817eb019>] ops_exit_list+0x44/0x55
  [<ffffffff817ebcb7>] cleanup_net+0x193/0x226
  [<ffffffff81091e1c>] process_one_work+0x26c/0x4d8
  [<ffffffff81091d20>] ? process_one_work+0x170/0x4d8
  [<ffffffff81092296>] worker_thread+0x1df/0x2c2
  [<ffffffff810920b7>] ? process_scheduled_works+0x2f/0x2f
  [<ffffffff810920b7>] ? process_scheduled_works+0x2f/0x2f
  [<ffffffff81097a20>] kthread+0xd4/0xdc
  [<ffffffff810bc523>] ? trace_hardirqs_on_caller+0x17d/0x199
  [<ffffffff8109794c>] ? __kthread_parkme+0x83/0x83
  [<ffffffff81a5240f>] ret_from_fork+0x3f/0x70
  [<ffffffff8109794c>] ? __kthread_parkme+0x83/0x83

Fixes: 93a7e7e837af ("net: Remove the now unused vrf_ptr")
Cc: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: encx24j600_exit() can be static

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Microchip encx24j600 driver

This ethernet driver supports the Micorchip enc424j600/626j600 Ethernet
controller over a SPI bus interface. This driver makes use of the regmap API to
optimize access to registers by caching registers where possible.

Datasheet:
http://ww1.microchip.com/downloads/en/DeviceDoc/39935b.pdf

Signed-off-by: Jon Ringle <jringle@gridpoint.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

regmap: Allow installing custom reg_update_bits function

This commit allows installing a custom reg_update_bits function for cases where
the hardware provides a mechanism to set or clear register bits without a
read/modify/write cycle. Such is the case with the Microchip ENCX24J600.

Signed-off-by: Jon Ringle <jringle@gridpoint.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

enic: do hang reset only in case of tx timeout

The current code invokes hang reset in case of error interrupt. We should
hang reset only in case of tx timeout. This because of the way hang reset
is implemented in firmware. Hang reset takes more firmware resources than
soft reset. Adaptor does not generate error interrupt in case of tx
timeout.

Hang reset only in case of tx timeout, in .ndo_tx_timeout. Do soft reset
otherwise. Introduce deferred work, enic_tx_hang_reset, to do hang reset.

Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

enic: handle spurious error interrupt

Some of the enic adaptors are know to generate spurious interrupts. When
error interrupt is generated, driver just resets the device. This patch
resets the device only when an error is occurred.

Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'cxgb4-next'

Hariprasad Shenai says:

====================
cxgb4: Trivial fixes for cxgb4

Fixes the following issues
Don't read non existent T4/T5/T6 adapter registers for ethtool dump.
For T4, dont read mailbox control registers. Adds new devlog faility and
report correct link speed for unsupported ones.

This patch series has been created against net-next tree and includes
patches on cxgb4 driver.

We have included all the maintainers of respective drivers. Kindly review
the change and let us know in case of any review comments.
====================

Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4: Report correct link speed for unsupported ones

When we get garbage from the firmware with weird Port Speeds,
etc. we should emit a warning regarding unsupported speeds rather than
use the bogus default of "10Mbps" which isn't even an option in the
firmware Port Information message

Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4: Adds a new Device Log Facility FW_DEVLOG_FACILITY_CF

The firmware team added a new Device Log Facility FW_DEVLOG_FACILITY_CF,
but the driver has been decoding Device Log messages with that Facility as
"(NULL)", fixing it.

Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4: For T4, don't read the Firmware Mailbox Control register

T4 doesn't have the Shadow copy of the register which we can read without
side effect. So don't read mbox control register for T4 adapter

Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4 : Update T4/T5/T6 register ranges

Update T4/T5/T6 adapter register ranges so that it doesn't read non
existent registers when dumped using ethtool

Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'master' of git://git./linux/kernel/git/ebiederm/net-next

Eric W. Biederman says:

====================
net: Pass net through ip fragmention

This is the next installment of my work to pass struct net through the
output path so the code does not need to guess how to figure out which
network namespace it is in, and ultimately routes can have output
devices in another network namespace.

This round focuses on passing net through ip fragmentation which we seem
to call from about everywhere. That is the main ip output paths, the
bridge netfilter code, and openvswitch. This has to happend at once
accross the tree as function pointers are involved.

First some prep work is done, then ipv4 and ipv6 are converted and then
temporary helper functions are removed.
====================

Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'rds-perf'

Sowmini Varadhan says:

====================
RDS: RDS-TCP perf enhancements

A 3-part patchset that (a) improves current RDS-TCP perf
by 2X-3X and (b) refactors earlier robustness code for
better observability/scaling.

Patch 1 is an enhancment of earlier robustness fixes
that had used separate sockets for client and server endpoints to
resolve race conditions. It is possible to have an equivalent
solution that does not use 2 sockets. The benefit of a
single socket solution is that it results in more predictable
and observable behavior for the underlying TCP pipe of an
RDS connection

Patches 2 and 3 are simple, straightforward perf bug fixes
that align the RDS TCP socket with other parts of the kernel stack.

v2: fix kbuild-test-robot warnings, comments from Sergei Shtylov
and Santosh Shilimkar.
====================

Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

RDS-TCP: Set up MSG_MORE and MSG_SENDPAGE_NOTLAST as appropriate in rds_tcp_xmit

For the same reasons as commit 2f5338442425 ("tcp: allow splice() to
build full TSO packets") and commit 35f9c09fe9c7 ("tcp: tcp_sendpages()
should call tcp_push() once"), rds_tcp_xmit may have multiple pages to
send, so use the MSG_MORE and MSG_SENDPAGE_NOTLAST as hints to
tcp_sendpage()

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

RDS-TCP: Do not bloat sndbuf/rcvbuf in rds_tcp_tune

Using the value of RDS_TCP_DEFAULT_BUFSIZE (128K)
clobbers efficient use of TSO because it inflates the size_goal
that is computed in tcp_sendmsg/tcp_sendpage and skews packet
latency, and the default values for these parameters actually
results in significantly better performance.

In request-response tests using rds-stress with a packet size of
100K with 16 threads (test parameters -q 100000 -a 256 -t16 -d16)
between a single pair of IP addresses achieves a throughput of
6-8 Gbps. Without this patch, throughput maxes at 2-3 Gbps under
equivalent conditions on these platforms.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

RDS: Use a single TCP socket for both send and receive.

Commit f711a6ae062c ("net/rds: RDS-TCP: Always create a new rds_sock
for an incoming connection.") modified rds-tcp so that an incoming SYN
would ignore an existing "client" TCP connection which had the local
port set to the transient port. The motivation for ignoring the existing
"client" connection in f711a6ae was to avoid race conditions and an
endless duel of reconnect attempts triggered by a restart/abort of one
of the nodes in the TCP connection.

However, having separate sockets for active and passive sides
is avoidable, and the simpler model of a single TCP socket for
both send and receives of all RDS connections associated with
that tcp socket makes for easier observability. We avoid the race
conditions from f711a6ae by attempting reconnects in rds_conn_shutdown
if, and only if, the (new) c_outgoing bit is set for RDS_TRANS_TCP.
The c_outgoing bit is initialized in __rds_conn_create().

A side-effect of re-using the client rds_connection for an incoming
SYN is the potential of encountering duelling SYNs, i.e., we
have an outgoing RDS_CONN_CONNECTING socket when we get the incoming
SYN. The logic to arbitrate this criss-crossing SYN exchange in
rds_tcp_accept_one() has been modified to emulate the BGP state
machine: the smaller IP address should back off from the connection attempt.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'xgbe-next'

Tom Lendacky says:

====================
amd-xgbe: AMD XGBE driver updates 2015-09-30

The following patches are included in this driver update series:

- Remove unneeded semi-colon
- Follow the DT/ACPI precedence used by the device_ APIs
- Add ethtool support for getting and setting the msglevel
- Add ethtool support error and debug messages
- Simplify the hardware FIFO assignment calculations
- Add receive buffer unavailable statistic
- Use the device workqueue instead of the system workqueue
- Remove the use of a link state bit

This patch series is based on net-next.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

amd-xgbe: Remove the XGBE_LINK state bit

The XGBE_LINK bit is used just to determine whether to call the
netif_carrier_on/off functions. Rather than define and use this bit,
just call the functions. The netif_carrier_ok function can be used in
place of checking the XGBE_LINK bit in the future.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

amd-xgbe: Use device workqueue instead of system workqueue

The driver creates, flushes and destroys a device workqueue but queues
work to the system workqueue. Switch from using the system workqueue to
the device workqueue.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

amd-xgbe: Add receive buffer unavailable statistic

Add a statistic that tracks how many times an interrupt is generated for
a receive buffer not being available to the hardware which prevents the
hardware from being able to DMA the received data.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

amd-xgbe: Simplify calculation and setting of queue fifos

The calculation of the Tx and Rx fifo sizes can be calculated rather
than hardcoded in a switch statement. Additionally, the per-queue fifo
sizes can be calculated rather than hardcoded using if/else if statements
that can possibly underutilize the available fifo area.

Change the code to calculate the fifo sizes and the per-queue fifo sizes
to simplify the code and make best use of the available fifo.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

amd-xgbe: Add ethtool error and debug messages

Add error and dynamic debug messages to various ethtool functions in
the driver while also removing the DBGPR debug print calls. Also, change
the message level for some error messages from alert to err.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

amd-xgbe: Add ethtool support for setting the msglevel

Provide the ethtool functions to support getting and setting the
msglevel for the driver.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

amd-xgbe: Use proper DT / ACPI precedence checking

Device tree presence takes precedence over ACPI in the device_* APIs.
The amd-xgbe driver should follow the same precedence. Update the check
on whether to use DT / ACPI to follow this.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

amd-xgbe: Remove an unneeded semicolon on a switch statement

Remove an unneeded semicolon at the end of a switch statement block.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: restore fastopen operations

I accidentally cleared fastopenq.max_qlen in reqsk_queue_alloc()
while max_qlen can be set before listen() is called,
using TCP_FASTOPEN socket option for example.

Fixes: 0536fcc039a8 ("tcp: prepare fastopen code for upcoming listener changes")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'net-y2038'

Arnd Bergmann says:

====================
net: assorted y2038 changes

This is a set of changes for network drivers and core code to
get rid of the use of time_t and derived data structures.

I have a longer set of patches that enables me to build kernels
with the time_t definition removed completely as a help to find
y2038 overflow issues. This is the subset for networking that
contains all code that has a reasonable way of fixing at the
moment and that is either commonly used (in one of the defconfigs)
or that blocks building a whole subsystem.

Most of the patches in this series should be noncontroversial,
but the last two that I marked [RFC] are a bit tricky and
need input from people that are more familiar with the code than
I am. All 12 patches are independent of one another and can
be applied in any order, so feel free to pick all that look
good.

Patches that are not included here are:

- disabling less common device drivers that I don't have a fix
   for yet, this includes
drivers/net/ethernet/brocade/bna/bfa_ioc.c
drivers/net/ethernet/qlogic/netxen/netxen_nic_hw.c
drivers/net/ethernet/tile/tilegx.c
drivers/net/hamradio/baycom_ser_fdx.c
drivers/net/wireless/ath/ath10k/core.h
drivers/net/wireless/ath/ath9k/
drivers/net/wireless/ath/ath9k/
drivers/net/wireless/atmel.c
drivers/net/wireless/prism54/isl_38xx.c
drivers/net/wireless/rt2x00/rt2x00debug.c
drivers/net/wireless/rtlwifi/
drivers/net/wireless/ti/wlcore/
drivers/staging/ozwpan/
net/atm/mpoa_caches.c
net/atm/mpoa_proc.c
net/dccp/probe.c
net/ipv4/tcp_probe.c
net/netfilter/nfnetlink_queue_core.c
net/netfilter/nfnetlink_queue_core.c
net/netfilter/xt_time.c
net/openvswitch/flow.c
net/sctp/probe.c
net/sunrpc/auth_gss/
net/sunrpc/svcauth_unix.c
net/vmw_vsock/af_vsock.c
   We'll get there eventually, or we an add a dependency to ensure
   they are not built on 32-bit kernels that need to survive
   beyond 2038. Most of these should be really easy to fix.

- recvmmsg/sendmmsg system calls: patches have been sent out
   as part of the syscall series, need a little more work and
   review

- SIOCGSTAMP/SIOCGSTAMPNS/ ioctl calls: tricky, need to discuss
   with some folks at kernel summit

- SO_RCVTIMEO/SO_SNDTIMEO/SO_TIMESTAMP/SO_TIMESTAMPNS socket
   opt: similar and related to the ioctl

- mmapped packet socket: need to create v4 of the API, nontrivial

- pktgen: sends 32-bit timestamps over network, need to find out
   if using unsigned stamps is good enough

- af_rxpc: similar to pktgen, uses 32-bit times for deadlines

- ppp ioctl: patch is being worked on, nontrivial but doable
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: sctp: avoid incorrect time_t use

We want to avoid using time_t in the kernel because of the y2038
overflow problem. The use in sctp is not for storing seconds at
all, but instead uses microseconds and is passed as 32-bit
on all machines.

This patch changes the type to u32, which better fits the use.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: linux-sctp@vger.kernel.org
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv6: use ktime_t for internal timestamps

The ipv6 mip6 implementation is one of only a few users of the
skb_get_timestamp() function in the kernel, which is both unsafe
on 32-bit architectures because of the 2038 overflow, and slightly
less efficient than the skb_get_ktime() based approach.

This converts the function call and the mip6_report_rate_limiter
structure that stores the time stamp, eliminating all uses of
timeval in the ipv6 code.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfnetlink: use y2038 safe timestamp

The __build_packet_message function fills a nfulnl_msg_packet_timestamp
structure that uses 64-bit seconds and is therefore y2038 safe, but
it uses an intermediate 'struct timespec' which is not.

This trivially changes the code to use 'struct timespec64' instead,
to correct the result on 32-bit architectures.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cc: netfilter-devel@vger.kernel.org
Cc: coreteam@netfilter.org
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

atm: remove 'struct zatm_t_hist'

The zatm_t_hist structure is not used anywhere in the kernel, but is
exported to user space. As we are trying to eliminate uses of time_t
in the kernel for y2038 compatibility, the current definition triggers
checking tools because it contains 'struct timeval'.

As pointed out by Chas Williams, the only user of this structure was
the ZATM_GETHIST ioctl command that has been removed a long time ago,
and we can remove the structure as well without breaking any user
space.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Chas Williams <3chas3@gmail.com>
Cc: linux-atm-general@lists.sourceforge.net
Signed-off-by: David S. Miller <davem@davemloft.net>

mac80211: use ktime_get_seconds

The mac80211 code uses ktime_get_ts to measure the connected time.
As this uses monotonic time, it is y2038 safe on 32-bit systems,
but we still want to deprecate the use of 'timespec' because most
other users are broken.

This changes the code to use ktime_get_seconds() instead, which
avoids the timespec structure and is slightly more efficient.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: linux-wireless@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>

mwifiex: avoid gettimeofday in ba_threshold setting

mwifiex_get_random_ba_threshold() uses a complex homegrown implementation
to generate a pseudo-random number from the current time as returned
from do_gettimeofday().

This currently requires two 32-bit divisions plus a couple of other
computations that are eventually discarded as only eight bits of
the microsecond portion are used at all.

We could replace this with a call to get_random_bytes(), but that
might drain the entropy pool too fast if this is called for each
packet.

Instead, this patch converts it to use ktime_get_ns(), which is a
bit faster than do_gettimeofday(), and then uses a similar algorithm
as before, but in a way that takes both the nanosecond and second
portion into account for slightly-more-but-still-not-very-random
pseudorandom number.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Amitkumar Karwar <akarwar@marvell.com>
Cc: Nishant Sarmukadam <nishants@marvell.com>
Cc: Kalle Valo <kvalo@codeaurora.org>
Cc: linux-wireless@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>

mwifiex: use ktime_get_real for timestamping

The mwifiex_11n_aggregate_pkt() function creates a ktime_t from
a timeval returned by do_gettimeofday, which is slow and causes
an overflow in 2038 on 32-bit architectures.

This solves both problems by using the appropriate ktime_get_real()
function.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Amitkumar Karwar <akarwar@marvell.com>
Cc: Nishant Sarmukadam <nishants@marvell.com>
Cc: Kalle Valo <kvalo@codeaurora.org>
Cc: linux-wireless@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>

net: igb: avoid using timespec

We want to deprecate the use of 'struct timespec' on 32-bit
architectures, as it is will overflow in 2038. The igb
driver uses it to read the current time, and can simply
be changed to use ktime_get_real_ts64() instead.

Because of hardware limitations, there is still an overflow
in year 2106, which we cannot really avoid, but this documents
the overflow.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: intel-wired-lan@lists.osuosl.org
Reviewed-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: stmmac: avoid using timespec

We want to deprecate the use of 'struct timespec' on 32-bit
architectures, as it is will overflow in 2038. The stmmac
driver uses it to read the current time, and can simply
be changed to use ktime_get_real_ts64() instead.

Because of hardware limitations, there is still an overflow
in year 2106, which we cannot really avoid, but this documents
the overflow.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: fec: avoid timespec use

The fec_ptp_enable_pps uses an open-coded implementation of ns_to_timespec,
which will be removed eventually as it is not y2038-safe on 32-bit
architectures. Two more instances of the same code in this file were
already converted to use the safe ns_to_timespec64 in commit 6630514fcee
("ptp: fec: use helpers for converting ns to timespec"), this changes
the last one as well.

The seconds portion here is actually unused and we could just remove the
timespec variable, but using ns_to_timespec64 can still be better as the
implementation can be hand-optimized in the future.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Fugang Duan <b38611@freescale.com>
Cc: Luwei Zhou <b45643@freescale.com>
Cc: Frank Li <Frank.Li@freescale.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'ipv4-multipath-hash'

Peter Nørlund says:

====================
ipv4: Hash-based multipath routing

When the routing cache was removed in 3.6, the IPv4 multipath algorithm changed
from more or less being destination-based into being quasi-random per-packet
scheduling. This increases the risk of out-of-order packets and makes it
impossible to use multipath together with anycast services.

This patch series replaces the old implementation with flow-based load
balancing based on a hash over the source and destination addresses.

Distribution of the hash is done with thresholds as described in RFC 2992.
This reduces the disruption when a path is added/remove when having more than
two paths.

To futher the chance of successful usage in conjuction with anycast, ICMP
error packets are hashed over the inner IP addresses. This ensures that PMTU
will work together with anycast or load-balancers such as IPVS.

Port numbers are not considered since fragments could cause problems with
anycast and IPVS. Relying on the DF-flag for TCP packets is also insufficient,
since ICMP inspection effectively extracts information from the opposite
flow which might have a different state of the DF-flag. This is also why the
RSS hash is not used. These are typically based on the NDIS RSS spec which
mandates TCP support.

Measurements of the additional overhead of a two-path multipath
(p_mkroute_input excl. __mkroute_input) on a Xeon X3550 (4 cores, 2.66GHz):

Original per-packet: ~394 cycles/packet
L3 hash: ~76 cycles/packet

Changes in v5:
- Fixed compilation error

Changes in v4:
- Functions take hash directly instead of func ptr
- Added inline hash function
- Added dummy macros to minimize ifdefs
- Use upper 31 bits of hash instead of lower

Changes in v3:
- Multipath algorithm is no longer configurable (always L3)
- Added random seed to hash
- Moved ICMP inspection to isolated function
- Ignore source quench packets (deprecated as per RFC 6633)

Changes in v2:
- Replaced 8-bit xor hash with 31-bit jenkins hash
- Don't scale weights (since 31-bit)
- Avoided unnecesary renaming of variables
- Rely on DF-bit instead of fragment offset when checking for fragmentation
- upper_bound is now inclusive to avoid overflow
- Use a callback to postpone extracting flow information until necessary
- Skipped ICMP inspection entirely with L4 hashing
- Handle newly added sysctl ignore_routes_with_linkdown
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

ipv4: ICMP packet inspection for multipath

ICMP packets are inspected to let them route together with the flow they
belong to, minimizing the chance that a problematic path will affect flows
on other paths, and so that anycast environments can work with ECMP.

Signed-off-by: Peter Nørlund <pch@ordbogen.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv4: L3 hash-based multipath

Replaces the per-packet multipath with a hash-based multipath using
source and destination address.

Signed-off-by: Peter Nørlund <pch@ordbogen.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'tcp-listener-fixes-and-improvement'

Eric Dumazet says:

====================
tcp: lockless listener fixes and improvement

This fixes issues with TCP FastOpen vs lockless listeners,
and SYNACK being attached to request sockets.

Then, last patch brings performance improvement for
syncookies generation and validation.

Tested under a 4.3 Mpps SYNFLOOD attack, new perf profile looks
like :
    12.11%  [kernel]  [k] sha_transform
     5.83%  [kernel]  [k] tcp_conn_request
     4.59%  [kernel]  [k] __inet_lookup_listener
     4.11%  [kernel]  [k] ipt_do_table
     3.91%  [kernel]  [k] tcp_make_synack
     3.05%  [kernel]  [k] fib_table_lookup
     2.74%  [kernel]  [k] sock_wfree
     2.66%  [kernel]  [k] memcpy_erms
     2.12%  [kernel]  [k] tcp_v4_rcv
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: avoid two atomic ops for syncookies

inet_reqsk_alloc() is used to allocate a temporary request
in order to generate a SYNACK with a cookie. Then later,
syncookie validation also uses a temporary request.

These paths already took a reference on listener refcount,
we can avoid a couple of atomic operations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: use sk_fullsock() in __netdev_pick_tx()

SYN_RECV & TIMEWAIT sockets are not full blown, they do not have a
sk_dst_cache pointer.

Fixes: ca6fb0651883 ("tcp: attach SYNACK messages to request sockets instead of listener")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv6: inet6_sk() should use sk_fullsock()

SYN_RECV & TIMEWAIT sockets are not full blown, they do not have a pinet6
pointer.

Fixes: ca6fb0651883 ("tcp: attach SYNACK messages to request sockets instead of listener")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

inet: ip_skb_dst_mtu() should use sk_fullsock()

SYN_RECV & TIMEWAIT sockets are not full blown,
do not even try to call ip_sk_use_pmtu() on them.

Fixes: ca6fb0651883 ("tcp: attach SYNACK messages to request sockets instead of listener")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: fix fastopen races vs lockless listener

There are multiple races that need fixes :

1) skb_get() + queue skb + kfree_skb() is racy

An accept() can be done on another cpu, data consumed immediately.
tcp_recvmsg() uses __kfree_skb() as it is assumed all skb found in
socket receive queue are private.

Then the kfree_skb() in tcp_rcv_state_process() uses an already freed skb

2) tcp_reqsk_record_syn() needs to be done before tcp_try_fastopen()
for the same reasons.

3) We want to send the SYNACK before queueing child into accept queue,
otherwise we might reintroduce the ooo issue fixed in
commit 7c85af881044 ("tcp: avoid reorders for TFO passive connections")

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'bridge-netlink'

Nikolay Aleksandrov says:

====================
bridge: complete netlink support

This set completes the bridge device's netlink support and makes it
possible to view and configure everything that can be configured via
sysfs. I have tested all of these (setting and getting). There're a few
longer line warnings about the br_get_size() ifla comments but I think we
should have them to know what has been accounted for. I have used the sysfs
interface as a guide of what and how to set. As usual I'll send the
corresponding iproute2 patches later.
The bridge port's netlink interface will be completed after this set gets
applied in some form.

This patch-set is on top of my last vlan cleanups set:
http://www.spinics.net/lists/netdev/msg346005.html
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

bridge: netlink: add support for default_pvid

Add IFLA_BR_VLAN_DEFAULT_PVID to allow setting/getting bridge's
default_pvid via netlink.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bridge: netlink: add support for netfilter tables config

Add support to allow getting/setting netfilter tables settings.
Currently these are IFLA_BR_NF_CALL_IPTABLES, IFLA_BR_NF_CALL_IP6TABLES
and IFLA_BR_NF_CALL_ARPTABLES.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bridge: netlink: add support for igmp's intervals

Add support to set/get all of the igmp's configurable intervals via
netlink. These currently are:
IFLA_BR_MCAST_LAST_MEMBER_INTVL
IFLA_BR_MCAST_MEMBERSHIP_INTVL
IFLA_BR_MCAST_QUERIER_INTVL
IFLA_BR_MCAST_QUERY_INTVL
IFLA_BR_MCAST_QUERY_RESPONSE_INTVL
IFLA_BR_MCAST_STARTUP_QUERY_INTVL

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bridge: netlink: add support for multicast_startup_query_count

Add IFLA_BR_MCAST_STARTUP_QUERY_CNT to allow setting/getting
br->multicast_startup_query_count via netlink. Also align the ifla
comments.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bridge: netlink: add support for multicast_last_member_count

Add IFLA_BR_MCAST_LAST_MEMBER_CNT to allow setting/getting
br->multicast_last_member_count via netlink.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bridge: netlink: add support for igmp's hash_max

Add IFLA_BR_MCAST_HASH_MAX to allow setting/getting br->hash_max via
netlink.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bridge: netlink: add support for igmp's hash_elasticity

Add IFLA_BR_MCAST_HASH_ELASTICITY to allow setting/getting
br->hash_elasticity via netlink.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bridge: netlink: add support for multicast_querier

Add IFLA_BR_MCAST_QUERIER to allow setting/getting br->multicast_querier
via netlink.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bridge: netlink: add support for multicast_query_use_ifaddr

Add IFLA_BR_MCAST_QUERY_USE_IFADDR to allow setting/getting
br->multicast_query_use_ifaddr via netlink.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bridge: netlink: add support for multicast_snooping

Add IFLA_BR_MCAST_SNOOPING to allow enabling/disabling multicast
snooping via netlink.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bridge: netlink: add support for multicast_router

Add IFLA_BR_MCAST_ROUTER to allow setting and retrieving
br->multicast_router when igmp snooping is enabled.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bridge: netlink: add fdb flush

Simple attribute that flushes the bridge's fdb.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bridge: netlink: add group_addr support

Add IFLA_BR_GROUP_ADDR attribute to allow setting and retrieving the
group_addr via netlink.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bridge: netlink: export all timers

Export the following bridge timers (also exported via sysfs):
IFLA_BR_HELLO_TIMER, IFLA_BR_TCN_TIMER, IFLA_BR_TOPOLOGY_CHANGE_TIMER,
IFLA_BR_GC_TIMER via netlink.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bridge: netlink: export topology_change and topology_change_detected

Add IFLA_BR_TOPOLOGY_CHANGE and IFLA_BR_TOPOLOGY_CHANGE_DETECTED and
export them via netlink.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bridge: netlink: export root path cost

Add IFLA_BR_ROOT_PATH_COST and export it via netlink.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bridge: netlink: export root port

Add IFLA_BR_ROOT_PORT and export it via netlink.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bridge: netlink: export bridge id

Add IFLA_BR_BRIDGE_ID and export br->bridge_id via netlink.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bridge: netlink: export root id

Add IFLA_BR_ROOT_ID and export br->designated_root via netlink. For this
purpose add struct ifla_bridge_id that would represent struct bridge_id.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>