git.monstr.eu Git - linux-2.6-microblaze.git/log

tcp: tsq: move tsq_flags close to sk_wmem_alloc

tsq_flags being in the same cache line than sk_wmem_alloc
makes a lot of sense. Both fields are changed from tcp_wfree()
and more generally by various TSQ related functions.

Prior patch made room in struct sock and added sk_tsq_flags,
this patch deletes tsq_flags from struct tcp_sock.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: reorganize struct sock for better data locality

Group fields used in TX path, and keep some cache lines mostly read
to permit sharing among cpus.

Gained two 4 bytes holes on 64bit arches.

Added a place holder for tcp tsq_flags, next to sk_wmem_alloc
to speed up tcp_wfree() in the following patch.

I have not added ____cacheline_aligned_in_smp, this might be done later.
I prefer doing this once inet and tcp/udp sockets reorg is also done.

Tested with both TCP and UDP.

UDP receiver performance under flood increased by ~20 % :
Accessing sk_filter/sk_wq/sk_napi_id no longer stalls because sk_drops
was moved away from a critical cache line, now mostly read and shared.

/* --- cacheline 4 boundary (256 bytes) --- */
unsigned int               sk_napi_id;           /* 0x100   0x4 */
int                        sk_rcvbuf;            /* 0x104   0x4 */
struct sk_filter *         sk_filter;            /* 0x108   0x8 */
union {
struct socket_wq * sk_wq;                /*         0x8 */
struct socket_wq * sk_wq_raw;            /*         0x8 */
};                                               /* 0x110   0x8 */
struct xfrm_policy *       sk_policy[2];         /* 0x118  0x10 */
struct dst_entry *         sk_rx_dst;            /* 0x128   0x8 */
struct dst_entry *         sk_dst_cache;         /* 0x130   0x8 */
atomic_t                   sk_omem_alloc;        /* 0x138   0x4 */
int                        sk_sndbuf;            /* 0x13c   0x4 */
/* --- cacheline 5 boundary (320 bytes) --- */
int                        sk_wmem_queued;       /* 0x140   0x4 */
atomic_t                   sk_wmem_alloc;        /* 0x144   0x4 */
long unsigned int          sk_tsq_flags;         /* 0x148   0x8 */
struct sk_buff *           sk_send_head;         /* 0x150   0x8 */
struct sk_buff_head        sk_write_queue;       /* 0x158  0x18 */
__s32                      sk_peek_off;          /* 0x170   0x4 */
int                        sk_write_pending;     /* 0x174   0x4 */
long int                   sk_sndtimeo;          /* 0x178   0x8 */

Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: tcp_mtu_probe() is likely to exit early

Adding a likely() in tcp_mtu_probe() moves its code which used to
be inlined in front of tcp_write_xmit()

We still have a cache line miss to access icsk->icsk_mtup.enabled,
we will probably have to reorganize fields to help data locality.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: tsq: add a shortcut in tcp_small_queue_check()

Always allow the two first skbs in write queue to be sent,
regardless of sk_wmem_alloc/sk_pacing_rate values.

This helps a lot in situations where TX completions are delayed either
because of driver latencies or softirq latencies.

Test is done with no cache line misses.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: tsq: avoid one atomic in tcp_wfree()

Under high load, tcp_wfree() has an atomic operation trying
to schedule a tasklet over and over.

We can schedule it only if our per cpu list was empty.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: tsq: add shortcut in tcp_tasklet_func()

Under high stress, I've seen tcp_tasklet_func() consuming
~700 usec, handling ~150 tcp sockets.

By setting TCP_TSQ_DEFERRED in tcp_wfree(), we give a chance
for other cpus/threads entering tcp_write_xmit() to grab it,
allowing tcp_tasklet_func() to skip sockets that already did
an xmit cycle.

In the future, we might give to ACK processing an increased
budget to reduce even more tcp_tasklet_func() amount of work.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: tsq: remove one locked operation in tcp_wfree()

Instead of atomically clear TSQ_THROTTLED and atomically set TSQ_QUEUED
bits, use one cmpxchg() to perform a single locked operation.

Since the following patch will also set TCP_TSQ_DEFERRED here,
this cmpxchg() will make this addition free.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: tsq: add tsq_flags / tsq_enum

This is a cleanup, to ease code review of following patches.

Old 'enum tsq_flags' is renamed, and a new enumeration is added
with the flags used in cmpxchg() operations as opposed to
single bit operations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'bnxt_en-dcbnl'

Michael Chan says:

====================
bnxt_en: Add DCBNL support.

This series adds DCBNL operations to support host-based IEEE DCBX.

v2: Updated to the latest firmware interface spec.

David, please consider this series for net-next.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

bnxt_en: Add PFC statistics.

Report PFC statistics to ethtool -S and DCBNL.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnxt_en: Implement DCBNL to support host-based DCBX.

Support only IEEE DCBX initially. Add IEEE DCBNL ops and functions to
get and set the hardware DCBX parameters. The DCB code is conditional on
Kconfig CONFIG_BNXT_DCB.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnxt_en: Update firmware header file to latest 1.6.0.

Latest interface has the latest DCB command structs. Get and store the
max number of lossless TCs the hardware can support.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnxt_en: Re-factor bnxt_setup_tc().

Add a new function bnxt_setup_mq_tc() to handle MQPRIO. This new function
will be called during ETS setup when we add DCBNL in the next patch.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: dp83848: Support ethernet pause frames

According to the documentation, the PHYs supported by this driver
can also support pause frames. Announce this to be so.
Tested with a TI83822I.

Acked-by: Andrew F. Davis <afd@ti.com>
Signed-off-by: Jesper Nilsson <jesper.nilsson@axis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv6 addrconf: Implemented enhanced DAD (RFC7527)

Implemented RFC7527 Enhanced DAD.
IPv6 duplicate address detection can fail if there is some temporary
loopback of Ethernet frames. RFC7527 solves this by including a random
nonce in the NS messages used for DAD, and if an NS is received with the
same nonce it is assumed to be a looped back DAD probe and is ignored.
RFC7527 is enabled by default. Can be disabled by setting both of
conf/{all,interface}/enhanced_dad to zero.

Signed-off-by: Erik Nordmark <nordmark@arista.com>
Signed-off-by: Bob Gilligan <gilligan@arista.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mv88e6390-batch-three'

Andrew Lunn says:

====================
mv88e6390 batch 3

More patches to support the MV88e6390. This is mostly refactoring
existing code and adding implementations for the mv88e6390. This
patchset set which reserved frames are sent to the cpu, the size of
jumbo frames that will be accepted, turn off egress rate limiting, and
configuration of pause frames.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Implement mv88e6390 pause control

The mv88e6390 has a number flow control registers accessed via the
Flow Control register. Use these to set the pause control.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Refactor pause configuration

The mv88e6390 has a different mechanism for configuring pause.
Refactor the code into an ops function, and for the moment, don't add
any mv88e6390 code yet.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Refactor egress rate limiting

There are two different rate limiting configurations, depending on the
switch generation. Refactor this into ops.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Refactor setting of jumbo frames

Some switches support jumbo frames. Refactor this code into operations
in the ops structure.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Reserved Management frames to CPU

Older devices have a couple of registers in global2. The mv88e6390
family has a single register in global1 behind which hides similar
configuration. Implement and op for this.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mv88e6390-batch-two'

Andrew Lunn says:

====================
MV88E6390 batch two

This is the second batch of patches adding support for the
MV88e6390. They are not sufficient to make it work properly.

The mv88e6390 has a much expanded set of priority maps. Refactor the
existing code, and implement basic support for the new device.

Similarly, the monitor control register has been reworked.

The mv88e6390 has something odd in its EDSA tagging implementation,
which means it is not possible to use it. So we need to use DSA
tagging. This is the first device with EDSA support where we need to
use DSA, and the code does not support this. So two patches refactor
the existing code. The two different register definitions are
separated out, and using DSA on an EDSA capable device is added.

v2:
Add port prefix
Add helper function for 6390
Add _IEEE_ into #defines
Split monitor_ctrl into a number of separate ops.
Remove 6390 code which is management, used in a later patch
s/EGREES/EGRESS/.
Broke up setup_port_dsa() and set_port_dsa() into a number of ops

v3:
Verify mandatory ops for port setup
Don't set ether type for DSA port.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Refactor CPU and DSA port setup

Older chips only support DSA tagging. Newer chips have both DSA and
EDSA tagging. Refactor the code by adding port functions for setting the
frame mode, egress mode, and if to forward unknown frames.

This results in the helper mv88e6xxx_6065_family() becoming unused, so
remove it.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
v3:
Verify mandatory ops for port setup
Don't set ether type for DSA port.
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Move the tagging protocol into info

Older chips support a single tagging protocol, DSA. New chips support
both DSA and EDSA, an enhanced version. Having both as an option
changes the register layouts. Up until now, it has been assumed that
if EDSA is supported, it will be used. Hence the register layout has
been determined by which protocol should be used. However, mv88e6390
has a different implementation of EDSA, which requires we need to use
the DSA tagging. Hence separate the selection of the protocol from the
register layout.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Monitor and Management tables

The mv88e6390 changes the monitor control register into the Monitor
and Management control, which is an indirection register to various
registers.

Add ops to set the CPU port and the ingress/egress port for both
register layouts, to global1

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Implement mv88e6390 tag remap

The mv88e6390 does not have the two registers to set the frame
priority map. Instead it has an indirection registers for setting a
number of different priority maps. Refactor the old code into an
function, implement the mv88e6390 version, and use an op to call the
right one.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'fib-notifier-event-replay'

Jiri Pirko says:

====================
ipv4: fib: Replay events when registering FIB notifier

Ido says:

In kernel 4.9 the switchdev-specific FIB offload mechanism was replaced
by a new FIB notification chain to which modules could register in order
to be notified about the addition and deletion of FIB entries. The
motivation for this change was that switchdev drivers need to be able to
reflect the entire FIB table and not only FIBs configured on top of the
port netdevs themselves. This is useful in case of in-band management.

The fundamental problem with this approach is that upon registration
listeners lose all the information previously sent in the chain and
thus have an incomplete view of the FIB tables, which can result in
packet loss. This patchset fixes that by dumping the FIB tables and
replaying notifications previously sent in the chain for the registered
notification block.

The entire dump process is done under RCU and thus the FIB notification
chain is converted to be atomic. The listeners are modified accordingly.
This is done in the first eight patches.

The ninth patch adds a change sequence counter to ensure the integrity
of the FIB dump. The last patch adds the dump itself to the FIB chain
registration function and modifies existing listeners to pass a callback
to be executed in case dump was inconsistent.

---
v3->v4:
- Register the notification block after the dump and protect it using
  the change sequence counter (Hannes Frederic Sowa).
- Since we now integrate the dump into the registration function, drop
  the sysctl to set maximum number of retries and instead set it to a
  fixed number. Lets see if it's really a problem before adding something
  we can never remove.
- For the same reason, dump FIB tables for all net namespaces.
- Add a comment regarding guarantees provided by mutex semantics.

v2->v3:
- Add sysctl to set the number of FIB dump retries (Hannes Frederic Sowa).
- Read the sequence counter under RTNL to ensure synchronization
  between the dump process and other processes changing the routing
  tables (Hannes Frederic Sowa).
- Pass a callback to the dump function to be executed prior to a retry.
- Limit the dump to a single net namespace.

v1->v2:
- Add a sequence counter to ensure the integrity of the FIB dump
  (David S. Miller, Hannes Frederic Sowa).
- Protect notifications from re-ordering in listeners by using an
  ordered workqueue (Hannes Frederic Sowa).
- Introduce fib_info_hold() (Jiri Pirko).
- Relieve rocker from the need to invoke the FIB dump by registering
  to the FIB notification chain prior to ports creation.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

ipv4: fib: Replay events when registering FIB notifier

Commit b90eb7549499 ("fib: introduce FIB notification infrastructure")
introduced a new notification chain to notify listeners (f.e., switchdev
drivers) about addition and deletion of routes.

However, upon registration to the chain the FIB tables can already be
populated, which means potential listeners will have an incomplete view
of the tables.

Solve that by dumping the FIB tables and replaying the events to the
passed notification block. The dump itself is done using RCU in order
not to starve consumers that need RTNL to make progress.

The integrity of the dump is ensured by reading the FIB change sequence
counter before and after the dump under RTNL. This allows us to avoid
the problematic situation in which the dumping process sends a ENTRY_ADD
notification following ENTRY_DEL generated by another process holding
RTNL.

Callers of the registration function may pass a callback that is
executed in case the dump was inconsistent with current FIB tables.

The number of retries until a consistent dump is achieved is set to a
fixed number to prevent callers from looping for long periods of time.
In case current limit proves to be problematic in the future, it can be
easily converted to be configurable using a sysctl.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv4: fib: Allow for consistent FIB dumping

The next patch will enable listeners of the FIB notification chain to
request a dump of the FIB tables. However, since RTNL isn't taken during
the dump, it's possible for the FIB tables to change mid-dump, which
will result in inconsistency between the listener's table and the
kernel's.

Allow listeners to know about changes that occurred mid-dump, by adding
a change sequence counter to each net namespace. The counter is
incremented just before a notification is sent in the FIB chain.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv4: fib: Convert FIB notification chain to be atomic

In order not to hold RTNL for long periods of time we're going to dump
the FIB tables using RCU.

Convert the FIB notification chain to be atomic, as we can't block in
RCU critical sections.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

rocker: Register FIB notifier before creating ports

We can miss FIB notifications sent between the time the ports were
created and the FIB notification block registered.

Instead of receiving these notifications only when they are replayed for
the FIB notification block during registration, just register the
notification block before the ports are created.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

rocker: Implement FIB offload in deferred work

Convert rocker to offload FIBs in deferred work in a similar fashion to
mlxsw, which was converted in the previous commits.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

rocker: Create an ordered workqueue for FIB offload

As explained in the previous commits, we need to process FIB entries
addition / deletion events in FIFO order or otherwise we can have a
mismatch between the kernel's FIB table and the device's.

Create an ordered workqueue for rocker to which these work items will be
submitted to.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_router: Implement FIB offload in deferred work

FIB offload is currently done in process context with RTNL held, but
we're about to dump the FIB tables in RCU critical section, so we can no
longer sleep.

Instead, defer the operation to process context using deferred work. Make
sure fib info isn't freed while the work is queued by taking a reference
on it and releasing it after the operation is done.

Deferring the operation is valid because the upper layers always assume
the operation was successful. If it's not, then the driver-specific
abort mechanism is called and all routed traffic is directed to slow
path.

The work items are submitted to an ordered workqueue to prevent a
mismatch between the kernel's FIB table and the device's.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: core: Create an ordered workqueue for FIB offload

We're going to start processing FIB entries addition / deletion events
in deferred work. These work items must be processed in the order they
were submitted or otherwise we can have differences between the kernel's
FIB table and the device's.

Solve this by creating an ordered workqueue to which these work items
will be submitted to. Note that we can't simply convert the current
workqueue to be ordered, as EMADs re-transmissions are also processed in
deferred work.

Later on, we can migrate other work items to this workqueue, such as FDB
notification processing and nexthop resolution, since they all take the
same lock anyway.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv4: fib: Add fib_info_hold() helper

As explained in the previous commit, modules are going to need to take a
reference on fib info and then drop it using fib_info_put().

Add the fib_info_hold() helper to make the code more readable and also
symmetric with fib_info_put().

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Suggested-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv4: fib: Export free_fib_info()

The FIB notification chain is going to be converted to an atomic chain,
which means switchdev drivers will have to offload FIB entries in
deferred work, as hardware operations entail sleeping.

However, while the work is queued fib info might be freed, so a
reference must be taken. To release the reference (and potentially free
the fib info) fib_info_put() will be called, which in turn calls
free_fib_info().

Export free_fib_info() so that modules will be able to invoke
fib_info_put().

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

act_mirred: fix a typo in get_dev

Fixes: 255cb30425c0 ("net/sched: act_mirred: Add new tc_action_ops get_dev()")
Cc: Hadar Hen Zion <hadarh@mellanox.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch '40GbE' of git://git./linux/kernel/git/jkirsher/next-queue

Jeff Kirsher says:

====================
40GbE Intel Wired LAN Driver Updates 2016-12-02

This series contains updates to i40e and i40evf only.

Alex provides changes so that we are much more robust about defining what
we can and cannot offload in i40e and i40evf by doing additional checks
other than L4 tunnel header length.

Jake provides several fixes/changes, first cleaning up a label that is
unnecessary, as well as cleaned up the use of a "magic number".  Clarified
the code by separating the global private flags and the regular private
flags per interface into two arrays, so that future additions will not
produce duplication and buggy code.  Adds additional checks to protect
against NULL values for msix_entries and q_vectors pointers.

Michal adds Clause22 method for accessing registers for some external
PHYs.

Piotr adds additional protocol support for the admin queue discover
capabilities function.

Tushar Dave fixes a panic seen on SPARC, where writel() should not be
used to write directly to a memory address but only to a memory mapped
I/O address otherwise it causes data access exceptions.

Joe Perches separates out a section of code into its own function, to
help reduce i40evf_reset_task() a bit.

Alan fixes an issue by checking for NULL before dereferencing msix_entries
and returning early in the case where it is NULL within the i40evf_close()
code path.

Henry provides code cleanup to remove unreachable and redundant sections
of code.  Fixed up an issue where new NICs were not identifying "unknown
PHYs" correctly.

Harshitha fixes a issue where the ethtool "Supported Link" modes list
backplane interfaces on X722 devices for 10 GbE with SFP+ and Cortina
retimer, where these interfaces should not be visible to the user since
they cannot use them.

Carolyn changes an X722 informational message so that it only appears
when extra messages are desired.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: fix the missing avr32 SOF_TIMESTAMPING_OPT_STATS

The commit of SOF_TIMESTAMPING_OPT_STATS didn't include the
new header for avr32, causing build to break. The patch fixes it.

Fixes: 1c885808e456 ("tcp: SOF_TIMESTAMPING_OPT_STATS option for SO_TIMESTAMPING")
Reported-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

udp: be less conservative with sock rmem accounting

Before commit 850cbaddb52d ("udp: use it's own memory accounting
schema"), the udp protocol allowed sk_rmem_alloc to grow beyond
the rcvbuf by the whole current packet's truesize. After said commit
we allow sk_rmem_alloc to exceed the rcvbuf only if the receive queue
is empty. As reported by Jesper this cause a performance regression
for some (small) values of rcvbuf.

This commit is intended to fix the regression restoring the old
handling of the rcvbuf limit.

Reported-by: Jesper Dangaard Brouer <brouer@redhat.com>
Fixes: 850cbaddb52d ("udp: use it's own memory accounting schema")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net_sched: gen_estimator: account for timer drifts

Under heavy stress, timer used in estimators tend to slowly be delayed
by a few jiffies, leading to inaccuracies.

Lets remember what was the last scheduled jiffies so that we get more
precise estimations, without having to add a multiply/divide in the loop
to account for the drifts.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sfc: remove EFX_BUG_ON_PARANOID, use EFX_WARN_ON_[ONCE_]PARANOID instead

Logically, EFX_BUG_ON_PARANOID can never be correct. For, BUG_ON should
only be used if it is not possible to continue without potential harm;
and since the non-DEBUG driver will continue regardless (as the BUG_ON is
compiled out), clearly the BUG_ON cannot be needed in the DEBUG driver.
So, replace every EFX_BUG_ON_PARANOID with either an EFX_WARN_ON_PARANOID
or the newly defined EFX_WARN_ON_ONCE_PARANOID.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'samples-bpf-automated-cgroup-tests'

Sargun Dhillon says:

====================
samples, bpf: Refactor; Add automated tests for cgroups

These two patches are around refactoring out some old, reusable code from the
existing test_current_task_under_cgroup_user test, and adding a new, automated
test.

There is some generic cgroupsv2 setup & cleanup code, given that most
environment still don't have it setup by default. With this code, we're able
to pretty easily add an automated test for future cgroupsv2 functionality.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

samples, bpf: Add automated test for cgroup filter attachments

This patch adds the sample program test_cgrp2_attach2. This program is
similar to test_cgrp2_attach, but it performs automated testing of the
cgroupv2 BPF attached filters. It runs the following checks:
* Simple filter attachment
* Application of filters to child cgroups
* Overriding filters on child cgroups
* Checking that this still works when the parent filter is removed

The filters that are used here are simply allow all / deny all filters, so
it isn't checking the actual functionality of the filters, but rather
the behaviour around detachment / attachment. If net_cls is enabled,
this test will fail.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

samples, bpf: Refactor test_current_task_under_cgroup - separate out helpers

This patch modifies test_current_task_under_cgroup_user. The test has
several helpers around creating a temporary environment for cgroup
testing, and moving the current task around cgroups. This set of
helpers can then be used in other tests.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

samples/bpf: silence compiler warnings

silence some of the clang compiler warnings like:
include/linux/fs.h:2693:9: warning: comparison of unsigned enum expression < 0 is always false
arch/x86/include/asm/processor.h:491:30: warning: taking address of packed member 'sp0' of class or structure 'x86_hw_tss' may result in an unaligned pointer value
include/linux/cgroup-defs.h:326:16: warning: field 'cgrp' with variable sized type 'struct cgroup' not at the end of a struct or class is a GNU extension
since they add too much noise to samples/bpf/ build.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

netns: fix net_generic() "id - 1" bloat

net_generic() function is both a) inline and b) used ~600 times.

It has the following code inside

...
ptr = ng->ptr[id - 1];
...

"id" is never compile time constant so compiler is forced to subtract 1.
And those decrements or LEA [r32 - 1] instructions add up.

We also start id'ing from 1 to catch bugs where pernet sybsystem id
is not initialized and 0. This is quite pointless idea (nothing will
work or immediate interference with first registered subsystem) in
general but it hints what needs to be done for code size reduction.

Namely, overlaying allocation of pointer array and fixed part of
structure in the beginning and using usual base-0 addressing.

Ids are just cookies, their exact values do not matter, so lets start
with 3 on x86_64.

Code size savings (oh boy): -4.2 KB

As usual, ignore the initial compiler stupidity part of the table.

add/remove: 0/0 grow/shrink: 12/670 up/down: 89/-4297 (-4208)
function                                     old     new   delta
tipc_nametbl_insert_publ                    1250    1270     +20
nlmclnt_lookup_host                          686     703     +17
nfsd4_encode_fattr                          5930    5941     +11
nfs_get_client                              1050    1061     +11
register_pernet_operations                   333     342      +9
tcf_mirred_init                              843     849      +6
tcf_bpf_init                                1143    1149      +6
gss_setup_upcall                             990     994      +4
idmap_name_to_id                             432     434      +2
ops_init                                     274     275      +1
nfsd_inject_forget_client                    259     260      +1
nfs4_alloc_client                            612     613      +1
tunnel_key_walker                            164     163      -1

...

tipc_bcbase_select_primary                   392     360     -32
mac80211_hwsim_new_radio                    2808    2767     -41
ipip6_tunnel_ioctl                          2228    2186     -42
tipc_bcast_rcv                               715     672     -43
tipc_link_build_proto_msg                   1140    1089     -51
nfsd4_lock                                  3851    3796     -55
tipc_mon_rcv                                1012     956     -56
Total: Before=156643951, After=156639743, chg -0.00%

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

netns: add dummy struct inside "struct net_generic"

This is precursor to fixing "[id - 1]" bloat inside net_generic().

Name "s" is chosen to complement name "u" often used for dummy unions.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

netns: publish net_generic correctly

Publishing net_generic pointer is done with silly mistake: new array is
published BEFORE setting freshly acquired pernet subsystem pointer.

memcpy
rcu_assign_pointer
kfree_rcu
ng->ptr[id - 1] = data;

This bug was introduced with commit dec827d174d7f76c457238800183ca864a639365
("[NETNS]: The generic per-net pointers.") in the glorious days of
chopping networking stack into containers proper 8.5 years ago (whee...)

How it didn't trigger for so long?
Well, you need quite specific set of conditions:

*) race window opens once per pernet subsystem addition
   (read: modprobe or boot)

*) not every pernet subsystem is eligible (need ->id and ->size)

*) not every pernet subsystem is vulnerable (need incorrect or absense
   of ordering of register_pernet_sybsys() and actually using net_generic())

*) to hide the bug even more, default is to preallocate 13 pointers which
   is actually quite a lot. You need IPv6, netfilter, bridging etc together
   loaded to trigger reallocation in the first place. Trimmed down
   config are OK.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

netlink: 2-clause nla_ok()

nla_ok() consists of 3 clauses:

1) int rem >= (int)sizeof(struct nlattr)

2) u16 nla_len >= sizeof(struct nlattr)

3) u16 nla_len <= int rem

The statement is that clause (1) is redundant.

What it does is ensuring that "rem" is a positive number,
so that in clause (3) positive number will be compared to positive number
with no problems.

However, "u16" fully fits into "int" and integers do not change value
when upcasting even to signed type. Negative integers will be rejected
by clause (3) just fine. Small positive integers will be rejected
by transitivity of comparison operator.

NOTE: all of the above DOES NOT apply to nlmsg_ok() where ->nlmsg_len is
u32(!), so 3 clauses AND A CAST TO INT are necessary.

Obligatory space savings report: -1.6 KB

$ ./scripts/bloat-o-meter ../vmlinux-000* ../vmlinux-001*
add/remove: 0/0 grow/shrink: 3/63 up/down: 35/-1692 (-1657)
function                                     old     new   delta
validate_scan_freqs                          142     155     +13
tcf_em_tree_validate                         867     879     +12
dcbnl_ieee_del                               328     338     +10
netlbl_cipsov4_add_common.isra               218     215      -3
...
ovs_nla_put_actions                          888     806     -82
netlbl_cipsov4_add_std                      1648    1566     -82
nl80211_parse_sched_scan                    2889    2780    -109
ip_tun_from_nlattr                          3086    2945    -141

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

staging: wilc1000: use reset to set mac header

Since offset is zero, it's not necessary to use set function. Reset
function is straightforward, and will remove the unnecessary add
operation in set function.

Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

iwlwifi: use reset to set transport header

Since offset is zero, it's not necessary to use set function. Reset
function is straightforward, and will remove the unnecessary add
operation in set function.

Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlx4: use reset to set mac header

Since offset is zero, it's not necessary to use set function. Reset
function is straightforward, and will remove the unnecessary add
operation in set function.

Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnx2x: use reset to set network header

Since offset is zero, it's not necessary to use set function. Reset
function is straightforward, and will remove the unnecessary add
operation in set function.

Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qede: use reset to set network header

Since offset is zero, it's not necessary to use set function. Reset
function is straightforward, and will remove the unnecessary add
operation in set function.

Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Acked-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'xgene-jumbo-and-pause-frame'

Iyappan Subramanian says:

====================
drivers: net: xgene: Add Jumbo and Pause frame support

This patch set adds,

1. Jumbo frame support
2. Pause frame based flow control

and fixes RSS for non-TCP/UDP packets.
====================

Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>

drivers: net: xgene: ethtool: Add get/set_pauseparam

This patch adds get_pauseparam and set_pauseparam functions.

Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: Quan Nguyen <qnguyen@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

drivers: net: xgene: Add flow control initialization

This patch adds flow control/pause frame initialization and
advertising capabilities.

Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: Quan Nguyen <qnguyen@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

drivers: net: xgene: Add flow control configuration

This patch adds functions to configure mac, when flow control
and pause frame settings change.

Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: Quan Nguyen <qnguyen@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

drivers: net: xgene: fix: RSS for non-TCP/UDP

This patch fixes RSS feature, for non-TCP/UDP packets.

Signed-off-by: Khuong Dinh <kdinh@apm.com>
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

drivers: net: xgene: Add change_mtu function

This patch implements ndo_change_mtu() callback function that
enables mtu change.

Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: Quan Nguyen <qnguyen@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

drivers: net: xgene: Add support for Jumbo frame

This patch adds support for jumbo frame, by allocating
additional buffer (page) pool and configuring the hardware.

Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: Quan Nguyen <qnguyen@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

drivers: net: xgene: Configure classifier with pagepool

This patch configures classifier with the pagepool information.

Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: Quan Nguyen <qnguyen@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

drivers: net: xgene: Add helper function

This is a prepartion patch and adds xgene_enet_get_fpsel() helper
function to get buffer pool number.

Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: Quan Nguyen <qnguyen@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: ti: davinci_cpdma: add missing EXPORTs

As of commit 8f32b90981dcdb355516fb95953133f8d4e6b11d
("net: ethernet: ti: davinci_cpdma: add set rate for a channel") the
ARM allmodconfig builds would fail modpost with:

ERROR: "cpdma_chan_set_weight" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined!
ERROR: "cpdma_chan_get_rate" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined!
ERROR: "cpdma_chan_get_min_rate" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined!
ERROR: "cpdma_chan_set_rate" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined!

Since these weren't declared as static, it is assumed they were
meant to be shared outside the file, and that modular build testing
was simply overlooked.

Fixes: 8f32b90981dc ("net: ethernet: ti: davinci_cpdma: add set rate for a channel")
Cc: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Cc: Mugunthan V N <mugunthanvnm@ti.com>
Cc: Grygorii Strashko <grygorii.strashko@ti.com>
Cc: linux-omap@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'linux-can-next-for-4.10-20161201' of git://git./linux/kernel/git/mkl/linux-can-next

Marc Kleine-Budde says:

====================
pull-request: can-next 2016-12-01

this is a pull request of 4 patches for net-next/master.

There are two patches by Chris Paterson for the rcar_can and rcar_canfd
device tree binding documentation. And a patch by Geert Uytterhoeven
that corrects the order of interrupt specifiers.

The fourth patch by Colin Ian King fixes a spelling error in the
kvaser_usb driver.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: stmmac: unify mdio functions

stmmac_mdio_{read|write} and stmmac_mdio_{read|write}_gmac4 are not
enought different for being split.
The only differences between thoses two functions are shift/mask for
addr/reg/clk_csr.

This patch introduce a per platform set of variable for setting thoses
shift/mask and unify mdio read and write functions.

Signed-off-by: Corentin Labbe <clabbe.montjoie@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: stmmac: avoid Camelcase naming

This patch simply rename regValue to value, like it was named in other
mdio functions.

Signed-off-by: Corentin Labbe <clabbe.montjoie@gmail.com>
Acked-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

irda: w83977af_ir: fix damaged whitespace

As David Miller pointed out for for the previous patch, the whitespace
in some functions looks rather odd. This was caused by commit 6329da5f258a
("obsolete config in kernel source: USE_INTERNAL_TIMER"), which removed
some conditions but did not reindent the code.

This fixes the indentation in the file and removes extraneous whitespace
at the end of the lines and before tabs.

There are many other minor coding style problems in the driver, but I'm
not touching those here.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

stmmac: cleanup documenation, make it match reality

Fix english in documentation, make documentation match reality, remove
options that were removed from code.

Signed-off-by: Pavel Machek <pavel@denx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge git://git./linux/kernel/git/davem/net

Couple conflicts resolved here:

1) In the MACB driver, a bug fix to properly initialize the
   RX tail pointer properly overlapped with some changes
   to support variable sized rings.

2) In XGBE we had a "CONFIG_PM" --> "CONFIG_PM_SLEEP" fix
   overlapping with a reorganization of the driver to support
   ACPI, OF, as well as PCI variants of the chip.

3) In 'net' we had several probe error path bug fixes to the
   stmmac driver, meanwhile a lot of this code was cleaned up
   and reorganized in 'net-next'.

4) The cls_flower classifier obtained a helper function in
   'net-next' called __fl_delete() and this overlapped with
   Daniel Borkamann's bug fix to use RCU for object destruction
   in 'net'.  It also overlapped with Jiri's change to guard
   the rhashtable_remove_fast() call with a check against
   tc_skip_sw().

5) In mlx4, a revert bug fix in 'net' overlapped with some
   unrelated changes in 'net-next'.

6) In geneve, a stale header pointer after pskb_expand_head()
   bug fix in 'net' overlapped with a large reorganization of
   the same code in 'net-next'.  Since the 'net-next' code no
   longer had the bug in question, there was nothing to do
   other than to simply take the 'net-next' hunks.

Signed-off-by: David S. Miller <davem@davemloft.net>

i40e: change message to only appear when extra debug info is wanted

This patch changes an X722 informational message so that it only
appears when extra messages are desired. Without this patch,
on X722 devices, this message appears at load, potentially causing
unnecessary alarm.

Change-ID: I94f7aae15dc5b2723cc9728c630c72538a3e670e
Signed-off-by: Carolyn Wyborny <carolyn.wyborny@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e/i40evf: replace for memcpy with single memcpy call in ethtool

memcpy replaced with single memcpy call in ethtool.

Change-ID: I3f5bef6bcc593412c56592c6459784db41575a0a
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: set broadcast promiscuous mode for each active VLAN

A previous workaround added to ensure receipt of all broadcast frames
incorrectly set the broadcast promiscuous mode unconditionally
regardless of active VLAN status.

Replace this partial workaround with a complete solution that sets the
broadcast promiscuous filters in i40e_sync_vsi_filters. This new method
sets the promiscuous mode based on when broadcast filters are added or
removed.

I40E_VLAN_ANY will request a broadcast filter for all VLANs, (as we're
in untagged mode) while a broadcast filter on a specific VLAN will only
request broadcast for that VLAN.

Thus, we restore addition of broadcast filter to the array, but we add
special handling for these such that they enable the broadcast
promiscuous mode instead of being sent as regular filters.

The end result is that we will correctly receive all broadcast packets
(even those with a *source* address equal to the broadcast address) but
will not receive packets for which we don't have an active VLAN filter.

Change-ID: I7d0585c5cec1a5bf55bf533b42e5e817d5db6a2d
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: Fix for ethtool Supported link modes

This patch fixes the problem where the ethtool Supported link
modes list backplane interfaces on X722 devices for 10GbE with
SFP+ and Cortina retimer. This patch fixes the problem by setting
and using a flag for this particular device since the backplane
interface is only between the internal PHY and the retimer and it
should not be seen by the user as they cannot use it.
Without this patch, the user wrongly thinks that backplane interfaces
are supported on their device when they actually are not.

Change-ID: I3882bc2928431d48a2db03a51a713a1f681a79e9
Signed-off-by: Harshitha Ramamurthy <harshitha.ramamurthy@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40evf: protect against NULL msix_entries and q_vectors pointers

Update the functions which free msix_entries and q_vectors so that they
are safe against NULL values. This allows calling code to not care
whether these have already been freed when disabling and freeing them.

Change-ID: I31bfd1c0da18023d971b618edc6fb049721f3298
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: Pass unknown PHY type for unknown PHYs

The PHY type value for unrecognized PHYs and cables was changed
based on firmware version number. Newer hardware use lower firmware
version numbers and this was causing some PHYs to be identified
as type 0x16 instead of 0xe (unknown).

Without this patch, newer card will incorrectly identify unknown
PHYs and cables.

This change adds hardware type to the check for firmware version
so the PHY type is reported correctly.

Change-ID: I0723cbfd263c76fc73ff1a5275d1639051376c9a
Signed-off-by: Henry Tieman <henry.w.tieman@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: Remove unreachable code

The code at the end of i40e_read_phy_register_clause22() contained
unreachable code and redundant control statements.

This change removes the unreachable code. And deletes the redundant
goto statement and if statement.

Change-ID: I713032b1585396f40f903cbcfdea987abd874400
Signed-off-by: Henry Tieman <henry.w.tieman@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40evf: check for msix_entries null dereference

It is possible for msix_entries to be freed by a previous suspend/remove
before a VF is closed. This patch fixes the issue by checking for NULL
before dereferencing msix_entries and returning early in the case where
it is NULL within the i40evf_close code path. Without this patch it is
possible to trigger a kernel panic through NULL dereference.

Change-ID: I92a2746e82533a889e25f91578eac9abd0388ae2
Signed-off-by: Alan Brady <alan.brady@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40evf: Move some i40evf_reset_task code to separate function

The i40evf_reset_task function is a couple hundred lines and it has
a separable block that disables VF. Move that block to a new
i40evf_disable_vf function to shorten i40evf_reset_task a bit.

Signed-off-by: Joe Perches <joe@perches.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: fix panic on SPARC while changing num of desc

On SPARC, writel() should not be used to write directly to memory
address but only to memory mapped I/O address otherwise it causes
data access exception.

Commit 147e81ec75689 ("i40e: Test memory before ethtool alloc
succeeds") introduced a code that uses memory address to fake the HW
tail address and attempt to write to that address using writel()
causes kernel panic on SPARC. The issue is reproduced while changing
number of descriptors using ethtool.

This change resolves the panic by using HW read-only memory mapped
I/O register to fake HW tail address instead memory address.

e.g.
> ethtool -G eth2 tx 2048 rx 2048
i40e 0000:03:00.2 eth2: Changing Tx descriptor count from 512 to 2048.
i40e 0000:03:00.2 eth2: Changing Rx descriptor count from 512 to 2048
sun4v_data_access_exception: ADDR[fff8001f9734a000] CTX[0000]
TYPE[0004], going.
              \|/ ____ \|/
              "@'/ .. \`@"
              /_| \__/ |_\
                 \__U_/
ethtool(3273): Dax [#1]
CPU: 9 PID: 3273 Comm: ethtool Tainted: G            E
4.8.0-linux-net_temp+ #7
task: fff8001f96d7a660 task.stack: fff8001f97348000
TSTATE: 0000009911001601 TPC: 00000000103189e4 TNPC: 00000000103189e8 Y:
00000000    Tainted: G            E
TPC: <i40e_alloc_rx_buffers+0x124/0x260 [i40e]>
g0: fff8001f4eb64000 g1: 00000000000007ff g2: fff8001f9734b92c g3:
00203e0000000000
g4: fff8001f96d7a660 g5: fff8001fa6704000 g6: fff8001f97348000 g7:
0000000000000001
o0: 0006000046706928 o1: 00000000db3e2000 o2: fff8001f00000000 o3:
0000000000002000
o4: 0000000000002000 o5: 0000000000000001 sp: fff8001f9734afc1 ret_pc:
0000000010318a64
RPC: <i40e_alloc_rx_buffers+0x1a4/0x260 [i40e]>
l0: fff8001f4e8bffe0 l1: fff8001f4e8cffe0 l2: 00000000000007ff l3:
00000000ff000000
l4: 0000000000ff0000 l5: 000000000000ff00 l6: 0000000000cda6a8 l7:
0000000000e822f0
i0: fff8001f96380000 i1: 0000000000000000 i2: 00203edb00000000 i3:
0006000046706928
i4: 0000000002086320 i5: 0000000000e82370 i6: fff8001f9734b071 i7:
00000000103062d4
I7: <i40e_set_ringparam+0x3b4/0x540 [i40e]>
Call Trace:
[00000000103062d4] i40e_set_ringparam+0x3b4/0x540 [i40e]
[000000000094e2f8] dev_ethtool+0x898/0xbe0
[0000000000965570] dev_ioctl+0x250/0x300
[0000000000923800] sock_do_ioctl+0x40/0x60
[000000000092427c] sock_ioctl+0x7c/0x280
[00000000005ef040] vfs_ioctl+0x20/0x60
[00000000005ef5d4] do_vfs_ioctl+0x194/0x4c0
[00000000005ef974] SyS_ioctl+0x74/0xa0
[0000000000406214] linux_sparc_syscall+0x34/0x44
Disabling lock debugging due to kernel taint
Caller[00000000103062d4]: i40e_set_ringparam+0x3b4/0x540 [i40e]
Caller[000000000094e2f8]: dev_ethtool+0x898/0xbe0
Caller[0000000000965570]: dev_ioctl+0x250/0x300
Caller[0000000000923800]: sock_do_ioctl+0x40/0x60
Caller[000000000092427c]: sock_ioctl+0x7c/0x280
Caller[00000000005ef040]: vfs_ioctl+0x20/0x60
Caller[00000000005ef5d4]: do_vfs_ioctl+0x194/0x4c0
Caller[00000000005ef974]: SyS_ioctl+0x74/0xa0
Caller[0000000000406214]: linux_sparc_syscall+0x34/0x44
Caller[0000000000107154]: 0x107154
Instruction DUMP: e43620c8
e436204a  c45e2038
<c2a083a0> 82102000
81cfe008  90086001
82102000  81cfe008

Kernel panic - not syncing: Fatal exception

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: Add protocols over MCTP to i40e_aq_discover_capabilities

Add logical_id to I40E_AQ_CAP_ID_MNG_MODE capability starting from major
version 2.

Change-ID: Idb29214b172ea5c70cbd45a99e6745c0215af7e4
Signed-off-by: Piotr Raczynski <piotr.raczynski@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: fix trivial typo in naming of i40e_sync_filters_subtask

A comment incorrectly referred to i40e_vsi_sync_filters_subtask which
does not actually exist. Reference the correct function instead.

Change-ID: I6bd805c605741ffb6fe34377259bb0d597edfafd
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: Add Clause22 implementation

Some external PHYs require Clause22 method for accessing registers.
This patch also adds some defines to support blink led on devices using
10CBaseT PHY.

Change-ID: I868a4326911900f6c89e7e522fda4968b0825f14
Signed-off-by: Michal Kosiarz <michal.kosiarz@intel.com>
Signed-off-by: Matt Jared <matthew.a.jared@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: avoid duplicate private flags definitions

Separate the global private flags and the regular private flags per
interface into two arrays. Future additions of private flags will not
need to be duplicated which may lead to buggy code. Also rename
"i40e_priv_flags_strings_gl" to "i40e_gl_priv_flags_strings" for
clarity, as it reads more naturally.

Change-ID: I68caef3c9954eb7da342d7f9d20f2873186f2758
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: remove second check of VLAN_N_VID in i40e_vlan_rx_add_vid

Replace a check of magic number 4095 with VLAN_N_VID. This
makes it obvious that a later check against VLAN_N_VID is
always true and can be removed.

Change-ID: I28998f127a61a529480ce63d8a07e266f6c63b7b
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: remove error_param_int label from i40e_vc_config_promiscuous_mode_msg

This label is unnecessary, as are jumping to a block that checks aq_ret
and then immediately skipping it and returning. So just jump straight to
the error_param and remove this unnecessary label.

Also use goto error_param even in the last check for style consistency.

Change-ID: If487c7d10c4048e37c594e5eca167693aaed45f6
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40evf: Be much more verbose about what we can and cannot offload

This change makes it so that we are much more robust about defining what we
can and cannot offload. Previously we were performing no checks. This
should bring us up to parity with the i40e PF driver.

In addition the device only supports GSO as long as the MSS is 64 or
greater. We were not checking this so an MSS less than that was resulting
in Tx hangs.

Change-ID: If533553ec92fc6ba694eab6ac81fdaf3004f3592
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: Be much more verbose about what we can and cannot offload

This change makes it so that we are much more robust about defining what we
can and cannot offload. Previously we were just checking for the L4 tunnel
header length, however there are other fields we should be verifying as
there are multiple scenarios in which we cannot perform hardware offloads.

In addition the device only supports GSO as long as the MSS is 64 or
greater. We were not checking this so an MSS less than that was resulting
in Tx hangs.

Change-ID: I5e2fd5f3075c73601b4b36327b771c64fcb6c31b
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>

Merge tag 'fixes-for-linus' of git://git./linux/kernel/git/arm/arm-soc

Pull ARM SoC fixes from Arnd Bergmann:
"This should be the last set of bugfixes for arm-soc in v4.9. None of
  these are critical regressions, but it would be nice to still get them
  merged.

   - On the Juno platform, the idle latency was described wrong, leading
     to suboptimal cpuidle tuning.

   - Also on the same platform, PCI I/O space was set up incorrectly and
     could not work.

   - On the sti platform, a syntactically incorrect DT entry caused
     warnings.

   - The newly added 'gr8' platform has somewhat confusing file names,
     which we rename for consistency"

* tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
  arm64: dts: juno: fix cluster sleep state entry latency on all SoC versions
  arm64: dts: juno: Correct PCI IO window
  ARM: dts: STiH407-family: fix i2c nodes
  ARM: gr8: Rename the DTSI and relevant DTS

Merge git://git./linux/kernel/git/davem/net

Pull networking fixes from David Miller:

1) Lots more phydev and probe error path leaks in various drivers by
    Johan Hovold.

2) Fix race in packet_set_ring(), from Philip Pettersson.

3) Use after free in dccp_invalid_packet(), from Eric Dumazet.

4) Signnedness overflow in SO_{SND,RCV}BUFFORCE, also from Eric
    Dumazet.

5) When tunneling between ipv4 and ipv6 we can be left with the wrong
    skb->protocol value as we enter the IPSEC engine and this causes all
    kinds of problems. Set it before the output path does any
    dst_output() calls, from Eli Cooper.

6) bcmgenet uses wrong device struct pointer in DMA API calls, fix from
    Florian Fainelli.

7) Various netfilter nat bug fixes from FLorian Westphal.

8) Fix memory leak in ipvlan_link_new(), from Gao Feng.

9) Locking fixes, particularly wrt. socket lookups, in l2tp from
    Guillaume Nault.

10) Avoid invoking rhash teardowns in atomic context by moving netlink
    cb->done() dump completion from a worker thread. Fix from Herbert
    Xu.

11) Buffer refcount problems in tun and macvtap on errors, from Jason
    Wang.

12) We don't set Kconfig symbol DEFAULT_TCP_CONG properly when the user
    selects BBR. Fix from Julian Wollrath.

13) Fix deadlock in transmit path on altera TSE driver, from Lino
    Sanfilippo.

14) Fix unbalanced reference counting in dsa_switch_tree, from Nikita
    Yushchenko.

15) tc_tunnel_key needs to be properly exported to userspace via uapi,
    fix from Roi Dayan.

16) rds_tcp_init_net() doesn't unregister notifier in error path, fix
    from Sowmini Varadhan.

17) Stale packet header pointer access after pskb_expand_head() in
    genenve driver, fix from Sabrina Dubroca.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (103 commits)
  net: avoid signed overflows for SO_{SND|RCV}BUFFORCE
  geneve: avoid use-after-free of skb->data
  tipc: check minimum bearer MTU
  net: renesas: ravb: unintialized return value
  sh_eth: remove unchecked interrupts for RZ/A1
  net: bcmgenet: Utilize correct struct device for all DMA operations
  NET: usb: qmi_wwan: add support for Telit LE922A PID 0x1040
  cdc_ether: Fix handling connection notification
  ip6_offload: check segs for NULL in ipv6_gso_segment.
  RDS: TCP: unregister_netdevice_notifier() in error path of rds_tcp_init_net
  Revert: "ip6_tunnel: Update skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit()"
  ipv6: Set skb->protocol properly for local output
  ipv4: Set skb->protocol properly for local output
  packet: fix race condition in packet_set_ring
  net: ethernet: altera: TSE: do not use tx queue lock in tx completion handler
  net: ethernet: altera: TSE: Remove unneeded dma sync for tx buffers
  net: ethernet: stmmac: fix of-node and fixed-link-phydev leaks
  net: ethernet: stmmac: platform: fix outdated function header
  net: ethernet: stmmac: dwmac-meson8b: fix probe error path
  net: ethernet: stmmac: dwmac-generic: fix probe error path
  ...

net: avoid signed overflows for SO_{SND|RCV}BUFFORCE

CAP_NET_ADMIN users should not be allowed to set negative
sk_sndbuf or sk_rcvbuf values, as it can lead to various memory
corruptions, crashes, OOM...

Note that before commit 82981930125a ("net: cleanups in
sock_setsockopt()"), the bug was even more serious, since SO_SNDBUF
and SO_RCVBUF were vulnerable.

This needs to be backported to all known linux kernels.

Again, many thanks to syzkaller team for discovering this gem.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

geneve: avoid use-after-free of skb->data

geneve{,6}_build_skb can end up doing a pskb_expand_head(), which
makes the ip_hdr(skb) reference we stashed earlier stale. Since it's
only needed as an argument to ip_tunnel_ecn_encap(), move this
directly in the function call.

Fixes: 08399efc6319 ("geneve: ensure ECN info is handled properly in all tx/rx paths")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: John W. Linville <linville@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tipc: check minimum bearer MTU

Qian Zhang (张谦) reported a potential socket buffer overflow in
tipc_msg_build() which is also known as CVE-2016-8632: due to
insufficient checks, a buffer overflow can occur if MTU is too short for
even tipc headers. As anyone can set device MTU in a user/net namespace,
this issue can be abused by a regular user.

As agreed in the discussion on Ben Hutchings' original patch, we should
check the MTU at the moment a bearer is attached rather than for each
processed packet. We also need to repeat the check when bearer MTU is
adjusted to new device MTU. UDP case also needs a check to avoid
overflow when calculating bearer MTU.

Fixes: b97bf3fd8f6a ("[TIPC] Initial merge")
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reported-by: Qian Zhang (张谦) <zhangqian-c@360.cn>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'linux-can-fixes-for-4.9-20161201' of git://git./linux/kernel/git/mkl/linux-can

Marc Kleine-Budde says:

====================
pull-request: can 2016-12-02

this is a pull request for net/master.

There are two patches by Stephane Grosjean, who adds support for the new
PCAN-USB X6 USB interface to the pcan_usb driver.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: renesas: ravb: unintialized return value

We want to set the other "err" variable here so that we can return it
later. My version of GCC misses this issue but I caught it with a
static checker.

Fixes: 9f70eb339f52 ("net: ethernet: renesas: ravb: fix fixed-link phydev leaks")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Reviewed-by: Johan Hovold <johan@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'wireless-drivers-next-for-davem-2016-12-01' of git://git./linux/kernel/git/kvalo/wireless-drivers-next

Kalle Valo says:

====================
wireless-drivers-next patches for 4.10

Major changes:

rsi

* filter rx frames
* configure tx power
* make it possible to select antenna
* support 802.11d

brcmfmac

* cleanup of scheduled scan code
* support for bcm43341 chipset with different chip id
* support rev6 of PCIe device interface

ath10k

* add spectral scan support for QCA6174 and QCA9377 families
* show used tx bitrate with 10.4 firmware

wil6210

* add power save mode support
* add abort scan functionality
* add support settings retry limit for short frames

bcma

* add Dell Inspiron 3148
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

sh_eth: remove unchecked interrupts for RZ/A1

When streaming a lot of data and the RZ/A1 can't keep up, some status bits
will get set that are not being checked or cleared which cause the
following messages and the Ethernet driver to stop working. This
patch fixes that issue.

irq 21: nobody cared (try booting with the "irqpoll" option)
handlers:
[<c036b71c>] sh_eth_interrupt
Disabling IRQ #21

Fixes: db893473d313a4ad ("sh_eth: Add support for r7s72100")
Signed-off-by: Chris Brandt <chris.brandt@renesas.com>
Acked-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: bcmgenet: Utilize correct struct device for all DMA operations

__bcmgenet_tx_reclaim() and bcmgenet_free_rx_buffers() are not using the
same struct device during unmap that was used for the map operation,
which makes DMA-API debugging warn about it. Fix this by always using
&priv->pdev->dev throughout the driver, using an identical device
reference for all map/unmap calls.

Fixes: 1c1008c793fa ("net: bcmgenet: add main driver file")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>