Anirudh Venkataramanan [Tue, 20 Mar 2018 14:58:16 +0000 (07:58 -0700)]
ice: Add stats and ethtool support
This patch implements a watchdog task to get packet statistics from
the device.
This patch also adds support for the following ethtool operations:
ethtool devname
ethtool -s devname [msglvl N] [msglevel type on|off]
ethtool -g|--show-ring devname
ethtool -G|--set-ring devname [rx N] [tx N]
ethtool -i|--driver devname
ethtool -d|--register-dump devname [raw on|off] [hex on|off] [file name]
ethtool -k|--show-features|--show-offload devname
ethtool -K|--features|--offload devname feature on|off
ethtool -P|--show-permaddr devname
ethtool -S|--statistics devname
ethtool -a|--show-pause devname
ethtool -A|--pause devname [autoneg on|off] [rx on|off] [tx on|off]
ethtool -r|--negotiate devname
CC: Andrew Lunn <andrew@lunn.ch>
CC: Jakub Kicinski <kubakici@wp.pl>
CC: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Anirudh Venkataramanan [Tue, 20 Mar 2018 14:58:15 +0000 (07:58 -0700)]
ice: Add support for VLANs and offloads
This patch adds support for VLANs. When a VLAN is created a switch filter
is added to direct the VLAN traffic to the corresponding VSI. When a VLAN
is deleted, the filter is deleted as well.
This patch also adds support for the following hardware offloads.
1) VLAN tag insertion/stripping
2) Receive Side Scaling (RSS)
3) Tx checksum and TCP segmentation
4) Rx checksum
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Anirudh Venkataramanan [Tue, 20 Mar 2018 14:58:14 +0000 (07:58 -0700)]
ice: Implement transmit and NAPI support
This patch implements ice_start_xmit (the handler for ndo_start_xmit) and
related functions. ice_start_xmit ultimately calls ice_tx_map, where the
Tx descriptor is built and posted to the hardware by bumping the ring tail.
This patch also implements ice_napi_poll, which is invoked when there's an
interrupt on the VSI's queues. The interrupt can be due to either a
completed Tx or an Rx event. In case of a completed Tx/Rx event, resources
are reclaimed. Additionally, in case of an Rx event, the skb is fetched
and passed up to the network stack.
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Anirudh Venkataramanan [Tue, 20 Mar 2018 14:58:13 +0000 (07:58 -0700)]
ice: Configure VSIs for Tx/Rx
This patch configures the VSIs to be able to send and receive
packets by doing the following:
1) Initialize flexible parser to extract and include certain
fields in the Rx descriptor.
2) Add Tx queues by programming the Tx queue context (implemented in
ice_vsi_cfg_txqs). Note that adding the queues also enables (starts)
the queues.
3) Add Rx queues by programming Rx queue context (implemented in
ice_vsi_cfg_rxqs). Note that this only adds queues but doesn't start
them. The rings will be started by calling ice_vsi_start_rx_rings on
interface up.
4) Configure interrupts for VSI queues.
5) Implement ice_open and ice_stop.
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Anirudh Venkataramanan [Tue, 20 Mar 2018 14:58:12 +0000 (07:58 -0700)]
ice: Add support for switch filter programming
A VSI needs traffic directed towards it. This is done by programming
filter rules on the switch (embedded vSwitch) element in the hardware,
which connects the VSI to the ingress/egress port.
This patch introduces data structures and functions necessary to add
remove or update switch rules on the switch element. This is a pretty low
level function that is generic enough to add a whole range of filters.
This patch also introduces two top level functions ice_add_mac and
ice_remove mac which through a series of intermediate helper functions
eventually call ice_aq_sw_rules to add/delete simple MAC based filters.
It's worth noting that one invocation of ice_add_mac/ice_remove_mac
is capable of adding/deleting multiple MAC filters.
Also worth noting is the fact that the driver maintains a list of currently
active filters, so every filter addition/removal causes an update to this
list. This is done for a couple of reasons:
1) If two VSIs try to add the same filters, we need to detect it and do
things a little differently (i.e. use VSI lists, described below) as
the same filter can't be added more than once.
2) In the event of a hardware reset we can simply walk through this list
and restore the filters.
VSI Lists:
In a multi-VSI situation, it's possible that multiple VSIs want to add the
same filter rule. For example, two VSIs that want to receive broadcast
traffic would both add a filter for destination MAC ff:ff:ff:ff:ff:ff.
This can become cumbersome to maintain and so this is handled using a
VSI list.
A VSI list is resource that can be allocated in the hardware using the
ice_aq_alloc_free_res admin queue command. Simply put, a VSI list can
be thought of as a subscription list containing a set of VSIs to which
the packet should be forwarded, should the filter match.
For example, if VSI-0 has already added a broadcast filter, and VSI-1
wants to do the same thing, the filter creation flow will detect this,
allocate a VSI list and update the switch rule so that broadcast traffic
will now be forwarded to the VSI list which contains VSI-0 and VSI-1.
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Anirudh Venkataramanan [Tue, 20 Mar 2018 14:58:11 +0000 (07:58 -0700)]
ice: Add support for VSI allocation and deallocation
This patch introduces data structures and functions to alloc/free
VSIs. The driver represents a VSI using the ice_vsi structure.
Some noteworthy points about VSI allocation:
1) A VSI is allocated in the firmware using the "add VSI" admin queue
command (implemented as ice_aq_add_vsi). The firmware returns an
identifier for the allocated VSI. The VSI context is used to program
certain aspects (loopback, queue map, etc.) of the VSI's configuration.
2) A VSI is deleted using the "free VSI" admin queue command (implemented
as ice_aq_free_vsi).
3) The driver represents a VSI using struct ice_vsi. This is allocated
and initialized as part of the ice_vsi_alloc flow, and deallocated
as part of the ice_vsi_delete flow.
4) Once the VSI is created, a netdev is allocated and associated with it.
The VSI's ring and vector related data structures are also allocated
and initialized.
5) A VSI's queues can either be contiguous or scattered. To do this, the
driver maintains a bitmap (vsi->avail_txqs) which is kept in sync with
the firmware's VSI queue allocation imap. If the VSI can't get a
contiguous queue allocation, it will fallback to scatter. This is
implemented in ice_vsi_get_qs which is called as part of the VSI setup
flow. In the release flow, the VSI's queues are released and the bitmap
is updated to reflect this by ice_vsi_put_qs.
CC: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Acked-by: Shannon Nelson <shannon.nelson@oracle.com>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Anirudh Venkataramanan [Tue, 20 Mar 2018 14:58:10 +0000 (07:58 -0700)]
ice: Initialize PF and setup miscellaneous interrupt
This patch continues the initialization flow as follows:
1) Allocate and initialize necessary fields (like vsi, num_alloc_vsi,
irq_tracker, etc) in the ice_pf instance.
2) Setup the miscellaneous interrupt handler. This also known as the
"other interrupt causes" (OIC) handler and is used to handle non
hotpath interrupts (like control queue events, link events,
exceptions, etc.
3) Implement a background task to process admin queue receive (ARQ)
events received by the driver.
CC: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Acked-by: Shannon Nelson <shannon.nelson@oracle.com>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Anirudh Venkataramanan [Tue, 20 Mar 2018 14:58:09 +0000 (07:58 -0700)]
ice: Get MAC/PHY/link info and scheduler topology
This patch adds code to continue the initialization flow as follows:
1) Get PHY/link information and store it
2) Get default scheduler tree topology and store it
3) Get the MAC address associated with the port and store it
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Anirudh Venkataramanan [Tue, 20 Mar 2018 14:58:08 +0000 (07:58 -0700)]
ice: Get switch config, scheduler config and device capabilities
This patch adds to the initialization flow by getting switch
configuration, scheduler configuration and device capabilities.
Switch configuration:
On boot, an L2 switch element is created in the firmware per physical
function. Each physical function is also mapped to a port, to which its
switch element is connected. In other words, this switch can be visualized
as an embedded vSwitch that can connect a physical function's virtual
station interfaces (VSIs) to the egress/ingress port. Egress/ingress
filters will be eventually created and applied on this switch element.
As part of the initialization flow, the driver gets configuration data
from this switch element and stores it.
Scheduler configuration:
The Tx scheduler is a subsystem responsible for setting and enforcing QoS.
As part of the initialization flow, the driver queries and stores the
default scheduler configuration for the given physical function.
Device capabilities:
As part of initialization, the driver has to determine what the device is
capable of (ex. max queues, VSIs, etc). This information is obtained from
the firmware and stored by the driver.
CC: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Acked-by: Shannon Nelson <shannon.nelson@oracle.com>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Anirudh Venkataramanan [Tue, 20 Mar 2018 14:58:07 +0000 (07:58 -0700)]
ice: Start hardware initialization
This patch implements multiple pieces of the initialization flow
as follows:
1) A reset is issued to ensure a clean device state, followed
by initialization of admin queue interface.
2) Once the admin queue interface is up, clear the PF config
and transition the device to non-PXE mode.
3) Get the NVM configuration stored in the device's non-volatile
memory (NVM) using ice_init_nvm.
CC: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Acked-by: Shannon Nelson <shannon.nelson@oracle.com>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Anirudh Venkataramanan [Tue, 20 Mar 2018 14:58:06 +0000 (07:58 -0700)]
ice: Add support for control queues
A control queue is a hardware interface which is used by the driver
to interact with other subsystems (like firmware, PHY, etc.). It is
implemented as a producer-consumer ring. More specifically, an
"admin queue" is a type of control queue used to interact with the
firmware.
This patch introduces data structures and functions to initialize
and teardown control/admin queues. Once the admin queue is initialized,
the driver uses it to get the firmware version.
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Anirudh Venkataramanan [Tue, 20 Mar 2018 14:58:05 +0000 (07:58 -0700)]
ice: Add basic driver framework for Intel(R) E800 Series
This patch adds a basic driver framework for the Intel(R) E800 Ethernet
Series of network devices. There is no functionality right now other than
the ability to load.
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
David S. Miller [Mon, 26 Mar 2018 01:27:38 +0000 (21:27 -0400)]
Merge tag 'wireless-drivers-next-for-davem-2018-03-24' of git://git./linux/kernel/git/kvalo/wireless-drivers-next
Kalle Valo says:
====================
wireless-drivers-next patches for 4.17
The biggest changes are the bluetooth related patches to the rsi
driver. It adds a new bluetooth driver which communicates directly
with the wireless driver and the interface is defined in
include/net/rsi_91x.h.
Major changes:
wl1251
* read the MAC address from the NVS file
rtlwifi
* enable mac80211 fast-tx support
mt76
* add capability to select tx/rx antennas
mt7601
* let mac80211 validate rx CCMP Packet Number (PN)
rsi
* bluetooth: add new btrsi driver
* btcoex support with the new btrsi driver
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
kbuild test robot [Fri, 23 Mar 2018 19:47:42 +0000 (03:47 +0800)]
tipc: tipc_disc_addr_trial_msg() can be static
Fixes:
25b0b9c4e835 ("tipc: handle collisions of 32-bit node address hash values")
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Acked-by: Jon Maloy jon.maloy@ericsson.com
Signed-off-by: David S. Miller <davem@davemloft.net>
Dan Carpenter [Fri, 23 Mar 2018 11:36:15 +0000 (14:36 +0300)]
ibmvnic: Potential NULL dereference in clean_one_tx_pool()
There is an && vs || typo here, which potentially leads to a NULL
dereference.
Fixes:
e9e1e97884b7 ("ibmvnic: Update TX pool cleaning routine")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ganesh Goudar [Fri, 23 Mar 2018 11:35:49 +0000 (17:05 +0530)]
cxgb4: support new ISSI flash parts
Add support for new 32MB and 64MB ISSI (Integrated Silicon
Solution, Inc.) FLASH parts.
Signed-off-by: Casey Leedom <leedom@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ganesh Goudar [Fri, 23 Mar 2018 11:33:10 +0000 (17:03 +0530)]
cxgb4: depend on firmware event for link status
Depend on the firmware sending us link status changes,
rather than assuming that the link goes down upon L1
configuration.
Signed-off-by: Casey Leedom <leedom@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arjun Vynipadath [Fri, 23 Mar 2018 10:18:46 +0000 (15:48 +0530)]
cxgb4: copy vlan_id in ndo_get_vf_config
Copy vlan_id to get it displayed in vf info.
Signed-off-by: Arjun Vynipadath <arjun@chelsio.com>
Signed-off-by: Casey Leedom <leedom@chelsio.com>
Signed-off-by: Ganesh Goudhar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arjun Vynipadath [Fri, 23 Mar 2018 09:55:10 +0000 (15:25 +0530)]
cxgb4: Setup FW queues before registering netdev
When NetworkManager is enabled, there are chances that interface up
is called even before probe completes. This means we have not yet
allocated the FW sge queues, hence rest of ingress queue allocation
wont be proper. Fix this by calling setup_fw_sge_queues() before
register_netdev().
Fixes:
0fbc81b3ad51 ('chcr/cxgb4i/cxgbit/RDMA/cxgb4: Allocate resources dynamically for all cxgb4 ULD's')
Signed-off-by: Arjun Vynipadath <arjun@chelsio.com>
Signed-off-by: Casey Leedom <leedom@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 26 Mar 2018 00:48:26 +0000 (20:48 -0400)]
Merge branch 'broadcom-Adaptive-interrupt-coalescing'
Florian Fainelli says:
====================
net: broadcom: Adaptive interrupt coalescing
This patch series adds adaptive interrupt coalescing for the Gigabit Ethernet
drivers SYSTEMPORT and GENET.
This really helps lower the interrupt count and system load, as measured by
vmstat for a Gigabit TCP RX session:
SYSTEMPORT:
without:
1 0 0 192188 0 25472 0 0 0 0 122100 38870 1 42 57 0 0
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-10.0 sec 1.03 GBytes 884 Mbits/sec
with:
1 0 0 192288 0 25468 0 0 0 0 58806 44401 0 100 0 0 0
[ 5] 0.0-10.0 sec 1.04 GBytes 888 Mbits/sec
GENET:
without:
1 0 0
1170404 0 25420 0 0 0 0 130785 63402 2 85 12 0 0
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-10.0 sec 1.04 GBytes 888 Mbits/sec
with:
1 0 0
1170560 0 25420 0 0 0 0 50610 48477 0 100 0 0 0
[ 5] 0.0-10.0 sec 1.05 GBytes 899 Mbits/sec
Please look at the implementation and let me know if you see any problems, this
was largely inspired by bnxt_en.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Fri, 23 Mar 2018 01:19:33 +0000 (18:19 -0700)]
net: bcmgenet: Add support for adaptive RX coalescing
Unlike the moder modern SYSTEMPORT hardware, we do not have a
configurable TDMA timeout, which limits us to implement adaptive RX
interrupt coalescing only. We have each of our RX rings implement a
bcmgenet_net_dim structure which holds an interrupt counter, number of
packets, bytes, and a container for a net_dim instance.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Fri, 23 Mar 2018 01:19:32 +0000 (18:19 -0700)]
net: systemport: Implement adaptive interrupt coalescing
Implement support for adaptive RX and TX interrupt coalescing using
net_dim. We have each of our TX ring and our single RX ring implement a
bcm_sysport_net_dim structure which holds an interrupt counter, number
of packets, bytes, and a container for a net_dim instance.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 26 Mar 2018 00:43:42 +0000 (20:43 -0400)]
Merge branch 'mv88e6xxx-module-reloading'
Andrew Lunn says:
====================
Fixes to allow mv88e6xxx module to be reloaded
As reported by Uwe Kleine-König, the interrupt trigger is first
configured by DT and then reconfigured to edge. This results in a
failure on EPROBE_DEFER, or if the module is unloaded and reloaded.
A second crash happens on module reload due to a missing call to the
common IRQ free code when using polled interrupts.
With these fixes in place, it becomes possible to load and unload the
kernel modules a few times without it crashing.
v2: Fix the ü in Künig a couple of times
v3: But the ü should be an ö!
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Sun, 25 Mar 2018 21:43:15 +0000 (23:43 +0200)]
net: dsa: mv88e6xxx: Call the common IRQ free code
When free'ing the polled IRQs, call the common irq free code.
Otherwise the interrupts are left registered, and when we come to load
the driver a second time, we get an Opps.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Sun, 25 Mar 2018 21:43:14 +0000 (23:43 +0200)]
net: dsa: mv88e6xxx: Use the DT IRQ trigger mode
By calling request_threaded_irq() with the flag IRQF_TRIGGER_FALLING
we override the trigger mode provided in device tree. And the
interrupt is actually active low, which is what all the current device
tree descriptions use.
Suggested-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Tested-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Roman Mashak [Sun, 25 Mar 2018 21:20:06 +0000 (17:20 -0400)]
tc-testing: updated police, mirred, skbedit and skbmod with more tests
Added extra test cases for control actions (reclassify, pipe etc.),
cookies, max index value and police args sanity check.
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 25 Mar 2018 21:07:41 +0000 (17:07 -0400)]
Merge branch 'hv_netvsc-Fix-improve-RX-path-error-handling'
Haiyang Zhang says:
====================
hv_netvsc: Fix/improve RX path error handling
Fix the status code returned to the host. Also add range
check for rx packet offset and length.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Haiyang Zhang [Thu, 22 Mar 2018 19:01:14 +0000 (12:01 -0700)]
hv_netvsc: Add range checking for rx packet offset and length
This patch adds range checking for rx packet offset and length.
It may only happen if there is a host side bug.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Haiyang Zhang [Thu, 22 Mar 2018 19:01:13 +0000 (12:01 -0700)]
hv_netvsc: Fix the return status in RX path
As defined in hyperv_net.h, the NVSP_STAT_SUCCESS is one not zero.
Some functions returns 0 when it actually means NVSP_STAT_SUCCESS.
This patch fixes them.
In netvsc_receive(), it puts the last RNDIS packet's receive status
for all packets in a vmxferpage which may contain multiple RNDIS
packets.
This patch puts NVSP_STAT_FAIL in the receive completion if one of
the packets in a vmxferpage fails.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 25 Mar 2018 20:46:05 +0000 (16:46 -0400)]
Merge branch 'net-permit-skb_segment-on-head_frag-frag_list-skb'
Yonghong Song says:
====================
net: permit skb_segment on head_frag frag_list skb
One of our in-house projects, bpf-based NAT, hits a kernel BUG_ON at
function skb_segment(), line 3667. The bpf program attaches to
clsact ingress, calls bpf_skb_change_proto to change protocol
from ipv4 to ipv6 or from ipv6 to ipv4, and then calls bpf_redirect
to send the changed packet out.
...
3665 while (pos < offset + len) {
3666 if (i >= nfrags) {
3667 BUG_ON(skb_headlen(list_skb));
...
The triggering input skb has the following properties:
list_skb = skb->frag_list;
skb->nfrags != NULL && skb_headlen(list_skb) != 0
and skb_segment() is not able to handle a frag_list skb
if its headlen (list_skb->len - list_skb->data_len) is not 0.
Patch #1 provides a simple solution to avoid BUG_ON. If
list_skb->head_frag is true, its page-backed frag will
be processed before the list_skb->frags.
Patch #2 provides a test case in test_bpf module which
constructs a skb and calls skb_segment() directly. The test
case is able to trigger the BUG_ON without Patch #1.
The patch has been tested in the following setup:
ipv6_host <-> nat_server <-> ipv4_host
where nat_server has a bpf program doing ipv4<->ipv6
translation and forwarding through clsact hook
bpf_skb_change_proto.
Changelog:
v5 -> v6:
. Added back missed BUG_ON(!nfrags) for zero
skb_headlen(skb) case, plus a couple of
cosmetic changes, from Alexander.
v4 -> v5:
. Replace local variable head_frag with
a static inline function skb_head_frag_to_page_desc
which gets the head_frag on-demand. This makes
code more readable and also does not increase
the stack size, from Alexander.
. Remove the "if(nfrags)" guard for skb_orphan_frags
and skb_zerocopy_clone as I found that they can
handle zero-frag skb (with non-zero skb_headlen(skb))
properly.
. Properly release segment list from skb_segment()
in the test, from Eric.
v3 -> v4:
. Remove dynamic memory allocation and use rewinding
for both index and frag to remove one branch in fast path,
from Alexander.
. Fix a bunch of issues in test_bpf skb_segment() test,
including proper way to allocate skb, proper function
argument for skb_add_rx_frag and not freeint skb, etc.,
from Eric.
v2 -> v3:
. Use starting frag index -1 (instead of 0) to
special process head_frag before other frags in the skb,
from Alexander Duyck.
v1 -> v2:
. Removed never-hit BUG_ON, spotted by Linyu Yuan.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Yonghong Song [Wed, 21 Mar 2018 23:31:04 +0000 (16:31 -0700)]
net: bpf: add a test for skb_segment in test_bpf module
Without the previous commit,
"modprobe test_bpf" will have the following errors:
...
[ 98.149165] ------------[ cut here ]------------
[ 98.159362] kernel BUG at net/core/skbuff.c:3667!
[ 98.169756] invalid opcode: 0000 [#1] SMP PTI
[ 98.179370] Modules linked in:
[ 98.179371] test_bpf(+)
...
which triggers the bug the previous commit intends to fix.
The skbs are constructed to mimic what mlx5 may generate.
The packet size/header may not mimic real cases in production. But
the processing flow is similar.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yonghong Song [Wed, 21 Mar 2018 23:31:03 +0000 (16:31 -0700)]
net: permit skb_segment on head_frag frag_list skb
One of our in-house projects, bpf-based NAT, hits a kernel BUG_ON at
function skb_segment(), line 3667. The bpf program attaches to
clsact ingress, calls bpf_skb_change_proto to change protocol
from ipv4 to ipv6 or from ipv6 to ipv4, and then calls bpf_redirect
to send the changed packet out.
3472 struct sk_buff *skb_segment(struct sk_buff *head_skb,
3473 netdev_features_t features)
3474 {
3475 struct sk_buff *segs = NULL;
3476 struct sk_buff *tail = NULL;
...
3665 while (pos < offset + len) {
3666 if (i >= nfrags) {
3667 BUG_ON(skb_headlen(list_skb));
3668
3669 i = 0;
3670 nfrags = skb_shinfo(list_skb)->nr_frags;
3671 frag = skb_shinfo(list_skb)->frags;
3672 frag_skb = list_skb;
...
call stack:
...
#1 [
ffff883ffef03558] __crash_kexec at
ffffffff8110c525
#2 [
ffff883ffef03620] crash_kexec at
ffffffff8110d5cc
#3 [
ffff883ffef03640] oops_end at
ffffffff8101d7e7
#4 [
ffff883ffef03668] die at
ffffffff8101deb2
#5 [
ffff883ffef03698] do_trap at
ffffffff8101a700
#6 [
ffff883ffef036e8] do_error_trap at
ffffffff8101abfe
#7 [
ffff883ffef037a0] do_invalid_op at
ffffffff8101acd0
#8 [
ffff883ffef037b0] invalid_op at
ffffffff81a00bab
[exception RIP: skb_segment+3044]
RIP:
ffffffff817e4dd4 RSP:
ffff883ffef03860 RFLAGS:
00010216
RAX:
0000000000002bf6 RBX:
ffff883feb7aaa00 RCX:
0000000000000011
RDX:
ffff883fb87910c0 RSI:
0000000000000011 RDI:
ffff883feb7ab500
RBP:
ffff883ffef03928 R8:
0000000000002ce2 R9:
00000000000027da
R10:
000001ea00000000 R11:
0000000000002d82 R12:
ffff883f90a1ee80
R13:
ffff883fb8791120 R14:
ffff883feb7abc00 R15:
0000000000002ce2
ORIG_RAX:
ffffffffffffffff CS: 0010 SS: 0018
#9 [
ffff883ffef03930] tcp_gso_segment at
ffffffff818713e7
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 25 Mar 2018 20:24:34 +0000 (16:24 -0400)]
Merge branch '10GbE' of git://git./linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:
====================
10GbE Intel Wired LAN Driver Updates 2018-03-23
This series contains updates to ixgbe and ixgbevf only.
Paul adds status register reads to reduce a potential race condition
where registers can read 0xFFFFFFFF during a PCI reset, which in turn
causes the driver to remove the adapter. Then fixes an assignment
operation with an "OR" operation.
Shannon Nelson provides several IPsec offload cleanups to ixgbe, as well as a
patch to enable TSO with IPsec offload.
Tony provides the much anticipated XDP support for ixgbevf. Currently,
pass, drop and XDP_TX actions are supported, as well as meta data and
stats reporting.
Björn Töpel tweaks the page counting for XDP_REDIRECT, since a page can
have its reference count decreased via the xdp_do_redirect() call.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 25 Mar 2018 20:18:55 +0000 (16:18 -0400)]
Merge branch 'liquidio-Tx-queue-cleanup'
Intiyaz Basha says:
====================
liquidio: Tx queue cleanup
Moved some common function to octeon_network.h
Removed some unwanted functions and checks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Intiyaz Basha [Sat, 24 Mar 2018 00:37:44 +0000 (17:37 -0700)]
liquidio: Renamed txqs_start to start_txqs
For consistency renaming txqs_start to start_txqs
Signed-off-by: Intiyaz Basha <intiyaz.basha@cavium.com>
Acked-by: Derek Chickles <derek.chickles@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Intiyaz Basha [Sat, 24 Mar 2018 00:37:41 +0000 (17:37 -0700)]
liquidio: Renamed txqs_stop to stop_txqs
For consistency renaming txqs_stop to stop_txqs
Signed-off-by: Intiyaz Basha <intiyaz.basha@cavium.com>
Acked-by: Derek Chickles <derek.chickles@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Intiyaz Basha [Sat, 24 Mar 2018 00:37:39 +0000 (17:37 -0700)]
liquidio: Renamed txqs_wake to wake_txqs
For consistency renaming txqs_wake to wake_txqs
Signed-off-by: Intiyaz Basha <intiyaz.basha@cavium.com>
Acked-by: Derek Chickles <derek.chickles@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Intiyaz Basha [Sat, 24 Mar 2018 00:37:36 +0000 (17:37 -0700)]
liquidio: Function call skb_iq for deriving queue from skb
Using skb_iq function for deriving queue from skb
Signed-off-by: Intiyaz Basha <intiyaz.basha@cavium.com>
Acked-by: Derek Chickles <derek.chickles@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Intiyaz Basha [Sat, 24 Mar 2018 00:37:33 +0000 (17:37 -0700)]
liquidio: Removed one line function wake_q
Removing one line function wake_q
Signed-off-by: Intiyaz Basha <intiyaz.basha@cavium.com>
Acked-by: Derek Chickles <derek.chickles@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Intiyaz Basha [Sat, 24 Mar 2018 00:37:30 +0000 (17:37 -0700)]
liquidio: Removed one line function stop_q
Removing one line function stop_q
Signed-off-by: Intiyaz Basha <intiyaz.basha@cavium.com>
Acked-by: Derek Chickles <derek.chickles@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Intiyaz Basha [Sat, 24 Mar 2018 00:37:28 +0000 (17:37 -0700)]
liquidio: Removed netif_is_multiqueue check
Removing checks for netif_is_multiqueue.
Configuring single queue will be a multiqueue netdev with one queues.
Signed-off-by: Intiyaz Basha <intiyaz.basha@cavium.com>
Acked-by: Derek Chickles <derek.chickles@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Intiyaz Basha [Sat, 24 Mar 2018 00:37:25 +0000 (17:37 -0700)]
liquidio: Removed start_txq function
Removing start_txq function from VF and PF files
Signed-off-by: Intiyaz Basha <intiyaz.basha@cavium.com>
Acked-by: Derek Chickles <derek.chickles@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Intiyaz Basha [Sat, 24 Mar 2018 00:37:20 +0000 (17:37 -0700)]
liquidio: Removed one line function stop_txq
Removing one line function stop_txq
Signed-off-by: Intiyaz Basha <intiyaz.basha@cavium.com>
Acked-by: Derek Chickles <derek.chickles@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Intiyaz Basha [Sat, 24 Mar 2018 00:37:17 +0000 (17:37 -0700)]
liquidio: Moved common function skb_iq to to octeon_network.h
Moving common function skb_iq to to octeon_network.h
Signed-off-by: Intiyaz Basha <intiyaz.basha@cavium.com>
Acked-by: Derek Chickles <derek.chickles@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Intiyaz Basha [Sat, 24 Mar 2018 00:37:07 +0000 (17:37 -0700)]
liquidio: Moved common function txqs_start to octeon_network.h
Moving common function txqs_start to octeon_network.h
Signed-off-by: Intiyaz Basha <intiyaz.basha@cavium.com>
Acked-by: Derek Chickles <derek.chickles@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Intiyaz Basha [Sat, 24 Mar 2018 00:36:58 +0000 (17:36 -0700)]
liquidio: Moved common function txqs_wake to octeon_network.h
Moving common function txqs_wake to octeon_network.h
Signed-off-by: Intiyaz Basha <intiyaz.basha@cavium.com>
Acked-by: Derek Chickles <derek.chickles@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Intiyaz Basha [Sat, 24 Mar 2018 00:36:56 +0000 (17:36 -0700)]
liquidio: Moved common function txqs_stop to octeon_network.h
Moving common function txqs_stop to octeon_network.h
Signed-off-by: Intiyaz Basha <intiyaz.basha@cavium.com>
Acked-by: Derek Chickles <derek.chickles@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Davide Caratti [Fri, 23 Mar 2018 18:31:30 +0000 (19:31 +0100)]
net/sched: act_vlan: declare push_vid with host byte order
use u16 in place of __be16 to suppress the following sparse warnings:
net/sched/act_vlan.c:150:26: warning: incorrect type in assignment (different base types)
net/sched/act_vlan.c:150:26: expected restricted __be16 [usertype] push_vid
net/sched/act_vlan.c:150:26: got unsigned short
net/sched/act_vlan.c:151:21: warning: restricted __be16 degrades to integer
net/sched/act_vlan.c:208:26: warning: incorrect type in assignment (different base types)
net/sched/act_vlan.c:208:26: expected unsigned short [unsigned] [usertype] tcfv_push_vid
net/sched/act_vlan.c:208:26: got restricted __be16 [usertype] push_vid
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Davide Caratti [Fri, 23 Mar 2018 18:09:39 +0000 (19:09 +0100)]
net/sched: remove tcf_idr_cleanup()
tcf_idr_cleanup() is no more used, so remove it.
Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Fri, 23 Mar 2018 18:03:58 +0000 (21:03 +0300)]
mlxsw: spectrum_span: Prevent duplicate mirrors
In net commit
8175f7c4736f ("mlxsw: spectrum: Prevent duplicate
mirrors") we prevented the user from mirroring more than once from a
single binding point (port-direction pair).
The fix was essentially reverted in a merge conflict resolution when net
was merged into net-next. Restore it.
Fixes:
03fe2debbb27 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Björn Töpel [Thu, 22 Mar 2018 09:02:36 +0000 (10:02 +0100)]
ixgbe: tweak page counting for XDP_REDIRECT
The current page counting scheme assumes that the reference count
cannot decrease until the received frame is sent to the upper layers
of the networking stack. This assumption does not hold for the
XDP_REDIRECT action, since a page (pointed out by xdp_buff) can have
its reference count decreased via the xdp_do_redirect call.
To work around that, we now start off by a large page count and then
don't allow a refcount less than two.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Tony Nguyen [Fri, 16 Mar 2018 22:34:06 +0000 (15:34 -0700)]
ixgbevf: Add XDP queue stats reporting
XDP stats are included in TX stats, however, they are not
reported in TX queue stats since they are setup on different
queues. Add reporting for XDP queue stats to provide
consistency between the total stats and per queue stats.
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Tony Nguyen [Fri, 16 Mar 2018 22:34:05 +0000 (15:34 -0700)]
ixgbevf: Add support for meta data
Add support for XDP meta data when using build skb.
Based on commit
366a88fe2f40 ("bpf, ixgbe: add meta data support")
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Tony Nguyen [Fri, 16 Mar 2018 22:34:04 +0000 (15:34 -0700)]
ixgbevf: Delay tail write for XDP packets
Current XDP implementation hits the tail on every XDP_TX; change the
driver to only hit the tail after packet processing is complete.
Based on
commit
7379f97a4fce ("ixgbe: delay tail write to every 'n' packets")
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Tony Nguyen [Fri, 16 Mar 2018 22:34:03 +0000 (15:34 -0700)]
ixgbevf: Add support for XDP_TX action
This implements the XDP_TX action which is modeled on the ixgbe
implementation. However instead of using CPU id to determine which XDP
queue to use, this uses the received RX queue index, which is similar
to i40e. Doing this eliminates the restriction that number of CPUs not
exceed number of XDP queues that ixgbe has.
Also, based on the number of queues available, the number of TX queues
may be reduced when an XDP program is loaded in order to accommodate the
XDP queues.
Based largely on
commit
33fdc82f0883 ("ixgbe: add support for XDP_TX action")
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Tony Nguyen [Fri, 16 Mar 2018 22:34:02 +0000 (15:34 -0700)]
ixgbevf: Add XDP support for pass and drop actions
Implement XDP_PASS and XDP_DROP based on the ixgbe implementation.
Based largely on commit
924708081629 ("ixgbe: add XDP support for pass and
drop actions").
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Shannon Nelson [Fri, 16 Mar 2018 18:09:07 +0000 (11:09 -0700)]
ixgbe: enable TSO with IPsec offload
Fix things up to support TSO offload in conjunction
with IPsec hw offload. This raises throughput with
IPsec offload on to nearly line rate.
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Shannon Nelson [Fri, 16 Mar 2018 18:09:06 +0000 (11:09 -0700)]
ixgbe: no need for esp trailer if GSO
There is no need to calculate the trailer length if we're doing
a GSO/TSO, as there is no trailer added to the packet data.
Also, don't bother clearing the flags field as it was already
cleared earlier.
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Shannon Nelson [Fri, 16 Mar 2018 18:09:05 +0000 (11:09 -0700)]
ixgbe: remove unneeded ipsec test in TX path
Since the ipsec data fields will be zero anyway in the non-ipsec
case, we can remove the conditional jump.
Suggested-by: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Shannon Nelson [Fri, 16 Mar 2018 18:09:04 +0000 (11:09 -0700)]
ixgbe: no need for ipsec csum feature check
With the patch
commit
f8aa2696b4af ("esp: check the NETIF_F_HW_ESP_TX_CSUM bit before segmenting")
we no longer need to protect ourself from checksum
offload requests on IPsec packets, so we can remove
the check in our .ndo_features_check callback.
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Paul Greenwalt [Thu, 15 Mar 2018 12:22:07 +0000 (08:22 -0400)]
ixgbe: fix read-modify-write in x550 phy setup
Replaced an assignment operation with an OR operation.
The variable assignment was overwriting the value read from the PHY
register. The OR operation sets only the intended register bits.
The bits that were being overwritten are reserved, so the assignment had no
functional impact.
Reported by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Paul Greenwalt [Mon, 12 Mar 2018 13:22:55 +0000 (09:22 -0400)]
ixgbe: add status reg reads to ixgbe_check_remove
Add status register reads and delay between reads to ixgbe_check_remove.
Registers can read 0xFFFFFFFF during PCI reset, which causes the driver
to remove the adapter. The additional status register reads can reduce the
chance of this race condition.
If the status register is not 0xFFFFFFFF, then ixgbe_check_remove returns
the value of the register being read.
Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Mathias Kresin [Thu, 22 Mar 2018 22:31:39 +0000 (23:31 +0100)]
net: phy: intel-xway: add VR9 v1.1 phy ids
The phys embedded into the v1.1 of the VR9 SoC are using different phy
ids. Add the phy ids to use the driver for this VR9 version as well.
Signed-off-by: Mathias Kresin <dev@kresin.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mathias Kresin [Thu, 22 Mar 2018 22:31:38 +0000 (23:31 +0100)]
net: phy: intel-xway: add VR9 version number
The VR9 phy ids are matching only for the SoC version 1.2. Rename the
macros and change the names to take this into account.
Signed-off-by: Mathias Kresin <dev@kresin.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
kbuild test robot [Thu, 22 Mar 2018 21:31:07 +0000 (05:31 +0800)]
net: hns3: hclge_inform_reset_assert_to_vf() can be static
Fixes:
2bfbd35d8ecd ("net: hns3: Changes required in PF mailbox to support VF reset")
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gustavo A. R. Silva [Thu, 22 Mar 2018 20:08:49 +0000 (15:08 -0500)]
qed: Use true and false for boolean values
Assign true or false to boolean variables instead of an integer value.
This issue was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Acked-by: Sudarsana Kalluru <Sudarsana.Kalluru@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gustavo A. R. Silva [Thu, 22 Mar 2018 19:59:27 +0000 (14:59 -0500)]
dpaa_eth: use true and false for boolean values
Assign true or false to boolean variables instead of an integer value.
This issue was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 23 Mar 2018 17:12:19 +0000 (13:12 -0400)]
Merge branch 'tipc-introduce-128-bit-auto-configurable-node-id'
Jon Maloy says:
====================
tipc: introduce 128-bit auto-configurable node id
We introduce a 128-bit free-format node identity as an alternative to
the legacy <Zone.Cluster.Node> structured 32-bit node address.
We also make configuration of this identity optional; if a bearer is
enabled without a pre-configured node id it will be set automatically
based on the used interface's MAC or IP address.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Maloy [Thu, 22 Mar 2018 19:42:52 +0000 (20:42 +0100)]
tipc: obtain node identity from interface by default
Selecting and explicitly configuring a TIPC node identity may be
unwanted in some cases.
In this commit we introduce a default setting if the identity has not
been set at the moment the first bearer is enabled. We do this by
using a raw copy of a unique identifier from the used interface: MAC
address in the case of an L2 bearer, IPv4/IPv6 address in the case
of a UDP bearer.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Maloy [Thu, 22 Mar 2018 19:42:51 +0000 (20:42 +0100)]
tipc: handle collisions of 32-bit node address hash values
When a 32-bit node address is generated from a 128-bit identifier,
there is a risk of collisions which must be discovered and handled.
We do this as follows:
- We don't apply the generated address immediately to the node, but do
instead initiate a 1 sec trial period to allow other cluster members
to discover and handle such collisions.
- During the trial period the node periodically sends out a new type
of message, DSC_TRIAL_MSG, using broadcast or emulated broadcast,
to all the other nodes in the cluster.
- When a node is receiving such a message, it must check that the
presented 32-bit identifier either is unused, or was used by the very
same peer in a previous session. In both cases it accepts the request
by not responding to it.
- If it finds that the same node has been up before using a different
address, it responds with a DSC_TRIAL_FAIL_MSG containing that
address.
- If it finds that the address has already been taken by some other
node, it generates a new, unused address and returns it to the
requester.
- During the trial period the requesting node must always be prepared
to accept a failure message, i.e., a message where a peer suggests a
different (or equal) address to the one tried. In those cases it
must apply the suggested value as trial address and restart the trial
period.
This algorithm ensures that in the vast majority of cases a node will
have the same address before and after a reboot. If a legacy user
configures the address explicitly, there will be no trial period and
messages, so this protocol addition is completely backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Maloy [Thu, 22 Mar 2018 19:42:50 +0000 (20:42 +0100)]
tipc: add 128-bit node identifier
We add a 128-bit node identity, as an alternative to the currently used
32-bit node address.
For the sake of compatibility and to minimize message header changes
we retain the existing 32-bit address field. When not set explicitly by
the user, this field will be filled with a hash value generated from the
much longer node identity, and be used as a shorthand value for the
latter.
We permit either the address or the identity to be set by configuration,
but not both, so when the address value is set by a legacy user the
corresponding 128-bit node identity is generated based on the that value.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Maloy [Thu, 22 Mar 2018 19:42:49 +0000 (20:42 +0100)]
tipc: remove direct accesses to own_addr field in struct tipc_net
As a preparation to changing the addressing structure of TIPC we replace
all direct accesses to the tipc_net::own_addr field with the function
dedicated for this, tipc_own_addr().
There are no changes to program logics in this commit.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Maloy [Thu, 22 Mar 2018 19:42:48 +0000 (20:42 +0100)]
tipc: allow closest-first lookup algorithm when legacy address is configured
The removal of an internal structure of the node address has an unwanted
side effect.
- Currently, if a user is sending an anycast message with destination
domain 0, the tipc_namebl_translate() function will use the 'closest-
first' algorithm to first look for a node local destination, and only
when no such is found, will it resort to the cluster global 'round-
robin' lookup algorithm.
- Current users can get around this, and enforce unconditional use of
global round-robin by indicating a destination as Z.0.0 or Z.C.0.
- This option disappears when we make the node address flat, since the
lookup algorithm has no way of recognizing this case. So, as long as
there are node local destinations, the algorithm will always select
one of those, and there is nothing the sender can do to change this.
We solve this by eliminating the 'closest-first' option, which was never
a good idea anyway, for non-legacy users, but only for those. To
distinguish between legacy users and non-legacy users we introduce a new
flag 'legacy_addr_format' in struct tipc_core, to be set when the user
configures a legacy-style Z.C.N node address. Hence, when a legacy user
indicates a zero lookup domain 'closest-first' is selected, and in all
other cases we use 'round-robin'.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Maloy [Thu, 22 Mar 2018 19:42:47 +0000 (20:42 +0100)]
tipc: remove restrictions on node address values
Nominally, TIPC organizes network nodes into a three-level network
hierarchy consisting of the levels 'zone', 'cluster' and 'node'. This
hierarchy is reflected in the node address format, - it is sub-divided
into an 8-bit zone id, and 12 bit cluster id, and a 12-bit node id.
However, the 'zone' and 'cluster' levels have in reality never been
fully implemented,and never will be. The result of this has been
that the first 20 bits the node identity structure have been wasted,
and the usable node identity range within a cluster has been limited
to 12 bits. This is starting to become a problem.
In the following commits, we will need to be able to connect between
nodes which are using the whole 32-bit value space of the node address.
We therefore remove the restrictions on which values can be assigned
to node identity, -it is from now on only a 32-bit integer with no
assumed internal structure.
Isolation between clusters is now achieved only by setting different
values for the 'network id' field used during neighbor discovery, in
practice leading to the latter becoming the new cluster identity.
The rules for accepting discovery requests/responses from neighboring
nodes now become:
- If the user is using legacy address format on both peers, reception
of discovery messages is subject to the legacy lookup domain check
in addition to the cluster id check.
- Otherwise, the discovery request/response is always accepted, provided
both peers have the same network id.
This secures backwards compatibility for users who have been using zone
or cluster identities as cluster separators, instead of the intended
'network id'.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Maloy [Thu, 22 Mar 2018 19:42:46 +0000 (20:42 +0100)]
tipc: some cleanups in the file discover.c
To facilitate the coming changes in the neighbor discovery functionality
we make some renaming and refactoring of that code. The functional changes
in this commit are trivial, e.g., that we move the message sending call in
tipc_disc_timeout() outside the spinlock protected region.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Maloy [Thu, 22 Mar 2018 19:42:45 +0000 (20:42 +0100)]
tipc: refactor function tipc_enable_bearer()
As a preparation for the next commits we try to reduce the footprint of
the function tipc_enable_bearer(), while hopefully making is simpler to
follow.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gustavo A. R. Silva [Thu, 22 Mar 2018 18:44:56 +0000 (13:44 -0500)]
net/mlx5: Fix use-after-free
_rule_ is being freed and then dereferenced by accessing rule->ctx
Fix this by copying the value returned by PTR_ERR(rule->ctx) into a local
variable for its safe use after freeing _rule_
Addresses-Coverity-ID:
1466041 ("Read from pointer after free")
Fixes:
05564d0ae075 ("net/mlx5: Add flow-steering commands for FPGA IPSec implementation")
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 23 Mar 2018 17:00:47 +0000 (13:00 -0400)]
Merge branch 'pernet-convert-part11'
Kirill Tkhai says:
====================
Converting pernet_operations (part #11)
this series continues to review and to convert pernet_operations
to make them possible to be executed in parallel for several
net namespaces at the same time.
I thought last series was last, but there is one
new pernet_operations came to kernel. This is
udp_sysctl_ops, and here we convert it.
Also, David Howells acked rxrpc_net_ops, so I resend
the patch in case of it should be queued by patchwork:
https://www.spinics.net/lists/netdev/msg490678.html
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Kirill Tkhai [Thu, 22 Mar 2018 18:34:55 +0000 (21:34 +0300)]
net: Convert rxrpc_net_ops
These pernet_operations modifies rxrpc_net_id-pointed
per-net entities. There is external link to AF_RXRPC
in fs/afs/Kconfig, but it seems there is no other
pernet_operations interested in that per-net entities.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kirill Tkhai [Thu, 22 Mar 2018 18:34:46 +0000 (21:34 +0300)]
net: Convert udp_sysctl_ops
These pernet_operations just initialize udp4 defaults.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Thu, 22 Mar 2018 18:14:47 +0000 (20:14 +0200)]
mlxsw: spectrum_span: Fix initialization of struct mlxsw_sp_span_parms
Since the first element of struct mlxsw_sp_span_parms is a pointer,
to zero-initialize this structure the correct notation is not = {0}, but
rather = {NULL}, as reported by sparse.
Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Davide Caratti [Thu, 22 Mar 2018 18:12:19 +0000 (19:12 +0100)]
tc-testing: add selftests for 'bpf' action
Test d959: Add cBPF action with valid bytecode
Test f84a: Add cBPF action with invalid bytecode
Test e939: Add eBPF action with valid object-file
Test 282d: Add eBPF action with invalid object-file
Test d819: Replace cBPF bytecode and action control
Test 6ae3: Delete cBPF action
Test 3e0d: List cBPF actions
Test 55ce: Flush BPF actions
Test ccc3: Add cBPF action with duplicate index
Test 89c7: Add cBPF action with invalid index
Test 7ab9: Add cBPF action with cookie
Changes since v1:
- use index=2^32-1 in test ccc3, add tests 7a89, 89c7 (thanks Roman Mashak)
- added test 282d
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nikolay Aleksandrov [Fri, 23 Mar 2018 16:27:06 +0000 (18:27 +0200)]
net: bridge: fix direct access to bridge vlan_enabled and use helper
We need to use br_vlan_enabled() helper otherwise we'll break builds
without bridge vlans:
net/bridge//br_if.c: In function ‘br_mtu’:
net/bridge//br_if.c:458:8: error: ‘const struct net_bridge’ has no
member named ‘vlan_enabled’
if (br->vlan_enabled)
^
net/bridge//br_if.c:462:1: warning: control reaches end of non-void
function [-Wreturn-type]
}
^
scripts/Makefile.build:324: recipe for target 'net/bridge//br_if.o'
failed
Fixes:
419d14af9e07 ("bridge: Allow max MTU when multiple VLANs present")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 23 Mar 2018 16:25:55 +0000 (12:25 -0400)]
Merge branch 'tls-RX'
Dave Watson says:
====================
TLS Rx
TLS tcp socket RX implementation, to match existing TX code.
This patchset completes the software TLS socket, allowing full
bi-directional communication over TLS using normal socket syscalls,
after the handshake has been done in userspace. Only the symmetric
encryption is done in the kernel.
This allows usage of TLS sockets from within the kernel (for example
with network block device, or from bpf). Performance can be better
than userspace, with appropriate crypto routines [1].
sk->sk_socket->ops must be overridden to implement splice_read and
poll, but otherwise the interface & implementation match TX closely.
strparser is used to parse TLS framing on receive.
There are Openssl RX patches that work with this interface [2], as
well as a testing tool using the socket interface directly (without
cmsg support) [3]. An example tcp socket setup is:
// Normal tcp socket connect/accept, and TLS handshake
// using any TLS library.
setsockopt(sock, SOL_TCP, TCP_ULP, "tls", sizeof("tls"));
struct tls12_crypto_info_aes_gcm_128 crypto_info_rx;
// Fill in crypto_info based on negotiated keys.
setsockopt(sock, SOL_TLS, TLS_RX, &crypto_info, sizeof(crypto_info_rx));
// You can optionally TLX_TX as well.
char buffer[16384];
int ret = recv(sock, buffer, 16384);
// cmsg can be received using recvmsg and a msg_control
// of type TLS_GET_RECORD_TYPE will be set.
V1 -> V2
* For too-small framing errors, return EBADMSG, to match openssl error
code semantics. Docs and commit logs about this also updated.
RFC -> V1
* Refactor 'tx' variable names to drop tx
* Error return codes changed per discussion
* Only call skb_cow_data based on in-place decryption,
drop unnecessary frag list check.
[1] Recent crypto patchset to remove copies, resulting in optimally
zero copies vs. userspace's one, vs. previous kernel's two.
https://marc.info/?l=linux-crypto-vger&m=
151931242406416&w=2
[2] https://github.com/Mellanox/openssl/commits/tls_rx2
[3] https://github.com/ktls/af_ktls-tool/tree/RX
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Dave Watson [Thu, 22 Mar 2018 17:10:44 +0000 (10:10 -0700)]
tls: Add receive path documentation
Add documentation on rx path setup and cmsg interface.
Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dave Watson [Thu, 22 Mar 2018 17:10:35 +0000 (10:10 -0700)]
tls: RX path for ktls
Add rx path for tls software implementation.
recvmsg, splice_read, and poll implemented.
An additional sockopt TLS_RX is added, with the same interface as
TLS_TX. Either TLX_RX or TLX_TX may be provided separately, or
together (with two different setsockopt calls with appropriate keys).
Control messages are passed via CMSG in a similar way to transmit.
If no cmsg buffer is passed, then only application data records
will be passed to userspace, and EIO is returned for other types of
alerts.
EBADMSG is passed for decryption errors, and EMSGSIZE is passed for
framing too big, and EBADMSG for framing too small (matching openssl
semantics). EINVAL is returned for TLS versions that do not match the
original setsockopt call. All are unrecoverable.
strparser is used to parse TLS framing. Decryption is done directly
in to userspace buffers if they are large enough to support it, otherwise
sk_cow_data is called (similar to ipsec), and buffers are decrypted in
place and copied. splice_read always decrypts in place, since no
buffers are provided to decrypt in to.
sk_poll is overridden, and only returns POLLIN if a full TLS message is
received. Otherwise we wait for strparser to finish reading a full frame.
Actual decryption is only done during recvmsg or splice_read calls.
Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dave Watson [Thu, 22 Mar 2018 17:10:26 +0000 (10:10 -0700)]
tls: Refactor variable names
Several config variables are prefixed with tx, drop the prefix
since these will be used for both tx and rx.
Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dave Watson [Thu, 22 Mar 2018 17:10:15 +0000 (10:10 -0700)]
tls: Pass error code explicitly to tls_err_abort
Pass EBADMSG explicitly to tls_err_abort. Receive path will
pass additional codes - EMSGSIZE if framing is larger than max
TLS record size, EINVAL if TLS version mismatch.
Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dave Watson [Thu, 22 Mar 2018 17:10:06 +0000 (10:10 -0700)]
tls: Move cipher info to a separate struct
Separate tx crypto parameters to a separate cipher_context struct.
The same parameters will be used for rx using the same struct.
tls_advance_record_sn is modified to only take the cipher info.
Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dave Watson [Thu, 22 Mar 2018 17:09:53 +0000 (10:09 -0700)]
tls: Generalize zerocopy_from_iter
Refactor zerocopy_from_iter to take arguments for pages and size,
such that it can be used for both tx and rx. RX will also support
zerocopy direct to output iter, as long as the full message can
be copied at once (a large enough userspace buffer was provided).
Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jeff Kirsher [Thu, 22 Mar 2018 17:08:48 +0000 (10:08 -0700)]
intel: add SPDX identifiers to all the Intel drivers
Add the SPDX identifiers to all the Intel wired LAN driver files, as
outlined in Documentation/process/license-rules.rst.
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Chas Williams [Thu, 22 Mar 2018 15:34:06 +0000 (11:34 -0400)]
bridge: Allow max MTU when multiple VLANs present
If the bridge is allowing multiple VLANs, some VLANs may have
different MTUs. Instead of choosing the minimum MTU for the
bridge interface, choose the maximum MTU of the bridge members.
With this the user only needs to set a larger MTU on the member
ports that are participating in the large MTU VLANS.
Signed-off-by: Chas Williams <3chas3@gmail.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jay Vosburgh [Thu, 22 Mar 2018 14:42:41 +0000 (14:42 +0000)]
virtio-net: Fix operstate for virtio when no VIRTIO_NET_F_STATUS
The operstate update logic will leave an interface in the
default UNKNOWN operstate if the interface carrier state never changes
from the default carrier up state set at creation. This includes the
case of an explicit call to netif_carrier_on, as the carrier on to on
transition has no effect on operstate.
This affects virtio-net for the case that the virtio peer does
not support VIRTIO_NET_F_STATUS (the feature that provides carrier state
updates). Without this feature, the virtio specification states that
"the link should be assumed active," so, logically, the operstate should
be UP instead of UNKNOWN. This has impact on user space applications
that use the operstate to make availability decisions for the interface.
Resolve this by changing the virtio probe logic slightly to call
netif_carrier_off for both the "with" and "without" VIRTIO_NET_F_STATUS
cases, and then the existing call to netif_carrier_on for the "without"
case will cause an operstate transition.
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Fri, 23 Mar 2018 15:09:48 +0000 (08:09 -0700)]
devlink: Remove top_hierarchy arg for DEVLINK disabled path
Earlier change missed the path where CONFIG_NET_DEVLINK is disabled.
Thanks to Jiri for spotting.
Fixes:
145307460ba9 ("devlink: Remove top_hierarchy arg to devlink_resource_register")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 23 Mar 2018 15:24:57 +0000 (11:24 -0400)]
Merge git://git./linux/kernel/git/davem/net
Fun set of conflict resolutions here...
For the mac80211 stuff, these were fortunately just parallel
adds. Trivially resolved.
In drivers/net/phy/phy.c we had a bug fix in 'net' that moved the
function phy_disable_interrupts() earlier in the file, whilst in
'net-next' the phy_error() call from this function was removed.
In net/ipv4/xfrm4_policy.c, David Ahern's changes to remove the
'rt_table_id' member of rtable collided with a bug fix in 'net' that
added a new struct member "rt_mtu_locked" which needs to be copied
over here.
The mlxsw driver conflict consisted of net-next separating
the span code and definitions into separate files, whilst
a 'net' bug fix made some changes to that moved code.
The mlx5 infiniband conflict resolution was quite non-trivial,
the RDMA tree's merge commit was used as a guide here, and
here are their notes:
====================
Due to bug fixes found by the syzkaller bot and taken into the for-rc
branch after development for the 4.17 merge window had already started
being taken into the for-next branch, there were fairly non-trivial
merge issues that would need to be resolved between the for-rc branch
and the for-next branch. This merge resolves those conflicts and
provides a unified base upon which ongoing development for 4.17 can
be based.
Conflicts:
drivers/infiniband/hw/mlx5/main.c - Commit
42cea83f9524
(IB/mlx5: Fix cleanup order on unload) added to for-rc and
commit
b5ca15ad7e61 (IB/mlx5: Add proper representors support)
add as part of the devel cycle both needed to modify the
init/de-init functions used by mlx5. To support the new
representors, the new functions added by the cleanup patch
needed to be made non-static, and the init/de-init list
added by the representors patch needed to be modified to
match the init/de-init list changes made by the cleanup
patch.
Updates:
drivers/infiniband/hw/mlx5/mlx5_ib.h - Update function
prototypes added by representors patch to reflect new function
names as changed by cleanup patch
drivers/infiniband/hw/mlx5/ib_rep.c - Update init/de-init
stage list to match new order from cleanup patch
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Fri, 23 Mar 2018 01:48:43 +0000 (18:48 -0700)]
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
"13 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm, thp: do not cause memcg oom for thp
mm/vmscan: wake up flushers for legacy cgroups too
Revert "mm: page_alloc: skip over regions of invalid pfns where possible"
mm/shmem: do not wait for lock_page() in shmem_unused_huge_shrink()
mm/thp: do not wait for lock_page() in deferred_split_scan()
mm/khugepaged.c: convert VM_BUG_ON() to collapse fail
x86/mm: implement free pmd/pte page interfaces
mm/vmalloc: add interfaces to free unmapped page table
h8300: remove extraneous __BIG_ENDIAN definition
hugetlbfs: check for pgoff value overflow
lockdep: fix fs_reclaim warning
MAINTAINERS: update Mark Fasheh's e-mail
mm/mempolicy.c: avoid use uninitialized preferred_node
Linus Torvalds [Fri, 23 Mar 2018 01:37:49 +0000 (18:37 -0700)]
Merge branch 'libnvdimm-fixes' of git://git./linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm fixes from Dan Williams:
"Two regression fixes, two bug fixes for older issues, two fixes for
new functionality added this cycle that have userspace ABI concerns,
and a small cleanup. These have appeared in a linux-next release and
have a build success report from the 0day robot.
* The 4.16 rework of altmap handling led to some configurations
leaking page table allocations due to freeing from the altmap
reservation rather than the page allocator.
The impact without the fix is leaked memory and a WARN() message
when tearing down libnvdimm namespaces. The rework also missed a
place where error handling code needed to be removed that can lead
to a crash if devm_memremap_pages() fails.
* acpi_map_pxm_to_node() had a latent bug whereby it could
misidentify the closest online node to a given proximity domain.
* Block integrity handling was reworked several kernels back to allow
calling add_disk() after setting up the integrity profile.
The nd_btt and nd_blk drivers are just now catching up to fix
automatic partition detection at driver load time.
* The new peristence_domain attribute, a platform indicator of
whether cpu caches are powerfail protected for example, is meant to
be a single value enum and not a set of flags.
This oversight was caught while reviewing new userspace code in
libndctl to communicate the attribute.
Fix this new enabling up so that we are not stuck with an unwanted
userspace ABI"
* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
libnvdimm, nfit: fix persistence domain reporting
libnvdimm, region: hide persistence_domain when unknown
acpi, numa: fix pxm to online numa node associations
x86, memremap: fix altmap accounting at free
libnvdimm: remove redundant assignment to pointer 'dev'
libnvdimm, {btt, blk}: do integrity setup before add_disk()
kernel/memremap: Remove stale devres_free() call
Linus Torvalds [Fri, 23 Mar 2018 00:37:44 +0000 (17:37 -0700)]
Merge tag 'drm-fixes-for-v4.16-rc7' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
"A bunch of fixes all over the place (core, i915, amdgpu, imx, sun4i,
ast, tegra, vmwgfx), nothing too serious or worrying at this stage.
- one uapi fix to stop multi-planar images with getfb
- Sun4i error path and clock fixes
- udl driver mmap offset fix
- i915 DP MST and GPU reset fixes
- vmwgfx mutex and black screen fixes
- imx array underflow fix and vblank fix
- amdgpu: display fixes
- exynos devicetree fix
- ast mode fix"
* tag 'drm-fixes-for-v4.16-rc7' of git://people.freedesktop.org/~airlied/linux: (29 commits)
drm/ast: Fixed 1280x800 Display Issue
drm: udl: Properly check framebuffer mmap offsets
drm/i915: Specify which engines to reset following semaphore/event lockups
drm/vmwgfx: Fix a destoy-while-held mutex problem.
drm/vmwgfx: Fix black screen and device errors when running without fbdev
drm: Reject getfb for multi-plane framebuffers
drm/amd/display: Add one to EDID's audio channel count when passing to DC
drm/amd/display: We shouldn't set format_default on plane as atomic driver
drm/amd/display: Fix FMT truncation programming
drm/amd/display: Allow truncation to 10 bits
drm/sun4i: hdmi: Fix another error handling path in 'sun4i_hdmi_bind()'
drm/sun4i: hdmi: Fix an error handling path in 'sun4i_hdmi_bind()'
drm/i915/dp: Write to SET_POWER dpcd to enable MST hub.
drm/amd/display: fix dereferencing possible ERR_PTR()
drm/amd/display: Refine disable VGA
drm/tegra: Shutdown on driver unbind
drm/tegra: dsi: Don't disable regulator on ->exit()
drm/tegra: dc: Detach IOMMU group from domain only once
dt-bindings: exynos: Document #sound-dai-cells property of the HDMI node
drm/imx: move arming of the vblank event to atomic_flush
...
David Rientjes [Thu, 22 Mar 2018 23:17:45 +0000 (16:17 -0700)]
mm, thp: do not cause memcg oom for thp
Commit
2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and
madvised allocations") changed the page allocator to no longer detect
thp allocations based on __GFP_NORETRY.
It did not, however, modify the mem cgroup try_charge() path to avoid
oom kill for either khugepaged collapsing or thp faulting. It is never
expected to oom kill a process to allocate a hugepage for thp; reclaim
is governed by the thp defrag mode and MADV_HUGEPAGE, but allocations
(and charging) should fallback instead of oom killing processes.
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803191409420.124411@chino.kir.corp.google.com
Fixes:
2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations")
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrey Ryabinin [Thu, 22 Mar 2018 23:17:42 +0000 (16:17 -0700)]
mm/vmscan: wake up flushers for legacy cgroups too
Commit
726d061fbd36 ("mm: vmscan: kick flushers when we encounter dirty
pages on the LRU") added flusher invocation to shrink_inactive_list()
when many dirty pages on the LRU are encountered.
However, shrink_inactive_list() doesn't wake up flushers for legacy
cgroup reclaim, so the next commit
bbef938429f5 ("mm: vmscan: remove old
flusher wakeup from direct reclaim path") removed the only source of
flusher's wake up in legacy mem cgroup reclaim path.
This leads to premature OOM if there is too many dirty pages in cgroup:
# mkdir /sys/fs/cgroup/memory/test
# echo $$ > /sys/fs/cgroup/memory/test/tasks
# echo 50M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
# dd if=/dev/zero of=tmp_file bs=1M count=100
Killed
dd invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=0
Call Trace:
dump_stack+0x46/0x65
dump_header+0x6b/0x2ac
oom_kill_process+0x21c/0x4a0
out_of_memory+0x2a5/0x4b0
mem_cgroup_out_of_memory+0x3b/0x60
mem_cgroup_oom_synchronize+0x2ed/0x330
pagefault_out_of_memory+0x24/0x54
__do_page_fault+0x521/0x540
page_fault+0x45/0x50
Task in /test killed as a result of limit of /test
memory: usage 51200kB, limit 51200kB, failcnt 73
memory+swap: usage 51200kB, limit 9007199254740988kB, failcnt 0
kmem: usage 296kB, limit 9007199254740988kB, failcnt 0
Memory cgroup stats for /test: cache:49632KB rss:1056KB rss_huge:0KB shmem:0KB
mapped_file:0KB dirty:49500KB writeback:0KB swap:0KB inactive_anon:0KB
active_anon:1168KB inactive_file:24760KB active_file:24960KB unevictable:0KB
Memory cgroup out of memory: Kill process 3861 (bash) score 88 or sacrifice child
Killed process 3876 (dd) total-vm:8484kB, anon-rss:1052kB, file-rss:1720kB, shmem-rss:0kB
oom_reaper: reaped process 3876 (dd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Wake up flushers in legacy cgroup reclaim too.
Link: http://lkml.kernel.org/r/20180315164553.17856-1-aryabinin@virtuozzo.com
Fixes:
bbef938429f5 ("mm: vmscan: remove old flusher wakeup from direct reclaim path")
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>