David Ahern [Wed, 20 Mar 2019 16:18:59 +0000 (09:18 -0700)]
ipv4: Allow amount of dirty memory from fib resizing to be controllable
fib_trie implementation calls synchronize_rcu when a certain amount of
pages are dirty from freed entries. The number of pages was determined
experimentally in 2009 (commit
c3059477fce2d).
At the current setting, synchronize_rcu is called often -- 51 times in a
second in one test with an average of an 8 msec delay adding a fib entry.
The total impact is a lot of slow down modifying the fib. This is seen
in the output of 'time' - the difference between real time and sys+user.
For example, using 720,022 single path routes and 'ip -batch'[1]:
$ time ./ip -batch ipv4/routes-1-hops
real 0m14.214s
user 0m2.513s
sys 0m6.783s
So roughly 35% of the actual time to install the routes is from the ip
command getting scheduled out, most notably due to synchronize_rcu (this
is observed using 'perf sched timehist').
This patch makes the amount of dirty memory configurable between 64k where
the synchronize_rcu is called often (small, low end systems that are memory
sensitive) to 64M where synchronize_rcu is called rarely during a large
FIB change (for high end systems with lots of memory). The default is 512kB
which corresponds to the current setting of 128 pages with a 4kB page size.
As an example, at 16MB the worst interval shows 4 calls to synchronize_rcu
in a second blocking for up to 30 msec in a single instance, and a total
of almost 100 msec across the 4 calls in the second. The trade off is
allowing FIB entries to consume more memory in a given time window but
but with much better fib insertion rates (~30% increase in prefixes/sec).
With this patch and net.ipv4.fib_sync_mem set to 16MB, the same batch
file runs in:
$ time ./ip -batch ipv4/routes-1-hops
real 0m9.692s
user 0m2.491s
sys 0m6.769s
So the dead time is reduced to about 1/2 second or <5% of the real time.
[1] 'ip' modified to not request ACK messages which improves route
insertion times by about 20%
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kirill Tkhai [Wed, 20 Mar 2019 09:16:53 +0000 (12:16 +0300)]
tun: Remove unused first parameter of tun_get_iff()
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kirill Tkhai [Wed, 20 Mar 2019 09:16:42 +0000 (12:16 +0300)]
tun: Add ioctl() TUNGETDEVNETNS cmd to allow obtaining real net ns of tun device
In commit
f2780d6d7475 "tun: Add ioctl() SIOCGSKNS cmd to allow
obtaining net ns of tun device" it was missed that tun may change
its net ns, while net ns of socket remains the same as it was
created initially. SIOCGSKNS returns net ns of socket, so it is
not suitable for obtaining net ns of device.
We may have two tun devices with the same names in two net ns,
and in this case it's not possible to determ, which of them
fd refers to (TUNGETIFF will return the same name).
This patch adds new ioctl() cmd for obtaining net ns of a device.
Reported-by: Harald Albrecht <harald.albrecht@gmx.net>
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 21 Mar 2019 17:16:54 +0000 (10:16 -0700)]
Merge branch 'ipv6-Change-addrconf_f6i_alloc-to-use-ip6_route_info_create'
David Ahern says:
====================
ipv6: Change addrconf_f6i_alloc to use ip6_route_info_create
addrconf_f6i_alloc is the last caller of fib6_info_alloc besides
ip6_route_info_create. There really is no good reason for it do
its own fib6_info initialization, so convert it to call
ip6_route_info_create.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Thu, 21 Mar 2019 12:21:35 +0000 (05:21 -0700)]
ipv6: Change addrconf_f6i_alloc to use ip6_route_info_create
Change addrconf_f6i_alloc to generate a fib6_config and call
ip6_route_info_create. addrconf_f6i_alloc is the last caller to
fib6_info_alloc besides ip6_route_info_create, and there is no
reason for it to do its own initialization on a fib6_info.
Host routes need to be created even if the device is down, so add a
new flag, fc_ignore_dev_down, to fib6_config and update fib6_nh_init
to not error out if device is not up.
Notes on the conversion:
- ip_fib_metrics_init is the same as fib6_config has fc_mx set to NULL
and fc_mx_len set to 0
- dst_nocount is handled by the RTF_ADDRCONF flag
- dst_host is handled by fc_dst_len = 128
nh_gw does not get set after the conversion to ip6_route_info_create
but it should not be set in addrconf_f6i_alloc since this is a host
route not a gateway route.
Everything else is a straight forward map between fib6_info and
fib6_config.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Thu, 21 Mar 2019 12:21:34 +0000 (05:21 -0700)]
ipv6: Move setting default metric for routes
ip6_route_info_create is a low level function for ensuring fc_metric is
set. Move the check and default setting to the 2 locations that do not
already set fc_metric before calling ip6_route_info_create. This is
required for the next patch which moves addrconf allocations to
ip6_route_info_create and want the metric for host routes to be 0.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vakul Garg [Thu, 21 Mar 2019 11:59:57 +0000 (11:59 +0000)]
net/tls: Replace kfree_skb() with consume_skb()
To free the skb in normal course of processing, consume_skb() should be
used. Only for failure paths, skb_free() is intended to be used.
https://www.kernel.org/doc/htmldocs/networking/API-consume-skb.html
Signed-off-by: Vakul Garg <vakul.garg@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hoang Le [Thu, 21 Mar 2019 10:25:18 +0000 (17:25 +0700)]
tipc: fix a null pointer deref
In commit
c55c8edafa91 ("tipc: smooth change between replicast and
broadcast") we introduced new method to eliminate the risk of message
reordering that happen in between different nodes.
Unfortunately, we forgot checking at receiving side to ignore intra node.
We fix this by checking and returning if arrived message from intra node.
syzbot report:
==================================================================
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 7820 Comm: syz-executor418 Not tainted 5.0.0+ #61
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 01/01/2011
RIP: 0010:tipc_mcast_filter_msg+0x21b/0x13d0 net/tipc/bcast.c:782
Code: 45 c0 0f 84 39 06 00 00 48 89 5d 98 e8 ce ab a5 fa 49 8d bc
24 c8 00 00 00 48 b9 00 00 00 00 00 fc ff df 48 89 f8 48 c1 e8 03
<80> 3c 08 00 0f 85 9a 0e 00 00 49 8b 9c 24 c8 00 00 00 48 be 00 00
RSP: 0018:
ffff8880959defc8 EFLAGS:
00010202
RAX:
0000000000000019 RBX:
ffff888081258a48 RCX:
dffffc0000000000
RDX:
0000000000000000 RSI:
ffffffff86cab862 RDI:
00000000000000c8
RBP:
ffff8880959df030 R08:
ffff8880813d0200 R09:
ffffed1015d05bc8
R10:
ffffed1015d05bc7 R11:
ffff8880ae82de3b R12:
0000000000000000
R13:
000000000000002c R14:
0000000000000000 R15:
ffff888081258a48
FS:
000000000106a880(0000) GS:
ffff8880ae800000(0000)
knlGS:
0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
CR2:
0000000020001cc0 CR3:
0000000094a20000 CR4:
00000000001406f0
DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
DR3:
0000000000000000 DR6:
00000000fffe0ff0 DR7:
0000000000000400
Call Trace:
tipc_sk_filter_rcv+0x182d/0x34f0 net/tipc/socket.c:2168
tipc_sk_enqueue net/tipc/socket.c:2254 [inline]
tipc_sk_rcv+0xc45/0x25a0 net/tipc/socket.c:2305
tipc_sk_mcast_rcv+0x724/0x1020 net/tipc/socket.c:1209
tipc_mcast_xmit+0x7fe/0x1200 net/tipc/bcast.c:410
tipc_sendmcast+0xb36/0xfc0 net/tipc/socket.c:820
__tipc_sendmsg+0x10df/0x18d0 net/tipc/socket.c:1358
tipc_sendmsg+0x53/0x80 net/tipc/socket.c:1291
sock_sendmsg_nosec net/socket.c:651 [inline]
sock_sendmsg+0xdd/0x130 net/socket.c:661
___sys_sendmsg+0x806/0x930 net/socket.c:2260
__sys_sendmsg+0x105/0x1d0 net/socket.c:2298
__do_sys_sendmsg net/socket.c:2307 [inline]
__se_sys_sendmsg net/socket.c:2305 [inline]
__x64_sys_sendmsg+0x78/0xb0 net/socket.c:2305
do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4401c9
Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8
48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05
<48> 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:
00007ffd887fa9d8 EFLAGS:
00000246 ORIG_RAX:
000000000000002e
RAX:
ffffffffffffffda RBX:
00000000004002c8 RCX:
00000000004401c9
RDX:
0000000000000000 RSI:
0000000020002140 RDI:
0000000000000003
RBP:
00000000006ca018 R08:
0000000000000000 R09:
00000000004002c8
R10:
0000000000000000 R11:
0000000000000246 R12:
0000000000401a50
R13:
0000000000401ae0 R14:
0000000000000000 R15:
0000000000000000
Modules linked in:
---[ end trace
ba79875754e1708f ]---
Reported-by: syzbot+be4bdf2cc3e85e952c50@syzkaller.appspotmail.com
Fixes:
c55c8eda ("tipc: smooth change between replicast and broadcast")
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hoang Le [Thu, 21 Mar 2019 10:25:17 +0000 (17:25 +0700)]
tipc: fix use-after-free in tipc_sk_filter_rcv
skb free-ed in:
1/ condition 1: tipc_sk_filter_rcv -> tipc_sk_proto_rcv
2/ condition 2: tipc_sk_filter_rcv -> tipc_group_filter_msg
This leads to a "use-after-free" access in the next condition.
We fix this by intializing the variable at declaration, then it is safe
to check this variable to continue processing if condition matches.
syzbot report:
==================================================================
BUG: KASAN: use-after-free in tipc_sk_filter_rcv+0x2166/0x34f0
net/tipc/socket.c:2167
Read of size 4 at addr
ffff88808ea58534 by task kworker/u4:0/7
CPU: 0 PID: 7 Comm: kworker/u4:0 Not tainted 5.0.0+ #61
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 01/01/2011
Workqueue: tipc_send tipc_conn_send_work
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x172/0x1f0 lib/dump_stack.c:113
print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187
kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
__asan_report_load4_noabort+0x14/0x20 mm/kasan/generic_report.c:131
tipc_sk_filter_rcv+0x2166/0x34f0 net/tipc/socket.c:2167
tipc_sk_enqueue net/tipc/socket.c:2254 [inline]
tipc_sk_rcv+0xc45/0x25a0 net/tipc/socket.c:2305
tipc_topsrv_kern_evt+0x3b7/0x580 net/tipc/topsrv.c:610
tipc_conn_send_to_sock+0x43e/0x5f0 net/tipc/topsrv.c:283
tipc_conn_send_work+0x65/0x80 net/tipc/topsrv.c:303
process_one_work+0x98e/0x1790 kernel/workqueue.c:2269
worker_thread+0x98/0xe40 kernel/workqueue.c:2415
kthread+0x357/0x430 kernel/kthread.c:253
ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352
Reported-by: syzbot+e863893591cc7a622e40@syzkaller.appspotmail.com
Fixes:
c55c8eda ("tipc: smooth change between replicast and broadcast")
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stephen Suryaputra [Wed, 20 Mar 2019 14:29:27 +0000 (10:29 -0400)]
ipv6: Add icmp_echo_ignore_anycast for ICMPv6
In addition to icmp_echo_ignore_multicast, there is a need to also
prevent responding to pings to anycast addresses for security.
Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
YueHaibing [Wed, 20 Mar 2019 13:48:06 +0000 (21:48 +0800)]
net: isdn: Make isdn_ppp_mp_discard and isdn_ppp_mp_reassembly static
Fix sparse warnings:
drivers/isdn/i4l/isdn_ppp.c:1891:16: warning:
symbol 'isdn_ppp_mp_discard' was not declared. Should it be static?
drivers/isdn/i4l/isdn_ppp.c:1903:6: warning:
symbol 'isdn_ppp_mp_reassembly' was not declared. Should it be static?
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
YueHaibing [Wed, 20 Mar 2019 13:37:13 +0000 (21:37 +0800)]
net: hns3: Make hclge_destroy_cmd_queue static
Fix sparse warning:
drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c:414:6:
warning: symbol 'hclge_destroy_cmd_queue' was not declared. Should it be static?
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 20 Mar 2019 18:18:55 +0000 (11:18 -0700)]
Merge branch 'net-refactor-ndo_select_queue'
Paolo Abeni says:
====================
net: refactor ndo_select_queue()
Currently, on most devices implementing ndo_select_queue(), we get 2
indirect calls per xmit packet, at least in some scenarios.
We can avoid one of such indirect calls refactoring the ndo_select_queue()
usage so that we don't need anymore the 'fallback' argument.
The first patch renames a helper used later as a public API, the second one
changes the af packet implementation so that it uses the common infrastructure
to select the xmit queue, and the second patch drops the now unneeded argument
from ndo_select_queue().
Alternatively we could use the INDIRECT_CALL_WRAPPER infrastructure to avoid
the fallback indirect call in the common case, but this solution allows also
for some code cleanup.
v1 -> v2:
- renamed select queue helpers, as per Eric's and David's suggestions
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Wed, 20 Mar 2019 10:02:06 +0000 (11:02 +0100)]
net: remove 'fallback' argument from dev->ndo_select_queue()
After the previous patch, all the callers of ndo_select_queue()
provide as a 'fallback' argument netdev_pick_tx.
The only exceptions are nested calls to ndo_select_queue(),
which pass down the 'fallback' available in the current scope
- still netdev_pick_tx.
We can drop such argument and replace fallback() invocation with
netdev_pick_tx(). This avoids an indirect call per xmit packet
in some scenarios (TCP syn, UDP unconnected, XDP generic, pktgen)
with device drivers implementing such ndo. It also clean the code
a bit.
Tested with ixgbe and CONFIG_FCOE=m
With pktgen using queue xmit:
threads vanilla patched
(kpps) (kpps)
1 2334 2428
2 4166 4278
4 7895 8100
v1 -> v2:
- rebased after helper's name change
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Wed, 20 Mar 2019 10:02:05 +0000 (11:02 +0100)]
packet: rework packet_pick_tx_queue() to use common code selection
Currently packet_pick_tx_queue() is the only caller of
ndo_select_queue() using a fallback argument other than
netdev_pick_tx.
Leveraging rx queue, we can obtain a similar queue selection
behavior using core helpers. After this change, ndo_select_queue()
is always invoked with netdev_pick_tx() as fallback.
We can change ndo_select_queue() signature in a followup patch,
dropping an indirect call per transmitted packet in some scenarios
(e.g. TCP syn and XDP generic xmit)
This changes slightly how af packet queue selection happens when
PACKET_QDISC_BYPASS is set. It's now more similar to plan dev_queue_xmit()
tacking in account both XPS and TC mapping.
v1 -> v2:
- rebased after helper name change
RFC -> v1:
- initialize sender_cpu to the expected value
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Wed, 20 Mar 2019 10:02:04 +0000 (11:02 +0100)]
net: dev: rename queue selection helpers.
With the following patches, we are going to use __netdev_pick_tx() in
many modules. Rename it to netdev_pick_tx(), to make it clear is
a public API.
Also rename the existing netdev_pick_tx() to netdev_core_pick_tx(),
to avoid name clashes.
Suggested-by: Eric Dumazet <edumazet@google.com>
Suggested-by: David Miller <davem@davemloft.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 20 Mar 2019 18:12:50 +0000 (11:12 -0700)]
Merge branch 'qed-next'
Sudarsana Reddy Kalluru says:
====================
qed* enhancements.
The patch series adds couple of enhancements for qed/qede drivers.
Please consider applying it to 'net-next' tree.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Sudarsana Reddy Kalluru [Wed, 20 Mar 2019 07:26:26 +0000 (00:26 -0700)]
qed: Define new MF bit for no_vlan config
The patch introduces a new Multi-Function bit for cases where firmware
shouldn't perform the insertion of vlan-0 tag. The new bit is defined to
abstract the implementation from the actual MF mode.
Signed-off-by: Sudarsana Reddy Kalluru <skalluru@marvell.com>
Signed-off-by: Ariel Elior <aelior@marvell.com>
Signed-off-by: Michal Kalderon <mkalderon@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sudarsana Reddy Kalluru [Wed, 20 Mar 2019 07:26:25 +0000 (00:26 -0700)]
qede: Populate mbi version in ethtool driver query data.
The patch adds support to display MBI image version in 'ethtool -i' output.
Signed-off-by: Sudarsana Reddy Kalluru <skalluru@marvell.com>
Signed-off-by: Ariel Elior <aelior@marvell.com>
Signed-off-by: Michal Kalderon <mkalderon@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hangbin Liu [Wed, 20 Mar 2019 02:23:33 +0000 (10:23 +0800)]
macvlan: pass get_ts_info and SIOC[SG]HWTSTAMP ioctl to real device
Similiar to commit
a6111d3c93d0 ("vlan: Pass SIOC[SG]HWTSTAMP ioctls to
real device") and commit
37dd9255b2f6 ("vlan: Pass ethtool get_ts_info
queries to real device."), add MACVlan HW ptp support.
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mao Wenan [Wed, 20 Mar 2019 02:06:57 +0000 (10:06 +0800)]
net: bridge: use eth_broadcast_addr() to assign broadcast address
This patch is to use eth_broadcast_addr() to assign broadcast address
insetad of memset().
Signed-off-by: Mao Wenan <maowenan@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vakul Garg [Wed, 20 Mar 2019 02:03:36 +0000 (02:03 +0000)]
net/tls: Add support of AES128-CCM based ciphers
Added support for AES128-CCM based record encryption. AES128-CCM is
similar to AES128-GCM. Both of them have same salt/iv/mac size. The
notable difference between the two is that while invoking AES128-CCM
operation, the salt||nonce (which is passed as IV) has to be prefixed
with a hardcoded value '2'. Further, CCM implementation in kernel
requires IV passed in crypto_aead_request() to be full '16' bytes.
Therefore, the record structure 'struct tls_rec' has been modified to
reserve '16' bytes for IV. This works for both GCM and CCM based cipher.
Signed-off-by: Vakul Garg <vakul.garg@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 20 Mar 2019 17:58:17 +0000 (10:58 -0700)]
Merge branch 'net-phy-aquantia-add-interface-mode-handling'
Heiner Kallweit says:
====================
net: phy: aquantia: add interface mode handling
These two patches add interface mode handling for the AQR107/AQCS109.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Nikita Yushchenko [Tue, 19 Mar 2019 22:05:50 +0000 (23:05 +0100)]
net: phy: aquantia: check for changed interface mode in read_status
Depending on the auto-negotiated speed the PHY may change the interface
mode. Check for new mode and set phydev->interface accordingly.
Signed-off-by: Nikita Yushchenko <nikita.yoush@cogentembedded.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
[hkallweit1@gmail.com: picked from bigger patch and reworked]
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Tue, 19 Mar 2019 22:04:38 +0000 (23:04 +0100)]
net: phy: aquantia: check for supported interface modes in config_init
Let config_init check for unsupported interface modes on AQR107/AQCS109.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
[hkallweit1@gmail.com: adjusted for AQR107/AQCS109 specifics]
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit [Tue, 19 Mar 2019 18:56:51 +0000 (19:56 +0100)]
net: phy: improve handling link_change_notify callback
Currently the Phy driver's link_change_notify callback is called
whenever the state machine is run (every second if polling), no matter
whether the state changed or not. This isn't needed and may confuse
users considering the name of the callback. Actually it contradicts
its kernel-doc description. Therefore let's change the behavior and
call this callback only in case of an actual state change.
This requires changes to the at803x and rockchip drivers.
at803x can be simplified so that it reacts on a state change to
PHY_NOLINK only.
The rockchip driver can also be much simplified. We simply re-init
the AFE/DSP registers whenever we change to PHY_RUNNING and speed
is 100Mbps. This causes very small overhead because we do this even
if the speed was 100Mbps already. But this is negligible and
I think justified by the much simpler code.
Changes are compile-tested only.
A little bit problematic seems to be to find somebody with the
hardware to test the changes to the two PHY drivers. See also [0].
David may be able to test the Rockchip driver.
[0] https://marc.info/?t=
153782508800006&r=1&w=2
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 20 Mar 2019 17:14:10 +0000 (10:14 -0700)]
Merge branch '100GbE' of git://git./linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:
====================
100GbE Intel Wired LAN Driver Updates 2019-03-19
This series contains updates to ice driver only.
Michal adds support for the pruning enable flag to avoid seeing
broadcast packets on different VLANs.
Akeem fixes an issue with VF queues being disabled and the VF netdev
network carrier being lost after reset. Fixed an issue issue when doing
PFR and CORER resets, where all VF VSIs need to be reset and rebuilt
with the main VSIs before replaying all VSIs. Resolved an issue to
properly initialize VFs in the guest OS via PCI passthrough.
Bruce adds a local variable to avoid unnecessary de-references
throughout ice_probe().
Brett cleans up the code a bit by removing the need for a local variable
and re-designs the loop to simply return when get a successful result.
Cleans up the code to replace loop calls with a predefined macro to make
the code more consistent. Updated the driver to ensure ITR granularity
is always 2 usecs. Refactors the calculation of VSIs per PF into a
general function that can calculate per PF allocations for not just VSIs
but across multiple resource types. Improve the driver performance of
the driver when using the default settings by determining the ring size
and the number of descriptors for transmit and receive based on a
calculation with the PAGE_SIZE, ICE_MAX_NUM_DESC, and
ICE_REQ_DESC_MULTIPLE.
Chinh fixes an issue, where a reserved bit was possibly being set when
it should never be set.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 20 Mar 2019 17:12:03 +0000 (10:12 -0700)]
Merge branch '1GbE' of git://git./linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:
====================
1GbE Intel Wired LAN Driver Updates 2019-03-19
This series contains updates to e100, e1000, e1000e, igb, igc and
ixgbe.
Serhey Popovych fixes the return value for several of our older
drivers for netdev_update_features() to notify of changes applied.
Kai-Heng Feng fixes the WoL setting for system suspend, which should
not set to runtime suspend settings for igb. Then fixes a power
management issue with e1000e for CNP+ devices.
Colin Ian King fixes whitespace issue (indentation), which helps with
readability.
Sasha provides the remaining changes for igc, including the enabling of
multi-queues to receive. Added support for displaying and configuring
network flow classification (NFC) via ethtool. Added additional
statistics and basic counters for igc. Fixed a typo, so it aligns with
our other drivers.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Brett Creeley [Fri, 8 Feb 2019 20:50:59 +0000 (12:50 -0800)]
ice: Determine descriptor count and ring size based on PAGE_SIZE
Currently we set the default number of Tx and Rx descriptors to 128 by
default. For Rx this amounts to a full page (assuming 4K pages) because
each Rx descriptor is 32 Bytes, but for Tx it only amounts to a half
page because each Tx descriptor is 16 Bytes (assuming 4K pages).
Instead of assuming 4K pages, determine the ring size and the number of
descriptors for Tx and Rx based on a calculation using the PAGE_SIZE,
ICE_MAX_NUM_DESC, and ICE_REQ_DESC_MULTIPLE. This change is being made
to improve the performance of the driver when using the default
settings.
Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Akeem G Abodunrin [Fri, 8 Feb 2019 20:50:58 +0000 (12:50 -0800)]
ice: Reset all VFs with VFLR during SR-IOV init flow
During SR-IOV initialization, we allocate and setup VFs with reset, and
since we were going to inform Firmware about our intention to do VFLR by
disabling LAN TX Queue, then we really have to complete VF reset flow with
VFLR using appropriate registers - Otherwise, reset status bit for VF in
the Guest OS might returns
DEADBEEF.
This resolves issue to properly initialize VFs in the Guest OS via PCI
passthrough.
Signed-off-by: Akeem G Abodunrin <akeem.g.abodunrin@intel.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Brett Creeley [Fri, 8 Feb 2019 20:50:57 +0000 (12:50 -0800)]
ice: Get resources per function
ice_get_guar_num_vsi currently calculates the number of VSIs per PF.
Rework this into a general function ice_get_num_per_func, that can
calculate per PF allocations for not just VSIs but across multiple
resource types.
Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Reviewed-by: Bruce Allan <bruce.w.allan@intel.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Akeem G Abodunrin [Fri, 8 Feb 2019 20:50:56 +0000 (12:50 -0800)]
ice: Implement flow to reset VFs with PFR and other resets
All VF VSIs need to be reset and rebuild with the main VSIs before
replaying all VSIs, so that all existing switch filters, scheduler tree
and other configuration could be replayed at once. This fixes issues when
doing PFR and CORER reset.
Signed-off-by: Akeem G Abodunrin <akeem.g.abodunrin@intel.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Brett Creeley [Fri, 8 Feb 2019 20:50:55 +0000 (12:50 -0800)]
ice: configure GLINT_ITR to always have an ITR gran of 2
Instead of hoping that our ITR granularity will be 2 usec program the
GLINT_CTL register to make sure the ITR granularity is always 2 usecs.
Now that we know what the ITR granularity will be get rid of the check
in ice_probe() to verify our previous assumption.
Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Brett Creeley [Fri, 8 Feb 2019 20:50:54 +0000 (12:50 -0800)]
ice: use ice_for_each_vsi macro when possible
Replace all instances of:
for (i = 0; i < pf->num_alloc_vsi; i++)
with the following macro:
ice_for_each_vsi(pf, i)
This will allow the code to be consistent since there are currently
cases of using both.
Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Chinh T Cao [Fri, 8 Feb 2019 20:50:52 +0000 (12:50 -0800)]
ice : Ensure only valid bits are set in ice_aq_set_phy_cfg
In the ice_aq_set_phy_cfg AQ command, the 16.4 bit is reserved. This
patch will make sure that this bit will never be set to 1.
Signed-off-by: Chinh T Cao <chinh.t.cao@intel.com>
Reviewed-by: Bruce Allan <bruce.w.allan@intel.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Brett Creeley [Fri, 8 Feb 2019 20:50:51 +0000 (12:50 -0800)]
ice: remove redundant variable and if condition
In ice_pf_rxq_wait we are using an unnecessary local variable and also
we are checking if the timeout time was reached after the loop. Get rid
of the local variable and return 0 right when we get a successful
result. This makes it so we can return -ETIMEDOUT if we ever exit the
loop because we know the timeout time has been hit.
Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Bruce Allan [Fri, 8 Feb 2019 20:50:50 +0000 (12:50 -0800)]
ice: avoid multiple unnecessary de-references in probe
Add a local variable struct device *dev to avoid unnecessary de-references
throughout ice_probe().
Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Akeem G Abodunrin [Fri, 8 Feb 2019 20:50:49 +0000 (12:50 -0800)]
ice: Fix issue with VF reset and multiple VFs support on PFs
This patch fixes issues with VF queues being disabled, and VF netdev
network carrier being lost after reset. Basically, we need to check if VF
is enabled, and queue configured in reset_all_vfs flow, and disable/enable
those queues appropriately whenever the function is called after
Global/CORER/PFR reset/rebuild/replay.
Signed-off-by: Akeem G Abodunrin <akeem.g.abodunrin@intel.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Michal Swiatkowski [Fri, 8 Feb 2019 20:50:48 +0000 (12:50 -0800)]
ice: Fix broadcast traffic in port VLAN mode
Set egress (Rx) pruning enable flag for VF VSI in VSI ctxt to
enable prune action.
To avoid seeing broadcast packet in different VLAN, pruning enable
flag in VSI ctxt should be set.
Write new functions (fill VSI ctx) to not repeat send ctxt code.
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@intel.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Sasha Neftin [Fri, 15 Mar 2019 15:12:07 +0000 (17:12 +0200)]
igc: Remove unneeded hw_dbg prints
Remove unneeded hw_dbg prints from igc_ethtool.c file.
Clean up code.
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Sasha Neftin [Mon, 11 Mar 2019 22:34:35 +0000 (00:34 +0200)]
igc: Fix the typo in igc_base.h header definition
Add the underline for the _IGC_BASE_H_.
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Sasha Neftin [Wed, 20 Feb 2019 12:39:31 +0000 (14:39 +0200)]
igc: Add support for the ntuple feature
Copy the ntuple feature into list of user selectable features.
Enable the ntuple feature.
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Sasha Neftin [Mon, 18 Feb 2019 08:37:31 +0000 (10:37 +0200)]
igc: Add support for statistics
Add support for statistics and show basic counters.
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
David S. Miller [Tue, 19 Mar 2019 21:59:32 +0000 (14:59 -0700)]
Merge branch 'enc28j60-messaging-clean-up-and-ACPI-improvements'
Andy Shevchenko says:
====================
enc28j60: messaging clean up and ACPI improvements
Most of the patches in the series dedicated to update messaging to use modern
APIs, such as netdev, with a benefit to distinguish devices, if more than one
installed on the system.
Besides that, patch 1 targeting ACPI enabled systems when MAC address provided
there via properties.
And few clean ups are included, such as:
- switching to module_spi_driver()
- converting to use ether_addr_copy() API
- converting to use SPDX
Since v2:
- cover letter is added
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Shevchenko [Tue, 19 Mar 2019 18:49:30 +0000 (20:49 +0200)]
enc28j60: Convert to use SPDX identifier
Reduce size of duplicated comments by switching to use SPDX identifier.
No functional change.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Shevchenko [Tue, 19 Mar 2019 18:49:29 +0000 (20:49 +0200)]
enc28j60: Fix indentation splats
Fix few indentation splats. No functional change intended.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Shevchenko [Tue, 19 Mar 2019 18:49:28 +0000 (20:49 +0200)]
enc28j60: Amend comments by fixing typos, adding periods, etc
Amend comments in the code:
- adding periods to the multi-line comments
- fixing typos
- capitalize first word in the sentences
- etc
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Shevchenko [Tue, 19 Mar 2019 18:49:27 +0000 (20:49 +0200)]
enc28j60: Remove linux/init.h
There is no need to include linux/init.h when at the same time
we include linux/module.h.
Remove redundant inclusion.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Shevchenko [Tue, 19 Mar 2019 18:49:26 +0000 (20:49 +0200)]
enc28j60: Convert printk() to netdev_printk()
The debug prints of network operations will look better if network
device name is printed. The benefit of that is a possibility to distinguish
the actual hardware when more than one is installed on the system.
Convert appropriate printk(KERN_DEBUG) to netdev_print(KERN_DEBUG, ndev).
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Shevchenko [Tue, 19 Mar 2019 18:49:25 +0000 (20:49 +0200)]
enc28j60: Convert HW related printk() to dev_printk()
The debug prints of hardware status and operations will look better
if SPI device name is printed. The benefit of that is a possibility
to distinguish the actual hardware when more than one is installed
on the system.
Convert appropriate printk(KERN_DEBUG) to dev_print(KERN_DEBUG, &spi->dev).
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Shevchenko [Tue, 19 Mar 2019 18:49:24 +0000 (20:49 +0200)]
enc28j60: Switch to dev_<level> from pr_<level>
Instead of using open coded printk(KERN_<LEVEL>) switch the driver to use
dev_<level> macros.
Note, the device name will be printed in full, which is beneficial when
more than one card installed on the system.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Shevchenko [Tue, 19 Mar 2019 18:49:23 +0000 (20:49 +0200)]
enc28j60: Use ether_addr_copy() in enc28j60_set_mac_address()
Use ether_addr_copy() instead of memcpy() to copy the mac address.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Shevchenko [Tue, 19 Mar 2019 18:49:22 +0000 (20:49 +0200)]
enc28j60: Switch to use module_spi_driver() macro
Eliminate some boilerplate code by using module_spi_driver() instead of
->init() / ->exit(), moving the salient bits from ->init() into ->probe().
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Shevchenko [Tue, 19 Mar 2019 18:49:21 +0000 (20:49 +0200)]
enc28j60: Drop driver name duplication from messages
When dev_<level>() macros are used against SPI device, the driver's name
is printed as well. No need to duplicate this explicitly.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Shevchenko [Tue, 19 Mar 2019 18:49:20 +0000 (20:49 +0200)]
enc28j60: Replace dev_*(&netdev->dev, ...) with netdev_*()
Replace open coded netdev_<level>() macros.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Shevchenko [Tue, 19 Mar 2019 18:49:19 +0000 (20:49 +0200)]
enc28j60: Remove duplicate messaging
The ->probe() and ->remove() stages can be easily debugged with
initcall_debug or function tracer. There is no need to repeat the same
explicitly in the driver.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Shevchenko [Tue, 19 Mar 2019 18:49:18 +0000 (20:49 +0200)]
enc28j60: Use device_get_mac_address()
Replace the DT-specific of_get_mac_address() function with
device_get_mac_address, which works on both DT and ACPI platforms.
This change makes it easier to add ACPI support.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sasha Neftin [Thu, 14 Feb 2019 11:31:37 +0000 (13:31 +0200)]
igc: Extend the ethtool supporting
Add show and configure network flow classification (NFC) methods
to the ethtool. Show the specifies Rx ntuple filters.
Configures receive network flow classification option or rules.
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Sasha Neftin [Wed, 6 Feb 2019 07:48:37 +0000 (09:48 +0200)]
igc: Add multiple receive queues control supporting
Enable the multi queues to receive.
Program the direction of packets to specified queues according
to the mode selected in the MRQC register.
Multiple receive queues defined by filters and RSS for 4 queues.
Enable/disable RSS hashing and also to enable multiple receive queues.
This patch will allow further ethtool support development.
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Kai-Heng Feng [Sat, 2 Feb 2019 17:40:16 +0000 (01:40 +0800)]
e1000e: Disable runtime PM on CNP+
There are some new e1000e devices can only be woken up from D3 one time,
by plugging Ethernet cable. Subsequent cable plugging does set PME bit
correctly, but it still doesn't get woken up.
Since e1000e connects to the root complex directly, we rely on ACPI to
wake it up. In this case, the GPE from _PRW only works once and stops
working after that. Though it appears to be a platform bug, e1000e
maintainers confirmed that I219 does not support D3.
So disable runtime PM on CNP+ chips. We may need to disable earlier
generations if this bug also hit older platforms.
Bugzilla: https://bugzilla.kernel.org/attachment.cgi?id=280819
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Colin Ian King [Wed, 16 Jan 2019 18:53:16 +0000 (18:53 +0000)]
igb: fix various indentation issues
There are some lines that have indentation issues, fix these
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Stephen Suryaputra [Tue, 19 Mar 2019 16:37:12 +0000 (12:37 -0400)]
ipv6: Add icmp_echo_ignore_multicast support for ICMPv6
IPv4 has icmp_echo_ignore_broadcast to prevent responding to broadcast pings.
IPv6 needs a similar mechanism.
v1->v2:
- Remove NET_IPV6_ICMP_ECHO_IGNORE_MULTICAST.
Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wolfram Sang [Tue, 19 Mar 2019 16:36:34 +0000 (17:36 +0100)]
net: macb: simplify getting .driver_data
We should get 'driver_data' from 'struct device' directly. Going via
platform_device is an unneeded step back and forth.
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kai-Heng Feng [Tue, 11 Dec 2018 07:59:38 +0000 (15:59 +0800)]
igb: Exclude device from suspend direct complete optimization
igb sets different WoL settings in system suspend callback and runtime
suspend callback.
The suspend direct complete optimization leaves igb in runtime suspended
state with wrong WoL setting during system suspend.
To fix this, we need to disable suspend direct complete optimization to
let igb always use suspend callback to set correct WoL during system
suspend.
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Serhey Popovych [Thu, 29 Mar 2018 14:51:36 +0000 (17:51 +0300)]
intel: correct return from set features callback
According to comments in <linux/netdevice.h> we should return either >0
or -errno from ->ndo_set_features() if changing dev->features by itself.
Return 1 in such places to notify netdev_update_features() about applied
changes in dev->features.
Signed-off-by: Serhey Popovych <serhe.popovych@gmail.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
YueHaibing [Tue, 19 Mar 2019 15:39:06 +0000 (23:39 +0800)]
net: pasemi: Make pasemi_mac_init_module static
Fix sparse warning:
drivers/net/ethernet/pasemi/pasemi_mac.c:1842:5: warning:
symbol 'pasemi_mac_init_module' was not declared. Should it be static?
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Guillaume Nault [Tue, 19 Mar 2019 15:05:44 +0000 (16:05 +0100)]
tcp: free request sock directly upon TFO or syncookies error
Since the request socket is created locally, it'd make more sense to
use reqsk_free() instead of reqsk_put() in TFO and syncookies' error
path.
However, tcp_get_cookie_sock() may set ->rsk_refcnt before freeing the
socket; tcp_conn_request() may also have non-null ->rsk_refcnt because
of tcp_try_fastopen(). In both cases 'req' hasn't been exposed
to the outside world and is safe to free immediately, but that'd
trigger the WARN_ON_ONCE in reqsk_free().
Define __reqsk_free() for these situations where we know nobody's
referencing the socket, even though ->rsk_refcnt might be non-null.
Now we can consolidate the error path of tcp_get_cookie_sock() and
tcp_conn_request().
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
YueHaibing [Tue, 19 Mar 2019 14:59:46 +0000 (22:59 +0800)]
datagram: Make __skb_datagram_iter static
Fix sparse warning:
net/core/datagram.c:411:5: warning:
symbol '__skb_datagram_iter' was not declared. Should it be static?
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
YueHaibing [Tue, 19 Mar 2019 14:46:41 +0000 (22:46 +0800)]
net: hns3: Make hclgevf_update_link_mode static
Fix sparse warning:
drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c:407:6:
warning: symbol 'hclgevf_update_link_mode' was not declared. Should it be static?
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
YueHaibing [Tue, 19 Mar 2019 14:42:37 +0000 (22:42 +0800)]
ibmveth: Make array ibmveth_stats static
Fix sparse warning:
drivers/net/ethernet/ibm/ibmveth.c:96:21:
warning: symbol 'ibmveth_stats' was not declared. Should it be static?
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 19 Mar 2019 14:01:08 +0000 (07:01 -0700)]
tcp: add tcp_inet6_sk() helper
TCP ipv6 fast path dereferences a pointer to get to the inet6
part of a tcp socket, but given the fixed memory placement,
we can do better and avoid a possible cache line miss.
This also reduces register pressure, since we let the compiler
know about this memory placement.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Murilo Fossa Vicentini [Tue, 19 Mar 2019 13:28:51 +0000 (10:28 -0300)]
ibmvnic: Report actual backing device speed and duplex values
The ibmvnic driver currently reports a fixed value for both speed and
duplex settings regardless of the actual backing device that is being
used. By adding support to the QUERY_PHYS_PARMS command defined by the
PAPR+ we can query the current physical port state and report the proper
values for these feilds.
Reported-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Signed-off-by: Murilo Fossa Vicentini <muvic@linux.ibm.com>
Reviewed-by: Mauro S. M. Rodrigues <maurosr@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hoang Le [Tue, 19 Mar 2019 11:49:50 +0000 (18:49 +0700)]
tipc: smooth change between replicast and broadcast
Currently, a multicast stream may start out using replicast, because
there are few destinations, and then it should ideally switch to
L2/broadcast IGMP/multicast when the number of destinations grows beyond
a certain limit. The opposite should happen when the number decreases
below the limit.
To eliminate the risk of message reordering caused by method change,
a sending socket must stick to a previously selected method until it
enters an idle period of 5 seconds. Means there is a 5 seconds pause
in the traffic from the sender socket.
If the sender never makes such a pause, the method will never change,
and transmission may become very inefficient as the cluster grows.
With this commit, we allow such a switch between replicast and
broadcast without any need for a traffic pause.
Solution is to send a dummy message with only the header, also with
the SYN bit set, via broadcast or replicast. For the data message,
the SYN bit is set and sending via replicast or broadcast (inverse
method with dummy).
Then, at receiving side any messages follow first SYN bit message
(data or dummy message), they will be held in deferred queue until
another pair (dummy or data message) arrived in other link.
v2: reverse christmas tree declaration
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hoang Le [Tue, 19 Mar 2019 11:49:49 +0000 (18:49 +0700)]
tipc: introduce new capability flag for cluster
As a preparation for introducing a smooth switching between replicast
and broadcast method for multicast message, We have to introduce a new
capability flag TIPC_MCAST_RBCTL to handle this new feature.
During a cluster upgrade a node can come back with this new capabilities
which also must be reflected in the cluster capabilities field.
The new feature is only applicable if all node in the cluster supports
this new capability.
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hoang Le [Tue, 19 Mar 2019 11:49:48 +0000 (18:49 +0700)]
tipc: support broadcast/replicast configurable for bc-link
Currently, a multicast stream uses either broadcast or replicast as
transmission method, based on the ratio between number of actual
destinations nodes and cluster size.
However, when an L2 interface (e.g., VXLAN) provides pseudo
broadcast support, this becomes very inefficient, as it blindly
replicates multicast packets to all cluster/subnet nodes,
irrespective of whether they host actual target sockets or not.
The TIPC multicast algorithm is able to distinguish real destination
nodes from other nodes, and hence provides a smarter and more
efficient method for transferring multicast messages than
pseudo broadcast can do.
Because of this, we now make it possible for users to force
the broadcast link to permanently switch to using replicast,
irrespective of which capabilities the bearer provides,
or pretend to provide.
Conversely, we also make it possible to force the broadcast link
to always use true broadcast. While maybe less useful in
deployed systems, this may at least be useful for testing the
broadcast algorithm in small clusters.
We retain the current AUTOSELECT ability, i.e., to let the broadcast link
automatically select which algorithm to use, and to switch back and forth
between broadcast and replicast as the ratio between destination
node number and cluster size changes. This remains the default method.
Furthermore, we make it possible to configure the threshold ratio for
such switches. The default ratio is now set to 10%, down from 25% in the
earlier implementation.
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Peter Xu [Mon, 18 Mar 2019 06:56:06 +0000 (14:56 +0800)]
virtio_net: remove hcpu from virtnet_clean_affinity
The variable is never used.
CC: Michael S. Tsirkin <mst@redhat.com>
CC: Jason Wang <jasowang@redhat.com>
CC: virtualization@lists.linux-foundation.org
CC: netdev@vger.kernel.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Mon, 18 Mar 2019 18:07:33 +0000 (11:07 -0700)]
Documentation: networking: Update netdev-FAQ regarding patches
Provide an explanation of what is expected with respect to sending new
versions of specific patches within a patch series, as well as what
happens if an earlier patch series accidentally gets merged).
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 19 Mar 2019 01:34:45 +0000 (18:34 -0700)]
Merge branch 's390-qeth-fixes'
Julian Wiedmann says:
====================
s390/qeth: fixes 2019-03-18
please apply the following three patches to -net. The first two are fixes
for minor race conditions in the probe code, while the third one gets
dropwatch working (again).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Julian Wiedmann [Mon, 18 Mar 2019 15:40:56 +0000 (16:40 +0100)]
s390/qeth: be drop monitor friendly
As part of the TX completion path, qeth_release_skbs() frees the completed
skbs with __skb_queue_purge(). This ends in kfree_skb(), reporting every
completed skb as dropped.
On the other hand when dropping an skb in .ndo_start_xmit, we end up
calling consume_skb()... where we should be using kfree_skb() so that
drop monitors get notified.
Switch the drop/consume logic around, and also don't accumulate dropped
packets in the tx_errors statistics.
Fixes:
dc149e3764d8 ("s390/qeth: replace open-coded skb_queue_walk()")
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Julian Wiedmann [Mon, 18 Mar 2019 15:40:55 +0000 (16:40 +0100)]
s390/qeth: fix race when initializing the IP address table
The ucast IP table is utilized by some of the L3-specific sysfs attributes
that qeth_l3_create_device_attributes() provides. So initialize the table
_before_ registering the attributes.
Fixes:
ebccc7397e4a ("s390/qeth: add missing hash table initializations")
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Julian Wiedmann [Mon, 18 Mar 2019 15:40:54 +0000 (16:40 +0100)]
s390/qeth: don't erase configuration while probing
The HW trap and VNICC configuration is exposed via sysfs, and may have
already been modified when qeth_l?_probe_device() attempts to initialize
them. So (1) initialize the VNICC values a little earlier, and (2) don't
bother about the HW trap mode, it was already initialized before.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bjorn Helgaas [Mon, 18 Mar 2019 13:51:06 +0000 (08:51 -0500)]
mISDN: hfcpci: Test both vendor & device ID for Digium HFC4S
The device ID alone does not uniquely identify a device. Test both the
vendor and device ID to make sure we don't mistakenly think some other
vendor's 0xB410 device is a Digium HFC4S. Also, instead of the bare hex
ID, use the same constant (PCI_DEVICE_ID_DIGIUM_HFC4S) used in the device
ID table.
No functional change intended.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 19 Mar 2019 01:31:09 +0000 (18:31 -0700)]
Merge branch 'sctp-fix-ignoring-asoc_id-for-tcp-style-sockets-on-some-setsockopts'
Xin Long says:
====================
sctp: fix ignoring asoc_id for tcp-style sockets on some setsockopts
This is a patchset to fix ignoring asoc_id for tcp-style sockets on
some setsockopts, introduced by SCTP_CURRENT_ASSOC of the patchset:
[net-next,00/24] sctp: support SCTP_FUTURE/CURRENT/ALL_ASSOC
(https://patchwork.ozlabs.org/cover/
1031706/)
As Marcelo suggested, we fix it on each setsockopt that is using
SCTP_CURRENT_ASSOC one by one by adding the check:
if (sctp_style(sk, TCP))
xxx.xxx_assoc_id = SCTP_FUTURE_ASSOC;
so that assoc_id will be completely ingored for tcp-style socket on
setsockopts, and works as SCTP_FUTURE_ASSOC.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Mon, 18 Mar 2019 12:06:11 +0000 (20:06 +0800)]
sctp: fix ignoring asoc_id for tcp-style sockets on SCTP_STREAM_SCHEDULER sockopt
A similar fix as Patch "sctp: fix ignoring asoc_id for tcp-style sockets on
SCTP_DEFAULT_SEND_PARAM sockopt" on SCTP_STREAM_SCHEDULER sockopt.
Fixes:
7efba10d6bd2 ("sctp: add SCTP_FUTURE_ASOC and SCTP_CURRENT_ASSOC for SCTP_STREAM_SCHEDULER sockopt")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Mon, 18 Mar 2019 12:06:10 +0000 (20:06 +0800)]
sctp: fix ignoring asoc_id for tcp-style sockets on SCTP_EVENT sockopt
A similar fix as Patch "sctp: fix ignoring asoc_id for tcp-style sockets on
SCTP_DEFAULT_SEND_PARAM sockopt" on SCTP_EVENT sockopt.
Fixes:
d251f05e3ba2 ("sctp: use SCTP_FUTURE_ASSOC and add SCTP_CURRENT_ASSOC for SCTP_EVENT sockopt")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Mon, 18 Mar 2019 12:06:09 +0000 (20:06 +0800)]
sctp: fix ignoring asoc_id for tcp-style sockets on SCTP_ENABLE_STREAM_RESET sockopt
A similar fix as Patch "sctp: fix ignoring asoc_id for tcp-style sockets on
SCTP_DEFAULT_SEND_PARAM sockopt" on SCTP_ENABLE_STREAM_RESET sockopt.
Fixes:
99a62135e127 ("sctp: use SCTP_FUTURE_ASSOC and add SCTP_CURRENT_ASSOC for SCTP_ENABLE_STREAM_RESET sockopt")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Mon, 18 Mar 2019 12:06:08 +0000 (20:06 +0800)]
sctp: fix ignoring asoc_id for tcp-style sockets on SCTP_DEFAULT_PRINFO sockopt
A similar fix as Patch "sctp: fix ignoring asoc_id for tcp-style sockets on
SCTP_DEFAULT_SEND_PARAM sockopt" on SCTP_DEFAULT_PRINFO sockopt.
Fixes:
3a583059d187 ("sctp: use SCTP_FUTURE_ASSOC and add SCTP_CURRENT_ASSOC for SCTP_DEFAULT_PRINFO sockopt")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Mon, 18 Mar 2019 12:06:07 +0000 (20:06 +0800)]
sctp: fix ignoring asoc_id for tcp-style sockets on SCTP_AUTH_DEACTIVATE_KEY sockopt
A similar fix as Patch "sctp: fix ignoring asoc_id for tcp-style sockets on
SCTP_DEFAULT_SEND_PARAM sockopt" on SCTP_AUTH_DEACTIVATE_KEY sockopt.
Fixes:
2af66ff3edc7 ("sctp: use SCTP_FUTURE_ASSOC and add SCTP_CURRENT_ASSOC for SCTP_AUTH_DEACTIVATE_KEY sockopt")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Mon, 18 Mar 2019 12:06:06 +0000 (20:06 +0800)]
sctp: fix ignoring asoc_id for tcp-style sockets on SCTP_AUTH_DELETE_KEY sockopt
A similar fix as Patch "sctp: fix ignoring asoc_id for tcp-style sockets on
SCTP_DEFAULT_SEND_PARAM sockopt" on SCTP_AUTH_DELETE_KEY sockopt.
Fixes:
3adcc300603e ("sctp: use SCTP_FUTURE_ASSOC and add SCTP_CURRENT_ASSOC for SCTP_AUTH_DELETE_KEY sockopt")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Mon, 18 Mar 2019 12:06:05 +0000 (20:06 +0800)]
sctp: fix ignoring asoc_id for tcp-style sockets on SCTP_AUTH_ACTIVE_KEY sockopt
A similar fix as Patch "sctp: fix ignoring asoc_id for tcp-style sockets on
SCTP_DEFAULT_SEND_PARAM sockopt" on SCTP_AUTH_ACTIVE_KEY sockopt.
Fixes:
bf9fb6ad4f29 ("sctp: use SCTP_FUTURE_ASSOC and add SCTP_CURRENT_ASSOC for SCTP_AUTH_ACTIVE_KEY sockopt")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Mon, 18 Mar 2019 12:06:04 +0000 (20:06 +0800)]
sctp: fix ignoring asoc_id for tcp-style sockets on SCTP_AUTH_KEY sockopt
A similar fix as Patch "sctp: fix ignoring asoc_id for tcp-style sockets on
SCTP_DEFAULT_SEND_PARAM sockopt" on SCTP_AUTH_KEY sockopt.
Fixes:
7fb3be13a236 ("sctp: use SCTP_FUTURE_ASSOC and add SCTP_CURRENT_ASSOC for SCTP_AUTH_KEY sockopt")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Mon, 18 Mar 2019 12:06:03 +0000 (20:06 +0800)]
sctp: fix ignoring asoc_id for tcp-style sockets on SCTP_MAX_BURST sockopt
A similar fix as Patch "sctp: fix ignoring asoc_id for tcp-style sockets on
SCTP_DEFAULT_SEND_PARAM sockopt" on SCTP_MAX_BURST sockopt.
Fixes:
e0651a0dc877 ("sctp: use SCTP_FUTURE_ASSOC and add SCTP_CURRENT_ASSOC for SCTP_MAX_BURST sockopt")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Mon, 18 Mar 2019 12:06:02 +0000 (20:06 +0800)]
sctp: fix ignoring asoc_id for tcp-style sockets on SCTP_CONTEXT sockopt
A similar fix as Patch "sctp: fix ignoring asoc_id for tcp-style sockets on
SCTP_DEFAULT_SEND_PARAM sockopt" on SCTP_CONTEXT sockopt.
Fixes:
49b037acca8c ("sctp: use SCTP_FUTURE_ASSOC and add SCTP_CURRENT_ASSOC for SCTP_CONTEXT sockopt")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Mon, 18 Mar 2019 12:06:01 +0000 (20:06 +0800)]
sctp: fix ignoring asoc_id for tcp-style sockets on SCTP_DEFAULT_SNDINFO sockopt
A similar fix as Patch "sctp: fix ignoring asoc_id for tcp-style sockets on
SCTP_DEFAULT_SEND_PARAM sockopt" on SCTP_DEFAULT_SNDINFO sockopt.
Fixes:
92fc3bd928c9 ("sctp: use SCTP_FUTURE_ASSOC and add SCTP_CURRENT_ASSOC for SCTP_DEFAULT_SNDINFO sockopt")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Mon, 18 Mar 2019 12:06:00 +0000 (20:06 +0800)]
sctp: fix ignoring asoc_id for tcp-style sockets on SCTP_DELAYED_SACK sockopt
A similar fix as Patch "sctp: fix ignoring asoc_id for tcp-style sockets on
SCTP_DEFAULT_SEND_PARAM sockopt" on SCTP_DELAYED_SACK sockopt.
Fixes:
9c5829e1c49e ("sctp: use SCTP_FUTURE_ASSOC and add SCTP_CURRENT_ASSOC for SCTP_DELAYED_SACK sockopt")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Marcelo Ricardo Leitner [Mon, 18 Mar 2019 12:05:59 +0000 (20:05 +0800)]
sctp: fix ignoring asoc_id for tcp-style sockets on SCTP_DEFAULT_SEND_PARAM sockopt
Currently if the user pass an invalid asoc_id to SCTP_DEFAULT_SEND_PARAM
on a TCP-style socket, it will silently ignore the new parameters.
That's because after not finding an asoc, it is checking asoc_id against
the known values of CURRENT/FUTURE/ALL values and that fails to match.
IOW, if the user supplies an invalid asoc id or not, it should either
match the current asoc or the socket itself so that it will inherit
these later. Fixes it by forcing asoc_id to SCTP_FUTURE_ASSOC in case it
is a TCP-style socket without an asoc, so that the values get set on the
socket.
Fixes:
707e45b3dc5a ("sctp: use SCTP_FUTURE_ASSOC and add SCTP_CURRENT_ASSOC for SCTP_DEFAULT_SEND_PARAM sockopt")
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Mon, 18 Mar 2019 11:58:29 +0000 (19:58 +0800)]
sctp: not copy sctp_sock pd_lobby in sctp_copy_descendant
Now sctp_copy_descendant() copies pd_lobby from old sctp scok to new
sctp sock. If sctp_sock_migrate() returns error, it will panic when
releasing new sock and trying to purge pd_lobby due to the incorrect
pointers in pd_lobby.
[ 120.485116] kasan: CONFIG_KASAN_INLINE enabled
[ 120.486270] kasan: GPF could be caused by NULL-ptr deref or user
[ 120.509901] Call Trace:
[ 120.510443] sctp_ulpevent_free+0x1e8/0x490 [sctp]
[ 120.511438] sctp_queue_purge_ulpevents+0x97/0xe0 [sctp]
[ 120.512535] sctp_close+0x13a/0x700 [sctp]
[ 120.517483] inet_release+0xdc/0x1c0
[ 120.518215] __sock_release+0x1d2/0x2a0
[ 120.519025] sctp_do_peeloff+0x30f/0x3c0 [sctp]
We fix it by not copying sctp_sock pd_lobby in sctp_copy_descendan(),
and skb_queue_head_init() can also be removed in sctp_sock_migrate().
Reported-by: syzbot+85e0b422ff140b03672a@syzkaller.appspotmail.com
Fixes:
89664c623617 ("sctp: sctp_sock_migrate() returns error if sctp_bind_addr_dup() fails")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Mon, 18 Mar 2019 11:47:00 +0000 (19:47 +0800)]
sctp: get sctphdr by offset in sctp_compute_cksum
sctp_hdr(skb) only works when skb->transport_header is set properly.
But in Netfilter, skb->transport_header for ipv6 is not guaranteed
to be right value for sctphdr. It would cause to fail to check the
checksum for sctp packets.
So fix it by using offset, which is always right in all places.
v1->v2:
- Fix the changelog.
Fixes:
e6d8b64b34aa ("net: sctp: fix and consolidate SCTP checksumming code")
Reported-by: Li Shuang <shuali@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yoshiki Komachi [Mon, 18 Mar 2019 05:39:52 +0000 (14:39 +0900)]
af_packet: fix the tx skb protocol in raw sockets with ETH_P_ALL
I am using "protocol ip" filters in TC to manipulate TC flower
classifiers, which are only available with "protocol ip". However,
I faced an issue that packets sent via raw sockets with ETH_P_ALL
did not match the ip filters even if they did satisfy the condition
(e.g., DHCP offer from dhcpd).
I have determined that the behavior was caused by an unexpected
value stored in skb->protocol, namely, ETH_P_ALL instead of ETH_P_IP,
when packets were sent via raw sockets with ETH_P_ALL set.
IMHO, storing ETH_P_ALL in skb->protocol is not appropriate for
packets sent via raw sockets because ETH_P_ALL is not a real ether
type used on wire, but a virtual one.
This patch fixes the tx protocol selection in cases of transmission
via raw sockets created with ETH_P_ALL so that it asks the driver to
extract protocol from the Ethernet header.
Fixes:
75c65772c3 ("net/packet: Ask driver for protocol if not provided by user")
Signed-off-by: Yoshiki Komachi <komachi.yoshiki@lab.ntt.co.jp>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Maxime Chevallier [Sat, 16 Mar 2019 13:41:30 +0000 (14:41 +0100)]
packets: Always register packet sk in the same order
When using fanouts with AF_PACKET, the demux functions such as
fanout_demux_cpu will return an index in the fanout socket array, which
corresponds to the selected socket.
The ordering of this array depends on the order the sockets were added
to a given fanout group, so for FANOUT_CPU this means sockets are bound
to cpus in the order they are configured, which is OK.
However, when stopping then restarting the interface these sockets are
bound to, the sockets are reassigned to the fanout group in the reverse
order, due to the fact that they were inserted at the head of the
interface's AF_PACKET socket list.
This means that traffic that was directed to the first socket in the
fanout group is now directed to the last one after an interface restart.
In the case of FANOUT_CPU, traffic from CPU0 will be directed to the
socket that used to receive traffic from the last CPU after an interface
restart.
This commit introduces a helper to add a socket at the tail of a list,
then uses it to register AF_PACKET sockets.
Note that this changes the order in which sockets are listed in /proc and
with sock_diag.
Fixes:
dc99f600698d ("packet: Add fanout support")
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>