Roi Dayan [Tue, 15 May 2018 09:08:12 +0000 (12:08 +0300)]
net/mlx5e: Split TC add rule path for nic vs e-switch
Move to have clear separation on the code path to add nic vs e-switch
flows. While here we break the code that deals with adding offloaded
TC tool to few smaller stages, each on helper function.
Besides getting us simpler and readable code, these are pre-steps
for being able to have two HW flows serving one SW TC flow for some
e-switch use cases.
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Rabie Loulou [Sun, 15 Oct 2017 12:48:41 +0000 (15:48 +0300)]
net/mlx5e: Change return type of tc add flow functions
Refactor the flow add utility functions to return err code instead of rule
pointers. This will allow for simpler logic when one tc rule is
duplicated to two HW rules in downstream patches.
Signed-off-by: Rabie Loulou <rabiel@mellanox.com>
Signed-off-by: Shahar Klein <shahark@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Mark Bloch [Wed, 3 Oct 2018 00:03:35 +0000 (00:03 +0000)]
net/mlx5: Use flow counter IDs and not the wrapping cache object
Currently, when a flow rule is created using the FS core layer, the caller
has to pass the entire flow counter object and not just the counter HW
handle (ID). This requires both the FS core and the caller to have
knowledge about the inner implementation of the FS layer flow counters
cache and limits the possible users.
Move to use the counter ID across the place when dealing with flows.
Doing this decoupling, now can we privatize the inner implementation
of the flow counters.
Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Mark Bloch [Tue, 2 Oct 2018 22:57:24 +0000 (22:57 +0000)]
net/mlx5: E-Switch, Get counters for offloaded flows from callers
There's no real reason for the e-switch logic to manage the creation of
counters for offloaded flows. The API already has the directive for the
caller to denote they want to attach a counter to the created flow.
As such, we go and move the management of flow counters to the mlx5e
tc offload logic. This also lets us remove an inelegant interface where
the FS layer had to provide a way to retrieve a counter from a flow rule.
Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Saeed Mahameed [Wed, 17 Oct 2018 21:13:36 +0000 (14:13 -0700)]
Merge branch 'mlx5-next' of git://git./linux/kernel/git/mellanox/linux into net-next
mlx5 updates for both net-next and rdma-next
* 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux: (21 commits)
net/mlx5: Expose DC scatter to CQE capability bit
net/mlx5: Update mlx5_ifc with DEVX UID bits
net/mlx5: Set uid as part of DCT commands
net/mlx5: Set uid as part of SRQ commands
net/mlx5: Set uid as part of SQ commands
net/mlx5: Set uid as part of RQ commands
net/mlx5: Set uid as part of QP commands
net/mlx5: Set uid as part of CQ commands
net/mlx5: Rename incorrect naming in IFC file
net/mlx5: Export packet reformat alloc/dealloc functions
net/mlx5: Pass a namespace for packet reformat ID allocation
net/mlx5: Expose new packet reformat capabilities
{net, RDMA}/mlx5: Rename encap to reformat packet
net/mlx5: Move header encap type to IFC header file
net/mlx5: Break encap/decap into two separated flow table creation flags
net/mlx5: Add support for more namespaces when allocating modify header
net/mlx5: Export modify header alloc/dealloc functions
net/mlx5: Add proper NIC TX steering flow tables support
net/mlx5: Cleanup flow namespace getter switch logic
net/mlx5: Add memic command opcode to command checker
...
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Daniel Borkmann [Tue, 16 Oct 2018 19:31:35 +0000 (21:31 +0200)]
tcp, ulp: remove socket lock assertion on ULP cleanup
Eric reported that syzkaller triggered a splat in tcp_cleanup_ulp()
where assertion sock_owned_by_me() failed. This happened through
inet_csk_prepare_forced_close() first releasing the socket lock,
then calling into tcp_done(newsk) which is called after the
inet_csk_prepare_forced_close() and therefore without the socket
lock held. The sock_owned_by_me() assertion can generally be
removed as the only place where tcp_cleanup_ulp() is called from
now is out of inet_csk_destroy_sock() -> sk->sk_prot->destroy()
where socket is in dead state and unreachable. Therefore, add a
comment why the check is not needed instead.
Fixes:
8b9088f806e1 ("tcp, ulp: enforce sock_owned_by_me upon ulp init and cleanup")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Yonatan Cohen [Tue, 9 Oct 2018 09:05:12 +0000 (12:05 +0300)]
net/mlx5: Expose DC scatter to CQE capability bit
dc_req_scat_data_cqe capability bit determines
if requester scatter to cqe is available for 64 bytes CQE over
DC transport type.
Signed-off-by: Yonatan Cohen <yonatanc@mellanox.com>
Reviewed-by: Guy Levi <guyle@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
David S. Miller [Tue, 16 Oct 2018 17:09:59 +0000 (10:09 -0700)]
Merge branch 'hns3-Some-cleanup-and-bugfix-for-desc-filling'
Yunsheng Lin says:
====================
Some cleanup and bugfix for desc filling
When retransmiting packets, skb_cow_head which is called in
hns3_set_tso may clone a new header. And driver will clear the
checksum of the header after doing DMA map, so HW will read the
old header whose L3 checksum is not cleared and calculate a
wrong L3 checksum.
Also When sending a big fragment using multiple buffer descriptor,
hns3 does one maping, but do multiple unmapping when tx is done,
which may cause unmapping problem.
This patchset does some cleanup before fixing the above problem.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Fuyun Liang [Tue, 16 Oct 2018 11:58:52 +0000 (19:58 +0800)]
net: hns3: fix for multiple unmapping DMA problem
When sending a big fragment using multiple buffer descriptor,
hns3 does one maping, but do multiple unmapping when tx is done,
which may cause unmapping problem.
To fix it, this patch makes sure the value of desc_cb.length of
the non-first bd is zero. If desc_cb.length is zero, we do not
unmap the buffer.
Fixes:
76ad4f0ee747 ("net: hns3: Add support of HNS3 Ethernet Driver for hip08 SoC")
Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fuyun Liang [Tue, 16 Oct 2018 11:58:51 +0000 (19:58 +0800)]
net: hns3: rename hns_nic_dma_unmap
To keep symmetrical, this patch renames hns_nic_dma_unmap to
hns3_clear_desc.
Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fuyun Liang [Tue, 16 Oct 2018 11:58:50 +0000 (19:58 +0800)]
net: hns3: add handling for big TX fragment
This patch unifies big tx fragment handling for tso and non-tso
case.
Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Peng Li [Tue, 16 Oct 2018 11:58:49 +0000 (19:58 +0800)]
net: hns3: move DMA map into hns3_fill_desc
To solve the L3 checksum error problem which happens when driver
does not clear L3 checksum, DMA map should be done after calling
skb_cow_head.
This patch moves DMA map into hns3_fill_desc to ensure that DMA
map is done after calling skb_cow_head.
Fixes:
76ad4f0ee747 ("net: hns3: Add support of HNS3 Ethernet Driver for hip08 SoC")
Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Peng Li [Tue, 16 Oct 2018 11:58:48 +0000 (19:58 +0800)]
net: hns3: remove hns3_fill_desc_tso
This patch removes hns3_fill_desc_tso in preparation for
fixing some desc filling bug, because for tso or non-tso
case, we will use the unified hns3_fill_desc.
Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 16 Oct 2018 17:04:29 +0000 (10:04 -0700)]
Merge branch 'qed-Align-PTT-and-add-various-link-modes'
Rahul Verma says:
====================
Align PTT and add various link modes.
This series aligns the ptt propagation as local ptt or global ptt.
Adds new transceiver modes, speed capabilities and board config,
which is utilized to display the enhanced link modes, media types
and speed. Enhances the link with detailed information.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Rahul Verma [Tue, 16 Oct 2018 10:59:22 +0000 (03:59 -0700)]
qed: Prevent link getting down in case of autoneg-off.
Newly added link modes are required to be added
during setting link modes. If the new link mode
is not available during qed_set_link, it may cause
link getting down due to empty supported capability,
being passed to MFW, after setting autoneg off/on
with current/supported speed.
Signed-off-by: Rahul Verma <Rahul.Verma@cavium.com>
Signed-off-by: Ariel Elior <ariel.elior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rahul Verma [Tue, 16 Oct 2018 10:59:21 +0000 (03:59 -0700)]
qede: Check available link modes before link set from ethtool.
Set link mode after checking available "supported" link caps
of the port.
Signed-off-by: Rahul Verma <Rahul.Verma@cavium.com>
Signed-off-by: Ariel Elior <ariel.elior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rahul Verma [Tue, 16 Oct 2018 10:59:20 +0000 (03:59 -0700)]
qed: Add supported link and advertise link to display in ethtool.
Added transceiver type, speed capability and board types
in HSI, are utilizing to display the accurate link
information in ethtool.
Signed-off-by: Rahul Verma <Rahul.Verma@cavium.com>
Signed-off-by: Ariel Elior <ariel.elior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rahul Verma [Tue, 16 Oct 2018 10:59:19 +0000 (03:59 -0700)]
qed: Added supported transceiver modes, speed capability and board config to HSI.
Added transceiver modes with different speed and media type,
speed capability and supported board types in HSI, which
will be utilizing to display correct specification of link
modes and speed type.
Signed-off-by: Rahul Verma <Rahul.Verma@cavium.com>
Signed-off-by: Ariel Elior <ariel.elior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rahul Verma [Tue, 16 Oct 2018 10:59:18 +0000 (03:59 -0700)]
qed: Align local and global PTT to propagate through the APIs.
Align the use of local PTT to propagate through the qed_mcp* API's.
Global ptt should not be used.
Register access should be done through layers. Register address is
mapped into a PTT, PF translation table. Several interface functions
require a PTT to direct read/write into register. There is a pool of
PTT maintained, and several PTT are used simultaneously to access
device registers in different flows. Same PTT should not be used in
flows that can run concurrently.
To avoid running out of PTT resources, too many PTT should not be
acquired without releasing them. Every PF has a global PTT, which is
used throughout the life of PF, in most important flows for register
access. Generic functions acquire the PTT locally and release after
the use. This patch aligns the use of Global PTT and Local PTT
accordingly.
Signed-off-by: Rahul Verma <rahul.verma@cavium.com>
Signed-off-by: Ariel Elior <ariel.elior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
YueHaibing [Tue, 16 Oct 2018 10:50:56 +0000 (18:50 +0800)]
net: aquantia: make function aq_fw2x_update_stats static
Fixes the following sparse warning:
drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils_fw2x.c:282:5: warning:
symbol 'aq_fw2x_update_stats' was not declared. Should it be static?
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 16 Oct 2018 07:14:18 +0000 (00:14 -0700)]
Merge branch 'net-Kernel-side-filtering-for-route-dumps'
David Ahern says:
====================
net: Kernel side filtering for route dumps
Implement kernel side filtering of route dumps by protocol (e.g., which
routing daemon installed the route), route type (e.g., unicast), table
id and nexthop device.
iproute2 has been doing this filtering in userspace for years; pushing
the filters to the kernel side reduces the amount of data the kernel
sends and reduces wasted cycles on both sides processing unwanted data.
These initial options provide a huge improvement for efficiently
examining routes on large scale systems.
v2
- better handling of requests for a specific table. Rather than walking
the hash of all tables, lookup the specific table and dump it
- refactor mr_rtm_dumproute moving the loop over the table into a
helper that can be invoked directly
- add hook to return NLM_F_DUMP_FILTERED in DONE message to ensure
it is returned even when the dump returns nothing
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Tue, 16 Oct 2018 01:56:51 +0000 (18:56 -0700)]
net/ipv4: Bail early if user only wants prefix entries
Unlike IPv6, IPv4 does not have routes marked with RTF_PREFIX_RT. If the
flag is set in the dump request, just return.
In the process of this change, move the CLONE check to use the new
filter flags.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Tue, 16 Oct 2018 01:56:50 +0000 (18:56 -0700)]
net/ipv6: Bail early if user only wants cloned entries
Similar to IPv4, IPv6 fib no longer contains cloned routes. If a user
requests a route dump for only cloned entries, no sense walking the FIB
and returning everything.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Tue, 16 Oct 2018 01:56:49 +0000 (18:56 -0700)]
net/mpls: Handle kernel side filtering of route dumps
Update the dump request parsing in MPLS for the non-INET case to
enable kernel side filtering. If INET is disabled the only filters
that make sense for MPLS are protocol and nexthop device.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Tue, 16 Oct 2018 01:56:48 +0000 (18:56 -0700)]
net: Enable kernel side filtering of route dumps
Update parsing of route dump request to enable kernel side filtering.
Allow filtering results by protocol (e.g., which routing daemon installed
the route), route type (e.g., unicast), table id and nexthop device. These
amount to the low hanging fruit, yet a huge improvement, for dumping
routes.
ip_valid_fib_dump_req is called with RTNL held, so __dev_get_by_index can
be used to look up the device index without taking a reference. From
there filter->dev is only used during dump loops with the lock still held.
Set NLM_F_DUMP_FILTERED in the answer_flags so the user knows the results
have been filtered should no entries be returned.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Tue, 16 Oct 2018 01:56:47 +0000 (18:56 -0700)]
net: Plumb support for filtering ipv4 and ipv6 multicast route dumps
Implement kernel side filtering of routes by egress device index and
table id. If the table id is given in the filter, lookup table and
call mr_table_dump directly for it.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Tue, 16 Oct 2018 01:56:46 +0000 (18:56 -0700)]
ipmr: Refactor mr_rtm_dumproute
Move per-table loops from mr_rtm_dumproute to mr_table_dump and export
mr_table_dump for dumps by specific table id.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Tue, 16 Oct 2018 01:56:45 +0000 (18:56 -0700)]
net/mpls: Plumb support for filtering route dumps
Implement kernel side filtering of routes by egress device index and
protocol. MPLS uses only a single table and route type.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Tue, 16 Oct 2018 01:56:44 +0000 (18:56 -0700)]
net/ipv6: Plumb support for filtering route dumps
Implement kernel side filtering of routes by table id, egress device
index, protocol, and route type. If the table id is given in the filter,
lookup the table and call fib6_dump_table directly for it.
Move the existing route flags check for prefix only routes to the new
filter.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Tue, 16 Oct 2018 01:56:43 +0000 (18:56 -0700)]
net/ipv4: Plumb support for filtering route dumps
Implement kernel side filtering of routes by table id, egress device index,
protocol and route type. If the table id is given in the filter, lookup the
table and call fib_table_dump directly for it.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Tue, 16 Oct 2018 01:56:42 +0000 (18:56 -0700)]
net: Add struct for fib dump filter
Add struct fib_dump_filter for options on limiting which routes are
returned in a dump request. The current list is table id, protocol,
route type, rtm_flags and nexthop device index. struct net is needed
to lookup the net_device from the index.
Declare the filter for each route dump handler and plumb the new
arguments from dump handlers to ip_valid_fib_dump_req.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Tue, 16 Oct 2018 01:56:41 +0000 (18:56 -0700)]
netlink: Add answer_flags to netlink_callback
With dump filtering we need a way to ensure the NLM_F_DUMP_FILTERED
flag is set on a message back to the user if the data returned is
influenced by some input attributes. Normally this can be done as
messages are added to the skb, but if the filter results in no data
being returned, the user could be confused as to why.
This patch adds answer_flags to the netlink_callback allowing dump
handlers to set the NLM_F_DUMP_FILTERED at a minimum in the
NLMSG_DONE message ensuring the flag gets back to the user.
The netlink_callback space is initialized to 0 via a memset in
__netlink_dump_start, so init of the new answer_flags is covered.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 16 Oct 2018 06:21:07 +0000 (23:21 -0700)]
Merge git://git./linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-10-16
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Convert BPF sockmap and kTLS to both use a new sk_msg API and enable
sk_msg BPF integration for the latter, from Daniel and John.
2) Enable BPF syscall side to indicate for maps that they do not support
a map lookup operation as opposed to just missing key, from Prashant.
3) Add bpftool map create command which after map creation pins the
map into bpf fs for further processing, from Jakub.
4) Add bpftool support for attaching programs to maps allowing sock_map
and sock_hash to be used from bpftool, from John.
5) Improve syscall BPF map update/delete path for map-in-map types to
wait a RCU grace period for pending references to complete, from Daniel.
6) Couple of follow-up fixes for the BPF socket lookup to get it
enabled also when IPv6 is compiled as a module, from Joe.
7) Fix a generic-XDP bug to handle the case when the Ethernet header
was mangled and thus update skb's protocol and data, from Jesper.
8) Add a missing BTF header length check between header copies from
user space, from Wenwen.
9) Minor fixups in libbpf to use __u32 instead u32 types and include
proper perf_event.h uapi header instead of perf internal one, from Yonghong.
10) Allow to pass user-defined flags through EXTRA_CFLAGS and EXTRA_LDFLAGS
to bpftool's build, from Jiri.
11) BPF kselftest tweaks to add LWTUNNEL to config fragment and to install
with_addr.sh script from flow dissector selftest, from Anders.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit [Mon, 15 Oct 2018 19:25:13 +0000 (21:25 +0200)]
net: phy: merge phy_start_aneg and phy_start_aneg_priv
After commit
9f2959b6b52d ("net: phy: improve handling delayed work")
the sync parameter isn't needed any longer in phy_start_aneg_priv().
This allows to merge phy_start_aneg() and phy_start_aneg_priv().
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Haiyang Zhang [Mon, 15 Oct 2018 19:06:15 +0000 (19:06 +0000)]
hv_netvsc: fix vf serial matching with pci slot info
The VF device's serial number is saved as a string in PCI slot's
kobj name, not the slot->number. This patch corrects the netvsc
driver, so the VF device can be successfully paired with synthetic
NIC.
Fixes:
00d7ddba1143 ("hv_netvsc: pair VF based on serial number")
Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 16 Oct 2018 05:56:42 +0000 (22:56 -0700)]
Merge branch 'tcp-second-round-for-EDT-conversion'
Eric Dumazet says:
====================
tcp: second round for EDT conversion
First round of EDT patches left TCP stack in a non optimal state.
- High speed flows suffered from loss of performance, addressed
by the first patch of this series.
- Second patch brings pacing to the current state of networking,
since we now reach ~100 Gbit on a single TCP flow.
- Third patch implements a mitigation for scheduling delays,
like the one we did in sch_fq in the past.
- Fourth patch removes one special case in sch_fq for ACK packets.
- Fifth patch removes a serious perfomance cost for TCP internal
pacing. We should setup the high resolution timer only if
really needed.
- Sixth patch fixes a typo in BBR.
- Last patch is one minor change in cdg congestion control.
Neal Cardwell also has a patch series fixing BBR after
EDT adoption.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 15 Oct 2018 16:37:58 +0000 (09:37 -0700)]
tcp: cdg: use tcp high resolution clock cache
We store in tcp socket a cache of most recent high resolution
clock, there is no need to call local_clock() again, since
this cache is good enough.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Neal Cardwell [Mon, 15 Oct 2018 16:37:57 +0000 (09:37 -0700)]
tcp_bbr: fix typo in bbr_pacing_margin_percent
There was a typo in this parameter name.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 15 Oct 2018 16:37:56 +0000 (09:37 -0700)]
tcp: optimize tcp internal pacing
When TCP implements its own pacing (when no fq packet scheduler is used),
it is arming high resolution timer after a packet is sent.
But in many cases (like TCP_RR kind of workloads), this high resolution
timer expires before the application attempts to write the following
packet. This overhead also happens when the flow is ACK clocked and
cwnd limited instead of being limited by the pacing rate.
This leads to extra overhead (high number of IRQ)
Now tcp_wstamp_ns is reserved for the pacing timer only
(after commit "tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh"),
we can setup the timer only when a packet is about to be sent,
and if tcp_wstamp_ns is in the future.
This leads to a ~10% performance increase in TCP_RR workloads.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 15 Oct 2018 16:37:55 +0000 (09:37 -0700)]
net_sched: sch_fq: no longer use skb_is_tcp_pure_ack()
With the new EDT model, sch_fq no longer has to special
case TCP pure acks, since their skb->tstamp will allow them
being sent without pacing delay.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 15 Oct 2018 16:37:54 +0000 (09:37 -0700)]
tcp: mitigate scheduling jitter in EDT pacing model
In commit
fefa569a9d4b ("net_sched: sch_fq: account for schedule/timers
drifts") we added a mitigation for scheduling jitter in fq packet scheduler.
This patch does the same in TCP stack, now it is using EDT model.
Note that this mitigation is valid for both external (fq packet scheduler)
or internal TCP pacing.
This uses the same strategy than the above commit, allowing
a time credit of half the packet currently sent.
Consider following case :
An skb is sent, after an idle period of 300 usec.
The air-time (skb->len/pacing_rate) is 500 usec
Instead of setting the pacing timer to now+500 usec,
it will use now+min(500/2, 300) -> now+250usec
This is like having a token bucket with a depth of half
an skb.
Tested:
tc qdisc replace dev eth0 root pfifo_fast
Before
netperf -P0 -H remote -- -q
1000000000 # 8000Mbit
540000 262144 262144 10.00 7710.43
After :
netperf -P0 -H remote -- -q
1000000000 # 8000 Mbit
540000 262144 262144 10.00 7999.75 # Much closer to 8000Mbit target
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 15 Oct 2018 16:37:53 +0000 (09:37 -0700)]
net: extend sk_pacing_rate to unsigned long
sk_pacing_rate has beed introduced as a u32 field in 2013,
effectively limiting per flow pacing to 34Gbit.
We believe it is time to allow TCP to pace high speed flows
on 64bit hosts, as we now can reach 100Gbit on one TCP flow.
This patch adds no cost for 32bit kernels.
The tcpi_pacing_rate and tcpi_max_pacing_rate were already
exported as 64bit, so iproute2/ss command require no changes.
Unfortunately the SO_MAX_PACING_RATE socket option will stay
32bit and we will need to add a new option to let applications
control high pacing rates.
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0
1787144 10.246.9.76:49992 10.246.9.77:36741
timer:(on,003ms,0) ino:91863 sk:2 <->
skmem:(r0,rb540000,t66440,tb2363904,
f605944,w1822984,o0,bl0,d0)
ts sack bbr wscale:8,8 rto:201 rtt:0.057/0.006 mss:1448
rcvmss:536 advmss:1448
cwnd:138 ssthresh:178 bytes_acked:
256699822585 segs_out:
177279177
segs_in:
3916318 data_segs_out:
177279175
bbr:(bw:31276.8Mbps,mrtt:0,pacing_gain:1.25,cwnd_gain:2)
send 28045.5Mbps lastrcv:73333
pacing_rate 38705.0Mbps delivery_rate 22997.6Mbps
busy:73333ms unacked:135 retrans:0/157 rcv_space:14480
notsent:
2085120 minrtt:0.013
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 15 Oct 2018 16:37:52 +0000 (09:37 -0700)]
tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh
In EDT design, I made the mistake of using tcp_wstamp_ns
to store the last tcp_clock_ns() sample and to store the
pacing virtual timer.
This causes major regressions at high speed flows.
Introduce tcp_clock_cache to store last tcp_clock_ns().
This is needed because some arches have slow high-resolution
kernel time service.
tcp_wstamp_ns is only updated when a packet is sent.
Note that we can remove tcp_mstamp in the future since
tcp_mstamp is essentially tcp_clock_cache/1000, so the
apparent socket size increase is temporary.
Fixes:
9799ccb0e984 ("tcp: add tcp_wstamp_ns socket field")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Li RongQing [Mon, 15 Oct 2018 11:00:31 +0000 (19:00 +0800)]
net: bridge: fix a possible memory leak in __vlan_add
After per-port vlan stats, vlan stats should be released
when fail to add vlan
Fixes:
9163a0fc1f0c0 ("net: bridge: add support for per-port vlan stats")
CC: bridge@lists.linux-foundation.org
cc: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Howells [Mon, 15 Oct 2018 10:31:03 +0000 (11:31 +0100)]
rxrpc: Add /proc/net/rxrpc/peers to display peer list
Add /proc/net/rxrpc/peers to display the list of peers currently active.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Yongjun [Mon, 15 Oct 2018 03:07:16 +0000 (03:07 +0000)]
fore200e: fix missing unlock on error in bsq_audit()
Add the missing unlock before return from function bsq_audit()
in the error handling case.
Fixes:
1d9d8be91788 ("fore200e: check for dma mapping failures")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 16 Oct 2018 05:44:33 +0000 (22:44 -0700)]
Merge branch 'bnxt_en-Add-support-for-new-57500-chips'
Michael Chan says:
====================
bnxt_en: Add support for new 57500 chips.
This patch-set is larger than normal because I wanted a complete series
to add basic support for the new 57500 chips. The new chips have the
following main differences compared to legacy chips:
1. Requires the PF driver to allocate DMA context memory as a backing
store.
2. New NQ (notification queue) for interrupt events.
3. One or more CP rings can be associated with an NQ.
4. 64-bit doorbells.
Most other structures and firmware APIs are compatible with legacy
devices with some exceptions. For example, ring groups are no longer
used and RSS table format has changed.
The patch-set includes the usual firmware spec. update, some refactoring
and restructuring, and adding the new code to add basic support for the
new class of devices.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:59 +0000 (07:02 -0400)]
bnxt_en: Add PCI ID for BCM57508 device.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:58 +0000 (07:02 -0400)]
bnxt_en: Add new NAPI poll function for 57500 chips.
Add a new poll function that polls for NQ events. If the NQ event is
a CQ notification, we locate the CP ring from the cq_handle and call
__bnxt_poll_work() to handle RX/TX events on the CP ring.
Add a new has_more_work field in struct bnxt_cp_ring_info to indicate
budget has been reached. __bnxt_poll_cqs_done() is called to update or
ARM the CP rings if budget has not been reached or not. If budget
has been reached, the next bnxt_poll_p5() call will continue to poll
from the CQ rings directly. Otherwise, the NQ will be ARMed for the
next IRQ.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:57 +0000 (07:02 -0400)]
bnxt_en: Refactor bnxt_poll_work().
Separate the CP ring polling logic in bnxt_poll_work() into 2 separate
functions __bnxt_poll_work() and __bnxt_poll_work_done(). Since the logic
is separated, we need to add tx_pkts and events fields to struct bnxt_napi
to keep track of the events to handle between the 2 functions. We also
add had_work_done field to struct bnxt_cp_ring_info to indicate whether
some work was performed on the CP ring.
This is needed to better support the 57500 chips. We need to poll up to
2 separate CP rings before we update or ARM the CP rings on the 57500 chips.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:56 +0000 (07:02 -0400)]
bnxt_en: Add coalescing setup for 57500 chips.
On legacy chips, the CP ring may be shared between RX and TX and so only
setup the RX coalescing parameters in such a case. On 57500 chips, we
always have a dedicated CP ring for TX so we can always set up the
TX coalescing parameters in bnxt_hwrm_set_coal().
Also, the min_timer coalescing parameter applies to the NQ on the new
chips and a separate firmware call needs to be made to set it up.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:55 +0000 (07:02 -0400)]
bnxt_en: Use bnxt_cp_ring_info struct pointer as parameter for RX path.
In the RX code path, we current use the bnxt_napi struct pointer to
identify the associated RX/CP rings. Change it to use the struct
bnxt_cp_ring_info pointer instead since there are now up to 2
CP rings per MSIX.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:54 +0000 (07:02 -0400)]
bnxt_en: Add RSS support for 57500 chips.
RSS context allocation and RSS indirection table setup are very different
on the new chip. Refactor bnxt_setup_vnic() to call 2 different functions
to set up RSS for the vnic based on chip type. On the new chip, the
number of RSS contexts and the indirection table size depends on the
number of RX rings. Each indirection table entry is also different
on the new chip since ring groups are no longer used.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:53 +0000 (07:02 -0400)]
bnxt_en: Increase RSS context array count and skip ring groups on 57500 chips.
On the new 57500 chips, we need to allocate one RSS context for every
64 RX rings. In previous chips, only one RSS context per vnic is
required regardless of the number of RX rings. So increase the max
RSS context array count to 8.
Hardware ring groups are not used on the new chips. Note that the
software ring group structure is still maintained in the driver to
keep track of the rings associated with the vnic.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:52 +0000 (07:02 -0400)]
bnxt_en: Allocate/Free CP rings for 57500 series chips.
On the new 57500 chips, we allocate/free one CP ring for each RX ring or
TX ring separately. Using separate CP rings for RX/TX is an improvement
as TX events will no longer be stuck behind RX events.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:51 +0000 (07:02 -0400)]
bnxt_en: Modify bnxt_ring_alloc_send_msg() to support 57500 chips.
Firmware ring allocation semantics are slightly different for most
ring types on 57500 chips. Allocation/deallocation for NQ rings are
also added for the new chips.
A CP ring handle is also added so that from the NQ interrupt event,
we can locate the CP ring.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:50 +0000 (07:02 -0400)]
bnxt_en: Add helper functions to get firmware CP ring ID.
On the new 57500 chips, getting the associated CP ring ID associated with
an RX ring or TX ring is different than before. On the legacy chips,
we find the associated ring group and look up the CP ring ID. On the
57500 chips, each RX ring and TX ring has a dedicated CP ring even if
they share the MSIX. Use these helper functions at appropriate places
to get the CP ring ID.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:49 +0000 (07:02 -0400)]
bnxt_en: Allocate completion ring structures for 57500 series chips.
On 57500 chips, the original bnxt_cp_ring_info struct now refers to the
NQ. bp->cp_nr_rings refer to the number of NQs on 57500 chips. There
are now 2 pointers for the CP rings associated with RX and TX rings.
Modify bnxt_alloc_cp_rings() and bnxt_free_cp_rings() accordingly.
With multiple CP rings per NAPI, we need to add a pointer in
bnxt_cp_ring_info struct to point back to the bnxt_napi struct.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:48 +0000 (07:02 -0400)]
bnxt_en: Modify the ring reservation functions for 57500 series chips.
The ring reservation functions have to be modified for P5 chips in the
following ways:
- bnxt_cp_ring_info structs map to internal NQs as well as CP rings.
- Ring groups are not used.
- 1 CP ring must be available for each RX or TX ring.
- number of RSS contexts to reserve is multiples of 64 RX rings.
- RFS currently not supported.
Also, RX AGG rings are only used for jumbo frames, so we need to
unconditionally call bnxt_reserve_rings() in __bnxt_open_nic()
to see if we need to reserve AGG rings in case MTU has changed.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:47 +0000 (07:02 -0400)]
bnxt_en: Adjust MSIX and ring groups for 57500 series chips.
Store the maximum MSIX capability in PCIe config. space earlier. When
we call firmware to query capability, we need to compare the PCIe
MSIX max count with the firmware count and use the smaller one as
the MSIX count for 57500 (P5) chips.
The new chips don't use ring groups. But previous chips do and
the existing logic limits the available rings based on resource
calculations including ring groups. Setting the max ring groups to
the max rx rings will work on the new chips without changing the
existing logic.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:46 +0000 (07:02 -0400)]
bnxt_en: Re-structure doorbells.
The 57500 series chips have a new 64-bit doorbell format. Use a new
bnxt_db_info structure to unify the new and the old 32-bit doorbells.
Add a new bnxt_set_db() function to set up the doorbell addreses and
doorbell keys ahead of time. Modify and introduce new doorbell
helpers to help abstract and unify the old and new doorbells.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:45 +0000 (07:02 -0400)]
bnxt_en: Add 57500 new chip ID and basic structures.
57500 series is a new chip class (P5) that requires some driver changes
in the next several patches. This adds basic chip ID, doorbells, and
the notification queue (NQ) structures. Each MSIX is associated with an
NQ instead of a CP ring in legacy chips. Each NQ has up to 2 associated
CP rings for RX and TX. The same bnxt_cp_ring_info struct will be used
for the NQ.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:44 +0000 (07:02 -0400)]
bnxt_en: Configure context memory on new devices.
Call firmware to configure the DMA addresses of all context memory
pages on new devices requiring context memory.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:43 +0000 (07:02 -0400)]
bnxt_en: Check context memory requirements from firmware.
New device requires host context memory as a backing store. Call
firmware to check for context memory requirements and store the
parameters. Allocate host pages accordingly.
We also need to move the call bnxt_hwrm_queue_qportcfg() earlier
so that all the supported hardware queues and the IDs are known
before checking and allocating context memory.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:42 +0000 (07:02 -0400)]
bnxt_en: Add new flags to setup new page table PTE bits on newer devices.
Newer chips require the PTU_PTE_VALID bit to be set for every page
table entry for context memory and rings. Additional bits are also
required for page table entries for all rings. Add a flags field to
bnxt_ring_mem_info struct to specify these additional bits to be used
when setting up the pages tables as needed.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:41 +0000 (07:02 -0400)]
bnxt_en: Refactor bnxt_ring_struct.
Move the DMA page table and vmem fields in bnxt_ring_struct to a new
bnxt_ring_mem_info struct. This will allow context memory management
for a new device to re-use some of the existing infrastructure.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:40 +0000 (07:02 -0400)]
bnxt_en: Update interrupt coalescing logic.
New firmware spec. allows interrupt coalescing parameters, such as
maximums, timer units, supported features to be queried. Update
the driver to make use of the new call to query these parameters
and provide the legacy defaults if the call is not available.
Replace the hard-coded values with these parameters.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:39 +0000 (07:02 -0400)]
bnxt_en: Add maximum extended request length fw message support.
Support the max_ext_req_len field from the HWRM_VER_GET_RESPONSE.
If this field is valid and greater than the mailbox size, use the
short command format to send firmware messages greater than the
mailbox size. Newer devices use this method to send larger messages
to the firmware.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:38 +0000 (07:02 -0400)]
bnxt_en: Add additional extended port statistics.
Latest firmware spec. has some additional rx extended port stats and new
tx extended port stats added. We now need to check the size of the
returned rx and tx extended stats and determine how many counters are
valid. New counters added include CoS byte and packet counts for rx
and tx.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 14 Oct 2018 11:02:37 +0000 (07:02 -0400)]
bnxt_en: Update firmware interface spec. to 1.10.0.3.
Among the new changes are trusted VF support, 200Gbps support, and new
API to dump ring information on the new chips.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 16 Oct 2018 05:37:28 +0000 (22:37 -0700)]
Merge branch 'selftests-pmtu-Add-test-choice-and-captures'
Stefano Brivio says:
====================
selftests: pmtu: Add test choice and captures
This series adds a couple of features useful for debugging: 1/2
allows selecting single tests and 2/2 adds optional traffic
captures.
Semantics for current invocation of test script are preserved.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Stefano Brivio [Fri, 12 Oct 2018 21:54:14 +0000 (23:54 +0200)]
selftests: pmtu: Add optional traffic captures for single tests
If --trace is passed as an option and tcpdump is available,
capture traffic for all relevant interfaces to per-test pcap
files named <test>_<interface>.pcap.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stefano Brivio [Fri, 12 Oct 2018 21:54:13 +0000 (23:54 +0200)]
selftests: pmtu: Allow selection of single tests
As number of tests is growing, it's quite convenient to allow
single tests to be run.
Display usage when the script is run with any invalid argument,
keep existing semantics when no arguments are passed so that
automated runs won't break.
Instead of just looping on the list of requested tests, if any,
check first that they exist, and go through them in a nested
loop to keep the existing way to display test descriptions.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit [Fri, 12 Oct 2018 21:30:52 +0000 (23:30 +0200)]
r8169: remove unneeded call to netif_stop_queue in rtl8169_net_suspend
netif_device_detach() stops all tx queues already, so we don't need
this call.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit [Fri, 12 Oct 2018 21:23:57 +0000 (23:23 +0200)]
r8169: simplify rtl8169_set_magic_reg
Simplify this function, no functional change intended.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arnd Bergmann [Fri, 12 Oct 2018 19:52:22 +0000 (21:52 +0200)]
octeontx2-af: remove unused cgx_fwi_link_change
The newly added driver causes a warning about a function that is
not used anywhere:
drivers/net/ethernet/marvell/octeontx2/af/cgx.c:320:12: error: 'cgx_fwi_link_change' defined but not used [-Werror=unused-function]
Remove it for now, until a user gets added. If we want to use this
function from another module, we also need a declaration in a header
file, which is currently missing, so it would have to change anyway.
Fixes:
1463f382f58d ("octeontx2-af: Add support for CGX link management")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ryan C Goodfellow [Fri, 12 Oct 2018 18:09:01 +0000 (11:09 -0700)]
nfp: devlink port split support for 1x100G CXP NIC
This commit makes it possible to use devlink to split the 100G CXP
Netronome into two 40G interfaces. Currently when you ask for 2
interfaces, the math in src/nfp_devlink.c:nfp_devlink_port_split
calculates that you want 5 lanes per port because for some reason
eth_port.port_lanes=10 (shouldn't this be 12 for CXP?). What we really
want when asking for 2 breakout interfaces is 4 lanes per port. This
commit makes that happen by calculating based on 8 lanes if 10 are
present.
Signed-off-by: Ryan C Goodfellow <rgoodfel@isi.edu>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Greg Weeks <greg.weeks@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 16 Oct 2018 05:23:20 +0000 (22:23 -0700)]
Merge branch 'dpaa2-eth-code-cleanup'
Ioana Ciornei says:
====================
dpaa2-eth: code cleanup
There are no functional changes in this patch set, only some cleanup
changes such as: unused parameters, uninitialized variables and
unnecessary Kconfig dependencies.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Ioana Radulescu [Fri, 12 Oct 2018 16:27:40 +0000 (16:27 +0000)]
dpaa2-eth: remove unused FD field
According to the hardware ArchDef, the PTV1 field in FD[CTRL]
is ignored by WRIOP, so setting it for Tx FDs is pointless.
Remove all references to it from the code.
Signed-off-by: Ioana Radulescu <ruxandra.radulescu@nxp.com>
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ioana Ciornei [Fri, 12 Oct 2018 16:27:38 +0000 (16:27 +0000)]
dpaa2-eth: mark unused parameter in dpaa2_eth_tx_conf
The ch parameter is never used in the dpaa2_eth_tx_conf function but
since its prototype must match the type defined in the consume field of
struct dpaa2_eth_fq, just mark it as __always_unused.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ioana Ciornei [Fri, 12 Oct 2018 16:27:35 +0000 (16:27 +0000)]
dpaa2-eth: remove unused priv parameter
The priv parameter is never used in the build_linear_skb and
drain_channel function. Remove it from the function definitions.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ioana Ciornei [Fri, 12 Oct 2018 16:27:33 +0000 (16:27 +0000)]
dpaa2-eth: fix uninitialized variable warnings
All 3 cases of possible uninitialized variables are false
positives since they are used only as output parameters.
Nonetheless, fix the warnings.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ioana Ciornei [Fri, 12 Oct 2018 16:27:29 +0000 (16:27 +0000)]
dpaa2-eth: make dpaa2_eth_set_dist_key static
The dpaa2_eth_set_dist_key function is only used in a single file.
Make it static.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ioana Radulescu [Fri, 12 Oct 2018 16:27:25 +0000 (16:27 +0000)]
dpaa2-eth: Fix Kconfig dependencies
Both ARCH_LAYERSCAPE and COMPILE_TEST dependencies are already implied
through the FSL_MC_BUS dep, so there's no need to state it explicitly.
Also, the fsl-mc bus depends on COMPILE_TEST only for some
architectures (arm, arm64, ppc, x86), so it's not correct to
claim build support unconditionally.
Signed-off-by: Ioana Radulescu <ruxandra.radulescu@nxp.com>
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ivan Khoronzhuk [Fri, 12 Oct 2018 16:06:29 +0000 (19:06 +0300)]
net: ethernet: ti: cpsw: use for mcast entries only host port
In dual-emac mode the cpsw driver sends directed packets, that means
that packets go to the directed port, but an ALE lookup is performed
to determine untagged egress only. It means that on tx side no need
to add port bit for ALE mcast entry mask, and basically ALE entry
for port identification is needed only on rx side.
So, add only host port in dual_emac mode as used directed
transmission, and no need in one more port. For single port boards
and switch mode all ports used, as usual, so no changes for them.
Also it simplifies farther changes.
In other words, mcast entries for dual-emac should behave exactly
like unicast. It also can help avoid leaking packets between ports
with same vlan on h/w level if ports could became members of same vid.
So now, for instance, if mcast address 33:33:00:00:00:01 is added then
entries in ALE table:
vid = 1, addr = 33:33:00:00:00:01, port_mask = 0x1
vid = 2, addr = 33:33:00:00:00:01, port_mask = 0x1
Instead of:
vid = 1, addr = 33:33:00:00:00:01, port_mask = 0x3
vid = 2, addr = 33:33:00:00:00:01, port_mask = 0x5
With the same considerations, set only host port for unregistered
mcast for dual-emac mode in case of IFF_ALLMULTI is set, exactly like
it's done in cpsw_ale_set_allmulti().
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 16 Oct 2018 05:21:28 +0000 (22:21 -0700)]
Merge branch 'net-ethernet-ti-cpsw-fix-mcast-packet-lost'
Ivan Khoronzhuk says:
====================
net: ethernet: ti: cpsw fix mcast packet lost
The patchset omits redundant refresh of mcast address table and
prevents mcast packet lost.
Based on net-next/master
tested on am572x evm
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Ivan Khoronzhuk [Fri, 12 Oct 2018 15:28:15 +0000 (18:28 +0300)]
net: ethernet: ti: cpsw: fix lost of mcast packets while rx_mode update
Whenever kernel or user decides to call rx mode update, it clears
every multicast entry from forwarding table and in some time adds
it again. This time can be enough to drop incoming multicast packets.
That's why clear only staled multicast entries and update or add new
one afterwards.
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ivan Khoronzhuk [Fri, 12 Oct 2018 15:28:14 +0000 (18:28 +0300)]
net: ethernet: ti: cpsw_ale: use const for API having pointer on mac address
It allows to use function under callbacks with same const qualifier of
mac address for farther changes.
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 16 Oct 2018 05:06:38 +0000 (22:06 -0700)]
Merge branch 'net-phy-improve-and-simplify-state-machine'
Heiner Kallweit says:
====================
net: phy: improve and simplify state machine
Improve / simplify handling of states PHY_RUNNING and PHY_RESUMING in
phylib state machine.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit [Thu, 11 Oct 2018 20:37:38 +0000 (22:37 +0200)]
net: phy: simplify handling of PHY_RESUMING in state machine
Simplify code for handling state PHY_RESUMING, no functional change
intended.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit [Thu, 11 Oct 2018 20:36:56 +0000 (22:36 +0200)]
net: phy: improve handling of PHY_RUNNING in state machine
Handling of state PHY_RUNNING seems to be more complex than it needs
to be. If not polling, then we don't have to do anything, we'll
receive an interrupt and go to state PHY_CHANGELINK once the link
goes down. If polling and link is down, we don't have to go the
extra mile over PHY_CHANGELINK and call phy_read_status() again
but can set status PHY_NOLINK directly.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Roopa Prabhu [Thu, 11 Oct 2018 19:35:13 +0000 (12:35 -0700)]
vxlan: support NTF_USE refresh of fdb entries
This makes use of NTF_USE in vxlan driver consistent
with bridge driver.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Justin.Lee1@Dell.com [Thu, 11 Oct 2018 18:07:37 +0000 (18:07 +0000)]
net/ncsi: Extend NC-SI Netlink interface to allow user space to send NC-SI command
The new command (NCSI_CMD_SEND_CMD) is added to allow user space application
to send NC-SI command to the network card.
Also, add a new attribute (NCSI_ATTR_DATA) for transferring request and response.
The work flow is as below.
Request:
User space application
-> Netlink interface (msg)
-> new Netlink handler - ncsi_send_cmd_nl()
-> ncsi_xmit_cmd()
Response:
Response received - ncsi_rcv_rsp()
-> internal response handler - ncsi_rsp_handler_xxx()
-> ncsi_rsp_handler_netlink()
-> ncsi_send_netlink_rsp ()
-> Netlink interface (msg)
-> user space application
Command timeout - ncsi_request_timeout()
-> ncsi_send_netlink_timeout ()
-> Netlink interface (msg with zero data length)
-> user space application
Error:
Error detected
-> ncsi_send_netlink_err ()
-> Netlink interface (err msg)
-> user space application
Signed-off-by: Justin Lee <justin.lee1@dell.com>
Reviewed-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit [Thu, 11 Oct 2018 17:31:47 +0000 (19:31 +0200)]
net: phy: trigger state machine immediately in phy_start_machine
When starting the state machine there may be work to be done
immediately, e.g. if the initial state is PHY_UP then the state
machine may trigger an autonegotiation. Having said that I see no need
to wait a second until the state machine is run first time.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 16 Oct 2018 04:58:46 +0000 (21:58 -0700)]
Merge branch 'veth-XDP-stats-improvement'
Toshiaki Makita says:
====================
veth: XDP stats improvement
ndo_xdp_xmit in veth did not update packet counters as described in [1].
Also, current implementation only updates counters on tx side so rx side
events like XDP_DROP were not collected.
This series implements the missing accounting as well as support for
ethtool per-queue stats in veth.
Patch 1: Update drop counter in ndo_xdp_xmit.
Patch 2: Update packet and byte counters for all XDP path, and drop
counter on XDP_DROP.
Patch 3: Support per-queue ethtool stats for XDP counters.
Note that counters are maintained on per-queue basis for XDP but not
otherwise (per-cpu and atomic as before). This is because 1) tx path in
veth is essentially lockless so we cannot update per-queue stats on tx,
and 2) rx path is net core routine (process_backlog) which cannot update
per-queue based stats when XDP is disabled. On the other hand there are
real rxqs and napi handlers for veth XDP, so update per-queue stats on
rx for XDP packets, and use them to calculate tx counters as well,
contrary to the existing non-XDP counters.
[1] https://patchwork.ozlabs.org/cover/953071/#
1967449
====================
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
Toshiaki Makita [Thu, 11 Oct 2018 09:36:50 +0000 (18:36 +0900)]
veth: Add ethtool statistics support for XDP
Expose per-queue stats for ethtool -S.
As there are only rx queues, and rx queues are used only when XDP is
used, per-queue counters are only rx XDP ones.
Example:
$ ethtool -S veth0
NIC statistics:
peer_ifindex: 11
rx_queue_0_xdp_packets:
28601434
rx_queue_0_xdp_bytes:
1716086040
rx_queue_0_xdp_drops:
28601434
rx_queue_1_xdp_packets:
17873050
rx_queue_1_xdp_bytes:
1072383000
rx_queue_1_xdp_drops:
17873050
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
Toshiaki Makita [Thu, 11 Oct 2018 09:36:49 +0000 (18:36 +0900)]
veth: Account for XDP packet statistics on rx side
On XDP path veth has napi handler so we can collect statistics on
per-queue basis for XDP.
By this change now we can collect XDP_DROP drop count as well as packets
and bytes coming through ndo_xdp_xmit. Packet counters shown by
"ip -s link", sysfs stats or /proc/net/dev is now correct for XDP.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
Toshiaki Makita [Thu, 11 Oct 2018 09:36:48 +0000 (18:36 +0900)]
veth: Account for packet drops in ndo_xdp_xmit
Use existing atomic drop counter. Since drop path is really an
exceptional case here, I'm thinking atomic ops would not hurt the
performance.
XDP packets and bytes are not counted in ndo_xdp_xmit, but will be
accounted on rx side by the following commit.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hoang Le [Thu, 11 Oct 2018 01:43:08 +0000 (08:43 +0700)]
tipc: support binding to specific ip address when activating UDP bearer
INADDR_ANY is hard-coded when activating UDP bearer. So, we could not
bind to a specific IP address even with replicast mode using - given
remote ip address instead of using multicast ip address.
In this commit, we fixed it by checking and switch to use appropriate
local ip address.
before:
$netstat -plu
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address
udp 0 0 **0.0.0.0:6118** 0.0.0.0:*
after:
$netstat -plu
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address
udp 0 0 **10.0.0.2:6118** 0.0.0.0:*
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 16 Oct 2018 04:49:56 +0000 (21:49 -0700)]
Merge tag 'mlx5e-updates-2018-10-10' of git://git./linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5e-updates-2018-10-10
IPoIB netlink support and mlx5e pre-allocated netdevice initialization
IP link was broken due to the changes in IPoIB for the rdma_netdev
support after commit
cd565b4b51e5
("IB/IPoIB: Support acceleration options callbacks").
This patchset fixes IPoIB pkey creation and removal using rtnetlink by
adding support in both IPoIB ULP layer and mlx5 layer:
From Jason and Denis:
1) Introduces changes in the RDMA netdev code in order to
allow allocation of the netdev to be done by the rtnl netdev code.
2) Reworks IPoIB initialization to use the two step rdma_netdev
creation.
From Feras and Saeed, mlx5e netdev layer refactoring to allow accepting
pre-allocated netdevs:
3) Adds support to initialize/cleanup netdevs that are not created
by mlx5 driver.
4) Change mlx5e netdevice layer to accept the pre-allocated netdevice
queue number.
5) Initialize mlx5e generic structures in one place to be used for all
netdevs types NIC/representors/IPoIB (both mlx5 allocated and
pre-allocted).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>