Vladimir Oltean [Sat, 8 Jun 2019 12:04:43 +0000 (15:04 +0300)]
net: dsa: sja1105: Expose PTP timestamping ioctls to userspace
This enables the PTP support towards userspace applications such as
linuxptp.
The switches can timestamp only trapped multicast MAC frames, and
therefore only the profiles of 1588 over L2 are supported.
TX timestamping can be enabled per port, but RX timestamping is enabled
globally. As long as RX timestamping is enabled, the switch will emit
metadata follow-up frames that will be processed by the tagger. It may
be a problem that linuxptp does not restore the RX timestamping settings
when exiting.
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Sat, 8 Jun 2019 12:04:42 +0000 (15:04 +0300)]
net: dsa: sja1105: Add a state machine for RX timestamping
Meta frame reception relies on the hardware keeping its promise that it
will send no other traffic towards the CPU port between a link-local
frame and a meta frame. Otherwise there is no other way to associate
the meta frame with the link-local frame it's holding a timestamp of.
The receive function is made stateful, and buffers a timestampable frame
until its meta frame arrives, then merges the two, drops the meta and
releases the link-local frame up the stack.
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Sat, 8 Jun 2019 12:04:41 +0000 (15:04 +0300)]
net: dsa: sja1105: Increase priority of CPU-trapped frames
Without noticing any particular issue, this patch ensures that
management traffic is treated with the maximum priority on RX by the
switch. This is generally desirable, as the driver keeps a state
machine that waits for metadata follow-up frames as soon as a management
frame is received. Increasing the priority helps expedite the reception
(and further reconstruction) of the RX timestamp to the driver after the
MAC has generated it.
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Sat, 8 Jun 2019 12:04:40 +0000 (15:04 +0300)]
net: dsa: sja1105: Add a global sja1105_tagger_data structure
This will be used to keep state for RX timestamping. It is global
because the switch serializes timestampable and meta frames when
trapping them towards the CPU port (lower port indices have higher
priority) and therefore having one state machine per port would create
unnecessary complications.
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Sat, 8 Jun 2019 12:04:39 +0000 (15:04 +0300)]
net: dsa: sja1105: Receive and decode meta frames
This adds support in the tagger for understanding the source port and
switch id of meta frames. Their timestamp is also extracted but not
used yet - this needs to be done in a state machine that modifies the
previously received timestampable frame - will be added in a follow-up
patch.
Also take the opportunity to:
- Remove a comment in sja1105_filter made obsolete by
e8d67fa5696e
("net: dsa: sja1105: Don't store frame type in skb->cb")
- Reorder the checks in sja1105_filter to optimize for the most likely
scenario first: regular traffic.
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Sat, 8 Jun 2019 12:04:38 +0000 (15:04 +0300)]
net: dsa: sja1105: Make sja1105_is_link_local not match meta frames
Although meta frames are configured to be sent at SJA1105_META_DMAC
(01-80-C2-00-00-0E) which is a multicast MAC address that would also be
trapped by the switch to the CPU, were it to receive it on a front-panel
port, meta frames are conceptually not link-local frames, they only
carry their RX timestamps.
The choice of sending meta frames at a multicast DMAC is a pragmatic
one, to avoid installing an extra entry to the DSA master port's
multicast MAC filter.
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Sat, 8 Jun 2019 12:04:37 +0000 (15:04 +0300)]
net: dsa: sja1105: Add support for the AVB Parameters Table
This table is used to program the switch to emit "meta" follow-up
Ethernet frames (which contain partial RX timestamps) after each
link-local frame that was trapped to the CPU port through MAC filtering.
This includes PTP frames.
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Sat, 8 Jun 2019 12:04:36 +0000 (15:04 +0300)]
net: dsa: sja1105: Build a minimal understanding of meta frames
Meta frames are sent on the CPU port by the switch if RX timestamping is
enabled. They contain a partial timestamp of the previous frame.
They are Ethernet frames with the Ethernet header constructed out of:
- SJA1105_META_DMAC
- SJA1105_META_SMAC
- ETH_P_SJA1105_META
The Ethernet payload will be decoded in a follow-up patch.
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Sat, 8 Jun 2019 12:04:35 +0000 (15:04 +0300)]
net: dsa: sja1105: Add logic for TX timestamping
On TX, timestamping is performed synchronously from the
port_deferred_xmit worker thread.
In management routes, the switch is requested to take egress timestamps
(again partial), which are reconstructed and appended to a clone of the
skb that was just sent. The cloning is done by DSA and we retrieve the
pointer from the structure that DSA keeps in skb->cb.
Then these clones are enqueued to the socket's error queue for
application-level processing.
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Sat, 8 Jun 2019 12:04:34 +0000 (15:04 +0300)]
net: dsa: sja1105: Add support for the PTP clock
The design of this PHC driver is influenced by the switch's behavior
w.r.t. timestamping. It exposes two PTP counters, one free-running
(PTPTSCLK) and the other offset- and frequency-corrected in hardware
through PTPCLKVAL, PTPCLKADD and PTPCLKRATE. The MACs can sample either
of these for frame timestamps.
However, the user manual warns that taking timestamps based on the
corrected clock is less than useful, as the switch can deliver corrupted
timestamps in a variety of circumstances.
Therefore, this PHC uses the free-running PTPTSCLK together with a
timecounter/cyclecounter structure that translates it into a software
time domain. Thus, the settime/adjtime and adjfine callbacks are
hardware no-ops.
The timestamps (introduced in a further patch) will also be translated
to the correct time domain before being handed over to the userspace PTP
stack.
The introduction of a second set of PHC operations that operate on the
hardware PTPCLKVAL/PTPCLKADD/PTPCLKRATE in the future is somewhat
unavoidable, as the TTEthernet core uses the corrected PTP time domain.
However, the free-running counter + timecounter structure combination
will suffice for now, as the resulting timestamps yield a sub-50 ns
synchronization offset in steady state using linuxptp.
For this patch, in absence of frame timestamping, the operations of the
switch PHC were tested by syncing it to the system time as a local slave
clock with:
phc2sys -s CLOCK_REALTIME -c swp2 -O 0 -m -S 0.01
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Sat, 8 Jun 2019 12:04:33 +0000 (15:04 +0300)]
net: dsa: sja1105: Export symbols for upcoming PTP driver
These are needed for the situation where the switch driver and the PTP
driver are both built as modules.
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Sat, 8 Jun 2019 12:04:32 +0000 (15:04 +0300)]
net: dsa: sja1105: Limit use of incl_srcpt to bridge+vlan mode
The incl_srcpt setting makes the switch mangle the destination MACs of
multicast frames trapped to the CPU - a primitive tagging mechanism that
works even when we cannot use the 802.1Q software features.
The downside is that the two multicast MAC addresses that the switch
traps for L2 PTP (01-80-C2-00-00-0E and 01-1B-19-00-00-00) quickly turn
into a lot more, as the switch encodes the source port and switch id
into bytes 3 and 4 of the MAC. The resulting range of MAC addresses
would need to be installed manually into the DSA master port's multicast
MAC filter, and even then, most devices might not have a large enough
MAC filtering table.
As a result, only limit use of incl_srcpt to when it's strictly
necessary: when under a VLAN filtering bridge. This fixes PTP in
non-bridged mode (standalone ports). Otherwise, PTP frames, as well as
metadata follow-up frames holding RX timestamps won't be received
because they will be blocked by the master port's MAC filter.
Linuxptp doesn't help, because it only requests the addition of the
unmodified PTP MACs to the multicast filter.
This issue is not seen in bridged mode because the master port is put in
promiscuous mode when the slave ports are enslaved to a bridge.
Therefore, there is no downside to having the incl_srcpt mechanism
active there.
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Sat, 8 Jun 2019 12:04:31 +0000 (15:04 +0300)]
net: dsa: sja1105: Reverse TPID and TPID2
>From reading the P/Q/R/S user manual, it appears that TPID is used by
the switch for detecting S-tags and TPID2 for C-tags. Their meaning is
not clear from the E/T manual.
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Sat, 8 Jun 2019 12:04:30 +0000 (15:04 +0300)]
net: dsa: sja1105: Move sja1105_change_tpid into sja1105_vlan_filtering
This is a cosmetic patch, pre-cursor to making another change to the
General Parameters Table (incl_srcpt) which does not logically pertain
to the sja1105_change_tpid function name, but not putting it there would
otherwise create a need of resetting the switch twice.
So simply move the existing code into the .port_vlan_filtering callback,
where the incl_srcpt change will be added as well.
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Sat, 8 Jun 2019 12:04:29 +0000 (15:04 +0300)]
net: dsa: tag_8021q: Create helper function for removing VLAN header
This removes the existing implementation from tag_sja1105, which was
partially incorrect (it was not changing the MAC header offset, thereby
leaving it to point 4 bytes earlier than it should have).
This overwrites the VLAN tag by moving the Ethernet source and
destination MACs 4 bytes to the right. Then skb->data (assumed to be
pointing immediately after the EtherType) is temporarily pushed to the
beginning of the new Ethernet header, the new Ethernet header offset and
length are recorded, then skb->data is moved back to where it was.
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Sat, 8 Jun 2019 12:04:28 +0000 (15:04 +0300)]
net: dsa: Add teardown callback for drivers
This is helpful for e.g. draining per-driver (not per-port) tagger
queues.
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Sat, 8 Jun 2019 12:04:27 +0000 (15:04 +0300)]
net: dsa: Keep a pointer to the skb clone for TX timestamping
For drivers that use deferred_xmit for PTP frames (such as sja1105),
there is no need to perform matching between PTP frames and their egress
timestamps, since the sending process can be serialized.
In that case, it makes sense to have the pointer to the skb clone that
DSA made directly in the skb->cb. It will be used for pushing the egress
timestamp back in the application socket's error queue.
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 7 Jun 2019 18:00:14 +0000 (11:00 -0700)]
Merge git://git./linux/kernel/git/davem/net
Some ISDN files that got removed in net-next had some changes
done in mainline, take the removals.
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Fri, 7 Jun 2019 16:29:14 +0000 (09:29 -0700)]
Merge git://git./linux/kernel/git/davem/net
Pull networking fixes from David Miller:
1) Free AF_PACKET po->rollover properly, from Willem de Bruijn.
2) Read SFP eeprom in max 16 byte increments to avoid problems with
some SFP modules, from Russell King.
3) Fix UDP socket lookup wrt. VRF, from Tim Beale.
4) Handle route invalidation properly in s390 qeth driver, from Julian
Wiedmann.
5) Memory leak on unload in RDS, from Zhu Yanjun.
6) sctp_process_init leak, from Neil HOrman.
7) Fix fib_rules rule insertion semantic change that broke Android,
from Hangbin Liu.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (33 commits)
pktgen: do not sleep with the thread lock held.
net: mvpp2: Use strscpy to handle stat strings
net: rds: fix memory leak in rds_ib_flush_mr_pool
ipv6: fix EFAULT on sendto with icmpv6 and hdrincl
ipv6: use READ_ONCE() for inet->hdrincl as in ipv4
Revert "fib_rules: return 0 directly if an exactly same rule exists when NLM_F_EXCL not supplied"
net: aquantia: fix wol configuration not applied sometimes
ethtool: fix potential userspace buffer overflow
Fix memory leak in sctp_process_init
net: rds: fix memory leak when unload rds_rdma
ipv6: fix the check before getting the cookie in rt6_get_cookie
ipv4: not do cache for local delivery if bc_forwarding is enabled
s390/qeth: handle error when updating TX queue count
s390/qeth: fix VLAN attribute in bridge_hostnotify udev event
s390/qeth: check dst entry before use
s390/qeth: handle limited IPv4 broadcast in L3 TX path
net: fix indirect calls helpers for ptype list hooks.
net: ipvlan: Fix ipvlan device tso disabled while NETIF_F_IP_CSUM is set
udp: only choose unbound UDP socket for multicast when not in a VRF
net/tls: replace the sleeping lock around RX resync with a bit lock
...
Linus Torvalds [Fri, 7 Jun 2019 16:25:27 +0000 (09:25 -0700)]
Merge tag 'for-linus' of git://git./linux/kernel/git/rdma/rdma
Pull rdma fixes from Jason Gunthorpe:
"Things are looking pretty quiet here in RDMA, not too many bug fixes
rolling in right now. The usual driver bug fixes and fixes for a
couple of regressions introduced in 5.2:
- Fix a race on bootup with RDMA device renaming and srp. SRP also
needs to rename its internal sys files
- Fix a memory leak in hns
- Don't leak resources in efa on certain error unwinds
- Don't panic in certain error unwinds in ib_register_device
- Various small user visible bug fix patches for the hfi and efa
drivers
- Fix the 32 bit compilation break"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
RDMA/efa: Remove MAYEXEC flag check from mmap flow
mlx5: avoid 64-bit division
IB/hfi1: Validate page aligned for a given virtual address
IB/{qib, hfi1, rdmavt}: Correct ibv_devinfo max_mr value
IB/hfi1: Insure freeze_work work_struct is canceled on shutdown
IB/rdmavt: Fix alloc_qpn() WARN_ON()
RDMA/core: Fix panic when port_data isn't initialized
RDMA/uverbs: Pass udata on uverbs error unwind
RDMA/core: Clear out the udata before error unwind
RDMA/hns: Fix PD memory leak for internal allocation
RDMA/srp: Rename SRP sysfs name after IB device rename trigger
Linus Torvalds [Fri, 7 Jun 2019 16:21:48 +0000 (09:21 -0700)]
Merge tag 'arm64-fixes' of git://git./linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
"Another round of mostly-benign fixes, the exception being a boot crash
on SVE2-capable CPUs (although I don't know where you'd find such a
thing, so maybe it's benign too).
We're in the process of resolving some big-endian ptrace breakage, so
I'll probably have some more for you next week.
Summary:
- Fix boot crash on platforms with SVE2 due to missing register
encoding
- Fix architected timer accessors when CONFIG_OPTIMIZE_INLINING=y
- Move cpu_logical_map into smp.h for use by upcoming irqchip drivers
- Trivial typo fix in comment
- Disable some useless, noisy warnings from GCC 9"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: Silence gcc warnings about arch ABI drift
ARM64: trivial: s/TIF_SECOMP/TIF_SECCOMP/ comment typo fix
arm64: arch_timer: mark functions as __always_inline
arm64: smp: Moved cpu_logical_map[] to smp.h
arm64: cpufeature: Fix missing ZFR0 in __read_sysreg_by_encoding()
David S. Miller [Thu, 6 Jun 2019 23:24:30 +0000 (16:24 -0700)]
Merge branch 'Xilinx-axienet-driver-updates'
Robert Hancock says:
====================
Xilinx axienet driver updates (v5)
This is a series of enhancements and bug fixes in order to get the mainline
version of this driver into a more generally usable state, including on
x86 or ARM platforms. It also converts the driver to use the phylink API
in order to provide support for SFP modules.
Changes since v4:
-Use reverse christmas tree variable order
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Robert Hancock [Thu, 6 Jun 2019 22:28:24 +0000 (16:28 -0600)]
net: axienet: convert to phylink API
Convert this driver to use the phylink API rather than the legacy PHY
API. This allows for better support for SFP modules connected using a
1000BaseX or SGMII interface.
Signed-off-by: Robert Hancock <hancock@sedsystems.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
Robert Hancock [Thu, 6 Jun 2019 22:28:23 +0000 (16:28 -0600)]
net: axienet: make use of axistream-connected attribute optional
Currently the axienet driver requires the use of a second devicetree
node, referenced by an axistream-connected attribute on the Ethernet
device node, which contains the resources for the AXI DMA block used by the
device. This setup is problematic for a use case we have where the Ethernet
and DMA cores are behind a PCIe to AXI bridge and the memory resources for
the nodes are injected into the platform devices using the multifunction
device subsystem - it's not easily possible for the driver to obtain the
platform-level resources from the linked device.
In order to simplify that usage model, and simplify the overall use of
this driver in general, allow for all of the resources to be kept on one
node where the resources are retrieved using platform device APIs rather
than device-tree-specific ones. The previous usage setup is still
supported if the axistream-connected attribute is specified.
Signed-off-by: Robert Hancock <hancock@sedsystems.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
Robert Hancock [Thu, 6 Jun 2019 22:28:22 +0000 (16:28 -0600)]
net: axienet: document axistream-connected attribute
The axienet driver requires the use of an axistream-connected attribute,
but this isn't documented in the devicetree bindings. Document how this
attribute is supposed to be used, including the upcoming change to make
the usage of this attribute optional.
Signed-off-by: Robert Hancock <hancock@sedsystems.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
Robert Hancock [Thu, 6 Jun 2019 22:28:21 +0000 (16:28 -0600)]
net: axienet: Fix MDIO bus parent node detection
This driver was previously using the parent node of the specified PHY
node as the device node to register the MDIO bus on. Andrew Lunn
pointed out this is wrong as the PHY node is potentially not even
underneath the MDIO bus for the current device instance. Find the MDIO
node explicitly by looking it up by name under the controller's device
node instead.
This could potentially break existing device trees if they don't use
"mdio" as the name for the MDIO bus, but I did not find any with various
searches and Xilinx's examples all use mdio as the name so it seems like
this should be relatively safe.
Signed-off-by: Robert Hancock <hancock@sedsystems.ca>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Robert Hancock [Thu, 6 Jun 2019 22:28:20 +0000 (16:28 -0600)]
net: axienet: document device tree mdio child node
The mdio child node for the MDIO bus is generally required when using
this driver but was not documented other than being shown in the
example. Document it as an optional (but usually required) parameter.
Signed-off-by: Robert Hancock <hancock@sedsystems.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
Robert Hancock [Thu, 6 Jun 2019 22:28:19 +0000 (16:28 -0600)]
net: axienet: stop interface during shutdown
On some platforms, such as iMX6 with PCIe devices, crashes or hangs can
occur if the axienet device continues to perform DMA transfers after
parent devices/busses have been shut down. Shut down the axienet
interface during its shutdown callback in order to avoid this.
Signed-off-by: Robert Hancock <hancock@sedsystems.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
Robert Hancock [Thu, 6 Jun 2019 22:28:18 +0000 (16:28 -0600)]
net: axienet: Make missing MAC address non-fatal
Failing initialization on a missing MAC address property is excessive.
We can just fall back to using a random MAC instead, which at least
leaves the interface in a functioning state.
Signed-off-by: Robert Hancock <hancock@sedsystems.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
Robert Hancock [Thu, 6 Jun 2019 22:28:17 +0000 (16:28 -0600)]
net: axienet: Fix race condition causing TX hang
It is possible that the interrupt handler fires and frees up space in
the TX ring in between checking for sufficient TX ring space and
stopping the TX queue in axienet_start_xmit. If this happens, the
queue wake from the interrupt handler will occur before the queue is
stopped, causing a lost wakeup and the adapter's transmit hanging.
To avoid this, after stopping the queue, check again whether there is
sufficient space in the TX ring. If so, wake up the queue again.
Signed-off-by: Robert Hancock <hancock@sedsystems.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
Robert Hancock [Thu, 6 Jun 2019 22:28:16 +0000 (16:28 -0600)]
net: axienet: Add optional support for Ethernet core interrupt
Previously this driver only handled interrupts from the DMA RX and TX
blocks, not from the Ethernet core itself. Add optional support for
the Ethernet core interrupt, which is used to detect rx_missed and
framing errors signalled by the hardware. In order to use this
interrupt, a third interrupt needs to be specified in the device tree.
Signed-off-by: Robert Hancock <hancock@sedsystems.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
Robert Hancock [Thu, 6 Jun 2019 22:28:15 +0000 (16:28 -0600)]
net: axienet: Support shared interrupts
Specify IRQF_SHARED to support shared interrupts. If the interrupt
handler is called and the device is not indicating an interrupt,
just return IRQ_NONE rather than spewing error messages.
Signed-off-by: Robert Hancock <hancock@sedsystems.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
Robert Hancock [Thu, 6 Jun 2019 22:28:14 +0000 (16:28 -0600)]
net: axienet: Add DMA registers to ethtool register dump
These registers are important for troubleshooting the state of the DMA
cores.
Signed-off-by: Robert Hancock <hancock@sedsystems.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
Robert Hancock [Thu, 6 Jun 2019 22:28:13 +0000 (16:28 -0600)]
net: axienet: Make RX/TX ring sizes configurable
Add support for setting the RX and TX ring sizes for this driver using
ethtool. Also increase the default RX ring size as the previous default
was far too low for good performance in some configurations.
Signed-off-by: Robert Hancock <hancock@sedsystems.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
Robert Hancock [Thu, 6 Jun 2019 22:28:12 +0000 (16:28 -0600)]
net: axienet: Cleanup DMA device reset and halt process
The Xilinx DMA blocks each have their own reset register, but they both
reset the entire DMA engine, so only one of them needs to be reset.
Also, when stopping the device, we need to not just command the DMA
blocks to stop, but wait for them to stop, and trigger a device reset
to ensure that they are completely stopped.
Signed-off-by: Robert Hancock <hancock@sedsystems.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
Robert Hancock [Thu, 6 Jun 2019 22:28:11 +0000 (16:28 -0600)]
net: axienet: Re-initialize MDIO registers properly after reset
The MDIO clock divisor register setting was only applied on the initial
startup when the driver was loaded. However, this setting is cleared
when the device is reset, such as would occur when the interface was
taken down and brought up again, and so the MDIO bus would be
non-functional afterwards.
Split up the MDIO bus setup and enable into separate functions and
re-enable the bus after a device reset, to ensure that the MDIO
registers are set properly. This also allows us to remove direct access
to MDIO registers in xilinx_axienet_main.c and centralize them all in
xilinx_axienet_mdio.c.
Also, lock the MDIO bus lock around the device reset process, to avoid
MDIO accesses from occurring while the MDIO is disabled during the reset.
Signed-off-by: Robert Hancock <hancock@sedsystems.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
Robert Hancock [Thu, 6 Jun 2019 22:28:10 +0000 (16:28 -0600)]
net: axienet: fix teardown order of MDIO bus
Since the MDIO is is brought up before the netdev is registered, it
should be torn down after the netdev is removed. Otherwise, PHY accesses
can potentially access freed MDIO bus references and cause a crash.
Signed-off-by: Robert Hancock <hancock@sedsystems.ca>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Robert Hancock [Thu, 6 Jun 2019 22:28:09 +0000 (16:28 -0600)]
net: axienet: Use clock framework to get device clock rate
This driver was previously always calculating the MDIO clock divisor
(from AXI bus clock to MDIO bus clock) based on the CPU clock frequency,
assuming that it is the same as the AXI bus frequency, but that
simplistic method only works on the MicroBlaze platform.
Add support for specifying the clock used for the device in the device
tree using the clock framework. If the clock is specified then it will
be used when calculating the clock divisor. The previous CPU clock
detection method is left for backward compatibility if no clock is
specified.
Signed-off-by: Robert Hancock <hancock@sedsystems.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
Robert Hancock [Thu, 6 Jun 2019 22:28:08 +0000 (16:28 -0600)]
net: axienet: add X86 and ARM as supported platforms
This driver should now build on (at least) X86 and ARM platforms, so add
them as supported platforms for the driver in Kconfig.
Signed-off-by: Robert Hancock <hancock@sedsystems.ca>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Robert Hancock [Thu, 6 Jun 2019 22:28:07 +0000 (16:28 -0600)]
net: axienet: fix MDIO bus naming
The MDIO bus for this driver was being named using the result of
of_address_to_resource on a node which may not have any resource on it,
but the return value of that call was not checked so it was using some
random value in the bus name. Change to name the MDIO bus based on the
resource start of the actual Ethernet register block.
Signed-off-by: Robert Hancock <hancock@sedsystems.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
Robert Hancock [Thu, 6 Jun 2019 22:28:06 +0000 (16:28 -0600)]
net: axienet: Use standard IO accessors
This driver was using in_be32 and out_be32 IO accessors which do not
exist on most platforms. Also, the use of big-endian accessors does not
seem correct as this hardware is accessed over an AXI bus which, to the
extent it has an endian-ness, is little-endian. Switch to standard
ioread32/iowrite32 accessors.
Signed-off-by: Robert Hancock <hancock@sedsystems.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
Robert Hancock [Thu, 6 Jun 2019 22:28:05 +0000 (16:28 -0600)]
net: axienet: Fix casting of pointers to u32
This driver was casting skb pointers to u32 and storing them as such in
the DMA buffer descriptor, which is obviously broken on 64-bit. The area
of the buffer descriptor being used is not accessed by the hardware and
has sufficient room for a 32 or 64-bit pointer, so just store the skb
pointer as such.
Signed-off-by: Robert Hancock <hancock@sedsystems.ca>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dinh Nguyen [Wed, 5 Jun 2019 15:05:51 +0000 (10:05 -0500)]
net: stmmac: socfpga: fix phy and ptp_ref setup for Arria10/Stratix10
On the Arria10, Agilex, and Stratix10 SoC, there are a few differences from
the Cyclone5 and Arria5:
- The emac PHY setup bits are in separate registers.
- The PTP reference clock select mask is different.
- The register to enable the emac signal from FPGA is different.
Thus, this patch creates a separate function for setting the phy modes on
Arria10/Agilex/Stratix10. The separation is based a new DTS binding:
"altr,socfpga-stmmac-a10-s10".
Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dinh Nguyen [Wed, 5 Jun 2019 15:05:50 +0000 (10:05 -0500)]
dt-bindings: socfpga-dwmac: add "altr, socfpga-stmmac-a10-s10" binding
Add the "altr,socfpga-stmmac-a10-s10" binding for Arria10/Agilex/Stratix10
implementation of the stmmac ethernet controller.
On the Arria10, Agilex, and Stratix10 SoCs, there are a few differences from
the Cyclone5 and Arria5:
- The emac PHY setup bits are in separate registers.
- The PTP reference clock select mask is different.
- The register to enable the emac signal from FPGA is different.
Because of these differences, the dwmac-socfpga glue logic driver will
use this new binding to set the appropriate bits for PHY, PTP reference
clock, and signal from FPGA.
Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 6 Jun 2019 21:13:40 +0000 (14:13 -0700)]
Merge branch 'nfp-tls-add-basic-TX-offload'
Jakub Kicinski says:
====================
nfp: tls: add basic TX offload
This series adds initial TLS offload support to the nfp driver.
Only TX side is added for now. We need minor adjustments to
the core tls code:
- expose the per-skb fallback helper;
- grow the driver context slightly;
- add a helper to get to the driver state more easily.
We only support TX offload for now, and only if all packets
keep coming in order. For retransmissions we use the
aforementioned software fallback, and in case there are
local drops we completely give up on given TCP stream.
This will obviously be improved soon, this patch set is the
minimal, functional yet easily reviewable chunk.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Wed, 5 Jun 2019 21:11:43 +0000 (14:11 -0700)]
nfp: tls: add basic statistics
Count TX TLS packets: successes, out of order, and dropped due to
missing record info. Make sure the RX and TX completion statistics
don't share cache lines with TX ones as much as possible. With TLS
stats they are no longer reasonably aligned.
Signed-off-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dirk van der Merwe [Wed, 5 Jun 2019 21:11:42 +0000 (14:11 -0700)]
nfp: tls: add/delete TLS TX connections
This patch adds the functionality to add and delete TLS connections on
the NFP, received from the kernel TLS callbacks.
Make use of the common control message (CCM) infrastructure to propagate
the kernel state to firmware.
Signed-off-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dirk van der Merwe [Wed, 5 Jun 2019 21:11:41 +0000 (14:11 -0700)]
nfp: tls: add datapath support for TLS TX
Prepend connection handle to each transmitted TLS packet.
For each connection, the driver tracks the next sequence number
expected. If an out of order packet is observed, the driver calls into
the TLS kernel code to reencrypt that particular skb.
Signed-off-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dirk van der Merwe [Wed, 5 Jun 2019 21:11:40 +0000 (14:11 -0700)]
net/tls: export TLS per skb encryption
While offloading TLS connections, drivers need to handle the case where
out of order packets need to be transmitted.
Other drivers obtain the entire TLS record for the specific skb to
provide as context to hardware for encryption. However, other designs
may also want to keep the hardware state intact and perform the
out of order encryption entirely on the host.
To achieve this, export the already existing software encryption
fallback path so drivers could access this.
Signed-off-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Wed, 5 Jun 2019 21:11:39 +0000 (14:11 -0700)]
net/tls: simplify driver context retrieval
Currently drivers have to ensure the alignment of their tls state
structure, which leads to unnecessary layers of getters and
encapsulated structures in each driver.
Simplify all this by marking the driver state as aligned (driver_state
members are currently aligned, so no hole is added, besides ALIGN in
TLS_OFFLOAD_CONTEXT_SIZE_RX/TX would reserve this extra space, anyway.)
With that we can add a common accessor to the core.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Wed, 5 Jun 2019 21:11:38 +0000 (14:11 -0700)]
net/tls: split the TLS_DRIVER_STATE_SIZE and bump TX to 16 bytes
8 bytes of driver state has been enough so far, but for drivers
which have to store 8 byte handle it's no longer practical to
store the state directly in the context.
Drivers generally don't need much extra state on RX side, while
TX side has to be tracking TCP sequence numbers. Split the
lengths of max driver state size on RX and TX.
The struct tls_offload_context_tx currently stands at 616 bytes and
struct tls_offload_context_rx stands at 368 bytes. Upcoming work
will consume extra 8 bytes in both for kernel-driven resync.
This means that we can bump TX side to 16 bytes and still fit
into the same number of cache lines but on RX side we would be 8
bytes over.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Wed, 5 Jun 2019 21:11:37 +0000 (14:11 -0700)]
nfp: prepare for more TX metadata prepend
Subsequent patches will add support for more TX metadata fields.
Prepare for this by handling an additional double word - firmware
handle as metadata type 7.
Signed-off-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Wed, 5 Jun 2019 21:11:36 +0000 (14:11 -0700)]
nfp: add tls init code
Add FW ABI defines and code for basic init of TLS offload.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Wed, 5 Jun 2019 21:11:35 +0000 (14:11 -0700)]
nfp: parse crypto opcode TLV
Parse TLV containing a bitmask of supported crypto operations.
The TLV contains a capability bitmask (supported operations)
and enabled bitmask. Each operation describes the crypto
protocol quite exhaustively (protocol, AEAD, direction).
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Wed, 5 Jun 2019 21:11:34 +0000 (14:11 -0700)]
nfp: add support for sending control messages via mailbox
FW may prefer to handle some communication via a mailbox
or the vNIC may simply not have a control queue (VFs).
Add a way of exchanging ccm-compatible messages via a
mailbox.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Wed, 5 Jun 2019 21:11:33 +0000 (14:11 -0700)]
nfp: parse the mailbox cmsg TLV
Parse the mailbox TLV. When control message queue is not available
we can fall back to passing the control messages via the vNIC
mailbox.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Wed, 5 Jun 2019 21:11:32 +0000 (14:11 -0700)]
nfp: make bar_lock a semaphore
We will need to release the bar lock from a workqueue
so move from a mutex to a semaphore. This lock should
not be too hot. Unfortunately semaphores don't have
lockdep support.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Wed, 5 Jun 2019 21:11:31 +0000 (14:11 -0700)]
nfp: count all failed TX attempts as errors
Currently if we need to modify the head of the skb and allocation
fails we would free the skb and not increment the error counter.
Make sure all errors are counted.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Robert Hancock [Tue, 4 Jun 2019 22:15:01 +0000 (16:15 -0600)]
net: phy: Add detection of 1000BaseX link mode support
Add 1000BaseX to the link modes which are detected based on the
MII_ESTATUS register as per 802.3 Clause 22. This allows PHYs which
support 1000BaseX to work properly with drivers using phylink.
Previously 1000BaseX support was not detected, and if that was the only
mode the PHY indicated support for, phylink would refuse to attach it
due to the list of supported modes being empty.
Signed-off-by: Robert Hancock <hancock@sedsystems.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Thu, 6 Jun 2019 20:13:09 +0000 (13:13 -0700)]
Merge branch 'parisc-5.2-3' of git://git./linux/kernel/git/deller/parisc-linux
Pull parisc fixes from Helge Deller:
- Fix crashes when accessing PCI devices on some machines like C240 and
J5000. The crashes were triggered because we replaced cache flushes
by nops in the alternative coding where we shouldn't for some
machines.
- Dave fixed a race in the usage of the sr1 space register when used to
load the coherence index.
- Use the hardware lpa instruction to to load the physical address of
kernel virtual addresses in the iommu driver code.
- The kernel may fail to link when CONFIG_MLONGCALLS isn't set. Solve
that by rearranging functions in the final vmlinux executeable.
- Some defconfig cleanups and removal of compiler warnings.
* 'parisc-5.2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Fix crash due alternative coding for NP iopdir_fdc bit
parisc: Use lpa instruction to load physical addresses in driver code
parisc: configs: Remove useless UEVENT_HELPER_PATH
parisc: Use implicit space register selection for loading the coherence index of I/O pdirs
parisc: Fix compiler warnings in float emulation code
parisc/slab: cleanup after /proc/slab_allocators removal
parisc: Allow building 64-bit kernel without -mlong-calls compiler option
parisc: Kconfig: remove ARCH_DISCARD_MEMBLOCK
Linus Torvalds [Thu, 6 Jun 2019 20:10:49 +0000 (13:10 -0700)]
Merge branch 'linus' of git://git./linux/kernel/git/herbert/crypto-2.6
Pull crypto fixes from Herbert Xu:
"This fixes a regression that breaks the jitterentropy RNG and a
potential memory leak in hmac"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: hmac - fix memory leak in hmac_init_tfm()
crypto: jitterentropy - change back to module_init()
Linus Torvalds [Thu, 6 Jun 2019 19:36:54 +0000 (12:36 -0700)]
Merge tag 'xfs-5.2-fixes-2' of git://git./fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
"Here are a couple more bug fixes for 5.2. Changes since last update:
- Fix some forgotten strings in a log debugging function
- Fix incorrect unit conversion in online fsck code"
* tag 'xfs-5.2-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: inode btree scrubber should calculate im_boffset correctly
xfs: fix broken log reservation debugging
Linus Torvalds [Thu, 6 Jun 2019 19:33:52 +0000 (12:33 -0700)]
Merge tag 'gfs2-v5.2.fixes' of git://git./linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 fix from Andreas Gruenbacher:
"A revert for a patch that turned out to be broken"
* tag 'gfs2-v5.2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
Revert "gfs2: Replace gl_revokes with a GLF flag"
Linus Torvalds [Thu, 6 Jun 2019 19:31:15 +0000 (12:31 -0700)]
Merge tag 'ovl-fixes-5.2-rc4' of git://git./linux/kernel/git/mszeredi/vfs
Pull overlayfs fixes from Miklos Szeredi:
"Here's one fix for a class of bugs triggered by syzcaller, and one
that makes xfstests fail less"
* tag 'ovl-fixes-5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: doc: add non-standard corner cases
ovl: detect overlapping layers
ovl: support the FS_IOC_FS[SG]ETXATTR ioctls
Linus Torvalds [Thu, 6 Jun 2019 19:25:56 +0000 (12:25 -0700)]
Merge tag 'fuse-fixes-5.2-rc4' of git://git./linux/kernel/git/mszeredi/fuse
Pull fuse fixes from Miklos Szeredi:
"This fixes a leaked inode lock in an error cleanup path and a data
consistency issue with copy_file_range().
It also adds a new flag for the WRITE request that allows userspace
filesystems to clear suid/sgid bits on the file if necessary"
* tag 'fuse-fixes-5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: extract helper for range writeback
fuse: fix copy_file_range() in the writeback case
fuse: add FUSE_WRITE_KILL_PRIV
fuse: fallocate: fix return with locked inode
Linus Torvalds [Thu, 6 Jun 2019 19:19:37 +0000 (12:19 -0700)]
Merge tag 'nfs-for-5.2-2' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client fixes from Anna Schumaker:
"These are mostly stable bugfixes found during testing, many during the
recent NFS bake-a-thon.
Stable bugfixes:
- SUNRPC: Fix regression in umount of a secure mount
- SUNRPC: Fix a use after free when a server rejects the RPCSEC_GSS credential
- NFSv4.1: Again fix a race where CB_NOTIFY_LOCK fails to wake a waiter
- NFSv4.1: Fix bug only first CB_NOTIFY_LOCK is handled
Other bugfixes:
- xprtrdma: Use struct_size() in kzalloc()"
* tag 'nfs-for-5.2-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
NFSv4.1: Fix bug only first CB_NOTIFY_LOCK is handled
NFSv4.1: Again fix a race where CB_NOTIFY_LOCK fails to wake a waiter
SUNRPC: Fix a use after free when a server rejects the RPCSEC_GSS credential
SUNRPC fix regression in umount of a secure mount
xprtrdma: Use struct_size() in kzalloc()
YueHaibing [Thu, 6 Jun 2019 14:46:49 +0000 (22:46 +0800)]
net: mscc: ocelot: remove unused variable 'vcap_data_t'
Fix sparse warning:
drivers/net/ethernet/mscc/ocelot_ace.c:96:3:
warning: symbol 'vcap_data_t' was not declared. Should it be static?
'vcap_data_t' never used so can be removed
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Thu, 6 Jun 2019 13:45:03 +0000 (15:45 +0200)]
pktgen: do not sleep with the thread lock held.
Currently, the process issuing a "start" command on the pktgen procfs
interface, acquires the pktgen thread lock and never release it, until
all pktgen threads are completed. The above can blocks indefinitely any
other pktgen command and any (even unrelated) netdevice removal - as
the pktgen netdev notifier acquires the same lock.
The issue is demonstrated by the following script, reported by Matteo:
ip -b - <<'EOF'
link add type dummy
link add type veth
link set dummy0 up
EOF
modprobe pktgen
echo reset >/proc/net/pktgen/pgctrl
{
echo rem_device_all
echo add_device dummy0
} >/proc/net/pktgen/kpktgend_0
echo count 0 >/proc/net/pktgen/dummy0
echo start >/proc/net/pktgen/pgctrl &
sleep 1
rmmod veth
Fix the above releasing the thread lock around the sleep call.
Additionally we must prevent racing with forcefull rmmod - as the
thread lock no more protects from them. Instead, acquire a self-reference
before waiting for any thread. As a side effect, running
rmmod pktgen
while some thread is running now fails with "module in use" error,
before this patch such command hanged indefinitely.
Note: the issue predates the commit reported in the fixes tag, but
this fix can't be applied before the mentioned commit.
v1 -> v2:
- no need to check for thread existence after flipping the lock,
pktgen threads are freed only at net exit time
-
Fixes:
6146e6a43b35 ("[PKTGEN]: Removes thread_{un,}lock() macros.")
Reported-and-tested-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fabio Estevam [Thu, 6 Jun 2019 12:40:33 +0000 (09:40 -0300)]
net: fec: Do not use netdev messages too early
When a valid MAC address is not found the current messages
are shown:
fec
2188000.ethernet (unnamed net_device) (uninitialized): Invalid MAC address: 00:00:00:00:00:00
fec
2188000.ethernet (unnamed net_device) (uninitialized): Using random MAC address: aa:9f:25:eb:7e:aa
Since the network device has not been registered at this point, it is better
to use dev_err()/dev_info() instead, which will provide cleaner log
messages like these:
fec
2188000.ethernet: Invalid MAC address: 00:00:00:00:00:00
fec
2188000.ethernet: Using random MAC address: aa:9f:25:eb:7e:aa
Tested on a imx6dl-pico-pi board.
Signed-off-by: Fabio Estevam <festevam@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Litao jiao [Thu, 6 Jun 2019 09:57:58 +0000 (17:57 +0800)]
vxlan: Use FDB_HASH_SIZE hash_locks to reduce contention
The monolithic hash_lock could cause huge contention when
inserting/deletiing vxlan_fdbs into the fdb_head.
Use FDB_HASH_SIZE hash_locks to protect insertions/deletions
of vxlan_fdbs into the fdb_head hash table.
Suggested-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Litao jiao <jiaolitao@raisecom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Thu, 6 Jun 2019 18:02:54 +0000 (11:02 -0700)]
Merge tag 'for-rc-adfs' of git://git.armlinux.org.uk/~rmk/linux-arm
Pull ADFS cleanups/fixes from Russell King:
"As a result of some of Al Viro's great work, here are a few cleanups
with fixes for adfs:
- factor out filename comparison, so we can be sure that
adfs_compare() (used for namei compare) and adfs_match() (used for
lookup) have the same behaviour.
- factor out filename lowering (which is not the same as tolower()
which will lower top-bit-set characters) to ensure that we have the
same behaviour when comparing filenames as when we hash them.
- factor out the object fixups, so we are applying all fixups to
directory objects in the same way, independent of the disk format.
- factor out the object name fixup (into the previously factored out
function) to ensure that filenames are appropriately translated -
for example, adfs allows '/' in filenames, which being the Unix
path separator, need to be translated to a different character,
which is normally '.' (DOS 8.3 filenames represent the . as a / on
adfs, so this is the expected reverse translation.)
- remove filename truncation; Al asked about this and apparently the
decision is to remove it. In any case, adfs's truncation was buggy,
so this rids us of that bug by removing the truncation feature.
- we now have only one location which adds the "filetype" suffix to
the filename, so there's no point that code being out of line.
- since we translate '/' into '.', an adfs filename of "/" or "//"
would end up being translated to "." and ".." which have special
meanings. In this case, change the first character to "^" to avoid
these special directory names being abused"
* tag 'for-rc-adfs' of git://git.armlinux.org.uk/~rmk/linux-arm:
fs/adfs: fix filename fixup handling for "/" and "//" names
fs/adfs: move append_filetype_suffix() into adfs_object_fixup()
fs/adfs: remove truncated filename hashing
fs/adfs: factor out filename fixup
fs/adfs: factor out object fixups
fs/adfs: factor out filename case lowering
fs/adfs: factor out filename comparison
Maxime Chevallier [Thu, 6 Jun 2019 08:42:56 +0000 (10:42 +0200)]
net: mvpp2: Use strscpy to handle stat strings
Use a safe strscpy call to copy the ethtool stat strings into the
relevant buffers, instead of a memcpy that will be accessing
out-of-bound data.
Fixes:
118d6298f6f0 ("net: mvpp2: add ethtool GOP statistics")
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Colin Ian King [Thu, 6 Jun 2019 08:40:39 +0000 (09:40 +0100)]
ipv6: fix spelling mistake: "wtih" -> "with"
There is a spelling mistake in a NL_SET_ERR_MSG message. Fix it.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Zhu Yanjun [Thu, 6 Jun 2019 08:00:03 +0000 (04:00 -0400)]
net: rds: fix memory leak in rds_ib_flush_mr_pool
When the following tests last for several hours, the problem will occur.
Server:
rds-stress -r 1.1.1.16 -D 1M
Client:
rds-stress -r 1.1.1.14 -s 1.1.1.16 -D 1M -T 30
The following will occur.
"
Starting up....
tsks tx/s rx/s tx+rx K/s mbi K/s mbo K/s tx us/c rtt us cpu
%
1 0 0 0.00 0.00 0.00 0.00 0.00 -1.00
1 0 0 0.00 0.00 0.00 0.00 0.00 -1.00
1 0 0 0.00 0.00 0.00 0.00 0.00 -1.00
1 0 0 0.00 0.00 0.00 0.00 0.00 -1.00
"
>From vmcore, we can find that clean_list is NULL.
>From the source code, rds_mr_flushd calls rds_ib_mr_pool_flush_worker.
Then rds_ib_mr_pool_flush_worker calls
"
rds_ib_flush_mr_pool(pool, 0, NULL);
"
Then in function
"
int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
int free_all, struct rds_ib_mr **ibmr_ret)
"
ibmr_ret is NULL.
In the source code,
"
...
list_to_llist_nodes(pool, &unmap_list, &clean_nodes, &clean_tail);
if (ibmr_ret)
*ibmr_ret = llist_entry(clean_nodes, struct rds_ib_mr, llnode);
/* more than one entry in llist nodes */
if (clean_nodes->next)
llist_add_batch(clean_nodes->next, clean_tail, &pool->clean_list);
...
"
When ibmr_ret is NULL, llist_entry is not executed. clean_nodes->next
instead of clean_nodes is added in clean_list.
So clean_nodes is discarded. It can not be used again.
The workqueue is executed periodically. So more and more clean_nodes are
discarded. Finally the clean_list is NULL.
Then this problem will occur.
Fixes:
1bc144b62524 ("net, rds, Replace xlist in net/rds/xlist.h with llist")
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 6 Jun 2019 17:29:21 +0000 (10:29 -0700)]
Merge branch 'ipv6-fix-EFAULT-on-sendto-with-icmpv6-and-hdrincl'
Olivier Matz says:
====================
ipv6: fix EFAULT on sendto with icmpv6 and hdrincl
The following code returns EFAULT (Bad address):
s = socket(AF_INET6, SOCK_RAW, IPPROTO_ICMPV6);
setsockopt(s, SOL_IPV6, IPV6_HDRINCL, 1);
sendto(ipv6_icmp6_packet, addr); /* returns -1, errno = EFAULT */
The problem is fixed in the second patch. The first one aligns the
code to ipv4, to avoid a race condition in the second patch.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Olivier Matz [Thu, 6 Jun 2019 07:15:19 +0000 (09:15 +0200)]
ipv6: fix EFAULT on sendto with icmpv6 and hdrincl
The following code returns EFAULT (Bad address):
s = socket(AF_INET6, SOCK_RAW, IPPROTO_ICMPV6);
setsockopt(s, SOL_IPV6, IPV6_HDRINCL, 1);
sendto(ipv6_icmp6_packet, addr); /* returns -1, errno = EFAULT */
The IPv4 equivalent code works. A workaround is to use IPPROTO_RAW
instead of IPPROTO_ICMPV6.
The failure happens because 2 bytes are eaten from the msghdr by
rawv6_probe_proto_opt() starting from commit
19e3c66b52ca ("ipv6
equivalent of "ipv4: Avoid reading user iov twice after
raw_probe_proto_opt""), but at that time it was not a problem because
IPV6_HDRINCL was not yet introduced.
Only eat these 2 bytes if hdrincl == 0.
Fixes:
715f504b1189 ("ipv6: add IPV6_HDRINCL option for raw sockets")
Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Olivier Matz [Thu, 6 Jun 2019 07:15:18 +0000 (09:15 +0200)]
ipv6: use READ_ONCE() for inet->hdrincl as in ipv4
As it was done in commit
8f659a03a0ba ("net: ipv4: fix for a race
condition in raw_sendmsg") and commit
20b50d79974e ("net: ipv4: emulate
READ_ONCE() on ->hdrincl bit-field in raw_sendmsg()") for ipv4, copy the
value of inet->hdrincl in a local variable, to avoid introducing a race
condition in the next commit.
Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit [Thu, 6 Jun 2019 05:49:17 +0000 (07:49 +0200)]
r8169: silence sparse warning in rtl8169_start_xmit
The opts[] array is of type u32. Therefore remove the wrong
cpu_to_le32(). The opts[] array members are converted to little endian
later when being assigned to the respective descriptor fields.
This is not a new issue, it just popped up due to r8169.c having
been renamed and more thoroughly checked. Due to the renaming
this patch applies to net-next only.
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bob Peterson [Thu, 6 Jun 2019 12:33:38 +0000 (07:33 -0500)]
Revert "gfs2: Replace gl_revokes with a GLF flag"
Commit
73118ca8baf7 introduced a glock reference counting bug in
gfs2_trans_remove_revoke. Given that, replacing gl_revokes with a GLF flag is
no longer useful, so revert that commit.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Dave Martin [Thu, 6 Jun 2019 10:33:43 +0000 (11:33 +0100)]
arm64: Silence gcc warnings about arch ABI drift
Since GCC 9, the compiler warns about evolution of the
platform-specific ABI, in particular relating for the marshaling of
certain structures involving bitfields.
The kernel is a standalone binary, and of course nobody would be
so stupid as to expose structs containing bitfields as function
arguments in ABI. (Passing a pointer to such a struct, however
inadvisable, should be unaffected by this change. perf and various
drivers rely on that.)
So these warnings do more harm than good: turn them off.
We may miss warnings about future ABI drift, but that's too bad.
Future ABI breaks of this class will have to be debugged and fixed
the traditional way unless the compiler evolves finer-grained
diagnostics.
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Helge Deller [Mon, 27 May 2019 19:20:00 +0000 (21:20 +0200)]
parisc: Fix crash due alternative coding for NP iopdir_fdc bit
According to the found documentation, data cache flushes and sync
instructions are needed on the PCX-U+ (PA8200, e.g. C200/C240)
platforms, while PCX-W (PA8500, e.g. C360) platforms aparently don't
need those flushes when changing the IO PDIR data structures.
We have no documentation for PCX-W+ (PA8600) and PCX-W2 (PA8700) CPUs,
but Carlo Pisani reported that his C3600 machine (PA8600, PCX-W+) fails
when the fdc instructions were removed. His firmware didn't set the NIOP
bit, so one may assume it's a firmware bug since other C3750 machines
had the bit set.
Even if documentation (as mentioned above) states that PCX-W (PA8500,
e.g. J5000) does not need fdc flushes, Sven could show that an Adaptec
29320A PCI-X SCSI controller reliably failed on a dd command during the
first five minutes in his J5000 when fdc flushes were missing.
Going forward, we will now NOT replace the fdc and sync assembler
instructions by NOPS if:
a) the NP iopdir_fdc bit was set by firmware, or
b) we find a CPU up to and including a PCX-W+ (PA8600).
This fixes the HPMC crashes on a C240 and C36XX machines. For other
machines we rely on the firmware to set the bit when needed.
In case one finds HPMC issues, people could try to boot their machines
with the "no-alternatives" kernel option to turn off any alternative
patching.
Reported-by: Sven Schnelle <svens@stackframe.org>
Reported-by: Carlo Pisani <carlojpisani@gmail.com>
Tested-by: Sven Schnelle <svens@stackframe.org>
Fixes:
3847dab77421 ("parisc: Add alternative coding infrastructure")
Signed-off-by: Helge Deller <deller@gmx.de>
Cc: stable@vger.kernel.org # 5.0+
John David Anglin [Sun, 2 Jun 2019 23:12:40 +0000 (19:12 -0400)]
parisc: Use lpa instruction to load physical addresses in driver code
Most I/O in the kernel is done using the kernel offset mapping.
However, there is one API that uses aliased kernel address ranges:
> The final category of APIs is for I/O to deliberately aliased address
> ranges inside the kernel. Such aliases are set up by use of the
> vmap/vmalloc API. Since kernel I/O goes via physical pages, the I/O
> subsystem assumes that the user mapping and kernel offset mapping are
> the only aliases. This isn't true for vmap aliases, so anything in
> the kernel trying to do I/O to vmap areas must manually manage
> coherency. It must do this by flushing the vmap range before doing
> I/O and invalidating it after the I/O returns.
For this reason, we should use the hardware lpa instruction to load the
physical address of kernel virtual addresses in the driver code.
I believe we only use the vmap/vmalloc API with old PA 1.x processors
which don't have a sba, so we don't hit this problem.
Tested on c3750, c8000 and rp3440.
Signed-off-by: John David Anglin <dave.anglin@bell.net>
Signed-off-by: Helge Deller <deller@gmx.de>
Krzysztof Kozlowski [Tue, 4 Jun 2019 07:57:39 +0000 (09:57 +0200)]
parisc: configs: Remove useless UEVENT_HELPER_PATH
Remove the CONFIG_UEVENT_HELPER_PATH because:
1. It is disabled since commit
1be01d4a5714 ("driver: base: Disable
CONFIG_UEVENT_HELPER by default") as its dependency (UEVENT_HELPER) was
made default to 'n',
2. It is not recommended (help message: "This should not be used today
[...] creates a high system load") and was kept only for ancient
userland,
3. Certain userland specifically requests it to be disabled (systemd
README: "Legacy hotplug slows down the system and confuses udev").
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Acked-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Helge Deller <deller@gmx.de>
John David Anglin [Tue, 28 May 2019 00:15:14 +0000 (20:15 -0400)]
parisc: Use implicit space register selection for loading the coherence index of I/O pdirs
We only support I/O to kernel space. Using %sr1 to load the coherence
index may be racy unless interrupts are disabled. This patch changes the
code used to load the coherence index to use implicit space register
selection. This saves one instruction and eliminates the race.
Tested on rp3440, c8000 and c3750.
Signed-off-by: John David Anglin <dave.anglin@bell.net>
Cc: stable@vger.kernel.org
Signed-off-by: Helge Deller <deller@gmx.de>
George G. Davis [Wed, 5 Jun 2019 20:30:09 +0000 (16:30 -0400)]
ARM64: trivial: s/TIF_SECOMP/TIF_SECCOMP/ comment typo fix
Fix a s/TIF_SECOMP/TIF_SECCOMP/ comment typo
Cc: Jiri Kosina <trivial@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org
Signed-off-by: George G. Davis <george_davis@mentor.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
David S. Miller [Thu, 6 Jun 2019 02:05:01 +0000 (19:05 -0700)]
Merge branch 'tcp-flowlabel'
Eric Dumazet says:
====================
ipv6: tcp: more control on RST flowlabels
First patch allows to reflect incoming IPv6 flowlabel
on RST packets sent when no socket could handle the packet.
Second patch makes sure we send the same flowlabel
for RST or ACK packets on behalf of TIME_WAIT sockets.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 5 Jun 2019 14:55:10 +0000 (07:55 -0700)]
ipv6: tcp: send consistent flowlabel in TIME_WAIT state
After commit
1d13a96c74fc ("ipv6: tcp: fix flowlabel value in ACK
messages"), we stored in tw_flowlabel the flowlabel, in the
case ACK packets needed to be sent on behalf of a TIME_WAIT socket.
We can use the same field so that RST packets sent from
TIME_WAIT state also use a consistent flowlabel.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florent Fourcot <florent.fourcot@wifirst.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 5 Jun 2019 14:55:09 +0000 (07:55 -0700)]
ipv6: tcp: enable flowlabel reflection in some RST packets
When RST packets are sent because no socket could be found,
it makes sense to use flowlabel_reflect sysctl to decide
if a reflection of the flowlabel is requested.
This extends commit
22b6722bfa59 ("ipv6: Add sysctl for per
namespace flow label reflection"), for some TCP RST packets.
In order to provide full control of this new feature,
flowlabel_reflect becomes a bitmask.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Gustavo A. R. Silva [Wed, 5 Jun 2019 14:45:16 +0000 (09:45 -0500)]
lib: objagg: Use struct_size() in kzalloc()
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:
struct objagg_stats {
...
struct objagg_obj_stats_info stats_info[];
};
size = sizeof(*objagg_stats) + sizeof(objagg_stats->stats_info[0]) * count;
instance = kzalloc(size, GFP_KERNEL);
Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:
instance = kzalloc(struct_size(instance, stats_info, count), GFP_KERNEL);
Notice that, in this case, variable alloc_size is not necessary, hence it
is removed.
This code was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Zhiqiang Liu [Wed, 5 Jun 2019 10:49:49 +0000 (18:49 +0800)]
inet_connection_sock: remove unused parameter of reqsk_queue_unlink func
small cleanup: "struct request_sock_queue *queue" parameter of reqsk_queue_unlink
func is never used in the func, so we can remove it.
Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hangbin Liu [Wed, 5 Jun 2019 04:27:14 +0000 (12:27 +0800)]
Revert "fib_rules: return 0 directly if an exactly same rule exists when NLM_F_EXCL not supplied"
This reverts commit
e9919a24d3022f72bcadc407e73a6ef17093a849.
Nathan reported the new behaviour breaks Android, as Android just add
new rules and delete old ones.
If we return 0 without adding dup rules, Android will remove the new
added rules and causing system to soft-reboot.
Fixes:
e9919a24d302 ("fib_rules: return 0 directly if an exactly same rule exists when NLM_F_EXCL not supplied")
Reported-by: Nathan Chancellor <natechancellor@gmail.com>
Reported-by: Yaro Slav <yaro330@gmail.com>
Reported-by: Maciej Żenczykowski <zenczykowski@gmail.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Tested-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit [Tue, 4 Jun 2019 21:02:34 +0000 (23:02 +0200)]
net: phy: remove state PHY_FORCING
In the early days of phylib we had a functionality that changed to the
next lower speed in fixed mode if no link was established after a
certain period of time. This functionality has been removed years ago,
and state PHY_FORCING isn't needed any longer. Instead we can go from
UP to RUNNING or NOLINK directly (same as in autoneg mode).
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nikita Danilov [Tue, 4 Jun 2019 13:23:49 +0000 (13:23 +0000)]
net: aquantia: fix wol configuration not applied sometimes
WoL magic packet configuration sometimes does not work due to
couple of leakages found.
Mainly there was a regression introduced during readx_poll refactoring.
Next, fw request waiting time was too small. Sometimes that
caused sleep proxy config function to return with an error
and to skip WoL configuration.
At last, WoL data were passed to FW from not clean buffer.
That could cause FW to accept garbage as a random configuration data.
Fixes:
6a7f2277313b ("net: aquantia: replace AQ_HW_WAIT_FOR with readx_poll_timeout_atomic")
Signed-off-by: Nikita Danilov <nikita.danilov@aquantia.com>
Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Mon, 3 Jun 2019 20:57:13 +0000 (16:57 -0400)]
ethtool: fix potential userspace buffer overflow
ethtool_get_regs() allocates a buffer of size ops->get_regs_len(),
and pass it to the kernel driver via ops->get_regs() for filling.
There is no restriction about what the kernel drivers can or cannot do
with the open ethtool_regs structure. They usually set regs->version
and ignore regs->len or set it to the same size as ops->get_regs_len().
But if userspace allocates a smaller buffer for the registers dump,
we would cause a userspace buffer overflow in the final copy_to_user()
call, which uses the regs.len value potentially reset by the driver.
To fix this, make this case obvious and store regs.len before calling
ops->get_regs(), to only copy as much data as requested by userspace,
up to the value returned by ops->get_regs_len().
While at it, remove the redundant check for non-null regbuf.
Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Neil Horman [Mon, 3 Jun 2019 20:32:59 +0000 (16:32 -0400)]
Fix memory leak in sctp_process_init
syzbot found the following leak in sctp_process_init
BUG: memory leak
unreferenced object 0xffff88810ef68400 (size 1024):
comm "syz-executor273", pid 7046, jiffies
4294945598 (age 28.770s)
hex dump (first 32 bytes):
1d de 28 8d de 0b 1b e3 b5 c2 f9 68 fd 1a 97 25 ..(........h...%
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<
00000000a02cebbd>] kmemleak_alloc_recursive include/linux/kmemleak.h:55
[inline]
[<
00000000a02cebbd>] slab_post_alloc_hook mm/slab.h:439 [inline]
[<
00000000a02cebbd>] slab_alloc mm/slab.c:3326 [inline]
[<
00000000a02cebbd>] __do_kmalloc mm/slab.c:3658 [inline]
[<
00000000a02cebbd>] __kmalloc_track_caller+0x15d/0x2c0 mm/slab.c:3675
[<
000000009e6245e6>] kmemdup+0x27/0x60 mm/util.c:119
[<
00000000dfdc5d2d>] kmemdup include/linux/string.h:432 [inline]
[<
00000000dfdc5d2d>] sctp_process_init+0xa7e/0xc20
net/sctp/sm_make_chunk.c:2437
[<
00000000b58b62f8>] sctp_cmd_process_init net/sctp/sm_sideeffect.c:682
[inline]
[<
00000000b58b62f8>] sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1384
[inline]
[<
00000000b58b62f8>] sctp_side_effects net/sctp/sm_sideeffect.c:1194
[inline]
[<
00000000b58b62f8>] sctp_do_sm+0xbdc/0x1d60 net/sctp/sm_sideeffect.c:1165
[<
0000000044e11f96>] sctp_assoc_bh_rcv+0x13c/0x200
net/sctp/associola.c:1074
[<
00000000ec43804d>] sctp_inq_push+0x7f/0xb0 net/sctp/inqueue.c:95
[<
00000000726aa954>] sctp_backlog_rcv+0x5e/0x2a0 net/sctp/input.c:354
[<
00000000d9e249a8>] sk_backlog_rcv include/net/sock.h:950 [inline]
[<
00000000d9e249a8>] __release_sock+0xab/0x110 net/core/sock.c:2418
[<
00000000acae44fa>] release_sock+0x37/0xd0 net/core/sock.c:2934
[<
00000000963cc9ae>] sctp_sendmsg+0x2c0/0x990 net/sctp/socket.c:2122
[<
00000000a7fc7565>] inet_sendmsg+0x64/0x120 net/ipv4/af_inet.c:802
[<
00000000b732cbd3>] sock_sendmsg_nosec net/socket.c:652 [inline]
[<
00000000b732cbd3>] sock_sendmsg+0x54/0x70 net/socket.c:671
[<
00000000274c57ab>] ___sys_sendmsg+0x393/0x3c0 net/socket.c:2292
[<
000000008252aedb>] __sys_sendmsg+0x80/0xf0 net/socket.c:2330
[<
00000000f7bf23d1>] __do_sys_sendmsg net/socket.c:2339 [inline]
[<
00000000f7bf23d1>] __se_sys_sendmsg net/socket.c:2337 [inline]
[<
00000000f7bf23d1>] __x64_sys_sendmsg+0x23/0x30 net/socket.c:2337
[<
00000000a8b4131f>] do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:3
The problem was that the peer.cookie value points to an skb allocated
area on the first pass through this function, at which point it is
overwritten with a heap allocated value, but in certain cases, where a
COOKIE_ECHO chunk is included in the packet, a second pass through
sctp_process_init is made, where the cookie value is re-allocated,
leaking the first allocation.
Fix is to always allocate the cookie value, and free it when we are done
using it.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reported-by: syzbot+f7e9153b037eac9b1df8@syzkaller.appspotmail.com
CC: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: netdev@vger.kernel.org
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Zhu Yanjun [Mon, 3 Jun 2019 12:48:19 +0000 (08:48 -0400)]
net: rds: fix memory leak when unload rds_rdma
When KASAN is enabled, after several rds connections are
created, then "rmmod rds_rdma" is run. The following will
appear.
"
BUG rds_ib_incoming (Not tainted): Objects remaining
in rds_ib_incoming on __kmem_cache_shutdown()
Call Trace:
dump_stack+0x71/0xab
slab_err+0xad/0xd0
__kmem_cache_shutdown+0x17d/0x370
shutdown_cache+0x17/0x130
kmem_cache_destroy+0x1df/0x210
rds_ib_recv_exit+0x11/0x20 [rds_rdma]
rds_ib_exit+0x7a/0x90 [rds_rdma]
__x64_sys_delete_module+0x224/0x2c0
? __ia32_sys_delete_module+0x2c0/0x2c0
do_syscall_64+0x73/0x190
entry_SYSCALL_64_after_hwframe+0x44/0xa9
"
This is rds connection memory leak. The root cause is:
When "rmmod rds_rdma" is run, rds_ib_remove_one will call
rds_ib_dev_shutdown to drop the rds connections.
rds_ib_dev_shutdown will call rds_conn_drop to drop rds
connections as below.
"
rds_conn_path_drop(&conn->c_path[0], false);
"
In the above, destroy is set to false.
void rds_conn_path_drop(struct rds_conn_path *cp, bool destroy)
{
atomic_set(&cp->cp_state, RDS_CONN_ERROR);
rcu_read_lock();
if (!destroy && rds_destroy_pending(cp->cp_conn)) {
rcu_read_unlock();
return;
}
queue_work(rds_wq, &cp->cp_down_w);
rcu_read_unlock();
}
In the above function, destroy is set to false. rds_destroy_pending
is called. This does not move rds connections to ib_nodev_conns.
So destroy is set to true to move rds connections to ib_nodev_conns.
In rds_ib_unregister_client, flush_workqueue is called to make rds_wq
finsh shutdown rds connections. The function rds_ib_destroy_nodev_conns
is called to shutdown rds connections finally.
Then rds_ib_recv_exit is called to destroy slab.
void rds_ib_recv_exit(void)
{
kmem_cache_destroy(rds_ib_incoming_slab);
kmem_cache_destroy(rds_ib_frag_slab);
}
The above slab memory leak will not occur again.
>From tests,
256 rds connections
[root@ca-dev14 ~]# time rmmod rds_rdma
real 0m16.522s
user 0m0.000s
sys 0m8.152s
512 rds connections
[root@ca-dev14 ~]# time rmmod rds_rdma
real 0m32.054s
user 0m0.000s
sys 0m15.568s
To rmmod rds_rdma with 256 rds connections, about 16 seconds are needed.
And with 512 rds connections, about 32 seconds are needed.
>From ftrace, when one rds connection is destroyed,
"
19) | rds_conn_destroy [rds]() {
19) 7.782 us | rds_conn_path_drop [rds]();
15) | rds_shutdown_worker [rds]() {
15) | rds_conn_shutdown [rds]() {
15) 1.651 us | rds_send_path_reset [rds]();
15) 7.195 us | }
15) + 11.434 us | }
19) 2.285 us | rds_cong_remove_conn [rds]();
19) * 24062.76 us | }
"
So if many rds connections will be destroyed, this function
rds_ib_destroy_nodev_conns uses most of time.
Suggested-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Zhu Yanjun [Mon, 3 Jun 2019 04:28:01 +0000 (00:28 -0400)]
net: rds: add per rds connection cache statistics
The variable cache_allocs is to indicate how many frags (KiB) are in one
rds connection frag cache.
The command "rds-info -Iv" will output the rds connection cache
statistics as below:
"
RDS IB Connections:
LocalAddr RemoteAddr Tos SL LocalDev RemoteDev
1.1.1.14 1.1.1.14 58 255 fe80::2:c903:a:7a31 fe80::2:c903:a:7a31
send_wr=256, recv_wr=1024, send_sge=8, rdma_mr_max=4096,
rdma_mr_size=257, cache_allocs=12
"
This means that there are about 12KiB frag in this rds connection frag
cache.
Since rds.h in rds-tools is not related with the kernel rds.h, the change
in kernel rds.h does not affect rds-tools.
rds-info in rds-tools 2.0.5 and 2.0.6 is tested with this commit. It works
well.
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 6 Jun 2019 00:03:14 +0000 (17:03 -0700)]
Merge branch 'dwmac-mediatek'
Biao Huang says:
====================
complete dwmac-mediatek driver and fix flow control issue
Changes in v2:
patch#1: there is no extra action in mediatek_dwmac_remove, remove it
v1:
This series mainly complete dwmac-mediatek driver:
1. add power on/off operations for dwmac-mediatek.
2. disable rx watchdog to reduce rx path reponding time.
3. change the default value of tx-frames from 25 to 1, so
ptp4l will test pass by default.
and also fix the issue that flow control won't be disabled any more
once being enabled.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Biao Huang [Mon, 3 Jun 2019 01:58:06 +0000 (09:58 +0800)]
net: stmmac: dwmac4: fix flow control issue
Current dwmac4_flow_ctrl will not clear
GMAC_RX_FLOW_CTRL_RFE/GMAC_RX_FLOW_CTRL_RFE bits,
so MAC hw will keep flow control on although expecting
flow control off by ethtool. Add codes to fix it.
Fixes:
477286b53f55 ("stmmac: add GMAC4 core support")
Signed-off-by: Biao Huang <biao.huang@mediatek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Biao Huang [Mon, 3 Jun 2019 01:58:05 +0000 (09:58 +0800)]
net: stmmac: modify default value of tx-frames
the default value of tx-frames is 25, it's too late when
passing tstamp to stack, then the ptp4l will fail:
ptp4l -i eth0 -f gPTP.cfg -m
ptp4l: selected /dev/ptp0 as PTP clock
ptp4l: port 1: INITIALIZING to LISTENING on INITIALIZE
ptp4l: port 0: INITIALIZING to LISTENING on INITIALIZE
ptp4l: port 1: link up
ptp4l: timed out while polling for tx timestamp
ptp4l: increasing tx_timestamp_timeout may correct this issue,
but it is likely caused by a driver bug
ptp4l: port 1: send peer delay response failed
ptp4l: port 1: LISTENING to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED)
ptp4l tests pass when changing the tx-frames from 25 to 1 with
ethtool -C option.
It should be fine to set tx-frames default value to 1, so ptp4l will pass
by default.
Signed-off-by: Biao Huang <biao.huang@mediatek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>