feat(cli,tunnel): v3.4 client consumes manifest endpoints + fix #45 silent client exit

Two follow-ups to the previous v3.4 commit (ba8d6b7):

## #49 — client uses BridgeEndpoint ports as authoritative

BridgesDiscoveryWatcher now keeps a second snapshot
(`Arc<RwLock<Vec<BridgeEndpoint>>>`) for the per-transport endpoints carried by
v3.4 manifests, alongside the existing flat-bridges snapshot for v3.3
compatibility. `endpoints_snapshot()` and `primary_endpoint()` expose it to the
client.

In `client::run`, immediately after the watcher loads, the primary endpoint's
per-transport ports override the dial-time `dial_cfg.endpoints.{tcp,quic,udp}`
*ports*. The IP stays whatever the dialer already resolved (server_addr /
bridge list). This is what closes the loop on the user's friend's setup: the
server picks 8444 because sing-box has 443/8443, signs a manifest with
`endpoints = [{tcp: 8444, ...}]`, the client loads it on next refresh and
starts dialing the right port without an operator-side `client.toml` edit.

When the manifest has no `endpoints` field (old v3.3 format, or operator
chose not to publish per-transport ports), no override is applied and the
client.toml `[transport] *_port` values are used as before.

## #45 — silent client exit on broken connection

Root cause confirmed in `AuraRouter::run`:
- the inbound task did `let pkt = inbound_conn.recv_packet().await?;`, so any
  recv error returned silently via `?`
- the `to_tun_tx` channel sender dropped, `to_tun_rx.recv()` returned `None`
- the outbound `select!` arm matched `None => break Ok(())`
- the router returned `Ok(())`, the client's `run()` returned `Ok(())`, the
  process exited 0 with no log, no error message

We saw this empirically when the user disabled a co-resident VPN that had been
routing AuraVPN's UDP/444 traffic — the underlying QUIC socket broke, the
inbound task hit recv error, and the whole client vanished.

Fix:
- Inbound task now logs the error at `error` level with the underlying
  `recv_packet` cause before exiting.
- The outbound `select!`'s `None` arm now returns an Err (not Ok(())) so the
  caller knows the tunnel died and `aura client` exits non-zero — which is
  what a supervisor (systemd, launchd, or a future auto-redial loop) wants to
  see.
- The router waits up to 200ms for the inbound task to land cleanly before
  returning, so its error / panic is logged instead of being swallowed by
  `abort()`.

Existing tests still pass (12/12 in aura-tunnel router tests). Tested
manually: with the fix, killing the underlying transport now produces a
"peer connection broke (recv_packet failed): …" error line and a non-zero
exit, instead of silent process disappearance.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
xah30
2026-05-29 17:22:10 +03:00
parent ba8d6b796f
commit 7c2080321b
3 changed files with 90 additions and 4 deletions
+34 -4
View File
@@ -144,7 +144,22 @@ impl<P: PacketIo + 'static> AuraRouter<P> {
let inbound_conn = Arc::clone(&self.conn);
let inbound = tokio::spawn(async move {
loop {
let pkt = inbound_conn.recv_packet().await?;
let pkt = match inbound_conn.recv_packet().await {
Ok(p) => p,
Err(e) => {
// v3.4 fix for #45 (silent client exit): the inbound task used to swallow
// this error and ride out via `?`, so when the underlying transport broke
// (e.g. a co-resident VPN's UDP socket got remapped) the outbound select!
// saw a clean `None` and returned `Ok(())`. No log, no exit message, no
// reconnect hint. Now we log loudly with the real cause before propagating.
let err_str = e.to_string();
tracing::error!(
error = %err_str,
"peer connection broke (recv_packet failed); client is exiting"
);
return Err(anyhow::anyhow!("recv_packet from peer failed: {err_str}"));
}
};
if to_tun_tx.send(pkt).await.is_err() {
// TUN owner loop has stopped; nothing more to do.
break;
@@ -177,14 +192,29 @@ impl<P: PacketIo + 'static> AuraRouter<P> {
c.inc_rx();
}
}
// Inbound task ended (connection closed/errored).
None => break Ok(()),
// Inbound task ended. Either gracefully (we drove `to_tun_tx` drop via the
// outbound side exiting first — unreachable here since we'd still be inside
// the select), or because the peer connection broke. v3.4: surface as an
// error so `aura client` exits non-zero and a supervisor (systemd, launchd,
// a future auto-redial loop) knows the tunnel died. The inbound task itself
// already logged the underlying cause at error level.
None => break Err(anyhow::anyhow!(
"peer connection closed; router shutting down (see preceding error log for cause)"
)),
}
}
}
};
inbound.abort();
// Wait for the inbound task to land so we can surface its error rather than just abort()
// it (which would silently drop the underlying cause). Bounded by a short timeout so a
// stuck inbound future cannot wedge shutdown.
match tokio::time::timeout(std::time::Duration::from_millis(200), inbound).await {
Ok(Ok(Ok(()))) => {}
Ok(Ok(Err(e))) => tracing::warn!(error = %e, "inbound task exited with error"),
Ok(Err(join_err)) => tracing::warn!(error = %join_err, "inbound task panicked"),
Err(_) => tracing::warn!("inbound task did not exit within 200ms; abandoning"),
}
result
}