XL2 &g]TypeV2ObjIDDDirnJv> ftVEcAlgoEcMEcNEcBSizeEcIndexEcDistCSumAlgoPartNumsPartETags bacc2ac0d754ff127ed880151b0eb12ePartSizesPartASizesPartIdxSizeMTimeg]MetaSysx-minio-internal-inline-datatruex-rustfs-internal-inline-datatrueMetaUsrcontent-typeapplication/x-texetag bacc2ac0d754ff127ed880151b0eb12ev쬈null#:7=<մ o3 10010405.10010444.10010449 Applied computing~Health informatics 300 10010147.10010178.10010224 Computing methodologies~Computer vision 500 \end{CCSXML} \ccsdesc[300]{Applied computing~Health informatics} \ccsdesc[500]{Computing methodologies~Computer vision} %% %% Keywords. The author(s) should pick words that accurately describe %% the work being presented. Separate the keywords with commas. \keywords{Robotic Ultrasound, Multi-view Perception, View Selection} %% %% This command processes the author and affiliation and title %% information and builds the first part of the formatted document. \maketitle \section{Introduction} \label{sec:intro} In medical ultrasound, perception typically requires multiple scan views through probe movement to reduce diagnostic ambiguity. As a non-invasive and real-time imaging modality, ultrasound is essential in clinical diagnosis, yet remains highly view-dependent~\cite{munir2025survey, elmekki2025comprehensive}. A single static image often fails to provide sufficient structural information due to acoustic occlusions and a limited field-of-view~\cite{jiang2023robotic, velikova2023lotus}. Consequently, multi-view perception through probe repositioning is necessary to improve anatomical coverage and reduce diagnostic uncertainty~\cite{men2023gaze, dai2021transmed, jiang2023robotic}. % However, the use of multiple cameras comes at a high cost 但是有效去实现这个多view,目前是个难题,这个目前是个成本很高的事情(需要人工什么的) However, not all probe views are equally informative. Exhaustively acquiring a large number of views can introduce substantial redundancy and increase scanning and processing costs. In manual practice, selecting which views to acquire relies on the experience of the operator, requiring repeated repositioning and real-time interpretation, which makes the process time-consuming, operator-dependent, and difficult to standardize~\cite{jiang2023robotic, men2023gaze, munir2025survey}. To reduce this operator burden, existing methods attempt to automate probe navigation~\cite{bi2024machine, jiang2024intelligent}, yet most of these methods optimize probe movement based on immediate geometric or image-quality feedback without maintaining a spatial memory of the observed anatomy~\cite{bi2024machine, jin2023neu}. These approaches may therefore acquire redundant views while missing viewpoints that could resolve occlusions or reveal unseen anatomy. A key question is then how to determine, from partial observations, which probe positions to acquire next so as to maximize diagnostic coverage within a limited scanning budget. \begin{figure}[t] \centering \includegraphics[width=\linewidth]{images/intro1.pdf} \caption{Uninformed exploration vs. active view exploration with SonoSelect. \textbf{Left:} Uninformed exploration samples views redundantly and fails to reach the target cyst occluded behind an overlying organ. \textbf{Right:} SonoSelect selects the next probe position based on current observations, directing the probe toward the target cyst for diagnosis.} \label{Fig:teaser} \end{figure} % 所以在我们这篇文章,我们希望把view selection这件事情能够自动化,代替人工的选取(这一段写fig 1的内容, 主要是写我们的任务定义), To address this, we define an active view exploration task for ultrasound. As illustrated in \cref{Fig:teaser}, uninformed exploration samples views redundantly within a local region, leaving the target unobserved when it is occluded behind an overlying organ (\cref{Fig:teaser}, left). In contrast, active view exploration selects the next probe position based on current observations, directing the probe toward the target for diagnosis (\cref{Fig:teaser}, right). This definition reduces redundant acquisition and increases the likelihood of obtaining the specific viewpoints needed for accurate diagnosis. % 因此,我们提出了SonoSelect,(简单介绍一下,也说下传统ppo的差别) We propose SonoSelect, an ultrasound-specific method, to address this task. Specifically, we cast ultrasound active view exploration as a sequential decision-making problem. As the probe moves, each new 2D ultrasound view is fused into a 3D spatial memory that represents what has been observed so far. This spatial memory then serves two purposes: it provides the agent with a volumetric summary of the current anatomical coverage, and it identifies regions that remain unobserved or uncertain, guiding where to scan next. Building on this representation, we design a reward that encourages probe movements toward greater organ coverage, lower reconstruction uncertainty, and less redundant scanning. To optimize this reward, SonoSelect decomposes the task into a sector selection module trained with Q-learning for long-horizon routing and a continuous control policy trained with PPO for short-range navigation, so that each sub-problem operates at its own temporal scale. % 在多个ultrasound任务上,我们的效果很好(比ppo好) % 我们的效率也很高,(一般需要几个view就好了) A preliminary study on multi-view organ classification shows that a small number of adaptively chosen views can match or exceed the all-view baseline, and that the optimal views vary across patient anatomies.Building on this observation, we further evaluate SonoSelect on a kidney cyst detection task, where the target is small and often occluded behind overlying organs. SonoSelect achieves higher organ and cyst coverage than conventional baselines, with trajectories that consistently converge toward the target rather than exhaustively scanning the volume. In particular, unlike standard RL approaches that show substantial performance degradation on unseen anatomies, SonoSelect maintains its coverage advantage across different patients, suggesting that the spatial memory provides a more generalizable basis for probe guidance than reward-driven exploration alone. % 未来的意义,我们这个是为了是为了实现Fully Autonomous Robotic Ultrasound system中的重要的一步 The proposed active view exploration approach has practical implications for both current clinical workflows and robotic ultrasound systems that automate probe positioning through mechanical arms. In current practice, the learned policy can serve as a decision support tool, suggesting informative scanning regions to assist sonographers during manual examination. For robotic systems, the selected next-best-view can be converted into target coordinates for a mechanical arm controller, providing the spatial goal that downstream motion planning modules use to reposition the probe. % Our contributions are: (1) We formulate ultrasound active view exploration as a sequential decision-making problem and show, through a preliminary study, that a small number of adaptively chosen views can match or exceed exhaustive acquisition. (2) We design two evaluation tasks within the SonoGym simulation environment, a multi-view organ classification task using simulated volumes and a kidney cyst detection task using human CT volumes. (3) We propose SonoSelect, which maintains a 3D spatial memory of observed anatomy and decomposes exploration into sector selection and continuous probe control, achieving effective coverage and generalization to unseen anatomies on both tasks. \section{Related Work} \label{sec:related} \textbf{Ultrasound Perception.} Most research in ultrasound perception focuses on analyzing individual 2D slices for tasks such as organ segmentation, anatomical classification, lesion detection, and representation learning. Velikova et al.~\cite{velikova2023lotus} proposed LOTUS to learn transferable ultrasound representations for downstream view-level tasks. UltraFedFM~\cite{jiang2025pretraining} scaled this line to federated self-supervised pretraining across institutions, organs, and modalities, substantially improving diagnosis and segmentation. In a visibility-aware setting, \cite{weld2025identifying} identified visible tissue and acoustic shadows by modeling probe--tissue contact confidence in intraoperative ultrasound. Sonomate~\cite{guo2026visually} further extended ultrasound understanding to vision--language modeling for anatomy detection and question answering. These studies have substantially improved the interpretation of individual ultrasound views, but they generally assume that informative slices are already available. To obtain richer anatomical observations beyond a single slice, recent studies have explored robotic probe control, autonomous scanning, and target-view acquisition. Hase et al.~\cite{hase2020ultrasound} learned probe control for target-view recovery. Men et al.~\cite{men2023gaze} coupled gaze cues with probe guidance for obstetric ultrasound scanning. Jiang et al.~\cite{jiang2025towards} developed autonomous carotid ultrasonography by combining robotic scanning with learned perception, and Jiang et al.~\cite{jiang2024intelligent} further studied intelligent robotic sonography through reward learning from few demonstrations. Duan et al.~\cite{duan2024safe} introduced safety-aware policy optimization for autonomous scanning, while Su et al.~\cite{su2025tissue} used tissue-view maps to guide structure-specific acquisition. SonoGym~\cite{ao2025sonogym} further provides a scalable simulation platform for learning-based robotic ultrasound. Overall, these methods advance robotic scanning and target-view acquisition, but most of them optimize local probe behavior or predefined targets rather than sequentially selecting complementary views for comprehensive multi-view perception. \textbf{Viewpoint Selection.} Viewpoint selection, often studied as next-best-view planning or active observation, aims to select future views that improve scene understanding, reconstruction quality, or task performance. Isler et al.~\cite{isler2016information} formulated active 3D reconstruction by maximizing volumetric information gain. Di et al.~\cite{di2024learning} learned viewpoint policies for active localization. Chen et al.~\cite{chen2024gennbv} developed generative next-best-view modeling, while Feng et al.~\cite{feng2024naruto} and Xue et al.~\cite{xue2024neural} explored neural view planning for active 3D reconstruction. Hou et al.~\cite{hou2024learning} further showed that multi-view recognition benefits from learning which views to acquire rather than aggregating views exhaustively. Collectively, these works show that actively selecting views can substantially improve information efficiency compared with fixed or exhaustive observation. Compared with its broad use in general active vision, the viewpoint-selection perspective has been only sparsely explored in ultrasound-related settings. Existing efforts mainly recover target views from local image feedback~\cite{hase2020ultrasound,men2023gaze}, learn task-specific rewards for robotic sonography~\cite{jiang2024intelligent}, impose safety-aware constraints during autonomous acquisition~\cite{duan2024safe}, or guide scanning with tissue-view maps for predefined structures~\cite{su2025tissue}. While closely related, these methods more commonly optimize image quality, probe safety, or navigation toward predefined views or structures. Relatively few studies explicitly formulate ultrasound scanning as a sequential viewpoint selection problem in which future probe views are chosen based on previously observed anatomy and accumulated spatial memory. Our work follows this viewpoint-selection perspective and instantiates it as active multi-view exploration for ultrasound. \begin{figure*}[t] \centering \includegraphics[width=\linewidth]{images/fig2.pdf} \caption{\textbf{Active multi-view ultrasound exploration with $T$ scanning steps.} Solid lines indicate the network forward pass and dashed lines indicate the agent-environment interaction. The approach starts from an initial probe pose, and the sector selection module $S$ selects a target sector $z_t$ from the current state $s_t$. The agent navigates to the selected sector and acquires a new ultrasound slice, which is fused into the probability map $\hat{V}_{t+1}$ via $U(\cdot)$. This process repeats for $T$ steps, progressively building a dense reconstruction. } \label{fig:system_overview} \end{figure*} \section{Methodology} \label{sec:method} \subsection{Problem Definition} % 一段话写任务定义(这里需要有个大图) We formulate active view exploration for ultrasound perception as a sequential decision-making problem under partial observability. The complete 3D anatomy is not directly accessible; the agent can only observe it through partial 2D ultrasound slices. The objective is to learn an exploration policy $\pi_\phi(a_t|s_t)$ that maps the current state $s_t$ to continuous kinematic actions $a_t$, maximizing the cumulative coverage of the target anatomical structure within a fixed budget of $T$ steps. % Concretely, the agent faces three subproblems: (1) estimating, from incomplete observations, how much anatomical coverage each unvisited region would provide; (2) deciding which regions to visit and in what order within the finite budget; and (3) translating each regional decision into a feasible kinematic trajectory. % State \textbf{State}. Because the number of acquired slices grows with each step, directly conditioning the policy on the full observation history is impractical. We instead maintain a fixed-dimensional state $s_t$ that summarizes all spatial information collected up to step $t$. At each step $t$, the agent receives a 2D ultrasound slice $I_t$ at probe pose $(\mathbf{p}_t, \mathbf{q}_t)$ and fuses it into a 3D probability map $\hat{V}_t$ via the volumetric fusion function $U(\cdot)$. We formulate the state $s_t$ as: \begin{equation} s_t = (\hat{V}_t, \mathbf{p}_t, \mathbf{q}_t), \end{equation} Here $\hat{V}_t$ aggregates all slices observed up to step $t$ into a spatial probability map, where each voxel stores the estimated probability of tissue occupancy. $\hat{V}_0$ is initialized to a uniform probability of $0.5$ to represent maximum uncertainty. This representation maintains the same dimensionality across different time steps, allowing the policy to operate on a fixed-size input regardless of the episode length. Although $s_t$ captures the spatial structure observed so far, it does not explicitly indicate how much of the target anatomy has been covered. To provide the critic with a more informative training signal, we define a privileged coverage ratio: \begin{equation} c_t = \frac{\sum_{v} \hat{V}_{t}(v) \cdot g(v)}{\sum_{v} g(v) + \epsilon}, \quad c_t \in [0,1], \end{equation} where the summation runs over all voxels $v$ in the reconstruction volume, and $g(v)$ is the ground-truth binary mask of the target structure. Since $c_t$ requires $g(v)$, it is available only during training in simulation. Following the asymmetric actor-critic formulation~\cite{pinto2017asymmetric}, the actor $\pi_\phi(a_t | s_t)$ sees only $s_t$, while the critic $V_\psi(s_t, c_t)$ additionally receives $c_t$ for more accurate value estimation. This separation ensures that the deployed policy does not rely on any privileged information. % Action \textbf{Action}. For a given state $s_t$ at time step $t \in \{1,\dots,T\}$, the agent outputs a continuous 4D action $a_t = (\Delta x, \Delta z, \Delta\phi, \Delta\psi)$, where $\Delta x$ and $\Delta z$ are translational displacements along the x and z axes, and $\Delta\phi$ and $\Delta\psi$ are rotational increments for roll and yaw, respectively. The y-axis translation is omitted because the probe maintains surface contact throughout scanning. The action space is continuous to allow fine-grained kinematic adjustments. \begin{figure*}[t] \includegraphics[width=\linewidth]{images/architecture.pdf} \caption{\textbf{Architecture of SonoSelect.} The view selection module extracts sector features $f_i$ from the reconstruction volume, computes Q-values $Q(s_t, z_i)$ for each sector, and converts the selected sector $z_t$ into a positional guidance vector $\mathbf{v}_t^{\text{pos}}$. The action refinement module takes $s_t$ and $\mathbf{v}_t^{\text{pos}}$ as input and outputs a kinematic increment $\Delta_t$. The residual fusion module combines $\mathbf{v}_t^{\text{pos}}$ and the scaled increment $\hat{\Delta}_t$ to produce the final action $a_t$.} \label{fig:sono} \end{figure*} \textbf{Transition}. Upon executing action $a_t$, the probe pose is updated to $(\mathbf{p}_{t+1}, \mathbf{q}_{t+1})$ via the environment's kinematic function. The environment then returns a new ultrasound slice $I_{t+1}$, which is fused into the probability map to produce $\hat{V}_{t+1}$, and the state transitions to $s_{t+1} = (\hat{V}_{t+1}, \mathbf{p}_{t+1}, \mathbf{q}_{t+1})$. The scanning process terminates when the step budget $T$ is exhausted. % Reward \textbf{Reward}. We design a dense, multi-objective reward function: \begin{equation} r_t = w_{cov} \Delta C_t + w_{info} \Delta H_t^{echo} - \ell_t^{path} \end{equation} The first term $\Delta C_t$ measures the incremental coverage gain over the anatomical structures of interest, weighted by $w_{cov}$, and provides the main learning signal. However, a single partial slice may refine the reconstruction without producing measurable coverage gain. To reward such intermediate progress, the second term $\Delta H_t^{echo}$, weighted by $w_{info}$, captures the reduction in volumetric Shannon entropy over the target region, so that steps reducing acoustic uncertainty still receive positive feedback. Because the policy maximizes the cumulative sum of all these terms, entropy reduction alone cannot sustain high returns; the policy is driven toward trajectories that also achieve coverage gains over the structures of interest. This distinguishes our reward from objectives that use entropy reduction as the sole optimization target, where the policy has no incentive to prioritize diagnostically relevant regions over other high-uncertainty areas. Finally, $\ell_t^{path}$ is a conditional kinematic penalty that penalizes large translational and rotational displacements when a step produces no coverage gain, discouraging the agent from moving excessively without acquiring new information. \subsection{SonoSelect Architecture} A flat continuous policy would need to simultaneously decide which region of the anatomy to visit next and compute the kinematic actions to get there. In practice, this joint optimization is difficult because selecting which anatomical region to scan next requires reasoning over the entire observed volume and operates over long horizons with sparse diagnostic feedback, while executing the probe movement toward that region requires dense, short-horizon kinematic adjustments. These two sub-tasks differ in both temporal scale and input granularity. SonoSelect decomposes this problem into two coupled components. A sector selection module handles the long-horizon decision of where to explore. The selected region then provides a directional target for a continuous control policy, which only needs to solve a simpler, short-range navigation task toward the chosen sector. This decomposition constrains the search space for each sub-problem while maintaining the flexibility required for fine-grained kinematic control. % \begin{figure} % \centering % \includegraphics[width=\linewidth]{images/feature.pdf} % \caption{\textbf{Sector feature extraction pipeline.} By treating elevation slices as input channels, a shared 2D convolutional encoder processes the reconstruction volume $\hat{V}_t$ into a 32-channel feature map. For each sector $i$, a sector-specific mask filters this map, followed by parallel average and max pooling. A shared MLP then projects the concatenated 64-dimensional vector, producing the sector feature $f_i$.} % \label{fig:sector_feature} % \end{figure} We discretize the local operational workspace into $S$ equiangular sectors (~\cref{fig:sono}). To obtain a feature representation $f_i$ for each sector, the reconstruction volume $\hat{V}_t$ is first rearranged by treating elevation slices as input channels and then processed by a shared 2D convolutional encoder. For each sector $i$, a binary sector mask is applied to the encoded feature map, followed by parallel average and max pooling. The concatenated pooling result is then projected through a shared MLP to produce $f_i$. The sector features $\{f_i\}_{i=1}^{S}$ are each passed through a shared Q-network to produce action values $\{Q(s_t, z_i)\}_{i=1}^{S}$, where $Q(s_t, z_i)$ estimates the cumulative expected reward for navigating toward sector $z_i$. This parameter-sharing design ensures that the Q-network generalizes across all candidate sectors rather than learning separate value estimates for each. During training, the sector is chosen via an $\epsilon$-greedy strategy to balance exploration and exploitation; at deployment, the sector with the highest Q-value is deterministically selected. The geometric center of the selected sector $z_t$ is converted into a positional target vector $\mathbf{v}_t^{\text{pos}} \in \mathbb{R}^2$ in the probe's local coordinate frame, representing the translational direction toward the selected sector. This vector serves as the guidance signal for the downstream continuous control policy. The continuous control policy translates the selected sector into kinematic actions. We employ a PPO-based actor-critic architecture. The actor takes as input the current state $s_t$ concatenated with the sector guidance vector $\mathbf{v}_t^{\text{pos}}$, and outputs a local kinematic increment $\Delta_t = [\Delta_t^{\text{pos}}, \Delta_t^{\text{ang}}] \in \mathbb{R}^4$. A residual scaling factor $\alpha$ is applied to obtain the scaled increment $\hat{\Delta}_t = \alpha \Delta_t$. The final action $a_t$ fuses the sector-derived target with this scaled increment: \begin{equation} a_t^{\text{pos}} = \beta_t \mathbf{v}_t^{\text{pos}} + (1-\beta_t) \hat{\Delta}_t^{\text{pos}}, \quad a_t^{\text{ang}} = \hat{\Delta}_t^{\text{ang}} \end{equation} where $\beta_t$ linearly anneals from an initial value $\beta_0$ to a final value $\beta_f$ over training. In early training, $\beta_t$ is large so that the translational component is dominated by the sector guidance $\mathbf{v}_t^{\text{pos}}$, providing a stable learning signal before the policy has converged. As training progresses, $\beta_t$ decreases and the policy's own output $\hat{\Delta}_t^{\text{pos}}$ takes over. The angular component $a_t^{\text{ang}}$ is determined entirely by the policy, as the sector selection provides only translational guidance. The critic estimates the state value $V_{\psi}(s_t, c_t)$ using the augmented state. \subsection{Training Scheme} We employ a rollout-based sequential updating approach to jointly train the continuous control policy (via Proximal Policy Optimization, PPO~\cite{schulman2017proximal}) and the sector selection module via Q-learning. This joint training scheme allows both modules to co-adapt within the same trajectory data, ensuring consistent learning signals across the two decision levels. The continuous control policy is optimized using the standard PPO objective with Generalized Advantage Estimation (GAE) ~\cite{schulman2015high}. The actor outputs kinematic increments $\Delta_t$ and is updated via clipped surrogate objectives, while the critic estimates $V_\psi(s_t, c_t)$ and provides the baseline for advantage computation. For the sector selection module, we train the sector Q-network using Monte Carlo rollout returns as regression targets. The action-value function \(Q_{\theta}(s_t, z_t)\) estimates the expected discounted return after selecting sector \(z_t\) at state \(s_t\): % For the sector selection module, we train the sector Q-network by regressing onto Monte Carlo rollout returns. The action-value function $Q_{\theta}(s_t, z_t)$ estimates the expected cumulative reward after selecting sector $z_t$ at state $s_t$: \begin{equation} Q_{\theta}(s_{t}, z_{t}) = \mathbb{E} \left( \sum_{\tau=t}^{T} \gamma^{\tau-t} r_{\tau} \right), \end{equation} where $\mathbb{E}(\cdot)$ denotes the expectation and $\gamma \in [0, 1]$ is the discount factor. Although both modules share the same reward signal, they require different value representations. The PPO critic learns a state value $V_\psi(s_t, c_t)$ used to compute advantages for the continuous control policy, while the sector selection module learns action-conditional values $Q_\theta(s_t, z_i)$ that compare the expected return of each candidate sector. This difference motivates maintaining separate value functions despite the shared reward. Given this formulation, we compute the return from the collected rollouts as the supervision target: \begin{equation} y_{t} = \begin{cases} r_{t} + \gamma (1 - d_{t}) y_{t+1}, & \text{if } t < T \\ r_{T}, & \text{otherwise} \end{cases}, \end{equation} where $d_t$ is the termination mask. The Q-network is then optimized using the $L_2$ distance loss: \begin{equation} \mathcal{L}_{Q} = \lambda_Q \frac{1}{T} \sum_{t=1}^{T} \text{MSE}(Q_{\theta}(s_{t}, z_{t}), y_{t}), \end{equation} where $\lambda_Q$ controls the loss weight. In joint training, the two objectives are optimized in separate backward passes within each iteration. First, the PPO objective $\mathcal{L}_{\text{PPO}}$ updates the continuous control policy and the critic. Then, in a separate backward pass, the Q-learning loss $\mathcal{L}_{Q}$ updates the sector selection module, including the Q-network and its associated feature encoder. This sequential scheme prevents gradient interference between the two objectives. A step-by-step demonstration of this process can be found in ~\cref{alg:sonoselect}. \begin{algorithm}[t] \caption{SonoSelect} \label{alg:sonoselect} \begin{algorithmic}[1] \small \STATE \textbf{Input}: Env $\mathcal{E}$, budget $T$, exploration rate $\epsilon$, scaling factor $\alpha$, annealing weight $\beta_t$, Q-loss weight $\lambda_Q$. \STATE \textbf{Update}: Q-network $Q_{\theta}$, actor $\pi_{\phi}$, critic $V_{\psi}$. \FOR{each training iteration} \STATE Initialize rollout buffers $\mathcal{B}_{\text{PPO}}, \mathcal{B}_{Q} \leftarrow \emptyset$ \STATE Reset environment: $s_1 \leftarrow \mathcal{E}.\text{reset}()$ \FOR{$t = 1$ to $T$} \STATE Extract sector features $\{f_i\}_{i=1}^{S}$ from $\hat{V}_t$ \STATE Select sector using $\epsilon$-greedy: with probability $\epsilon$ adopt a random sector, or else choose $z_t = \arg\max_{z_i} Q_{\theta}(s_t, z_i)$ \STATE Compute guidance $\mathbf{v}_t^{\text{pos}} \leftarrow \text{GeometricCenter}(z_t)$ \STATE Sample $\Delta_t \sim \pi_{\phi}(\cdot \mid s_t, \mathbf{v}_t^{\text{pos}})$; scale $\hat{\Delta}_t \leftarrow \alpha \Delta_t$ \STATE Fuse action: $a_t^{\text{pos}} \leftarrow \beta_t \mathbf{v}_t^{\text{pos}} + (1{-}\beta_t) \hat{\Delta}_t^{\text{pos}}$, $a_t^{\text{ang}} \leftarrow \hat{\Delta}_t^{\text{ang}}$ \STATE Execute $a_t$ in $\mathcal{E}$; observe $s_{t+1}, r_t, d_t$ \STATE Store $(s_t, c_t, a_t, r_t, s_{t+1}, c_{t+1}, d_t)$ in $\mathcal{B}_{\text{PPO}}$ \STATE Store $(s_t, z_t, r_t, s_{t+1}, d_t)$ in $\mathcal{B}_{Q}$ \ENDFOR \STATE Compute GAE advantages from $\mathcal{B}_{\text{PPO}}$; update $\pi_{\phi}, V_{\psi}$ via $\mathcal{L}_{\text{PPO}}$ \STATE Compute discounted returns $\{y_t\}$ from $\mathcal{B}_{Q}$; update $Q_{\theta}$ via $\nabla\mathcal{L}_{Q}$ \ENDFOR \end{algorithmic} \end{algorithm} \section{Experiment} We evaluate our approach in two stages. Sec.~\ref{sec:discrete_classification} studies multi-view classification in a simplified discrete setting, and Sec.~\ref{sec:continuous_detection} evaluates SonoSelect on a continuous kidney cyst detection task. \subsection{Preliminary: Multi-view Classification} \label{sec:discrete_classification} Before evaluating the full scanning pipeline, we first ask a simpler question: given a set of candidate viewpoints, can an adaptive selection policy identify the few informative views for each instance? To answer this, we use a simplified setting where the probe can access any candidate viewpoint without movement cost. As the adaptive selection method, we adopt MVSelect~\cite{hou2024learning}, which sequentially chooses the next view conditioned on previously acquired observations. This setup lets us test whether observation-driven view selection is beneficial before introducing continuous probe control. \textbf{Datasets.} We construct two custom multi-view ultrasound datasets, both adopting a strict 80\%/20\% train-test split and extracting $120 \times 120$ 2D slices under two distinct viewpoint configurations (12-view and 20-view configurations). \begin{itemize} \item \textit{Geometry:} this synthetic dataset comprises 10 distinct categories (sphere, ellipsoid, cube, cuboid, cylinder, capsule, cone, torus, octahedron, and cross), with 150 unique instances per category. \item \textit{Organ:} to move closer to clinical realism, we introduce a more challenging dataset comprising real human anatomical structures sourced from the publicly available TotalSegmentator dataset ~\cite{wasserthal2023totalsegmentator}. It contains 6 distinct categories: left kidney, liver, pancreas, spleen, aorta, and stomach, with 100 unique patient instances per category. \end{itemize} \textbf{Task Network.} For both datasets, we employ a ResNet-18 ~\cite{he2016deep} backbone combined with a max-pooling aggregation module. The network is trained offline on complete multi-view sequences so that the learned representations are not biased toward any particular view subset. \textbf{Quantitative Results.} The classification performances on both datasets are summarized in \cref{tab:combined_results}. We compare five selection strategies: (1) \textit{dataset-level oracle}, which uses the same fixed pair of views that achieves the highest average accuracy across all instances in the training set; (2) \textit{instance-level oracle}, which selects the optimal pair for each test instance by exhaustive search; (3) \textit{random selection}, which samples two views uniformly; (4) \textit{validation best policy}, which selects the fixed pair that achieves the highest accuracy on the validation set; and (5) \textit{MVSelect}~\cite{hou2024learning}, which sequentially selects views conditioned on previous observations. \begin{table}[t] \centering \small \setlength{\tabcolsep}{0.5mm} \renewcommand{\arraystretch}{1.1} % \label{tab:combined_results} % 使用 \resizebox 限制表格宽度不超过单栏宽度,解决右边突出的问题 \begin{tabular}{l|cc|cc} % 修改了列定义,去掉了多余的竖线,更贴合原图 \multirow{2}{*}{view selection} & \multicolumn{2}{c|}{Geometry} & \multicolumn{2}{c}{Organ} \\ \cline{2-5} % 在 SonoGeom 和 SonoOrgan 下方添加横线 & 12 views & 20 views & 12 views & 20 views \\ \hline N/A: all $N$ views & 84.02 & 92.96 & 92.50 & 91.73 \\ \hline dataset-lvl oracle & 79.07 $\pm$ 0.71 & 83.42 $\pm$ 2.23 & 92.75 $\pm$ 0.64 & 90.13 $\pm$ 1.98 \\ instance-lvl oracle & 93.80 $\pm$ 0.47 & 99.23 $\pm$ 0.61 & 98.04 $\pm$ 0.61 & 99.32 $\pm$ 1.06 \\ \hline random selection & 74.61 $\pm$ 2.32 & 68.71 $\pm$ 8.34 & 87.44 $\pm$ 4.17 & 73.91 $\pm$ 11.58 \\ validation best policy& 71.57 $\pm$ 2.46 & 70.87 $\pm$ 5.60 & 88.51 $\pm$ 1.61 & 83.73 $\pm$ 4.06 \\ \hline MVSelect~\cite{hou2024learning}& 79.11 $\pm$ 1.33 & 89.59 $\pm$ 1.61 & 96.23 $\pm$ 1.62 & 97.00 $\pm$ 1.04 \\ \end{tabular} \caption{Classification accuracy (\%) on Geometry and Organ. Each adaptive or fixed policy selects two views per instance from the candidate set.} \end{table} \begin{figure}[t] \centering \includegraphics[width=\linewidth]{images/Experiment1.pdf} \caption{\textbf{Qualitative results of the discrete view selection policy.} We visualize the selected viewpoints for both the Geometry (top) and Organ (bottom) datasets.} \label{fig:Experiment1} \end{figure} We first note that using all $N$ views does not yield the highest accuracy. The instance-level oracle, which selects the best two views per instance via exhaustive search, substantially surpasses the all-view baseline on both datasets. This suggests that not all views contribute positively to the final prediction. However, the dataset-level oracle, which fixes the same two views across all instances, performs considerably worse than the instance-level oracle. This gap confirms that the most informative views vary across instances. Random selection performs the worst overall, with high variance reflecting the inconsistency of uninformed view choices. MVSelect, which selects views conditioned on each instance's observations, approaches the instance-level oracle on both datasets. This confirms that an adaptive policy can recover near-optimal view combinations without exhaustive search. \cref{fig:Experiment1} visualizes the selected viewpoints for representative instances, showing that the policy chooses different orientations for different objects and anatomies. Together, these results support the two properties that motivate SonoSelect: (1) a small number of well-chosen views can match or exceed the performance of exhaustive acquisition, and (2) the optimal views are instance-dependent and vary across patient anatomies. Building on these findings, SonoSelect further addresses how to acquire such informative views by adaptively guiding probe movement based on a 3D spatial memory of the observed anatomy. The following section evaluates SonoSelect on a kidney cyst detection task, where the agent guides probe movement to detect target pathology within a limited scanning budget. \begin{table*}[t] \centering \label{tab:main_results} \footnotesize \setlength{\tabcolsep}{0.6mm} \renewcommand{\arraystretch}{1.2} % % 列定义:左侧基础信息与右侧两个大区块之间用 | 隔开 \begin{tabular}{lc|cccccc|cccccc} \hline \multirow{2}{*}{\textbf{Method}} & \multirow{2}{*}{\textbf{Sectors}} & \multicolumn{6}{c|}{\cellcolor{yellow!50}\textbf{Seen Patient}} & \multicolumn{6}{c}{\cellcolor{orange!20}\textbf{Unseen Patient}} \\ % 使用 \cline 在第 3 列到第 14 列下方画横线 \cline{3-14} & & \textbf{Kidney (\%)} $\uparrow$ & \textbf{Cyst (\%)} $\uparrow$ & \textbf{Dice (\%)} $\uparrow$ & \textbf{IoU (\%)} $\uparrow$ & \textbf{Trans.(voxels)} $\downarrow$ & \textbf{Rot.($^\circ$)} $\downarrow$ & \textbf{Kidney (\%)} $\uparrow$ & \textbf{Cyst (\%)} $\uparrow$ & \textbf{Dice (\%)} $\uparrow$ & \textbf{IoU (\%)} $\uparrow$ & \textbf{Trans.(voxels)} $\downarrow$ & \textbf{Rot.($^\circ$)} $\downarrow$ \\ \hline Random & - & 13.79 & 8.38 & 22.82 & 14.13 & 925.81 & 2623.35 & 23.41 & 1.88 & 37.41 & 24.50 & 937.55 & 2624.38 \\ PPO & - & 60.46 & 44.94 & 69.76 & 53.69 & 427.52 & 294.24 & 33.14 & 12.10 & 47.97 & 34.62 & 376.92 & 247.93 \\ RND & - & 64.99 & 45.53 & 72.59 & 57.08 & 549.41 & 318.89 & 48.91 & 23.41 & 64.94 & 49.33 & 540.09 & 272.15 \\ VIG & - & 56.95 & 40.35 & 63.88 & 50.36 & 484.74 & 350.06 & 48.62 & 23.12 & 64.84 & 49.03 & 463.56 & 417.47 \\ \hline \textbf{SonoSelect} & 4 & 56.52 & 41.50 & 66.66 & 50.40 & 398.51 & 258.11 & 23.06 & 3.67 & 37.05 & 24.24 & 353.37 & 201.96 \\ \textbf{SonoSelect} & 8 & 62.39 & 47.13 & 71.52 & 55.93 & 570.16 & 186.57 & 50.87 & 30.91 & 67.53 & 51.50 & 561.80 & 198.36 \\ \textbf{SonoSelect} & 16 & 67.55 & 48.37 & 73.88 & 58.62 & 423.40 & 311.65 & 54.56 & 35.13 & 70.76 & 54.78 & 446.98 & 311.12 \\ \textbf{SonoSelect} & 32 & 65.07 & 45.67 & 72.70 & 57.20 & 532.87 & 258.49 & 52.14 & 27.90 & 66.97 & 51.16 & 574.52 & 213.78 \\ \hline \end{tabular} \caption{Quantitative comparison of active scanning performance. SonoSelect shows smaller performance degradation on unseen patient anatomies compared to other learned baselines.} \end{table*} \begin{figure*}[t] % === 第一行:图5、图6、图7 === \begin{minipage}[t]{0.30\textwidth} \centering \includegraphics[width=\linewidth]{images/kde_1d_cyst_coverage_sonoselect_vs_ppo.pdf} \captionof{figure}{Distribution of per-episode cyst coverage on unseen anatomies.} \label{fig:kde} \end{minipage}\hfill \begin{minipage}[t]{0.35\textwidth} \centering \includegraphics[width=\linewidth]{images/traj.pdf} \captionof{figure}{Scanning trajectories on unseen data. Red/blue: on/off-target; percentages: effective scanning ratio.} \label{fig:qualitative_trajectories} \end{minipage}\hfill \begin{minipage}[t]{0.30\textwidth} \centering \includegraphics[width=\linewidth]{images/curve_cyst_coverage_t.pdf} \captionof{figure}{Cyst coverage over scanning steps on unseen anatomies.} \label{fig:cyst_curve} \end{minipage} \end{figure*} \subsection{Kidney Cyst Detection} \label{sec:continuous_detection} The preliminary study confirms that a small number of adaptively chosen views can match or exceed exhaustive acquisition, and that the optimal views vary across instances. Building on this finding, we now evaluate whether SonoSelect can realize these benefits when the probe moves sequentially along the body surface, where each movement carries a scanning cost and the agent receives only partial observations of the underlying anatomy. We test on a kidney cyst detection task, a clinically motivated scenario that requires both broad organ coverage and precise localization of small pathological targets, and evaluate generalization to unseen patient anatomies. \textbf{Experimental Setup.} The primary task requires the agent to dynamically scan the left kidney and identify renal cysts. We utilize 3D clinical CT volumes from the TotalSegmentator dataset ~\cite{wasserthal2023totalsegmentator}. To evaluate structural generalization, patient anatomies are strictly partitioned into seen and unseen domains. \textbf{Implementation details.} We train all models using PPO with 64 parallel environments for a total of 500K environment steps. Each policy update collects a rollout of 32 steps and performs 3 learning epochs over 4 mini-batches. We use the Adam optimizer with a learning rate of $3 \times 10^{-4}$, adjusted by a KL-adaptive scheduler. The discount factor is $\gamma = 0.97$ with GAE parameter $\lambda = 0.95$. For the auxiliary Q-network, we use a lower learning rate of $1 \times 10^{-4}$ and schedule its loss weight $\lambda_Q$ from 0.05 to 0.02 over 150K steps to gradually reduce its influence as the policy matures. A target network updated via Polyak averaging ($\tau = 0.005$) is employed to provide stable regression targets for Q-value training. The residual guidance coefficient $\beta_t$ is linearly annealed from 1.0 to 0.05 over 300K steps, allowing the policy to transition from heuristic-guided exploration to autonomous scanning. The residual scale $\alpha$ is kept at 1.0 throughout training. Each episode has a step budget of $T = 600$. All experiments are conducted on a single NVIDIA RTX 4090D GPU with a fixed random seed of 42. \textbf{Baselines.} In this fully continuous setting, we benchmark SonoSelect against baselines representing alternative exploration strategies. \textit{Random} applies uniformly sampled kinematic actions at each step, providing a lower bound on diagnostic yield without any learned or heuristic guidance. \textit{PPO}~\cite{schulman2017proximal} trains a single continuous control policy that directly maps observations to probe actions, testing whether end-to-end reinforcement learning can implicitly learn effective exploration without explicit spatial planning. \textit{VIG} (Volumetric Information Gain)~\cite{isler2016information} represents classical Next-Best-View planning driven by entropy maximization, testing whether uncertainty reduction alone provides sufficient guidance for diagnostic exploration. \textit{RND}~\cite{burda2018exploration} provides a state-visitation driven exploration bonus, testing whether encouraging novel state visits improves coverage without task-specific guidance. \textbf{Quantitative Results.} \cref{tab:main_results} presents scanning performance on seen and unseen patient anatomies. On seen anatomies, SonoSelect (16 sectors) achieves the highest scores across all four diagnostic metrics. The differences among learned methods are moderate on seen anatomies, as all methods can exploit spatial regularities present in the training data. We further examine whether this advantage transfers to unseen patients, where memorized spatial patterns are no longer reliable. All methods degrade on unseen anatomies, but the extent differs. PPO exhibits the largest drop, with kidney coverage falling from 60.46\% to 33.14\% and cyst coverage from 44.94\% to 12.10\%, suggesting that the policy relies on spatial regularities specific to the training anatomies and does not transfer well when the anatomy changes. RND and VIG show better retention than PPO, with cyst coverage reaching 23.41\% and 23.12\% respectively on unseen data. Both methods incorporate task-agnostic exploration signals, novelty for RND and entropy for VIG, which provide some robustness to anatomy changes. However, both methods still fall behind SonoSelect on all four diagnostic metrics, suggesting that task-agnostic exploration signals alone are not sufficient for efficient diagnostic scanning. In contrast, SonoSelect achieves the highest scores across all four diagnostic metrics on unseen data, with smaller performance degradation. SonoSelect shows the smallest performance gap between seen and unseen anatomies across all four diagnostic metrics, suggesting that the spatial memory provides a more transferable basis for probe guidance than the alternatives tested. We also report translation and rotation errors in \cref{tab:main_results}. These metrics reflect the cumulative probe displacement during scanning rather than diagnostic quality. Among the learned methods, the translation and rotation errors do not show a consistent ranking, as different exploration strategies produce trajectories of varying lengths and orientations. SonoSelect (16 sectors) achieves moderate translation and rotation values while obtaining the highest diagnostic coverage, indicating that its coverage advantage comes from more targeted probe movement rather than simply longer trajectories. \textbf{Effect of Sector Granularity.} The number of sectors controls the granularity of the routing decision (\cref{tab:main_results}). With only 4 sectors, the routing module partitions the search space too coarsely, and performance on unseen anatomies drops close to the random baseline. Increasing to 8 sectors provides sufficient directional resolution for the routing policy to generalize, yielding a substantial improvement on unseen data. Performance peaks at 16 sectors, where the routing granularity balances expressiveness against the difficulty of learning a reliable policy from limited training episodes. At 32 sectors, performance slightly decreases, indicating that finer partitioning introduces more choices than the policy can reliably distinguish given the available training data. \textbf{Scanning Efficiency.} \cref{fig:cyst_curve} compares how efficiently each method converts scanning steps into diagnostic coverage. SonoSelect accumulates cyst coverage at a faster rate than all baselines, with the gap widening after approximately 200 steps as the routing module begins directing the probe toward high-value regions. PPO's curve flattens early, suggesting that the agent converges to a local scanning pattern and stops exploring new regions, while RND and VIG show intermediate growth rates , but with slower growth in later steps. \textbf{Episode-level Analysis.} The scatter plot in~\cref{fig:tradeoff} examines this efficiency at the episode level. PPO clusters in the bottom-left quadrant, indicating frequent near-zero coverage episodes with short, spatially confined trajectories; on unseen anatomies, the agent frequently remains confined to local regions. SonoSelect occupies the upper-right quadrant, where longer trajectories correspond to higher diagnostic coverage. The per-episode distribution in~\cref{fig:kde} further illustrates this contrast: PPO's cyst coverage concentrates near zero, while SonoSelect's distribution shifts toward higher values. \textbf{Qualitative Results.} Representative trajectories in~\cref{fig:qualitative_trajectories} show the same pattern spatially. PPO produces circular movements far from the kidney, with effective scanning ratios of 13.5\%--19.6\%, indicating that the probe spends most of its budget on non-target regions. SonoSelect follows the kidney contours with substantially higher ratios, reflecting that the sector routing module directs the probe toward diagnostically relevant areas. Across all three analyses, the results indicate that SonoSelect achieves higher coverage not by scanning longer, but by allocating its scanning budget more effectively, reducing redundant acquisition while increasing the likelihood of reaching diagnostically relevant viewpoints. \subsection{Ablation Studies} \label{sec:ablation} To validate the core architectural designs of SonoSelect, we conduct ablation experiments on the kidney cyst detection task using unseen patient data. We isolate three components: the learned routing policy, the per-sector feature representation, and the residual control module. Each ablation removes one component while keeping the rest unchanged. The quantitative comparisons are summarized in \cref{tab:ablation}. \textbf{Effect of Learned Routing.} We first evaluate the high-level decision maker by replacing the learned routing policy with random sector selection. Without a task-driven geometric prior, the continuous policy receives arbitrary directional targets, leading to uncoordinated probe movement. As shown in \cref{tab:ablation}, this variant shows a notable drop in cyst coverage, with the score falling by roughly half, confirming that the learned routing policy is necessary to constrain the search space and direct the continuous policy toward diagnostically relevant regions. Without learned routing, the continuous policy can execute local movements but lacks directional guidance toward diagnostically relevant regions. \textbf{Necessity of Explicit Sector Features.} The w/o Sector Features variant replaces the per-sector feature vectors with uniform values, making all sectors appear identical to the Q-network. Although the Q-network still receives the global observation $s_t$, `it cannot distinguish sectors based on their spatial content in the reconstruction volume. As a result, the Q-network selects sectors without considering what each region contains, leading to reduced coverage for both kidney and cyst targets. This drop confirms that that the Q-network relies on per-sector spatial features to make informed routing decisions. Without per-sector features, the Q-network cannot leverage the spatial memory to differentiate among candidate sectors. \begin{table}[t] \centering \small \begin{tabular}{l|cccc} \hline Method & Kidney & Cyst & Dice & IoU \\ \hline Random Routing & 47.85 & 13.77 & 62.80 & 45.53 \\ w/o Sector Features & 45.32 & 18.39 & 61.35 & 45.24 \\ w/o Residual Control & 49.94 & 16.18 & 59.23 & 44.92 \\ Fixed $\beta=1.0$ & 49.41 & 27.60 & 67.04 & 50.45 \\ \textbf{SonoSelect} & 54.56 & 27.13 & 70.76 & 54.78 \\ \hline \end{tabular} \caption{Ablation study of SonoSelect components on unseen anatomies.} \label{tab:ablation} \vspace{-0.5cm} \end{table} \textbf{Role of Residual Control.} The w/o Residual Control variant removes the low-level kinematic adjustments and relies solely on the sector-level waypoints for probe guidance. This variant achieves the lowest Dice and IoU among all configurations, while its kidney coverage remains comparable to the other ablated variants. This asymmetry indicates that the sector selection module is sufficient to guide the probe toward the correct anatomical region, but capturing small targets such as cysts requires the fine-grained probe adjustments that the residual control module provides. This suggests that sector-level routing provides sufficient guidance for reaching the target region, but localizing small structures such as cysts requires the finer probe adjustments that the residual control module provides. \textbf{Effect of Guidance Annealing.} In the full model, the guidance coefficient $\beta_t$ is linearly annealed from 1.0 to 0.05 during training, gradually shifting translational control from the sector guidance to the policy's own output. The Fixed $\beta=1.0$ variant keeps $\beta_t$ at 1.0 throughout training and deployment. As shown in \cref{tab:ablation}, this variant achieves comparable cyst coverage to the full model, but kidney coverage and reconstruction quality both decline. This suggests that the annealing schedule allows the policy to learn fine-grained translational adjustments beyond the sector center, which contributes to more complete coverage of the kidney surface. \section{Conclusion} We propose SonoSelect, an active multi-view exploration framework for robotic ultrasound that selects informative viewpoints without exhaustive scanning or predefined trajectories. By bridging discrete high-level regional routing with continuous low-level kinematic control, SonoSelect learns to resolve anatomical ambiguities and achieves robust generalization to unseen anatomies where standard reinforcement learning approaches show substantial performance degradation. This approach represents a step toward autonomous robotic ultrasound deployment in clinical workflows. While the current evaluation is conducted in simulation, the hierarchical formulation of coupling discrete region selection with continuous probe control provides a principled way to handle the view-dependent nature of ultrasound imaging. This work suggests that structured, observation-driven exploration can serve as an effective mechanism for multi-view ultrasound perception, reducing the number of views needed for accurate diagnosis while maintaining robust coverage across diverse patient anatomies. \begin{figure}[t] \centering \includegraphics[width=0.9\linewidth]{images/pareto_tradeoff_allpoints.pdf} \caption{Episode-level cyst coverage vs.\ trajectory length on unseen anatomies. Each point represents one episode.} \label{fig:tradeoff} \end{figure} %% %% The next two lines define the bibliography style to be used, and %% the bibliography file. \bibliographystyle{ACM-Reference-Format} \bibliography{sample-base} \end{document} \endinput %% %% End of file `sample-sigconf-authordraft.tex'.