XL2 &6ʍwVTypeV2ObjIDDDir]nF/[␦EcAlgoEcMEcNEcBSizeEcIndexEcDistCSumAlgoPartNumsPartETags 1cacaa91da8179786591d31dd2828a1cPartSizes䳪PartASizes䳧PartIdxSize䳥MTime6ʍwVMetaSysx-rustfs-internal-inline-datatruex-minio-internal-inline-datatrueMetaUsretag 1cacaa91da8179786591d31dd2828a1ccontent-typeapplication/x-texvΧ.onull{(fw/@6HV:1iUe2FP%% %% This is file `sample-sigconf-authordraft.tex', %% generated with the docstrip utility. %% %% The original source files were: %% %% samples.dtx (with options: `all,proceedings,bibtex,authordraft') %% %% IMPORTANT NOTICE: %% %% For the copyright see the source file. %% %% Any modified versions of this file must be renamed %% with new filenames distinct from sample-sigconf-authordraft.tex. %% %% For distribution of the original source see the terms %% for copying and modification in the file samples.dtx. %% %% This generated file may be distributed as long as the %% original source files, as listed above, are part of the %% same distribution. (The sources need not necessarily be %% in the same archive or directory.) %% %% %% Commands for TeXCount %TC:macro \cite [option:text,text] %TC:macro \citep [option:text,text] %TC:macro \citet [option:text,text] %TC:envir table 0 1 %TC:envir table* 0 1 %TC:envir tabular [ignore] word %TC:envir displaymath 0 word %TC:envir math 0 word %TC:envir comment 0 0 %% %% The first command in your LaTeX source must be the \documentclass %% command. %% %% For submission and review of your manuscript please change the %% command to \documentclass[manuscript, screen, review]{acmart}. %% %% When submitting camera ready or to TAPS, please change the command %% to \documentclass[sigconf]{acmart} or whichever template is required %% for your publication. %% %% \documentclass[sigconf, screen, review, anonymous]{acmart} \usepackage{multirow} \usepackage{algorithmic} \usepackage{algorithm} \usepackage{colortbl} \usepackage{wrapfig} \usepackage{hyperref} \usepackage[capitalize]{cleveref} \usepackage{subcaption} \newcommand{\etal}{\textit{et al.}\@\xspace} %% %% \BibTeX command to typeset BibTeX logo in the docs \AtBeginDocument{% \providecommand\BibTeX{{% Bib\TeX}}} %% Rights management information. This information is sent to you %% when you complete the rights form. These commands have SAMPLE %% values in them; it is your responsibility as an author to replace %% the commands and values with those provided to you when you %% complete the rights form. \setcopyright{acmlicensed} \copyrightyear{2018} \acmYear{2018} \acmDOI{XXXXXXX.XXXXXXX} %% These commands are for a PROCEEDINGS abstract or paper. \acmConference[Conference acronym 'XX]{Make sure to enter the correct conference title from your rights confirmation email}{June 03--05, 2018}{Woodstock, NY} %% %% Uncomment \acmBooktitle if the title of the proceedings is different %% from ``Proceedings of ...''! %% %%\acmBooktitle{Woodstock '18: ACM Symposium on Neural Gaze Detection, %% June 03--05, 2018, Woodstock, NY} \acmISBN{978-1-4503-XXXX-X/2018/06} %% %% Submission ID. %% Use this when submitting an article to a sponsored event. You'll %% receive a unique submission ID from the organizers %% of the event, and this ID should be used as the parameter to this command. \acmSubmissionID{3405} %% %% For managing citations, it is recommended to use bibliography %% files in BibTeX format. %% %% You can then either use BibTeX with the ACM-Reference-Format style, %% or BibLaTeX with the acmnumeric or acmauthoryear sytles, that include %% support for advanced citation of software artefact from the %% biblatex-software package, also separately available on CTAN. %% %% Look at the sample-*-biblatex.tex files for templates showcasing %% the biblatex styles. %% %% %% The majority of ACM publications use numbered citations and %% references. The command \citestyle{authoryear} switches to the %% "author year" style. %% %% If you are preparing content for an event %% sponsored by ACM SIGGRAPH, you must use the "author year" style of %% citations and references. %% Uncommenting %% the next command will enable that style. %%\citestyle{acmauthoryear} %% %% end of the preamble, start of the body of the document source. \begin{document} %% %% The "title" command has an optional parameter, %% allowing the author to define a "short title" to be used in page headers. \title{SonoSelect: Efficient Ultrasound Perception via \\ Active Probe Exploration } %% %% The "author" command and its associated commands are used to define %% the authors and their affiliations. %% Of note is the shared affiliation of the first two authors, and the %% "authornote" and "authornotemark" commands %% used to denote shared contribution to the research. \author{Ben Trovato} \authornote{Both authors contributed equally to this research.} \email{trovato@corporation.com} \orcid{1234-5678-9012} \author{G.K.M. Tobin} \authornotemark[1] \email{webmaster@marysville-ohio.com} \affiliation{% \institution{Institute for Clarity in Documentation} \city{Dublin} \state{Ohio} \country{USA} } %% %% By default, the full list of authors will be used in the page %% headers. Often, this list is too long, and will overlap %% other information printed in the page headers. This command allows %% the author to define a more concise list %% of authors' names for this purpose. \renewcommand{\shortauthors}{Trovato et al.} %% %% The abstract is a short summary of the work to be presented in the %% article. \begin{abstract} Ultrasound perception typically requires multiple scan views through probe movement to reduce diagnostic ambiguity, mitigate acoustic occlusions, and improve anatomical coverage. However, not all probe views are equally informative. Exhaustively acquiring a large number of views can introduce substantial redundancy, increase scanning and processing costs. To address this, we define an active view exploration task for ultrasound and propose SonoSelect, an ultrasound-specific method that adaptively guides probe movement based on current observations. Specifically, we cast ultrasound active view exploration as a sequential decision-making problem. Each new 2D ultrasound view is fused into a 3D spatial memory of the observed anatomy, which guides the next probe position. On top of this formulation, we propose an ultrasound-specific objective that favors probe movements with greater organ coverage, lower reconstruction uncertainty, and less redundant scanning. Experiments on the ultrasound simulator show that SonoSelect achieves promising multi-view organ classification accuracy using only 2 out of N views. Furthermore, for a more difficult kidney cyst detection task, it reaches 54.56\% kidney coverage and 35.13\% cyst coverage, with short trajectories consistently centered on the target cyst. % with more target-focused trajectories. \end{abstract} %% %% The code below is generated by the tool at http://dl.acm.org/ccs.cfm. %% Please copy and paste the code instead of the example below. %% \begin{CCSXML} 10010405.10010444.10010449 Applied computing~Health informatics 300 10010147.10010178.10010224 Computing methodologies~Computer vision 500 \end{CCSXML} \ccsdesc[300]{Applied computing~Health informatics} \ccsdesc[500]{Computing methodologies~Computer vision} %% %% Keywords. The author(s) should pick words that accurately describe %% the work being presented. Separate the keywords with commas. \keywords{Robotic Ultrasound, Multi-view Perception, View Selection} %% %% This command processes the author and affiliation and title %% information and builds the first part of the formatted document. \maketitle \section{Introduction} \label{sec:intro} In medical ultrasound, perception typically requires multiple scan views through probe movement to reduce diagnostic ambiguity. As a non-invasive and real-time imaging modality, ultrasound is essential in clinical diagnosis, yet remains highly view-dependent~\cite{munir2025survey, elmekki2025comprehensive}. A single static image often fails to provide sufficient structural information due to acoustic occlusions and a limited field-of-view~\cite{jiang2023robotic, velikova2023lotus}. Consequently, multi-view perception through probe repositioning is necessary to improve anatomical coverage and reduce diagnostic uncertainty~\cite{men2023gaze, dai2021transmed, jiang2023robotic}. % However, the use of multiple cameras comes at a high cost 但是有效去实现这个多view,目前是个难题,这个目前是个成本很高的事情(需要人工什么的) However, not all probe views are equally informative. Exhaustively acquiring a large number of views can introduce substantial redundancy and increase scanning and processing costs. In manual practice, selecting which views to acquire relies on the experience of the operator, requiring repeated repositioning and real-time interpretation, which makes the process time-consuming, operator-dependent, and difficult to standardize~\cite{jiang2023robotic, men2023gaze, munir2025survey}. To reduce this operator burden, existing methods attempt to automate probe navigation~\cite{bi2024machine, jiang2024intelligent}, yet most of these methods optimize probe movement based on immediate geometric or image-quality feedback without maintaining a spatial memory of the observed anatomy~\cite{bi2024machine, jin2023neu}. These approaches may therefore acquire redundant views while missing viewpoints that could resolve occlusions or reveal unseen anatomy. A key question is then how to determine, from partial observations, which probe positions to acquire next so as to maximize diagnostic coverage within a limited scanning budget. \begin{figure}[t] \centering \includegraphics[width=\linewidth]{images/intro1.pdf} \vspace{-2em} \caption{Uninformed exploration vs. active view exploration with SonoSelect. (Left): Uninformed exploration may sample views redundantly and very likely fail to reach the target cyst occluded behind an overlying organ. (Right): SonoSelect selects the next probe position based on current observations, directing the probe toward the target cyst for diagnosis.} \label{Fig:teaser} \end{figure} % 所以在我们这篇文章,我们希望把view selection这件事情能够自动化,代替人工的选取(这一段写fig 1的内容, 主要是写我们的任务定义), To address this, we define an active view exploration task for ultrasound. As illustrated in \cref{Fig:teaser}, uninformed exploration samples views redundantly within a local region, leaving the target unobserved when it is occluded behind an overlying organ (\cref{Fig:teaser}, left). In contrast, active view exploration selects the next probe position based on current observations, directing the probe toward the target for diagnosis (\cref{Fig:teaser}, right). This definition reduces redundant acquisition and increases the likelihood of obtaining the specific viewpoints needed for accurate diagnosis. % 因此,我们提出了SonoSelect,(简单介绍一下,也说下传统ppo的差别) We propose SonoSelect, an ultrasound-specific method, to address this task. Specifically, we cast ultrasound active view exploration as a sequential decision-making problem. As the probe moves, each new 2D ultrasound view is fused into a 3D spatial memory that represents what has been observed so far. This spatial memory then serves two purposes: it provides the agent with a volumetric summary of the current anatomical coverage, and it identifies regions that remain unobserved or uncertain, guiding where to scan next. Building on this representation, we design a reward that encourages probe movements toward greater organ coverage, lower reconstruction uncertainty, and less redundant scanning. To optimize this reward, SonoSelect decomposes the task into a view selection module trained with Q-learning for long-horizon routing and an action refinement module trained with PPO for short-range probe control, so that each sub-problem operates at its own temporal scale. % 在多个ultrasound任务上,我们的效果很好(比ppo好) % 我们的效率也很高,(一般需要几个view就好了) A preliminary study on multi-view organ classification shows that a small number of adaptively chosen views can match or exceed the all-view baseline, and that the optimal views vary across patient anatomies. Building on this observation, we further evaluate SonoSelect on a kidney cyst detection task, where the target is small and often occluded behind overlying organs. SonoSelect achieves higher organ and cyst coverage than conventional baselines, with trajectories that consistently converge toward the target rather than exhaustively scanning the volume. In particular, unlike standard RL approaches that show substantial performance degradation on unseen anatomies, SonoSelect maintains its coverage advantage across different patients, suggesting that the spatial memory provides a more generalizable basis for probe guidance than reward-driven exploration alone. % 未来的意义,我们这个是为了是为了实现Fully Autonomous Robotic Ultrasound system中的重要的一步 The proposed active view exploration approach has practical implications for both current clinical workflows and robotic ultrasound systems that automate probe positioning through mechanical arms. In current practice, the learned policy can serve as a decision support tool, suggesting informative scanning regions to assist sonographers during manual examination. For robotic systems, the selected next-best-view can be converted into target coordinates for a mechanical arm controller, providing the spatial goal that downstream motion planning modules use to reposition the probe. % Our contributions are: (1) We formulate ultrasound active view exploration as a sequential decision-making problem and show, through a preliminary study, that a small number of adaptively chosen views can match or exceed exhaustive acquisition. (2) We design two evaluation tasks within the SonoGym simulation environment, a multi-view organ classification task using simulated volumes and a kidney cyst detection task using human CT volumes. (3) We propose SonoSelect, which maintains a 3D spatial memory of observed anatomy and decomposes exploration into sector selection and continuous probe control, achieving effective coverage and generalization to unseen anatomies on both tasks. \section{Related Work} \label{sec:related} \textbf{Ultrasound Perception.} Most research in ultrasound perception focuses on analyzing individual 2D views for tasks such as organ segmentation, anatomical classification, lesion detection, and representation learning. Velikova \emph{et~al.}~\cite{velikova2023lotus} proposed LOTUS to learn transferable ultrasound representations for downstream view-level tasks. UltraFedFM~\cite{jiang2025pretraining} scaled this line to federated self-supervised pretraining across institutions, organs, and modalities, substantially improving diagnosis and segmentation. In a visibility-aware setting, Weld \emph{et~al.} identified visible tissue and acoustic shadows by modeling probe-tissue contact confidence in intraoperative ultrasound~\cite{weld2025identifying}. Sonomate~\cite{guo2026visually} further extended ultrasound understanding to vision-language modeling for anatomy detection and question answering. These studies have substantially improved the interpretation of individual ultrasound views, but they generally assume that informative views are already available. To obtain richer anatomical observations beyond a single view, recent studies have explored robotic probe control, autonomous scanning, and target-view acquisition. Hase \emph{et~al.}~\cite{hase2020ultrasound} learned probe control for target-view recovery. Men \emph{et~al.}~\cite{men2023gaze} coupled gaze cues with probe guidance for obstetric ultrasound scanning. Jiang \emph{et~al.}~\cite{jiang2025towards} developed autonomous carotid ultrasonography by combining robotic scanning with learned perception, and Jiang \emph{et~al.}~\cite{jiang2024intelligent} further studied intelligent robotic sonography through reward learning from a few demonstrations. Duan \emph{et~al.}~\cite{duan2024safe} introduced safety-aware policy optimization for autonomous scanning, while Su \emph{et~al.}~\cite{su2025tissue} used tissue-view maps to guide structure-specific acquisition. SonoGym~\cite{ao2025sonogym} further provides a scalable simulation platform for learning-based robotic ultrasound. Overall, these methods advance robotic scanning and target-view acquisition, but most of them optimize local probe behavior or predefined targets rather than sequentially selecting complementary views for comprehensive multi-view perception. \textbf{Viewpoint Selection.} Viewpoint selection, often studied as next-best-view planning or active observation, aims to select future views that improve scene understanding, reconstruction quality, or task performance. Isler \emph{et~al.}~\cite{isler2016information} formulated active 3D reconstruction by maximizing volumetric information gain. Di \emph{et~al.}~\cite{di2024learning} learned viewpoint policies for active localization. Chen \emph{et~al.}~\cite{chen2024gennbv} developed generative next-best-view modeling, while Feng \emph{et~al.}~\cite{feng2024naruto} and Xue \emph{et~al.}~\cite{xue2024neural} explored neural view planning for active 3D reconstruction. Hou \emph{et~al.}~\cite{hou2024learning} further showed that multi-view recognition benefits from learning which views to acquire rather than aggregating views exhaustively. Collectively, these works show that actively selecting views can substantially improve information efficiency compared with fixed or exhaustive observation. Compared with its broad use in general active vision, the viewpoint-selection perspective has been only sparsely explored in ultrasound-related settings. Existing efforts mainly recover target views from local image feedback~\cite{hase2020ultrasound,men2023gaze}, learn task-specific rewards for robotic sonography~\cite{jiang2024intelligent}, impose safety-aware constraints during autonomous acquisition~\cite{duan2024safe}, or guide scanning with tissue-view maps for predefined structures~\cite{su2025tissue}. While closely related, these methods more commonly optimize image quality, probe safety, or navigation toward predefined views or structures. Relatively few studies explicitly formulate ultrasound scanning as a sequential viewpoint selection problem in which future probe views are chosen based on previously observed anatomy and accumulated spatial memory. Our work follows this viewpoint-selection perspective and instantiates it as active probe exploration for ultrasound. \begin{figure*}[t] \centering \includegraphics[width=\linewidth]{images/fig2.pdf} \caption{\textbf{Active probe exploration framework (SonoSelect) with $T$ scanning steps.} Solid lines indicate the network forward pass and dashed lines indicate the agent-environment interaction. The approach starts from an initial probe pose, and the agent produces an action $a_t$ from the current state $s_t$ through the view selection module $S$ and the action refinement module, then executes $a_t$ to acquire a new ultrasound view, which is fused into the probability map $\hat{V}_{t+1}$ via $U(\cdot)$. This process repeats for $T$ steps, progressively building a dense reconstruction. } \label{fig:overview} \end{figure*} \section{Methodology} \label{sec:method} % \subsection{Problem Definition} \subsection{SonoSelect Framework} % 一段话写任务定义(这里需要有个大图) We formulate active view exploration for ultrasound perception as a sequential decision-making problem under partial observability. The complete 3D anatomy is not directly accessible; the agent can only observe it through partial 2D ultrasound views. The objective is to learn an exploration policy $\pi_\phi(a_t|s_t)$ that maps the current state $s_t$ to continuous kinematic actions $a_t$, maximizing the cumulative coverage of the target anatomical structure within a fixed budget of $T$ steps. % State \textbf{State.} Because the number of acquired views grows with each step, directly conditioning the policy on the full observation history is impractical. We instead maintain a 3D probability map $\hat{V}_t$ that fuses all past observations into a fixed-size voxel grid, serving as a spatial memory of the explored anatomy. At each step $t$, the 2D ultrasound view $I_t$ acquired at probe pose $(\mathbf{p}_t, \mathbf{q}_t)$ is fused into $\hat{V}_t$ via the volumetric fusion function $U(\cdot)$. Specifically, $U(\cdot)$ projects the segmentation mask of $I_t$ back into the voxel grid using the known probe pose and updates the occupancy probability of each intersected voxel via Bayesian fusion, so that repeated observations of the same region progressively reduce uncertainty. We formulate the state $s_t$ as: \begin{equation} \label{eq:state} s_t = (\hat{V}_t, \mathbf{p}_t, \mathbf{q}_t), \end{equation} where $\hat{V}_t$ aggregates all views observed up to step $t$ into a spatial probability map in which each voxel stores the estimated probability of tissue occupancy. Since $\hat{V}_t$ is defined over a fixed voxel grid, $s_t$ maintains the same dimensionality at every step. $\hat{V}_0$ is initialized to a uniform probability of $0.5$ to represent maximum uncertainty. The state $s_t$ captures the spatial structure observed so far, but does not indicate how much of the target anatomy has been covered. To quantify exploration progress, we define a coverage ratio: \begin{equation} \label{eq:coverage} c_t = \frac{\sum_{v} \hat{V}_{t}(v) \cdot g(v)}{\sum_{v} g(v) + \delta}, \quad c_t \in [0,1], \end{equation} where the summation runs over all voxels $v$ in the reconstruction volume, $g(v)$ is the ground-truth binary mask of the target structure. Since $c_t$ requires $g(v)$, it is available only during training. To leverage this privileged information without leaking it to the deployed policy, we follow the asymmetric actor-critic formulation~\cite{pinto2017asymmetric}: the actor $\pi_\phi(a_t | s_t)$ observes only $s_t$, while the critic $V_\psi(s_t, c_t)$ additionally receives $c_t$; at deployment, the critic is discarded. % Action \textbf{Action}. For a given state $s_t$ at time step $t \in \{1,\dots,T\}$, the agent outputs a continuous 4D action $a_t = (\Delta x, \Delta z, \Delta\phi, \Delta\psi)$, where $\Delta x$ and $\Delta z$ are translational displacements along the x and z axes, and $\Delta\phi$ and $\Delta\psi$ are rotational increments for roll and yaw, respectively. The y-axis translation is omitted because the probe maintains surface contact throughout scanning. \begin{figure*}[t] \includegraphics[width=\linewidth]{images/architecture.pdf} \caption{\textbf{Policy Network Architecture.} The view selection module extracts sector features $f_i$ from the reconstruction volume, computes Q-values $Q(s_t, z_i)$ for each sector, and converts the selected sector $z_t$ into a positional guidance vector $\mathbf{v}_t^{\text{pos}}$. The action refinement module takes $s_t$ and $\mathbf{v}_t^{\text{pos}}$ as input and outputs a kinematic increment $\Delta_t$. The residual fusion module combines $\mathbf{v}_t^{\text{pos}}$ and the scaled increment $\hat{\Delta}_t$ to produce the final action $a_t$.} \label{fig:sono} \end{figure*} \textbf{Transition}. Upon executing action $a_t$, the probe pose is updated to $(\mathbf{p}_{t+1}, \mathbf{q}_{t+1})$ via the environment's kinematic function. The environment then returns a new ultrasound view $I_{t+1}$, which is fused into the probability map to produce $\hat{V}_{t+1}$, and the state transitions to $s_{t+1} = (\hat{V}_{t+1}, \mathbf{p}_{t+1}, \mathbf{q}_{t+1})$. The scanning process terminates when the step budget $T$ is exhausted. % Reward \textbf{Reward}. We design a dense, multi-objective reward function: \begin{equation} r_t = w_{cov} \Delta c_t + w_{info} \Delta H_t^{echo} - \ell_t^{path}. \end{equation} The first term $\Delta c_t$ measures the incremental coverage gain over the anatomical structures of interest, weighted by $w_{cov}$, and provides the main learning signal. Since a single partial view may refine the reconstruction without producing measurable coverage gain, the second term $\Delta H_t^{echo}$, weighted by $w_{info}$, captures the reduction in volumetric Shannon entropy over the target region, providing a learning signal for these intermediate steps. Because the policy maximizes the cumulative sum of all terms, entropy reduction alone cannot sustain high returns; the coverage term steers the policy toward diagnostically relevant regions rather than arbitrary high-uncertainty areas. Finally, to discourage large unnecessary movements that waste the limited step budget without acquiring new information, the path penalty is defined as \begin{equation} \ell_t^{path} = w_{pos}\|\Delta \mathbf{p}_t\|_2 + w_{ang}\|\Delta \boldsymbol{\theta}_t\|_2, \end{equation} where $\Delta \mathbf{p}_t$ and $\Delta \boldsymbol{\theta}_t$ denote the translational and rotational displacements between consecutive steps, respectively. % Finally, $\ell_t^{path}$ is a conditional kinematic penalty that penalizes large translational and rotational displacements when a step produces no coverage gain, discouraging the agent from moving excessively without acquiring new information. \subsection{Policy Network} A flat continuous policy would need to simultaneously decide which region to visit and compute the kinematic actions to get there, yet once a target direction is determined, the remaining problem reduces to local navigation. SonoSelect therefore decomposes the task into two coupled modules: a view selection module that discretizes the operational area into a finite set of candidate directions and selects where to explore next, and an action refinement module that translates the selected direction into kinematic probe motion toward the chosen sector. Together, these two modules form the policy network that maps the current state $s_t$ to the final action $a_t$, corresponding to the module marked "S" in \cref{fig:overview}. % A flat continuous policy would need to simultaneously decide which region to visit and compute the kinematic actions to get there. However, once a target direction is determined, the remaining problem reduces to local navigation. SonoSelect therefore decomposes the task into two coupled modules: a view selection module that discretizes the operational area into a finite set of candidate directions and selects where to explore next, and an action refinement module that translates the selected direction into kinematic probe motion toward the chosen sector. We partition the operational area around the current probe position into $S$ equiangular sectors. Each sector corresponds to a candidate direction the probe can explore next. To obtain a feature representation $f_i$ for each sector, the reconstruction volume $\hat{V}_t$ is first rearranged by treating elevation views as input channels and then processed by a shared 2D convolutional encoder. For each sector $i$, a binary sector mask is applied to the encoded feature map, followed by parallel average and max pooling. The concatenated pooling result is then projected through a shared MLP to produce $f_i$. The sector features $\{f_i\}_{i=1}^{S}$ are each passed through a shared Q-network to produce action values $\{Q(s_t, z_i)\}_{i=1}^{S}$, where $Q(s_t, z_i)$ estimates the expected cumulative reward for navigating toward sector $z_i$. The Q-network shares parameters across all sectors, so it generalizes across candidate directions rather than learning separate value estimates for each. During training, the sector is chosen via an $\epsilon$-greedy strategy to balance exploration and exploitation; at deployment, the sector with the highest Q-value is selected. The geometric center of the selected sector $z_t$ is converted into a positional target vector $\mathbf{v}_t^{\text{pos}} \in \mathbb{R}^2$ in the probe's local coordinate frame, representing the translational direction toward the selected sector. This vector serves as the guidance signal for the downstream action refinement module. The action refinement module translates the selected sector into kinematic actions. It is implemented as a PPO-based actor-critic architecture. The actor takes as input the current state $s_t$ concatenated with the sector guidance vector $\mathbf{v}_t^{\text{pos}}$, and outputs a local kinematic increment $\Delta_t = [\Delta_t^{\text{pos}}, \Delta_t^{\text{ang}}] \in \mathbb{R}^4$. A residual scaling factor $\alpha$ is applied to obtain the scaled increment $\hat{\Delta}_t = \alpha \Delta_t$. The final action $a_t$ fuses the sector-derived target with this scaled increment through a residual connection: \begin{equation} \label{eq:residual} a_t^{\text{pos}} = \beta_t \mathbf{v}_t^{\text{pos}} + (1-\beta_t) \hat{\Delta}_t^{\text{pos}}, \quad a_t^{\text{ang}} = \hat{\Delta}_t^{\text{ang}}, \end{equation} where $\beta_t$ linearly anneals from an initial value $\beta_0$ to a final value $\beta_f$ over training. In early training, $\beta_t$ is large so that the translational component is dominated by the sector guidance $\mathbf{v}_t^{\text{pos}}$, providing a stable learning signal before the policy has converged. As training progresses, $\beta_t$ decreases and the policy's own output $\hat{\Delta}_t^{\text{pos}}$ gradually takes over, allowing the agent to refine the coarse sector-level direction into precise kinematic adjustments. The angular component $a_t^{\text{ang}}$ is determined entirely by the policy, as the view selection module provides only translational guidance. The critic estimates the state value $V_{\psi}(s_t, c_t)$ using the state (\cref{eq:state}) augmented with the coverage ratio (\cref{eq:coverage}). \subsection{Training Scheme} The view selection module and the action refinement module address different aspects of the problem and require different optimization objectives, but they share the same trajectory data. We employ a rollout-based sequential updating approach to jointly train the action refinement module (via PPO~\cite{schulman2017proximal}) and the view selection module (via Q-learning). Within each iteration, both modules are updated from the same collected rollouts, ensuring consistent learning signals across the two decision levels. The action refinement module is optimized using the standard PPO objective with Generalized Advantage Estimation (GAE)~\cite{schulman2015high}. The actor outputs kinematic increments $\Delta_t$ and is updated via clipped surrogate objectives, while the critic estimates $V_\psi(s_t, c_t)$ and provides the baseline for advantage computation. For the view selection module, we train the sector Q-network using Monte Carlo rollout returns as regression targets. The action-value function $Q_{\theta}(s_t, z_t)$ estimates the expected discounted return after selecting sector $z_t$ at state $s_t$: % For the sector selection module, we train the sector Q-network by regressing onto Monte Carlo rollout returns. The action-value function $Q_{\theta}(s_t, z_t)$ estimates the expected cumulative reward after selecting sector $z_t$ at state $s_t$: \begin{equation} Q_{\theta}(s_{t}, z_{t}) = \mathbb{E} \left( \sum_{\tau=t}^{T} \gamma^{\tau-t} r_{\tau} \right), \end{equation} where $\mathbb{E}(\cdot)$ denotes the expectation and $\gamma \in [0, 1]$ is the discount factor. Although both modules share the same reward signal, they require different value representations. The PPO critic learns a state value $V_\psi(s_t, c_t)$ used to compute advantages for the action refinement module, while the view selection module learns action-conditional values $Q_\theta(s_t, z_i)$ that compare the expected return across candidate sectors. This difference motivates maintaining separate value functions despite the shared reward. The return from the collected rollouts serves as the supervision target for the Q-network: \begin{equation} \label{eq:return} y_{t} = \begin{cases} r_{t} + \gamma (1 - d_{t}) y_{t+1}, & \text{if } t < T \\ r_{T}, & \text{otherwise} \end{cases}, \end{equation} where $d_t$ is the termination mask. The Q-network is then optimized using the $L_2$ distance loss: \begin{equation} \label{eq:qloss} \mathcal{L}_{Q} = \lambda_Q \frac{1}{T} \sum_{t=1}^{T} \text{MSE}(Q_{\theta}(s_{t}, z_{t}), y_{t}), \end{equation} where $\lambda_Q$ controls the loss weight. In joint training, the two objectives are optimized in separate backward passes within each iteration. First, the PPO objective $\mathcal{L}_{\text{PPO}}$ updates the action refinement module and the critic. Then, in a separate backward pass, the Q-learning loss $\mathcal{L}_{Q}$ updates the view selection module, including the Q-network and its associated feature encoder. This sequential scheme prevents gradient interference between the two objectives. The complete procedure is summarized in \cref{alg:sonoselect}. \begin{algorithm}[t] \caption{Training Pipeline of SonoSelect} \label{alg:sonoselect} \begin{algorithmic}[1] \small \STATE \textbf{Input}: Env $\mathcal{E}$, budget $T$, exploration rate $\epsilon$, scaling factor $\alpha$, annealing weight $\beta_t$, Q-loss weight $\lambda_Q$. \STATE \textbf{Initialize}: Q-network $Q_{\theta}$, actor $\pi_{\phi}$, critic $V_{\psi}$. \FOR{each training iteration} \STATE Initialize rollout buffers $\mathcal{B}_{\text{PPO}}, \mathcal{B}_{Q} \leftarrow \emptyset$ \STATE Reset environment: $s_1 \leftarrow \mathcal{E}.\text{reset}()$ \FOR{$t = 1$ to $T$} \STATE Extract sector features $\{f_i\}_{i=1}^{S}$ from $\hat{V}_t$ in state $s_t$ (\cref{eq:state}) \STATE Compute Q-values $\{Q_{\theta}(s_t, z_i)\}_{i=1}^{S}$ for all sectors \STATE Select sector via $\epsilon$-greedy: with probability $\epsilon$ sample $z_t$ uniformly from $\{z_1, \dots, z_S\}$, otherwise $z_t = \arg\max_{z_i} Q_{\theta}(s_t, z_i)$ \STATE Compute guidance $\mathbf{v}_t^{\text{pos}} \leftarrow \text{GeometricCenter}(z_t)$ \STATE Sample $\Delta_t \sim \pi_{\phi}(\cdot \mid s_t, \mathbf{v}_t^{\text{pos}})$; scale $\hat{\Delta}_t \leftarrow \alpha \Delta_t$ \STATE Fuse action: $a_t^{\text{pos}} \leftarrow \beta_t \mathbf{v}_t^{\text{pos}} + (1{-}\beta_t) \hat{\Delta}_t^{\text{pos}}$, $a_t^{\text{ang}} \leftarrow \hat{\Delta}_t^{\text{ang}}$ (\cref{eq:residual}) \STATE Execute $a_t$ in $\mathcal{E}$; observe $s_{t+1}, r_t, d_t$ \STATE Update probe pose: $(\mathbf{p}_{t+1}, \mathbf{q}_{t+1})$ \STATE Store $(s_t, c_t, a_t, r_t, s_{t+1}, c_{t+1}, d_t)$ in $\mathcal{B}_{\text{PPO}}$, where $c_t$ is computed via \cref{eq:coverage} \STATE Store $(s_t, z_t, r_t, s_{t+1}, d_t)$ in $\mathcal{B}_{Q}$ \ENDFOR \STATE Compute GAE advantages from $\mathcal{B}_{\text{PPO}}$; update $\pi_{\phi}, V_{\psi}$ via $\mathcal{L}_{\text{PPO}}$ \STATE Compute discounted returns $\{y_t\}_{t=1}^{T}$ from $\mathcal{B}_{Q}$ via \cref{eq:return} \STATE Update Q-network $Q_{\theta}$ and shared encoder via \cref{eq:qloss} \ENDFOR \end{algorithmic} \end{algorithm} \section{Experiment} We evaluate our approach in two stages. Sec.~\ref{sec:discrete_classification} studies multi-view classification in a simplified discrete setting, and Sec.~\ref{sec:continuous_detection} evaluates SonoSelect on a continuous kidney cyst detection task. \subsection{Preliminary: Multi-view Classification} \label{sec:discrete_classification} Before evaluating the full scanning pipeline, we first ask a simpler question: given a set of candidate viewpoints, can an adaptive selection policy identify the few informative views for each instance? To answer this, we use a simplified setting where the probe can access any candidate viewpoint without movement cost. As the adaptive selection method, we adopt MVSelect~\cite{hou2024learning}, which sequentially chooses the next view conditioned on previously acquired observations. This setup lets us test whether observation-driven view selection is beneficial before introducing continuous probe control. \textbf{Datasets.} We construct two custom multi-view ultrasound datasets, both adopting a strict 80\%/20\% train-test split and extracting $120 \times 120$ 2D ultrasound views under two distinct viewpoint configurations (12-view and 20-view configurations). \begin{itemize} \item \textit{Geometry:} this synthetic dataset comprises 10 distinct categories (sphere, ellipsoid, cube, cuboid, cylinder, capsule, cone, torus, octahedron, and cross), with 150 unique instances per category. \item \textit{Organ:} to move closer to clinical realism, we introduce a more challenging dataset comprising real human anatomical structures sourced from the publicly available TotalSegmentator dataset~\cite{wasserthal2023totalsegmentator}. It contains 6 distinct categories: left kidney, liver, pancreas, spleen, aorta, and stomach, with 100 unique patient instances per category. \end{itemize} \textbf{Task Network.} For both datasets, we employ a ResNet-18 ~\cite{he2016deep} backbone combined with a max-pooling aggregation module. The network is trained offline on complete multi-view sequences so that the learned representations are not biased toward any particular view subset. \textbf{Quantitative Results.} The classification performances on both datasets are summarized in \cref{tab:combined_results}. We compare five selection strategies: (1) \textit{dataset-level oracle}, which uses the same fixed pair of views that achieves the highest average accuracy across all instances in the training set; (2) \textit{instance-level oracle}, which selects the optimal pair for each test instance by exhaustive search; (3) \textit{random selection}, which samples two views uniformly; (4) \textit{validation best policy}, which selects the fixed pair that achieves the highest accuracy on the validation set; and (5) \textit{MVSelect}~\cite{hou2024learning}, which sequentially selects views conditioned on previous observations. \begin{table}[t] \centering \small \setlength{\tabcolsep}{0.5mm} \renewcommand{\arraystretch}{1.1} % % 使用 \resizebox 限制表格宽度不超过单栏宽度,解决右边突出的问题 \begin{tabular}{l|cc|cc} % 修改了列定义,去掉了多余的竖线,更贴合原图 \multirow{2}{*}{view selection} & \multicolumn{2}{c|}{Geometry} & \multicolumn{2}{c}{Organ} \\ \cline{2-5} % 在 SonoGeom 和 SonoOrgan 下方添加横线 & 12 views & 20 views & 12 views & 20 views \\ \hline N/A: all $N$ views & 84.02 & 92.96 & 92.50 & 91.73 \\ \hline dataset-lvl oracle & 79.07 $\pm$ 0.71 & 83.42 $\pm$ 2.23 & 92.75 $\pm$ 0.64 & 90.13 $\pm$ 1.98 \\ instance-lvl oracle & 93.80 $\pm$ 0.47 & 99.23 $\pm$ 0.61 & 98.04 $\pm$ 0.61 & 99.32 $\pm$ 1.06 \\ \hline random selection & 74.61 $\pm$ 2.32 & 68.71 $\pm$ 8.34 & 87.44 $\pm$ 4.17 & 73.91 $\pm$ 11.58 \\ validation best policy& 71.57 $\pm$ 2.46 & 70.87 $\pm$ 5.60 & 88.51 $\pm$ 1.61 & 83.73 $\pm$ 4.06 \\ \hline MVSelect~\cite{hou2024learning}& 79.11 $\pm$ 1.33 & 89.59 $\pm$ 1.61 & 96.23 $\pm$ 1.62 & 97.00 $\pm$ 1.04 \\ \end{tabular} \caption{Classification accuracy (\%) on Geometry and Organ. Each adaptive or fixed policy selects two views per instance from the candidate set.} \label{tab:combined_results} \end{table} We first note that using all $N$ views does not yield the highest accuracy. The instance-level oracle, which selects the best two views per instance via exhaustive search, substantially surpasses the all-view baseline on both datasets. This suggests that not all views contribute positively to the final prediction. However, the dataset-level oracle, which fixes the same two views across all instances, performs considerably worse than the instance-level oracle. This gap confirms that the most informative views vary across instances. Random selection performs the worst overall, with high variance reflecting the inconsistency of uninformed view choices. MVSelect, which selects views conditioned on each instance's observations, approaches the instance-level oracle on both datasets. This confirms that an adaptive policy can recover near-optimal view combinations without exhaustive search. \cref{fig:Experiment1} visualizes the selected viewpoints for representative instances, showing that the policy chooses different orientations for different objects and anatomies. Together, these results support the two properties that motivate SonoSelect: (1) a small number of well-chosen views can match or exceed the performance of exhaustive acquisition, and (2) the optimal views are instance-dependent and vary across patient anatomies. Building on these findings, SonoSelect further addresses how to acquire such informative views by adaptively guiding probe movement based on a 3D spatial memory of the observed anatomy. The following section evaluates SonoSelect on a kidney cyst detection task, where the agent guides probe movement to detect target pathology within a limited scanning budget. \subsection{Kidney Cyst Detection} \label{sec:continuous_detection} The preliminary study confirms that a small number of adaptively chosen views can match or exceed exhaustive acquisition, and that the optimal views vary across instances. Building on this finding, we now evaluate whether SonoSelect can achieve high diagnostic coverage with fewer, adaptively selected views when the probe moves sequentially along the body surface, where each movement carries a scanning cost and the agent receives only partial observations of the underlying anatomy. We test on a kidney cyst detection task, which requires the agent to balance broad organ coverage with precise localization of small pathological targets, and evaluate generalization to unseen patient anatomies. \textbf{Experimental Setup.} The primary task requires the agent to dynamically scan the left kidney and identify renal cysts. We utilize 3D clinical CT volumes from the TotalSegmentator dataset ~\cite{wasserthal2023totalsegmentator}, selecting volumes that contain annotated kidney and cyst segmentations. The dataset is split at the patient level into a training set of seen anatomies used for policy learning and a test set of entirely unseen patient anatomies that do not appear during training. \textbf{Implementation details.} We train all models using PPO with 64 parallel environments for a total of 500K environment steps. We use the Adam optimizer with a learning rate of $3 \times 10^{-4}$, adjusted by a KL-adaptive scheduler. The discount factor is $\gamma = 0.97$ with GAE parameter $\lambda = 0.95$. For the auxiliary Q-network, we use a lower learning rate of $1 \times 10^{-4}$ and schedule its loss weight $\lambda_Q$ from 0.05 to 0.02 over 150K steps to gradually reduce its influence as the policy matures. A target network updated via Polyak averaging ($\tau = 0.005$) is employed to provide stable regression targets for Q-value training. The residual guidance coefficient $\beta_t$ is linearly annealed from 1.0 to 0.05 over 300K steps, allowing the policy to transition from heuristic-guided exploration to autonomous scanning. The policy and value networks share a lightweight 2D CNN encoder. The encoder receives the reconstructed ultrasound volume by stacking depth slices along the channel dimension, converting the 3D input into a multi-channel 2D representation. This design avoids the computational overhead of 3D convolutions while preserving depth information through channel-wise encoding. Each episode has a step budget of $T = 600$. All experiments are conducted on a single NVIDIA RTX 4090D GPU. \textbf{Baselines.} In this fully continuous setting, we benchmark SonoSelect against baselines representing alternative exploration strategies. \textit{Random} applies uniformly sampled kinematic actions at each step, providing a lower bound on diagnostic yield without any learned or heuristic guidance. \textit{PPO}~\cite{schulman2017proximal} trains a single continuous control policy that directly maps observations to probe actions, testing whether end-to-end reinforcement learning can implicitly learn effective exploration without explicit spatial planning. \textit{VIG} (Volumetric Information Gain)~\cite{isler2016information} represents classical Next-Best-View planning driven by entropy maximization, testing whether uncertainty reduction alone provides sufficient guidance for diagnostic exploration. \textit{RND}~\cite{burda2018exploration} provides a state-visitation driven exploration bonus, testing whether encouraging novel state visits improves coverage without task-specific guidance. \begin{figure}[t] \centering \includegraphics[width=\linewidth]{images/Experiment1.pdf} \caption{\textbf{Qualitative results of the discrete view selection policy.} We visualize the selected viewpoints for both the Geometry (top) and Organ (bottom) datasets.} \label{fig:Experiment1} \end{figure} \textbf{Quantitative Results.} \cref{tab:main_results} presents scanning performance on seen and unseen patient anatomies. \begin{table*}[t] \centering \label{tab:main_results} \footnotesize \setlength{\tabcolsep}{0.5mm} \renewcommand{\arraystretch}{1.2} % % 列定义:左侧基础信息与右侧两个大区块之间用 | 隔开 \begin{tabular}{lc|cccccc|cccccc} \hline \multirow{2}{*}{\textbf{Method}} & \multirow{2}{*}{\textbf{Sectors}} & \multicolumn{6}{c|}{\cellcolor{yellow!50}\textbf{Seen Patient}} & \multicolumn{6}{c}{\cellcolor{orange!20}\textbf{Unseen Patient}} \\ % 使用 \cline 在第 3 列到第 14 列下方画横线 \cline{3-14} & & \textbf{Kidney (\%)} $\uparrow$ & \textbf{Cyst (\%)} $\uparrow$ & \textbf{Dice (\%)} $\uparrow$ & \textbf{IoU (\%)} $\uparrow$ & \textbf{Trans.(voxels)} $\downarrow$ & \textbf{Rot.($^\circ$)} $\downarrow$ & \textbf{Kidney (\%)} $\uparrow$ & \textbf{Cyst (\%)} $\uparrow$ & \textbf{Dice (\%)} $\uparrow$ & \textbf{IoU (\%)} $\uparrow$ & \textbf{Trans.(voxels)} $\downarrow$ & \textbf{Rot.($^\circ$)} $\downarrow$ \\ \hline Random & - & 13.79 & 8.38 & 22.82 & 14.13 & 925.81 & 2623.35 & 23.41 & 1.88 & 37.41 & 24.50 & 937.55 & 2624.38 \\ PPO & - & 60.46 & 44.94 & 69.76 & 53.69 & 427.52 & 294.24 & 33.14 & 12.10 & 47.97 & 34.62 & 376.92 & 247.93 \\ RND & - & 64.99 & 45.53 & 72.59 & 57.08 & 549.41 & 318.89 & 48.91 & 23.41 & 64.94 & 49.33 & 540.09 & 272.15 \\ VIG & - & 56.95 & 40.35 & 63.88 & 50.36 & 484.74 & 350.06 & 48.62 & 23.12 & 64.84 & 49.03 & 463.56 & 417.47 \\ \hline \textbf{SonoSelect} & 4 & 56.52 & 41.50 & 66.66 & 50.40 & \textbf{398.51} & 258.11 & 23.06 & 3.67 & 37.05 & 24.24 & \textbf{353.37} & 201.96 \\ \textbf{SonoSelect} & 8 & 62.39 & 47.13 & 71.52 & 55.93 & 570.16 & \textbf{186.57} & 50.87 & \textbf{30.91} & 67.53 & 51.50 & 561.80 & \textbf{198.36} \\ \textbf{SonoSelect} & 16 & \textbf{67.55} & \textbf{48.37} & \textbf{73.88} & \textbf{58.62} & 423.40 & 311.65 & \textbf{54.56} & 27.13 & \textbf{70.76} & \textbf{54.78} & 446.98 & 311.12 \\ \textbf{SonoSelect} & 32 & 65.07 & 45.67 & 72.70 & 57.20 & 532.87 & 258.49 & 52.14 & 27.90 & 66.97 & 51.16 & 574.52 & 213.78 \\ \hline \end{tabular} \caption{Quantitative comparison of active scanning performance. SonoSelect shows smaller performance degradation on unseen patient anatomies compared to other learned baselines.} \end{table*} \begin{figure*}[t] % === 第一行:图5、图6、图7 === \begin{minipage}[t]{0.32\textwidth} \centering \includegraphics[width=\linewidth]{images/curve_cyst_coverage_t.pdf} \captionof{figure}{Cyst coverage over scanning steps on unseen anatomies.} \label{fig:cyst_curve} \end{minipage}\hfill \begin{minipage}[t]{0.32\textwidth} \centering \includegraphics[width=\linewidth]{images/pareto_tradeoff_allpoints.pdf} \caption{Episode-level cyst coverage vs.\ trajectory length on unseen anatomies. Each point represents one episode.} \label{fig:tradeoff} \end{minipage}\hfill \begin{minipage}[t]{0.32\textwidth} \centering \includegraphics[width=\linewidth]{images/kde_1d_cyst_coverage_sonoselect_vs_ppo.pdf} \captionof{figure}{Distribution of per-episode cyst coverage on unseen anatomies.} \label{fig:kde} \end{minipage} \end{figure*} On seen anatomies, SonoSelect (16 sectors) achieves the highest scores across all four diagnostic metrics. The differences among learned methods remain moderate on seen anatomies, since all methods have access to the same training distribution. The more informative comparison is therefore on unseen anatomies, where the methods can no longer rely on patterns encountered during training. All methods degrade on unseen anatomies, but the degree of degradation differs. PPO exhibits the largest drop, with kidney coverage falling from 60.46\% to 33.14\% and cyst coverage from 44.94\% to 12.10\%, suggesting that its policy overfits to training-specific spatial patterns. RND and VIG show better retention than PPO, with cyst coverage reaching 23.41\% and 23.12\% respectively on unseen data. RND encourages visiting novel states and VIG maximizes observation entropy, both of which provide some robustness to anatomy changes. However, both methods still fall behind SonoSelect on all four diagnostic metrics on unseen data. SonoSelect (16 sectors) achieves the highest scores and shows the smallest performance gap between seen and unseen anatomies, suggesting that the spatial memory provides a more transferable basis for probe guidance than novelty or entropy driven exploration alone. We also report translation and rotation errors in \cref{tab:main_results}. These metrics reflect cumulative probe displacement during scanning rather than diagnostic quality. Among the learned methods, translation and rotation errors do not show a consistent ranking, as different exploration strategies produce trajectories of varying lengths and orientations. SonoSelect (16 sectors) achieves moderate translation and rotation values while obtaining the highest diagnostic coverage, indicating that its coverage advantage comes from more targeted probe movement rather than simply longer trajectories. \textbf{Effect of Sector Granularity.} The number of sectors controls the granularity of the routing decision (\cref{tab:main_results}). With only 4 sectors, the routing module partitions the search space too coarsely, and performance on unseen anatomies drops close to the random baseline. Increasing to 8 sectors provides sufficient directional resolution for the routing policy to generalize, yielding a substantial improvement on unseen data. Performance peaks at 16 sectors, where the routing granularity balances expressiveness against the difficulty of learning a reliable policy from limited training episodes. At 32 sectors, performance slightly decreases, indicating that finer partitioning introduces more choices than the policy can reliably distinguish given the available training data. \textbf{Scanning Efficiency.} \cref{fig:cyst_curve} compares how efficiently each method converts scanning steps into diagnostic coverage. In the initial phase, RND achieves the fastest coverage growth, followed by VIG and SonoSelect. However, after approximately 150 steps, SonoSelect's growth rate accelerates and surpasses both RND and VIG, with the gap continuing to widen in later stages. By 600 steps, SonoSelect achieves 35\% cyst coverage, compared to 23\% for RND and 23\% for VIG. PPO's curve flattens early and achieves only 12\% cyst coverage by 600 steps, the lowest among all learning-based methods. \textbf{Episode-level Analysis.} The scatter plot in~\cref{fig:tradeoff} examines this efficiency at the episode level. PPO clusters in the bottom-left quadrant, indicating frequent near-zero coverage episodes with short, spatially confined trajectories; on unseen anatomies, the agent frequently remains confined to local regions. SonoSelect occupies the upper-right quadrant, where longer trajectories correspond to higher diagnostic coverage. The per-episode distribution in~\cref{fig:kde} further illustrates this contrast: PPO's cyst coverage concentrates near zero, while SonoSelect's distribution shifts toward higher values. \textbf{Qualitative Results.} Representative trajectories in~\cref{fig:qualitative_trajectories} show the same pattern spatially. PPO produces circular movements far from the kidney, with effective scanning ratios of 13.5\%-19.6\%, indicating that the probe spends most of its budget on non-target regions. SonoSelect follows the kidney contours with substantially higher ratios, reflecting that the sector routing module directs the probe toward diagnostically relevant areas. Across all three analyses, the results indicate that SonoSelect achieves higher coverage not by scanning longer, but by allocating its scanning budget more effectively, reducing redundant acquisition while increasing the likelihood of reaching diagnostically relevant viewpoints. \begin{figure}[t] \centering \includegraphics[width=\linewidth]{images/traj.pdf} \captionof{figure}{Scanning trajectories on unseen data. Red/purple: on/off-target; percentages: effective scanning ratio.} \label{fig:qualitative_trajectories} \end{figure} \subsection{Ablation Studies} \label{sec:ablation} To validate the core architectural designs of SonoSelect, we conduct ablation experiments on the kidney cyst detection task using unseen patient data. We isolate three components: the learned routing policy, the per-sector feature representation, and the residual control module. Each ablation removes one component while keeping the rest unchanged. The quantitative comparisons are summarized in \cref{tab:ablation}. \textbf{Effect of Learned Routing.} We first evaluate the high-level decision maker by replacing the learned routing policy with random view selection. Without a task-driven geometric prior, the continuous policy receives arbitrary directional targets, leading to uncoordinated probe movement. As shown in \cref{tab:ablation}, this variant shows a notable drop in cyst coverage, with the score falling by roughly half, confirming that the learned routing policy is necessary to constrain the search space and direct the continuous policy toward diagnostically relevant regions. Without learned routing, the continuous policy can execute local movements but lacks directional guidance toward diagnostically relevant regions. \textbf{Necessity of Explicit Sector Features.} The w/o Sector Features variant replaces the per-sector feature vectors with uniform values, making all sectors appear identical to the Q-network. Although the Q-network still receives the global observation $s_t$, it cannot distinguish sectors based on their spatial content in the reconstruction volume. As a result, the Q-network selects sectors without considering what each region contains, leading to reduced coverage for both kidney and cyst targets. This drop confirms that the Q-network relies on per-sector spatial features to make informed routing decisions. Without per-sector features, the Q-network cannot leverage the spatial memory to differentiate among candidate sectors. \begin{table}[t] \centering \small \begin{tabular}{l|cccc} \hline Method & Kidney & Cyst & Dice & IoU \\ \hline Random Routing & 47.85 & 13.77 & 62.80 & 45.53 \\ w/o Sector Features & 45.32 & 18.39 & 61.35 & 45.24 \\ w/o Residual Control & 49.94 & 16.18 & 59.23 & 44.92 \\ Fixed $\beta=1.0$ & 49.41 & 27.60 & 67.04 & 50.45 \\ \textbf{SonoSelect} & 54.56 & 27.13 & 70.76 & 54.78 \\ \hline \end{tabular} \caption{Ablation study of SonoSelect components on unseen anatomies.} \label{tab:ablation} \vspace{-0.5cm} \end{table} \textbf{Role of Residual Control.} The w/o Residual Control variant removes the low-level kinematic adjustments and relies solely on the sector-level waypoints for probe guidance. This variant achieves the lowest Dice and IoU among all configurations, while its kidney coverage remains comparable to the other ablated variants. This asymmetry indicates that sector-level routing provides sufficient guidance for reaching the target region, but localizing small structures such as cysts requires the finer probe adjustments that the residual control module provides. \textbf{Effect of Guidance Annealing.} In the full model, the guidance coefficient $\beta_t$ is linearly annealed from 1.0 to 0.05 during training, gradually shifting translational control from the sector guidance to the policy's own output. The Fixed $\beta=1.0$ variant keeps $\beta_t$ at 1.0 throughout training and deployment. As shown in \cref{tab:ablation}, this variant achieves comparable cyst coverage to the full model, but kidney coverage and reconstruction quality both decline. This suggests that the annealing schedule allows the policy to learn fine-grained translational adjustments beyond the sector center, which contributes to more complete coverage of the kidney surface. \section{Conclusion} We propose SonoSelect, an active probe exploration framework for robotic ultrasound that selects informative viewpoints without exhaustive scanning or predefined trajectories. By bridging discrete high-level regional routing with continuous low-level kinematic control, SonoSelect learns to resolve anatomical ambiguities and achieves robust generalization to unseen anatomies where standard reinforcement learning approaches show substantial performance degradation. This approach represents a step toward autonomous robotic ultrasound deployment in clinical workflows. While the current evaluation is conducted in simulation, the hierarchical formulation of coupling discrete region selection with continuous probe control provides a principled way to handle the view-dependent nature of ultrasound imaging. This work suggests that structured, observation-driven exploration can serve as an effective mechanism for multi-view ultrasound perception, reducing the number of views needed for accurate diagnosis while maintaining robust coverage across diverse patient anatomies. %% %% The next two lines define the bibliography style to be used, and %% the bibliography file. \bibliographystyle{ACM-Reference-Format} \bibliography{sample-base} \end{document} \endinput %% %% End of file `sample-sigconf-authordraft.tex'.