XL2 &7/2TypeV2ObjIDDDir^ Cj똼GEcAlgoEcMEcNEcBSizeEcIndexEcDistCSumAlgoPartNumsPartETags c6015c1ba2e37a1be5766bbf2c8cc687PartSizesPartASizesPartIdxSizeMTime7/2MetaSysx-rustfs-internal-inline-datatruex-minio-internal-inline-datatrueMetaUsrcontent-typeapplication/x-texetag c6015c1ba2e37a1be5766bbf2c8cc687vΥN!mnullp}EF\;}UwcvԖ%% %% This is file `sample-sigconf-authordraft.tex', %% generated with the docstrip utility. %% %% The original source files were: %% %% samples.dtx (with options: `all,proceedings,bibtex,authordraft') %% %% IMPORTANT NOTICE: %% %% For the copyright see the source file. %% %% Any modified versions of this file must be renamed %% with new filenames distinct from sample-sigconf-authordraft.tex. %% %% For distribution of the original source see the terms %% for copying and modification in the file samples.dtx. %% %% This generated file may be distributed as long as the %% original source files, as listed above, are part of the %% same distribution. (The sources need not necessarily be %% in the same archive or directory.) %% %% %% Commands for TeXCount %TC:macro \cite [option:text,text] %TC:macro \citep [option:text,text] %TC:macro \citet [option:text,text] %TC:envir table 0 1 %TC:envir table* 0 1 %TC:envir tabular [ignore] word %TC:envir displaymath 0 word %TC:envir math 0 word %TC:envir comment 0 0 %% %% The first command in your LaTeX source must be the \documentclass %% command. %% %% For submission and review of your manuscript please change the %% command to \documentclass[manuscript, screen, review]{acmart}. %% %% When submitting camera ready or to TAPS, please change the command %% to \documentclass[sigconf]{acmart} or whichever template is required %% for your publication. %% %% \documentclass[sigconf, screen, review, anonymous]{acmart} \usepackage{multirow} \usepackage{algorithmic} \usepackage{algorithm} \usepackage{colortbl} \usepackage{wrapfig} \usepackage{hyperref} \usepackage[capitalize]{cleveref} \usepackage{subcaption} %% %% \BibTeX command to typeset BibTeX logo in the docs \AtBeginDocument{% \providecommand\BibTeX{{% Bib\TeX}}} %% Rights management information. This information is sent to you %% when you complete the rights form. These commands have SAMPLE %% values in them; it is your responsibility as an author to replace %% the commands and values with those provided to you when you %% complete the rights form. \setcopyright{acmlicensed} \copyrightyear{2018} \acmYear{2018} \acmDOI{XXXXXXX.XXXXXXX} %% These commands are for a PROCEEDINGS abstract or paper. \acmConference[Conference acronym 'XX]{Make sure to enter the correct conference title from your rights confirmation email}{June 03--05, 2018}{Woodstock, NY} %% %% Uncomment \acmBooktitle if the title of the proceedings is different %% from ``Proceedings of ...''! %% %%\acmBooktitle{Woodstock '18: ACM Symposium on Neural Gaze Detection, %% June 03--05, 2018, Woodstock, NY} \acmISBN{978-1-4503-XXXX-X/2018/06} %% %% Submission ID. %% Use this when submitting an article to a sponsored event. You'll %% receive a unique submission ID from the organizers %% of the event, and this ID should be used as the parameter to this command. \acmSubmissionID{3405} %% %% For managing citations, it is recommended to use bibliography %% files in BibTeX format. %% %% You can then either use BibTeX with the ACM-Reference-Format style, %% or BibLaTeX with the acmnumeric or acmauthoryear sytles, that include %% support for advanced citation of software artefact from the %% biblatex-software package, also separately available on CTAN. %% %% Look at the sample-*-biblatex.tex files for templates showcasing %% the biblatex styles. %% %% %% The majority of ACM publications use numbered citations and %% references. The command \citestyle{authoryear} switches to the %% "author year" style. %% %% If you are preparing content for an event %% sponsored by ACM SIGGRAPH, you must use the "author year" style of %% citations and references. %% Uncommenting %% the next command will enable that style. %%\citestyle{acmauthoryear} %% %% end of the preamble, start of the body of the document source. \begin{document} %% %% The "title" command has an optional parameter, %% allowing the author to define a "short title" to be used in page headers. \title{SonoSelect: Efficient Ultrasound Perception via \\ Active Probe Exploration } %% %% The "author" command and its associated commands are used to define %% the authors and their affiliations. %% Of note is the shared affiliation of the first two authors, and the %% "authornote" and "authornotemark" commands %% used to denote shared contribution to the research. \author{Ben Trovato} \authornote{Both authors contributed equally to this research.} \email{trovato@corporation.com} \orcid{1234-5678-9012} \author{G.K.M. Tobin} \authornotemark[1] \email{webmaster@marysville-ohio.com} \affiliation{% \institution{Institute for Clarity in Documentation} \city{Dublin} \state{Ohio} \country{USA} } %% %% By default, the full list of authors will be used in the page %% headers. Often, this list is too long, and will overlap %% other information printed in the page headers. This command allows %% the author to define a more concise list %% of authors' names for this purpose. \renewcommand{\shortauthors}{Trovato et al.} %% %% The abstract is a short summary of the work to be presented in the %% article. \begin{abstract} Ultrasound perception typically requires multiple scan views through probe movement to reduce diagnostic ambiguity, mitigate acoustic occlusions, and improve anatomical coverage. However, not all probe views are equally informative. Exhaustively acquiring a large number of views can introduce substantial redundancy, increase scanning and processing costs. To address this, we define an active view exploration task for ultrasound and propose SonoSelect, an ultrasound-specific method that adaptively guides probe movement based on current observations. Specifically, we cast ultrasound active view exploration as a sequential decision-making problem. Each new 2D ultrasound view is fused into a 3D spatial memory of the observed anatomy, which guides the next probe position. On top of this formulation, we propose an ultrasound-specific objective that favors probe movements with greater organ coverage, lower reconstruction uncertainty, and less redundant scanning. Experiments on the ultrasound simulator show that SonoSelect achieves promising multi-view organ classification accuracy using only 2 out of N views. Furthermore, for a more difficult kidney cyst detection task, it reaches 54.56\% kidney coverage and 27.13\% cyst coverage, with short trajectories consistently centered on the target cyst. % with more target-focused trajectories. \end{abstract} %% %% The code below is generated by the tool at http://dl.acm.org/ccs.cfm. %% Please copy and paste the code instead of the example below. %% \begin{CCSXML} 10010405.10010444.10010449 Applied computing~Health informatics 300 10010147.10010178.10010224 Computing methodologies~Computer vision 500 \end{CCSXML} \ccsdesc[300]{Applied computing~Health informatics} \ccsdesc[500]{Computing methodologies~Computer vision} %% %% Keywords. The author(s) should pick words that accurately describe %% the work being presented. Separate the keywords with commas. \keywords{Robotic Ultrasound, Multi-view Perception, View Selection} %% %% This command processes the author and affiliation and title %% information and builds the first part of the formatted document. \maketitle \section{Introduction} \label{sec:intro} In medical ultrasound, perception typically requires multiple scan views through probe movement to reduce diagnostic ambiguity. As a non-invasive and real-time imaging modality, ultrasound is essential in clinical diagnosis, yet remains highly view-dependent~\cite{munir2025survey, elmekki2025comprehensive}. A single static image often fails to provide sufficient structural information due to acoustic occlusions and a limited field-of-view~\cite{jiang2023robotic, velikova2023lotus}. Consequently, multi-view perception through probe repositioning is necessary to improve anatomical coverage and reduce diagnostic uncertainty~\cite{men2023gaze, dai2021transmed, jiang2023robotic}. % However, the use of multiple cameras comes at a high cost 但是有效去实现这个多view,目前是个难题,这个目前是个成本很高的事情(需要人工什么的) However, not all probe views are equally informative. Exhaustively acquiring a large number of views can introduce substantial redundancy and increase scanning and processing costs. In manual practice, selecting which views to acquire relies on the experience of the operator, requiring repeated repositioning and real-time interpretation, which makes the process time-consuming, operator-dependent, and difficult to standardize~\cite{jiang2023robotic, men2023gaze, munir2025survey}. To reduce this operator burden, existing methods attempt to automate probe navigation~\cite{bi2024machine, jiang2024intelligent}, yet most of these methods optimize probe movement based on immediate geometric or image-quality feedback without maintaining a spatial memory of the observed anatomy~\cite{bi2024machine, jin2023neu}. These approaches may therefore acquire redundant views while missing viewpoints that could resolve occlusions or reveal unseen anatomy. A key question is then how to determine, from partial observations, which probe positions to acquire next so as to maximize diagnostic coverage within a limited scanning budget. \begin{figure}[t] \centering \includegraphics[width=\linewidth]{images/intro1.pdf} \caption{Uninformed exploration vs. active view exploration with SonoSelect. \textbf{Left:} Uninformed exploration samples views redundantly and fails to reach the target cyst occluded behind an overlying organ. \textbf{Right:} SonoSelect selects the next probe position based on current observations, directing the probe toward the target cyst for diagnosis.} \label{Fig:teaser} \end{figure} % 所以在我们这篇文章,我们希望把view selection这件事情能够自动化,代替人工的选取(这一段写fig 1的内容, 主要是写我们的任务定义), To address this, we define an active view exploration task for ultrasound. As illustrated in \cref{Fig:teaser}, uninformed exploration samples views redundantly within a local region, leaving the target unobserved when it is occluded behind an overlying organ (\cref{Fig:teaser}, left). In contrast, active view exploration selects the next probe position based on current observations, directing the probe toward the target for diagnosis (\cref{Fig:teaser}, right). This definition reduces redundant acquisition and increases the likelihood of obtaining the specific viewpoints needed for accurate diagnosis. % 因此,我们提出了SonoSelect,(简单介绍一下,也说下传统ppo的差别) We propose SonoSelect, an ultrasound-specific method, to address this task. Specifically, we cast ultrasound active view exploration as a sequential decision-making problem. As the probe moves, each new 2D ultrasound view is fused into a 3D spatial memory that represents what has been observed so far. This spatial memory then serves two purposes: it provides the agent with a volumetric summary of the current anatomical coverage, and it identifies regions that remain unobserved or uncertain, guiding where to scan next. Building on this representation, we design a reward that encourages probe movements toward greater organ coverage, lower reconstruction uncertainty, and less redundant scanning. To optimize this reward, SonoSelect decomposes the task into a sector selection module trained with Q-learning for long-horizon routing and a continuous control policy trained with PPO for short-range navigation, so that each sub-problem operates at its own temporal scale. % 在多个ultrasound任务上,我们的效果很好(比ppo好) % 我们的效率也很高,(一般需要几个view就好了) A preliminary study on multi-view organ classification shows that a small number of adaptively chosen views can match or exceed the all-view baseline, and that the optimal views vary across patient anatomies.Building on this observation, we further evaluate SonoSelect on a kidney cyst detection task, where the target is small and often occluded behind overlying organs. SonoSelect achieves higher organ and cyst coverage than conventional baselines, with trajectories that consistently converge toward the target rather than exhaustively scanning the volume. In particular, unlike standard RL approaches that show substantial performance degradation on unseen anatomies, SonoSelect maintains its coverage advantage across different patients, suggesting that the spatial memory provides a more generalizable basis for probe guidance than reward-driven exploration alone. % 未来的意义,我们这个是为了是为了实现Fully Autonomous Robotic Ultrasound system中的重要的一步 The proposed active view exploration approach has practical implications for both current clinical workflows and robotic ultrasound systems that automate probe positioning through mechanical arms. In current practice, the learned policy can serve as a decision support tool, suggesting informative scanning regions to assist sonographers during manual examination. For robotic systems, the selected next-best-view can be converted into target coordinates for a mechanical arm controller, providing the spatial goal that downstream motion planning modules use to reposition the probe. % Our contributions are: (1) We formulate ultrasound active view exploration as a sequential decision-making problem and show, through a preliminary study, that a small number of adaptively chosen views can match or exceed exhaustive acquisition. (2) We design two evaluation tasks within the SonoGym simulation environment, a multi-view organ classification task using simulated volumes and a kidney cyst detection task using human CT volumes. (3) We propose SonoSelect, which maintains a 3D spatial memory of observed anatomy and decomposes exploration into sector selection and continuous probe control, achieving effective coverage and generalization to unseen anatomies on both tasks. \section{Related Work} \label{sec:related} \textbf{Ultrasound Perception.} Most existing research in robotic ultrasound focuses on improving the perception of individual 2D slices, such as organ segmentation, classification, and lesion detection ~\cite{jiang2023robotic, huang2023review}. While these methods have achieved high accuracy in controlled planes, they inherently suffer from the limited field-of-view and acoustic occlusions characteristic of single-view ultrasound ~\cite{jiang2025towards}. To acquire more comprehensive observations, autonomous scanning systems have been developed, but they typically follow predefined trajectories or optimize for local image quality and acoustic coupling through force-aware control ~\cite{chatelain2017confidence, ning2023inverse, tirindelli2020force}, without selecting views based on their diagnostic contribution. Recent learning-based advances have attempted to navigate toward target views from local observations ~\cite{hase2020ultrasound, jiang2024intelligent}. However, these approaches primarily focus on acquiring a single predefined standard plane rather than comprehensive 3D perception. Other recent works explore constraint-aware safe exploration ~\cite{duan2024safe} or build tissue-view maps for specific structures ~\cite{su2025tissue}, but these efforts remain limited to local path planning for individual anatomical targets. As a result, the problem of sequentially selecting views to build up a comprehensive 3D understanding of the scanned regionremains open. We formulate it as active multi-view exploration, where the agent selects a sequence of views to maximize diagnostic coverage across the full scanning region rather than navigating to a single target plane. \textbf{Viewpoint Selection.} The concept of Next-Best-View (NBV) planning was originally established in the computer vision community to solve 3D reconstruction and active localization for objects using RGB or RGB-D sensors ~\cite{isler2016information, di2024learning}. These methods, ranging from classical information-theoretic entropy reduction ~\cite{isler2016information} to recent learning-based active vision policies ~\cite{chen2024gennbv, feng2024naruto, xue2024neural}, operate under free-space assumptions: the sensor can be positioned at arbitrary viewpoints around the object, and the imaging process follows predictable optical properties such as known projection geometry and consistent illumination. Translating NBV principles into the ultrasound domain introduces physical challenges that violate these assumptions. The probe is constrained to maintain continuous contact with the body surface, limiting the set of reachable viewpoints. Acoustic shadowing caused by bone or gas occludes structures that would be visible from other orientations, and signal-dependent speckle noise reduces the reliability of pixel-level uncertainty estimates. Together, these factors mean that purely uncertainty-driven exploration strategies, which perform well under free-space conditions, can be misled by imaging artifacts in ultrasound. Our work adapts the core idea of NBV planning, selecting the next observation to maximize diagnostic gain, to the contact-constrained ultrasound setting. Rather than relying on geometric uncertainty alone, the agent learns to account for anatomical context when choosing where to scan next. We train the exploration policy in high-throughput simulators that support robotic ultrasound tasks ~\cite{makoviychuk2021isaac, ao2025sonogym, schmidgall2024surgical}, where large-scale parallel rollouts provide sufficient experience for the agent to learn anatomy-aware scanning strategies that prioritize diagnostically informative coverage over geometric traversal. \section{Methodology} \label{sec:method} \subsection{Problem Definition} \begin{figure*}[t] \centering \includegraphics[width=\linewidth]{images/fig2.pdf} \caption{\textbf{Active multi-view ultrasound exploration with $T$ scanning steps.} Solid lines indicate the network forward pass and dashed lines indicate the interaction between the agent (SonoSelect) and the environment (ultrasound simulation). At each time step $t$, the sector selection module $S$ evaluates the current state $s_t = (\hat{V}_t, \mathbf{p}_t, \mathbf{q}_t)$ and selects a target sector $z_t$ from the discretized workspace. The agent then navigates to the selected sector, acquires a new ultrasound slice, and the volumetric fusion module $U(\cdot)$ integrates it into the spatial probability map $\hat{V}_{t+1}$. As scanning progresses, the probability map evolves from sparse voxel estimates to a dense reconstruction. After the budget is exhausted, the final coverage reward $r_T = \text{Coverage}(\hat{V}_T, g(v))$ is computed between the accumulated probability map and the ground-truth annotation.} \label{fig:system_overview} \end{figure*} % 一段话写任务定义(这里需要有个大图) We formulate active view exploration for ultrasound perception as a % resource-constrained POMDP~\cite{kaelbling1998planning}. Specifically, the unobservable state $s$ represents the complete 3D anatomy, which the agent can only access through partial 2D ultrasound slices. The objective is to learn an exploration policy $\pi_\phi(a_t|s_t)$ that maps the current state $s_t$ to continuous kinematic actions $a_t$, maximizing the cumulative coverage of the target anatomical structure within a fixed budget of $T$ steps. Concretely, the agent faces three subproblems: (1) estimating, from incomplete observations, how much anatomical coverage each unvisited region would provide; (2) deciding which regions to visit and in what order within the finite budget; and (3) translating each regional decision into a feasible kinematic trajectory. % State \textbf{State}. Because the number of acquired slices grows with each step, directly conditioning the policy on the full observation history is impractical. We instead maintain a fixed-dimensional state $s_t$ that summarizes all spatial information collected up to step $t$. At each step $t$, the agent receives a 2D ultrasound slice $I_t$ at probe pose $(\mathbf{p}_t, \mathbf{q}_t)$ and fuses it into a 3D probability map $\hat{V}_t$ via the volumetric fusion function $U(\cdot)$. We formulate the state $s_t$ as: \begin{equation} s_t = (\hat{V}_t, \mathbf{p}_t, \mathbf{q}_t), \end{equation} Here $\hat{V}_t$ aggregates all slices observed up to step $t$ into a spatial probability map, where each voxel stores the estimated probability of tissue occupancy. $\hat{V}_0$ is initialized to a uniform probability of $0.5$ to represent maximum uncertainty. This representation maintains the same dimensionality across different time steps, allowing the policy to operate on a fixed-size input regardless of the episode length. Although $s_t$ captures the spatial structure observed so far, it does not explicitly indicate how much of the target anatomy has been covered. To provide the critic with a more informative training signal, we define a privileged coverage ratio: \begin{equation} c_t = \frac{\sum_{v} \hat{V}_{t}(v) \cdot g(v)}{\sum_{v} g(v) + \epsilon}, \quad c_t \in [0,1], \end{equation} where the summation runs over all voxels $v$ in the reconstruction volume, and $g(v)$ is the ground-truth binary mask of the target structure. Since $c_t$ requires $g(v)$, it is available only during training in simulation. Following the asymmetric actor-critic formulation~\cite{pinto2017asymmetric}, the actor $\pi_\phi(a_t | s_t)$ sees only $s_t$, while the critic $V_\psi(s_t, c_t)$ additionally receives $c_t$ for more accurate value estimation. This separation ensures that the deployed policy does not rely on any privileged information. % Action \textbf{Action}. For a given state $s_t$ at time step $t \in \{1,\dots,T\}$, the agent outputs a continuous 4D action $a_t = (\Delta x, \Delta z, \Delta\phi, \Delta\psi)$, where $\Delta x$ and $\Delta z$ are translational displacements along the x and z axes, and $\Delta\phi$ and $\Delta\psi$ are rotational increments for roll and yaw, respectively. The y-axis translation is omitted because the probe maintains surface contact throughout scanning. The action space is continuous to allow fine-grained kinematic adjustments. \textbf{Transition}. Upon executing action $a_t$, the probe pose is updated to $(\mathbf{p}_{t+1}, \mathbf{q}_{t+1})$ via the environment's kinematic function. The environment then returns a new ultrasound slice $I_{t+1}$, which is fused into the probability map to produce $\hat{V}_{t+1}$, and the state transitions to $s_{t+1} = (\hat{V}_{t+1}, \mathbf{p}_{t+1}, \mathbf{q}_{t+1})$. The scanning process terminates when the step budget $T$ is exhausted. % Reward \textbf{Reward}. We design a dense, multi-objective reward function: \begin{equation} r_t = w_{cov} \Delta C_t + w_{info} \Delta H_t^{echo} - \ell_t^{path} \end{equation} The first term $\Delta C_t$ measures the incremental coverage gain over the anatomical structures of interest, weighted by $w_{cov}$, and provides the main learning signal. However, a single partial slice may refine the reconstruction without producing measurable coverage gain. To reward such intermediate progress, the second term $\Delta H_t^{echo}$, weighted by $w_{info}$, captures the reduction in volumetric Shannon entropy over the target region, so that steps reducing acoustic uncertainty still receive positive feedback. Because the policy maximizes the cumulative sum of all these terms, entropy reduction alone cannot sustain high returns; the policy is driven toward trajectories that also achieve coverage gains over the structures of interest. This distinguishes our reward from objectives that use entropy reduction as the sole optimization target, where the policy has no incentive to prioritize diagnostically relevant regions over other high-uncertainty areas. Finally, $\ell_t^{path}$ is a conditional kinematic penalty that penalizes large translational and rotational displacements when a step produces no coverage gain, discouraging the agent from moving excessively without acquiring new information. \begin{figure*}[t] \includegraphics[width=\linewidth]{images/architecture.pdf} \caption{\textbf{Architecture of SonoSelect.} The scanning region is discretized into $S$ sectors from which learned features $f_i$ are extracted via shared 2D convolutional encoding and masked pooling. The view selection module produces sector-conditioned Q-values $Q(s_t, z_i)$, and selects the optimal sector $z_t$ whose geometric center is converted into a positional guidance vector $\mathbf{v}_t^{\text{pos}}$. The action refinement module takes as input the volumetric state $s_t$ together with the guidance vector $\mathbf{v}_t^{\text{pos}}$ and outputs the local kinematic increment $\Delta_t$ which is scaled to $\hat{\Delta}_t$. The Residual Fusion module combines $\mathbf{v}_t^{\text{pos}}$ and $\hat{\Delta}_t$ to produce the final continuous action $a_t$, which drives the probe to a new pose and triggers volumetric fusion $U(\cdot)$ to update the spatial memory $\hat{V}_{t+1}$.} \label{fig:sono} \end{figure*} \subsection{SonoSelect Architecture} A flat continuous policy would need to simultaneously decide which region of the anatomy to visit next and compute the kinematic actions to get there. In practice, this joint optimization is difficult because selecting which anatomical region to scan next requires reasoning over the entire observed volume and operates over long horizons with sparse diagnostic feedback, while executing the probe movement toward that region requires dense, short-horizon kinematic adjustments. These two sub-tasks differ in both temporal scale and input granularity. SonoSelect decomposes this problem into two coupled components. A sector selection module handles the long-horizon decision of where to explore. The selected region then provides a directional target for a continuous control policy, which only needs to solve a simpler, short-range navigation task toward the chosen sector. This decomposition constrains the search space for each sub-problem while maintaining the flexibility required for fine-grained kinematic control. % \begin{figure} % \centering % \includegraphics[width=\linewidth]{images/feature.pdf} % \caption{\textbf{Sector feature extraction pipeline.} By treating elevation slices as input channels, a shared 2D convolutional encoder processes the reconstruction volume $\hat{V}_t$ into a 32-channel feature map. For each sector $i$, a sector-specific mask filters this map, followed by parallel average and max pooling. A shared MLP then projects the concatenated 64-dimensional vector, producing the sector feature $f_i$.} % \label{fig:sector_feature} % \end{figure} We discretize the local operational workspace into $S$ equiangular sectors(~\cref{fig:sono}). To obtain a feature representation $f_i$ for each sector, the reconstruction volume $\hat{V}_t$ is first rearranged by treating elevation slices as input channels and then processed by a shared 2D convolutional encoder. For each sector $i$, a binary sector mask is applied to the encoded feature map, followed by parallel average and max pooling. The concatenated pooling result is then projected through a shared MLP to produce $f_i$. The sector features $\{f_i\}_{i=1}^{S}$ are each passed through a shared Q-network to produce action values $\{Q(s_t, z_i)\}_{i=1}^{S}$, where $Q(s_t, z_i)$ estimates the cumulative expected reward for navigating toward sector $z_i$. This parameter-sharing design ensures that the Q-network generalizes across all candidate sectors rather than learning separate value estimates for each. During training, the sector is chosen via an $\epsilon$-greedy strategy to balance exploration and exploitation; at deployment, the sector with the highest Q-value is deterministically selected. The geometric center of the selected sector $z_t$ is converted into a positional target vector $\mathbf{v}_t^{\text{pos}} \in \mathbb{R}^2$ in the probe's local coordinate frame, representing the translational direction toward the selected sector. This vector serves as the guidance signal for the downstream continuous control policy. The continuous control policy translates the selected sector into kinematic actions. We employ a PPO-based actor-critic architecture. The actor takes as input the current state $s_t$ concatenated with the sector guidance vector $\mathbf{v}_t^{\text{pos}}$, and outputs a local kinematic increment $\Delta_t = [\Delta_t^{\text{pos}}, \Delta_t^{\text{ang}}] \in \mathbb{R}^4$. A residual scaling factor $\alpha$ is applied to obtain the scaled increment $\hat{\Delta}_t = \alpha \Delta_t$. The final action $a_t$ fuses the sector-derived target with this scaled increment: \begin{equation} a_t^{\text{pos}} = \beta_t \mathbf{v}_t^{\text{pos}} + (1-\beta_t) \hat{\Delta}_t^{\text{pos}}, \quad a_t^{\text{ang}} = \hat{\Delta}_t^{\text{ang}} \end{equation} where $\beta_t$ linearly anneals from an initial value $\beta_0$ to a final value $\beta_f$ over training. In early training, $\beta_t$ is large so that the translational component is dominated by the sector guidance $\mathbf{v}_t^{\text{pos}}$, providing a stable learning signal before the policy has converged. As training progresses, $\beta_t$ decreases and the policy's own output $\hat{\Delta}_t^{\text{pos}}$ takes over. The angular component $a_t^{\text{ang}}$ is determined entirely by the policy, as the sector selection provides only translational guidance. The critic estimates the state value $V_{\psi}(s_t, c_t)$ using the augmented state. \subsection{Training Scheme} We employ a rollout-based sequential updating approach to jointly train the continuous control policy (via Proximal Policy Optimization, PPO~\cite{schulman2017proximal}) and the sector selection module via Q-learning. This joint training scheme allows both modules to co-adapt within the same trajectory data, ensuring consistent learning signals across the two decision levels. The continuous control policy is optimized using the standard PPO objective with Generalized Advantage Estimation (GAE) ~\cite{schulman2015high}. The actor outputs kinematic increments $\Delta_t$ and is updated via clipped surrogate objectives, while the critic estimates $V_\psi(s_t, c_t)$ and provides the baseline for advantage computation. For the sector selection module, we train the sector Q-network using Monte Carlo rollout returns as regression targets. The action-value function \(Q_{\theta}(s_t, z_t)\) estimates the expected discounted return after selecting sector \(z_t\) at state \(s_t\): % For the sector selection module, we train the sector Q-network by regressing onto Monte Carlo rollout returns. The action-value function $Q_{\theta}(s_t, z_t)$ estimates the expected cumulative reward after selecting sector $z_t$ at state $s_t$: \begin{equation} Q_{\theta}(s_{t}, z_{t}) = \mathbb{E} \left( \sum_{\tau=t}^{T} \gamma^{\tau-t} r_{\tau} \right), \end{equation} where $\mathbb{E}(\cdot)$ denotes the expectation and $\gamma \in [0, 1]$ is the discount factor. Although both modules share the same reward signal, they require different value representations. The PPO critic learns a state value $V_\psi(s_t, c_t)$ used to compute advantages for the continuous control policy, while the sector selection module learns action-conditional values $Q_\theta(s_t, z_i)$ that compare the expected return of each candidate sector. This difference motivates maintaining separate value functions despite the shared reward. Given this formulation, we compute the return from the collected rollouts as the supervision target: \begin{equation} y_{t} = \begin{cases} r_{t}^{(Q)} + \gamma (1 - d_{t}) y_{t+1}, & \text{if } t < T \\ r_{T}^{(Q)}, & \text{otherwise} \end{cases}, \end{equation} where $d_t$ is the termination mask. The Q-network is then optimized using the $L_2$ distance loss: \begin{equation} \mathcal{L}_{Q} = \lambda_Q \frac{1}{T} \sum_{t=1}^{T} \text{MSE}(Q_{\theta}(s_{t}, z_{t}), y_{t}), \end{equation} where $\lambda_Q$ controls the loss weight. In joint training, the two objectives are optimized in separate backward passes within each iteration. First, the PPO objective $\mathcal{L}_{\text{PPO}}$ updates the continuous control policy and the critic. Then, in a separate backward pass, the Q-learning loss $\mathcal{L}_{Q}$ updates the sector selection module, including the Q-network and its associated feature encoder. This sequential scheme prevents gradient interference between the two objectives. A step-by-step demonstration of this process can be found in ~\cref{alg:sonoselect}. \begin{algorithm}[t] \caption{SonoSelect} \label{alg:sonoselect} \begin{algorithmic}[1] \small \STATE \textbf{Input}: Env $\mathcal{E}$, budget $T$, exploration rate $\epsilon$, scaling factor $\alpha$, annealing weight $\beta_t$, Q-loss weight $\lambda_Q$. \STATE \textbf{Update}: Q-network $Q_{\theta}$, actor $\pi_{\phi}$, critic $V_{\psi}$. \FOR{each training iteration} \STATE Initialize rollout buffers $\mathcal{B}_{\text{PPO}}, \mathcal{B}_{Q} \leftarrow \emptyset$ \STATE Reset environment: $s_1 \leftarrow \mathcal{E}.\text{reset}()$ \FOR{$t = 1$ to $T$} \STATE Extract sector features $\{f_i\}_{i=1}^{S}$ from $\hat{V}_t$ \STATE Select sector using $\epsilon$-greedy: with probability $\epsilon$ adopt a random sector, or else choose $z_t = \arg\max_{z_i} Q_{\theta}(s_t, z_i)$ \STATE Compute guidance $\mathbf{v}_t^{\text{pos}} \leftarrow \text{GeometricCenter}(z_t)$ \STATE Sample $\Delta_t \sim \pi_{\phi}(\cdot \mid s_t, \mathbf{v}_t^{\text{pos}})$; scale $\hat{\Delta}_t \leftarrow \alpha \Delta_t$ \STATE Fuse action: $a_t^{\text{pos}} \leftarrow \beta_t \mathbf{v}_t^{\text{pos}} + (1{-}\beta_t) \hat{\Delta}_t^{\text{pos}}$, $a_t^{\text{ang}} \leftarrow \hat{\Delta}_t^{\text{ang}}$ \STATE Execute $a_t$ in $\mathcal{E}$; observe $s_{t+1}, r_t, d_t$ \STATE Store $(s_t, c_t, a_t, r_t, s_{t+1}, c_{t+1}, d_t)$ in $\mathcal{B}_{\text{PPO}}$ \STATE Store $(s_t, z_t, r_t, s_{t+1}, d_t)$ in $\mathcal{B}_{Q}$ \ENDFOR \STATE Compute GAE advantages from $\mathcal{B}_{\text{PPO}}$; update $\pi_{\phi}, V_{\psi}$ via $\mathcal{L}_{\text{PPO}}$ \STATE Compute discounted returns $\{y_t\}$ from $\mathcal{B}_{Q}$; update $Q_{\theta}$ via $\nabla\mathcal{L}_{Q}$ \ENDFOR \end{algorithmic} \end{algorithm} \section{Experiment} We evaluate our approach in two stages. The first stage (Sec.~\ref{sec:discrete_classification}) examines whether a small, adaptively chosen subset of views can match or exceed the diagnostic performance of exhaustive acquisition, using a simplified setting where the probe can access any candidate viewpoint without movement cost. The second stage (Sec.~\ref{sec:continuous_detection}) evaluates SonoSelect on a kidney cyst detection task, where the agent adaptively guides probe movement to detect target pathology within a limited scanning budget. \subsection{Preliminary Study: Multi-view Classification} \label{sec:discrete_classification} Before evaluating the full scanning pipeline, we first ask a simpler question: given a set of candidate viewpoints, can an adaptive selection policy identify the few informative views for each instance? To answer this, we use a simplified setting where the probe can access any candidate viewpoint without movement cost. As the adaptive selection method, we adopt MVSelect~\cite{hou2024learning}, which sequentially chooses the next view conditioned on previously acquired observations. This setup lets us test whether observation-driven view selection is beneficial before introducing continuous probe control. \textbf{Datasets.} We construct two custom multi-view ultrasound datasets, both adopting a strict 80\%/20\% train-test split and extracting $120 \times 120$ 2D slices under two distinct viewpoint configurations (12-view and 20-view configurations). \begin{itemize} \item \textit{Geometry:} This synthetic dataset comprises 10 distinct categories (sphere, ellipsoid, cube, cuboid, cylinder, capsule, cone, torus, octahedron, and cross), with 150 unique instances per category. \item \textit{Organ:} To move closer to clinical realism, we introduce a more challenging dataset comprising real human anatomical structures sourced from the publicly available TotalSegmentator dataset ~\cite{wasserthal2023totalsegmentator}. It contains 6 distinct categories: left kidney, liver, pancreas, spleen, aorta, and stomach, with 100 unique patient instances per category. \end{itemize} \textbf{Task Network.} For both datasets, we employ a ResNet-18 ~\cite{he2016deep} backbone combined with a max-pooling aggregation module. The network is trained offline on complete multi-view sequences so that the learned representations are not biased toward any particular view subset. \textbf{Quantitative Results.} The classification performances on both datasets are summarized in \cref{tab:combined_results}. We compare five selection strategies: (1) \textit{dataset-level oracle}, which uses the same fixed pair of views that achieves the highest average accuracy across all instances in the training set; (2) \textit{instance-level oracle}, which selects the optimal pair for each test instance by exhaustive search; (3) \textit{random selection}, which samples two views uniformly; (4) \textit{validation best policy}, which selects the fixed pair that achieves the highest accuracy on the validation set; and (5) \textit{MVSelect}~\cite{hou2024learning}, which sequentially selects views conditioned on previous observations. \begin{table}[t] \centering \small \caption{Classification accuracy (\%) on SonoGeom and SonoOrgan. Each adaptive or fixed policy selects two views per instance from the candidate set.} \label{tab:combined_results} % 使用 \resizebox 限制表格宽度不超过单栏宽度,解决右边突出的问题 \resizebox{\columnwidth}{!}{ \begin{tabular}{l|cc|cc} % 修改了列定义,去掉了多余的竖线,更贴合原图 \multirow{2}{*}{view selection} & \multicolumn{2}{c|}{Geometry} & \multicolumn{2}{c}{Organ} \\ \cline{2-5} % 在 SonoGeom 和 SonoOrgan 下方添加横线 & 12 views & 20 views & 12 views & 20 views \\ \hline N/A: all $N$ views & 84.02 & 92.96 & 92.50 & 91.73 \\ \hline dataset-lvl oracle & 79.07 $\pm$ 0.71 & 83.42 $\pm$ 2.23 & 92.75 $\pm$ 0.64 & 90.13 $\pm$ 1.98 \\ instance-lvl oracle & 93.80 $\pm$ 0.47 & 99.23 $\pm$ 0.61 & 98.04 $\pm$ 0.61 & 99.32 $\pm$ 1.06 \\ \hline random selection & 74.61 $\pm$ 2.32 & 68.71 $\pm$ 8.34 & 87.44 $\pm$ 4.17 & 73.91 $\pm$ 11.58 \\ validation best policy& 71.57 $\pm$ 2.46 & 70.87 $\pm$ 5.60 & 88.51 $\pm$ 1.61 & 83.73 $\pm$ 4.06 \\ \hline MVSelect~\cite{hou2024learning}& 79.11 $\pm$ 1.33 & 89.59 $\pm$ 1.61 & 96.23 $\pm$ 1.62 & 97.00 $\pm$ 1.04 \\ \end{tabular} } \end{table} \begin{figure}[t] \centering \includegraphics[width=\linewidth]{images/Experiment1.pdf} \caption{\textbf{Qualitative results of the discrete view selection policy.} We visualize the selected viewpoints for both the geometry (top) and organ (bottom) datasets.} \label{fig:Experiment1} \end{figure} We first note that using all $N$ views does not yield the highest accuracy. The instance-level oracle, which selects the best two views per instance via exhaustive search, substantially surpasses the all-view baseline on both datasets. This indicates that redundant or low-quality views introduce noise that degrades the aggregated representation. However, the dataset-level oracle, which fixes the same two best views across all instances, performs considerably worse than the instance-level oracle and in some cases falls below the all-view baseline. This gap shows that the most informative views vary from one instance to another and cannot be predetermined as a fixed protocol. Random selection performs the worst overall, with high variance reflecting the inconsistency of uninformed view choices. MVSelect, which selects views conditioned on each instance's observations, approaches the instance-level oracle on both datasets. This confirms that an adaptive policy can recover near-optimal view combinations without exhaustive search. Together, these results support the two properties that motivate SonoSelect: (1) a small number of well-chosen views can match or exceed the performance of exhaustive acquisition, and (2) the optimal views are instance-dependent and vary across patient anatomies. Building on these findings, SonoSelect further addresses how to acquire such informative views by adaptively guiding probe movement based on a 3D spatial memory of the observed anatomy. The following section evaluates SonoSelect on a kidney cyst detection task, where the agent guides probe movement to detect target pathology within a limited scanning budget. \textbf{Qualitative Results.} \cref{fig:Experiment1} visualizes the views selected by MVSelect for representative instances from both datasets. The policy avoids ambiguous cross-sections and orients the probe toward viewpoints that capture discriminative geometric features of each object. \subsection{Kidney Cyst Detection} \label{sec:continuous_detection} The preliminary study confirms that a small number of adaptively chosen views can match or exceed exhaustive acquisition, and that the optimal views vary across instances. Building on this finding, we now evaluate whether SonoSelect can realize these benefits when the probe moves sequentially along the body surface, where each movement carries a scanning cost and the agent receives only partial observations of the underlying anatomy. We test on a kidney cyst detection task, a clinically motivated scenario that requires both broad organ coverage and precise localization of small pathological targets, and evaluate generalization to unseen patient anatomies. \textbf{Experimental Setup.} The primary task requires the agent to dynamically scan the left kidney and identify renal cysts. We utilize 3D clinical CT volumes from the TotalSegmentator dataset ~\cite{wasserthal2023totalsegmentator}. To evaluate structural generalization, patient anatomies are strictly partitioned into seen and unseen domains. The local operational workspace is divided into $S=16$ equiangular sectors with radius of $r=15$ voxels. \textbf{Baselines.} In this fully continuous setting, we benchmark SonoSelect against baselines representing alternative exploration strategies. \textit{Random} applies uniformly sampled kinematic actions at each step, providing a lower bound on diagnostic yield without any learned or heuristic guidance. \textit{PPO}~\cite{schulman2017proximal} represents end-to-end reinforcement learning without hierarchical decomposition, testing whether a flat policy can implicitly learn both regional planning and local control. \textit{VIG} (Volumetric Information Gain)~\cite{isler2016information} represents classical Next-Best-View planning driven by entropy maximization, testing whether uncertainty reduction alone provides sufficient guidance for diagnostic exploration. \textit{RND}~\cite{burda2018exploration} provides a state-visitation driven exploration bonus, testing whether encouraging novel state visits improves coverage without task-specific guidance. \begin{table*}[t] \centering \label{tab:main_results} \resizebox{\textwidth}{!}{ % 列定义:左侧基础信息与右侧两个大区块之间用 | 隔开 \begin{tabular}{lc|cccccc|cccccc} \hline \multirow{2}{*}{\textbf{Method}} & \multirow{2}{*}{\textbf{Sectors}} & \multicolumn{6}{c|}{\cellcolor{yellow!50}\textbf{Seen Patient}} & \multicolumn{6}{c}{\cellcolor{orange!20}\textbf{Unseen Patient}} \\ % 使用 \cline 在第 3 列到第 14 列下方画横线 \cline{3-14} & & \textbf{Kidney (\%)} $\uparrow$ & \textbf{Cyst (\%)} $\uparrow$ & \textbf{Dice (\%)} $\uparrow$ & \textbf{IoU (\%)} $\uparrow$ & \textbf{Trans.} $\downarrow$ & \textbf{Rot.} $\downarrow$ & \textbf{Kidney (\%)} $\uparrow$ & \textbf{Cyst (\%)} $\uparrow$ & \textbf{Dice (\%)} $\uparrow$ & \textbf{IoU (\%)} $\uparrow$ & \textbf{Trans.} $\downarrow$ & \textbf{Rot.} $\downarrow$ \\ \hline Random & - & 13.79 & 8.38 & 22.82 & 14.13 & 925.81 & 2623.35 & 23.41 & 1.88 & 37.41 & 24.50 & 937.55 & 2624.38 \\ PPO & - & 60.46 & 44.94 & 69.76 & 53.69 & 427.52 & 294.24 & 33.14 & 12.10 & 47.97 & 34.62 & 376.92 & 247.93 \\ RND & - & 64.99 & 45.53 & 72.59 & 57.08 & 549.41 & 318.89 & 48.91 & 23.41 & 64.94 & 49.33 & 540.09 & 272.15 \\ VIG & - & 56.95 & 40.35 & 63.88 & 50.36 & 484.74 & 350.06 & 48.62 & 23.12 & 64.84 & 49.03 & 463.56 & 417.47 \\ \hline \textbf{SonoSelect} & 4 & 56.52 & 41.50 & 66.66 & 50.40 & 398.51 & 258.11 & 23.06 & 3.67 & 37.05 & 24.24 & 353.37 & 201.96 \\ \textbf{SonoSelect} & 8 & 62.39 & 47.13 & 71.52 & 55.93 & 570.16 & 186.57 & 50.87 & 30.91 & 67.53 & 51.50 & 561.80 & 198.36 \\ \textbf{SonoSelect} & 16 & 67.55 & 48.37 & 73.88 & 58.62 & 423.40 & 311.65 & 54.56 & 27.13 & 70.76 & 54.78 & 446.98 & 311.12 \\ \textbf{SonoSelect} & 32 & 65.07 & 45.67 & 72.70 & 57.20 & 532.87 & 258.49 & 52.14 & 27.90 & 66.97 & 51.16 & 574.52 & 213.78 \\ \hline \end{tabular} } \caption{Quantitative comparison of active scanning performance. SonoSelect shows smaller performance degradation on unseen patient anatomies compared to other learned baselines.} \end{table*} \textbf{Quantitative Results.} \cref{tab:main_results} presents scanning performance on seen and unseen patient anatomies. On seen anatomies, VIG achieves the highest cyst coverage and reconstruction accuracy, while SonoSelect achieves the highest kidney coverage, as reported in ~\cref{tab:main_results}. This is consistent with the nature of entropy-based exploration: on training anatomies, the spatial distribution of acoustic uncertainty tends to align with target anatomical structures, so greedy entropy maximization effectively guides the probe toward informative regions. SonoSelect's lower cyst coverage on seen data reflects a trade-off in its hierarchical design: the sector selection module distributes exploration across the scanning workspace based on estimated diagnostic value, producing more uniform spatial coverage rather than concentrating on the regions that happen to contain cysts in the training set. This broader exploration strategy, in turn, favors generalization, as the results on unseen anatomies confirm. For reference, the Random baseline achieves the lowest diagnostic scores across all metrics, while consuming substantially more scanning budget, confirming that directed exploration is necessary for this task. All methods degrade on unseen anatomies, but the extent of degradation differs. Among the learned methods, PPO exhibits the largest performance degradation, with kidney coverage dropping from 59.2\% to 33.1\% and cyst coverage from 28.5\% to 12.1\%, indicating that the flat policy does not learn transferable exploration behaviors across different anatomies. VIG's cyst coverage drops from 44.5\% to 23.1\%, accompanied by a sharp increase in rotational movement. On seen anatomies, high-entropy regions tend to coincide with target structures, so entropy maximization effectively guides the probe. On unseen anatomies, this alignment weakens, and the probe spends movement budget pursuing uncertainty reduction in diagnostically uninformative regions. In contrast, SonoSelect's cyst coverage decreases less across seen and unseen anatomies compared to other learned methods, and it achieves the highest scores across all four diagnostic metrics on unseen data. This stability can be attributed to the hierarchical decomposition of the scanning policy. Because the high-level routing operates on sector-level spatial features rather than raw voxel coordinates, its decisions are less tied to the specific geometry of training anatomies. Similarly, the low-level controller only needs to execute short-range navigation toward a given sector, a skill that depends on local kinematics rather than global anatomical layout. As a result, neither level relies on memorizing the full spatial structure of training patients, which explains why SonoSelect's performance degrades less when the anatomy changes. \begin{figure*}[t] \begin{minipage}[t]{0.40\textwidth} \centering \includegraphics[width=\linewidth]{images/pareto_tradeoff_allpoints.pdf} \captionof{figure}{Episode-level cyst coverage vs. trajectory length on unseen anatomies. Each point represents one scanning episode.} \label{fig:tradeoff} \includegraphics[width=\linewidth]{images/kde_1d_cyst_coverage_sonoselect_vs_ppo.pdf} \captionof{figure}{Distribution of per-episode cyst coverage on unseen anatomies for SonoSelect and PPO.} \label{fig:kde} \end{minipage}% \hfill% \begin{minipage}[t]{0.55\textwidth} \centering \includegraphics[width=\linewidth]{images/traj.pdf} \captionof{figure}{Qualitative comparison of scanning trajectories on unseen patient data. Red and blue segments indicate on-target and non-target portions, respectively; percentages report the effective scanning ratio. (a) Trajectories of PPO. (b) Trajectories of SonoSelect.} \label{fig:qualitative_trajectories} \vspace{4mm} % 图和表格之间的垂直间距,可根据需要微调 \small \begin{tabular}{l|cccc} \hline Method & Kidney & Cyst & Dice & IoU \\ \hline Random Routing & 47.85 & 13.77 & 62.80 & 45.53 \\ w/o Sector Features & 45.32 & 18.39 & 61.35 & 45.24 \\ w/o Residual Control & 49.94 & 16.18 & 59.23 & 44.92 \\ \textbf{SonoSelect} & 54.56 & 27.13 & 70.76 & 54.78 \\ \hline \end{tabular} \captionof{table}{Ablation study of SonoSelect components.} \label{tab:ablation} \end{minipage} \end{figure*} \textbf{Episode-level analysis.} To further examine the generalization behavior at the episode level, ~\cref{fig:tradeoff} plots the episode-level distribution of cyst coverage against trajectory length on unseen anatomies. PPO exhibits a dense cluster in the bottom-left quadrant, indicating frequent near-zero coverage episodes with short, spatially confined trajectories. This pattern is consistent with the limited transferability of the flat policy: when familiar spatial cues from training anatomies are absent, the agent tends to remain confined to local regions rather than exploring broadly. SonoSelect's distribution occupies the upper-right quadrant, where longer trajectories correspond to higher diagnostic coverage. As \cref{tab:main_results} shows, SonoSelect's average trajectory length is longer than that of PPO, yet this additional movement translates into higher scores across all diagnostic metrics, indicating thorough exploration of the target anatomy rather than aimless wandering. The per-episode distribution in ~\cref{fig:kde} further illustrates this contrast: PPO's cyst coverage concentrates near zero, while SonoSelect's distribution shifts toward higher values. % \begin{figure}[t] % \centering % \includegraphics[width=\linewidth]{images/traj.pdf} % \caption{Qualitative comparison of scanning trajectories on unseen patient data. Red segments indicate trajectory portions where the probe is actively scanning the kidney or cyst, while gray segments represent movement through non-target regions. The percentage below each example records the proportion of the total trajectory spent on effective target scanning. (a) PPO produces uncoordinated trajectories with low effective scanning ratios. (b) SonoSelect achieves structured, anatomy-centered navigation with substantially higher effective scanning ratios.} % \label{fig:qualitative_trajectories} % \end{figure} \textbf{Qualitative Results.} ~\cref{fig:qualitative_trajectories} visualizes representative trajectories generated by PPO and SonoSelect on unseen anatomies. As illustrated in \cref{fig:qualitative_trajectories}a, PPO produces uncoordinated circular movements far from the kidney, with the majority of the trajectory passing through non-target regions. The effective scanning ratio in these examples ranges from 13.5\% to 19.6\%, indicating that the agent spends most of its movement budget on non-informative traversal. This is consistent with the low coverage reported in \cref{tab:main_results}, where the agent fails to direct the probe toward the target anatomy. In contrast, SonoSelect (\cref{fig:qualitative_trajectories}b) produces more structured trajectories that closely follow the contours of the kidney. The effective scanning ratios increase substantially, reflecting that a larger fraction of the trajectory contributes to diagnostic observation. This improvement is attributable to the sector-level routing learned by the high-level module, which directs the probe toward the target region and reduces time spent in non-informative areas. \subsection{Ablation Studies} \label{sec:ablation} To validate the core architectural designs of SonoSelect, we conduct ablation experiments on the kidney cyst detection task using unseen patient data. We isolate three components: the learned routing policy, the per-sector feature representation, and the residual control module. Each ablation removes one component while keeping the rest unchanged. The quantitative comparisons are summarized in \cref{tab:ablation}. \textbf{Effect of Learned Routing.} We first evaluate the high-level decision maker by replacing the learned routing policy with random sector selection. Without a task-driven geometric prior, the continuous policy receives arbitrary directional targets, leading to uncoordinated probe movement. As shown in \cref{tab:ablation}, this variant shows a notable drop in cyst coverage, confirming that the learned routing policy is necessary to constrain the search space and direct the continuous policy toward diagnostically relevant regions. % \begin{table} % \begin{tabular}{lcccc} % \toprule % Method & Kidney & Cyst & Dice & IoU \\ % 缩短了表头以适应窄列 % \midrule % Random Routing & 44.4 & 12.5 & 62.2 & 45.5 \\ % w/o Sector Features & 44.3 & 18.3 & 61.3 & 45.2 \\ % w/o Residual Control & 44.9 & 10.1 & 59.2 & 44.9 \\ % \textbf{SonoSelect (Ours)} & \textbf{49.5} & \textbf{30.7} & \textbf{64.2} & \textbf{48.4} \\ % \bottomrule % \end{tabular} % \caption{Ablation study of SonoSelect components.} % \label{tab:ablation} % \end{table} \textbf{Necessity of Explicit Sector Features.} The w/o Sector Features variant replaces the learned feature vectors with uniform values, making all sectors appear identical to the Q-network. Although the Q-network still receives the global observation $s_t$, `it cannot distinguish sectors based on their spatial content in the reconstruction volume. As a result, the Q-network selects sectors without considering what each region contains, leading to reduced coverage for both kidney and cyst targets. The coverage drop confirms that the Q-network relies on per-sector spatial features to make informed routing decisions. \textbf{Role of Residual Control.} The w/o Residual Control variant removes the low-level kinematic adjustments. This variant achieves the lowest cyst coverage among all configurations, while its kidney coverage remains comparable to the other ablated variants. This asymmetry reveals a clear functional division within the framework: the high-level routing policy is sufficient to guide the probe toward the correct anatomical region, but capturing small targets such as cysts requires the fine-grained probe adjustments that only the residual control module provides. \section{Conclusion} We propose SonoSelect, an active multi-view exploration framework for robotic ultrasound that selects informative viewpoints without exhaustive scanning or predefined trajectories. By bridging discrete high-level regional routing with continuous low-level kinematic control, SonoSelect learns to resolve anatomical ambiguities and achieves robust generalization to unseen anatomies where standard reinforcement learning approaches show substantial performance degradation. This approach represents a step toward autonomous robotic ultrasound deployment in clinical workflows. While the current evaluation is conducted in simulation, the hierarchical formulation of coupling discrete region selection with continuous probe control provides a principled way to handle the view-dependent nature of ultrasound imaging. This work suggests that structured, observation-driven exploration can serve as an effective mechanism for multi-view ultrasound perception, reducing the number of views needed for accurate diagnosis while maintaining robust coverage across diverse patient anatomies. %% %% The next two lines define the bibliography style to be used, and %% the bibliography file. \bibliographystyle{ACM-Reference-Format} \bibliography{sample-base} \end{document} \endinput %% %% End of file `sample-sigconf-authordraft.tex'.