XL2 &rrwTypeV2ObjIDDDir9weH5O4חEcAlgoEcMEcNEcBSizeEcIndexEcDistCSumAlgoPartNumsPartETags c093b9f67d0672f25dfc8c9cd9f945e8PartSizes֪PartASizes֧PartIdxSize֥MTimerrwMetaSysx-minio-internal-inline-datatruex-rustfs-internal-inline-datatrueMetaUsretag c093b9f67d0672f25dfc8c9cd9f945e8content-typeapplication/x-texvnull) ~ffchXK>8C5DqT%% %% This is file `sample-sigconf-authordraft.tex', %% generated with the docstrip utility. %% %% The original source files were: %% %% samples.dtx (with options: `all,proceedings,bibtex,authordraft') %% %% IMPORTANT NOTICE: %% %% For the copyright see the source file. %% %% Any modified versions of this file must be renamed %% with new filenames distinct from sample-sigconf-authordraft.tex. %% %% For distribution of the original source see the terms %% for copying and modification in the file samples.dtx. %% %% This generated file may be distributed as long as the %% original source files, as listed above, are part of the %% same distribution. (The sources need not necessarily be %% in the same archive or directory.) %% %% %% Commands for TeXCount %TC:macro \cite [option:text,text] %TC:macro \citep [option:text,text] %TC:macro \citet [option:text,text] %TC:envir table 0 1 %TC:envir table* 0 1 %TC:envir tabular [ignore] word %TC:envir displaymath 0 word %TC:envir math 0 word %TC:envir comment 0 0 %% %% The first command in your LaTeX source must be the \documentclass %% command. %% %% For submission and review of your manuscript please change the %% command to \documentclass[manuscript, screen, review]{acmart}. %% %% When submitting camera ready or to TAPS, please change the command %% to \documentclass[sigconf]{acmart} or whichever template is required %% for your publication. %% %% \documentclass[sigconf, screen, review, anonymous]{acmart} \usepackage{multirow} \usepackage{algorithmic} \usepackage{algorithm} \usepackage{colortbl} \usepackage{wrapfig} \usepackage{hyperref} \usepackage{cleveref} %% %% \BibTeX command to typeset BibTeX logo in the docs \AtBeginDocument{% \providecommand\BibTeX{{% Bib\TeX}}} %% Rights management information. This information is sent to you %% when you complete the rights form. These commands have SAMPLE %% values in them; it is your responsibility as an author to replace %% the commands and values with those provided to you when you %% complete the rights form. \setcopyright{acmlicensed} \copyrightyear{2018} \acmYear{2018} \acmDOI{XXXXXXX.XXXXXXX} %% These commands are for a PROCEEDINGS abstract or paper. \acmConference[Conference acronym 'XX]{Make sure to enter the correct conference title from your rights confirmation email}{June 03--05, 2018}{Woodstock, NY} %% %% Uncomment \acmBooktitle if the title of the proceedings is different %% from ``Proceedings of ...''! %% %%\acmBooktitle{Woodstock '18: ACM Symposium on Neural Gaze Detection, %% June 03--05, 2018, Woodstock, NY} \acmISBN{978-1-4503-XXXX-X/2018/06} %% %% Submission ID. %% Use this when submitting an article to a sponsored event. You'll %% receive a unique submission ID from the organizers %% of the event, and this ID should be used as the parameter to this command. %%\acmSubmissionID{123-A56-BU3} %% %% For managing citations, it is recommended to use bibliography %% files in BibTeX format. %% %% You can then either use BibTeX with the ACM-Reference-Format style, %% or BibLaTeX with the acmnumeric or acmauthoryear sytles, that include %% support for advanced citation of software artefact from the %% biblatex-software package, also separately available on CTAN. %% %% Look at the sample-*-biblatex.tex files for templates showcasing %% the biblatex styles. %% %% %% The majority of ACM publications use numbered citations and %% references. The command \citestyle{authoryear} switches to the %% "author year" style. %% %% If you are preparing content for an event %% sponsored by ACM SIGGRAPH, you must use the "author year" style of %% citations and references. %% Uncommenting %% the next command will enable that style. %%\citestyle{acmauthoryear} %% %% end of the preamble, start of the body of the document source. \begin{document} %% %% The "title" command has an optional parameter, %% allowing the author to define a "short title" to be used in page headers. \title{SonoSelect: Efficient Ultrasound Perception via \\ Active Probe Exploration } %% %% The "author" command and its associated commands are used to define %% the authors and their affiliations. %% Of note is the shared affiliation of the first two authors, and the %% "authornote" and "authornotemark" commands %% used to denote shared contribution to the research. \author{Ben Trovato} \authornote{Both authors contributed equally to this research.} \email{trovato@corporation.com} \orcid{1234-5678-9012} \author{G.K.M. Tobin} \authornotemark[1] \email{webmaster@marysville-ohio.com} \affiliation{% \institution{Institute for Clarity in Documentation} \city{Dublin} \state{Ohio} \country{USA} } \author{Lars Th{\o}rv{\"a}ld} \affiliation{% \institution{The Th{\o}rv{\"a}ld Group} \city{Hekla} \country{Iceland}} \email{larst@affiliation.org} \author{Valerie B\'eranger} \affiliation{% \institution{Inria Paris-Rocquencourt} \city{Rocquencourt} \country{France} } \author{Aparna Patel} \affiliation{% \institution{Rajiv Gandhi University} \city{Doimukh} \state{Arunachal Pradesh} \country{India}} \author{Huifen Chan} \affiliation{% \institution{Tsinghua University} \city{Haidian Qu} \state{Beijing Shi} \country{China}} \author{Charles Palmer} \affiliation{% \institution{Palmer Research Laboratories} \city{San Antonio} \state{Texas} \country{USA}} \email{cpalmer@prl.com} \author{John Smith} \affiliation{% \institution{The Th{\o}rv{\"a}ld Group} \city{Hekla} \country{Iceland}} \email{jsmith@affiliation.org} \author{Julius P. Kumquat} \affiliation{% \institution{The Kumquat Consortium} \city{New York} \country{USA}} \email{jpkumquat@consortium.net} %% %% By default, the full list of authors will be used in the page %% headers. Often, this list is too long, and will overlap %% other information printed in the page headers. This command allows %% the author to define a more concise list %% of authors' names for this purpose. \renewcommand{\shortauthors}{Trovato et al.} %% %% The abstract is a short summary of the work to be presented in the %% article. \begin{abstract} Ultrasound perception normally requires multiple scan views through probe motion to reduce diagnostic ambiguity, mitigate acoustic occlusions, and improve anatomical coverage. However, exhaustively acquiring and processing many views at the same time can be highly inefficient and time-consuming, which limits deployment on autonomous ultrasound systems and in fast-paced clinical workflows. To address this, we define a task of active multi-view exploration and propose an approach that analyzes current observations to dynamically guide the probe to the next best diagnostic region. Our framework features a reinforcement learning based exploration module, SonoSelect, which not only chooses subsequent views but also supports a joint training with the downstream perception network. Experiments on multi-view ultrasound classification and continuous cyst detection tasks show that our approach achieves superior target coverage and perception accuracy and robust generalization to unseen anatomies by actively prioritizing informative viewpoints over uninformed spatial coverage. \end{abstract} %% %% The code below is generated by the tool at http://dl.acm.org/ccs.cfm. %% Please copy and paste the code instead of the example below. %% \begin{CCSXML} 10010405.10010444.10010449 Applied computing~Health informatics 300 10010147.10010178.10010224 Computing methodologies~Computer vision 500 \end{CCSXML} \ccsdesc[300]{Applied computing~Health informatics} \ccsdesc[500]{Computing methodologies~Computer vision} %% %% Keywords. The author(s) should pick words that accurately describe %% the work being presented. Separate the keywords with commas. \keywords{Robotic Ultrasound, Multi-view Perception, View Selection} %% %% This command processes the author and affiliation and title %% information and builds the first part of the formatted document. \maketitle \section{Introduction} \label{sec:intro} % Medical ultrasound commonly acquire multiple scan views through probe motion to reduce diagnostic ambiguity (ultrasound 诊断一般需要多去实现view) In medical ultrasound, the acquisition of multiple scan views through probe motion is a common practice to reduce diagnostic ambiguity~\cite{wang2021deep, wu2020deep, zhou2021deep}. As a non-invasive and real-time imaging modality, ultrasound is essential in clinical diagnosis, but remains highly view-dependent~\cite{jiang2023robotic, munir2025survey, elmekki2025comprehensive}. A single static image often fails to provide sufficient structural information due to anatomical occlusions and a limited field-of-view. Consequently, multi-view perception is necessary to capture a complete representation of the pathology and to ensure diagnostic accuracy ~\cite{dai2021transmed}. % However, the use of multiple cameras comes at a high cost 但是有效去实现这个多view,目前是个难题,这个目前是个成本很高的事情(需要人工什么的) However, the acquisition of multiple views remains challenging in practice. In manual practice, adjusting the probe to find optimal planes relies on the experience of the operator, requiring repeated repositioning and real-time interpretation, making the process tedious and difficult to standardize ~\cite{jiang2023robotic, wang2021deep, munir2025survey}. To reduce this operator burden, existing methods attempt to automate probe navigation, yet most rely on uninformed geometric guidance or localized visual servoing ~\cite{peralta2020next}. While effective at maintaining surface contact and correcting local positioning errors, these methods optimize probe motion based on immediate geometric or image-quality feedback without considering the broader anatomical context. As a result, the probe tends to repeatedly sample nearby regions that yield diminishing new information, while potentially missing the specific viewpoints needed to resolve diagnostic ambiguities due to anatomical occlusions or limited acoustic windows. \begin{figure}[t] \centering \includegraphics[width=\linewidth]{images/intro1.pdf} \caption{\textbf{Motivation of SonoSelect for efficient robotic ultrasound scanning.} \textbf{Left:} Uninformed exploration samples views redundantly within a local region. The target may remain hidden behind an overlying organ and outside the explored area. \textbf{Right:} SonoSelect autonomously selects the next scanning region based on diagnostic value, directing the probe to systematically explore and detect the target through informative views.} \label{fig:teaser} \end{figure} % 所以在我们这篇文章,我们希望把view selection这件事情能够自动化,代替人工的选取(这一段写fig 1的内容, 主要是写我们的任务定义), To automate view selection and reduce reliance on manual navigation, we introduce an active multi-view exploration strategy. As illustrated in \cref{fig:teaser}, uninformed exploration methods tend to sample views redundantly within a local region, leaving the target organ unobserved when it lies beyond the explored area (\cref{fig:teaser},left). In contrast, an active strategy directs the probe toward regions estimated to contribute the most to diagnosis, systematically expanding anatomical coverage rather than concentrating on locally accessible areas (\cref{fig:teaser}, right). This task-driven view selection reduces redundant acquisition and increases the chance of obtaining the specific viewpoints needed for accurate diagnosis. % 因此,我们提出了SonoSelect,(简单介绍一下,也说下传统ppo的差别) We propose SonoSelect, a view selection module that efficiently identifies diagnostically informative ultrasound viewpoints. Given the current observation, SonoSelect selects a promising scanning region based on its estimated diagnostic value, and then generates continuous probe motions to reach that region. We formulate this process as a partially observable decision process and train SonoSelect with a task-specific reward that encourages maximizing coverage of the target anatomical structures. The region selection and motion control modules are jointly optimized through a shared training scheme, keeping the overall framework compact. % 在多个ultrasound任务上,我们的效果很好(比ppo好) % 我们的效率也很高,(一般需要几个view就好了) Experiments across multiple ultrasound tasks demonstrate that SonoSelect consistently outperforms conventional baselines in both target coverage and generalization to unseen anatomies. Unlike standard RL approaches that show substantial performance degradation on unseen anatomies, SonoSelect generates anatomy-aware trajectories that systematically explore diagnostically relevant regions, which helps maintain coverage and reconstruction quality on unseen patient anatomies. % 未来的意义,我们这个是为了是为了实现Fully Autonomous Robotic Ultrasound system中的重要的一步 The proposed active view selection approach has practical implications for both current clinical workflows and future automation. First, the learned view selection policy can be integrated into existing clinical ultrasound systems as a decision support tool, suggesting the next informative scanning region to assist sonographers during examination and reducing operator dependence on experience. Second, for robotic ultrasound systems that need to decide where to scan next, SonoSelect can supply target coordinates based on diagnostic value to guide probe motion planning. \section{Related Work} \label{sec:related} \textbf{Ultrasound Perception: From Single-View to Multi-view.} The majority of existing research in robotic ultrasound focuses on improving the perception of individual 2D slices, such as organ segmentation, classification, and lesion detection ~\cite{jiang2023robotic, huang2023review}. While these methods have achieved high accuracy in controlled planes, they inherently suffer from the limited field-of-view and acoustic occlusions characteristic of single-view ultrasound ~\cite{jiang2025towards}. To acquire more comprehensive observations, autonomous scanning systems have been developed, but they typically follow predefined trajectories or optimize for local image quality and acoustic coupling through force-aware control ~\cite{chatelain2017confidence, ning2023inverse, tirindelli2020force}, without selecting views based on their diagnostic contribution. Recent learning-based advances have attempted to navigate toward target views from local observations ~\cite{hase2020ultrasound, jiang2024intelligent}. However, these approaches primarily focus on acquiring a single predefined standard plane rather than comprehensive 3D perception. Other recent works explore constraint-aware safe exploration ~\cite{duan2024safe} or build tissue-view maps for specific structures ~\cite{su2025tissue}, but these efforts remain limited to local path planning for individual anatomical targets. As a result, no existing method addresses the problem of sequentially selecting views to build up a comprehensive 3D understanding of the scanned region. Our work fills this gap by formulating the problem as active multi-view exploration, where the agent selects a sequence of views to maximize diagnostic coverage across the full scanning region rather than navigating to a single target plane. \textbf{Viewpoint Selection: From Computer Vision to Ultrasound.} The concept of Next-Best-View (NBV) planning was originally established in the computer vision community to solve 3D reconstruction and active localization for objects using RGB or RGB-D sensors ~\cite{isler2016information, di2024learning}. These methods, ranging from classical information-theoretic entropy reduction ~\cite{isler2016information} to recent learning-based active vision policies ~\cite{chen2024gennbv, feng2024naruto, xue2024neural}, operate under free-space assumptions: the sensor can be positioned at arbitrary viewpoints around the object, and the imaging process follows predictable optical properties such as known projection geometry and consistent illumination. Translating NBV principles into the ultrasound domain introduces physical challenges that violate these assumptions. The probe is constrained to maintain continuous contact with the body surface, limiting the set of reachable viewpoints. Acoustic shadowing caused by bone or gas occludes structures that would be visible from other orientations, and signal-dependent speckle noise reduces the reliability of pixel-level uncertainty estimates. Together, these factors mean that purely uncertainty-driven exploration strategies, which perform well under free-space conditions, can be misled by imaging artifacts in ultrasound. Our work adapts the core idea of NBV planning, selecting the next observation to maximize diagnostic gain, to the contact-constrained ultrasound setting. Rather than relying on geometric uncertainty alone, the agent learns to account for anatomical context when choosing where to scan next. We train the exploration policy in high-throughput simulators that support robotic ultrasound tasks ~\cite{makoviychuk2021isaac, ao2025sonogym, schmidgall2024surgical}, where large-scale parallel rollouts provide sufficient experience for the agent to discover anatomy-aware scanning strategies that prioritize diagnostically informative coverage over geometric traversal. \section{Methodology} \label{sec:method} \subsection{Problem Formulation} \begin{figure}[t] \centering \includegraphics[width=0.9\linewidth]{images/overview.pdf} \caption{\textbf{Overview of the Active Multi-view Ultrasound Exploration system.} Framed as a resource-constrained POMDP, the hierarchical agent executes kinematic actions $a_t$ conditioned on the current state $s_t$. Through continuous environment interaction, the spatial memory $\hat{V}_t$ is iteratively updated with new ultrasound slices until the scanning budget is exhausted.} \label{fig:system_overview} \end{figure} % 一段话写任务定义(这里需要有个大图) We formulate Active Multi-view Ultrasound Exploration as a resource-constrained POMDP ~\cite{kaelbling1998planning}. The unobservable state $s \in \mathcal{S}$ represents the complete 3D anatomy, which the agent can only access through partial 2D ultrasound slices $o \in \Omega$. The objective is to learn an exploration policy $\pi(a_t|h_t)$ that maps the observation history $h_t$ to continuous kinematic actions $a_t \in \mathcal{A}$, maximizing the cumulative coverage of target anatomical structures $\mathcal{R}$ within a fixed budget $T$. Concretely, the agent faces three subproblems: (1) estimating, from incomplete observations, how much anatomical coverage each unvisited region would provide; (2) deciding which regions to visit and in what order within the finite budget; and (3) translating each regional decision into a feasible kinematic trajectory. % State \textbf{State}. Because the raw observation history grows with each step, we compress it into a fused 3D probability map $\hat{V}_t$ that serves as a fixed-dimensional spatial memory. The observable state at step $t$ is: \begin{equation} o_t = (\hat{V}_t, \mathbf{p}_t, \mathbf{q}_t), \end{equation} Here $\hat{V}_t$ is obtained by fusing the current slice $I_t$ at pose $(\mathbf{p}_t, \mathbf{q}_t)$ into the previous map via the volumetric fusion function $U(\cdot)$, and $\hat{V}_0$ is initialized to a uniform probability of $0.5$ to represent maximum uncertainty. To provide the critic with a richer training signal, we further define a privileged coverage signal: \begin{equation} c_t = \frac{\sum_{v} \hat{m}t(v) \cdot g(v)}{\sum{v} g(v) + \epsilon}, \quad c_t \in [0,1], \end{equation} where $g(v)$ is the ground-truth binary mask of the target structure and $\hat{m}_t(v)$ is the estimated target occupancy from the current reconstruction. Since $c_t$ requires $g(v)$, it is available only during training in simulation. Following the asymmetric actor-critic formulation ~\cite{pinto2017asymmetric}, the actor $\pi_\theta(a_t | o_t)$ sees only the observable state, while the critic $V_\psi(o_t, c_t)$ additionally receives $c_t$ for more accurate value estimation. This separation ensures that the deployed policy does not rely on any privileged information. % Action \textbf{Action}. For a given observable state $o_t$ at time step $t \in \{1,\dots,T\}$, the agent outputs a continuous 4D motion action $a_t$ (comprising translations along the x-z axes and rotations for roll-yaw) to adjust the probe pose. The action space is continuous to allow fine-grained kinematic adjustments during scanning. \textbf{Transition}. Upon executing action $a_t$, the probe pose is updated via the environment's kinematic function. The environment returns a new observation slice $I_{t+1}$, which is fused into the reconstruction volume $\hat{V}_{t+1}$. The scanning process terminates when the step budget $T$ is exhausted or the early stopping condition is met. % Reward \textbf{Reward}. We design a dense, multi-objective reward function: \begin{equation} r_t = w_{org} \Delta C_t^{org} + w_{tgt} \Delta C_t^{tgt} + w_{info} \Delta H_t^{echo} - \ell_t^{path}, \end{equation} The first two terms, $\Delta C_t^{org}$ and $\Delta C_t^{tgt}$, measure the incremental coverage gains for the primary organ and the target structure, respectively, providing the main learning signal. However, a single partial slice may refine the reconstruction without producing measurable coverage gain. To reward such intermediate progress, the third term $\Delta H_t^{echo}$ captures the reduction in volumetric Shannon entropy over the target region, so that steps reducing acoustic uncertainty still receive positive feedback. Finally, $\ell_t^{path}$ is a conditional kinematic penalty applied only when a step produces no coverage gain, discouraging excessive translation and rotation in the absence of new information. \begin{figure*}[t] \includegraphics[width=0.9\linewidth]{images/architecture2.pdf} \caption{\textbf{Architecture of the SonoSelect agent.} The scanning region is discretized into 16 sectors from which learned features $f_i$ are extracted via shared 2D convolutional encoding and masked pooling. The Sector Selection module produces sector-conditioned Q-values $Q(o_t, z_i)$, and selects the optimal sector $z_t$ whose geometric center is converted into a target guidance vector $\mathbf{v}_t$. Concurrently, the PPO Actor encodes the volumetric state $o_t$ through a convolutional backbone to output the local kinematic increment $\Delta_t$. The Residual Fusion module combines $\mathbf{v}_t$ and $\Delta_t$ to produce the final continuous action $a_t$, which drives the probe to a new pose and triggers volumetric fusion $U(\cdot)$ to update the spatial memory $\hat{V}_{t+1}$.} \label{fig:sono} \end{figure*} \subsection{SonoSelect Architecture} A flat continuous policy would need to simultaneously decide which region of the anatomy to visit next and compute the kinematic actions to get there. In practice, this joint optimization is difficult because regional planning operates over long horizons with sparse diagnostic feedback, while kinematic control requires dense, short-horizon adjustments. SonoSelect decomposes this problem into two coupled components. A sector selection module handles the long-horizon decision of where to explore. The selected region then provides a directional target for a continuous control policy, which only needs to solve a simpler, short-range navigation task toward the chosen sector. This decomposition constrains the search space for each sub-problem while maintaining the flexibility required for fine-grained kinematic control. \begin{figure} \centering \includegraphics[width=\linewidth]{images/feature.pdf} \caption{\textbf{Sector feature extraction pipeline.} By treating elevation slices as input channels, a shared 2D convolutional encoder processes the reconstruction volume $\hat{V}_t$ into a 32-channel feature map. For each sector $i$, a sector-specific mask filters this map, followed by parallel average and max pooling. A shared MLP then projects the concatenated 64-dimensional vector, producing the sector feature $f_i$.} \label{fig:sector_feature} \end{figure} We discretize the local operational workspace into $S{=}16$ equiangular sectors (Fig.~\ref{fig:sono}). To obtain a feature representation $f_i$ for each sector, the reconstruction volume $\hat{V}_t$ is first rearranged by treating elevation slices as input channels and processed by a shared 2D convolutional encoder. For each sector $i$, a binary sector mask is applied to the encoded feature map, followed by parallel average and max pooling. The concatenated pooling result is then projected through a shared MLP to produce $f_i$ (Fig.~\ref{fig:sector_feature}). The sector features $\{f_i\}_{i=1}^{S}$ are each passed through a shared Q-network to produce action values $\{Q(o_t, z_i)\}_{i=1}^{S}$, where $Q(o_t, z_i)$ estimates the expected reward for navigating toward sector $z_i$. This parameter-sharing design ensures that the Q-network generalizes across all candidate sectors rather than learning separate value estimates for each. During training, the sector is chosen via an $\epsilon$-greedy strategy to balance exploration and exploitation; at deployment, the sector with the highest Q-value is deterministically selected. The geometric center of the selected sector $z_t$ is then converted into a directional target vector $\mathbf{v}_t$ in the probe's local coordinate frame, which serves as the input to the downstream continuous control policy. The continuous control policy translates the selected sector into kinematic actions. We employ a PPO-based actor-critic architecture where the actor outputs a local kinematic increment $\Delta_t = [\Delta_t^{\text{pos}}, \Delta_t^{\text{ang}}]$. The final action $a_t$ dynamically fuses the sector-derived target $\mathbf{v}_t$ with the policy's local adjustment $\Delta_t$ via a residual formulation: \begin{equation} a_t^{\text{pos}} = \beta_t \mathbf{v}_t^{\text{pos}} + (1-\beta_t) \alpha \Delta_t^{\text{pos}}, \quad a_t^{\text{ang}} = \alpha \Delta_t^{\text{ang}}, \end{equation} where $\alpha$ is a scaling factor and $\beta_t$ linearly anneals from $1.0$ to near $0$ over the course of training, so that early training is dominated by sector guidance and later training relies on the learned policy. The critic estimates the state value $V_{\psi}(o_t, c_t)$ using the augmented state. \subsection{Training Scheme} We employ a rollout-based sequential updating approach to jointly train the continuous control policy (via Proximal Policy Optimization, PPO~\cite{schulman2017proximal}) and the sector selection module via Q-learning. This joint training scheme allows both modules to co-evolve within the same trajectory data, ensuring consistent learning signals across the two decision levels. The continuous control policy is optimized using the standard PPO objective with Generalized Advantage Estimation (GAE) ~\cite{schulman2015high}. The actor outputs kinematic increments $\Delta_t$ and is updated via clipped surrogate objectives, while the critic estimates $V_\psi(o_t, c_t)$ and provides the baseline for advantage computation. For the sector selection module, we employ Q-learning. The action-value function $Q_{\theta}(o_t, z_t)$ estimates the expected cumulative reward after selecting sector $z_t$ at state $o_t$: \begin{equation} Q_{\theta}(o_{t}, z_{t}) = \mathbb{E} \left( \sum_{\tau=t}^{T} \gamma^{\tau-t} r_{\tau}^{(Q)} \right), \end{equation} where $\mathbb{E}(\cdot)$ denotes the expectation and $\gamma \in [0, 1]$ is the discount factor. The Q-network receives the same environment reward as the continuous control policy, i.e., $r^{(Q)}_t = r_t$. Although both modules share the same reward signal, they require different value representations: the PPO critic learns a state value $V^{\pi}(o_t, c_t)$ that marginalizes over all possible sector choices and is used to compute advantages for the continuous control policy, whereas the sector selection module requires action-conditional values $Q(o_t, z_i)$ that explicitly compare the expected return of navigating toward each candidate sector $z_i$. This distinction motivates maintaining separate value functions for the two modules despite their shared reward. Given this formulation, we compute the return from the collected rollouts as the supervision target: \begin{equation} y_{t} = \begin{cases} r_{t}^{(Q)} + \gamma (1 - d_{t}) y_{t+1}, & \text{if } t < T \\ r_{T}^{(Q)}, & \text{otherwise} \end{cases}, \end{equation} where $d_t$ is the termination mask. The Q-network is then optimized using the $L_2$ distance loss: \begin{equation} \mathcal{L}_{Q} = \lambda_Q \frac{1}{T} \sum_{t=1}^{T} \text{MSE}(Q_{\theta}(o_{t}, z_{t}), y_{t}), \end{equation} where $\lambda_Q$ controls the loss weight. In joint training, the two objectives are optimized in separate backward passes within each iteration. First, the PPO objective $\mathcal{L}_{\text{PPO}}$ updates the continuous control policy and the critic. Then, in a separate backward pass, the Q-learning loss $\mathcal{L}_{Q}$ updates the sector selection module, including the Q-network and its associated feature encoder. This sequential scheme prevents gradient interference between the two objectives. A step-by-step demonstration of this process can be found in ~\cref{alg:sonoselect}. \begin{algorithm}[t] \caption{SonoSelect: Hierarchical RL for US Control} \label{alg:sonoselect} \begin{algorithmic}[1] % [1] 表示每一行都编号 \small \STATE \textbf{Input}: Env $\mathcal{E}$, Steps $T$, Rate $\epsilon$, Weight $\beta_t, \lambda_Q$. \STATE \textbf{Initialize}: $Q_{\theta}$, Actor $\pi_{\phi}$, Critic $V_{\psi}$. \FOR{each training iteration} \STATE $\mathcal{B}_{\text{task}}, \mathcal{B}_{Q} \leftarrow \emptyset$ \hfill \FOR{$t = 1$ to $T$} \STATE $z_t \leftarrow \epsilon\text{-greedy}(Q_{\theta}, o_t)$ \hfill \STATE $\mathbf{v}_t \leftarrow \text{Vec}(z_t)$, $\Delta_t \sim \pi_{\phi}(\cdot | o_t)$ \hfill \STATE $a_t \leftarrow \text{Residual}(\mathbf{v}_t, \Delta_t)$ \hfill \STATE $s_{t+1}, r_t, d_t, \text{info} \leftarrow \text{Execute}(a_t, \mathcal{E})$ \STATE $r_t^{(Q)} \leftarrow r_t$ \hfill \STATE Push $(o_t, c_t, a_t, r_t, o_{t+1}, c_{t+1}, d_t)$ to $\mathcal{B}_{\text{task}}$ \STATE Push $(s_t, z_t, r_t^{(Q)}, o_{t+1}, d_t)$ to $\mathcal{B}_{Q}$ \ENDFOR \STATE Update $\pi_{\phi}, V_{\psi}$ using GAE \& $\mathcal{L}_{\text{PPO}}$ on $\mathcal{B}_{\text{task}}$ \hfill \STATE $y_t \leftarrow \text{n-step return}(r_t^{(Q)}, d_t)$ on $\mathcal{B}_{Q}$ \STATE Update $Q_{\theta}$ via $\nabla \mathcal{L}_{Q}(y_t, Q_{\theta})$ \hfill \ENDFOR \end{algorithmic} \end{algorithm} \section{Experiment} We systematically benchmark the proposed \textit{Active Multi-view Ultrasound Exploration} formulation. Our evaluation is structured into two phases: a discrete preliminary study followed by a fully continuous evaluation in a dynamic environment. The first phase (Sec. \ref{sec:discrete_classification}) validates the core assumption that diagnostic information concentrates in a small, instance-dependent subset of views, using a discrete setting where the probe can access any candidate viewpoint without kinematic constraints. The second phase (Sec. \ref{sec:continuous_detection}) evaluates our complete framework, \textbf{SonoSelect}, in the fully continuous POMDP environment where the agent jointly plans which regions to explore and how to navigate toward them. \subsection{Preliminary Study: Multi-view Ultrasound Classification} \label{sec:discrete_classification} To isolate the view selection problem from continuous navigation, we conduct this study in a discrete setting where the probe can access any candidate viewpoint without kinematic constraints. We employ MVSelect ~\cite{hou2024learning}, a sequential view selection method that chooses the next view conditioned on previously acquired observations. MVSelect provides a suitable testbed because it implements adaptive selection without requiring a continuous control policy, allowing us to focus on whether instance-dependent view selection outperforms fixed or random protocols. \textbf{Datasets.} We construct two custom multi-view ultrasound datasets, both adopting a strict 80\%/20\% train-test split and extracting $120 \times 120$ 2D slices under two distinct view-viewpoint setups (12-view and 20-view configurations). \begin{itemize} \item \textit{SonoGeom:} This synthetic dataset comprises 10 distinct categories (sphere, ellipsoid, cube, cuboid, cylinder, capsule, cone, torus, octahedron, and cross), with 150 unique instances per category. \item \textit{SonoOrgan:} To bridge the gap toward clinical realism, we introduce a more challenging dataset comprising real human anatomical structures sourced from the publicly available TotalSegmentator dataset ~\cite{wasserthal2023totalsegmentator}. It contains 6 distinct categories: left kidney, liver, pancreas, spleen, aorta, and stomach, with 100 unique patient instances per category. \end{itemize} \textbf{Task Network.} For both datasets, we employ a ResNet-18 ~\cite{he2016deep} backbone combined with a max-pooling aggregation module. The network is trained offline on complete multi-view sequences so that the learned representations are not biased toward any particular view subset. \textbf{Quantitative Results.} The classification performances on both datasets are summarized in \cref{tab:combined_results}. We compare four selection strategies: (1) \textit{dataset-level oracle}, which uses the same fixed pair of views that achieves the highest average accuracy across all instances in the training set; (2) \textit{instance-level oracle}, which selects the optimal pair for each test instance by exhaustive search; (3) \textit{random selection}, which samples two views uniformly; and (4) \textit{MVSelect}, which sequentially selects views conditioned on previous observations. We also report the performance of using all $N$ views as a reference \begin{table}[t] \centering \caption{Classification results on the simple geometry dataset (SonoGeom) and real organ dataset (SonoOrgan) with a selection budget of $T=2$ views.} \label{tab:combined_results} \small \begin{tabular}{l|c|c|c|c} % 修改点:所有列之间均添加竖线 \hline \multirow{2}{*}{Selection ($T=2$)} & \multicolumn{2}{c|}{SonoGeom} & \multicolumn{2}{c}{SonoOrgan} \\ \cline{2-5} & 12 views & 20 views & 12 views & 20 views \\ \hline all $N$ views & 84.0 & 91.3 & 92.5 & 91.7 \\ \hline dataset-lvl oracle & 79.1 $\pm$ 0.6 & 82.1 $\pm$ 1.2 & 92.8 $\pm$ 0.6 & 90.1 $\pm$ 2.0 \\ instance-lvl oracle & 93.8 $\pm$ 0.5 & 99.2 $\pm$ 0.6 & 98.1 $\pm$ 0.7 & 99.3 $\pm$ 1.1 \\ random selection & 74.6 $\pm$ 2.4 & 76.0 $\pm$ 2.8 & 87.4 $\pm$ 4.2 & 73.9 $\pm$ 11.6 \\ MVSelect & 83.4 $\pm$ 0.9 & 89.6 $\pm$ 1.6 & 96.2 $\pm$ 1.6 & 97.0 $\pm$ 1.0 \\ \hline \end{tabular} \end{table} \begin{figure}[htbp] \centering \includegraphics[width=\linewidth]{images/Experiment1.pdf} \caption{\textbf{Qualitative results of the discrete view selection policy ($T=2$).} We visualize the selected viewpoints for both the SonoGeom (top, cube) and SonoOrgan (bottom, anatomical structure) datasets.} \label{fig:Experiment1} \end{figure} We first note that using all $N$ views does not yield the highest accuracy. The instance-level oracle, which selects the best two views per instance via exhaustive search, substantially surpasses the all-view baseline on both datasets. This indicates that redundant or low-quality views introduce noise that degrades the aggregated representation. However, the dataset-level oracle, which fixes the same two best views across all instances, performs considerably worse than the instance-level oracle and in some cases falls below the all-view baseline. This gap shows that the most informative views vary from one instance to another and cannot be predetermined as a fixed protocol. Random selection performs the worst overall, with high variance reflecting the inconsistency of uninformed view choices. MVSelect, which selects views conditioned on each instance's observations, approaches the instance-level oracle on both datasets. This confirms that an adaptive, observation-driven policy can recover near-optimal view combinations without exhaustive search. Together, these results support the two properties that motivate SonoSelect: (1) a small number of well-chosen views can match or exceed the performance of exhaustive acquisition, and (2) the optimal views are instance-dependent, requiring an observation-driven selection policy. The following section evaluates whether these properties hold when the selection is embedded in a continuous navigation setting. \textbf{Qualitative Results.} \cref{fig:Experiment1} visualizes the views selected by MVSelect for representative instances from both datasets. The policy avoids ambiguous cross-sections and orients the probe toward acoustic windows that capture discriminative geometric features of each object. \subsection{Continuous Kidney Cyst Detection.} \label{sec:continuous_detection} The preliminary study established that diagnostic information is sparse and instance-dependent, justifying an active selection strategy. Having confirmed that adaptive view selection yields substantial diagnostic gains in the discrete setting, we now evaluate whether these gains transfer to a physically realistic continuous scanning scenario. In this setting, the agent can no longer teleport between viewpoints but instead navigates the probe through continuous kinematic actions, introducing the joint challenge of regional planning and local motion control. We evaluate SonoSelect, our hierarchical framework designed to address this joint challenge, to test both diagnostic performance and structural generalization. \textbf{Experimental Setup.} The primary task requires the agent to dynamically scan the left kidney and identify renal cysts. We utilize 3D clinical CT volumes from the TotalSegmentator dataset ~\cite{wasserthal2023totalsegmentator}. To evaluate structural generalization, patient anatomies are strictly partitioned into seen and unseen domains. \textbf{Baselines.} In this fully continuous setting, we benchmark SonoSelect against baselines representing alternative exploration strategies: \begin{itemize} \item \textit{Random} applies uniformly sampled kinematic actions at each step, providing a lower bound that quantifies the diagnostic yield achievable without any learned or heuristic guidance. \item \textit{Pure PPO} represents end-to-end reinforcement learning without hierarchical decomposition, testing whether a flat policy can implicitly learn both regional planning and local control. \item \textit{VIG} (Volumetric Information Gain) ~\cite{isler2016information} represents classical Next-Best-View driven by entropy maximization, testing whether uncertainty reduction alone provides sufficient guidance for diagnostic exploration. \item \textit{RND ~\cite{burda2018exploration}} provides a state-visitation driven exploration bonus, testing whether encouraging novel state visits improves coverage without task-specific guidance. \end{itemize} \begin{table*}[t] \centering \label{tab:main_results} \begin{tabular}{l c c c c c c} \toprule \textbf{Method} & \textbf{Kidney Cov. (\%)} & \textbf{Cyst Cov. (\%)} & \textbf{Dice (\%)} & \textbf{IoU (\%)} & \textbf{Trans. (voxels)} & \textbf{Rot. ($^{\circ}$)} \\ \midrule \rowcolor{yellow!50} \multicolumn{7}{c}{\textit{Seen Patient}} \\ Random & 19.3 & 12.2 & 30.6 & 19.4 & 934.3 & 2621.1 \\ Pure PPO~\cite{schulman2017proximal} & 59.2 & 28.5 & 61.7 & 44.9 & 458.5 & 327.2 \\ RND~\cite{burda2018exploration} & 60.8 & 31.3 & 63.1 & 46.3 & \textbf{430.4} & \textbf{253.7} \\ VIG~\cite{isler2016information} & 63.1 & \textbf{44.5} & \textbf{71.7} & \textbf{56.0} & 473.6 & 318.9 \\ \textbf{SonoSelect (Ours)} & \textbf{64.3} & 31.0 & 65.9 & 49.3 & 452.6 & 296.1 \\ \midrule \rowcolor{orange!20} \multicolumn{7}{c}{\textit{Unseen Patient}} \\ Random & 27.4 & 2.7 & 42.3 & 27.8 & 927.8 & 2630.0 \\ Pure PPO & 25.7 & 8.2 & 39.1 & 26.5 & \textbf{210.1} & 375.5 \\ RND & 41.1 & 20.7 & 55.2 & 41.2 & 403.8 & \textbf{237.5} \\ VIG & 48.6 & 23.8 & 64.0 & 45.2 & 489.1 & 469.8 \\ \textbf{SonoSelect (Ours)} & \textbf{49.5} & \textbf{30.7} & \textbf{64.2} & \textbf{48.4} & 667.9 & 291.5 \\ \bottomrule \end{tabular} \caption{Quantitative comparison of active scanning performance. SonoSelect exhibits superior robustness, effectively bridging the generalization gap that plagues standard RL baselines in unseen environments.} \end{table*} \textbf{Quantitative Results.} \cref{tab:main_results} presents scanning performance on seen and unseen patient anatomies. On seen anatomies, VIG achieves the highest cyst coverage and reconstruction accuracy, while SonoSelect achieves the highest kidney coverage, as reported in ~\cref{tab:main_results}. This is consistent with the nature of entropy-based exploration: on training anatomies, the spatial distribution of acoustic uncertainty tends to align with target anatomical structures, so greedy entropy maximization effectively guides the probe toward informative regions. SonoSelect's lower cyst coverage on seen data reflects a trade-off in its hierarchical design: the sector selection module distributes exploration across the scanning workspace based on estimated diagnostic value, producing more uniform spatial coverage rather than concentrating on the regions that happen to contain cysts in the training set. However, this broader exploration strategy favors generalization, as the results on unseen anatomies below will confirm. For reference, the Random baseline achieves the lowest diagnostic scores across all metrics, while consuming substantially more motion budget, confirming that directed exploration is necessary for this task. All methods degrade on unseen anatomies, but the extent of degradation differs. Among the learned methods, Pure PPO exhibits the largest performance degradation, with kidney coverage dropping and cyst coverage, indicating that the flat policy does not learn transferable exploration behaviors across different anatomies. VIG's cyst coverage drops substantially, accompanied by a sharp increase in rotational motion (318.9° to 469.8°). This suggests that on unseen anatomies, the alignment between high-entropy regions and target structures weakens, causing the entropy-driven policy to pursue uncertainty reduction in diagnostically uninformative regions while consuming motion budget on reorientation. In contrast, SonoSelect's cyst coverage remains stable across seen and unseen anatomies, and it achieves the highest scores across all four diagnostic metrics on unseen data. This stability can be attributed to the hierarchical decomposition of the scanning policy. Because the high-level routing operates on sector-level spatial features rather than raw voxel coordinates, its decisions are less tied to the specific geometry of training anatomies. Similarly, the low-level controller only needs to execute short-range navigation toward a given sector, a skill that depends on local kinematics rather than global anatomical layout. As a result, neither level relies on memorizing the full spatial structure of training patients, which explains why SonoSelect's performance degrades less when the anatomy changes. % 将 wrapfigure 放在目标段落正上方 \begin{figure} \centering \includegraphics[width=\linewidth]{images/ppo&sono.pdf} \caption{Episode-level distribution of Cyst Coverage against Trajectory Length on unseen anatomies. SonoSelect (blue) pushes the Pareto front toward higher diagnostic yields compared to Pure PPO (red).} \label{fig:tradeoff} \end{figure} \textbf{Episode-level analysis.} To further examine the generalization behavior at the episode level, ~\cref{fig:tradeoff} plots the episode-level distribution of cyst coverage against trajectory length on unseen anatomies. Pure PPO exhibits a dense cluster in the bottom-left quadrant, indicating frequent near-zero coverage episodes with short, spatially confined trajectories. This pattern is consistent with the limited transferability of the flat policy: when familiar spatial cues from training anatomies are absent, the agent tends to remain confined to local regions rather than exploring broadly. SonoSelect's distribution occupies the upper-right quadrant, where longer trajectories correspond to higher diagnostic coverage. As \cref{tab:main_results} shows, SonoSelect's average trajectory length is considerably longer than that of Pure PPO, yet this additional motion translates into higher scores across all diagnostic metrics, indicating thorough exploration of the target anatomy rather than aimless wandering. \begin{figure}[t] \centering \includegraphics[width=\linewidth]{images/trajectories.pdf} \caption{Qualitative comparison of scanning trajectories on unseen patient data. Red segments indicate trajectory portions where the probe is actively scanning the kidney or cyst, while gray segments represent motion through non-target regions. The percentage below each example records the proportion of the total trajectory spent on effective target scanning. (a) Pure PPO produces uncoordinated trajectories with low effective scanning ratios. (b) SonoSelect achieves structured, anatomy-centered navigation with substantially higher effective scanning ratios.} \label{fig:qualitative_trajectories} \end{figure} \textbf{Qualitative Results.} ~\cref{fig:qualitative_trajectories} visualizes representative trajectories generated by Pure PPO and SonoSelect on unseen anatomies. As illustrated in \cref{fig:qualitative_trajectories}a, pure PPO produces uncoordinated circular movements far from the kidney, with the majority of the trajectory passing through non-target regions. The effective scanning ratio in these examples ranges from 13.5\% to 19.6\%, indicating that the agent spends most of its motion budget on non-informative traversal. This is consistent with the low coverage reported in \cref{tab:main_results}, where the agent fails to direct the probe toward the target anatomy. In contrast, SonoSelect (\cref{fig:qualitative_trajectories}b) produces more structured trajectories that closely follow the contours of the kidney. The effective scanning ratios increase substantially, reflecting that a larger fraction of the trajectory contributes to diagnostic observation. This improvement is attributable to the sector-level routing learned by the high-level module, which directs the probe toward the target region and reduces time spent in non-informative areas. \subsection{Ablation Studies} \label{sec:ablation} To validate the core architectural designs of SonoSelect, we conduct ablation experiments on the kidney cyst detection task using unseen patient data. We isolate three components: the learned routing policy, the per-sector feature representation, and the residual control module. Each ablation removes one component while keeping the rest unchanged. The quantitative comparisons are summarized in \cref{tab:ablation}. \textbf{Effect of Learned Routing.} We first evaluate the high-level decision maker by replacing the learned routing policy with random sector selection. Without a task-driven geometric prior, the continuous policy receives arbitrary directional targets, leading to uncoordinated probe motion. As shown in \cref{tab:ablation}, this variant suffers a significant drop in cyst coverage, confirming that the learned routing policy is necessary to constrain the search space and direct the continuous policy toward diagnostically relevant regions. \begin{table} \begin{tabular}{lcccc} \toprule Method & Kidney & Cyst & Dice & IoU \\ % 缩短了表头以适应窄列 \midrule Random Routing & 44.4 & 12.5 & 62.2 & 45.5 \\ w/o Sector Features & 44.3 & 18.3 & 61.3 & 45.2 \\ w/o Residual Control & 44.9 & 10.1 & 59.2 & 44.9 \\ \textbf{SonoSelect (Ours)} & \textbf{49.5} & \textbf{30.7} & \textbf{64.2} & \textbf{48.4} \\ \bottomrule \end{tabular} \caption{Ablation study of SonoSelect components.} \label{tab:ablation} \end{table} \textbf{Necessity of Explicit Sector Features.} The w/o Sector Features variant replaces the learned feature vectors with uniform values, making all sectors appear identical to the Q-network. Although the Q-network still receives the global state $s_t$, it cannot distinguish sectors based on their spatial content in the reconstruction volume. As a result, the Q-network selects sectors without considering what each region contains, leading to reduced coverage for both kidney and cyst targets. This confirms that per-sector spatial features are necessary for the Q-network to direct exploration toward regions likely to contain the target anatomy. \textbf{Role of Residual Control.} The w/o Residual Control variant removes the low-level kinematic adjustments. This variant achieves the lowest cyst coverage among all configurations, while its kidney coverage remains comparable to the other ablated variants. This asymmetry reveals a clear functional division within the framework: the high-level routing policy is sufficient to guide the probe toward the correct anatomical region, but capturing small targets such as cysts requires the fine-grained probe adjustments that only the residual control module provides. \section{Conclusion} We propose SonoSelect, an active multi-view exploration framework for robotic ultrasound that intelligently seeks informative viewpoints without exhaustive scanning or predefined trajectories. By bridging discrete high-level regional routing with continuous low-level kinematic control, SonoSelect learns to resolve anatomical ambiguities and achieves robust generalization to unseen anatomies where standard reinforcement learning approaches show substantial performance degradation. Experiments in discrete multi-view classification and continuous dynamic cyst detection demonstrate that SonoSelect achieves superior diagnostic accuracy and robust generalization to unseen patient anatomies through anatomy-aware exploration. This active exploration approach represents a step toward autonomous robotic ultrasound deployment in clinical workflows. However, because the current evaluation relies on simulated static volumes, the framework has not been validated against complex tissue deformations and dynamic acoustic coupling losses inherent in physical scanning. Future work includes integrating force-aware contact dynamics, modeling realistic soft-tissue deformation, and physical clinical deployment. %% %% The next two lines define the bibliography style to be used, and %% the bibliography file. \bibliographystyle{ACM-Reference-Format} \bibliography{sample-base} \end{document} \endinput %% %% End of file `sample-sigconf-authordraft.tex'.