Spec Deep-Dive: DroidAgent
Question
Why does coinse/droidagent belong in the spec dataset if it is not a
traditional requirements repository and has almost no public revision history?
Short answer
DroidAgent is valuable because it turns natural-language intent into executable
mobile-GUI testing traces. Its specification-like artifacts are not architecture
Markdown or a canonical spec.md; they are generated tasks, end conditions,
action histories, GUI-state observations, markdown reports, and UIAutomator-style
replay scripts. For spec-dataset-evolution-research-project, it is a compact
post-LLM example of agent-generated behavioral scenario specifications with a
strong code/spec bridge and weak ordinary longitudinal history.
Source basis
| Claim scope | Private corpus source | Public upstream reference | Evidence fields used | Caveat |
|---|---|---|---|---|
| Repository identity | reports/deep-dives/droidagent.md | https://github.com/coinse/droidagent | repo URL, local clone metadata, README retrieval, HEAD, commit count | GitHub REST metadata was rate-limited; evidence is from clone, raw README retrieval, page scrape, and local inspection. |
| Paper and purpose | reports/deep-dives/droidagent.md | arXiv 2311.08649, Autonomous Large Language Model Agents Enabling Intent-Driven Mobile GUI Testing | arXiv API abstract, README, paper-linked evaluation claims | Semantic Scholar returned HTTP 429, so citation counts were not used. |
| Implementation architecture | reports/deep-dives/droidagent.md | paths including droidagent/agent.py, _planner.py, _actor.py, _observer.py, _reflector.py, memories/, scripts/ | local file inventory, component paths, action API, memory implementation | This page paraphrases dossier findings; it does not copy source files or experiment outputs. |
| Spec-like output surface | reports/deep-dives/droidagent.md | generated histories, reports, scripts, notebooks, CSV assessments in the public repo | task text, end conditions, event traces, UIAutomator script generator, report generator, manual assessment rows | Raw evaluation data referenced by README was not present in the clone; Google Drive archives would be separate evidence. |
| Limitations | reports/deep-dives/droidagent.md | same repo and paper | retrieval failures, small commit history, old OpenAI model defaults, missing data, replay caveats | The public repo is publication-shaped; it should not be mined as a rich long-term evolution source. |
What DroidAgent is
DroidAgent is an Android GUI testing agent from the COINSE group. The inspected dossier describes a plan/observe/act/reflect loop:
- generate a realistic user task;
- choose GUI actions through function-calling over available widgets;
- observe and summarize state changes;
- reflect on task success or failure and write lessons into memory.
The implementation separates planner, actor, observer, reflector, persistent memory, possible actions, experiment runner, report generator, and script generator. That structure is directly relevant to tool-execution and evaluation-and-review-loops: the text is not merely explanatory, it becomes part of a test-generation loop.
The spec-like artifacts
DroidAgent does not supply a single stable requirements file. Its corpus value is in generated and derived artifacts:
| Artifact family | Why it is spec-like | Connected code path or mechanism |
|---|---|---|
| Generated task descriptions | They state user intent and target outcome at scenario level. | Planner prompt and task-selection logic. |
| End conditions | They define when a GUI task should stop or count as achieved. | Actor/reflector loop and task state. |
| Action histories | They bind intent to concrete widget actions and observations. | Event records in the exploration data. |
| UIAutomator-style scripts | They turn traces into executable replay code. | scripts/make_script.py. |
| Markdown task reports | They render action/state evidence into human-reviewable summaries. | scripts/make_report.py. |
| Manual assessment CSVs and notebooks | They provide task realism/success labels and evaluation context. | Evaluation notebooks and assessment files. |
That is a high-connectedness pattern: natural-language task, GUI evidence, event trace, replay script, and report can all share provenance. It is not the same shape as llm-readable-spec-files, but it tests the same underlying idea: a useful spec constrains behavior and leaves evidence.
Evaluation and timing
The paper was published on arXiv in November 2023; the public repository begins in January 2024 with a publication/artifact-shaped history. The dossier records only four public commits, so repo-level spec evolution is weak.
The evaluation claims are more interesting than the git history:
- arXiv abstract: 15 Themis benchmark apps, average activity coverage reported at 61%, and 317 of 374 autonomously created tasks manually judged realistic/relevant;
- local notebooks: coverage comparisons against DroidBot, GPTDroid, Humanoid, and Monkey; task success/failure counts; ablations; and cost accounting;
- caveat: the raw evaluation directories named by the README were not present in the inspected clone.
For the dataset, this is a post-LLM behavioral-spec generation case, not a rich longitudinal repo case.
Pressure and evolution signals
The public git history is too small to support a meaningful pressure timeline. The useful pressures are inside the agent loop:
- unreachable GUI states force action revision;
- repeated actions trigger critique;
- timeout and max-action bounds force abandonment or replanning;
- widget memory changes future prompts by adding learned affordances;
- persona and goal changes alter task selection;
- replay scripts are explicitly not guaranteed to be fully reproducible.
This is pressure from interaction with the world, not from a project roadmap. That distinction matters for harness-engineering because an environment can make its own specs by colliding with reality. Reality remains an uncompromising reviewer; pleasingly, it does not accept vague tickets.
Limitations and publication boundary
- No raw corpus files, screenshots, CSV contents, or generated reports are copied here.
- GitHub API, Firecrawl-backed search, and Semantic Scholar evidence had retrieval failures or rate limits; those failures constrain confidence.
- The repo is small and artifact-like, so it should be a deep case study rather than a broad mining stratum.
- OpenAI model defaults in the code target older 0613-era chat/function-calling models and may need modernization before reproduction.
- Generated replay scripts are flaky and may require manual hardening.
Dataset implication
Classify DroidAgent as behavioral-scenario-spec, executable-gui-test-spec,
and agent-generated-spec. Use it as the canonical contrast to
spec-deep-dive-case-jcode: jcode has high connectedness through living
architecture docs and prompt/instruction files; DroidAgent has high connectedness
through generated task traces and replay scripts.
Deep-dive navigation
- Aggregate index: spec-deep-dive-index
- Priority cases: spec-deep-dive-case-jcode, spec-deep-dive-case-droidagent, spec-deep-dive-case-j8-ambiguity
- Cohort pages: spec-deep-dive-cohort-exact-spec-md-and-standards, spec-deep-dive-cohort-agent-native-spec-kit-kiro, spec-deep-dive-cohort-rfc-adr-executable-contracts