Spec Deep-Dive: DroidAgent

Question

Why does coinse/droidagent belong in the spec dataset if it is not a traditional requirements repository and has almost no public revision history?

Short answer

DroidAgent is valuable because it turns natural-language intent into executable mobile-GUI testing traces. Its specification-like artifacts are not architecture Markdown or a canonical spec.md; they are generated tasks, end conditions, action histories, GUI-state observations, markdown reports, and UIAutomator-style replay scripts. For spec-dataset-evolution-research-project, it is a compact post-LLM example of agent-generated behavioral scenario specifications with a strong code/spec bridge and weak ordinary longitudinal history.

Source basis

Claim scope	Private corpus source	Public upstream reference	Evidence fields used	Caveat
Repository identity	`reports/deep-dives/droidagent.md`	`https://github.com/coinse/droidagent`	repo URL, local clone metadata, README retrieval, HEAD, commit count	GitHub REST metadata was rate-limited; evidence is from clone, raw README retrieval, page scrape, and local inspection.
Paper and purpose	`reports/deep-dives/droidagent.md`	arXiv `2311.08649`, Autonomous Large Language Model Agents Enabling Intent-Driven Mobile GUI Testing	arXiv API abstract, README, paper-linked evaluation claims	Semantic Scholar returned HTTP 429, so citation counts were not used.
Implementation architecture	`reports/deep-dives/droidagent.md`	paths including `droidagent/agent.py`, `_planner.py`, `_actor.py`, `_observer.py`, `_reflector.py`, `memories/`, `scripts/`	local file inventory, component paths, action API, memory implementation	This page paraphrases dossier findings; it does not copy source files or experiment outputs.
Spec-like output surface	`reports/deep-dives/droidagent.md`	generated histories, reports, scripts, notebooks, CSV assessments in the public repo	task text, end conditions, event traces, UIAutomator script generator, report generator, manual assessment rows	Raw evaluation data referenced by README was not present in the clone; Google Drive archives would be separate evidence.
Limitations	`reports/deep-dives/droidagent.md`	same repo and paper	retrieval failures, small commit history, old OpenAI model defaults, missing data, replay caveats	The public repo is publication-shaped; it should not be mined as a rich long-term evolution source.

What DroidAgent is

DroidAgent is an Android GUI testing agent from the COINSE group. The inspected dossier describes a plan/observe/act/reflect loop:

generate a realistic user task;
choose GUI actions through function-calling over available widgets;
observe and summarize state changes;
reflect on task success or failure and write lessons into memory.

The implementation separates planner, actor, observer, reflector, persistent memory, possible actions, experiment runner, report generator, and script generator. That structure is directly relevant to tool-execution and evaluation-and-review-loops: the text is not merely explanatory, it becomes part of a test-generation loop.

The spec-like artifacts

DroidAgent does not supply a single stable requirements file. Its corpus value is in generated and derived artifacts:

Artifact family	Why it is spec-like	Connected code path or mechanism
Generated task descriptions	They state user intent and target outcome at scenario level.	Planner prompt and task-selection logic.
End conditions	They define when a GUI task should stop or count as achieved.	Actor/reflector loop and task state.
Action histories	They bind intent to concrete widget actions and observations.	Event records in the exploration data.
UIAutomator-style scripts	They turn traces into executable replay code.	`scripts/make_script.py`.
Markdown task reports	They render action/state evidence into human-reviewable summaries.	`scripts/make_report.py`.
Manual assessment CSVs and notebooks	They provide task realism/success labels and evaluation context.	Evaluation notebooks and assessment files.

That is a high-connectedness pattern: natural-language task, GUI evidence, event trace, replay script, and report can all share provenance. It is not the same shape as llm-readable-spec-files, but it tests the same underlying idea: a useful spec constrains behavior and leaves evidence.

Evaluation and timing

The paper was published on arXiv in November 2023; the public repository begins in January 2024 with a publication/artifact-shaped history. The dossier records only four public commits, so repo-level spec evolution is weak.

The evaluation claims are more interesting than the git history:

arXiv abstract: 15 Themis benchmark apps, average activity coverage reported at 61%, and 317 of 374 autonomously created tasks manually judged realistic/relevant;
local notebooks: coverage comparisons against DroidBot, GPTDroid, Humanoid, and Monkey; task success/failure counts; ablations; and cost accounting;
caveat: the raw evaluation directories named by the README were not present in the inspected clone.

For the dataset, this is a post-LLM behavioral-spec generation case, not a rich longitudinal repo case.

Pressure and evolution signals

The public git history is too small to support a meaningful pressure timeline. The useful pressures are inside the agent loop:

unreachable GUI states force action revision;
repeated actions trigger critique;
timeout and max-action bounds force abandonment or replanning;
widget memory changes future prompts by adding learned affordances;
persona and goal changes alter task selection;
replay scripts are explicitly not guaranteed to be fully reproducible.

This is pressure from interaction with the world, not from a project roadmap. That distinction matters for harness-engineering because an environment can make its own specs by colliding with reality. Reality remains an uncompromising reviewer; pleasingly, it does not accept vague tickets.

Limitations and publication boundary

No raw corpus files, screenshots, CSV contents, or generated reports are copied here.
GitHub API, Firecrawl-backed search, and Semantic Scholar evidence had retrieval failures or rate limits; those failures constrain confidence.
The repo is small and artifact-like, so it should be a deep case study rather than a broad mining stratum.
OpenAI model defaults in the code target older 0613-era chat/function-calling models and may need modernization before reproduction.
Generated replay scripts are flaky and may require manual hardening.

Dataset implication

Classify DroidAgent as behavioral-scenario-spec, executable-gui-test-spec, and agent-generated-spec. Use it as the canonical contrast to spec-deep-dive-case-jcode: jcode has high connectedness through living architecture docs and prompt/instruction files; DroidAgent has high connectedness through generated task traces and replay scripts.

Aggregate index: spec-deep-dive-index
Priority cases: spec-deep-dive-case-jcode, spec-deep-dive-case-droidagent, spec-deep-dive-case-j8-ambiguity
Cohort pages: spec-deep-dive-cohort-exact-spec-md-and-standards, spec-deep-dive-cohort-agent-native-spec-kit-kiro, spec-deep-dive-cohort-rfc-adr-executable-contracts

Agent Harness Wiki

Browse