A reverse-vibe-coding workflow for refactoring

in Programming & Dev19 days ago (edited)

image.png

(Image by Gemini AI)

In my previous blog post I introduced the Reverse Vibe-Coding proto-manifesto. My post was a relatively high level overview of the proposed workflow that close to reverses the human/AI worksplit in order to come to a solid workflow that is suitable for full product lifecycle production grade product development, geared to product-lifetime productivity, not quick prototyping.

While there actually is some vibe-coding in reverse vibe-coding (mainly for experiments and learning), and while core business logic is hand-coded with a strong focus on maintainability and the DRY principle, in full-lifetime development, refactoring is a huge part of the total development, and a place where the proposed RVC workflow has its actual AI backbone.

A backbone that revives a merge-based git workflow and rather than the ambiguous rebased blob commits now common in many AI assisted and vibe-coding workflows, a git history with complete provenance again, differentiating sharply between human contributions and contributions by AI.

In this post I want to dive into this backbone. All at a conceptual level, but at a deeper more technical conceptual level than what I outlined in the proto-manifesto.

Note that while this post outlines the refactor workflow, the workflow for scaffolding and boilerplate are very similar in most aspects, but to keep maximum clarity in this blog post, we only look at the refactoring workflow.

Prompt -> prompt-template -> parallel templates -> DSL

Prompting a LLM or SLM in English is great for tasks that explore novel approaches, but as you use a particular model for a while, you find wordings that work significantly better than other prompts. After having had a few reproducibility incidents when you have forgotten the proper wording, many developers start collecting those really prompts. When doing similar tasks having looked up old prompts in your collection, a lightbulb moment often occurs, that, hey, what if I turned this prompt into a template. So now instead of English we are prompting with what is basically a bit of code doing a template invocation that creates the prompt.

When we then have a collection of templates, sometimes a template doesn't quite work best, so we make a slightly tuned version of the original prompt into a second template, then a third. And instead of giving the LLM/SLM one prompt, we give it all variants and then as humans end up choosing the best of the results.

Then, the more we use the template invocation code to prompt, we think how inconvenient it is to write a small bit of Python/Jinja code, and a prompting DSL starts emerging, all in small steps, all at the same time. In Reverse Vibe-Coding, we embrace this process. No English prompts for our target project. English prompts are put into templates in a separate central DSL repo. When we have defined them, we extend the DSL to make invocation convenient.

As noted, RVC is still a moving target and so is the DSL sub-workflow, but by embracing this sub-workflow, users can start moving away from the imprecise artistic way of prompting to a more deterministic flow.

The bottom line:

"English is not the right language for non-creative AI code assistance tasks, we need a Domain Specific Language"

At first every organization, team or lone developer will have their own DSL, but likely a DSL will eventually emerge that is generic enough to be widely used, and the templating layer may become obsolete because the LLMs/SLMs could be trained to natively use that DSL.

So what could a DSL for prompting look like? It was a hard decission wether or not to put some idealized version of my own current efforts in here, because RVC doesn't want to dictate any specific DSL syntax or semantics, but it is important to have at least some measure of what kind of expressions you would likely be giving ambiguous and LLM/SLM dependent English prompt up for.

Please use this example purely as directional, if at all, find your own way from template to DSL

LANGUAGE CPP
CONFORM TO GUIDELINES:
  ALEXANDRESCU
  CORE
FORMAT WITH:
  RATTLIFF
HOUSESTYLE innuendo
REFACTOR Foo FROM src/foo.cpp INTERFACE inc/foo.hpp:
  OWNER Bar FROM src/bar.cpp INTERFACE inc/bar.hpp
  OWNER FooTester FROM test/foo_test.cpp
  TEMPLATE class_proxy_split VARIANTS ALL:
    VAR takeout [_pBax, _pQuaz]
    VAR newclass FooPart
    VAR membership unique_ptr
    MODIFY:
      TEMPLATE constructorspec VARIANTS [move_nocopy, copy_nomove]

The example is a little verbose, but it is specific ans it is complete enough to be atomic in the light of the validation baseline that we shall discuss in the next section.
But first lets give another set of idealized purely directional template definitions this DSL code could rely on. These are fully fictional and meant again only for
illustration:

context template

IMPLEMENTS:
  TEMPLATE DSL:CONTEXT
  MODEL deepseek
  MODE webchat
MATCH:
  LANGUAGE CPP
RVC_PROMPT_TEMPLATE
I want you to look at some C++ code, but fist let me give you some
context.
# formatting
The source code uses {{DSL:FORMAT}} indentation. For clarity let me
outline the rules for this formatting:
{{CONTENT:DSL:FORMAT}}
# coding style
The coding uses the following coding style.
{{FLOW:BEGIN foreach DSL:GUIDELINES AS guideline}}
## {{FLOW:VAR guideline}}
{{CONTENT:FLOW:VAR guideline}}
{{FLOW:END foreach}}
# House style
Next to these conventions, the code uses a few hous style rules:
{{CONTENT:DSL:HOUSESTYLE}}

A template like this would pull in relevant context at SLM/LLM interaction start.

Refactor template

IMPLEMENTS:
  TEMPLATE DSL:REFACTOR
  MODEL deepseek
  MODE webchat
MATCH:
  LANGUAGE CPP
RVC_PROMPT_TEMPLATE
# Refactoring
The class we want to refactor is the {{DSL:REFACTOR}} class that is
declared in the header file {{DSL:INTERFACE}} file and defined in the
file {{DSL:FROM}}. I'll share both files:

{{DSL:INTERFACE}}
```
{{CONTENT:DSL:INTERFACE}}
```

{{DSL:FROM}}
```
{{CONTENT:DSL:FROM}}
```
# Clients
The {{DSL:REFACTOR}} class is used from a number of tiles, that I'll list:

{{FLOW:BEGIN foreach DSL:OWNER[OWNER,FROM,INTERFACE] AS [owner,from,interface]}}
## client {{CONTENT:FLOW:CNT}}
One client is the class or function {{FLOW:VAR owner}} declared in the
file {{FLOW:VAR interface}} and defined in the file {{FLOW:VAR from}}:

### {{FLOW:VAR interface}}
{{CONTENT:FLOW:VAR interface}}
### {{FLOW:VAR from}}
{{CONTENT:FLOW:VAR from}}
{{FLOW:END foreach}}

# First step

So let's get started. Let me describe the desired refactoring one file at a time.
Please respond to this and following prompts with an updated version of the
refered to file.

A template like this would addthe main refactor context to SLM/LLM session initialisation.

The core refactor template

IMPLEMENTS:
  TEMPLATE class_proxy_split
  VARIANT CONTEXT
  MODEL deepseek
  MODE webchat
MATCH:
  LANGUAGE CPP
RVC_PROMPT_TEMPLATE INTERFACE
# {{PARENT:INTERFACE}}
Let's look at {{PARENT:INTERFACE}} first. I want you to create a declaration for
{{DSL::VAR newclass}} and add it to {{PARENT:INTERFACE}} before the declaration of
{{PARENT::REFACTOR}}. The members :
{{FLOW:BEGIN foreach DSL:VAR:takeout AS takeout}}
* {{FLOW:VAR takeout}}
{{FLOW:END foreach}}
Should move from {{PARENT:INTERFACE}} to {{DSL::VAR newclass}}.
{{PARENT::REFACTOR}} should get a new {{DSL::VAR newclass}} member that is initiated
from the {{PARENT::REFACTOR}} constructor initialisation list.
The original {{PARENT::REFACTOR}} should still keep all methods it already had,
proxying to the new {{DSL::VAR newclass}} member. If old {{PARENT::REFACTOR}} methods
acces both remaining members of {{PARENT::REFACTOR}} and members now moved to
{{DSL::VAR newclass}}, then the functionality should be split appropriately between
the new {{DSL::VAR newclass}} version of the method, and the {{PARENT::REFACTOR}}
version of the method, so the {{DSL::VAR newclass}} declaration should define these
methods too.
Use the coding guidelines as provided.
Please give me a proposed new version of {{PARENT:INTERFACE}}, without any elaboration.
RVC_PROMPT_TEMPLATE OWNER INTERFACE
Thank you, now we look at the declarations in {{DSL:INTERFACE}} that refers to
{{PARENT::REFACTOR}}, please make sure the instantiation of {{DSL:OWNER}} is correct,
and output an optionally updated version of the header file.
If anything changes, please make sure to adhere to the coding guidelines as provided.
RVC_PROMPT_TEMPLATE OWNER FROM
Thank you, now we look at the definitions in {{DSL:FROM}} that refers to
{{PARENT::REFACTOR}}, please make sure the use of {{DSL:OWNER}} is correct,
and output an optionally updated version of the header file.
If anything changes, please make sure to adhere to the coding guidelines as provided.
RVC_PROMPT_TEMPLATE FROM
Thank you very much. Now, finaly we are going to look at the implementation.
Let's look at {{PARENT:FROM}} and create an implementation of {{DSL::VAR newclass}} directly
before the implementation of {{PARENT::REFACTOR}}. Keep to the coding guidelines as provided
and update the relevant methods of {{PARENT::REFACTOR}} according to the specs as far provided.

This template would add all relevant prompting text to complete initialization, and to then start extracting new versions of the relevant files from the LLM/SLM.

A modify template

IMPLEMENTS:
  TEMPLATE class_proxy_split
  VARIANT move_nocopy
  MODEL deepseek
  MODE webchat
  MODIFICATION 1
MATCH:
  LANGUAGE CPP
RVC_PROMPT_TEMPLATE_MOD [INTERFACE]

For the new C++ class {{PARENT:VAR newclass}} should **NOT** be copyable, but should
implement a move constructor. Define an exception class that should be thrown if a method
is called on an empty {{PARENT:VAR newclass}}, and implement a throwing guard in each method to assure this behaviour.

A template like this could finetune specific parts of the core refactor template.

Together these fout types of templates would expand the initial DSL code to a series of prompts within a session.
While purely illustrative, I hope the combination of template sompared to the DSL pseudo-code shows just how much
specificity can be expressed. While the DSL is already quite verbose, it isn't close to as verbose as the very
specific prompts that it generates.

A validation baseline

Every programming language is different in available tooling, so let us not get hung up on the exact set of validation baseline tools and provisioning. Some languages will need more, others will need less, but at the base, we need to define a validation baseline that helps us figure out one thing:

  • Did the AI break things?*

Let's look at one language, Python, as an example. What tools do we need to run to see if the AI broke something, it's just for illustration as for what kind of tools you may want in your validation baseline.

  • Linting, code complexity and coding conventions: pylint
  • Idiomatic coding style : pycodestyle
  • Checks for dead code : vulture
  • Basic code security checks : bandit
  • Property based testing : hypothesis
  • Basic unit tests : unittest

Before code is given to any AI (LLM or SLM), it should pass all these tools as a validation baseline. If the result of any AI action is to be presented to the user, it should first pass the validation baseline too. Consider the validation baseline, the handover contract between the human and the AI.

Passing the baseline does not mean “correct”, only “good enough to hand back to a human”.

Note also what is missing. No integration tests. We treat the AI like we treat human developers. this is not CICD yet, this is all pre-CICD human/AI handover. And the validation baseline is part of a multi-try setup for the AI. We will give the AI multiple chances to pass the validation baseline before giving up on a branch, and adding long running processes to such a pipeline would result in undesirable latency in the human AI interaction.

Protocol-in-a-file

In the refactor workflow of Reverse Vibe-Coding, there are no IDE hooks or integrations for AI. We want full provenance, and we want to do away with history-deleting rebase heavy git workflows. Rebase in git workflow is considered workflow smell. The only hook that the refactor workflow has is the git push the user does.

Because a git push in itself contains very little information, we need to run a protocol on top of it. For RVC, we call this protocol RVPP (Reverse Vibe-coding Prompt Protocol), and it is implemented in an RVP (Reverse Vibe-Coding Prompt) file, where the. One refactor task one RVP file, and we number the files sequentially, starting with R1.rvp for refactor prompt files where the 'R' stands for Refactor.

We define RVP files as append only files. The file is divided up in assignments and all assignment from inception until merge are referred to as the task.

An assignment is always started with a chunk of DSL code created by the user. If the assignment gets completed, a report gets added to the assignment and two two newlines create an empty line after which the user can add new assignment DSL code. In the next section we look at how an assignment is processed, broken into sub assignments, and how topic branches mix into the protocol.

On push to trunk after RVP creation.

So what should happen on a GIT push? The new commits are looked at and if there is a commit to trunk that creates a new RVP file, the following process starts:

  1. Baseline validation
  2. assignment to assignment prompt-set conversion
  3. topic-branch creations
  4. per-topic-branch assignment processor is started

While the implicit RVP workflow contract states that the repo should meet the baseline, we start off verifying by testing the baseline. If the baseline isn't met, then the hook processing is simply abandoned. The next step is conversion of the chunk of assignment DSL to all the variants of the task startup assignment prompt. Every variant gets its own, likely ephemeral, topic branch, and then for each variant topic-branch/prompt, a stateful assignment processor is started. I’m getting a bit ahead of myself, but for clarity's sake, all but the chosen branch are ephemeral by design.

What happens next happens in parallel for each of the variants/topic-branches, and for each variant it happens 1 up to N times, where 8 is suggested as default for N.

  1. The prompt is given to the SLM (or LLM)
  2. The result is reintegrated in the topic branch code.
  3. Baseline validation is run, if it fails and the try count is less than N, validation errors are fed back to the SLM and we continue once more at 1.
  4. On an Nth failure the topic branch is deleted
  5. On first baseline validation success, a report section is added to the RVP file.
  6. The changes are committed to the topic branch
  7. If trunk has seen commits since the start of the push trigger, trunk is merged into the topic branch and the validation baseline is validated once more.

Now once every variant topic branch has either been deleted or has been updated with the SLM output, the user is supposed to either merge one of the surviving topic branches, or add a new assignment to the RVP file in order to continue on the result from one specific topic branch.

On push to topic branch after RVP update

The hooks for user commits to a topic branch are slightly different but still mostly similar to those for trunk. The differences are marked in bold.

  1. Baseline validation
  2. Deletion of non-chosen topic branches
  3. Assignment to assignment prompt-set conversion
  4. Topic-branch creations
  5. per-topic-branch assignment processor is started

And for the per variant processor:

  1. The prompt is given to the SLM (or LLM)
  2. The result is reintegrated in the topic branch code.
  3. Baseline validation is run, if it fails and the try count is less than N, validation errors are fed back to the SLM and we continue once more at 1.
  4. On an Nth failure the topic branch is deleted
  5. On first baseline validation success, a report section is added to the RVP file.
  6. The changes are committed to the topic branch
  7. If trunk has seen commits since the start of the push trigger, trunk is merged into the topic branch and the validation baseline is validated once more.
  8. The parent topic branch is deleted.

As we see, it's all basically all the same, except for the topic branch deletions.

Sorting surviving branches with McCabe

This part is still very much experimental, but in order to facilitate maximum efficiency in AI to human hand-over when there are multiple surviving topic branches, we introduce a sorting stage.
By not makng the starting of the paralel agents fire-and-forget, but asynchonously waiting for completion of all of them, we allow for a post-agent phase of the git push hook where we can do some work to accomodate the user in choosing a topic branch. Think of it as automated candidate triage. It doesn't need to be perfect, but it can safe time.

At that stage we have the trunk and all of the just created surviving topic branches. What we do now is the following:

  1. We run all of the affected files for the trunk and each topic branch through a code complexity tool.
  2. We normalize the output
  3. We use the output to extract a single metric
  4. We create or overwrite an MD file in the airc.d directory with the same number as the RVP file
  5. We commit the MD file to trunk for the user to use

So lets walk trrough this. For the code complexity, we use an apropriate tool for the programming language at hand that determines the so called Cyclomatic complexity aka McCabe index for each function or metric in the affected files.

Here is an example with Python and some random python code:

python -m mccabe fsst.py
TryExcept 23 3
If 28 2
32:0: 'runs_in_docker' 1
TryExcept 36 4
112:0: 'wait_for_flureedb_to_terminate' 7
135:4: 'Hooks.__init__' 10
165:4: 'Hooks.before' 2
170:4: 'Hooks.between' 2
175:4: 'Hooks.after' 2
182:4: 'FlushFile.__init__' 1
191:4: 'FlushFile.write' 1
208:4: 'FlushFile.writelines' 1
225:4: 'FlushFile.flush' 1
235:4: 'FlushFile.close' 1
245:4: 'FlushFile.fileno' 1
260:0: 'query_to_clojure' 2

In step 1, we get such output for each affected file in each new topic branch plus the trunk. The integer at the end of each line gives us the cyclomatic complexity of the method or function.

In step 2, we look id there are any methods or functions in one of the branches but not in the others. If these are there, we add that method as function as a fake entry in the result, with a McCabe index of 4. The idea behind this is that we will later use 4.5 as one of the complaxity targets, so a value of 4 puts just a tiny bit of a penalty on a missing method or function that will be matched or exceeded in deviation from this partial metric by at least the same deviation as the actual method or function. If a method or function has the exact same McCabe index in all branches, that method is removed from all outputs to keep the comparison lean. The result is a smallish set of McCabe indices per branch that we can now condense into a single metric. The normalization ensures all branches are compared over the same function set.

We now have a list of positive integers we can play with, In step 3 we convert this list, first into three numbers, that we then combine into just one.

  1. σ² : The population variance of the McCabe index. This submetric rewards consistent complexity distribution
  2. e : The square of 1 plus the absolute value of 4.5 minus the population mean μ. This submetric penalizes both functions of high complexity, and excessive fragmentation.
  3. h : The sum of all McCabe numbers higher than 9 divided by the squareroot of the number of McCabe index numbers. This submetric penalizes very high complexity functions.

Now we take the sqrt of the sum of these three numbers for each of the surviving topic branches and we subtract the the sqrt of the sum of these three numbers for the trunk from that. This means we have a number, r, the code quality regression number, that can be both positive or negative, where negative numvers indicate the refactor has likely improved the overall code quality by this metric, while a positive number indicates it might have degraded code quality.

In step 4 we sort the topic branches by this code quality number and write the results, complete with the code quality regression indicating regression or improvement of the code quality in the branch. This sorted list is written to an MD file with a "r" prefix and the same number as the RVP file it relates to.

In step 5, this MD file is committed to trunk.

It is important to note that this part is currently experimental, and better code quality metrics will probably be possible, and could grow dynamicaly as we gain more experience with the workflow. The idea is that with the MD file, the user can look at the most promising candidates first and discard candidates with a very high quality regression number.

On merge to trunk.

When the final chosen topic branch finally concludes the refactoring task, nothing much is needed anymore. The topic branches get deleted and the normal CICD bit of the processing of the merge commences. That part falls outside of the RVC workflow, so we leave it unspecified in this blog post.

Workflow latency streamlining

We don't want to have the user that is refactoring waiting, but neither do we want to incept too many merge conflicts. To help with that, the user is expected to start on new tasks (RVP file) that touch different parts of the code while waiting for an assignment within an active task. As a rule of thumb, two to four parallel tasks are suggested for an optimal workflow.

Summarizing

In this post we looked at the Reverse Vibe-Coding git workflow for refactoring. The workflow for scaffolding and boilerplate is slightly more involved because of baseline validation bootstrapping, but most of the workflow for these is quite similar. I hope this outline shows how this part of the RVC workflow is a robust and highly productivity-efficient alternative to the use of IDE integrated AI, and how it allows us to return to a provenance preserving merge based workflow at the same time. I hope this post demonstrates that RVP is the right choice for AI enhanced productivity from a full product lifecycle perspective, and that moving away from English as a prompting language and from the IDE as integration point for AI are good choices that help move AI assisted coding for production deployment forward.