ENFR
8news

Tech • IA • Crypto

TodayMy briefingVideosTop articles 24hArchivesFavoritesMy topics

Making agentic workflows trustworthy and verifiable with a custom DSL

AnthropicClaudeMay 22, 2026 at 05:13 PM29:35
Audio player
0:00 / 0:00

TL;DR

A custom domain-specific language can make AI workflows more trustworthy by exposing, verifying, and enforcing how results are produced, not just the outputs themselves.

KEY POINTS

Mechanism over output

Identical outputs from two AI systems do not imply equal reliability. Systems using advanced models, tool use, and iterative critique differ fundamentally from simpler pipelines, even if results match. Trust depends on how conclusions are generated, not just the final answer.

Trade-off between speed and rigor

AI system design involves balancing fast responses with thorough, defensible analysis. High-rigor workflows require more computation and time but deliver stronger guarantees of correctness and provenance, especially in high-stakes domains like scientific research.

Three requirements for trustworthy agents

Effective agent workflows must be legible, allowing humans and other systems to inspect each step. Iteration must retain fidelity so that refinements do not drift from the original goal. Finally, execution must faithfully follow the defined process to ensure consistency and reliability.

Introduction of AshPL DSL

The system uses AshPL, a domain-specific language tailored for research workflows. It is a restricted, typed subset of Python, designed to be simple and predictable. The language is purely functional, with no loops or mutation, enabling easier verification and reproducibility.

Domain-specific primitives

AshPL includes built-in operations aligned with scientific research, such as retrieving academic papers, filtering studies, and joining datasets. This specialization allows workflows to directly encode domain logic rather than relying on generic prompting.

Executable and inspectable workflows

Workflows are not مجرد plans but executable programs. Every output artifact, such as a research table, is directly tied to the underlying AshPL code. Users and systems can inspect or audit the exact steps used to generate results.

Iterative write–execute loop

The system continuously generates, executes, and refines AshPL programs. Errors such as type mismatches are quickly detected and corrected. This loop ensures progressive improvement while maintaining structural consistency.

Full re-execution with caching

Each iteration re-executes the entire program rather than partial updates, reducing logical drift. Performance is preserved באמצעות a content-addressed cache that stores prior computations, allowing reuse of previously evaluated steps.

System architecture for reliability

The architecture includes a user interface, event log, Python execution service, and a sandboxed component that generates AshPL code. A secure gateway manages model interactions, preventing exposure of sensitive data such as API keys.

Visualization and transparency

In addition to code inspection, workflows can be visualized as structured graphs. This helps users quickly understand and validate the sequence of operations, making complex analyses more interpretable.

Support for layered analysis

Users can incrementally extend workflows, adding new analyses such as market strategies or regulatory relationships. The system integrates these additions into a growing program without losing prior context or coherence.

Engineering complexity beyond the DSL

Building the language itself is only part of the effort. Supporting systems such as interrupt handling, session persistence, evaluation frameworks, and model orchestration require substantial engineering investment.

Evaluation challenges

Assessing correctness is difficult because the system dynamically generates and executes programs. Dedicated evaluation processes are necessary to ensure accuracy, robustness, and consistency across diverse workflows.

CONCLUSION

Trustworthy AI systems depend as much on transparent, verifiable processes as on accurate outputs, and domain-specific languages offer a practical path to achieving that balance in complex workflows.

Full transcript

My name is is James Brady. I work at Alys and today I'm going to be talking about how we make our agentic workflows trustworthy and verifiable with a custom domain specific language. Okay. So, uh, in terms of the, uh, the structure of today, I'm going to start with a higher level overview of why we went for a DSL in the in the first place. Talk a little bit about the language, how we made the decisions we did um, in its design, how we integrated it into elicit. We'll do a quick demo and then uh, and then wrap up at the end. But, uh, let me start with a question. So, let's say that two systems produce identical output. Do you trust them equally? And the answer is of course well it depends. It depends on what went on inside of those systems to produce that output. I would say that the the the mechanism the how of how an answer is produced is as important and important in a different way compared to just the the final output itself. Let me try and uh and make this a bit more concrete. So let's say you're running a static analysis tool over your codebase and it runs for a while and in the end in the end it says this code is free of security security vulnerabilities safe to ship to production. I would contest that if you knew the system was built on let's say an older model 3.5 sonnet something like this if the system is using an older model like that this is option one and option two is it's a latest and greatest state-of-the-art model it's done all sorts of tool use it's done critique and reddrafting that's just a fundamentally different kind of an object the the the message might be literally identical but you would react very differently to those two messages if it came from a kind of older model that was you know not so powerful versus something that had uh used a lot more tokens and intelligence. So the mechanism matters and there isn't a a sort of single correct mechanism. There isn't a kind of single canonical um best way of designing the internal structure of of the systems that you're building. I really think that it's a it's a design choice. It depends on what it is you're trying to do. It depends on the domain that you're building building in. It depends on the user. It depends on the task like what it is that the user is doing within the domain. We found that there's definitely a speed versus rigor trade-off. So if you're trying to do something which is uh extremely in-depth and extremely defensive and extremely high quality that naturally takes a bit longer than uh than something a bit more surface level. And there's no, you know, there's no correct answer. Sometimes you want you want fast and sometimes you want really really high quality. Uh the providers Brandon Taste is interesting here. So uh I don't know if I would have called this before we started working on this ourselves but elicit prides itself on super high reliability, really high quality data provenence. We really kind of stand behind the results that we put in in front of people. I'll show you a demo of what I've been talking about a bit later on. And these are some of the some of the concerns that we had in our mind when we were thinking about well what is the me we know the mechanism matters but what is the right mechanism for us at elicit and uh I think it came down to these three desiderata when we were building out our research agent which will be the demo in uh in a few minutes. So firstly the research agents process must be legible. It needs to be legible to the user and also by the way it needs to be legible to other agents. We want for um the uh the the process the algorithm the kind of like internal set of steps that the that the agent is taking to be uh um spot checkable by the human spot checkable by other agents. We can run you know sort of critique agents over it that kind of a thing. The second is aidum the iteration on the process retains fidelity. This is maybe uh let me explain this a bit more because it's a it's a bit of a fiddly one. What I found and and maybe what some of you have found as well is that if you're iterating on a piece of work and you're saying that's not quite right. It's kind of going this other direction or you know I want to add this other layer or this other consideration. I found that you can sometimes drift a little bit from what you're initially trying to do and the model ends up getting a bit confused and you have to say you know let's start again or backtrack or something. It's kind of uh kind kind of annoying and it it definitely harms trust. So we want to avoid that. We want to be able to add to the work. We want to be able to add layers. We want to go be able to go in different directions without losing that kind of uh clarity and consistency of what the user was initially interested in doing. And uh lastly, and certainly not leastly, is the process is followed faithfully. So let's say we've got this process, it's legible, we've checked it, the user's checked it, it's great, and we've iterated on it, and we've kind of stayed true to what it is the users are interested in. Well, we have to actually ensure the system does in fact do that set of steps. Otherwise, you know, uh what are we what are we doing here? So, uh those are the considerations that we foregrounded when we we were thinking about how we want for elicit uh elicit to work and that that led us to reaching for a DSL. I'm not saying that everyone should be using a DSL. You shouldn't. What I'm saying is that these three things really kind of led naturally towards well a DSL could be a great choice for us. So our DSL is called AshPl. The kind of weird smooshed together AE thing is apparently called ash. It's like an old English um dip thong or something. Uh so AshPL and this is our domain specific language for the agentic workflows in the in the elicit um in the elicit product and a has a few distinguishing factors. So uh firstly it is cheering incomplete. It's relatively simple. There's no loops. There's no uh yeah there's no there's no recursion. There's no there's no mutation. It's purely functional. It's a reactive language. And it's an opinionated subset of of Python. And the opinionated is is important here. So it's not just a kind of generic simplification of Python, if you will, not like Python with a couple of bits taken off at random. Um what we did is we uh we disallow we sort of take out the language features of Python that just aren't aren't that helpful and we add stuff in. We add some extra primitives in which are specific to our domain. So our domain is scientific research and uh empirical decision-m high stakes decision-m and the primitives that we put into our DSL match that you know we've got retrieving academic uh research papers or kind of clinical trials you know things things like that are built into the built into the language. Okay. Um yeah let's have a look at some. So hopefully this isn't too small for you all. Um, you don't need to read the the code obviously. Uh, what I'm trying to show here is that the ashpl on the right looks a lot like Python because it is a subset of Python. Uh, we're keen on types. It's it's it's typed that lets us do uh fast kind of redrafts if you've got a type error. And I think this example program uh just FYI was the the process that we wanted to go through to do a competitive competitive analysis for elicit itself. So we're looking for other um uh academic search engines and AI assistants. It looks like systematic review tools. We're we're doing web searches for those. We're joining the results. We're enriching the sources. You know, this is the kind of the set of steps that we want to go through that we think is a good process for doing a competitive um uh landscape overview. And um the core engine of what goes on within an elicit user session is that we have a component which I'll show in in in the next slide which is writing the SPL and then we interpret the HPL that's just done in like plain old Python code and then we reddraft the SHPL based on what just happened. So in a simple case you could imagine we we write some HPL there's a type error okay there must be a problem. So that gets kicked back to the HPL kind of writer component. It tries again, fix it to type error. We reinterpret the HPL. It runs this time. We get some results back. We rewrite the HPL. It's that kind of constant loop of uh writing and then interpreting and then rewriting and then interpreting. And that's that's like the core um engine of of making progress inside of inside of elicit. Okay. So that's the language. Let me show you how we integrated it into um into like more of a into more of a system. So we have the UI in the top left. That is what the user is is interacting with. It's just in a in a web browser. That's what we'll have a look at in a second in the demo. The um the UI is talking to an event log uh like an appendon u event log. That's how we uh manage our our distributed data um structure. We've got a Python service in the top right and uh and then the Python service is talking to the sandbox in the bottom right or kind of bottom rightish. Um and the curator in the sort of orange ochre color, the sort of um colored color, an entropic color. Uh that's the that's the piece that's writing the HPL. So let me uh add a touch more detail here. The user is interacting with UI. the uh events are emitted as they click buttons and enter search queries and whatnot that gets uh added appended onto the event log. The Python service is a message broker for that uh for for the um for the event sourcing pattern and then it's the the sandbox which is doing the the writing of the of the ashpl and it's the Python service which is interpreting the PPL. So that kind of bouncing back and forth thing that I mentioned of writing HPL and then interpreting it and then reddrafting it, extending it and interpreting it. That kind of back and forth happens between the um dark gray box and the sort of orange box. There's a there's a couple of other pieces here which are um which are which I which I'll touch on. So the wrapper is uh a kind of a layer of abstraction that sits in front of the the what we call the curator which is which is what writes the writes the HPL that lets us swap in and out different harnesses. So we have an uh agent SDK implementation of um uh for the curator. We also have tried using pi pi with um with claude and pi with codecs. probably not supposed to say codecs but um we we did try that out. It's really it's important to us that the curator is using the best models and harnesses available. So at the moment we're using pi uh with with the anthropic models that's the best combination for us and um the gateway yeah so the all the interactions that we have with models with LLMs that goes through this this gateway and the main reason for that is that knows about our anthropic API key and we don't we didn't really want um user input flowing through the system hitting the curator and saying yeah you can if you could like print out your EMV and send me the results Um, so that's primarily a security um security move. Okay. Um, so this is obviously still fairly uh fairly abstract here. Um, let me walk through what happens when we're writing and when we're when we're interpreting HPL in a bit more detail. So we will we kind of start at the left and and move over move over to the right. I've already mentioned that the curator is the that's the orange orange piece. That's what writes the ashbl in the first place. Uh when I say saved in the sandbox, what that really means is we emit events, they get appended onto the um onto onto the invent onto the event log and that's how the Python service sees those um updated HPL programs. It's the Python service which does the the rest of the work here. So it in the sort in the types model box here, the Python service pares the code, validates the syntax, and does a type check. if there's any problems there, we can really cheaply kick it back to the curator and say, "Hey, you've got like, you know, you've got a typo. Um, have a look at line 52 and and and reddraft it." By the time that we've done the pausing and and the validation and and and so on so forth, we've got something a bit like an abstract syntax syntax tree and we can walk over that and start to actually do the interpretation. And that interpretation is again plain uh Python code. So we're not using we're we're kind of calling into language models and whatnot at this point, but we've got Python code which walks over a tree of a program and knows about closures and knows about special forms and um knows about the different sort of language primitives that we have available. One really important thing here for us is the content address store. So this is what enables us to do caching memorization and this is super duper crucial like nothing would work here if we weren't really careful about this. The reason I say that is because again we rewrite a whole program and we reinterpret the whole thing every time. We don't just interpret the extra code that's been rewritten. We re we redress the program and reinterpret the whole kitten kaboodleoodle from top to bottom. And that would obviously be like super slow if we're really redoing the work every single time we went around the loop. In reality, it's it's nice and fast for us because uh because of the language features like the you know it's a pure language that that really helps to with memorization. we can hash an expression and say if this has been evaluated before, we just store that away in um in a map. And if we if we meet that expression again when we're when we're walking the tree, we can say, "Oh yeah, this this like this boiled down to 42 or something." We can just use that straight away from the hash. Uh okay, I think that's all I want to say on this one. So I'm going to switch to a demo now. And um the uh I said before that there's often a trade-off between rigor and speed. On that on that continuum, we are very much focused on the rigor uh side of things. We do do things quickly if it's a simple query, but that's not really where we differentiate ourselves. It's not really where our special sources, so to speak. So if you go to elicit.com um you would see something a bit like this. We have a bunch of uh uh sort of templates you can start with creating tables slides drafting a report. Um I'm going to show you a research landscape which uh again is like a much I think it probably took in total I don't know like a couple of hours or something of of it doing work and me adding layers on top of it. So can't do it in a demo format. Uh but I've got a session saved away that we're going to take a look at. Uh but yeah, it doesn't need to take that long. It's just, you know, it gets bit a bit more interesting when it's a more in-depth thing. So this is the research landscape that we're going to take a take a look at here. And my initial query was to map the companies and institutions investing in foundation model models for biology. And you can see that the first thing that we did here was Alyssa asked asked me a question. It was like, okay, I get the kind of overall big picture. Let me um narrow that down a little bit. Are you interested in a broad landscape? I think there other options here where are you interested in something um like, you know, a particular foundation model, you more interested in uh academic institutions or or companies, that kind of a thing? And I just said, yeah, the broad landscape um is is what I'm looking for. And then the rest of uh the rest of the steps here are driven by ashpl. So this uh first analysis step you can see if anyone can't see this and it needs to be bigger then please do say uh you don't need to be able to read all the text in detail but okay I'll go with it as it is. So this first analysis block we're doing a bunch of searches. We're looking for academic papers relating to genomic foundation model pre-training transformer. We're doing some web searches. uh we're trying to fetch the full text of papers when available. We're doing some screening like filtering. All of these steps, all of these stages uh are encoded into ashpl and then we run they're actually um the is not just a representation of a plan. It is literally the plan which is executable. You know, uh that that's what lets us really be be sure that we're following through on the plan as as stated. Um so let me go a bit deeper here. That was the kind of first the first um analysis stage of us looking for organizations looking for institutions. Looks like we did yep did some more analysis here. I think this one is all right at this point we've got some actual institutions. We got Howard Hughes Medical Institute, Stanford University, etc. Um again this is all coming from AshPl. We're doing some more searches. We're doing some more searches. We're doing some more screening. Um, yeah, you can see we go pretty deep um when we when we're in this mode. Let me skim forward to the results here and I'll get this sidebar out of the way. So, uh, after some humming and worring and um quite a few tokens, we end up with a table like this. We call this an artifact and each row is a in this case an organization which has got some kind of a interest in biological um foundation models. got GDM, we've got meta, Microsoft research, etc., etc. And you can see that we've extracted some uh some attributes alongside that the foundation models that they've created, the the modalities that they're that they're interested in, notable collaborations. It looks like so I've been saying that this is driven by HPL, but um how do you how do I know that? Uh you know, what's what's the connection here? For each of these artifacts, we can actually look at the HPL code that was used to to generate it. So, um this is literally the the executable DSL that was um behind the creation of that table we were just looking at. And you can see that first of all, we're doing some uh some web searches for foundation models uh multimodal biology, AI, you know, you can you can see this uh looking for acade academic papers. Again, we are, I guess, joining these together at some point. Uh, yep, that's where the join is. Um, and, uh, as you can probably tell, looking at the HPL is not particularly fun. Um, most people don't do this and and that's not really the the kind of the core driver of why we have this. We we have this because we want to know that elicit the system is following the the following instructions that we came up with, right? Like that's the kind of primary thing. Um it is useful for other agents to be able to look at this HPL though and say, you know, you've missed something, you've overlooked something, you have I don't know, there's a there's a a key search that you've you should have considered or there's a part of the user's query that that you didn't take into account. Uh so that's something which is which is really handy when the plan is so legible in in this format. something which is a bit more uh useful from a user perspective, a bit more ergonomic is a uh a graphical representation of what's done within the system. So this is um derived directly from um you know from from the HPL. This isn't just a kind of um I don't know a madeup nice visualization or something. It literally is derived directly from the same thing that the that the plan uh was executing over. And I think in this case it's it's pretty pretty linear so it's not super interesting but um yeah we start off with a couple of searches did some enrichment which means fetching um full text of papers that kind of a thing extracting um curating which means filtering do some more searches etc. So I I do actually find looking at this to be quite handy if I'm trying to convince myself or not that I would endorse the process that the uh that elicit that elicit took. Uh and you can quite quite quickly notice when there's something that looks a little bit um skewf, but I wouldn't be I wouldn't stop here uh necessarily right like there's other kinds of um layers to this investigation I might want to add on. So uh I think I did a few things here. Yeah, I asked for a comparison of open and closed source strategies for the different organizations. We did some work for that. I then asked for the commercializ commercialization strategy, the GTM approach and we did some work for that. I then asked for um I can you know you can see another artifact was created. Um I think the next thing I was interested in was Oh, I missed missed a block here. Here we go. Yeah. Mapping out the different government orgs and other kind of oversight institutions. I did some work for that. And then and then at the at the end of this user session of my user session, we have um I asked for a join, right? So we've got effectively a table of data which is the organizations. We got a table of data which is the oversight bodies. And just in natural language, I can say I kind of want to I want to join these together and see how the labs have have interacted with the oversight bodies. And uh that's come up with this table. Uh we can see how anthropic has interacted with um US AI safety institute and AC in the UK and so on so forth. Um and if I look at the uh HPL for this table, what you might notice is that the top of the program um is identical to what we had before. So this is this these are the same web same same web queries and paper queries that we had for that very first table we were looking at and this is all the same code. It's got the augment mentions the I guess the join's going to be down here a little bit like this is the same stuff as what we were looking at before. The difference is this program is now a like a lot longer. I think the last one was know 100 lines 150 lines. We're up to like a thousand or so a bit bit a bit more. And it's only when you're right down here that we're starting to talk about um yeah, you can see that we're looking at the oversight uh oversight bodies and and the interactions between those and the and the labs. Here we're talking about oversight. This is the sort of uh uh the model for a lab interacting with a with an oversight body. And at the point of generating that last table, we would have interpreted this whole um program again from scratch except there's that cache that I talked about. So the fact that we had already done all this stuff up here, we'd already done all these all these web queries, web queries and paper searches and and so on and so forth meant we can interpret the whole the whole um program from scratch, but you know the vast majority of it is is just memorized and you get it back straight away. One of the reasons we took that design design decision is because um it's easy to be confident about and and and make statistical guarantees of of kind of cohesion and correctness when you're literally interpreting the whole program every single time. You know, if you're just interpreting little snippets, that's where the drift can come in that I mentioned before. Like that's one of the places the drift can come in. Uh okay. So, um, can we switch back to the slides, please? I think that's that's it for the that's it for the demo. Um, okay. Again, I'm not saying that everyone should be using a DSL. It's uh it's not the easiest thing to build. Uh, I know it wasn't it wasn't so bad, but um it's the kind of thing that you should reach for if the Ziterata for your product and for your organization points you in that direction. And if they do, it's great. We we're really really happy with how it's working for us. But again, elicit is uh based on and and kind of really anchors around high quality dependability, robustness, data provenence, all that stuff I was just talking about. And that's why we went for it. If you're in a similar position or you can think there's some other design that could lead to a different DSL that might be a good fit, here are some of the things that you you should be should be thinking about. So firstly and most obviously you need a DSL and the uh agent ergonomic piece here what I mean by that is firstly we have found that you'll have a better time if you base your DSL on an existing language that there's a lot of examples of in the training data because then you know it doesn't the the model the curator in our case doesn't need to like learn the syntax uh right it just needs to know there's a subset that it can go for um And um I would say a surprisingly small amount of work went into the DSL compared to everything else. Everything else is kind of like conventional software engineering to really turn it into a system that works. And that's where the majority of the of the work was here. So I mentioned the wrapper. Um yeah, that's like letting us switch between different harnesses uh and models. Um interrupt handling. So when you're in Alyth and you're, you know, waiting for the results to come back, you can add other things into the chat and we want for that to gracefully flow back into the curator so it can reddraft its plan without stopping the world. That isn't something that any harness handles natively. So that's something we had to build. We can come back to sessions in the future and like rehydrate them. So we had to hold build a whole thing for for that. That's not really a native feature. Credential isolation is that wrapper thing that I mentioned. Um there's a weirdly annoyingly amount of uh an annoying amount of stuff to handle messages coming out of the models and make sure that they're not just like lost to standard out. You know, just like if you've worked with models a lot, it's the kind of stuff that you're used to being a bit annoying. Um and number seven, yeah, we use event sourcing. We we're really happy with that pattern. Uh that's not a small lift. Um, I guess you don't need to. I think I think most people would probably need to do three, four, five, and six. Would recommend number two. I guess you have to do one. That's obvious. Um, number seven, you have to do something there. And eight, I've not mentioned this, but yeah, I I I guess I said before that we're really pride, we really pride ourselves on accuracy and um uh and robustness and and truth and trustworthiness. And we have a dedicated eval team who are great. Uh it's so hard to do eval when the the system is like writing programs and executing them on the fly. Like it's just a very complex dynam um domain to be in. Um but we've invested a lot of time there and I'd really strongly recommend that you do the same. Um if you're doing a kind of DSL DSLbased system. Okay. So let me uh let me finish where I began. The uh example I gave at the beginning was let's uh let's imagine two systems produce identical output and should you trust it? I think it's not crazy to imagine Opus coming up with one of those tables that I was just showing. I didn't show a bunch of the FE just because of time. There's a bunch of features there that um are really important to us. But certainly at least at a surface level, the table itself isn't like a crazy thing to imagine a state-of-the-art model coming up with. However, the fact that we go through a very particular um and sort of painstaking process to generate it and we expose that in an ergonomic way to the user, you right with the HPL and with that graphical uh interface and a few of the bells and whistles. I think that's the thing that makes me think and know from conversations with our users that they hold they would hold those two things quite differently like a table like that in in elicit is a is a fundamentally different thing to uh a table that's just you know been bbled out from a from a model. Um and maybe there's something that kind of has that same dynamic for you for your for your business and for your product. So yeah, my my pitch here is not that you should go and use a a DSL. Um my pitch is that you should care a lot about the mechanism. Um because the mechanism the mechanism matters. Okay, that's it for me. Thank you very much.

More from Anthropic