How AI unlocked what clinical judgment alone could not build

Voice dictation with AI formatting became the primary input method for most doctoLys users — a feature born from a single phone call.

Doctors think in narratives.

Software requires structure.

For decades, this mismatch has forced clinicians to adapt to rigid systems — filling in fields that do not reflect how they think, entering data in a format designed for computers rather than consultation rooms.

By late 2023, doctoLys worked. Doctors recognized their daily practice in it. The interface was organized around consultations, not database tables — a design philosophy born from building as a clinician, as we explored in how a doctor-led development team builds better clinical architecture. The infrastructure ran without IT support. The onboarding took minutes.

And I still had a list of requests I could not fulfill.

Not because the features were technically complex. Because they required solving a contradiction I had been staring at for two years — a paradox built into the nature of medical records themselves. Then generative AI APIs became accessible. Within months, three features shipped that I had believed were impossible. One of them accidentally became the foundation of the entire mobile app.

What you will learn in this article:

The core paradox of medical data that traditional software cannot resolve
Why the same doctors who rejected structured input were asking for structured output
How a single phone call led to a feature that changed how most users interact with doctoLys
The three AI-powered features that resolved the paradox — and the clinical decisions behind each one
The two non-negotiable constraints every AI medical tool must satisfy

Medical Insight

As a product founder, I had a strong conviction about structured data. I knew that a doctor who enters free text today is a doctor who cannot query their own patient database in five years. The gestational diabetes prevalence in their panel. The real-time view of ongoing pregnancies and expected delivery dates to plan leave. All of it requires structured input. I could not abandon that conviction. But I also could not ignore what I was seeing in the usage data.

The paradox I could not solve

doctoLys was built on forms. Examination forms, biology forms, serology forms, ultrasound report templates — each one designed to capture clinical data in a structured, queryable format. We added autofill buttons to reduce typing. We simplified layouts. We ran iteration after iteration on the input experience.

The usage data told a clear story: most users were bypassing the forms entirely and writing in the free-text field.

I understood why. Doctors who had spent decades with paper records did not think in fields and dropdowns. They thought in prose. A paper consultation note is a personal document — underlined, annotated, structured in a way that reflects how that specific doctor thinks. The forms felt like a foreign language.

But I also knew what abandoning structured input would cost. Every piece of data entered as free text is data that cannot be aggregated, queried, or acted upon. The doctor who wants to know the prevalence of gestational diabetes in their patient panel needs structured data. The dashboard showing ongoing pregnancies and expected delivery dates needs structured data. The clinical alert system needs structured data. Free text is medically rich and computationally inert.

For two years, I had no resolution. The doctors' preference and the product's long-term value pointed in opposite directions, and there was no middle ground — until there was.

The twist — they wanted structured output all along

While monitoring usage, I noticed something that sharpened the paradox considerably.

The same doctors who refused to fill in structured forms were asking, consistently and urgently, for outputs that only a structured database can generate:

The complete list of investigations performed across an entire pregnancy follow-up, independent of which consultation they were ordered in
Automatic referral letters summarising the patient's clinical history in a coherent narrative
Alerts when a clinical rule was missed — a toxoplasmosis serology not prescribed for a non-immune patient, Aspirin not stopped at 34 weeks of amenorrhoea, an evidence-based update relevant to a specific patient's situation

These are not simple features. Every one of them requires that the underlying data be structured, consistent, and queryable. The doctors were asking for the benefits of a structured database while refusing to feed it.

This was not a contradiction in their behaviour. It was a reasonable expectation — one that traditional software architecture simply could not satisfy. The input method and the output value were structurally coupled, and there was no way to decouple them using conventional development tools.

Generative AI decoupled them.

The phone call — and the feature that changed everything

In 2023, a colleague called me. He was a gynecologist practicing in a group office with an older partner who was not comfortable with keyboards. He wanted to migrate from paper to digital records, but the keyboard barrier made it impossible for his colleague. Was there a transcription solution?

I spent a few hours researching. What I found was not just a transcription tool — it was a complete answer to a problem I had not fully formulated yet.

Large language models handle transcription with a capability that purpose-built transcription tools cannot match: they are natively multilingual, they recognise medical terminology when given the right clinical context in the system prompt, and — critically — they can format the transcribed output against a predefined template. Not just transcribe and dump. Transcribe, interpret, and structure.

I built a prompt that mimicked the structure of a real gynaecology consultation: patient history, reason for visit, physical examination, investigations, prescriptions, follow-up plan. The doctor speaks. The LLM transcribes the audio, recognises the clinical content, and formats the output into the corresponding sections of the consultation template automatically.

The feature shipped in a few days.

The feedback was unexpected. This was not just a solution for doctors uncomfortable with keyboards. Within weeks, voice dictation with AI formatting had become the primary input method for the majority of doctoLys users — including those who had been using the app for months with typed input. Doctors who had never complained about typing switched voluntarily.

The reason became obvious in retrospect. Voice is how doctors have always recorded their clinical thinking. The paper note was always a transcription of what the doctor observed and decided in the moment. Removing the keyboard did not change the workflow — it restored it.

This feature later became the foundation of the doctoLys mobile app. As we explored in why mobile-first design is critical for clinical adoption, the smartphone is the natural device for a workflow that starts with voice. The AI transcription feature made that architecture not just possible but inevitable.

Medical Insight

When I built the transcription prompt, I spent considerable time on the medical terminology layer. General LLMs trained on broad corpora know what "gestational diabetes" means, but they do not know that "34 SA" means 34 weeks of amenorrhoea, or that an abbreviation used in a Tunisian clinical context means something specific. The system prompt carries that clinical context. Without a clinician writing it, the output is transcription. With it, the output is a structured consultation note.

From transcription to database — extracting investigation results automatically

The transcription breakthrough clarified something about what generative AI could actually do for a medical record system. It was not just a better input method. It was a translation layer between unstructured clinical language and structured database entries.

That insight led directly to the second feature.

Doctors receive investigation results — laboratory reports, imaging reports, specialist letters — directly in their email inbox, as PDF attachments or image files. The traditional workflow requires manually reading each document and typing the relevant values into the corresponding fields in the app. For a busy practice, this is a real time cost, repeated dozens of times per week.

The AI-powered alternative: the document arrives, the LLM reads it, extracts the clinically relevant values, structures them into a predefined JSON schema aligned with the app's data model, and loads them directly into the patient record. The doctor reviews and confirms. The document becomes a database entry in seconds.

From a product development perspective, this feature required solving a problem that has nothing to do with transcription: the LLM must reliably output structured JSON, not prose. That requires careful prompt engineering — defining the exact schema, specifying how to handle missing fields, controlling for hallucination on numerical values. Every one of those engineering decisions required a clinician to define what the correct output actually looks like. A developer building this without clinical input would produce a technically functional extraction that misses half the edge cases a doctor would catch in thirty seconds.

The consultation summary — and the clinical rules engine

The third feature is the most structurally ambitious, and the one that most directly resolved the original paradox.

Every time a new consultation is added to a patient record — whether typed, dictated, or imported — the consultation text is sent to the LLM with two instructions. First: structure this text against the specialty-specific template and return a clean summary. Second: check the clinical content against a predefined set of rules and return any recommendations alongside the summary.

The rules are clinical, not technical. They were written by a clinician:

Has the toxoplasmosis serology been prescribed for a non-immune patient at the appropriate gestational age?
Has Aspirin been stopped at 34 weeks of amenorrhoea for a patient on low-dose prophylaxis?
Is there a relevant evidence-based update that applies to this patient's specific situation?

The doctor receives the structured summary and the rule-based recommendations together, immediately after saving the consultation. The patient record remains in free-text input if that is how the doctor prefers to work. But the LLM has simultaneously extracted the structured clinical meaning from that free text and checked it against the rules.

The doctor who refused to fill in a gestational diabetes field now has a system that reads their consultation note, identifies the diagnosis, and verifies that the appropriate management steps are in place — without any additional input.

The input is free. The output is structured. The paradox is resolved.

To make this concrete: a clinician writes "Patient non-immune to toxoplasmosis, 12 weeks, prescribed folic acid, next visit in 4 weeks." No form filled. No structured field selected. The LLM reads that note, extracts the immunological status, the gestational age, and the current prescriptions, structures them into the patient record, and simultaneously checks: toxoplasmosis serology has not been prescribed for a non-immune patient at 12 weeks — a clinical gap. The recommendation appears with the consultation summary. The doctor confirms or overrides. The database is updated. The alert is resolved. All from a sentence written the same way it would have been written on paper.

The LLM reads free-text consultation notes, returns a structured summary, and checks clinical rules — resolving the structured input paradox without changing how doctors write.

The decisions that required a clinician — not a developer

Before any of these features could be built, I had to understand how these models actually work — and two architectural questions needed answers that only clinical judgment could provide.

Using an LLM well is not about picking the right model. It is about designing a system where context is controlled, outputs are constrained to a defined format, and results are verifiable by the clinician before they become part of the record. Each of those three requirements has a clinical definition, not just a technical one.

A brief glossary for clinicians. An LLM — Large Language Model — is the technology behind tools most doctors have already encountered: ChatGPT, Gemini, and similar AI assistants. It is an AI system trained on vast amounts of text that can read, interpret, and generate language. Think of it as an extremely capable assistant that has read most of the internet, including a significant amount of medical literature. It does not store your data or learn from your patients. It reads what you send it, processes it, and returns a structured response.

The key to using an LLM well is what the field calls a system prompt — a set of instructions given to the model before any patient content is sent. The system prompt defines who the model is acting as, what format the output should follow, what terminology is in use, and what rules apply. The user prompt is the actual content being processed — the consultation note, the lab document, the dictated audio. Together, they determine the quality of the output.

The difference context makes is not marginal — it is the difference between a usable feature and a reliable one. Consider the same consultation note sent to an LLM in three different ways:

No context: the model returns a generic summary that could belong to any medical specialty. It may misinterpret abbreviations, miss clinical significance, and produce output in the wrong format.
System prompt only: the model knows it is processing a gynaecology consultation and formats the output correctly. It still does not know that "34 SA" means 34 weeks of amenorrhoea (gestational age in the French clinical convention), that a missing toxoplasmosis serology is a gap worth flagging, or that folic acid prescribed in the third trimester signals something different from the first.
System prompt + patient record context: the model has the patient's history, known immunological status, current gestational age, and existing prescriptions. It recognises that the toxo serology is missing for a non-immune patient at 12 weeks, flags it in the recommendations, and formats the consultation summary with the correct specialty-specific structure.

The third scenario is what doctoLys sends. The patient record context travels with the consultation text. The system prompt was written by a clinician who knows what "correct" looks like in a real gynaecology practice. The output is clinically reliable — not because the model is medical-specific, but because the context is.

Local LLM vs. cloud LLM. The dev team's instinct leaned toward a locally hosted model — no data leaves the system, no cloud dependency, no per-token cost. The clinical reasoning pointed to cloud. The most capable models available through cloud APIs are dramatically more powerful than anything deployable locally at our infrastructure scale. They are natively multilingual — a non-negotiable requirement for a product used in Tunisia, France, West Africa, and the Middle East. And the privacy concern, while real, has a clean technical solution: patient identity is stripped from the text before any request is sent. The LLM receives clinical content without identifying information. The answer was obvious once framed as a clinical question rather than a technical one.

General LLM vs. medical-specific LLM. Medical-specific language models exist, trained on clinical literature and structured health data. The case for using them seems intuitive. The case against them is stronger: general-purpose LLMs have advanced to the point where, combined with well-crafted clinical system prompts, they outperform narrowly trained medical models on the practical tasks a medical office application requires. The system prompt carries the clinical context. The model provides the reasoning capability. The combination — and specifically the quality of the prompt engineering — is where the clinician's knowledge becomes irreplaceable.

Prompt engineering in medicine is not a technical task. It requires knowing that "34 SA" means 34 weeks of amenorrhoea, that the absence of a serology result is clinically different from a normal result, that a prescription for folic acid carries different weight in a first trimester versus a third. These distinctions do not come from a documentation page. They come from years of clinical practice.

Two constraints that are not optional

Every AI-powered feature in doctoLys operates under two non-negotiable constraints.

Human validation before saving. Every LLM output — whether a transcription, an extracted investigation result, or a consultation summary — is presented to the doctor as a proposal, not a saved record. The doctor reviews, corrects if needed, and confirms before anything is written to the database.

The primary fear doctors express when AI is introduced into clinical workflows is hallucination. This is the term used in the field for a specific and well-documented failure mode: an LLM generating content that is plausible in form but factually incorrect. The model does not know it is wrong. It produces a confident, well-structured output — and that output may contain an invented lab value, a misattributed diagnosis, a missed prescription, or a subtly wrong gestational age calculation. In a clinical context, this is not a software bug. It is a patient safety risk.

The concern is legitimate. It is also, with the right design, manageable.

The validation step is the direct answer. The LLM output is a first draft — never a saved record. The physician reads it before anything is written to the database. Errors are caught at the point where they cost nothing: a correction before saving, not a consequence after. The model contributes speed and structure. The doctor contributes judgment and accountability. Neither replaces the other, and the workflow is faster than typing from scratch even after the review step is included.

In practice, hallucination rates are significantly lower when the LLM is given rich clinical context — the patient record, the specialty-specific system prompt, the consultation history. A model working with full context makes fewer errors than a model working from a blank prompt. This is another reason why the quality of the system prompt matters as much as the model itself.

This is not just good product design. Under the EU AI Act, medical AI systems classified as high-risk require meaningful human oversight. A clinical decision support tool that writes directly to a medical record without physician confirmation does not meet that standard. The validation step is both a regulatory requirement and a clinical one — the doctor remains responsible for every entry in the record.

Patient identity removal before processing. Before any consultation text, document, or voice transcription is sent to an external LLM API, all patient-identifying information is stripped from the content. Name, date of birth, identification numbers, address — none of it leaves the system. The LLM receives clinical content only. This is an architectural requirement, not a configuration option, and it was designed into the feature pipeline from the first day of development.

Medical Insight

The AI Act classification question was one I researched myself — specifically whether a clinical rule recommendation engine falls under the high-risk AI system definition for medical devices. The answer is nuanced and depends on how the output is framed and used. Designing the validation step as a genuine physician decision point, not a rubber-stamp confirmation, was a deliberate product and compliance choice. It also happens to produce better clinical outcomes.

What this combination actually unlocked

The three features — voice transcription with AI formatting, document-to-database extraction, and the consultation summary with clinical rules engine — are not independent additions to doctoLys. They are three expressions of the same underlying capability: a translation layer between the way doctors naturally work and the structured data a useful medical record system requires.

Feature	Without AI — requires structured input	With AI — works on unstructured input
Investigation list across pregnancy follow-up	Only possible if each result entered in structured fields	Extracted automatically from free text and documents
Consultation summary	Doctor reads all previous consultations manually	LLM generates structured summary from free-text notes
Clinical alerts (missed serology, Aspirin stop)	Only fires if diagnosis coded in structured field	LLM reads consultation text, detects gap, flags recommendation
Automatic referral letter	Requires structured data fields to populate template	Generated from free-text consultation history
Gestational diabetes prevalence in patient panel	Requires structured diagnosis field — empty if doctor used free text	LLM extracts diagnosis from consultation notes into database
Pregnancy dashboard with expected delivery dates	Requires structured date fields entered per consultation	AI-formatted consultation populates fields automatically

None of them would have been possible with traditional software development. Not because the technology did not exist — transcription APIs existed before LLMs. But because the specific combination of clinical judgment in the prompt engineering, the understanding of what the structured output needs to contain, and the ability to test the features in a live practice environment produced results that a technically competent team without clinical grounding could not have reached.

The doctor who wants to know the gestational diabetes prevalence in their patient panel now has that data — because the LLM extracted it from the consultation notes they were already writing. The pregnancy dashboard with real-time expected delivery dates now updates automatically — because the AI-formatted consultation includes the structured fields that populate it. The clinical alerts fire correctly — because the rules were written by someone who has managed hundreds of high-risk pregnancies and knows exactly where the gaps occur.

Clinical judgment built the product. AI extended what it could do. The combination produced capabilities that neither could have reached alone.

The future of medical software is not more structured forms. It is systems that understand clinicians first — and structure data second.

This is part one of a three-part series on AI in doctoLys. In the next article, we explore AI-assisted coding — how generative AI changed what a clinician-developer can build alone, and what it means for the speed, cost, and quality of health software development.

Frequently Asked Questions

Can AI really replace structured form input in a medical record system?

Not replace — resolve the tension. Generative AI acts as a translation layer: doctors input in natural language (typed or spoken), and the LLM structures that input against a predefined clinical template. The database receives structured data. The doctor never fills in a form. Both requirements are satisfied simultaneously.

Is it safe to send patient data to a cloud LLM API?

With proper design, yes. The critical step is stripping all patient-identifying information before any content is sent externally. The LLM receives clinical content only — no name, no date of birth, no identification number. The clinical content itself is never linked back to an identifiable individual in the external system.

What does the EU AI Act require for AI in medical software?

Medical AI systems classified as high-risk under the EU AI Act — which includes clinical decision support tools that influence medical decisions — require meaningful human oversight. In practice, this means AI outputs must be presented as proposals for physician review and confirmation, not written directly to the medical record. The physician remains legally responsible for every entry.

Why use a general LLM rather than a medical-specific model?

General-purpose LLMs combined with well-crafted clinical system prompts outperform narrowly trained medical models on practical medical office tasks. The system prompt carries the clinical context — terminology, specialty-specific rules, output format requirements. The quality of that prompt engineering, written by a clinician, determines the quality of the output more than the base model choice.

What is the biggest mistake in implementing AI features in medical software?

Building AI features without clinical input on the prompt engineering. The LLM provides reasoning capability. The system prompt provides clinical context. Without a clinician defining what correct output looks like — including edge cases, terminology nuances, and specialty-specific rules — the feature is technically functional and clinically unreliable.

AI-powered. Clinician-built. Ready in minutes.

Voice dictation, automatic summaries, clinical alerts. No setup, no IT required.

Start My Free Trial

Written by Dr. Sadok Derouich, a practicing gynecologist since 2012, digital health entrepreneur, and CEO of doctoLys — the AI medical office app built for doctors worldwide.

How AI unlocked what clinical judgment alone could not build

Medical Insight

The paradox I could not solve

The twist — they wanted structured output all along

The phone call — and the feature that changed everything

Medical Insight

From transcription to database — extracting investigation results automatically

The consultation summary — and the clinical rules engine

The decisions that required a clinician — not a developer

Two constraints that are not optional

Medical Insight

What this combination actually unlocked

Frequently Asked Questions

AI-powered. Clinician-built. Ready in minutes.

About the Author

Stay Updated