Evaluating Evidence and Acting on Uncertainty
Academicians most effective Argument Against Longevity
I started writing this post last week after getting arrows from just about every ivory tower out there, literally almost every one. It’s as if when we leave, Yale, Sinai, Columbia, Stanford, etc, our brains fall out of our head or get pulled out in preparation for mummification. Nothing could be further from the truth. If anything our eyes are open to fields not excellent at our educational institutions and an array of developing science that is not ready or algorithmic care plans foisted on residents and nurse practitioners. Despite being called mad men and grifters, we are actually just as smart as those who chose to live off a 403b, HHMI or an R1 every 3 to 5 years.
If you can tell, this is going to be quite a deep dive. This time we will go about 3 ATA into the world of evidence analysis and what to do when the world is unsure what to do for your patient, specifically tilted in favor of longevity. I will highlight a real world example we have used for Rapamycin as well for paid subscribers.
If you haven’t become one, let me explain what my value is. I am providing my protocols for practicing Longevity Medicine in a regular fashion and offering Longevity Coaching services for patients and physicians through my substack, this is well worth the 2 Starbucks Drinks a month!
Longevity medicine i.e. the CLINICAL APPLICATION of geroscience promises to extend healthspan by targeting aging processes. However, clinicians in this field face a fundamental challenge: how to make patient-care decisions when definitive evidence is sparse or evolving. Unlike traditional single-disease interventions, geroscience interventions (e.g. senolytics, metformin for aging, rapamycin) often have sparse long-term randomized trial data and rely on emerging biomarkers or animal studies.
This post explores how clinicians can critically evaluate and act on medical evidence in geroscience despite uncertainty. We review established evidence appraisal models (GRADE, evidence hierarchies, Bayesian decision-making, and Realist approaches), critique their use in an aging context, and discuss strategies for communicating uncertainty. In the next post of this series we will address ethical considerations of experimental or preventive anti-aging therapies and draw lessons from analogous fields (early-phase oncology, preventive cardiology, and personalized genomics which I know all to well as a board member to many personalized medicinal genomics research efforts) to guide practical, transparent, and rigorous decision-making for clinicians.
In short, I’ve been around the block in the acting with limited evidence space since 2007 when the cool kids were spitting in tubes so Google could sell it to foreign governments and eventually to Regeneron. The one thing I learned in the research board meetings I participated in with heads of the departments of Genetics, Cardiology and Oncology at MSKCC, Moffit, Stanford, UPenn, NYU, Harvard, Yale, etc. is to have a construct to evaluate the evidence. You see, the world was hyping spit parties to discover your future for fun.
And while the world was paying to give Corporations like Google your genome, I was actively researching the effects of this data, even if it, like marketing, had no real evidence. In 2009 I even wrote an article warning everyone about this in Nature Biotechnology.
“I am writing in response to your editorial entitled “In need of counseling?”1, in which you argue that direct-to-consumer (DTC) companies should not be shackled by regulation and that physicians cannot continue to be gatekeepers of genetic information. As a healthcare provider on the front lines of genomic medicine and founder of Helix Health (Stamford, Connecticut, USA)—a company that provides medically validated genetic technology to patients—I believe the editorial contains several incorrect assumptions and fails to reflect the reality of the current situation.”
But don’t worry about that, Chevy Chase was doing it, so it must be cool.
See Your Future, Be Your Future Danny. Spit in the damned tube.
Did it upset me that a bunch of people were hyping something with a miniscule amount of evidence at the time and overpromising results? As a clinical fellow in genetics yes it did. But what concerned me more was that this was being done without any medical records protection (HIPAA etc.)
So while I appreciate the academic “concern” You must remember the safest place to practice Longevity Medicine is with a Longevity Doctor.
Why? Physicians were taught how to review the evidence. Some forgot, some never learned properly, most remember some of this and all have taken an oath to do no harm. Unlike AI or corporations. In this post I will review and teach how to properly review the evidence and give you my personal take on what the teeth gnashing is all about when Academicians scold the boots on the ground and the patients about “How” to practice longevity.
The traditional hierarchy of evidence. Systematic reviews of RCTs sit at the top as the highest-quality “unfiltered” evidence, whereas expert opinion and small case series form the base
Established Evidence Evaluation Models in Medicine
Evidence Hierarchies and GRADE
Clinical decisions have traditionally been guided by hierarchies of evidence. The classic Evidence-Based Medicine (EBM) pyramid ranks study designs by validity: at the base lie expert opinion and case reports, ascending through observational studies, with randomized controlled trials (RCTs) and systematic reviews at the apex.
Building on the evidence pyramid, the GRADE framework (Grading of Recommendations, Assessment, Development and Evaluation) provides a structured approach to rating the certainty of a body of evidence and the strength of clinical recommendations. Under GRADE, evidence from well-conducted RCTs starts as “High” quality, whereas non-randomized or observational studies start as “Low” quality by default. Reviewers then consider factors that can downgrade the certainty (risk of bias, inconsistency, indirectness of evidence, imprecision, publication bias) or upgrade it (e.g. large effect size, dose-response relationship)cdc.gov. The final rating (High, Moderate, Low, or Very Low certainty) reflects confidence that the evidence reflects the true effect. GRADE also separates the strength of a recommendation (strong vs. conditional/weak) from evidence quality … allowing for situations where a strong recommendation might be made on low-quality evidence or vice versa, after considering context, patient values, and potential harms/benefits.
Advantages: Evidence hierarchies and GRADE introduce rigor and transparency. They prioritize study designs less prone to bias and encourage clinicians to be cautious about low-quality data. GRADE’s systematic downgrading criteria ensure that limitations (e.g. small sample size or surrogate endpoints) are explicitly considered. This approach has been widely adopted in guideline development across specialties for its consistency and clarity.
Limitations in Geroscience: In the context of geroscience and preventive longevity interventions, strict adherence to traditional hierarchies can be problematic. Because aging outcomes (e.g. extended lifespan or delayed multimorbidity) often require decades of follow-up, long-term RCTs are logistically and ethically challenging. As a result, much geroscience evidence comes from animal studies, human observational cohorts, or short-term trials using surrogate endpoints (like biomarkers of “biological age”). GRADE would categorize such evidence as low certainty from the outset, regardless of relevance. In this great article I reviewed, there are some real downsides to using GRADE. For example, a large 10-year prospective cohort study on a nutritional intervention might be the best available human evidence for longevity, yet GRADE would rate it “Low” merely for being non-randomized. Important findings could thus be dismissed from guideline consideration due to design alone.
Furthermore, GRADE penalizes indirect evidence – and many geroscience surrogates (epigenetic clocks, inflammatory markers, etc.) are indeed indirect proxies for hard endpoints like mortality. This means that even promising interventions targeting aging hallmarks may be graded as very low certainty until definitive morbidity or mortality data accrue. While GRADE’s rigor is vital, its traditional evidentiary requirements seldom align with the realities of longevity research (few lengthy RCTs, evolving biomarkers), necessitating complementary approaches.
Bayesian Frameworks for Evidence and Decision-Making
Another lens to evaluate medical evidence is the Bayesian framework, which is rooted in probability updating and explicitly incorporates prior knowledge. In Bayesian evidence evaluation, clinicians start with a prior belief about an intervention’s likely benefit/risk (informed by mechanistic geroscience insights, animal data, or related human studies), and then update that belief as new evidence emerges (the likelihood of observing trial results if the intervention truly works). The result is a posterior probability that the treatment is effective, given both the prior and the new data. This approach contrasts with the frequentist null-hypothesis paradigm (which yields p-values but no direct probability of efficacy).
Advantages: A Bayesian approach can be very practical in an emerging field like longevity medicine where definitive trials are absent but there is a rich backdrop of mechanistic aging biology. It allows clinicians to formally integrate diverse evidence sources – e.g. prior: “Based on animal studies, there is a high (say 80%) prior probability that rapamycin slows aging”; evidence: “A small human trial showed improved immune function with p≈0.05”; posterior: “Now perhaps a ~90% probability that rapamycin has geroprotective effect, albeit magnitude uncertain.” In Bayesian decision-making, one can also factor in the patient’s unique context to estimate individualized benefit. This framework aligns with clinical intuition (we often come with prior impressions and then refine them) and encourages continuous learning as evidence accumulates.
Think about this in terms of the mitochondria and mitochondrial science.
Deep Dive: Mitochondrial Movement and Longevity. Missing Knowledge Revealed
Mitochondria….More Than Just Powerhouses
The science moves probabilities and we adjust clinical benefit likelihood as we learn. If we don’t we will sit in labs all day and twiddle our thumbs. This is the definition of evidence based practice
It may also facilitate decision analysis under uncertainty: by estimating an intervention’s probability of benefit and potential harm, a clinician can weigh the expected value (even without certainty) – a concept akin to how preventive cardiologists decide to treat based on calculated risk percentages.
Limitations: The Bayesian method’s flexibility comes with challenges. The choice of prior can be subjective or contentious…. overly optimistic priors might overestimate an intervention’s promise (Arguments heard from academia), whereas conservative priors might underplay genuine effects. In geroscience, for instance, different experts might assign very different credence to animal model data. Moreover, Bayesian calculations can be complex, and the approach is not yet standard in guideline development (most guidelines still use frequentist trial interpretations and GRADE).
Finally, Bayesian outputs (like a probability distribution of effect size) must be communicated carefully to patients and regulators not used to this paradigm. Despite these challenges, the Bayesian framework is a valuable tool for rational decision-making in longevity medicine when evidence is uncertain, as it makes assumptions explicit and updateable rather than black-and-white.
Realist Approaches to Evidence (Context and Mechanisms)
Traditional evidence hierarchies focus on whether an intervention works on average; Realist approaches ask for whom, in what context, and how an intervention works. A Realist review or evaluation is a form of theory-driven evidence synthesis that seeks to understand the mechanisms by which an intervention produces outcomes and how these mechanisms are modulated by context. Many have defined this as the N of 1 FUNCTIONAL MEDICINE method. Not exactly what ivory towers want to scale nor is it something regulators and policy makers embrace. But it is EXACTLY what patient autonomy needs.
Instead of attempting to eliminate all contextual factors through tightly controlled trials, Realist approaches embrace complexity: they collect evidence from diverse sources (quantitative and qualitative) to construct a narrative of “what works, for whom, under what circumstances, and why.” This method is well-suited to interventions that are complex or highly individualized. Realist approach it appears to me leads to a fundamental misunderstanding by academic researchers of what experienced physicians and in this case, experienced longevity medicine physicians are doing on a daily basis.
Advantages: For a multifaceted domain like geroscience, a Realist approach can provide insights that a simple efficacy estimate cannot. For example, a Realist analysis of an exercise program for older adults might reveal that it improves outcomes only if social support mechanisms are in place (context), via improving self-efficacy and muscle strength (mechanisms). In longevity medicine, patient heterogeneity is enormous – genetic makeup, lifestyle, comorbidities, psychosocial factors all influence aging.
Realist reviews can integrate evidence ranging from clinical trials to case studies and basic science, linking them through underlying mechanisms (e.g. “intervention X reduces senescent cell burden, which in context Y leads to improved resilience”). This approach acknowledges that an intervention might not have uniform effects: for instance, a senolytic drug might significantly help frail individuals with high senescent cell burden, but do little in robust younger adults – a nuance that a one-size-fits-all RCT may obscure. Realist evaluations thus offer practical guidance by identifying subgroups likely to benefit and optimal implementation conditions.
I think there is no better group working from the Realist Approach than our amazing group of Longevity Docs on our group communication network.
Daily we approach each other with open hearts and mindsusing our clinical and scientific knowledge of functional medicine and intergrated modalities to approach cases each of us has, we apply collective realist approaches. In fact it reminds me precisely of the work we did at the Coriell Personalized Medicine Collaborative in both the traditional ICOB and the Pharmacogenomics Group we debated the evidence and used a realist base with traditional Bayesian analysis as well as application of evidence hierarchies.
Limitations: Realist evidence synthesis does not yield the kind of clear-cut quantitative certainty that clinicians might hope for. It often results in a set of context-dependent recommendations or theories rather than a simple yes/no answer. The methodology can be time- and expertise-intensive, requiring qualitative research skills and iterative theory testing. There is also a risk of bias if the review is not conducted rigorously, since it relies on interpretation of complex data. In geroscience, where mechanisms are still being discovered, a Realist review could be limited by incomplete scientific understanding. Nonetheless, combining a Realist mindset with traditional efficacy data can be powerful…clinicians gain a deeper understanding of why an anti-aging strategy may or may not work in their specific patient’s case, improving personalized care.
This is why a community of physicians dedicated to understanding and using all three of these methods will eventually end up where we need to be. I predict it will happen much quicker than at the research lab level.
Comparative Summary of Frameworks
Each of the above models offers a distinct perspective on evidence. In practice, these approaches are complementary. A clinician in longevity medicine might use GRADE or evidence hierarchies to understand the formal quality of evidence, apply Bayesian reasoning to make a patient-specific decision in light of uncertainty, and use Realist insights to tailor the intervention to the patient’s context and to explain the nuanced rationale. The key is to remain flexible and TRIANGULATE THE EVIDENCE, rather than rigidly adhere to one framework when dealing with a nascent field like aging interventions.
Challenges in Applying Evidence Models to Geroscience
Longevity medicine presents unique evidentiary challenges that test the limits of traditional models. Below we detail specific issues and how they complicate evidence evaluation:
Lack of Long-Term RCTs: The “gold standard” evidence – multi-year randomized trials for hard outcomes (like all-cause mortality or incidence of multiple age-related diseases) – is exceedingly rare in geroscience. Trials like TAME (Targeting Aging with Metformin) are exceptions and even TAME is a 6-year trial using a composite endpoint. Most interventions cannot be practically tested over decades. As noted in one review, questions requiring >5 years of follow-up or lifelong exposure are not definitively addressable with RCTs. GRADE would therefore relegate the best available evidence (often observational cohorts) to low certainty This creates a conundrum: should clinicians wait 20+ years for perfect evidence while patients age in the meantime? Similar situations have been faced in nutritional science and preventive medicine. To adapt, researchers have proposed alternative frameworks; for example, the World Cancer Research Fund (WCRF) criteria treat consistent epidemiological data plus mechanistic plausibility as sufficient to infer causality in diet and cancer relationships.
Likewise, Katz et al. (2022) introduced the HEALM approach (Hierarchies of Evidence Applied to Lifestyle Medicine) which first asks “Is the question definitively addressable by RCTs?” – if not (e.g. requires very long duration or unethical randomization), HEALM then weighs evidence from mechanistic studies, shorter interventions, and observational data together. “So diet is not as dangerous as Rapamycin” is what I hear from the chuckleheads in the back of the room. The only answer I have to that is….Are you sure about that?
The evidence is graded not as “High/Low” certainty but as Grade A (strong/decisive), B (moderate/suggestive), or C (inconclusive) based on the totality of proof. Such adaptive schemas may be more suitable for aging research than a strict GRADE approach. Clinicians should be aware that absence of RCT evidence is not absence of effect – it may reflect practical constraints. Thus, they must judiciously interpret lower-tier evidence, sometimes acting on the best available data (with caution), rather than dismissing it outright.
Absence of evidence is never Evidence of Absence……I’m sure Philip Morris and McDonalds are looking for RCTs and cancer incidence too…..
Do you have an alternative way to look at the data? Send me a line!
Biomarkers and Surrogate Endpoints: Because waiting decades for outcomes is impractical, geroscience relies on surrogate markers of aging – from molecular measures (e.g. DNA methylation “aging clocks”) to functional markers (e.g. gait speed, grip strength) that are believed to predict future healthspan. The use of surrogates is not unique to aging; for instance, LDL cholesterol reduction was used as a surrogate for cardiovascular risk to expedite lipid-lowering drug approval. Now I prefer to use ApoB/A1 levels and Lp(a) clinically, but hey, I’m not the FDA….
The critical issue is validation: does a change in the surrogate reliably predict meaningful clinical benefit? In longevity medicine, many proposed biomarkers are still being validated. An NIH Geroscience Network paper highlights that a true biomarker of aging should predict aging-related outcomes and serve as a surrogate endpoint for evaluating interventions.
To date, no surrogate has universal acceptance, though candidates like epigenetic methylation age or inflammatory burden are promising. Evidence frameworks like GRADE treat unvalidated surrogates as indirect evidence, warranting a downgrade in certainty. From a regulatory and scientific standpoint, this is appropriate caution: history has shown that not all surrogates translate into real benefit (e.g. some osteoporosis drugs improved bone density (surrogate) but not fracture rates). For clinicians, this means that improvements in an aging biomarker should be interpreted as hypothesis-generating unless supported by broader data. There is growing recognition, including at policy levels, that we need to accelerate biomarker validation for aging. As one policy forum noted, “the current lack of validated surrogate endpoints for major late-life conditions is a critical bottleneck in clinical research”, and it called for initiatives to qualify aging biomarkers similarly to how LDL was qualified in cardiology.
In practice, clinicians must balance optimism and skepticism: positive changes in surrogate measures (like a patient’s “biological age” trending younger by a clock assay) are encouraging, but decisions should also consider conventional health indicators and the totality of evidence. By creating the specialty of Longevity, we can use the clinical practice of disease based care and incorporate a deeper understanding of functional cellular pathways and energy based modalities to enhance bayesian frameworks.
Extrapolating from Animal and Mechanistic Evidence: Geroscience is bolstered by rich basic science – e.g. compounds like rapamycin, acarbose, or senolytics extend lifespan in mice. Mechanistic studies illuminate pathways (mTOR, AMPK, cellular senescence, etc.) that are conserved in humans. One of the most important take aways I learned from my time as the clinical fellow in genetics was Dr. McGrath’s comparative genomics insistence. Jim’s dry humor was offputting for those who didn’t understand him. Sarcastic and funny, just like my family. He always liked to bring me back to conserved pathways when teaching. Always, conservation matters. Jim died last year, exploring the world of cytoplasm shedding in fertilization, organelle transfer and what happens to Dad’s cytoplasm. He did this in mice, just like he did with Wistar.
By focusing on conserved pathways it creates a biologically plausible rationale for interventions. However, evidence hierarchies formally rank such mechanistic evidence at the bottom (being preclinical or theoretical). While GRADE allows “upgrading” evidence quality if, say, there is a very large treatment effect or compelling mechanism, these criteria are hard to satisfy without human data. The Bradford Hill postulates for causation give weight to biological plausibility but in EBM it serves more to support causality after an association is observed rather than to prove efficacy alone. A tension thus exists: many longevity clinicians find mechanistic evidence highly compelling (it underpins the geroscience “hallmarks of aging” paradigm), yet by EBM standards, it’s insufficient. In the end when the GWAS heyday existed, my mentors would always say “But is there a biological explanation for that SNP?” If not, we would postulate linkage vs pathway. The problem with this thinking is that we have yet to discover everything under the sun.
The solution is to use mechanistic evidence in a Bayesian or Realist sense – as context to interpret limited clinical data. For example, a small trial of a senolytic in humans might show no overt clinical change in 6 months; a mechanistic perspective reminds us that clearing senescent cells has robust effects in animals, suggesting that perhaps the trial was too short or the dose too low, rather than concluding the intervention is ineffective. In other words, mechanistic geroscience evidence should inform our priors and our understanding of how an intervention might work (or why it might fail in certain conditions), but it doesn’t obviate the need for careful clinical observation. Clinicians should neither ignore high-quality animal data (since it often represents decades of research) nor uncritically assume humans will respond identically to mice. Bridging this gap requires both translational research and pragmatic clinical observation of patients who choose to try such interventions.
Heterogeneity and Complexity: Aging is not one uniform process but a collection of interacting biological and environmental factors. Patients vary widely in genetics (e.g. APOE genotype might influence response to a therapy), in baseline “gerobiology” (one 70-year-old may have high inflammation and another high epigenetic age, etc.), and in goals (one may prioritize maximum lifespan, another quality of life or function). This heterogeneity means that a trial averaging outcomes might dilute important subgroup effects. Traditional evidence evaluation struggles with this: an RCT might show “no average benefit” even if a subset benefited greatly, leading to a blanket judgment of “no evidence of effect.” Realist approaches and precision-medicine frameworks aim to tease out these nuances.
From an evidence standpoint, this means clinicians should look beyond aggregate results when possible. If a longevity intervention failed to extend lifespan overall, were there signals that a particular subgroup (say, those with high baseline inflammation or those above a certain age) did benefit? This is where detailed data and even N-of-1 trials come in. In geroscience clinics, some physicians conduct single-patient experiments – for example, starting a patient on an off-label aging drug and tracking a panel of biomarkers and functional indicators over time.
While anecdotal, a collection of such data, if systematically recorded, can generate hypotheses or suggest for whom the intervention seems effective. The key challenge is to avoid being misled by random variability. Clinicians should apply critical thinking: is an observed change likely due to the intervention or natural variation? Can it be cross-checked with another metric? Engaging in collaborative data sharing (through patient registries or case series) can elevate such experiences from anecdote to a form of real-world evidence. In our Longevity Docs group we are doing just that with a registry on GLP-1 medicines and biomarkers for aging.
Geroscience pushes the boundaries of conventional evidence evaluation. To navigate this, clinicians must become comfortable with graded certainty – recognizing shades of gray between proven and disproven. As one longevity physician noted, “responsible longevity medicine requires balanced communication, recognizing distinctions between proven, plausible, and uncertain interventions”
Longevity Science Deserves Nuance, Not Blanket Dismissals
In a recent YouTube discussion, Dr. Eric Topol and Dr. Mike Varshavski (Doctor Mike) provided a critique of longevity medicine, raising several possibly valid points but also making definitive statem…
Embracing new frameworks (like modified evidence grading or Bayesian analysis) and maintaining scientific humility will allow clinicians to take action against aging when justified, while clearly understanding and conveying the level of uncertainty involved.
Do you want to see an example of how we do this at Concierge Medical and how I propose you do this work in real life? Let me give you a great example!
Keep reading with a 7-day free trial
Subscribe to Longevity Insider with Dr. Murphy to keep reading this post and get 7 days of free access to the full post archives.