Low-data Uncertainty Analysis with Triangular Distribution Monte Carlo

import * as Plot from "npm:@observablehq/plot";
import * as d3 from "npm:d3";
const numSamples = 10000;

Many times, we're asked to make decisions that require analysis of situations involving quantities we don't have certain knowledge of. Perhaps we are asked to predict a quantity we can't be certain of until it happens (like next month's rainfall) or measure something without the right tools (how much does my couch weigh). For some of these things we might be able to use some existing data to arrive at a distribution of potential values we can use in our analysis, but sometimes there's simply no relevant existing data set.

Human psychology is allergic to uncertainty, so for many there exists a strong temptation to rely on intuition in these cases. However, especially for unfamiliar situations or inexperienced actors, there may be no valid intuition to draw from. Even with experience, many human psychological biases exist which challenge those that lean too heavily on intuition.

Slightly more analytical folks will also often attempt to eliminate uncertainty, and may use single-point estimation processes to arrive at scalar quantities they can use in calculations using traditional arithmetic. There may be a "margin of safety" or pessimistic attitude taken while estimating these single points which can provide safety or reliability, but being overly pessimistic will lead to inefficient behaviors. Single-point estimation may feel rigorous because it involves math, but ultimately it is a poor way to model situations because it has discarded our uncertainty as a source of valuable information.

Truly rational folks may have mastery of statistical methods for risk management in the face of uncertainty but come up empty for solutions to situations where the problems are ill-defined or there's incredibly little knowledge of the quantities involved or how to effectively statistically model them.

Triangular Distribution Three-point Estimation

While we can't always be certain of quantities, we can usually have at least a basic idea of the minimum expected, most-likely, and maximum expected values for them, called a three-point estimation. If we are incredibly uncertain about a quantity, these points might be spread quite far apart, but we can still come up with at least some values in many cases, even with an incredible lack of prior information.

A Triangular distribution models a triangle shaped probability density function with one point at a minimum value (min), one point at the most likely value (mode), amd the last point at the maximum value (max). We can use these to model our three point estimates as probability functions.

// Produce a sample from a triangular distribution
function triangularSample(min, mode, max) {
    const u = Math.random();
    const c = (mode - min) / (max - min);

    if (u < c) {
        return min + Math.sqrt(u * (max - min) * (mode - min));
    } else {
        return max - Math.sqrt((1 - u) * (max - min) * (max - mode));
    }
}

Example

const example = new Array(numSamples);
for (let i = 0; i < numSamples; i++) {
    example[i] = triangularSample(1, 10, 100); // Min 1, Mode 10, Max 100
}

And plot:

Plot.rectY(example, Plot.binX({y: "count"}, {x: d => d})).plot()

Monte Carlo Methods

Monte Carlo methods allow for relatively straight-forward translation of standard mathematical model or logical model into one which can account for uncertainty (at least one of the quantities involved is a probability distribution of possible actual values).

The key is repeated sampling. Anything you can calculate or model with scalar inputs, you can calculate according to Monte Carlo methods using distributions of input possibilities by sampling the calculation into a distribution of output possibilities.

Example

Imagine you have a die which you're not sure if it's weighted or fair. The Monte Carlo method to find out this uncertain truth is: roll the thing until the data says it's conclusively fair or conclusively unfair.

Application

Perhaps we're trying to figure out how much sweet tea we need to brew before an upcoming cookout. I've invited 8 friends, each of which might bring a plus-one, and there's myself.

If I just intuit an amount of tea I'd like to brew in such a situation, trying not to run out, I'd pick maybe 3 gallons.

Is that enough? I'd rather not just hope that it is, so lets take a simple analytical approach we can model with a simple function like the following:

function teaGallonsRequired(attendees, pintsPerPerson) {
    return Math.ceil(attendees * pintsPerPerson / 16);
}

If I were to do single point estimates I'd guess that maybe half of my invites show up and half of those bring a plus-one for 7 people including myself. I'll guess each person maybe will drink 3 pints of tea.

teaGallonsRequired(7, 3)

Yeah, glad we didn't just settle at 3 gallons.

Even though we were trying to be realistic, thanks to optimism bias and our human nature, this single-point analysis probably should be considered to be (and labeled in the summary below) as our "Optimistic" single-point estimate, even though that wasn't our intent.

Alternatively I can go out of my way to try to be quite pessimistic and assume absolutely everyone shows up, with a plus-one, and each attendee drinks 8 pints each.

teaGallonsRequired(17, 8)

Dang... I don't know if I have that many storage vessels to keep tea in, or the fridge space... but I definitely wouldn't run out! Hopefully we can become confident without needing to brew that much tea, so lets try Monte Carlo.

Given the way we've factored it, we can actually keep the same model function, passing distributions of possible inputs instead of scalar values by repeatedly sampling the function and in the end we get a distribution of possible outputs.

For input distributions, unfortunately we don't have any past data about this event we could sample from or try to model by fitting some typical distribution shape onto, since it's not a recurring thing with the same invitees every time, but despite the ill-defined nature of our problem and an absence of knowledge based in evidence, we can lean on triangular distribution models of three-point estimates:

const teaDistribution = new Array(numSamples);
for (let i = 0; i < numSamples; i++) {
    const attendees = triangularSample(1, 7, 17);
    const pintsPerPerson = triangularSample(0, 3, 8);
    teaDistribution[i] = teaGallonsRequired(attendees, pintsPerPerson);
}
Plot.rectY(teaDistribution, Plot.groupX({y: "count"}, {x: d => d})).plot()

Since we've included and modeled our uncertainty we can now use this distribution to make practical trade-offs between risk (of running out of tea) and effort/storage space (brewing additional tea).

For my friends I think I can tolerate having a small minority of events I host run out of drinks (maybe 10%), if a solid majority of the time there's plenty (90%), so I'll go with a 90% confidence that brewed tea is greater than the required tea:

d3.quantile(teaDistribution, 0.9)

If this were an event that someone was paying me to host professionally I'd maybe go for two nines:

d3.quantile(teaDistribution, 0.99)

Still less than our pessimistic single-point estimate.

Conclusion and Opportunities

In summary, we built a small model of a situation in order to analyse it, with comparisons against more popular but more simplistic analytical methods:

If we were a company buying tea for some audience we are trying to influence or something, we could even have an explicit way to weigh assumed risk and expected consequences of disappointing our guests versus our tea expenses rather than simply picking a risk tolerance directly.

Whatever method we choose to address risk and uncertainty here, we wouldn't have the opportunity to do so if we'd succumbed to the temptation of confidence or became paralysed by a lack of evidence.

Of course, there's no accounting for a bad model or bad estimates. Garbage in, garbage out, as always. But by modeling and embracing uncertainty with a statistical analytical approach, we have a starting point we can improve from, rather than a hope and a prayer.

Some variables will not at all be suited for modeling by triangular distributions and may be Multi-modal, long-tail, or parabolic shaped instead. Monte Carlo methods as applied to analytical modeling of risk is highly adaptable to almost any probability function or observational dataset as input. Triangular distributions were my first example here for their broad applicability, but if you can, you want to model your inputs as more suitable distributions that occur in nature, ideally are observed in situations similar to the one your trying to model. If you have the data, rather than estimate, you should prefer to use real-world evidence directly from your situation or past instances of your situation.

Some problems are analytically intractable, others may not suit this method. If you attempt the method in a highly chaotic (outputs vary wildly given small variations in the inputs) or highly dimensional (too many variables, internal or external) situation, your outputs may never converge into a useful distribution regardless of the number of samples.

This tea situation illustrates but doesn't actually justify the use of Monte Carlo methods, as we could likely have achieved similar predictive ability by simple three-point estimate of the final distribution instead of bothering with estimating the input distributions and Monte Carlo simulating a multiplication operation (multiplication isn't very hard to intuit). However, the method scales to much more complex scenarios than this one (ones with dependencies between variables, conditional branches in logic, iterative processes with effects between steps, or multiple outputs which require trade-off between objectives) as we'll see in future articles (ones written about more practical applications with more basis in real-world business challenges and opportunities). For now, you'll have to trust me on the ability of the method to scale in complexity.

This also doesn't prevent one from needing to track the predictive performance of the resulting models. One must test the quality of any estimation-driven process in terms of expected vs actual risk: if your tolerance is for a 10% failure rate, is your failure rate actually around 10%?

You will also probably want to use sensitivity analysis to inform the design of your simulation (which inputs or intermediary values have the greatest impact on our outputs). This will also be covered in a future article.