RobustMeans

The aim of this package is to implement in Julia some robust mean estimators (one-dimensional for now). See Mean Estimation and Regression Under Heavy-Tailed Distributions: A Survey or The Robust Randomized Quasi Monte Carlo method, applications to integrating singular functions for recent surveys.

Note

Computing the empirical mean over a data set is one of the most common operations in data analysis. However, this operation is not robust to outliers or contaminated samples. Robust mean estimators are mean estimators that are robust (in some sense) against such outliers or contaminated samples.

I am currently trying some stuff on the package about "robust moving average"

Example: Comparing robust estimator vs Empirical Means

using Distributions
using RobustMeans

Generate samples

n = 8 * 7
M = 10^5 # M = 10^7 is used for the plot
α = 3.1
distribution = Pareto(α)
μ = mean(distribution) # True mean
σ = std(distribution) # True std
x = rand(distribution, M, n) # M realizations of samples of size n

Estimate the mean with different estimators

# Store all the realizations into a Dictionary
p = 1 # Parameter of Minsker-Ndaoud
δ = 3exp(-8) # 0.001
estimators = [EmpiricalMean(), Catoni(σ), Huber(σ), LeeValiant(), MinskerNdaoud(p)]
short_names = ["EM", "CA", "HU", "LV", "MN"]
estimates = Dict{MeanEstimator,Vector}()
for estimator in estimators
    estimates[estimator] = [mean(r, δ, estimator) for r in eachrow(x)]
end

Results

Code for the plot

using StatsPlots, LaTeXStrings
gr()
plot_font = "Computer Modern" # To have nice LaTeX font plots.
default(
    fontfamily = plot_font,
    linewidth = 2,
    label = nothing,
    grid = true,
    framestyle = :default
)

begin
    plot(thickness_scaling = 2, size = (1000, 600))
    plot!(Normal(), label = L"\mathcal{N}(0,1)", c = :black, alpha = 0.6)
    for (ns, s) in enumerate(estimators)
        W = √(n) * (estimates[s] .- μ) / σ
        stephist!(W, alpha = 0.6, norm = :pdf, label = short_names[ns], c = ns)
        vline!([quantile(W, 1-δ)], s = :dot, c = ns)
    end

    vline!([0], label = :none, c = :black, lw = 1, alpha = 0.9)
    yaxis!(:log10, yminorticks = 9, minorgrid = :y, legend = :topright, minorgridlinewidth = 1.2)
    ylims!((1/M*10, 2))
    xlabel!(L"\sqrt{n}(\hat{\mu}_n-\mu)/\sigma", tickfonthalign = :center)
    ylabel!("PDF")
    xlims!((-5, 10))
    ylims!((1e-5,2))
    yticks!(10.0 .^ (-7:-0))
end

Example: Robust nonlinear regression

Let's say you have a nonlinear regression problem $Y = f(u, X) + \epsilon$, where $\epsilon$ can be heavy-tailed or corrupted samples. The function is parametrized by the vector $u$ of parameters you want to adjust.

Traditionally, one would try to solve the following optimization problem

$$u^\ast_{\mathrm{EM}} = \mathrm{argmin}_u \dfrac{1}{N}\sum_{i=1}^N (y_i - f(u, x_i))^2$$

However, this empirical mean could be heavily influenced by data outliers. To perform robust regression, one could use

$$u^\ast_{\mathrm{robust}} = \mathrm{argmin}_u \text{RobustMean}\left(\left\{(y_i - f(u, x_i))^2\right\}_{i\in [\![1, N]\!]}\right)$$

Note

Note that when $f$ is linear, you can use the dedicated package RobustModels.jl, which has many more robust estimators and a better interface. However, it does lack some of the more theoretical ones written here and most importantly it is currently limited to linear models.

In the following example, we use the Minsker-Ndaoud robust estimator and $f$ is a $\mathrm{relu}$ function. We choose Minsker-Ndaoud because it is compatible with automatic differentiation. Note that Catoni/Huber should also be easily differentiable; for the MoM-based estimator, I am not sure...

First, here is the set up

using RobustMeans
using Plots

using Optimization
using ForwardDiff

relu(x, a, b) = x > b ? a * (x - b) : zero(x)
N = 8*5
X = 100rand(N)
a_true = 1
b_true = 20
Y = abs.(relu.(X, a_true, b_true) + 2randn(N))

# We manually corrupt the dataset
percentage_outliers = 0.17
n_outliers = round(Int, percentage_outliers * length(X))
Y[1:n_outliers] = maximum(Y)*rand(n_outliers) .+ minimum(Y)

# just so δ is a multiple of the number of data points    
δ = 3exp(-8)


u0 = [0.2, 18]
p = [X, Y]

For comparison, let's try the regular regression

f_err_EM(u, p) = mean((relu.(p[1], u[1], u[2]) - p[2]).^2, δ, EmpiricalMean())

optf_EM = OptimizationFunction(f_err_EM, AutoForwardDiff())
prob_EM = OptimizationProblem(optf_EM, u0, p)

sol_EM = solve(prob_EM, Optimization.LBFGS())

Now the robust regression

f_err_R(u, p) = mean((relu.(p[1], u[1], u[2]) - p[2]).^2, δ, MinskerNdaoud(2))

optf_R = OptimizationFunction(f_err_R, AutoForwardDiff())
prob_R = OptimizationProblem(optf_R, u0, p)

sol_R = solve(prob_R, Optimization.LBFGS())

Xl = 0:0.1:100
scatter(X, Y, label="Data")
plot!(Xl, relu.(Xl, a_true, b_true), label="True function", lw = 2, s = :dash, c = :black)
plot!(Xl, relu.(Xl, sol_EM.u...), lw = 2, label = "Fit EM")
plot!(Xl, relu.(Xl, sol_R.u...), lw = 2, label = "Fit Minsker-Ndaoud")

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.github/workflows		.github/workflows
img		img
src		src
test		test
.gitignore		.gitignore
AUTHORS		AUTHORS
LICENSE		LICENSE
Project.toml		Project.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RobustMeans

Example: Comparing robust estimator vs Empirical Means

Generate samples

Estimate the mean with different estimators

Results

Example: Robust nonlinear regression

About

Uh oh!

Releases 5

Packages

Uh oh!

Languages

License

dmetivie/RobustMeans.jl

Folders and files

Latest commit

History

Repository files navigation

RobustMeans

Example: Comparing robust estimator vs Empirical Means

Generate samples

Estimate the mean with different estimators

Results

Example: Robust nonlinear regression

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Languages

Packages