The aim of this package is to implement in Julia some robust mean estimators (one-dimensional for now). See Mean Estimation and Regression Under Heavy-Tailed Distributions: A Survey or The Robust Randomized Quasi Monte Carlo method, applications to integrating singular functions for recent surveys.
Note
Computing the empirical mean over a data set is one of the most common operations in data analysis. However, this operation is not robust to outliers or contaminated samples.
Robust mean estimators are mean estimators that are robust (in some sense) against such outliers or contaminated samples.
I am currently trying some stuff on the package about "robust moving average"
using Distributions
using RobustMeansn = 8 * 7
M = 10^5 # M = 10^7 is used for the plot
α = 3.1
distribution = Pareto(α)
μ = mean(distribution) # True mean
σ = std(distribution) # True std
x = rand(distribution, M, n) # M realizations of samples of size n# Store all the realizations into a Dictionary
p = 1 # Parameter of Minsker-Ndaoud
δ = 3exp(-8) # 0.001
estimators = [EmpiricalMean(), Catoni(σ), Huber(σ), LeeValiant(), MinskerNdaoud(p)]
short_names = ["EM", "CA", "HU", "LV", "MN"]
estimates = Dict{MeanEstimator,Vector}()
for estimator in estimators
estimates[estimator] = [mean(r, δ, estimator) for r in eachrow(x)]
endCode for the plot
using StatsPlots, LaTeXStrings
gr()
plot_font = "Computer Modern" # To have nice LaTeX font plots.
default(
fontfamily = plot_font,
linewidth = 2,
label = nothing,
grid = true,
framestyle = :default
)begin
plot(thickness_scaling = 2, size = (1000, 600))
plot!(Normal(), label = L"\mathcal{N}(0,1)", c = :black, alpha = 0.6)
for (ns, s) in enumerate(estimators)
W = √(n) * (estimates[s] .- μ) / σ
stephist!(W, alpha = 0.6, norm = :pdf, label = short_names[ns], c = ns)
vline!([quantile(W, 1-δ)], s = :dot, c = ns)
end
vline!([0], label = :none, c = :black, lw = 1, alpha = 0.9)
yaxis!(:log10, yminorticks = 9, minorgrid = :y, legend = :topright, minorgridlinewidth = 1.2)
ylims!((1/M*10, 2))
xlabel!(L"\sqrt{n}(\hat{\mu}_n-\mu)/\sigma", tickfonthalign = :center)
ylabel!("PDF")
xlims!((-5, 10))
ylims!((1e-5,2))
yticks!(10.0 .^ (-7:-0))
endLet's say you have a nonlinear regression problem
Traditionally, one would try to solve the following optimization problem
However, this empirical mean could be heavily influenced by data outliers. To perform robust regression, one could use
Note
Note that when
In the following example, we use the Minsker-Ndaoud robust estimator and
First, here is the set up
using RobustMeans
using Plots
using Optimization
using ForwardDiff
relu(x, a, b) = x > b ? a * (x - b) : zero(x)
N = 8*5
X = 100rand(N)
a_true = 1
b_true = 20
Y = abs.(relu.(X, a_true, b_true) + 2randn(N))
# We manually corrupt the dataset
percentage_outliers = 0.17
n_outliers = round(Int, percentage_outliers * length(X))
Y[1:n_outliers] = maximum(Y)*rand(n_outliers) .+ minimum(Y)
# just so δ is a multiple of the number of data points
δ = 3exp(-8)
u0 = [0.2, 18]
p = [X, Y]For comparison, let's try the regular regression
f_err_EM(u, p) = mean((relu.(p[1], u[1], u[2]) - p[2]).^2, δ, EmpiricalMean())
optf_EM = OptimizationFunction(f_err_EM, AutoForwardDiff())
prob_EM = OptimizationProblem(optf_EM, u0, p)
sol_EM = solve(prob_EM, Optimization.LBFGS())Now the robust regression
f_err_R(u, p) = mean((relu.(p[1], u[1], u[2]) - p[2]).^2, δ, MinskerNdaoud(2))
optf_R = OptimizationFunction(f_err_R, AutoForwardDiff())
prob_R = OptimizationProblem(optf_R, u0, p)
sol_R = solve(prob_R, Optimization.LBFGS())Xl = 0:0.1:100
scatter(X, Y, label="Data")
plot!(Xl, relu.(Xl, a_true, b_true), label="True function", lw = 2, s = :dash, c = :black)
plot!(Xl, relu.(Xl, sol_EM.u...), lw = 2, label = "Fit EM")
plot!(Xl, relu.(Xl, sol_R.u...), lw = 2, label = "Fit Minsker-Ndaoud")