scipy.interpolate.griddataの内挿方法による違いを比較

はじめに

　以前、3次元のサンプルデータを内挿してmatplotlibでうまくプロットする方法について記事にしました。

xyzの点データを内挿してmeshgridにしmatplotlibでプロットする - 静かなる名辞

　この記事では内挿のアルゴリズムをデフォルトのlinearにして使いましたが、他の方法ではどうなるのか気になったので実験してみました。

使えるアルゴリズム

　選択肢は3つだけです。

method : {‘linear’, ‘nearest’, ‘cubic’}, optional
Method of interpolation. One of
nearest
return the value at the data point closest to the point of interpolation. See NearestNDInterpolator for more details.
linear
tessellate the input point set to n-dimensional simplices, and interpolate linearly on each simplex. See LinearNDInterpolator for more details.
cubic (1-D)
return the value determined from a cubic spline.
cubic (2-D)
return the value determined from a piecewise cubic, continuously differentiable (C1), and approximately curvature-minimizing polynomial surface. See CloughTocher2DInterpolator for more details.
scipy.interpolate.griddata — SciPy v1.3.0 Reference Guide

　なんとなくcubicには1-Dと2-Dの2つがあって「1次キュービック補間と2次キュービック補間？　そんなのあったっけ」と思いがちですが、データが1次元か2次元かで使い分けられるだけで、ユーザが指定できるのは{‘linear’, ‘nearest’, ‘cubic’}のいずれかです。

　それぞれ

線形補間
最近傍補間
キュービック補間

　です。詳しい中身は知らなくても、いずれも名前くらいは聞いたことがあると思います。

実験

　二次元正規分布でサンプル数=128,512とし、それぞれの補間アルゴリズムで内挿します。結果をプロットして確認します。

　また、回帰とみなしてRMSEを出してみました。

　コードを以下に示します。

import numpy as np
import matplotlib.pyplot as plt

from scipy import stats
from scipy import interpolate
from sklearn.metrics import mean_squared_error

def rmse(true, pred):
    return mean_squared_error(true.ravel(), pred.ravel())**(1/2)

def main():
    norm = stats.multivariate_normal(mean=[2.0, 3.0], cov=[[4, 2],[2,4]])

    # samples
    xy128 = np.random.uniform(low=-10, high=10, size=(128, 2))
    z128 = norm.pdf(xy128)
    xy512 = np.random.uniform(low=-10, high=10, size=(512, 2))
    z512 = norm.pdf(xy512)

    # xy meshgrid
    x = y = np.linspace(-10, 10, 500)
    X, Y = np.meshgrid(x, y)
    Z = norm.pdf(np.vstack([X.ravel(), Y.ravel()]).T).reshape(X.shape)

    # plot
    fig, axes = plt.subplots(nrows=2, ncols=5, figsize=(10,5))
    plt.subplots_adjust(hspace=0.6, wspace=0.4)

    axes[0,0].pcolormesh(X, Y, Z, cmap="jet")
    axes[0,0].set_title("true data")
    axes[1,0].pcolormesh(X, Y, Z, cmap="jet")
    axes[1,0].set_title("true data")

    for i, (n_samples, xy, z) in enumerate(
            zip([128, 512], [xy128, xy512], [z128, z512])):
        axes[i,1].scatter(xy[:,0], xy[:,1], c=z, cmap="jet")
        axes[i,1].set_title("samples {}".format(n_samples))

        for j, i_method in enumerate(["nearest", "linear", "cubic"]):
            i_Z = interpolate.griddata(xy, z, (X, Y), method=i_method, 
                                       fill_value=0.0)
            axes[i,j+2].pcolormesh(X, Y, i_Z, cmap="jet")
            axes[i,j+2].set_title("{} {}\nrmse={:.5f}".format(
                i_method, str(n_samples), rmse(Z, i_Z)))
        
    plt.savefig("result.png")

if __name__ == "__main__":
    main()

　なお、RMSEを計算する都合上、fill_value=0.0としています。デフォルトはnanですが、それだと計算できないので……。一応実際にnanの状態でも確認し、nanになるのはグラフの端（このデータではほぼ0.0）の領域だけであることを確認して以上の判断をしました。

結果と考察

　プロットされた結果を示します。

　見ての通り、ダメダメな最近傍補間、まあまあな線形補間、群を抜いて良いキュービック補間という関係です。cubicにしておけば良いのでは？

　ただ、今回は真値そのものが補間が効きやすいなめらかなデータですが、実データはもう少しノイズが乗ったりして暴れることがあると思います。キュービック補間はオーバーシュート・アンダーシュートがあるらしいので、そういう場合でも対応できるように保険としてデフォルトがlinearになっているのかもしれません。まあ、無難なのはそっちでしょう。

　実用的には、両方やってみて大丈夫そうな方を選ぶことになるでしょう。

まとめ

　cubicがよかったです。