[Python] Joblibのキャッシュを使って同じ計算を省略する

本エントリではPythonのJoblibがもつキャッシュ機能によって同じ計算を省略し、処理を高速化するための方法を説明する。このエントリを読むことで、関数をキャッシュ可能にする方法、numpyのarrayをメモリーマップを使って読み込む方法、参照を使ってデータにアクセスする方法がわかる。

Joblibとは
環境
Joblibのインストール
Memoryを使って計算結果をキャッシュする
Joblibを使ってテキストを単語のIDに変換する処理を高速化する
まとめ

Joblibとは

JoblibはPythonにおけるパイプライン処理の効率化をするためのライブラリであり、以下の特徴を持つ。今回は特徴の一つであるキャッシュ機能について説明する。

1. 計算結果のキャッシュが可能

JoblibではPythonの関数をメモ化することができる。例えば以下のように numpy.square 関数をメモ化すると、同一の入力に対して再び計算せず、キャッシュした計算結果を返す。

>>> from joblib import Memory
>>> cachedir = 'your_cache_dir_goes_here'
>>> mem = Memory(cachedir)
>>> import numpy
>>> a = np.vander(numpy.arange(3)).astype(numpy.float)
>>> square = mem.cache(numpy.square)
>>> b = square(a)                                   
________________________________________________________________________________
[Memory] Calling square...
square(array([[0., 0., 1.],
       [1., 1., 1.],
       [4., 2., 1.]]))
___________________________________________________________square - 0...s, 0.0min

>>> c = square(a)  # この関数呼び出しでは実際の計算はおこなわれない

2. 並列化が容易

Joblibでは並列化処理を簡潔に記述できる。

>>> from joblib import Parallel, delayed
>>> from math import sqrt
>>> Parallel(n_jobs=1)(delayed(sqrt)(i**2) for i in range(10))
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0]

3. 高速、高圧縮な永続化

Joblibは大規模なデータでも効率的に永続化でき、pickleを代替として利用できる。

環境

OSはmacOS Mojave バージョン10.14.16である。他の環境は次の通り。

❯❯❯ python --version
Python 3.7.3
❯❯❯ pip freeze|grep joblib
joblib==0.13.2
❯❯❯ pip freeze|grep numpy
numpy==1.16.2

Joblibのインストール

pip install joblib

Memoryを使って計算結果をキャッシュする

キャッシュの簡単な例

キャッシュ結果を保存するディレクトリを指定して Memory クラスのインスタンスを作成する。

>>> from joblib import Memory
>>> cachedir = 'your_cache_location_directory'
>>> memory = Memory(cachedir, verbose=0)

次に計算結果をキャッシュしたい関数をこの memory でデコレートする。

>>> @memory.cache
>>> ... def f(x):
>>> ...     print('Running f(%s)' % x)
>>> ...     return x

このキャッシュ可能した関数に対して同じ入力を与えて呼び出すと、二度目の呼び出しでは、実際の計算を避けることができる。

>>> print(f(1))
Running f(1)
1
>>> print(f(1))
1

今まで与えられていない入力に対しては、実際に計算して結果を返す。

>>> print(f(2))
>>> Running f(2)
>>> 2

NumPyのデータを扱う関数をキャッシュ可能にする

Memory を実装したもともとの動機はnumpyのarrayを扱う処理をキャッシュ可能にしたかったためとのこと。 Memory は与えられた入力に対して過去に計算をしているかどうかを確かめるために、高速な暗号学的ハッシュ関数を利用している。

例えば、ある値を受け取り、numpyのarrayを出力する関数と、numpyのarrayを受け取って計算をおこなう関数を2つ定義する。

>>> import numpy as np

>>> @memory.cache
... def g(x):
...     print('A long-running calculation, with parameter %s' % x)
...     return np.hamming(x)

>>> @memory.cache
... def h(x):
...     print('A second long-running calculation, using g(x)')
...     return np.vander(x)

以下のように関数 h が関数 g の結果をもとに2回呼び出されても、2回めの呼び出しでは実際の計算はおこなわれない。

>>> a = g(3) # 1回目のg呼び出し
A long-running calculation, with parameter 3
>>> a
array([0.08, 1.  , 0.08])
>>> g(3)  # 2回目のg呼び出し
array([0.08, 1.  , 0.08])
>>> b = h(a)  # 1回目のh呼び出し
A second long-running calculation, using g(x)
>>> b2 = h(a)  # 2回目のh呼び出し
>>> b2
array([[0.0064, 0.08  , 1.    ],
       [1.    , 1.    , 1.    ],
       [0.0064, 0.08  , 1.    ]])
>>> np.allclose(b, b2)
True

メモリーマップを使った高速なキャッシュの参照

メモリーマップを使うことでnumpyのarrayをキャッシュからの読み込みを高速化することができる。

>>> cachedir2 = 'your_cachedir2_location'
>>> memory2 = Memory(cachedir2, mmap_mode='r')  # メモリーマップを読み込みモードで利用
>>> square = memory2.cache(np.square)
>>> a = np.vander(np.arange(3)).astype(np.float)
>>> square(a)
________________________________________________________________________________
[Memory] Calling square...
square(array([[0., 0., 1.],
       [1., 1., 1.],
       [4., 2., 1.]]))
___________________________________________________________square - 0.0s, 0.0min
memmap([[ 0.,  0.,  1.],
        [ 1.,  1.,  1.],
        [16.,  4.,  1.]])

関数 square が同じ入力で呼び出されると、メモリーマップを使ってディスクから読み込んで計算結果を返す。

>>> res = square(a)
>>> print(repr(res))
memmap([[ 0.,  0.,  1.],
        [ 1.,  1.,  1.],
        [16.,  4.,  1.]])

mmap_mode についてはnumpy.loadを参照してほしい。

キャッシュされた結果の参照を取得する。

計算結果そのものではなく、参照を取得することもできる。参照を取得したい場合は、メモ化した関数に対して call_and_shelve を呼び出す。

>>> result = g.call_and_shelve(4)
A long-running calculation, with parameter 4
>>> result  
MemorizedResult(location="...", func="...g...", args_id="...")

関数 g の計算結果はディスク上に保存され、メモリからは削除される。計算結果を取得するには get 関数を利用する。

>>> result.get()
>>> array([0.08, 0.77, 0.77, 0.08])

このキャッシュを削除するには clear 関数を利用する。これにより、ディスクからそのキャッシュが削除されるため、再び get 関数を利用しても KeyError となる。

>>> result.clear()
>>> result.get()  
Traceback (most recent call last):
...
KeyError: 'Non-existing cache value (may have been cleared).\nFile ... does not exist'

Joblibを使ってテキストを単語のIDに変換する処理を高速化する

ニューラルネットワークを使った自然言語処理では、事前にテキストに出現する単語を一意な整数へ変換することが一般的である。この処理はテキストの量が多くなるほど時間がかかるものの、毎回計算結果は同じである。この処理をJoblibでキャッシュすることで、2回目の計算を高速化できるか試してみる。

データは次の通り取得する。

wget https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/vocab-2016-09-10.txt
wget http://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz
tar zxvf 1-billion-word-language-modeling-benchmark-r13output.tar.gz

中身はそれぞれ以下の通りである。

❯❯❯ head ~/data/vocab-2016-09-10.txt
</S>
<S>
<UNK>
the
,
.
to
of
and
a

❯❯❯ head ~/data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00001-of-00100
The U.S. Centers for Disease Control and Prevention initially advised school systems to close if outbreaks occurred , then reversed itself , saying the apparent mildness of the virus meant most schools and day care centers should stay open , even if they had confirmed cases of swine flu .
When Ms. Winfrey invited Suzanne Somers to share her controversial views about bio-identical hormone treatment on her syndicated show in 2009 , it won Ms. Winfrey a rare dollop of unflattering press , including a Newsweek cover story titled " Crazy Talk : Oprah , Wacky Cures & You . "
Elk calling -- a skill that hunters perfected long ago to lure game with the promise of a little romance -- is now its own sport .
Don 't !
Fish , ranked 98th in the world , fired 22 aces en route to a 6-3 , 6-7 ( 5 / 7 ) , 7-6 ( 7 / 4 ) win over seventh-seeded Argentinian David Nalbandian .
Why does everything have to become such a big issue ?
AMMAN ( Reuters ) - King Abdullah of Jordan will meet U.S. President Barack Obama in Washington on April 21 to lobby on behalf of Arab states for a stronger U.S. role in Middle East peacemaking , palace officials said on Sunday .
To help keep traffic flowing the Congestion Charge will remain in operation through-out the strike and TfL will be suspending road works on major London roads wherever possible .
If no candidate wins an absolute majority , there will be a runoff between the top two contenders , most likely in mid-October .
Authorities previously served search warrants at Murray 's Las Vegas home and his businesses in Las Vegas and Houston .

次にテキストに出現する単語を一意な整数に変換し、文単位でnumpyのarrayに変換するプログラムを用意する。

from joblib import Memory
import numpy as np


memory = Memory('cache_dir', mmap_mode='r')


@memory.cache
def load_vocab(vocab_file):
    item2index = {}
    with open(vocab_file, 'r') as f:
        for line in f:
            item = line.strip()
            item2index[item] = len(item2index)
    return item2index


@memory.cache
def convert(data_file, item2index, is_reference):
    if is_reference:
        item2index = item2index.get()
    data = []
    unk = item2index['<UNK>']
    with open(data_file, 'r') as f:
        for line in f:
            words = line.strip().split()
            indices = []
            for word in words:
                if word in item2index:
                    indices.append(item2index[word])
                else:
                    indices.append(unk)
            data.append(np.array(indices, dtype=np.int32))
    return unk


def main_with_reference(args):
    start_at1 = time.time()
    item2index = load_vocab.call_and_shelve(args.vocab_file)
    print('Elapsed time of load_vocab', time.time() - start_at1)

    start_at2 = time.time()
    data = convert(args.data_file, item2index, args.with_reference)
    print('Elapsed time of convert', time.time() - start_at2)


def main_without_reference(args):
    start_at1 = time.time()
    item2index = load_vocab(args.vocab_file)
    print('Elapsed of load_vocab', time.time() - start_at1)

    start_at2 = time.time()
    data = convert(args.data_file, item2index, args.with_reference)
    print('Elapsed time of convert', time.time() - start_at2)


if __name__ == '__main__':
    import time
    import argparse

    ap = argparse.ArgumentParser()
    ap.add_argument('--vocab_file')
    ap.add_argument('--data_file')
    ap.add_argument('--with_reference', action='store_true')
    args = ap.parse_args()

    if args.with_reference:
        main_with_reference(args)
    else:
        main_without_reference(args)

参照を使った場合、使わない場合に対してそれぞれ2回同じ関数を呼び出し、処理時間を計測した。まずは参照を使う場合について計測した結果は次のとおりである。

❯❯❯ for i in 0 1; do echo $i; python run.py --vocab_file ~/data/vocab-2016-09-10.txt --data_file ~/data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00001-of-00100 --with_reference; done
0
________________________________________________________________________________
[Memory] Calling __main__.load_vocab...
load_vocab('~/data/vocab-2016-09-10.txt')
_______________________________________________________load_vocab - 5.8s, 0.1min
Elapsed time of load_vocab 8.896415948867798
________________________________________________________________________________
[Memory] Calling __main__.convert...
convert('~/data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00001-of-00100',
MemorizedResult(location="cache_dir/joblib", func="__main__/load_vocab", args_id="6bc0fc840982639b22ef42d2a0bbffa9"),
True)
__________________________________________________________convert - 6.2s, 0.1min
Elapsed time of convert 6.167249917984009
1
Elapsed time of load_vocab 0.005745887756347656
Elapsed time of convert 0.0015680789947509766

1回目の load_vocab が約8.90秒、 convert が約6.17秒かかっているのに対して、2回目ではそれぞれ0.0057秒、0.0016秒となっていることがわかる。

次に参照を使わない場合について計測する。事前に、上記の結果を保存しているディレクトリを削除する。

❯❯❯ rm -rf cache_dir

結果は次のとおりである。

for i in 0 1; do echo $i; python run.py --vocab_file ~/data/vocab-2016-09-10.txt --data_file ~/data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00001-of-00100; done
0
________________________________________________________________________________
[Memory] Calling __main__-run.load_vocab...
load_vocab('~/data/vocab-2016-09-10.txt')
_______________________________________________________load_vocab - 5.7s, 0.1min
Elapsed of load_vocab 8.70986008644104
________________________________________________________________________________
[Memory] Calling __main__-run.convert...
convert('~/data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00001-of-00100',
{ '!': 278,
  '"': 11,
  '#': 5520,
  '$': 56,
  '%': 202,
  '&': 323,
  "'": 77,
  "'A": 31666,
  "'ALBA": 643448,
  "'ALENE": 268317,
  "'ALL": 643449,
  "'ALOFA": 182060,
  "'AMICO": 552405,
  "'AMPEZZO": 218449,
  "'AN": 185156,
  "'ANANA": 243173,
  "'ANDRATX": 643450,
  "'AQUILA": 61890,
  "'AYIN": 268318,
  "'Abate": 407514,
  "'Abbaye": 290027,
  "'Abbe": 407515,
  "'Abbraccio": 354852,
  "'Abbé": 443501,
  "'Abernon": 643451,
  "'Abitot": 334983,
  "'Abo": 490004,
  "'Aboville": 552406,
  "'Abreu": 643452,
  "'Abruzzo": 213291,
  "'Absence": 643453,
  "'Absinthe": 643454,
  "'Absolu": 643455,
  "'Abym": 643456,
  "'Academie": 378535,
  "'Académie": 552407,
  "'Acampo": 125839,
  "'A...,
False)
run.py:53: UserWarning: Persisting input arguments took 5.06s to run.
If this happens often in your code, it can cause performance problems
(results will be correct in all cases).
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  data = convert(args.data_file, item2index, args.with_reference)
_________________________________________________________convert - 17.1s, 0.3min
Elapsed time of convert 26.944008827209473
1
Elapsed of load_vocab 3.030667781829834
Elapsed time of convert 4.847580909729004

1回目の load_vocab が約8.71秒、 convert が約26.94秒かかっているのに対して、2回目ではそれぞれ3.03秒、4.85秒となっていることがわかる。大きな入力を与えた場合、速度が低下するような警告が出ている。

結果をまとめると次のとおりである。

関数	キャッシュ前/後	参照を利用	参照を利用しない
load_vocab	キャッシュ前	8.90	8.71
load_vocab	キャッシュ後	0.0057	3.03
convert	キャッシュ前	6.17	26.94
convert	キャッシュ後	0.0016	4.85

これらの処理時間の計測結果から、キャッシュを利用することで処理が高速化されていることがわかる。さらに、(特に今回のように関数に与える入力が大きな場合、) 参照を使った方が、使わない場合と比較して処理が速くなっていることがわかる。

まとめ

Joblibのキャッシュ機能を使って同じ計算を省略する方法について説明した。numpyのarrayについてはメモリーマップでディスクからデータを読み込む方法についても説明した。大きな入力データに対しては参照を使って処理が遅くならないようにできることを確認した。テキストに出現する単語をIDに変換する処理について、キャッシュを活用することで2回目以降の処理が高速になることを確認した。

[Python] Joblibのキャッシュを使って同じ計算を省略する

Joblibとは

1. 計算結果のキャッシュが可能

2. 並列化が容易

3. 高速、高圧縮な永続化

環境

Joblibのインストール

Memoryを使って計算結果をキャッシュする

キャッシュの簡単な例

NumPyのデータを扱う関数をキャッシュ可能にする

メモリーマップを使った高速なキャッシュの参照

キャッシュされた結果の参照を取得する。

Joblibを使ってテキストを単語のIDに変換する処理を高速化する

まとめ

関連記事

PythonでElasticsearchを使うときのメモ

mafが便利そう

Induced SortingをPythonで書いた

scikit-learnのソースコードリーディング（ナイーブベイズ分類）

食べログAPIのPythonラッパーを書いた

最近の記事

transformersのAutoModelで独自クラスを使う

【huggingface/datasets】複数のデータセットを組み合わせてサンプリングする

【Python】pre-commitを使ってコミット前にプログラムを自動検査する

【自然言語処理】フリーで使える大規模な日本語テキストコーパス

【Python】Poetryを使ったパッケージ管理

Takuya Makino