PythonでElasticsearchを使うときのメモ

本記事ではPythonとElasticsearchを使って、日本のレストランに関するデータを使って記事を検索エンジンにbulk APIを使って登録し、検索するまでを紹介する。

Elasticsearchのインストール

Install Elasticsearch from archive on Linux or MacOSに従って以下のようにインストールする。

$wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.3.2-darwin-x86_64.tar.gz
$wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.3.2-darwin-x86_64.tar.gz.sha512
$shasum -a 512 -c elasticsearch-7.3.2-darwin-x86_64.tar.gz.sha512 
$tar -xzf elasticsearch-7.3.2-darwin-x86_64.tar.gz
$cd elasticsearch-7.3.2/

以下のようにして起動すればインストールは完了。

$./bin/elasticsearch

Elasticsearchプラグインのインストール

日本語のドキュメントを扱うことを想定しているので、インデックスには形態素を登録したい。なので、elasticsearch-analysis-kuromoji をインストールする。kuromojiは日本語の形態素解析器であり、Elasticsearch内で形態素解析を利用する際によく用いられる。

$./bin/elasticsearch-plugin install analysis-kuromoji
-> Downloading analysis-kuromoji from elastic
[=================================================] 100%  
-> Installed analysis-kuromoji

プロキシ環境の場合はプロキシを指定するか、手動でインストールする必要がある。

Pythonクライアントをインストール

次にElasticsearchのPythonクライアントをインストールする。

pip install elasticsearch

本記事で利用するデータ

本記事はElasticsearchチュートリアルを参考にしており、データもそれに合わせてlivedoor/datasetsを利用している。データは次のようにしてダウンロードする。

git clone https://github.com/livedoor/datasets.git
cd datasets
tar zxvf ldgourmet.tar.gz

このデータは以下の情報が含まれている。

http://blog.livedoor.jp/techblog/archives/65836960.html

restaurants.csv お店データ

id お店ID

name 店名

property 支店名

alphabet 店名欧文

name_kana 店名ひらがな

pref_id 都道府県ID (prefs.csv参照)

area_id エリアID (areas.csv参照)

station_id1, station_time1, station_distance1 最寄り駅ID(stations.csv参照),時間(分),距離(m)

station_id2, station_time2, station_distance2 (同上)

station_id3, station_time3, station_distance3 (同上)

category_id1カテゴリID(categories.csv参照)

category_id2, category_id3, category_id4, category_id5 (同上)

zip 郵便番号

address 住所

north_latitude 北緯

east_longitude 東経

description 備考

purpose お店利用目的

open_morning モーニング有

open_lunch ランチ有

open_late 23時以降営業

photo_count 写真アップロード数

special_count 特集掲載数

menu_count メニュー投稿数

fan_count ファン数

access_count 類型アクセス数

created_on 作成日

modified_on 更新日

closed 閉店

prefs.csv 都道府県マスタ

id 都道府県ID

name 都道府県名

areas.csv エリアマスタ

id エリアID

pref_id 都道府県ID

name エリア名

stations.csv 駅マスタ

id 駅ID

pref_id 都道府県ID

name 駅名

name_kana 駅名ひらがな

property 路線名

categories.csv カテゴリマスタ

id カテゴリID

name カテゴリ名

name_kana カテゴリ名ひらがな

parent1, parent2 親カテゴリID

similar 類似カテゴリ名

ratings.csv 口コミデータ

id 口コミID

restaurant_id 対象お店ID

user_id ユーザID

total 総合評価(0-5)

food 料理評価(0-5)

service サービス評価(0-5)

atmosphere 雰囲気評価(0-5)

cost_performance コストパフォーマンス評価(0-5)

title 口コミコメントタイトル

body 口コミコメント

purpose 利用目的

created_on 投稿日時

ratings_votes.csv 口コミへの投票データ

rating_id 対象口コミID

user ユーザID

投票日時

データの前処理

上記の restaurants.csv は都道府県や駅がidになっているため (他にもidになっているものがあるが簡単のため省略する) 、 prefs.csv や stations.csv の情報に基づいて、idを実際の名前に変換しておく。

import csv
import re


rest = 'datasets/restaurants.csv'
pref = 'datasets/prefs.csv'
sta = 'datasets/stations.csv'

with open(pref) as f:
    # id, name
    reader = csv.reader(f)
    attrs = next(reader)
    pref_id_to_name = {}
    for row in reader:
        pref_id_to_name[row[0]] = row[1]

with open(sta) as f:
    # id, pref_id, name, name_kana, property
    reader = csv.reader(f)
    attrs = next(reader)
    sta_id_to_name = {}
    for row in reader:
        sta_id_to_name[row[0]] = row[2]


output_file = 'data.csv'
with open(rest) as f, open(output_file, 'w') as g:
    reader = csv.reader(f)
    attrs = next(reader)

    from_to = [
        ('station_id1', 'station_1'),
        ('station_id2', 'station_2'),
        ('station_id3', 'station_3'),
        ('pref_id', 'pref'),
    ]

    for _from, to in from_to:
        idx = attrs.index(_from)
        attrs[idx] = to
    print(attrs)

    writer = csv.writer(g)
    writer.writerow(attrs)

    for row in reader:
        elem = []
        for attr, val in zip(attrs, row):
            if attr == 'pref' and val in pref_id_to_name:
                elem.append(pref_id_to_name[val])
            elif re.match('station_\d', attr) and val in sta_id_to_name:
                elem.append(sta_id_to_name[val])
            else:
                elem.append(val)
        writer.writerow(elem)

これで data.csv に都道府県や駅が実際の名前となったレストラン情報される。

ドキュメントを検索エンジンに登録する

実際にドキュメントを検索エンジンに登録する前に、ドキュメントのデータの構造のようなもの (Mapping) を定義しておく。

Mappingとは？

Mappingとは登録するドキュメントをどのようにインデックスに登録するのか、ドキュメントがどのようなフィールドを持っているのかを定義するものである。例えば以下のような定義をする際にMappingを用いる。

どのフィールドが全文検索用に用いられるのか
どのフィールドが数値、日付、場所情報を表しているのか
日付のフォーマット

Mappingを作成する

mapping.yaml に登録するドキュメントのMappingを記述する。 Mappingは利用目的に合わせて定義するものだが、今回は、都道府県名や最寄り駅名をもとにレストランを検索する例とする。

$cat data.yaml
settings:
  index:
    analysis:
      analyzer:
        my_analyzer:
          type: custom
          tokenizer: kuromoji_tokenizer
          filter:
            - kuromoji_baseform
mappings:
  properties:
    description:
      type: text
      index: true
      analyzer: my_analyzer
    name:
      type: text
      index: true
    name_kana:
      type: text
    adress:
      type: text
      index: true
      analyzer: my_analyzer
    pref:
      type: text
      index: true
      analyzer: my_analyzer
    station_1:
      type: text
      index: true
    station_distance1:
      type: float

indexへの登録にはbulk APIを用いて一括してドキュメントを登録する。 generatorと組み合わせて利用することで省メモリでかつ比較的速くインデックス処理が進む。


import argparse
import csv
import os
import sys

from elasticsearch import Elasticsearch, helpers
import yaml


def create_index(args, index='restaurants'):
    es = Elasticsearch()

    # print(es.indices.delete(index=index, ignore=[404]))

    setting = yaml.load(open(args.mapping_file), Loader=yaml.SafeLoader)
    properties = setting['mappings']['properties']
    print(setting)
    print(es.indices.create(index=index, body=setting))
    print(es.indices.flush())

    def generate_data():
        with open(args.data_file, 'r') as f:
            reader = csv.reader(f)
            attrs = next(reader)
            for lid, row in enumerate(reader):
                data = {
                    '_op_type': 'index',
                    '_index': index,
                    '_id': lid,
                }
                for j, value in enumerate(row):
                    if attrs[j] in properties:
                        data[attrs[j]] = value
                yield data
    print(helpers.bulk(es, generate_data()))


def main():
    ap = argparse.ArgumentParser()
    ap.add_argument('--data_file')
    ap.add_argument('--mapping_file')
    args = ap.parse_args()

    create_index(args)


if __name__ == '__main__':
    main()

実行して登録する。

$python add-documents.py --data_file ./data.csv --mapping_file ./data.yaml

ドキュメントを検索する

例えば恵比寿駅の近くにあるレストランを調べたい場合は次のようにする。

rom elasticsearch import Elasticsearch

es = Elasticsearch()
index = "restaurants"

query = {
    "query": {
        "term": {
            "station_1": "恵比寿",
        },
    },
    'sort': [
        {'station_distance1': 'asc'}
    ]
}

for i in es.search(index=index, body=query)["hits"]["hits"]:
    for k, v in i["_source"].items():
        print(k, v)
    print()

実際に検索してみると次のような結果が得られる。

$python search.py

name 点
name_kana ともるえびすてん
pref 東京都
station_1 恵比寿
station_distance1 48
description

name ビストロ　ダブダル
name_kana びすとろだぶだる
pref 東京都
station_1 恵比寿
station_distance1 49
description 営業時間情報、定休日情報を追加しました。   (from 東京グルメ 2006/04/22)

name ちか八
name_kana ちかはち
pref 東京都
station_1 恵比寿
station_distance1 49
description

name 蕎麦処 朝日屋
name_kana あさひや
pref 東京都
station_1 恵比寿
station_distance1 50
description JR恵比寿駅前ロータリーに面する、紀伊国屋酒店となり。

name 寿司文
name_kana すしぶんそうほんてん
pref 東京都
station_1 恵比寿
station_distance1 60
description

name 鶴越 夕鶴
name_kana つるこし ゆうづる
pref 東京都
station_1 恵比寿
station_distance1 60
description JR恵比寿駅西口ロータリーから徒歩約1分。　　　　神戸ソウルフードとうどんのお店　　　　饂飩

恵比寿駅から近い順に並んで出力されているのがわかる。

まとめ

本エントリではElasticsearchおよびkuromojiの導入手順から、Mappingを定義してドキュメントをbulk APIでインデクシングし、検索するところまで紹介した。検索では最寄り駅からの距離の近さでソートする方法を確認した。

PythonでElasticsearchを使うときのメモ

目次

Elasticsearchのインストール

Elasticsearchプラグインのインストール

Pythonクライアントをインストール

本記事で利用するデータ

データの前処理

ドキュメントを検索エンジンに登録する

Mappingとは？

Mappingを作成する

ドキュメントを検索する

まとめ

関連記事

scikit-learnのソースコードリーディング（ナイーブベイズ分類）

mafが便利そう

Induced SortingをPythonで書いた

食べログAPIのPythonラッパーを書いた

最近の記事

transformersのAutoModelで独自クラスを使う

【huggingface/datasets】複数のデータセットを組み合わせてサンプリングする

【Python】pre-commitを使ってコミット前にプログラムを自動検査する

【自然言語処理】フリーで使える大規模な日本語テキストコーパス

【Python】Poetryを使ったパッケージ管理

Takuya Makino