I am tired of boring Elasticsearch tutorials. Learning should be interactive; it shouldn't feel like reading lengthy technical documentation. Today we are going to learn the basics of Elasticsearch using a movie database. I have a small database of titles ready for you to import.

Don't we need some theory? You didn't click to view the typical technical blog post. If you are stubborn, jump to the workshop section and learn on the fly. Some people learn this way better,  nothing wrong with that. If you want to get to know some essentials, go through Slideshare I created.

Before we begin clone this repo: git clone git@github.com:ptsdengineer/learn-elasticsearch-with-hollywood-movies.git or download zip

Installing Elasticsearch and seeding data

For Mac users: the most comfortable way would be to install it from homebrew: brew install Elasticsearch

For Linux users: Please try this tutorial here: https://www.digitalocean.com/community/tutorials/how-to-install-and-configure-elasticsearch-on-ubuntu-14-04

Run Elasticsearch

elasticsearch on Mac

sudo service elasticsearch start on Ubuntu

Check if it's running on http://localhost:9200

Run using docker

If you prefer docker way, you can also use it in tutorial. I have docker-compose file ready.

docker-compose up

Credentials: user is elastic and password changeme

Install API testing tool

Use Postman, curl or Insomnia for making calls to Elasticsearch API. I can't recommend Insomnia enough, this piece of software makes life so much easier.

Creating data

Create index for movies, it will hold all the movies documents that we will import in a minute. Open insomnia and make your first call.

Now let's import data into our index. I prepared JSON with all the documents that could be easily run.

curl -s --header "Content-Type:application/json"  -XPOST localhost:9200/_bulk --data-binary @movies.json
Bulk import of movies from json

*Use option -u for typing user and password when running with docker.

If you want to learn more about this feature: Bulk import

Match all

Let's make the most straightforward possible query to our movies index. The query that returns all results, it's called match all query.

GET <name_of_index>/_search
{
    "query": {
        "match_all": {}
    }
}

You should get this type of result in response:

"hits": {
    "total": 306,
    "max_score": 1,

Match all docs

Exercise

Type this query into Insomnia. You will get results for the movies index.

String query

Still straightforward query, we will only search for a particular string.

GET /_search
{
    "query": {
        "query_string" : {
            "default_field" : "content",
            "query" : "this AND that OR thus"
        }
    }
}

String query documentation

Exercise.

Using this knowledge find movie Scarface in the Elasticsearch. There should be only one result.

Operators

Let's build on this. We want to extend our search capabilities. Elasticsearch uses operators like in programming, by default it uses OR but we can use AND to get an exact match.

GET /_search
{
    "query": {
        "query_string" : {
            "default_field" : "content",
            "query" : "Strawberry pie with jello",
            "default_operator": "AND"
        }
    }
}

Now we will be sure that we will only get recipes we are interested.

Exercise.

Make a query to Elasticsearch that will return only one result on query Captain America first avenger

"hits": {
     "total": 1,
     "max_score": 11.263437,
     "hits": [
        {
           "_index": "movies",
           "_type": "movie",
           "_id": "139",
           "_score": 11.263437,
           "_source": {
              "title": "Captain America: The First Avenger",
              "plot": "Predominantly set during World War II, Steve Rogers is a sickly man from Brooklyn who's transformed into super-soldier Captain America to aid in the war effort. Rogers must stop the Red Skull – Adolf Hitler's ruthless head of weaponry, and the leader of an organization that intends to use a mysterious device of untold powers for world domination.",
              "genres": null,

Fuzziness

What about cases when users don't type query correctly? We should also handle those cases. Fortunately Elasticsearch has an answer, match query with a fuzzy query, it's a simpler cousin of string query.

"query": {
  "match": {
    "text": {
      "query": "jomped over me!",
      "fuzziness": "AUTO",
      "operator":  "and"
    }
  }
}

"fuzziness": "AUTO" generates an edit distance based on the length of the term. 0..2 must match exactly 3..5 one edit allowed >5 two edits allowed

You could also use number values, like 0, 1, 2. Fuzziness is interpreted as Levenshtein Edit Distance. More about: fuzziness

Exercise.

Write a query that will return all Captain America movies based on a query, which was mistyped: "Captaon America"

Filtering

We are using a range query.

Matches documents with fields that have terms within a certain range. The Lucene query type depends on the field type, for string fields, the TermRangeQuery, while for number/date fields, the query is a NumericRangeQuery. The following example returns all documents where age is between 10 and 20:

GET _search
{
    "query": {
        "range" : {
            "age" : {
                "gte" : 10,
                "lte" : 20,
                "boost" : 2.0
            }
        }
    }
}

gte = Greater-than or equal to

gt = Greater-than

lte = Less-than or equal to

lt = Less-than

Range query

Exercise.

Create a query that would return movies with a running time between 60 and 90 minutes.
It should return 57 results.

Bool query

The bool query takes a more-matches-is-better approach, so the score from each matching must or should clause will be added together to provide the final _score for each document.

must - The clause (query) must appear in matching documents and will contribute to the score.

filter - Filter clauses are executed in filter context. Scoring is ignored and clauses are considered for caching.

should - The clause (query) should appear in the matching document.

Example query:

POST _search
{
  "query": {
    "bool" : {
      "must" : {
        "term" : { "user" : "kimchy" }
      },
      "filter": {
        "term" : { "tag" : "tech" }
      },
      "must_not" : {
        "range" : {
          "age" : { "gte" : 10, "lte" : 20 }
        }
      },
      "should" : [
        { "term" : { "tag" : "wow" } },
        { "term" : { "tag" : "elasticsearch" } }
      ],
      "minimum_should_match" : 1,
      "boost" : 1.0
    }
  }
}

Bool query

Exercise.

Create a query that will find superhero movies (keywords field: superhero) that are no longer than 120 minutes and no shorter than 60 minutes (field runtime) and must not have Robert Downey Jr. as a starring actor (actors field).

You should get 12 results for this query.

Aggregations

Let's get some interesting stats for analytics. We want to get an overall view of how some value occurs through the documents. The stats aggregation would give us count, minimum value, maximum value, averages. It's useful for getting overall insight.

{
    "aggs" : {
        "grades_stats" : { "stats" : { "field" : "grade" } }
    }
}

and returns:

{
    ...

    "aggregations": {
        "grades_stats": {
            "count": 6,
            "min": 60,
            "max": 98,
            "avg": 78.5,
            "sum": 471
        }
    }
}

Read more about aggregations here

Exercise.

Get overall data for ratings in movies: min, max, average. Do that using stats query.

Range Aggregation

A multi-bucket value source-based aggregation enables the user to define a set of ranges - each representing a bucket.

GET products/_search?size=0

{
  "aggs": {
    "weight_ranges": {
      "range": {
        "field": "weight",
        "ranges": [
          {
            "to": 500
          },
          {
            "from": 500,
            "to": 1000
          },
          {
            "from": 1000,
            "to": 1500
          }
        ]
      }
    }
  }
}

Returns aggregated data:

    ...

    "aggregations": {
        "weight_ranges" : {
            "buckets": [
                {
                    "to": 500,
                    "doc_count": 20
                },
                {
                    "from": 500,
                    "to": 1000,
                    "doc_count": 4
                },
                {
                    "from": 1000,
                    "doc_count": 4
                }
            ]
        }
    }
}

Exercise.

Using range queries, count how many movies were in specific run times: below 60 minutes, between 60 and 75 minutes, between 90 and 120 minutes.

Histogram aggregation

We can also use a histogram to bucket data instead of ranges. It's useful for prices in shops, so we can see how prices fall between different ranges like 0$-10$, 10$-20$.

POST /sales/_search?size=0
{
    "aggs" : {
        "prices" : {
            "histogram" : {
                "field" : "price",
                "interval" : 10
            }
        }
    }
}

Would return:

{
    ...
    "aggregations": {
        "prices" : {
            "buckets": [
                {
                    "key": 0.0,
                    "doc_count": 1
                },
                {
                    "key": 50.0,
                    "doc_count": 1
                },
                {
                    "key": 100.0,
                    "doc_count": 0
                },
                {
                    "key": 150.0,
                    "doc_count": 2
                },
                {
                    "key": 200.0,
                    "doc_count": 3
                }
            ]
        }
    }
}

Exercise.

Create histogram aggregation for a rating in movies with interval equal 1.

Sorting

Allows adding one or more sort on specific fields. Each sort can be reversed as well. The sort is defined on a per-field level, with particular field name for _score to sort by score, and _doc to sort by index order.

GET /my_index/my_type/_search
{
    "sort" : [
        { "post_date" : {"order" : "asc"}},
        "user",
        { "name" : "desc" },
        { "age" : "desc" },
        "_score"
    ],
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

Sorting

Exercise.

Sort Captain America movies by release date in ascending order, the oldest film first. You should display only Captain America movies here. Keep results relevant.

Highlighting

Elasticsearch allows highlighting search results in one or more fields. It's useful for the results page, visually communicates where query appears in the searched field.

GET /_search
{
    "query" : {
        "match": { "content": "kimchy" }
    },
    "highlight" : {
      "pre_tags" : ["<tag1>"],
      "post_tags" : ["</tag1>"],
        "fields" : {
            "content" : {}
        }
    }
}

highlight query

Exercise.

Create highlight for your query to search "terrorist attack" plot in films. It should return with highlighted fields with tags like this:

"highlight": {
             "plot": [
                "Jack Ryan, as a young covert CIA analyst, uncovers a Russian plot to crash the U.S. economy with a <highlight>terrorist</highlight> <highlight>attack</highlight>."
             ]
          }

Pagination

You can create pagination by passing parameters size and from to query. The size will dictate a number of elements on the page and from will work as offset.


For pages 1 to 3.

GET /_search?size=5
GET /_search?size=5&from=5
GET /_search?size=5&from=10

Could be passed to body too.

{
  "query": {
    "match_all": {}
  },
  "size": 5
}

Exercise.

Create pagination for movies with genre action.

Final task

Put your knowledge to good use and create a movie recommendation query that will take many parameters, including plot, actors, title, release date.

Requirements:

  1. Fancy movies with a higher rating. They should have a higher score but stay relevant.
  2. The algorithm should prefer newer movies.
  3. Favor shorter films over longer.

You can also play around with it further and add extra powers to it.

Send me your answers to ptsdengineer@protonmail.com. I will post the best ones in the upcoming blog post. Get creative!

Best regards,

PTSD Engineer