Thursday, 24 July 2025

Integrating Elasticsearch into a Symfony app

G'day:

Here I am trying to solve a problem I had had in the past. We had a back-end web app which was pretty much a UI on some DB tables (for the purposes of this summary anyhow): companies, customers, accounts, that sort of thing. I won't disclose the business model too much as it's not relevant here, and I don't want to tie this back to a specific role I've been in. Anyway, you get the idea: a hierarchy of business entities with backing storage.

Part of the app was a global search (you know, top right of the UI, and one can plug anything one likes in there, and search results ensue). There was no specific backing storage for this, the search basically did a bunch of SELECT / UNIONs straight on the main transactional DB:

SELECT
    someColumns
FROM
    tb11
WHERE
    col1 LIKE '%search_term%'
    OR 
    col2 LIKE '%search_term%'
    OR
    // etc

UNION

SELECT
    someColumns
FROM
    tb12
WHERE
    colx LIKE '%search_term%'
    OR 
    coly LIKE '%search_term%'
    OR
    // etc
UNION    
// moar tables etc

For a proof of concept, this works OK. For getting something to market: it'll do. For a database that has scaled up: it stops working. The query itself was taking ages, but at the same time it was messing with other queries going on at the same time: if someone did a global search, it could kill the requests of other users. Oopsy.

We denormalised the data a bit into a dedicated search data table, and kept that up to date with event handlers when data in the source tables changed. From there we used the DB's built-in full-text searching to get the results. This was better, but it was all a bit Heath Robinson, still putting too much load on the transactional DB, and the DB's full-text-search capabilities were… erm… not as helpful as it could have been. There's also a chunk of "getting app devs to do the job of a DB dev" in the mix here.

We realised we needed to get the search out of the transactional DB, but we never had the chance to do anything about it whilst I was still on that team.

Time has passed.

This issue has stuck with me, and I've always wanted to have a look at other ways to solve it. I have some time for investigating stuff at the moment, so over the last coupla days I have turned my mind to solving this.

What I'm going to do is to run an Elasticsearch container alongside my app container, and where in the earlier scenario we built our own search index table to query; I'm gonna fire the data to Elasticsearch and let it look after it.

First of all, I have started with my default PHP8, Nginx, MariaDB container setup, with a Symfony site running on the PHP container. Same baseline I always use, and I won't go over it as it's all in Github (php-elasticsearch, and the README.md covers it).

Full disclosure: as a further exercise, I got Github Copilot to generate all the code for this. It was all at my guidance and design, but I wanted to see how much of the solution I could get Copilot to write. About 95% of it, in the end: sometimes it was easier for me to tweak something and tell Copilot why I'd done it, rather than explain what I would have wanted it to do. I am completely happy with the code I have ended up with: it's pretty much what I would have keyed in had I done it myself. Pretty much.

First: an Elasticsearch container.

# docker/docker-compose.yml

services:
  # ...

  elasticsearch:
    container_name: elasticsearch

    image: elasticsearch:9.0.3

    environment:
      - discovery.type=single-node
      - network.host=0.0.0.0
      - xpack.security.enabled=false
      - xpack.security.http.ssl.enabled=false

    ports:
      - "9200:9200"
      - "9300:9300"

    stdin_open: true
    tty: true

    volumes:
      - elasticsearch-data:/usr/share/elasticsearch/data

    healthcheck:
      test: [ "CMD", "curl", "-fs", "http://localhost:9200/_cluster/health" ]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 30s

volumes:
  elasticsearch-data:

# ...

No need for a specific Dockerfile for this: the standad image is pretty much ready to go. I made some config tweaks:

  • I have NFI what discovery.type=single-node does, other than what it sounds like it does. But it was on the docker run example on Docker Hub, so I ran with it.
  • network.host=0.0.0.0 is just so it will listen to requests on the host network rather than just the internal Docker network.
  • xpack.security.enabled=false means I don't need to pass credentials with my queries. This is not appropriate for anything other than dev, OK?
  • xpack.security.http.ssl.enabled=false means it'll work over http instead of https-only. This is OK if the client and server are on the same network (as they are with me), but not if they're across the public wire from each other.
  • We're querying on port 9200; but Elasticsearch also needs 9300 for an admin channel or something. Again, it's from the docker run example on Docker hub.
  • I'm putting its data on an "external" volume so that it persists across container rebuilds. On a serious system this would map to a file system location, but a Docker Volume is fine for my purposes.
  • It's cool that the image has a healthcheck end point build in. And good on Copilot for knowing this.

I've created an integration test to make sure that works:

// tests/Integration/System/ElasticSearchTest.php

namespace App\Tests\Integration\System;

use App\Tests\Integration\Fixtures\ElasticsearchTestIndexManager;
use Elastic\Elasticsearch\Client;
use PHPUnit\Framework\TestCase;
use Elastic\Elasticsearch\ClientBuilder;

class ElasticSearchTest extends TestCase
{
    private Client $client;
    private string $id = 'test_id';
    private ElasticsearchTestIndexManager $indexManager;

    private const string INDEX = ElasticsearchTestIndexManager::INDEX;

    protected function setUp(): void
    {
        $address = sprintf(
            '%s:%s',
            getenv('ELASTICSEARCH_HOST'),
            getenv('ELASTICSEARCH_PORT')
        );

        $this->client = ClientBuilder::create()
            ->setHosts([$address])
            ->build();

        $this->indexManager = new ElasticsearchTestIndexManager($this->client);
        $this->indexManager->ensureIndexExists();
    }

    protected function tearDown(): void
    {
        $this->indexManager->removeIndexIfExists();
    }

    public function testWriteAndReadDocument()
    {
        $doc = ['foo' => 'bar', 'baz' => 42];

        try {
            $this->client->index([
                'index' => self::INDEX,
                'id'    => $this->id,
                'body'  => $doc
            ]);

            $response = $this->client->get([
                'index' => self::INDEX,
                'id'    => $this->id
            ]);

            $this->assertEquals($doc, $response['_source']);
        } finally {
            $this->client->delete([
                'index' => self::INDEX,
                'id'    => $this->id
            ]);
        }
    }
}

The guts of this is that the test indexes (adds) a document to the Easticsearch DB, reads it back, and then deletes it. I have factored-out some code from this as it's used in another test (qv):

tests/Integration/Fixtures/ElasticsearchTestIndexManager.php

namespace App\Tests\Integration\Fixtures;

use Elastic\Elasticsearch\Client;

class ElasticsearchTestIndexManager
{
    public const string INDEX = 'test_index';

    private Client $client;

    public function __construct(Client $client)
    {
        $this->client = $client;
    }

    public function ensureIndexExists(): void
    {
        if (!$this->client->indices()->exists(['index' => self::INDEX])->asBool()) {
            $this->client->indices()->create(['index' => self::INDEX]);
        }
    }

    public function removeIndexIfExists(): void
    {
        if ($this->client->indices()->exists(['index' => self::INDEX])->asBool()) {
            $this->client->indices()->delete(['index' => self::INDEX]);
        }
    }
}

All pretty self-explanatory I reckon.

Oh I also needed to install a library for Elasticsearch support:

docker exec php composer require elasticsearch/elasticsearch:^9.0.0

I've also installed Elasticvue on my PC so I can query the data indepedent of my code.

I'm not going to call the Elasticsearch Client - using its bespoke syntax - directly in my code. That's bad separation of concerns. I'm going to implement an adapter to hide as much as possible from the app:

// src/Service/ElasticsearchAdapter.php

namespace App\Service;

use Elastic\Elasticsearch\Client;

class ElasticsearchAdapter
{
    private const string INDEX = 'search_index';
    private Client $client;

    public function __construct(Client $client)
    {
        $this->client = $client;
    }

    public function indexDocument(string $id, array $body): void
    {
        $this->client->index([
            'index' => self::INDEX,
            'id'    => $id,
            'body'  => $body,
        ]);
    }

    public function getDocument(string $id): array
    {
        $response = $this->client->get([
            'index' => self::INDEX,
            'id'    => $id,
        ]);
        return $response['_source'] ?? [];
    }

    public function deleteDocument(string $id): void
    {
        $this->client->delete([
            'index' => self::INDEX,
            'id'    => $id,
        ]);
    }

    public function searchByString(string $query): array
    {
        $body = [
            'query' => [
                'query_string' => [
                    'query' => $query,
                ],
            ],
        ];
        $response = $this->client->search([
            'index' => self::INDEX,
            'body'  => $body,
        ]);
        return $response['hits']['hits'] ?? [];
    }
}

This wraps those calls away, so the code wanting to "do stuff" with Elasticsearch doesn't need to know how. It's just "do this for me will you? kthxbye".

And I test this:

// tests/Integration/Service/ElasticsearchAdapterTest.php

namespace App\Tests\Integration\Service;

use App\Service\ElasticsearchAdapter;
use App\Tests\Integration\Fixtures\ElasticsearchTestIndexManager;
use Elastic\Elasticsearch\ClientBuilder;
use Elastic\Elasticsearch\Exception\ClientResponseException;
use PHPUnit\Framework\Attributes\TestDox;
use PHPUnit\Framework\TestCase;

class ElasticsearchAdapterTest extends TestCase
{
    private ElasticsearchAdapter $adapter;
    private string $id = 'test_id';
    private array $body = ['foo' => 'bar'];
    private ElasticsearchTestIndexManager $indexManager;

    protected function setUp(): void
    {
        $address = sprintf(
            '%s:%s',
            getenv('ELASTICSEARCH_HOST'),
            getenv('ELASTICSEARCH_PORT')
        );

        $client = ClientBuilder::create()
            ->setHosts([$address])
            ->build();
        $this->adapter = new ElasticsearchAdapter($client);

        $this->indexManager = new ElasticsearchTestIndexManager($client);
        $this->indexManager->ensureIndexExists();
    }

    protected function tearDown(): void
    {
        $this->indexManager->removeIndexIfExists();
    }

    #[TestDox('Indexes a document successfully')]
    public function testIndexDocument(): void
    {
        $this->adapter->indexDocument($this->id, $this->body);
        $result = $this->adapter->getDocument($this->id);
        $this->assertEquals($this->body, $result);
    }

    #[TestDox('Retrieves a document successfully')]
    public function testGetDocument(): void
    {
        $this->adapter->indexDocument($this->id, $this->body);
        $result = $this->adapter->getDocument($this->id);
        $this->assertEquals($this->body, $result);
    }

    #[TestDox('Deletes a document successfully')]
    public function testDeleteDocument(): void
    {
        $this->adapter->indexDocument($this->id, $this->body);
        $this->adapter->deleteDocument($this->id);

        $this->expectException(ClientResponseException::class);
        $this->adapter->getDocument($this->id);
    }
}

A coupla things to note here:

  • This test is hitting Elasticsearch; it'd probably be better to have this as a functional test and mock-out the Client.
  • Especially given it's covering much the same ground as the previous integration test, which is a true intergration test. We don't need both.
  • I'm not testing searchByString in here. I probably should be, I forgot about it, and only noticed now.

Right: we can now talk to the Elasticsearch DB. Cool. Now we have to tell Symfony about it.

From a design perspective, I've decided that whenever Symfony calls Doctrine to write to an entity, I'll intercept that somehow, and also fire off a call to the ElasticsearchAdapter to perform the equivalent action. This is done via Doctrine Lifecycle Listeners. There's not much to them:

// src/EventListener/SearchIndexer.php

namespace App\EventListener;

use App\Entity\SyncableToElasticsearch;
use App\Service\ElasticsearchAdapter;
use Doctrine\Bundle\DoctrineBundle\Attribute\AsDoctrineListener;
use Doctrine\ORM\Event\PostPersistEventArgs;
use Doctrine\ORM\Event\PostUpdateEventArgs;
use Doctrine\ORM\Events;
use Symfony\Component\Routing\Generator\UrlGeneratorInterface;

#[AsDoctrineListener(event: Events::postPersist, priority: 500, connection: 'default')]
#[AsDoctrineListener(event: Events::postUpdate, priority: 500, connection: 'default')]
class SearchIndexer
{
    public function __construct(
        private ElasticsearchAdapter $adapter,
        private UrlGeneratorInterface $urlGenerator
    ) {}

    public function postPersist(PostPersistEventArgs $args): void
    {
        $this->sync($args->getObject());
    }

    public function postUpdate(PostUpdateEventArgs  $args): void
    {
        $this->sync($args->getObject());
    }

    public function sync(object $entity): void
    {
        if (!$entity instanceof SyncableToElasticsearch) {
            return;
        }

        $doc = $entity->toElasticsearchDocument();
        $doc['body']['_meta'] = [
            'type' => $entity->getShortName(),
            'title' => $entity->getSearchTitle(),
            'url' => $this->urlGenerator->generate(
                $entity->getShortName() . '_view',
                ['id' => $entity->getId()]
            ),
        ];

        $this->adapter->indexDocument($doc['id'], $doc['body']);
    }
}

The docs are clear on this, but basically one tags a class as AsDoctrineListener, specifies which events to listen to, and which DB connection (this is all in the attributes on the class). Then the specified event handlers are called when the given Doctrine events occur. Really easy!

One shortcoming in the system is that one can either use a Doctrine entity listeners, which can be configured for a single entity; or one can use a Doctrine lifecycle listener - as I have here - and apply it to all entities. There's no way of like sticking an attribute on an Entity and say "listen to events on this please". This seems like a lost opportunity. I've solved this via the code in sync:

if (!$entity instanceof SyncableToElasticsearch) {
    return;
}

And I implement that interface on all the entities I want to index. This actually seems reasonable from a design perspective, as I ended up needing a couple methods on them to support the indexing operation anyhow.

// src/Entity/SyncableToElasticsearch.php

namespace App\Entity;

interface SyncableToElasticsearch
{
    public function toElasticsearchDocument(): array;
    public function getSearchTitle(): string;
}
// src/Entity/AbstractSyncableToElasticsearch.php

namespace App\Entity;

abstract class AbstractSyncableToElasticsearch implements SyncableToElasticsearch
{
    protected ?int $id;

    abstract public function toElasticsearchArray(): array;

    public function toElasticsearchDocument(): array
    {
        return [
            'id' => $this->getElasticsearchId(),
            'body' => $this->toElasticsearchArray(),
        ];
    }

    protected function getElasticsearchId(): string
    {
        $entityType = $this->getShortName();
        return $entityType . '_' . $this->id;
    }

    public function getShortName(): string
    {
        $parts = explode('\\', static::class);
        $entityType = strtolower(end($parts));

        return $entityType;
    }
}

I have this abstract class which all the SyncableToElasticsearch entities extend as there's a coupla bits of code they all need to run as part of making the document for Elasticsearch. I don't like using inheritance when I can avoid it, but it was either this or a trait, and I dislike traits even more than inheritance.

And here's the relevant bits of one of the entities:

// src/Entity/Student.php

namespace App\Entity;

// ...

#[ORM\Entity(repositoryClass: StudentRepository::class)]
class Student extends AbstractSyncableToElasticsearch
{
    // ...

    public function toElasticsearchArray(): array
    {
        return [
            'email' => $this->email,
            'fullName' => $this->fullName,
            'dateOfBirth' => $this->dateOfBirth?->format('Y-m-d'),
            'gender' => $this->gender,
            'enrolmentYear' => $this->enrolmentYear,
            'status' => $this->status?->label(),
        ];
    }

    public function getSearchTitle(): string
    {
        return $this->fullName;
    }
}

Each entity knows how to derive what it needs to give to Elasticsearch, so this is the best place for these.

If we come back to SearchIndexer::sync now, we can see how all this is used:

$doc = $entity->toElasticsearchDocument();
$doc['body']['_meta'] = [
    'type' => $entity->getShortName(),
    'title' => $entity->getSearchTitle(),
    'url' => $this->urlGenerator->generate(
        $entity->getShortName() . '_view',
        ['id' => $entity->getId()]
    ),
];

$this->adapter->indexDocument($doc['id'], $doc['body']);

So for some reason the given entity has been updated in the DB, the postUpdate event has been fired and intercepted and the handler runs; and if it's an entity that's in the search index: we create a document for the search index and fire it off to Elasticsearch. Done. That's it. I mean literally, that's the end of the exercise. All the rest of this article is various support stuff I wrote to load data (into the DB and into EleasticSearch), have a UI for it, etc.

I needed a script to get all the data into Elasticsearch in the first place. Pretty easy (there's a lot of code, but it's most config / boilerplate):

// Command/ReindexSearchCommand.php

namespace App\Command;

use App\Entity\Course;
use App\Entity\Department;
use App\Entity\Instructor;
use App\Entity\Institution;
use App\Entity\Student;
use App\Entity\Assignment;
use App\EventListener\SearchIndexer;
use Doctrine\ORM\EntityManagerInterface;
use Symfony\Component\Console\Attribute\AsCommand;
use Symfony\Component\Console\Command\Command;
use Symfony\Component\Console\Input\InputArgument;
use Symfony\Component\Console\Input\InputInterface;
use Symfony\Component\Console\Output\OutputInterface;
use Symfony\Component\Routing\Generator\UrlGeneratorInterface;

#[AsCommand(
    name: 'search:reindex',
    description: 'Reindex all or specific entities into Elasticsearch'
)]
class ReindexSearchCommand extends Command
{
    private const array ENTITY_MAP = [
        'assignment' => Assignment::class,
        'course' => Course::class,
        'department' => Department::class,
        'instructor' => Instructor::class,
        'institution' => Institution::class,
        'student' => Student::class,
    ];

    public function __construct(
        private readonly EntityManagerInterface $em,
        private readonly SearchIndexer $indexer
    ) {
        parent::__construct();
    }

    protected function configure(): void
    {
        $this->addArgument(
            'entity',
            InputArgument::REQUIRED,
            'Entity to reindex (all, student, course, department, instructor, institution, assignment)'
        );
    }

    protected function execute(InputInterface $input, OutputInterface $output): int
    {
        $entityArg = strtolower($input->getArgument('entity'));

        if ($entityArg === 'all') {
            $entitiesToIndex = self::ENTITY_MAP;
        } elseif (isset(self::ENTITY_MAP[$entityArg])) {
            $entitiesToIndex = [$entityArg => self::ENTITY_MAP[$entityArg]];
        } else {
            $output->writeln('<error>Unknown entity: ' . $entityArg . '</error>');
            return Command::FAILURE;
        }

        foreach ($entitiesToIndex as $name => $class) {
            $output->writeln("Indexing $name...");
            $this->indexEntity($class, $output);
        }

        $output->writeln('<info>Reindexing complete.</info>');

        return Command::SUCCESS;
    }

    private function indexEntity(string $class, OutputInterface $output): void
    {
        $batchSize = 100;
        $repo = $this->em->getRepository($class);
        $qb = $repo->createQueryBuilder('e');
        $count = (int) $qb->select('COUNT(e.id)')->getQuery()->getSingleScalarResult();

        $output->writeln("  Found $count records.");

        for ($i = 0; $i < $count; $i += $batchSize) {
            $qb = $repo
                ->createQueryBuilder('e')
                ->setFirstResult($i)
                ->setMaxResults($batchSize);

            $results = $qb->getQuery()->getResult();

            foreach ($results as $entity) {
                $this->indexer->sync($entity);
            }

            $this->em->clear();
            gc_collect_cycles();
        }
    }
}

The key bit is the indexEntity method. It loops over batches of entities and passes them to the indexer's sync method (that we looked at before). Done.

What's with the entities in that ENTITY_MAP:

private const array ENTITY_MAP = [
    'assignment' => Assignment::class,
    'course' => Course::class,
    'department' => Department::class,
    'instructor' => Instructor::class,
    'institution' => Institution::class,
    'student' => Student::class,
];

These are the entities in my stub app. Basically I've decided to represent elements of educational institutions:

NB: Enrolments do not have anything to index in the search, so they do not extend AbstractSyncableToElasticsearch. All they do is tie a Student to a Course.

I have built a crude UI to list, view and edit each entity, which one enters via http://localhost:8080/institutions, which will display something like:

Drilling down:

Hey I told you it was crude.

Oh and "wootywoo" is my go-to string to search for. In case you were wondering.

One can add Students via the Courses view page (http://localhost:8080/courses/{id}/view):

There is no capacity to delete entities in this UI. Mostly cos I forgot about it until just now. I can't see as there being much more to it than hook some code up to another event.

I'm not gonna go into the details of the UI, as it's only helper-code, and I'm gonna do some research into Symfony forms and that sort of malarky separately in a week or so. But there are some controllers and some forms and some templates if you want to look.

The last part of this is how I got the test data into the database (not ElasticSearch) in the first place. This is all done with Data Fixtures, and Factories (see the Symfony docs: DoctrineFixturesBundle). Again, I'm not going into this here as it's support code. But go have a look [shrug]. This helped me load a whole bunch of data into the system, based on these constraints:

// tests/Fixtures/FixtureLimits.php

namespace App\Tests\Fixtures;

final class FixtureLimits
{
    public const int INSTITUTIONS_MAX = 100;
    public const int DEPARTMENTS_MIN = 5;
    public const int DEPARTMENTS_MAX = 15;
    public const int INSTRUCTORS_MIN = 5;
    public const int INSTRUCTORS_MAX = 15;
    public const int STUDENTS_MIN = 10;
    public const int STUDENTS_MAX = 50;
    public const int COURSES_MIN = 5;
    public const int COURSES_MAX = 10;
    public const int ASSIGNMENTS_MIN = 1;
    public const int ASSIGNMENTS_MAX = 5;
}

And that's about it. I'm quite happy about how easy it was to get the Elasticsearch part of this done. It took way longer to write the code to load the data, and for the UI than it did to get the Elasticsearch stuff integrated.

Righto.

--
Adam


PS: I started feeling guilty about the lack of deletion support in this work, so I sorted it out. See Making sure this app also deletes from Elasticsearch when I delete an entity.