Monday, October 23, 2023
HomeBig DataSmugMug’s sturdy search pipelines for Amazon OpenSearch Service

SmugMug’s sturdy search pipelines for Amazon OpenSearch Service

SmugMug operates two very giant on-line picture platforms, SmugMug and Flickr, enabling greater than 100 million clients to securely retailer, search, share, and promote tens of billions of photographs. Clients importing and looking out by means of many years of photographs helped flip search into essential infrastructure, rising steadily since SmugMug first used Amazon CloudSearch in 2012, adopted by Amazon OpenSearch Service since 2018, after reaching billions of paperwork and terabytes of search storage.

Right here, Lee Shepherd, SmugMug Employees Engineer, shares SmugMug’s search structure used to publish, backfill, and mirror dwell site visitors to a number of clusters. SmugMug makes use of these pipelines to benchmark, validate, and migrate to new configurations, together with Graviton primarily based r6gd.2xlarge situations from i3.2xlarge, together with testing Amazon OpenSearch Serverless. We cowl three pipelines used for publishing, backfilling, and querying with out introducing spiky unrealistic site visitors patterns, and with none impression on manufacturing providers.

There are two predominant architectural items essential to the method:

  • A sturdy supply of fact for index knowledge. It’s finest apply and a part of our backup technique to have a sturdy retailer past the OpenSearch index, and Amazon DynamoDB offers scalability and integration with AWS Lambda that simplifies lots of the method. We use DynamoDB for different non-search providers, so this was a pure match.
  • A Lambda operate for publishing knowledge from the supply of fact into OpenSearch. Utilizing operate aliases helps run a number of configurations of the identical Lambda operate on the identical time and is essential to retaining knowledge in sync.


The publishing pipeline is pushed from occasions like a person getting into key phrases or captions, new uploads, or label detection by means of Amazon Rekognition. These occasions are processed, combining knowledge from a number of different asset shops like Amazon Aurora MySQL Suitable Version and Amazon Easy Storage Service (Amazon S3), earlier than writing a single merchandise into DynamoDB.

Writing to DynamoDB invokes a Lambda publishing operate, by means of the DynamoDB Streams Kinesis Adapter, that takes a batch of up to date gadgets from DynamoDB and indexes them into OpenSearch. There are different advantages to utilizing the DynamoDB Streams Kinesis Adapter equivalent to lowering the variety of concurrent Lambdas required.

The publishing Lambda operate makes use of atmosphere variables to find out what OpenSearch area and index to publish to. A manufacturing alias is configured to write down to the manufacturing OpenSearch area, off of the DynamoDB desk or Kinesis Stream

When testing new configurations or migrating, a migration alias is configured to write down to the brand new OpenSearch area however use the identical set off because the manufacturing alias. This allows twin indexing of information to each OpenSearch Service domains concurrently.

Right here’s an instance of the DynamoDB desk schema:

 "Id": 123456,  // partition key
 "Fields": {
  "format": "JPG",
  "top": 1024,
  "width": 1536,
 "LastUpdated": 1600107934,

The ‘LastUpdated’ worth is used because the doc model when indexing, permitting OpenSearch to reject any out-of-order updates.


Now that adjustments are being printed to each domains, the brand new area (index) must be backfilled with historic knowledge. To backfill a newly created index, a mixture of Amazon Easy Queue Service (Amazon SQS) and DynamoDB is used. A script populates an SQS queue with messages that include directions for parallel scanning a phase of the DynamoDB desk.

The SQS queue launches a Lambda operate that reads the message directions, fetches a batch of things from the corresponding phase of the DynamoDB desk, and writes them into an OpenSearch index. New messages are written to the SQS queue to maintain monitor of progress by means of the phase. After the phase completes, no extra messages are written to the SQS queue and the method stops itself.

Concurrency is set by the variety of segments, with further controls supplied by Lambda concurrency scaling. SmugMug is ready to index greater than 1 billion paperwork per hour on their OpenSearch configuration whereas incurring zero impression to the manufacturing area.

A NodeJS AWS-SDK primarily based script is used to seed the SQS queue. Right here’s a snippet of the SQS configuration script’s choices:

Utilization: queue_segments [options]

--search-endpoint <url>  OpenSearch endpoint url
--sqs-url <url>          SQS queue url
--index <string>         OpenSearch index title
--table <string>         DynamoDB desk title
--key-name <string>      DynamoDB desk partition key title
--segments <int>         Variety of parallel segments

Together with the format of the ensuing SQS message:

  searchEndpoint: opts.searchEndpoint,
  sqsUrl: opts.sqsUrl,
  desk: opts.desk,
  keyName: opts.keyName,
  index: opts.index,
  phase: i,
  totalSegments: opts.segments,
  exclusiveStartKey: <lastEvaluatedKey from earlier iteration>

As every phase is processed, the ‘lastEvaluatedKey’ from the earlier iteration is added to the message because the ‘exclusiveStartKey’ for the subsequent iteration.


Final, our mirrored search question outcomes run by sending an OpenSearch question to an SQS queue, along with our manufacturing area. The SQS queue launches a Lambda operate that replays the question to the reproduction area. The search outcomes from these requests are usually not despatched to any person however enable replicating manufacturing load on the OpenSearch service underneath check with out impression to manufacturing techniques or clients.


When evaluating a brand new OpenSearch area or configuration, the primary metrics we’re all for are question latency efficiency, specifically the took latencies (latencies per time), and most significantly latencies for looking out. In our transfer to Graviton R6gd, we noticed about 40 % decrease P50-P99 latencies, together with comparable beneficial properties in CPU utilization in comparison with i3’s (ignoring Graviton’s decrease prices). One other welcome profit was the extra predictable and monitorable JVM reminiscence strain with the rubbish assortment adjustments from the addition of G1GC on R6gd and different new situations.

Utilizing this pipeline, we’re additionally testing OpenSearch Serverless and discovering its finest use-cases. We’re enthusiastic about that service and absolutely intend to have a completely serverless structure in time. Keep tuned for outcomes.

In regards to the Authors

Lee Shepherd is a SmugMug Employees Software program Engineer

Aydn Bekirov is an Amazon Net Providers Principal Technical Account Supervisor



Please enter your comment!
Please enter your name here

Most Popular

Recent Comments