BlogSubmit an AI Product
Article image

Back to Blogs

Migrating billion+ documents from Elastic Search to AWS S3

Article

Enterprises today need their data monetized, if not, leverage on the insights from the enormous amount of data to generate newer incomes from newer products. Building infrastructure, platforms to gather, clean, prepare, process, standardize and making the data (big data) available for heterogeneous stake holders amidst the ecosystem that is vastly huge, constantly changing, is a daunting task. The open source communities are churning out new components, frameworks in this ecosystem. The cloud providers like AWS, Azure, GCP release new services that lets the organization, adopting these services, focus on their business than building, managing these services. This article focuses on Data migration in the cloud and how flexible that could be using the managed services.

Requirement

  • A billion+ json documents from each environment (integration, performance, production) to move from Elastic Search to AWS S3.
  • json documents need to be flattened
  • stored in Parquet format
  • no latency restrictions

There are multiple solutions to this problem. Some of them are listed below:

ElasticSearch → LogStash → Apache Kafka → Apache Spark → S3

ElasticSearch → LogStash → Apache Kafka → Secor → S3

KafkaConnect can be used as another integration option, instead of Spark and Secor, however there are no officially supported Parquet sink converter/connector that can store in S3.

Secor is an open-source component from pinterest that can run as a service ingesting data from Kafka and provides out-of-box configuration options to tweak caching, data formats, partitioning etc.,

With both options, standing and operating the infrastructure for Kafka, Spark and Secor bears costs even if the migration is once off. Code needs to be written wiring these components and the efforts involved has costs associated with it. After all the infrastructure needs to be torn down if not used.

With Apache Spark, the programmatic control, with it’s distributed processing power, is a boon, as it gives fine-grained control over partitioning, ordering, caching etc., however the developer efforts involved including CI is inevitable.

Researching through some of the AWS services proved to be seamless and effortless for this once off activity.

ElasticSearch → LogStash → AWS Kinesis → AWS Kinesis Firehose (lambda, glue)→ S3

Logstash

Logstash — is a product from Elastic Search that lets you stash data in and out of Elastic Search. The plugin based programming model makes it easy to configure the input and the output of those plugins. Basically you can move data to other streaming, messaging systems like Kafka, Kinesis, NoSql databases etc., The configuration/code snippet is further down in this article.

Kinesis Firehose

Kinesis Data-Firehose — Data-Firehose is a managed delivery stream that lets you capture, transform data from streams, convert and store in a destination. It has a source, processing/transformation stream and a destination to store. In this case the source is Kinesis Stream that we created earlier (where the json data from ES will be ingested into through logstash). Firehose can batch, compress, encrypt data before storing the data. Firehose provides facility to perform the following pipeline work Data Source → Transform →Data Conversion →Store

Lambda (Transform)

Lambda — are Functions that are serverless, managed (compute service). You could write just the code you want to execute without worrying about its deployment, management, operations, provisioning of servers. The processing (flattening json document), in the firehose, is run by Lambda. Python, Node.js, Ruby, .Net are widely used platforms while Java is a latest addition in this list. Creating a lambda from the console is simple as well. Choose the platform (python in our case) to run, writing the code (below) and fire away.

AWS Glue (Data conversion)

AWS Glue is an ETL service from AWS that includes meta-data (table definition, schema) management in what is called as Data Catalog. The transformed data from the Lambda needs to be converted to parquet format. Firehose supports out-of-box serde formats to convert to Parquet or ORC. Data conversion to Parquet needs a schema to confirm to. AWS Glue can be used to create a database and the schema through the console.

S3 (Store)

S3 (Simple Storage Service) — is an object store with 11 9s durability. Configuration of this is as simple as specifying the bucket (root directory), keys (directory like paths) where the data has to be stored, buffering etc., Now that the pipeline is created, it’s time to create the logstash configuration to move data. Apparently, there is no officially supported logstash kinesis output plugin however there’s this opensource plugin that could be used. Sample logstash configuration using this plugin is below:

Monitoring/Diagnosis

AWS Glue is an ETL service from AWS that includes meta-data (table definition, schema) management in what is called as Data Catalog. The transformed data from the Lambda needs to be converted to parquet format. Firehose supports out-of-box serde formats to convert to Parquet or ORC. Data conversion to Parquet needs a schema to confirm to. AWS Glue can be used to create a database and the schema through the console.

Performance/Throughput optimizations

Here are some parameters to consider tweaking to achieve an overall throughput or performance efficiency. That is another article ! LogStash(Pipeline workers, batch sizes, jvm memory). Kinesis Streams(Shard count). Lambda (Reducing the processing time is key,The max., concurrent executions of lambda is equal to max shard count.) Firehose (Buffer sizes, timeouts determine the file sizes on S3).

Instantiating the services took me roughly a day. Ready to migrate !

#aws#cloud data migration#elastic search

Related Blogs

Article Image

Tech

15 Jan 2025

The Impact of AI on Market Research: Enhancing Data Analysis

Having spent over a decade in the market research industry, I have witnessed firsthand the remarkable evolution of data collection and analysis methodologies. From traditional focus groups to the advent of online surveys, each new approach has promised deeper insights, faster results, and greater value for businesses striving to understand their customers. Now, we stand at the forefront of the next significant advancement in market research—artificial intelligence (AI). In this article, we will delve into how AI is revolutionizing data analysis, collection, and interpretation. You'll discover how machine learning algorithms process vast datasets at unprecedented speeds, identifying patterns previously undetectable by human analysts. The future of actionable business insights lies in merging the profound expertise of human market researchers with the computational prowess of AI. By the end of this read, you'll gain a deeper appreciation for how AI enhances traditional research methodologies to uncover richer insights and drive better business decisions.

Article Image

Tech

15 Jan 2025

How Enterprises Innovate and accelerate development with AI

Artificial Intelligence (AI) is no longer a futuristic concept—it's a present reality transforming enterprises across industries. With AI capabilities expanding rapidly, companies are leveraging these technologies to innovate and gain a competitive edge. This article delves into the key trends and technologies driving AI innovation in the enterprise, exploring how AI development companies collaborate with organizations to build intelligent systems that enhance operations, optimize workflows, and extract powerful data insights.

Stay upto date on the latest AI products and developments in AI.

Sign up for our monthly emails and stay updated with the latest AI products that are released. We will not spam. Our newsletter will list newly added products and fresh updates on AI developments.