How To Think Simple In Java
AI Agentic 101: Understanding Artificial Intelligence Agents
Low-Code Development
Low code, no code, citizen development, AI automation, scalability — if you work in the tech world, it's likely that you have been encouraged to use tools in at least one of these spaces. And it's for a good reason as Gartner has projected that by 2025, 70% of applications developed within organizations will have been built using low- and/or no-code technologies. So does the practice live up to the hype? Year over year, the answer is a resounding "yes" as the industry continues to evolve. Organizations have an increased demand for more frequent application releases and updates, and with that comes the need for increased efficiencies. And this is where low-code and no-code development practices shine. Sprinkle AI automation into low- and no-code development, and the scalability opportunities are endless. This Trend Report covers the evolving landscape of low- and no-code development by providing a technical exploration of integration techniques into current development processes, the role AI plays in relation to low- and no-code development, governance, intelligent automated testing, and adoption challenges. In addition to findings from our original research, technical experts from the DZone Community contributed articles addressing important topics in the low code space, including scalability, citizen development, process automation, and much more. To ensure that you, the developer, can focus on higher priorities, this Trend Report aims to provide all the tools needed to successfully leverage low code in your tech stack.
Data Pipeline Essentials
Open Source Migration Practices and Patterns
If you're a Java software developer and you weren't living on the planet Mars during these last years, then you certainly know what Quarkus is. And just in case you don't, you may find it out here. With Quarkus, the field of enterprise cloud-native applications development has never been so comfortable and it never took advantage of such a friendly and professional working environment. The Internet abounds with posts and articles explaining why and how Quarkus is a must for the enterprise, cloud-native software developer. And of course, CDK applications aren't on the sidelines: on the opposite, they can greatly take advantage of the Quarkus features to become smaller, faster, and more aligned with requirements nowadays. CDK With Quarkus Let's look at our first CDK with Quarkus example in the code repository. Go to the Maven module named cdk-quarkus and open the file pom.xml to see how to combine specific CDK and Quarkus dependencies and plugins. XML ... <dependency> <groupId>io.quarkus.platform</groupId> <artifactId>quarkus-bom</artifactId> <version>${quarkus.platform.version}</version> <type>pom</type> <scope>import</scope> </dependency> <dependency> <groupId>io.quarkiverse.amazonservices</groupId> <artifactId>quarkus-amazon-services-bom</artifactId> <version>${quarkus-amazon-services.version}</version> <type>pom</type> <scope>import</scope> </dependency> ... In addition to the aws-cdk-lib artifact which represents the CDK API library and is inherited from the parent Maven module, the dependencies above are required in order to develop CDK Quarkus applications. The first one, quarkus-bom, is the Bill of Material (BOM) which includes all the other required Quarkus artifacts. Here, we're using Quarkus 3.11 which is the most recent release as of this writing. The second one is the BOM of the Quarkus extensions required to interact with AWS services. Another mandatory requirement of Quarkus applications is the use of the quarkus-maven-plugin which is responsible for running the build and augmentation process. Let's recall that as opposed to more traditional frameworks like Spring or Jakarta EE where the application's initialization and configuration steps happen at the runtime, Quarkus performs them at build time, in a specific phase called "augmentation." Consequently, Quarkus doesn't rely on Java introspection and reflection, which is one of the reasons it is much faster than Spring, but needs to use the jandex-maven-plugin to build an index helping to discover annotated classes and beans in external modules. This is almost all as far as the Quarkus master POM is concerned. Let's look now at the CDK submodule. But first, we need to recall that, in order to synthesize and deploy a CDK application, we need a specific working environment defined by the cdk.json file. Hence, trying to use CDK commands in a project not having at its root this file will fail. One of the essential functions of the cdk.json file aims to define how to run the CDK application. By default, the cdk init app --language java command, used to scaffold the project's skeleton, will generate the following JSON statement: JSON ... "app": "mvn -e -q compile exec:java" ... This means that whenever we run a cdk deploy ... command, such that to synthesize a CloudFormation stack and deploy it, the maven-exec-plugin will be used to compile and package the code, before starting the associated main Java class. This is the most general case, the one of a classical Java CDK application. But to run a Quarkus application, we need to observe some special conditions. Quarkus packages an application as either a fast or a thin JAR and, if you aren't familiar with these terms, please don't hesitate to consult the documentation which explains them in detail. What interests us here is the fact that, by default, a fast JAR will be generated, under the name of quarkus-run.jar in the target/quarkus-app directory. Unless we're using Quarkus extensions for AWS, in which case a thin JAR is generated, in target/$finalName-runner.jar file, where $finalName is the value of the same element in pom.xml. In our case, we're using Quarkus extensions for AWS and, hence, a thin JAR will be created by the Maven build process. In order to run a Quarkus thin JAR, we need to manually modify the cdk.json file to replace the line above with the following one: JSON ... "app": "java -jar target/quarkus-app/quarkus-run.jar" ... The other important point to notice here is that, in general, a Quarkus application is exposing a REST API whose endpoint is started by the command above. But in our case, the one of a CDK application, there isn't any REST API and, hence, this endpoint needs to be specified in a different way. Look at our main class in the cdk-quarkus-api-gatewaymodule. Java @QuarkusMain public class CdkApiGatewayMain { public static void main(String... args) { Quarkus.run(CdkApiGatewayApp.class, args); } } Here, the @QuarkusMain annotation flags the subsequent class as the application's main endpoint and, further, using the io.quarkus.runtime.Quarkus.run() method will execute the mentioned class until it receives a signal like Ctrl-C, or one of the exit methods of the same API is called. So, we just saw how the CDK Quarkus application is started and that, once started, it runs the CdkApiGAtewayApp until it exits. This class is our CDK one which implements the App and that we've already seen in the previous post. But this time it looks differently, as you may see: Java @ApplicationScoped public class CdkApiGatewayApp implements QuarkusApplication { private CdkApiGatewayStack cdkApiGatewayStack; private App app; @Inject public CdkApiGatewayApp (App app, CdkApiGatewayStack cdkApiGatewayStack) { this.app = app; this.cdkApiGatewayStack = cdkApiGatewayStack; } @Override public int run(String... args) throws Exception { Tags.of(app).add("project", "API Gateway with Quarkus"); Tags.of(app).add("environment", "development"); Tags.of(app).add("application", "CdkApiGatewayApp"); cdkApiGatewayStack.initStack(); app.synth(); return 0; } } The first thing to notice is that this time, we're using the CDI (Context and Dependency Injection) implemented by Quarkus, also called ArC, which is a subset of the Jakarta CDI 4.1 specifications. It also has another particularity: it's a build-time CDI, as opposed to the runtime Jakarta EE one. The difference lies in the augmentation process, as explained previously. Another important point to observe is that the class implements the io.quarkus.runtime.QuarkusApplication interface which allows it to customize and perform specific actions in the context bootstrapped by the CdkApiGatewayMain class. As a matter of fact, it isn't recommended to perform such operations directly in the CdkApiGatewayMain since, at that point, Quarkus isn't completely bootstrapped and started yet. We need to define our class as @ApplicationScoped, such that to be instantiated only once. We also used constructor injection and took advantage of the producer pattern, as you may see in the CdkApiGatewayProducer class. We override the io.quarkus.runtime.QuarkusApplication.run() method such that to customize our App object by tagging it, as we already did in the previous example, and to invoke CdkApiGatewayStack, responsible to instantiate and initialize our CloudFormation stack. Last but not least, the app.synth() statement is synthesizing this stack and, once executed, our infrastructure, as defined by the CdkApiGatewayStack, should be deployed on the AWS cloud. Here is now the CdkApiGatewayStack class: Java @Singleton public class CdkApiGatewayStack extends Stack { @Inject LambdaWithBucketConstructConfig config; @ConfigProperty(name = "cdk.lambda-with-bucket-construct-id", defaultValue = "LambdaWithBucketConstructId") String lambdaWithBucketConstructId; @Inject public CdkApiGatewayStack(final App scope, final @ConfigProperty(name = "cdk.stack-id", defaultValue = "QuarkusApiGatewayStack") String stackId, final StackProps props) { super(scope, stackId, props); } public void initStack() { String functionUrl = new LambdaWithBucketConstruct(this, lambdaWithBucketConstructId, config).getFunctionUrl(); CfnOutput.Builder.create(this, "FunctionURLOutput").value(functionUrl).build(); } } This class has changed as well, compared to its previous release. It's a singleton that uses the concept of construct, which was introduced formerly. As a matter of fact, instead of defining the stack structure here, in this class, as we did before, we do it by encapsulating the stack's elements together with their configuration in a construct that facilitates easily assembled cloud applications. In our project, this construct is a part of a separate module, named cdk-simple-construct, such that we could reuse it repeatedly and increase the application's modularity. Java public class LambdaWithBucketConstruct extends Construct { private FunctionUrl functionUrl; public LambdaWithBucketConstruct(final Construct scope, final String id, LambdaWithBucketConstructConfig config) { super(scope, id); Role role = Role.Builder.create(this, config.functionProps().id() + "-role") .assumedBy(new ServicePrincipal("lambda.amazonaws.com")).build(); role.addManagedPolicy(ManagedPolicy.fromAwsManagedPolicyName("AmazonS3FullAccess")); role.addManagedPolicy(ManagedPolicy.fromAwsManagedPolicyName("CloudWatchFullAccess")); IFunction function = Function.Builder.create(this, config.functionProps().id()) .runtime(Runtime.JAVA_21) .role(role) .handler(config.functionProps().handler()) .memorySize(config.functionProps().ram()) .timeout(Duration.seconds(config.functionProps().timeout())) .functionName(config.functionProps().function()) .code(Code.fromAsset((String) this.getNode().tryGetContext("zip"))) .build(); functionUrl = function.addFunctionUrl(FunctionUrlOptions.builder().authType(FunctionUrlAuthType.NONE).build()); new Bucket(this, config.bucketProps().bucketId(), BucketProps.builder().bucketName(config.bucketProps().bucketName()).build()); } public String getFunctionUrl() { return functionUrl.getUrl(); } } This is our construct which encapsulates our stack elements: a Lambda function with its associated IAM role and an S3 bucket. As you can see, it extends the software.construct.Construct class and its constructor, in addition to the standard scopeand id, parameters take a configuration object named LambdaWithBucketConstructConfig which defines, among others, properties related to the Lambda function and the S3 bucket belonging to the stack. Please notice that the Lambda function needs the IAM-managed policy AmazonS3FullAccess in order to read, write, delete, etc. to/from the associated S3 bucket. And since for tracing purposes, we need to log messages to the CloudWatch service, the IAM-managed policy CloudWatchFullAccess is required as well. These two policies are associated with a role whose naming convention consists of appending the suffix -role to the Lambda function name. Once this role is created, it will be attached to the Lambda function. As for the Lambda function body, please notice how this is created from an asset dynamically extracted from the deployment context. We'll come back in a few moments with more details concerning this point. Last but not least, please notice how after the Lambda function is created, a URL is attached to it and cached such that it can be retrieved by the consumer. This way we completely decouple the infrastructure logic (i.e., the Lambda function itself) from the business logic; i.e., the Java code executed by the Lambda function, in our case, a REST API implemented as a Quarkus JAX-RS (RESTeasy) endpoint, acting as a proxy for the API Gateway exposed by AWS. Coming back to the CdkApiGatewayStack class, we can see how on behalf of the Quarkus CDI implementation, we inject the configuration object LambdaWithBucketConstructConfig declared externally, as well as how we use the Eclipse MicroProfile Configuration to define its ID. Once the LambdaWithBucketConstruct instantiated, the only thing left to do is to display the Lambda function URL such that we can call it with different consumers, whether JUnit integration tests, curl utility, or postman. We just have seen the whole mechanics which allows us to decouple the two fundamental CDK building blocks App and Stack. We also have seen how to abstract the Stack building block by making it an external module which, once compiled and built as a standalone artifact, can simply be injected wherever needed. Additionally, we have seen the code executed by the Lambda function in our stack can be plugged in as well by providing it as an asset, in the form of a ZIP file, for example, and stored in the CDK deployment context. This code is, too, an external module named quarkus-api and consists of a REST API having a couple of endpoints allowing us to get some information, like the host IP address or the S3 bucket's associated attributes. It's interesting to notice how Quarkus takes advantage of the Qute templates to render HTML pages. For example, the following endpoint displays the attributes of the S3 bucket that has been created as a part of the stack. Java ... @Inject Template s3Info; @Inject S3Client s3; ... @GET @Path("info/{bucketName}") @Produces(MediaType.TEXT_HTML) public TemplateInstance getBucketInfo(@PathParam("bucketName") String bucketName) { Bucket bucket = s3.listBuckets().buckets().stream().filter(b -> b.name().equals(bucketName)) .findFirst().orElseThrow(); TemplateInstance templateInstance = s3Info.data("bucketName", bucketName, "awsRegionName", s3.getBucketLocation(GetBucketLocationRequest.builder().bucket(bucketName).build()) .locationConstraintAsString(), "arn", String.format(S3_FMT, bucketName), "creationDate", LocalDateTime.ofInstant(bucket.creationDate(), ZoneId.systemDefault()), "versioning", s3.getBucketVersioning(GetBucketVersioningRequest.builder().bucket(bucketName).build())); return templateInstance.data("tags", s3.getBucketTagging(GetBucketTaggingRequest.builder().bucket(bucketName).build()).tagSet()); } This endpoint returns a TemplateInstance whose structure is defined in the file src/main/resources/templates/s3info.htmland which is filled with data retrieved by interrogating the S3 bucket in our stack, on behalf of the S3Client class provided by the AWS SDK. A couple of integration tests are provided and they take advantage of the Quarkus integration with AWS, thanks to which it is possible to run local cloud services, on behalf of testcontainers and localstack. In order to run them, proceed as follows: Shell $ git clone https://github.com/nicolasduminil/cdk $ cd cdk/cdk-quarkus/quarkus-api $ mvn verify Running the sequence of commands above will produce a quite verbose output and, at the end, you'll see something like this: Shell [INFO] [INFO] Results: [INFO] [INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0 [INFO] [INFO] [INFO] --- failsafe:3.2.5:verify (default) @ quarkus-api --- [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 22.344 s [INFO] Finished at: 2024-07-04T17:18:47+02:00 [INFO] ------------------------------------------------------------------------ That's not a big deal - just a couple of integration tests executed against a localstack running in testcontainers to make sure that everything works as expected. But if you want to test against real AWS services, meaning that you fulfill the requirements, then you should proceed as follows: Shell $ git clone https://github.com/nicolasduminil/cdk $ cd cdk $ ./deploy.sh cdk-quarkus/cdk-quarkus-api-gateway cdk-quarkus/quarkus-api/ Running the script deploy.sh with the parameters shown above will synthesize and deploy your stack. These two parameters are: The CDK application module name: This is the name of the Maven module where your cdk.json file is. The REST API module name: This is the name of the Maven module where the function.zip file is. If you look at the deploy.sh file, you'll see the following: Shell ...cdk deploy --all --context zip=~/cdk/$API_MODULE_NAME/target/function.zip... This command deploys the CDK app after having set in the zip context variable the function.zip location. Do you remember that the Lambda function has been created in the stack (LambdaWithBucketConstruct class) like this? Java IFunction function = Function.Builder.create(this, config.functionProps().id()) ... .code(Code.fromAsset((String) this.getNode().tryGetContext("zip"))) .build(); The statement below gets the asset stored in the deployment context under the context variable zip and uses it as the code that will be executed by the Lambda function. The output of the deploy.sh file execution (quite verbose as well) will finish by displaying the Lambda function URL: Shell ... Outputs: QuarkusApiGatewayStack.FunctionURLOutput = https://...lambda-url.eu-west-3.on.aws/ Stack ARN: arn:aws:cloudformation:eu-west-3:...:stack/QuarkusApiGatewayStack/... ... Now, in order to test your stack, you may fire your preferred browser at https://<generated>.lambda-url.eu-west-3.on.aws/s3/info/my-bucket-8701 and should see something looking like this: Conclusion Your test is successful and you now know how to use CDK constructs to create infrastructure standalone modules and assemble them into AWS CloudFormation stacks. But there is more, so stay tuned!
Architecture is often celebrated as a fine art, particularly when a building's aesthetic features stand out. Yet, a beautiful design alone does not guarantee functionality. Architectural design requires a blend of technical precision and artistic vision. The form of a building should directly serve its intended function, illustrating the principle that form should follow function. For example, the Royal Ontario Museum in Toronto, despite its striking appearance, has been criticized as one of the 'worst examples of architecture during the 2000s' due to its impractical interior spaces characterized by awkward corners and slanted walls that compromise usability. Similarly, in software development, while users may admire a well-designed interface, they often overlook the backend architecture that equally influences their experience. Software architects face myriad design decisions that affect an application's performance, scalability, and future adaptability. Their challenge isn't just to create something elegant but to design systems that are both effective and maintainable. This article highlights the often 'invisible' artistry of backend software engineers, acknowledging that creating performant, scalable, and maintainable software architectures is an art form in its own right. We Don’t Look at Software as a Creative Art Software development is often perceived more as a functional craft than a creative art, likely because of its roots in engineering. This background has historically aligned software development more with manufacturing principles — predictable, efficient, and repeatable processes — rather than the nuances of creative design. During the rapid expansion of computer usage in the 1960s and 1970s, businesses and governments needed reliable software quickly, leading to methodologies that mirrored factory assembly lines. This approach emphasized division of labor, where individual programmers focused on specific tasks, thereby limiting the scope for creativity in favor of functionality and bug-free code. However, this view neglects the inherent creativity and dynamism in software engineering. Consider the differences between manufacturing cars and developing software. In car manufacturing, processes are standardized with fixed objectives and requirements, focusing solely on output. You don’t see new requirements introduced mid-assembly line, new machinery being swapped in for the old during production, or users testing very early versions of the car directly on the factory floor. In contrast, software development is dynamic, with requirements, tools, and even end goals shifting significantly during the project lifecycle. What starts as a car might evolve into a pirate ship due to changing business needs, technological advancements, or user feedback. Creativity is essential in software development, influencing all aspects from user interface design to system architecture and complex problem-solving. Recent recognition of the creative aspects in fields like game development, web design, and user experience has begun to balance the narrative, showcasing that artistic design is as crucial as technical proficiency in developing software. We May Recognize a Beautiful UI Design, but What About a Great API Design? A design-oriented approach, which is prevalent in software development, centers around a user-centric problem-solving process. This methodology encourages developers to empathize with users, define problems, ideate solutions, prototype, and test, all of which require creativity. While the influence of this approach is readily apparent in user interfaces (UIs) due to their direct interaction with users and visible aesthetic elements, its application extends beyond the front end to other crucial elements like system architecture and API design. For example, APIs (Application Programming Interfaces) play a transformative role in enhancing internet user experiences by streamlining access to services and simplifying complex tasks. A well-crafted API enables seamless interactions between different systems, facilitating a more integrated and smooth user experience across various applications and services. Here are a couple of examples: Stripe API: Revolutionizing online payments, Stripe provides developers with a flexible, customizable, and straightforward API. It allows even small startups to implement sophisticated payment systems that are both robust and secure, significantly improving the e-commerce experience. OpenAI's GPT API: As one of the most popular APIs in recent years, it offers advanced natural language processing capabilities. Developers can integrate features such as chatbots, automated content generation, and more into their applications. This not only enhances user interactions but also creates more intuitive and human-like experiences in customer service and interactive applications. Post on Threads How Backend Design Decisions Shape the User Experience Backend design decisions critically shape the user experience, directly influencing the application loading time, how fast you can search the data, how seamless the user experience is in any given location, etc. In summary, these choices affect everything from the application's responsiveness to its stability across different environments. Key areas impacted by backend decisions include: Tech stack selection: The choice of technology—programming languages, databases, and server architecture—affects how quickly and efficiently data is processed and delivered to the user. Architectural style and patterns: Opting for specific architectural styles (e.g, microservices or serverless) impacts scalability and the ability to update features without disrupting user interactions. Third-party integrations: The integration of legacy systems and third-party services can enhance functionality but requires careful handling to maintain performance and ensure data consistency. Data structure design: How data is organized and managed (especially in distributed systems) directly impacts how quickly data can be accessed and manipulated, affecting everything from search functionality to transaction processing. These backend choices are often invisible to the end-user until an issue arises, and then they become glaringly apparent. Things like slow load times, downtime, or data inconsistency highlight how important robust system design is to ensure continuous user satisfaction. How Backend Design Decisions Shape the Org Backend design decisions extend beyond user interaction, fundamentally shaping an organization's ability to scale its product, grow its user base, and be innovative. The system design of software is not only a cornerstone for technical robustness but also crucial for economic sustainability over time. When designed poorly, systems accumulate architectural technical debt, which is often costlier and more complex to resolve than code-level technical debt, and can lead to severe operational disruptions. Indeed, the major outages experienced by industry giants like AT&T, Meta, and AWS over the past 18 months illustrate the dire consequences of unchecked system complexity. Furthermore, backend flexibility plays a vital role in an organization's agility—its capacity to swiftly adapt to market changes and seize new opportunities. A great example of this is the explosive user growth experienced by the Cara app, a platform built by volunteers to allow artists to share their portfolios. Cara's backend, designed with serverless architecture, scaled remarkably to support a jump from 40,000 to 650,000 users in just one week. While this rapid scalability enabled Cara to capitalize on a market opportunity, it also led to a substantial, unexpected cost— a $98,000 Vercel bill after a few days of peak usage. This underscores that each architectural decision carries its own set of trade-offs and necessitates a delicate balance between addressing immediate needs and anticipating future demands. It highlights the creative challenge of aligning current requirements with potential future scenarios in a dynamic market landscape. Conclusion: The Unseen Artistry of Backend Engineers In software development, the artistry behind backend architecture is often invisible yet immensely impactful. As we continue to demystify the backend, it becomes clear that these engineers are not just coders; they are the modern-day architects of the digital world, crafting foundational structures that empower businesses and enhance user experiences. Embracing this perspective can transform how organizations value and execute their software initiatives, ensuring that both the visible and hidden elements of our digital solutions are crafted with artistry and foresight.
Are you curious what experienced practitioners are saying about AI and platform engineering — and its growing impact on development workflows? Look no further than DZone’s latest event with PlatformCon 2024 where our global software community answers these vital questions in an expert panel on all things platform engineering, AI, and beyond. What Developers Must Know About AI and Platform Engineering Moderated by DZone Core member and Director of Data and AI at Silk, Kellyn Pot’Vin-Gorman, panelists Ryan Murray, Sandra Borda, and Chiradeep Vittal discussed the most probing questions and deliberations facing AI and platform engineering today. Check out the panel discussion in its entirety here: Important questions and talking points discussed include: How has AI transformed the platform engineering landscape? Examples of how AI has improved developer productivity within organizations. What are some of the challenges you’ve faced when integrating AI into your development workflow, and how have those been addressed? What are some anti-patterns or caveats when integrating GenAI into engineering platforms and the SDLC more broadly? What are some practical steps or strategies for organizations looking to start incorporating AI into their platform engineering efforts? ….and more!
Hello, DZone Community! We have several surveys in progress as part of our research for upcoming Trend Reports. We would love for you to join us by sharing your experiences and insights (anonymously if you choose) — readers just like you drive the content that we cover in our Trend Reports. check out the details for each research survey below Over the coming months, we will compile and analyze data from hundreds of respondents; results and observations will be featured in the "Key Research Findings" of our Trend Reports. Security Research Security is everywhere; you can’t live with it, and you certainly can’t live without it! We are living in an entirely unprecedented world — one where bad actors are growing more sophisticated and are taking full advantage of the rapid advancements in AI. We will be exploring the most pressing security challenges and emerging strategies in this year’s survey for our August Enterprise Security Trend Report. Our 10-12-minute Enterprise Security Survey explores: Building a security-first organization Security architecture and design Key security strategies and techniques Cloud and software supply chain security At the end of the survey, you're also able to enter the prize drawing for a chance to receive one of two $175 (USD) e-gift cards! Join the Security Research Data Engineering Research As a continuation of our annual data-related research, we're consolidating our database, data pipeline, and data and analytics scopes into a single 12-minute survey that will guide help the narratives of our July Database Systems Trend Report and data engineering report later in the year. Our 2024 Data Engineering Survey explores: Database types, languages, and use cases Distributed database design + architectures Data observability, security, and governance Data pipelines, real-time processing, and structured storage Vector data and databases + other AI-driven data capabilities Join the Data Engineering Research You'll also have the chance to enter the $500 raffle at the end of the survey — five random people will be drawn and will receive $100 each (USD)! Cloud and Kubernetes Research This year, we're combining our annual cloud native and Kubernetes research into one 10-minute survey that dives further into these topics as they relate to both one another and at the intersection of security, observability, AI, and more. DZone's research will be informing these Trend Reports: May – Cloud Native: Championing Cloud Development Across the SDLC September – Kubernetes in the Enterprise Our 2024 Cloud Native Survey covers: Microservices, container orchestration, and tools/solutions Kubernetes use cases, pain points, and security measures Cloud infrastructure, costs, tech debt, and security threats AI for release management + monitoring/observability Join the Cloud Native Research Don't forget to enter the $750 raffle at the end of the survey! Five random people will be selected to each receive $150 (USD). Your responses help inform the narrative of our Trend Reports, so we truly cannot do this without you. Stay tuned for each report's launch and see how your insights align with the larger DZone Community. We thank you in advance for your help! —The DZone Publications team
This is a continuation of the article Flexible Data Generation With Datafaker Gen about DataFaker Gen. In this section, we will explore the new BigQuery Sink feature for Google Cloud Platform, demonstrating how to utilize different field types based on the DataFaker schema. BigQuery is a fully managed and AI-ready data analytics platform available on Google Cloud Platform that gives anyone the capability to analyze terabytes of data. Let's consider a scenario where we aim to create a dummy dataset, aligned with our actual schema to facilitate executing and testing queries in BigQuery. By using Datafaker Gen, this data can become meaningful and predictable, based on predefined providers, thus allowing for more realistic and reliable testing environments. This solution leverages the BigQuery API Client libraries provided by Google. For more details, refer to the official documentation here: BigQuery API Client Libraries. Quick Start With BigQuery Sink This is a simple example of BigQuery Sink just to show that it requires two simple actions to see the result. This provides clarity on the approach. The other part of this article will cover detailed configuration and the flexibility of this feature. And so, three simple steps need to be done: 1. Download the project here, build it, and navigate to the folder with the BigQuery example: Shell ./mvnw clean verify && cd ./datafaker-gen-examples/datafaker-gen-bigquery 2. Configure schema in config.yaml : YAML default_locale: en-US fields: - name: id generators: [ Number#randomNumber ] - name: lastname generators: [ Name#lastName ] nullRate: 0.1 - name: firstname locale: ja-JP generators: [ Name#firstName ] Configure BigQuery Sink in output.yamlwith the path to the Service Account JSON (which should be obtained from GCP): YAML sinks: bigquery: project_id: [gcp project name] dataset: datafaker table: users service_account: [path to service accout json] Run it: Shell # Format json, number of lines 10000 and new BigQuery Sink bin/datafaker_gen -f json -n 10000 -sink bigquery In-Depth Guide To Using BigQuery Sink To prepare a generator for BigQuery, follow these two steps: Define the DataFaker Schema: The schema defined in config.yaml will be reused for the BigQuery Sink. Configure the BigQuery Sink: In output.yaml, specify the connection credentials, connection properties, and generation parameters. Note: Currently, BigQuery Sink only supports the JSON format. If another format is used, the BigQuery Sink will throw an exception. At the same time, it might be a good opportunity to introduce other formats, such as protobuf. 1. Define the DataFaker Schema One of the most important preparation tasks is defining the schema in the config.yaml file. The schema specifies the field definitions of the record based on the Datafaker provider. It also allows for the definition of embedded fields like array and struct. Consider this example of a schema definition in the config.yaml file. The first step is to define the base locale that should be used for all fields. This should be done at the top of the file in the property default_locale . The locale for a specific field can be customized directly. YAML default_locale: en-US This schema defines the default locale as 'en-EN' and lists the fields. Then all required fields should be defined in fields section. Let’s fill in the details of the field definitions. Datafaker Gen supports three main field types: default, array, and struct. Default Type This is a simple type that allows you to define the field name and how to generate its value using generator property. Additionally, there are some optional parameters that allow for customization of locale and rate nullability. YAML default_locale: en-US fields: - name: id generators: [ Number#randomNumber ] - name: lastname generators: [ Name#lastName ] nullRate: 0.1 - name: firstname locale: ja-JP generators: [ Name#firstName ] name: Defines the field name. generators: Defines the Faker provider methods that generate value. For BigQuery, based on the format provided by the Faker provider generators, it will generate JSON, which will be reused for BigQuery field types. In our example, Number#randomNumber returns a long value from the DataFaker provider, which is then converted to an integer for the BigQuery schema. Similarly, the fields Name#lastName and Name#firstName which are String and convert to STRING in BigQuery. nullRate: Determine how often this field is missing or has a null value. locale: Defines a specific locale for the current field. Array Type This type allows the generation of a collection of values. It reuses the fields from the default type and extends them with two additional properties: minLength and maxLength. In BigQuery, this type corresponds to a field with the REPEATED mode. The following fields need to be configured in order to enable the array type: type: Specify array type for this field. minLenght: Specify min length of array. maxLenght: Specify max length of array. All these properties are mandatory for the array type. YAML default_locale: en-US fields: - name: id generators: [ Number#randomNumber ] - name: lastname generators: [ Name#lastName ] nullRate: 0.1 - name: firstname generators: [ Name#firstName ] locale: ja-JP - name: phone numbers type: array minLength: 2 maxLength: 5 generators: [ PhoneNumber#phoneNumber, PhoneNumber#cellPhone ] It is also worth noting that, generator property can contain multiple sources of value, such as for phone numbers. Struct Type This type allows you to create a substructure that can contain many nested levels based on all existing types. In BigQuery, this type corresponds to RECORD type. struct type doesn’t have a generator property but has a new property called fields, where a substructure based on the default, array or struct type can be defined. There are two main fields that need to be added for the struct type: type: Specify struct type for this field. fields: Defines a list of fields in a sub-structure. YAML default_locale: en-US fields: - name: id generators: [ Number#randomNumber ] - name: lastname generators: [ Name#lastName ] nullRate: 0.1 - name: firstname generators: [ Name#firstName ] locale: ja-JP - name: phone numbers type: array minLength: 2 maxLength: 5 generators: [ PhoneNumber#phoneNumber, PhoneNumber#cellPhone ] - name: address type: struct fields: - name: country generators: [ Address#country ] - name: city generators: [ Address#city ] - name: street address generators: [ Address#streetAddress ] 2. Configure BigQuery Sink As previously mentioned, the configuration for sinks can be added in the output.yaml file. The BigQuery Sink configuration allows you to set up credentials, connection properties, and sink properties. Below is an example configuration for a BigQuery Sink: YAML sinks: bigquery: batchsize: 100 project_id: [gcp project name] dataset: datafaker table: users service_account: [path to service accout json] create_table_if_not_exists: true max_outstanding_elements_count: 100 max_outstanding_request_bytes: 10000 keep_alive_time_in_seconds: 60 keep_alive_timeout_in_seconds: 60 Let's review the entire list of leverages you can take advantage of: batchsize: Specifies the number of records to process in each batch. A smaller batch size can reduce memory usage but may increase the number of API calls. project_id: The Google Cloud Platform project ID where your BigQuery dataset resides. dataset: The name of the BigQuery dataset where the table is located. table: The name of the BigQuery table where the data will be inserted. Google Credentials should be configured with sufficient permissions to access and modify BigQuery datasets and tables. There are several ways to pass service account content: service_account: The path to the JSON file containing the service account credentials. This configuration should be defined in the output.yaml file. SERVICE_ACCOUNT_SECRETThis environment variable should contain the JSON content of the service account. The final option involves using the gcloud configuration from your environment (more details can be found here). This option is implicit and could potentially lead to unpredictable behavior. create_table_if_not_exists: If set to true, the table will be created if it does not already exist. A BigQuery Schema will be created based on the DataFaker Schema. max_outstanding_elements_count: The maximum number of elements (records) allowed in the buffer before they are sent to BigQuery. max_outstanding_request_bytes: The maximum size of the request in bytes allowed in the buffer before they are sent to BigQuery. keep_alive_time_in_seconds: The amount of time(in seconds) to keep the connection alive for additional requests. keep_alive_timeout_in_seconds: The amount of time(in seconds) to wait for additional requests before closing the connection due to inactivity. How to Run BigQuery Sink example has been merged into the main upstream Datafaker Gen project, where it can be adapted for your use. Running this generator is easy and lightweight. However, it requires several preparation steps: 1. Download the GitHub repository. The datafaker-gen-examples folder includes the example with BigQuery Sink, that we will use. 2. Build the entire project with all modules. The current solution uses 2.2.3-SNAPSHOT version of DataFaker library. Shell ./mvnw clean verify 3. Navigate to the 'datafaker-gen-bigquery' folder. This should serve as the working directory for your run. Shell cd ./datafaker-gen-examples/datafaker-gen-bigquery 4. Define the schema for records in the config.yaml file and place this file in the appropriate location where the generator should be run. Additionally, define the sinks configuration in the output.yaml file, as demonstrated previously. Datafake Gen can be executed through two options: 1. Use bash script from the bin folder in the parent project: Shell # Format json, number of lines 100 and new BigQuery Sink bin/datafaker_gen -f json -n 10000 -sink bigquery 2. Execute the JAR directly, like this: Shell java -cp [path_to_jar] net.datafaker.datafaker_gen.DatafakerGen -f json -n 10000 -sink bigquery Query Result and Outcome After applying all the necessary configurations and running in my test environment, it would be nice to check the outcome. This is the SQL query to retrieve the generated result: SQL SELECT id, lastname, firstname, `phone numbers`, address FROM `datafaker.users`; Here is the result of all our work (the result of the query): Only the first four records are shown here with all the fields defined above. It also makes sense to note that the phone numbers array field contains two or more values depending on the entries. The address structure field has three nested fields. Conclusion This newly added BigQuery Sink feature enables you to publish records to Google Cloud Platform efficiently. With the ability to generate and publish large volumes of realistic data, developers and data analysts can more effectively simulate the behavior of their applications and immediately start testing in real-world conditions. Your feedback allows us to evolve this project. Please feel free to leave a comment. The full source code is available here. I would like to thank Sergey Nuyanzin for reviewing this article. Thank you for reading! Glad to be of help.
Vector databases allow for efficient data storage and retrieval by storing them as points or vectors instead of traditional rows and columns. Two popular vector database options are pgVector extension for PostgreSQL and Amazon OpenSearch Service. This article compares the specifications, strengths, limitations, capabilities, and use cases for pgVector and OpenSearch to help inform decision-making when selecting the best-suited option for various needs. Introduction The rapid advancements in artificial intelligence (AI) and machine learning (ML) have necessitated the development of specialized databases that can efficiently store and retrieve high-dimensional data. Vector databases have emerged as a critical component in this landscape, enabling applications such as recommendation systems, image search, and natural language processing. This article compares two prominent vector database solutions, pgVector extension for PostgreSQL and Amazon OpenSearch Service, directly relevant to your roles as technical professionals, database administrators, and AI and ML practitioners. Technical Background Vector databases store data as vectors, enabling efficient similarity searches and other vector operations. pgVector enhances PostgreSQL's capabilities to handle vectors, while OpenSearch provides a comprehensive solution for storing and indexing vectors and metadata, supporting scalable AI applications. Problem Statement Choosing the proper vector database involves understanding the available options' specific requirements, performance characteristics, and integration capabilities. This article provides a practical and detailed comparison to assist in making an informed decision and instill confidence in the process. Methodology or Approach This analysis reviews current practices, case studies, and theoretical models to compare pgVector and OpenSearch comprehensively. It highlights critical differences in technical specifications, performance, and use cases, ensuring the audience feels well-informed. pgVector Extension for PostgreSQL pgVector is an open-source extension for PostgreSQL that enables storing and querying high-dimensional vectors. It supports various distance calculations and provides functionality for exact and approximate nearest-neighbor searches. Key features include: Vector storage: Supports vectors with up to 16,000 dimensions. Indexing: Supports indexing of vector data using IVFFlat for up to 2000 dimensions. Integration: Seamlessly integrates with PostgreSQL, leveraging its ACID compliance and other features. Amazon OpenSearch Service OpenSearch is an open-source, all-in-one vector database that supports flexible and scalable AI applications. Key features include: Scalability: Handles large volumes of data with distributed computing capabilities. Indexing: Supports various indexing methods, including HNSW and IVFFlat. Advanced features: Provides full-text search, security, and anomaly detection features. Comparative Analysis Technical Specifications CAPABILITY PGVECTOR (POSTGRESQL EXTENSION) AMAZON OPENSEARCH Max Vector Dimensions Up to 16,000 Up to 16,000 (various indexing methods) Distance Metrics L2, Inner Product, Cosine L1, L2, Inner Product, Cosine, L-infinity Database Type Relational NoSQL Performance Optimized for vector operations A variable may not match pgVector for intensive vector operations Memory Utilization High control over memory settings Limited granularity CPU Utilization More efficient Higher CPU utilization Fault Tolerance and Recovery PostgreSQL mechanisms Automated backups and recovery Security PostgreSQL features Advanced security features Distributed Computing Capabilities Limited Built for distributed computing GPU Acceleration Supported via libraries Supported by FAISS and NMSLIB Cost Free cost for PostgreSQL AWS infrastructure costs Integration with Other Tools PostgreSQL extensions and tools AWS services and tools Performance pgVector is designed to optimize vector operations, offering several tuning options for performance improvement. In contrast, OpenSearch's performance can vary, particularly with complex queries or large data volumes. Strengths and Limitations pgVector Strengths Open-source and free Seamless integration with PostgreSQL Efficient handling of high-dimensional vectors Detailed tuning options for performance optimization pgVector Limitations Requires knowledge of PostgreSQL and SQL Limited to vector indexing Scalability depends on the PostgreSQL setup OpenSearch Strengths Highly scalable with distributed computing Versatile data type support Advanced features, including full-text search and security Integration with AWS services OpenSearch Limitations Steeper learning curve Variable performance for high-dimensional vectors Higher latency for complex queries Use Cases pgVector Use Cases E-commerce: Recommendation systems and similarity searches. Healthcare: Semantic search for medical records and genomics research. Finance: Anomaly detection and fraud detection. Biotechnology and genomics: Handling complex genetic data. Multimedia analysis: Similarity search for images, videos, and audio files. OpenSearch Use Cases Marketing: Customer behavior analysis. Cybersecurity: Anomaly detection in network events. Supply chain management: Inventory management. Healthcare: Patient data analysis and predictive modeling. Telecommunications: Network performance monitoring. Retail: Recommendation engines and inventory management. Semantic search: Contextually relevant search results. Multimedia analysis: Reverse image search and video recommendation systems. Audio search: Music recommendation systems and audio-based content discovery. Geospatial search: Optimized routing and property suggestions. Conclusion: Future Trends and Developments The field of vector databases is rapidly evolving, driven by the increasing demand for efficient storage and retrieval of high-dimensional data in AI and ML applications. Future developments may include improved scalability, enhanced performance, and new features to support advanced use cases. Understanding these trends can help you make informed decisions and plan for the future.
With Spring Boot 3.2 and Spring Framework 6.1, we get support for Coordinated Restore at Checkpoint (CRaC), a mechanism that enables Java applications to start up faster. With Spring Boot, we can use CRaC in a simplified way, known as Automatic Checkpoint/Restore at startup. Even though not as powerful as the standard way of using CRaC, this blog post will show an example where the Spring Boot applications startup time is decreased by 90%. The sample applications are from chapter 6 in my book on building microservices with Spring Boot. Overview The blog post is divided into the following sections: Introducing CRaC, benefits, and challenges Creating CRaC-based Docker images with a Dockerfile Trying out CRaC with automatic checkpoint/restore Summary Next blog post Let’s start learning about CRaC and its benefits and challenges. 1. Introducing CRaC, Benefits, and Challenges Coordinated Restore at Checkpoint (CRaC) is a feature in OpenJDK, initially developed by Azul, to enhance the startup performance of Java applications by allowing them to restore to a previously saved state quickly. CRaC enables Java applications to save their state at a specific point in time (checkpoint) and then restore from that state at a later time. This is particularly useful for scenarios where fast startup times are crucial, such as serverless environments, microservices, and, in general, applications that must be able to scale up their instances quickly and also support scale-to-zero when not being used. This introduction will first explain a bit about how CRaC works, then discuss some of the challenges and considerations associated with it, and finally, describe how Spring Boot 3.2 integrates with it. The introduction is divided into the following subsections: 1.1. How CRaC Works 1.2. Challenges and Considerations 1.3. Spring Boot 3.2 integration with CRaC 1.1. How CRaC Works Checkpoint Creation At a chosen point during the application’s execution, a checkpoint is created. This involves capturing the entire state of the Java application, including the heap, stack, and all active threads. The state is then serialized and saved to the file system. During the checkpoint process, the application is typically paused to ensure a consistent state is captured. This pause is coordinated to minimize disruption and ensure the application can resume correctly. Before taking the checkpoint, some requests are usually sent to the application to ensure that it is warmed up, i.e., all relevant classes are loaded, and the JVM HotSpot engine has had a chance to optimize the bytecode according to how it is being used in runtime. Commands to perform a checkpoint: Shell java -XX:CRaCCheckpointTo=<some-folder> -jar my_app.jar # Make calls to the app to warm up the JVM... jcmd my_app.jar JDK.checkpoint State Restoration When the application is started from the checkpoint, the previously saved state is deserialized from the file system and loaded back into memory. The application then continues execution from the exact point where the checkpoint was taken, bypassing the usual startup sequence. Command to restore from a checkpoint: Shell java -XX:CRaCRestoreFrom=<some-folder> Restoring from a checkpoint allows applications to skip the initial startup process, including class loading, warmup initialization, and other startup routines, significantly reducing startup times. For more information, see Azul’s documentation: What is CRaC? 1.2. Challenges and Considerations As with any new technology, CRaC comes with a new set of challenges and considerations: State Management Open files and connections to external resources, such as databases, must be closed before the checkpoint is taken. After the restore, they must be reopened. CRaC exposes a Java lifecycle interface that applications can use to handle this, org.crac.Resource, with the callback methods beforeCheckpoint and afterRestore. Sensitive Information Credentials and secrets stored in the JVM’s memory will be serialized into the files created by the checkpoint. Therefore, these files need to be protected. An alternative is to run the checkpoint command against a temporary environment that uses other credentials and replace the credentials on restore. Linux Dependency The checkpoint technique is based on a Linux feature called CRIU, “Checkpoint/Restore In Userspace”. This feature only works on Linux, so the easiest way to test CRaC on a Mac or a Windows PC is to package the application into a Linux Docker image. Linux Privileges Required CRIU requires special Linux privileges, resulting in Docker commands to build Docker images and creating Docker containers also requiring Linux privileges to be able to run. Storage Overhead Storing and managing checkpoint data requires additional storage resources, and the checkpoint size can impact the restoration time. The original jar file is also required to be able to restart a Java application from a checkpoint. I will describe how to handle these challenges in the section on creating Docker images. 1.3. Spring Boot 3.2 Integration With CRaC Spring Boot 3.2 (and the underlying Spring Framework) helps with the processing of closing and reopening connections to external resources. Before the creation of the checkpoint, Spring stops all running beans, giving them a chance to close resources if needed. After a restore, the same beans are restarted, allowing beans to reopen connections to the resources. The only thing that needs to be added to a Spring Boot 3.2-based application is a dependency to the crac-library. Using Gradle, it looks like the following in the gradle.build file: Groovy dependencies { implementation 'org.crac:crac' Note: The normal Spring Boot BOM mechanism takes care of versioning the crac dependency. The automatic closing and reopening of connections handled by Spring Boot usually works. Unfortunately, when this blog post was written, some Spring modules lacked this support. To track the state of CRaC support in the Spring ecosystem, a dedicated test project, Spring Lifecycle Smoke Tests, has been created. The current state can be found on the project’s status page. If required, an application can register callback methods to be called before a checkpoint and after a restore by implementing the above-mentioned Resource interface. The microservices used in this blog post have been extended to register callback methods to demonstrate how they can be used. The code looks like this: Java import org.crac.*; public class MyApplication implements Resource { public MyApplication() { Core.getGlobalContext().register(this); } @Override public void beforeCheckpoint(Context<? extends Resource> context) { LOG.info("CRaC's beforeCheckpoint callback method called..."); } @Override public void afterRestore(Context<? extends Resource> context) { LOG.info("CRaC's afterRestore callback method called..."); } } Spring Boot 3.2 provides a simplified alternative to take a checkpoint compared to the default on-demand alternative described above. It is called automatic checkpoint/restore at startup. It is triggered by adding the JVM system property -Dspring.context.checkpoint=onRefresh to the java -jar command. When set, a checkpoint is created automatically when the application is started. The checkpoint is created after Spring beans have been created but not started, i.e., after most of the initialization work but before that application starts. For details, see Spring Boot docs and Spring Framework docs. With an automatic checkpoint, we don’t get a fully warmed-up application, and the runtime configuration must be specified at build time. This means that the resulting Docker images will be runtime-specific and contain sensitive information from the configuration, like credentials and secrets. Therefore, the Docker images must be stored in a private and protected container registry. Note: If this doesn’t meet your requirements, you can opt for the on-demand checkpoint, which I will describe in the next blog post. With CRaC and Spring Boot 3.2’s support for CRaC covered, let’s see how we can create Docker images for Spring Boot applications that use CRaC. 2. Creating CRaC-Based Docker Images With a Dockerfile While learning how to use CRaC, I studied several blog posts on using CRaC with Spring Boot 3.2 applications. They all use rather complex bash scripts (depending on your bash experience) using Docker commands like docker run, docker exec, and docker commit. Even though they work, it seems like an unnecessarily complex solution compared to producing a Docker image using a Dockerfile. So, I decided to develop a Dockerfile that runs the checkpoint command as a RUN command in the Dockerfile. It turned out to have its own challenges, as described below. I will begin by describing my initial attempt and then explain the problems I stumbled into and how I solved them, one by one until I reach a fully working solution. The walkthrough is divided into the following subsections: 2.1. First attempt 2.2. Problem #1, privileged builds with docker build 2.3. Problem #2, CRaC returns exit status 137, instead of 0 2.4. Problem #3, Runtime configuration 2.5. Problem #4, Spring Data JPA 2.6. The resulting Dockerfile Let’s start with a first attempt and see where it leads us. 2.1. First Attempt My initial assumption was to create a Dockerfile based on a multi-stage build, where the first stage creates the checkpoint using a JDK-based base image, and the second step uses a JRE-based base image for runtime. However, while writing this blog post, I failed to find a base image for a Java 21 JRE supporting CRaC. So I changed my mind to use a regular Dockerfile instead, using a base image from Azul: azul/zulu-openjdk:21-jdk-crac Note: BellSoft also provides base images for CraC; see Liberica JDK with CRaC Support as an alternative to Azul. The first version of the Dockerfile looks like this: Dockerfile FROM azul/zulu-openjdk:21-jdk-crac ADD build/libs/*.jar app.jar RUN java -Dspring.context.checkpoint=onRefresh -XX:CRaCCheckpointTo=checkpoint -jar app.jar EXPOSE 8080 ENTRYPOINT ["java", "-XX:CRaCRestoreFrom=checkpoint"] This Dockerfile is unfortunately not possible to use since CRaC requires a build to run privileged commands. 2.2. Problem #1, Privileged Builds With Docker Build As mentioned in section 1.2. Challenges and Considerations, CRIU, which CRaC is based on, requires special Linux privileges to perform a checkpoint. The standard docker build command doesn’t allow privileged builds, so it can’t be used to build Docker images using the above Dockerfile. Note: The --privileged - flag that can be used in docker run commands is not supported by docker build. Fortunately, Docker provides an improved builder backend called BuildKit. Using BuildKit, we can create a custom builder that is insecure, meaning it allows a Dockerfile to run privileged commands. To communicate with BuildKit, we can use Docker’s CLI tool buildx. The following command can be used to create an insecure builder named insecure-builder: Shell docker buildx create --name insecure-builder --buildkitd-flags '--allow-insecure-entitlement security.insecure' Note: The builder runs in isolation within a Docker container created by the docker buildx create command. You can run a docker ps command to reveal the container. When the builder is no longer required, it can be removed with the command: docker buildx rm insecure-builder. The insecure builder can be used to build a Docker image with a command like: Shell docker buildx --builder insecure-builder build --allow security.insecure --load . Note: The --load flag loads the built image into the regular local Docker image cache. Since the builder runs in an isolated container, its result will not end up in the regular local Docker image cache by default. RUN commands in a Dockerfile that requires privileges must be suffixed with --security=insecure. The --security-flag is only in preview and must therefore be enabled in the Dockerfile by adding the following line as the first line in the Dockerfile: Dockerfile # syntax=docker/dockerfile:1.3-labs For more details on BuildKit and docker buildx, see Docker Build architecture. We can now perform the build; however, the way the CRaC is implemented stops the build, as we will learn in the next section. 2.3. Problem #2, CRaC Returns Exit Status 137 Instead of 0 On a successful checkpoint, the java -Dspring.context.checkpoint=onRefresh -XX:CRaCCheckpointTo... command is terminated forcefully (like using kill -9) and returns the exit status 137 instead of 0, causing the Docker build command to fail. To prevent the build from stopping, the java command is extended with a test that verifies that 137 is returned and, if so, returns 0 instead. The following is added to the java command: || if [ $? -eq 137 ]; then return 0; else return 1; fi. Note: || means that the command following will be executed if the first command fails. With CRaC working in a Dockerfile, let’s move on and learn about the challenges with runtime configuration and how to handle them. 2.4. Problem #3, Runtime Configuration Using Spring Boot’s automatic checkpoint/restore at startup, there is no way to specify runtime configuration on restore; at least, I haven’t found a way to do it. This means that the runtime configuration has to be specified at build time. Sensitive information from the runtime configuration, such as credentials used for connecting to a database, will written to the checkpoint files. Since the Docker images will contain these checkpoint files they also need to be handled in a secure way. The Spring Framework documentation contains a warning about this, copied from the section Automatic checkpoint/restore at startup: As mentioned above, and especially in use cases where the CRaC files are shipped as part of a deployable artifact (a container image, for example), operate with the assumption that any sensitive data “seen” by the JVM ends up in the CRaC files, and assess carefully the related security implications. So, let’s assume that we can protect the Docker images, for example, in a private registry with proper authorization in place and that we can specify the runtime configuration at build time. In Chapter 6 of the book, the source code specifies the runtime configuration in the configuration files, application.yml, in a Spring profile named docker. The RUN command, which performs the checkpoint, has been extended to include an environment variable that declares what Spring profile to use: SPRING_PROFILES_ACTIVE=docker. Note: If you have the runtime configuration in a separate file, you can add the file to the Docker image and point it out using an environment variable like SPRING_CONFIG_LOCATION=file:runtime-configuration.yml. With the challenges of proper runtime configuration covered, we have only one problem left to handle: Spring Data JPA’s lack of support for CRaC without some extra work. 2.5. Problem #4, Spring Data JPA Spring Data JPA does not work out-of-the-box with CRaC, as documented in the Smoke Tests project; see the section about Prevent early database interaction. This means that auto-creation of database tables when starting up the application, is not possible when using CRaC. Instead, the creation has to be performed outside of the application startup process. Note: This restriction does not apply to embedded SQL databases. For example, the Spring PetClinic application works with CRaC without any modifications since it uses an embedded SQL database by default. To address these deficiencies, the following changes have been made in the source code of Chapter 6: Manual creation of a SQL DDL script, create-tables.sql Since we can no longer rely on the application to create the required database tables, a SQL DDL script has been created. To enable the application to create the script file, a Spring profile create-ddl-script has been added in the review microservice’s configuration file, microservices/review-service/src/main/resources/application.yml. It looks like: YAML spring.config.activate.on-profile: create-ddl-script spring.jpa.properties.jakarta.persistence.schema-generation: create-source: metadata scripts: action: create create-target: crac/sql-scripts/create-tables.sql The SQL DDL file has been created by starting the MySQL database and, next, the application with the new Spring profile. Once connected to the database, the application and database are shut down. Sample commands: Shell docker compose up -d mysql SPRING_PROFILES_ACTIVE=create-ddl-script java -jar microservices/review-service/build/libs/review-service-1.0.0-SNAPSHOT.jar # CTRL/C once "Connected to MySQL: jdbc:mysql://localhost/review-db" is written to the log output docker compose down The resulting SQL DDL script, crac/sql-scripts/create-tables.sql, has been added to Chapter 6’s source code. The Docker Compose file configures MySQL to execute the SQL DDL script at startup. A CraC-specific version of the Docker Compose file has been created, crac/docker-compose-crac.yml. To create the tables when the database is starting up, the SQL DDL script is used as an init script. The SQL DDL script is mapped into the init-folder /docker-entrypoint-initdb.d with the following volume-mapping in the Docker Compose file: Dockerfile volumes: - "./sql-scripts/create-tables.sql:/docker-entrypoint-initdb.d/create-tables.sql" Added a runtime-specific Spring profile in the review microservice’s configuration file. The guidelines in the Smoke Tests project’s JPA section have been followed by adding an extra Spring profile named crac. It looks like the following in the review microservice’s configuration file: YAML spring.config.activate.on-profile: crac spring.jpa.database-platform: org.hibernate.dialect.MySQLDialect spring.jpa.properties.hibernate.temp.use_jdbc_metadata_defaults: false spring.jpa.hibernate.ddl-auto: none spring.sql.init.mode: never spring.datasource.hikari.allow-pool-suspension: true Finally, the Spring profile crac is added to the RUN command in the Dockerfile to activate the configuration when the checkpoint is performed. 2.6. The Resulting Dockerfile Finally, we are done with handling the problems resulting from using a Dockerfile to build a Spring Boot application that can restore quickly using CRaC in a Docker image. The resulting Dockerfile, crac/Dockerfile-crac-automatic, looks like: Dockerfile # syntax=docker/dockerfile:1.3-labs FROM azul/zulu-openjdk:21-jdk-crac ADD build/libs/*.jar app.jar RUN --security=insecure \ SPRING_PROFILES_ACTIVE=docker,crac \ java -Dspring.context.checkpoint=onRefresh \ -XX:CRaCCheckpointTo=checkpoint -jar app.jar \ || if [ $? -eq 137 ]; then return 0; else return 1; fi EXPOSE 8080 ENTRYPOINT ["java", "-XX:CRaCRestoreFrom=checkpoint"] Note: One and the same Dockerfile is used by all microservices to create CRaC versions of their Docker images. We are now ready to try it out! 3. Trying Out CRaC With Automatic Checkpoint/Restore To try out CRaC, we will use the microservice system landscape used in Chapter 6 of my book. If you are not familiar with the system landscape, it looks like the following: Chapter 6 uses Docker Compose to manage (build, start, and stop) the system landscape. Note: If you don’t have all the tools used in this blog post installed in your environment, you can look into Chapters 21 and 22 for installation instructions. To try out CRaC, we need to get the source code from GitHub, compile it, and create the Docker images for each microservice using a custom insecure Docker builder. Next, we can use Docker Compose to start up the system landscape and run the end-to-end validation script that comes with the book to ensure that everything works as expected. We will wrap up the try-out section by comparing the startup times of the microservices when they start with and without using CRaC. We will go through each step in the following subsections: 3.1. Getting the source code 3.2. Building the CRaC-based Docker images 3.3. Running end-to-end tests 3.4. Comparing startup times without CRaC 3.1. Getting the Source Code Run the following commands to get the source code from GitHub, jump into the Chapter06 folder, check out the branch SB3.2-crac-automatic, and ensure that a Java 21 JDK is used (Eclipse Temurin is used here): Shell git clone https://github.com/PacktPublishing/Microservices-with-Spring-Boot-and-Spring-Cloud-Third-Edition.git cd Microservices-with-Spring-Boot-and-Spring-Cloud-Third-Edition/Chapter06 git checkout SB3.2-crac-automatic sdk use java 21.0.3-tem 3.2. Building the CRaC-Based Docker Images Start with compiling the microservices source code: Shell ./gradlew build If not already created, create the insecure builder with the command: Shell docker buildx create --name insecure-builder --buildkitd-flags '--allow-insecure-entitlement security.insecure' Now we can build a Docker image, where the build performs a CRaC checkpoint for each of the microservices with the commands: Shell docker buildx --builder insecure-builder build --allow security.insecure -f crac/Dockerfile-crac-automatic -t product-composite-crac --load microservices/product-composite-service docker buildx --builder insecure-builder build --allow security.insecure -f crac/Dockerfile-crac-automatic -t product-crac --load microservices/product-service docker buildx --builder insecure-builder build --allow security.insecure -f crac/Dockerfile-crac-automatic -t recommendation-crac --load microservices/recommendation-service docker buildx --builder insecure-builder build --allow security.insecure -f crac/Dockerfile-crac-automatic -t review-crac --load microservices/review-service 3.3. Running End-To-End Tests To start up the system landscape, we will use Docker Compose. Since CRaC requires special Linux privileges, a CRaC-specific docker-compose file comes with the source code, crac/docker-compose-crac.yml. Each microservice is given the required privilege, CHECKPOINT_RESTORE, by specifying: YAML cap_add: - CHECKPOINT_RESTORE Note: Several blog posts on CRaC suggest using privileged containers, i.e., starting them with run --privleged or adding privileged: true in the Docker Compose file. This is a really bad idea since an attacker who gets control over such a container can easily take control of the host that runs Docker. For more information, see Docker’s documentation on Runtime privilege and Linux capabilities. The final addition to the CRaC-specific Docker Compose file is the volume mapping for MySQL to add the init file described above in section 2.5. Problem #4, Spring Data JPA: Dockerfile volumes: - "./sql-scripts/create-tables.sql:/docker-entrypoint-initdb.d/create-tables.sql" Using this Docker Compose file, we can start up the system landscape and run the end-to-end verification script with the following commands: Shell export COMPOSE_FILE=crac/docker-compose-crac.yml docker compose up -d Let’s start with verifying that the CRaC afterRestore callback methods were called: Shell docker compose logs | grep "CRaC's afterRestore callback method called..." Expect something like: Shell ...ReviewServiceApplication : CRaC's afterRestore callback method called... ...RecommendationServiceApplication : CRaC's afterRestore callback method called... ...ProductServiceApplication : CRaC's afterRestore callback method called... ...ProductCompositeServiceApplication : CRaC's afterRestore callback method called... Now, run the end-to-end verification script: Shell ./test-em-all.bash If the script ends with a log output similar to: Shell End, all tests OK: Fri Jun 28 17:40:43 CEST 2024 …it means all tests run ok, and the microservices behave as expected. Bring the system landscape down with the commands: Shell docker compose down unset COMPOSE_FILE After verifying that the microservices behave correctly when started from a CRaC checkpoint, we can compare their startup times with microservices started without using CRaC. 3.4. Comparing Startup Times Without CRaC Now over to the most interesting part: How much faster does the microservice startup when performing a restore from a checkpoint compared to a regular cold start? The tests have been run on a MacBook Pro M1 with 64 GB memory. Let’s start with measuring startup times without using CRaC. 3.4.1. Startup Times Without CRaC To start the microservices without CRaC, we will use the default Docker Compose file. So, we must ensure that the COMPOSE_FILE environment variable is unset before we build the Docker images for the microservices. After that, we can start the database services, MongoDB and MySQL: Shell unset COMPOSE_FILE docker compose build docker compose up -d mongodb mysql Verify that the databases are reporting healthy with the command: docker compose ps. Repeat the command until both report they are healthy. Expect a response like this: Shell NAME ... STATUS ... chapter06-mongodb-1 ... Up 13 seconds (healthy) ... chapter06-mysql-1 ... Up 13 seconds (healthy) ... Next, start the microservices and look in the logs for the startup time (searching for the word Started). Repeat the logs command until logs are shown for all four microservices: Shell docker compose up -d docker compose logs | grep Started Look for a response like: Shell ...Started ProductCompositeServiceApplication in 1.659 seconds ...Started ProductServiceApplication in 2.219 seconds ...Started RecommendationServiceApplication in 2.203 seconds ...Started ReviewServiceApplication in 3.476 seconds Finally, bring down the system landscape: Shell docker compose down 3.4.2. Startup Times With CRaC First, declare that we will use the CRaC-specific Docker Compose file and start the database services, MongoDB and MySQL: Shell export COMPOSE_FILE=crac/docker-compose-crac.yml docker compose up -d mongodb mysql Verify that the databases are reporting healthy with the command: docker compose ps. Repeat the command until both report they are healthy. Expect a response like this: Shell NAME ... STATUS ... crac-mongodb-1 ... Up 10 seconds (healthy) ... crac-mysql-1 ... Up 10 seconds (healthy) ... Next, start the microservices and look in the logs for the startup time (this time searching for the word Restored). Repeat the logs command until logs are shown for all four microservices: Shell docker compose up -d docker compose logs | grep Restored Look for a response like: Shell ...Restored ProductCompositeServiceApplication in 0.131 seconds ...Restored ProductServiceApplication in 0.225 seconds ...Restored RecommendationServiceApplication in 0.236 seconds ...Restored ReviewServiceApplication in 0.154 seconds Finally, bring down the system landscape: Shell docker compose down unset COMPOSE_FILE Now, we can compare the startup times! 3.4.3. Comparing Startup Times Between JVM and CRaC Here is a summary of the startup times, along with calculations of how many times faster the CRaC-enabled microservice starts and the reduction of startup times in percentage: MICROSERVICE WITHOUT CRAC WITH CRAC CRAC TIMES FASTER CRAC REDUCED STARTUP TIME product-composite 1.659 0.131 12.7 92% product 2.219 0.225 9.9 90% recommendation 2.203 0.236 9.3 89% review 3.476 0.154 22.6 96% Generally, we can see a 10-fold performance improvement in startup times or 90% shorter startup time; that’s a lot! Note: The improvement in the Review microservice is even better since it no longer handles the creation of database tables. However, this improvement is irrelevant when comparing improvements using CRaC, so let’s discard the figures for the Review microservice. 4. Summary Coordinated Restore at Checkpoint (CRaC) is a powerful feature in OpenJDK that improves the startup performance of Java applications by allowing them to resume from a previously saved state, a.k.a., a checkpoint. With Spring Boot 3.2, we also get a simplified way of creating a checkpoint using CRaC, known as automatic checkpoint/restore at startup. The tests in this blog post indicate a 10-fold improvement in startup performance, i.e., a 90% reduction in startup time when using automatic checkpoint/restore at startup. The blog post also explained how Docker images using CRaC can be built using a Dockerfile instead of the complex bash scripts suggested by most blog posts on the subject. This, however, comes with some challenges of its own, like using custom Docker builders for privileged builds, as explained in the blog post. Using Docker images created using automatic checkpoint/restore at startup comes with a price. The Docker images will contain runtime-specific and sensitive information, such as credentials to connect to a database at runtime. Therefore, they must be protected from unauthorized use. The Spring Boot support for CRaC does not fully cover all modules in Spring’s eco-system, forcing some workaround to be applied, e.g., when using Spring Data JPA. Also, when using automatic checkpoint/Restore at startup, the JVM HotSpot engine cannot be warmed up before the checkpoint. If optimal execution time for the first requests being processed is important, automatic checkpoint/restore at startup is probably not the way to go. 5. Next Blog Post In the next blog post, I will show you how to use regular on-demand checkpoints to solve some of the considerations with automatic checkpoint/restore at startup. Specifically, the problems with specifying the runtime configuration at build time, storing sensitive runtime configuration in the Docker images, and how the Java VM can be warmed up before performing the checkpoint.
Stakeholders often regard Scrum and other Agile teams as cost centers, primarily focused on executing projects within budgetary confines. This conventional view, however, undervalues their strategic potential. If we reconsider Agile teams as investors — carefully allocating their resources to optimize returns — they can significantly impact an organization’s strategic objectives and long-term profitability. This perspective not only redefines their role but also enhances the effectiveness of their contributions to the business by solving the customers’ problems. Strategic Benefits of Viewing Agile Teams as Investors Viewing Agile teams merely as task executors or internal development agencies misses a significant opportunity to harness their strategic potential. Instead, when we envision these Agile teams as investors within the organization’s strategic framework, their role undergoes a radical transformation. This shift in perspective not only emphasizes the intrinsic value Agile teams contribute but also ensures that their daily activities directly support and drive the company’s broader financial and strategic objectives. The following article will explore the multiple strategic benefits of adopting this investor-like viewpoint for Agile teams. For example, by treating each Sprint as a calculated investment with measurable returns, organizations can foster a more dynamic, responsive, and profitable development environment, maximizing operational efficiency and business outcomes. The advantages of such a viewpoint are apparent: Dynamic allocation of resources: Agile teams prioritize work that promises the highest return on investment (ROI), adjusting their focus as market conditions and customer needs evolve. This dynamic resource allocation is akin to managing a flexible investment portfolio where the allocation is continuously optimized in response to changing externalities. Cultivation of ownership and accountability: Teams that view their roles through an investor lens develop a more profound sense of ownership over the products they build. This mindset fosters a culture where every resource expenditure is scrutinized for value, encouraging more thoughtful and result-oriented work and avoiding typical blunders such as gold plating. Alignment with organizational goals: The investor perspective also helps bridge the gap between Agile teams and corporate strategy. It ensures that every Sprint and every project contributes directly to the organization’s overarching goals, aligning day-to-day activities with long-term business objectives. There is a reason why Scrum introduced the Product Goal with the Scrum Guide 2020. Investor Mindset Within Agile Frameworks When Agile teams operate as investors, they manage a portfolio of product development opportunities, each akin to a financial asset. This paradigm shift necessitates a robust understanding of value from a product functionality standpoint and a market and business perspective. Every decision to pursue a new feature, enhance an existing product, or pivot direction is an investment decision with potential returns measured in customer satisfaction, market share, revenue growth, and long-term business viability. Supportive Practices for Agile Teams as Investors To harness the full potential of Agile teams as investors and maximize the returns on their investments, organizations must create a conducive environment that supports this refined role. The following practices are crucial for empowering Agile teams to operate effectively within this concept: Autonomy within guided parameters: Similar to how a fund manager operates within the confines of an investment mandate, Agile teams require the freedom to make decisions independently while adhering to the broader strategic objectives set by the organization. This autonomy empowers them to make quick, responsive decisions that align with real-time market conditions and customer feedback. Leaders must trust these teams to navigate the details, allowing them to innovate and adjust their strategies without micromanagement. Agile teams as investors require agency with known constraints. Emphasis on continuous learning: The “investment realm” is dynamic, with continuous shifts that demand ongoing education and adaptability. Agile teams similarly benefit from a continuous learning environment where they can stay updated on the latest technological trends, market dynamics, and customer preferences. This knowledge is critical for making informed decisions, anticipating market needs, and responding proactively. Organizations should facilitate this learning by providing access to training, workshops, and industry conferences and encouraging knowledge sharing within and across teams, for example, by hosting events for the Agile community. Transparent and open communication: Effective communication channels between Agile teams and stakeholders are essential for understanding project expectations, organizational goals, and resource availability. This transparency helps teams make informed decisions about where to allocate their efforts for the best possible returns. Therefore, Agile teams should collaborate with stakeholders and establish regular check-ins, such as Sprint Reviews, Retrospectives, and joint exercises and workshops, to ensure all stakeholders are on the same page and can provide timely feedback that could influence investment decisions. Strategic resource allocation: Just as investors decide how best to distribute assets to maximize portfolio returns, Agile teams must strategically allocate their time and resources. This involves prioritizing tasks based on their potential impact and aligning them with the organization’s key performance indicators (KPIs). Multiple tools, such as value stream mapping or user story mapping, can help identify the most valuable activities that contribute directly to customer satisfaction and business success. Risk management and mitigation: Risk management and mitigation are paramount in the investment world. Agile teams, too, must develop competencies in identifying, assessing, and responding to risks associated with their projects. For example, working iteratively and incrementally in Scrum helps to quickly create feedback loops and adjust course if Increments do not live up to the anticipated response, preventing the team from pouring more time into something less valuable, diluting the potential ROI of the team. (Typically, risk mitigation starts even earlier in the process, based on product discovery and refinement activities. Performance metrics and feedback loops: To understand the effectiveness of their investment decisions, Agile teams need robust metrics and feedback mechanisms to guide future improvements. Metrics such as return on investment (ROI), customer satisfaction scores, and market penetration rates are valuable in assessing the success of Agile initiatives. Establishing a culture of feedback where insights and learning from each project cycle are systematically collected and analyzed will enable teams to refine their approaches continually, hence the importance of Sprint Reviews and Retrospectives in Scrum for optimizing a team’s contributions to the company’s strategic goals and ensuring sustained business growth and agility. Top Ten Anti-Patterns Limiting Agile Teams as Investors It could all be so simple if it weren’t for corporate reality. Despite the usefulness of Agile teams as investors concept, teams typically face numerous obstacles. Consequently, identifying and addressing these anti-patterns is crucial for Agile teams to succeed. Here, we explore the top ten anti-patterns that can severely restrict Agile teams from maximizing their investment capabilities and suggest strategies for overcoming them: Siloed operations: When teams operate in silos, they miss critical insights from other parts of the organization that could influence strategic decisions. To break down these silos, promote cross-functional teams and encourage regular interdepartmental meetings where teams can share insights and collaborate on broader organizational goals. Open Spaces or Barcamps are a good starting point. Rigid adherence to roadmaps: While roadmaps help guide development, strict adherence can prevent teams from adapting to new information or capitalizing on emerging opportunities. Implementing a flexible roadmap approach, where adjustments are possible and expected, can help teams stay Agile and responsive. Short-term focus: Focusing solely on short-term outcomes can lead to decisions that sacrifice long-term value. Encourage teams to adopt a balanced scorecard approach that includes short-term and long-term goals, ensuring immediate achievements do not undermine future success. Insufficient stakeholder engagement: Agile teams often lack deep engagement with stakeholders, leading to misalignments and missed opportunities. To combat this, develop structured engagement plans that include regular updates and involvement opportunities for stakeholders throughout the product lifecycle, starting with Sprint Reviews, stakeholder Retrospectives, and collaborative workshops and exercises. Aversion to risk: A culture that penalizes failure stifles innovation and risk-taking. Establishing a risk-tolerant culture that rewards calculated risks and views failures as learning opportunities can encourage teams to pursue higher-return projects. Leadership needs to lead this effort — no pun intended — by sharing their experiences; “failure nights” are suitable for that purpose. Resource hoarding: When teams withhold resources to safeguard against uncertainties, it prevents those resources from being used where they could generate value. Encourage a culture of transparency and shared responsibility where resources are allocated based on strategic priorities rather than preserved for hypothetical needs. Neglect of technical debt: Ignoring technical debt can increase costs and reduce system efficiency in the long run. Task the Agile team to maintain technical excellence and allocate time for debt reduction in each Sprint, treating these efforts as critical investments in the product’s future. There is no business agility without technical excellence. Mismatched incentives: When team incentives, or, worse, personal incentives, are not aligned with organizational goals, it can lead to misdirected efforts. Align reward systems with desired outcomes, such as customer satisfaction, market growth, or innovation metrics, to ensure that everyone’s efforts contribute directly to business objectives. Poor market understanding: Teams cannot make informed investment decisions without a strong understanding of the market and customer needs. Invest in market research and customer interaction programs to keep teams informed and responsive to the external environment. All team members must participate in product discovery and customer research activities regularly. Resistance to organizational change: Resistance to new methodologies, practices, or tools can limit a team’s ability to adapt and grow. Foster a culture of continuous improvement and openness to change by regularly reviewing and updating practices and providing training and support for new approaches. By addressing these anti-patterns, organizations can empower their Agile teams as investors, making smarter decisions that align with long-term strategic goals and enhance the company’s overall market position. Conclusion In conclusion, reimagining Scrum and Agile teams as investors is not merely a shift in perspective but a transformative approach that aligns these teams more closely with the organization’s broader objectives. By viewing every Sprint and project through the investment lens, these teams are empowered to prioritize initiatives that promise the best returns regarding customer value and contributions to the organization’s success. This investor mindset encourages Agile teams to operate with an enhanced sense of ownership and accountability, making decisions that are not just beneficial in the short term but are sustainable and profitable over the long haul. It fosters a deeper level of strategic engagement with projects, where Agile teams are motivated to maximize efficiency and effectiveness, understanding their direct impact on the company’s performance. Moreover, the practices that support Agile teams as investors—such as granting autonomy, emphasizing continuous learning, and ensuring open communication—are foundational to creating a culture of innovation and responsiveness. These practices help break down silos, encourage risk-taking, and align team incentives with corporate goals, driving the organization forward in a competitive marketplace. It is critical to address the common anti-patterns that hinder this investment-centric approach. By actively working to eliminate these barriers, organizations can unlock the true potential of their Agile teams, transforming them into critical drivers of business value and strategic advantage. Ultimately, when Scrum and Agile teams are empowered to act as investors, they contribute not only to the immediate product development goals but also to the long-term viability and growth of the organization. This holistic integration of Agile practices with business strategy ensures that the investments made in every Sprint yield substantial and sustained returns, securing a competitive edge in the dynamic business landscape. Do you view your Agile teams as investors? Please share with us in the comments.
ClickHouse is the fastest, most resource-efficient OLAP database which can query billions of rows in milliseconds and is trusted by thousands of companies for real-time analytics. Here are seven tips to help you spin up a production ClickHouse cluster and avoid the most common mistakes. Tip 1: Use Multiple Replicas While testing ClickHouse, it’s natural to deploy a configuration with only one host because you may not want to use additional resources or take on unnecessary expenses. There’s nothing wrong with this in a development or testing environment, but that can come at a cost if you want to use only one host in production. If there’s a failure and you only have one replica and a single host, you’re at risk of losing all your data. For production loads, you should use several hosts and replicate data across them. Not only does it ensure that data remains safe when a host fails, but also allows you to balance the user load on several hosts, which makes resource-intensive queries faster. Tip 2: Don’t Be Shy With RAM ClickHouse is fast, but its speed depends on available resources, especially RAM. You can see great performance when running a ClickHouse cluster with the minimum amount of RAM in a development or testing environment, but that may change when the load increases. In a production environment with a lot of simultaneous read and write operations, a lack of RAM will be more noticeable. If your ClickHouse cluster doesn’t have enough memory, it will be slower, and executing complex queries will take longer. On top of that, when ClickHouse is performing resource-intensive operations, it may compete with the OS itself for RAM, and that eventually leads to OOM, downtime, and data loss. Developers of ClickHouse recommend using at least 16 GB of RAM to ensure that the cluster is stable. You can opt for less memory, but only do so when you know that the load won’t be high. Tip 3: Think Twice When Choosing a Table Engine ClickHouse supports several table engines with different characteristics, but a MergeTree engine will most likely be ideal. Specialized tables are tailored for specific uses, but have limitations that may not be obvious at first glance. Log Family engines may seem ideal for logs, but they don’t support replication and their database size is limited. Table engines in the MergeTree family are the default choice, and they provide the core data capabilities that ClickHouse is known for. Unless you know for sure why you need a different table engine, use an engine from a MergeTree family, and it will cover most of your use cases. Tip 4: Don’t Use More Than Three Columns for the Primary Key Primary keys in ClickHouse don’t serve the same purpose as in traditional databases. They don’t ensure uniqueness, but instead define how data is stored and then retrieved. If you use all columns as the primary key, you may benefit from faster queries. Yet, ClickHouse performance doesn’t only depend on reading data, but on writing it, too. When the primary key contains many columns, the whole cluster slows down when data is written to it. The optimal size of the primary key in ClickHouse is two or three columns, so you can run faster queries but not slow down data inserts. When choosing the columns, think of the requests that will be made and go for columns that will often be selected in filters. Tip 5: Avoid Small Inserts When you insert data in ClickHouse, it first saves a part with this data to a disk. It then sorts this data, merges it, and inserts it into the right place in the database in the background. If you insert small chunks of data very often, ClickHouse will create a part for every small insert. It will slow down the whole cluster and you may get the “Too many parts” error. To insert data efficiently, add data in big chunks and avoid sending more than one insert statement per second. ClickHouse can insert a lot of data at a high pace — even 100K rows per second is okay — but it should be one bulk insert instead of multiple smaller ones. If your data comes in small portions, consider using an external system such as <a>Managed Kafka</a> for making batches of data. ClickHouse is well integrated with Kafka and can efficiently consume data from it. Tip 6: Think of How You Will Get Rid of Duplicate Data Primary keys in ClickHouse don’t ensure that data is unique. Unlike other databases, if you insert duplicate data in ClickHouse, it will be added as is. Thus, the best option would be to ensure that the data is unique before inserting it. You can do it, for example, in a stream processing application, like Apache Kafka. If it’s not possible, there are ways to deal with it when you run queries. One option is to use `argMax` to select only the last version of the duplicate row. You can also use the ReplacingMergeTree engine that removes duplicate entries by design. Finally, you can run `OPTIMIZE TABLE ... FINAL` to merge data parts, but that’s a resource-demanding operation, and you should only run it when you know it won’t affect the cluster performance. Tip 7: Don’t Create an Index for Every Column Just like with primary keys, you may want to use multiple indexes to improve performance. This may be the case when you query data with the filters that match an index, but overall it won’t help you make queries faster. At the same time, you’ll certainly experience the downsides of this strategy. Multiple indexes significantly slow down data inserts because ClickHouse will need to both write the data in the correct place and then update indexes. When you want to create indexes in a production cluster, select the columns that correlate with the primary key.
Caching is often implemented as a generic solution when we think about improving the latency and availability characteristics of dependency service calls. Latency improves as we avoid the need to make the network round trip to the dependency service, and availability improves as we don’t need to worry about temporary downtimes of the dependency service given that the cache serves the required response that we are looking for. It is important to note that caching does not help if our requests to a dependency service lead to a distinct response every time, or if a client makes vastly different request types with not much overlap between responses. There are also additional constraints to using caching if our service cannot tolerate stale data. We won’t be delving into caching types, techniques, and applicability as those are covered broadly on the internet. Instead, we will focus on the less talked about risk with caching that gets ignored as systems evolve, and this puts the system at risk of a broad outage. When To Use Caching In many cases, caching is deployed to mask known scaling bottlenecks with dependency service or caching takes over the role to hide a potential scaling deficiency of dependency service over time. For instance, as our service starts making reduced calls to dependency service, they start believing that this is the norm for steady-state traffic. If our cache hit rate is 90%, meaning 9/10 calls to the dependency service are served by the cache, then the dependency service only sees 10% of the actual traffic. If client-side caching stops working due to an outage or bug, the dependency service would see a surge in traffic by 9x! In almost all cases, this surge in traffic will overload the dependency service causing an outage. If the dependency service is a data store, this will bring down multiple other services that depend on that data store. To prevent such outages, both the client and service should consider following recommendations to protect their systems. Recommendations For clients, it is important to stop treating the cache as a "good to have" optimization, and instead treat it as a critical component that needs the same treatment and scrutiny as a regular service. This includes monitoring and alarming on cache hit ratio threshold as well as overall traffic that is sent to the dependency service. Any update or changes to caching business logic also need to go through the same rigor for testing in development environments and in the pre-production stages. Deployments to servers participating in caching should ensure that the stored state is transferred to new servers that are coming up post-deployment, or the drop in cache hit rate during deployment is tolerable for the dependency service. If a large number of cache-serving servers are taken down during deployments, it can lead to a proportional drop in cache hit ratio putting pressure on dependency service. Clients also need to implement guardrails to control the overall traffic, measured as transaction per service (TPS), to dependency service. Algorithms like token buckets can help restrict TPS from the fleet when the caching fleet goes down. This needs to be periodically tested by taking down caching instances and seeing how clients send traffic to the dependency service. Clients should also think about implementing a negative caching strategy with a smaller Time-to-live (TTL). Negative caching means that the client will store the error response from the dependency service to ensure the dependency service is not bombarded with retry requests when it is having an extended outage. Similarly, on the service side, load-shedding mechanisms need to be implemented to protect the service from getting overloaded. Overloaded in this case means that the service is unable to respond within the client-side timeout. Note that as the service load increases, it is usually manifested with increased latency as server resources are overused, leading to slower response. We want to respond before the client-side timeout for a request and start rejecting requests if the overall latency starts breaching the client-side timeout. There are different techniques to prevent overloading; one of the simplest techniques is to restrict the number of connections from the Application Load Balancer (ALB) to your service host. However, this could mean indiscriminate dropping of requests, and if that is not desirable, then prioritization techniques could be implemented in the application layer of service to drop less important requests. The objective of load shedding is to ensure that the service protects the goodput, i.e., requests served within the client side timeout, as overall load grows on the service. The service also needs to periodically run load tests to validate the maximum TPS handled by the service host, which allows fine-tuning of the ALB connection limit. We introduced a couple of techniques to protect the goodput of a service which should be widely applicable but there are more approaches that readers can explore depending on their service need. Conclusion Caching offers immediate benefits for availability and latency at a low cost. However, neglecting the areas we discussed above can expose hidden scaling bottlenecks when the cache goes down, potentially leading to system failures. Regular diligence to ensure the proper functioning of the system even when the cache is down is crucial to prevent catastrophic outages that could affect your system's reliability. Here is an interesting read about a large-scale outage triggered by cache misses.
Deploy Your Data Pipelines With GitHub Actions
July 19, 2024 by
The Invisible Artistry of Backend Development
July 17, 2024 by
Topic Tagging Using Large Language Models
July 19, 2024 by
Explainable AI: Making the Black Box Transparent
May 16, 2023 by CORE
Setting Up a Local Development Environment With IntelliJ, DevContainers, and Amazon Linux 2023
July 19, 2024 by
Deploy Your Data Pipelines With GitHub Actions
July 19, 2024 by
Low Code vs. Traditional Development: A Comprehensive Comparison
May 16, 2023 by