Big Data

Enable data analytics with Talend and Amazon Redshift Serverless


This is a guest post co-written with Cameron Davie from Talend.

Today, in order to accelerate and scale data analytics, companies are looking for an approach to minimize infrastructure management and predict computing needs for different types of workloads, including spikes and ad hoc analytics.

The integration of Talend Cloud and Talend Stitch with Amazon Redshift Serverless can help you achieve successful business outcomes without data warehouse infrastructure management.

In this post, we demonstrate how Talend easily integrates with Redshift Serverless to help you accelerate and scale data analytics with trusted data.

About Redshift Serverless

Redshift Serverless makes it simple to run and scale analytics without having to manage your data warehouse infrastructure. Data scientists, developers, and data analysts can access meaningful insights and build data-driven applications with zero maintenance. Redshift Serverless automatically provisions and intelligently scales data warehouse capacity to deliver fast performance for even the most demanding and unpredictable workloads, and you pay only for what you use. You can load your data and start querying in your favorite business intelligence (BI) tools, build machine learning (ML) models in SQL, or combine your data with third-party data for new insights because Redshift Serverless seamlessly integrates with your data landscape. Existing Amazon Redshift customers can migrate their Redshift clusters to Redshift Serverless using the Amazon Redshift console or API without making changes to their applications and have the advantage of using this capability.

About Talend

Talend is an AWS ISV Partner with the Amazon Redshift Ready Product designation and AWS Competencies in both Data and Analytics and Migration. Talend Cloud combines data integration, data integrity, and data governance in a single, unified platform that makes it easy to collect, transform, clean, govern, and share your data. Talend Stitch is fully managed, scalable service that helps replicate data into your cloud data warehouse and quickly access analytics to make better, faster decisions.

Solution overview

The integration of Talend with Amazon Redshift adds new features and capabilities. As of this writing, Talend has 14 distinct native connectivity and configuration components for Amazon Redshift, which are fully documented in the Talend Help Center.

From the Talend Studio interface, there are no differences or changes required to support or access a Redshift Serverless instance or provisioned cluster.

In the following sections, we detail the steps to integrate the Talend Studio interface with Redshift Serverless.

Prerequisites

To complete the integration, you need a Redshift Serverless data warehouse. For setup instructions, see the Getting Started Guide. You also need a Talend Cloud account and Talend Studio. For setup instructions, see the Talend Cloud installation guide.

Integrate Talend Studio with Redshift Serverless

In the Talend Studio interface, you first create and establish a connection to Redshift Serverless. Then you add an output component to standard loading from your desired source into your Redshift Serverless data warehouse, using the established connection. The alternative step is to use a bulk loading component to load large amounts of data directly to your Redshift Serverless data warehouse, using the tRedshiftBulkExec component. Complete the following steps:

  1. Configure a tRedshiftConnection component to connect to Redshift Serverless:
    • For Database, choose Amazon Redshift.
    • Leave the values for Property Type and Driver version as default.
    • For Host, enter the Redshift Serverless endpoint’s host URL.
    • For Port, enter 5349.
    • For Database, enter your database name.
    • For Schema, enter your preferred schema.
    • For Username and Password, enter your user name and password, respectively.

Follow security best practices by using a strong password policy and regular password rotation to reduce the risk of password-based attacks or exploits.

For more information on how to connect to a database, refer to tDBConnection.

After you create the connection object, you can add an output component to your Talend Studio job. The output component defines that the data being processed in the job’s workflow will land in Redshift Serverless. The following examples show standard output and bulk loading output.

  1. Add a tRedshiftOutput database component.

tRedshiftOutput database component

  1. Configure the tRedshiftOutput database component to write, update, make changes to the connected Redshift Serverless data warehouse.
  2. When using the tRedshiftOutput component, select Use an existing component and choose the connection you created.

This step makes sure that this component is pre-configured.

tDBOutput component

For more information on how to set up a tDBOutput component, see tDBOutput.

  1. Alternatively, you can configure a tRedshiftBulkExec database component to run the insert operations on the connected Redshift Serverless data warehouse.

Using the tRedshiftBulkExec database component allows you to mass load data files directly from Amazon Simple Storage Service (Amazon S3) into Redshift Serverless as tables. The following screenshot illustrates that Talend is able to use connection information in a job across multiple components, saving time and effort when establishing connections to both Amazon Redshift and Amazon S3.

  1. When using the tRedshiftBulkExec component, select Use an existing component for Database settings and choose the connection you created.

This makes sure that this component is preconfigured.

  1. For S3 Setting, select Use an existing S3 connection and enter your existing connection that you will configure separately.

tDBBulkExec component

For more information on how to set up a tDBBulkExec component, see tDBBulkExec.

As well as Talend Cloud for enterprise-level data transformation needs, you could also use Talend Stitch to handle data ingestion and data replication to Redshift Serverless. All configuration for ingestion or replicating data from your desired sources to Redshift Serverless is done in a single input screen.

  1. Provide the following parameters:
    • For Display Name, enter your preferred display name for this connection.
    • For Description, enter a description of the connection. This is optional.
    • For Host, enter the Redshift Serverless endpoint’s host URL.
    • For Port, enter 5349.
    • For Database, enter your database name.
    • For Username and Password, enter your user name and password, respectively.

All support documents and information (including diagrams, steps, and screenshots) can be found in the Talend Cloud and Talend Stitch documentation.

Summary

In this post, we demonstrated how the integration of Talend with Redshift Serverless helps you quickly integrate multiple data sources into a fully managed, secure platform and immediately enable business-wide analytics.

Check out AWS Marketplace and sign up for a free trial with Talend. For more information about Redshift Serverless, refer to the Getting Started Guide.


About the Authors

Tamara Astakhova is a Sr. Partner Solutions Architect in Data and Analytics at AWS. She has over 18 years of experience in the architecture and development of large-scale data analytics systems. Tamara is working with strategic partners helping them build complex AWS-optimized architectures.

Cameron Davie is a Principal Solutions Engineer for the Tech Alliances team. He oversees the technical responsibilities of Talend’s most strategic ISV partnerships. Cameron has been with Talend for 6 years in this role, working directly as the primary technical resource for partners such as AWS, Snowflake, and more. Cameron’s role at Talend is primarily focused on technical enablement and evangelism. This includes showcasing key capabilities of our partners’ solution internally as well as demonstrating Talend’s core technical capabilities with the technical sellers at Talend’s strategic ISV partners. Cameron is a veteran of ISV partnerships and enterprise software, with over 23 years of experience. Before Talend, he spent 14 years at SAP on their OEM/Embedded Solutions partnership team.

Maneesh Sharma is a Senior Database Engineer at AWS with more than a decade of experience designing and implementing large-scale data warehouse and analytics solutions. He collaborates with various Amazon Redshift Partners and customers to drive better integration.