AWS aims to simplify data management with DataZone and the integration between Redshift, Aurora and Apache Spark

[ad_1]

There are several news regarding data management that AWS has announced a Re:Invent 2022, its annual conference which is held in Las Vegas and which we attended. The company filed an integration “zero ETL” between Amazon Redshift and Amazon Aurora, and an integration between Redshift and Apache Spark that allows you to use data analysis services and machine learning provided by AWS. Still on the subject of data management, AWS also announced Amazon DataZonesa service that allows businesses to gain better control over their data.

Just ETL between Amazon Aurora and Amazon Redhisft

One of the biggest problems that companies face when they use different systems is that of data transformation: the process etl extension (from English “extract, transform, load”or “extract, transform, load”) is often one of the longest and most complex problems to deal with in the integration between applications.

For this reason, the announcement by AWS that it will no longer be necessary to build complexes is particularly significant pipelines of ETL between Amazon Auroraa relational database service compatible with MySQL and PostgreSQL, e Amazon Redshifta service for analyzing structured and semi-structured data in databases, data warehouses and data lakes.

As AWS itself writes in the announcement release, “Many companies today rely on a three-part solution to analyze their transactional data: a relational database to store the data, a data warehouse to analyze it, and a pipelines of ETL between the relational database and the data warehouse. The pipelines they can be expensive to build and difficult to maintain, requiring developers to write custom code and constantly manage the infrastructure to make sure it scales with demand.”

The new solution instead allows Aurora’s transactional data to be automatically and continuously replicated in Redshift, so that it can be analyzed using, for example, machine learning techniques.

Amazon Redshift integrates with Apache Spark

Apache Spark is one of the most used open source frameworks to manage so-called “big data” analysis activities. AWS offers its own version that it says is three times faster than the open source one. However, there was no native integration between Spark and Redshift and companies had to go to third parties. AWS has therefore decided to provide your own connector which makes it easier for enterprises to analyze their data with Apache Spark within Redshift, while cutting out the competition.

The new integration allows developers to run Redshift queries on data from Spark-based applications, according to AWS “within seconds” using popular programming languages (such as Java, Python, R and Scala). The advantage of the new connector is that the intermediate stages are managed automatically by the system, so that users don’t have to worry about configuring and managing them themselves.

Amazon DataZone aims to simplify data management

Companies are finding it increasingly difficult to understand what data they have and where it is stored, thanks also to the growing number of both physical and virtual places where it can be kept. In addition to traditional infrastructure on premisecloud computing and third-party services are added. Amazon DataZones aims to help companies find, catalog, share and manage data wherever it is.

Through the service, enterprise data producers can use the DataZone web portal to create a data catalog with its own taxonomy, also setting the corresponding administration policies and the connection with third party services (both from AWS, such as S3 and Redshift, and from third parties, such as Salesforce and ServiceNow).

DataZone uses machine learning to collect and suggest metadata that can be used to catalog the data, and then make it available through its web portal. This way you can search through your data, request access to it, and examine the metadata. A project is then created which is shared among team members and allows you to more easily manage data access. It is also possible Leverage APIs to integrate DataZone with solutions like DataBricks, Snowflake and Tableau.

.

[ad_2]

Source link

Just ETL between Amazon Aurora and Amazon Redhisft

Amazon Redshift integrates with Apache Spark

Amazon DataZone aims to simplify data management

Shiv

Leave a Reply Cancel reply