Skip to main content

How to process data before migrating it to the target database?

How difficult is data migration? This question usually has different answers in different migration scenarios. If it's simply synchronizing the data from the source to the target, it's relatively easy to implement. In another scenario, the data in the source database needs to be processed before being merged into the target, which is very challenging compared to the former.

For data migration scenarios with data processing needs, the steps usually include extracting data from the source database, then transforming it based on user rules, and finally loading the processed data into the target system. This entire process is called ETL (Extract, Transform, Load). In today's internet companies, ETL demand scenarios are numerous and diverse, such as:

  • Data Cleaning: Remove phone numbers that do not meet requirements, remove unnecessary spaces or characters, etc.
  • Data Standardization: Unify data formats, for example, process date fields, discard hours, minutes, and seconds, and only retain year, month, and day.
  • Information Extraction: Extract required information from target fields, such as extracting household registration, gender, and date of birth from ID numbers.
  • JSON Processing: Process JSON fields, for example, convert all fields in the source database JSON into columns.
  • Column Processing: Add or remove columns, batch modify column names.

Under the premise of ETL requirements, the range of available migration products is further narrowed down. Products that are usable and easy to use are rare. In this field, good products are expensive, and there are also some open-source products to choose from. However, due to the lack of support for incremental replication, the business adaptability is not very high.

Most users want a product that meets the needs of low price, supports incremental data ETL, supports a visual interface, high performance, and stable operation. NineData's data replication function has also supported ETL in recent updates, providing more than 300 ETL functions, which can cover almost all common ETL scenarios.

Most importantly, NineData provides 10 free data replication quotas, including one incremental quota, which can meet the free use of most light users. So how can we efficiently perform ETL? Let's demonstrate next.

Take a common ETL scenario as an example. When analyzing data, we often only need to count the user visits on a specific date, and the date usually includes year, month, day, hour, minute, and second information. To optimize query performance, we need to reduce the date precision to reduce the consumption of computing resources. Therefore, retaining only the year, month, and day can achieve our goal.

Step 1: Enter Data Source

Enter all source and target data sources into the NineData platform.

image-20240719152341194

Step 2: Configure Synchronization Chain ETL Rules

  1. Log in to the NineData console, click on Data Replication > Data Replication, and then click on Create Replication.

    image-20240319150313727

  2. When the task configuration process reaches the Mapping Configuration step, click on Mapping and Filtering on the right side of the target table.

    image-20240808094319141

  3. Since we need to truncate the time of the created_date column in the source table to year, month, and day, discarding the hour, minute, and second information, we need to use the built-in function to transform this column. Let's try the following ETL configuration.

    image-20240809103724646

    1. Click on the Field Expression text box in the middle of the target field.
    2. In the pop-up window, find the function you need to use, and copy the function body. In this example, the dateutils_truncate function is selected.
    3. Paste the function body into the expression text box behind the equal sign (=), and configure the parameters. This function accepts two parameters, the first is the name of the time field, and the second is the time precision. The example in the figure dateutils_truncate(created_date, 5) indicates that the hour, minute, and second information of the created_date field is truncated, and the year, month, and day are retained.
    4. Click Submit.
  4. After the configuration is completed, click OK in the lower right corner to complete the ETL configuration. When the task starts, the created_date in the source will be processed according to this ETL configuration and then loaded into the target end. Let's take a look at the final effect after the migration is completed.

    Source image-20240809105230551

    Target image-20240809105300798

Summary

After the migration is completed, it can be seen that the date precision in the target has become year, month, and day, achieving our goal. In the incremental task, the newly written created_date in the business will also be synchronized and migrated to the target end in the same form. Retaining this task can achieve ETL on a long-term, stable, and unattended basis.

All the scenarios and demonstrations mentioned in this article are just the tip of the iceberg of ETL functions. The more than 300 functions provided by NineData each have different capabilities, which can handle various string, time, JSON transformation operations, and can also handle complex mathematical calculations, meeting various tricky ETL scenarios.

iShot_2024-08-09_14.55.36