Codenatives developed a cost-effective data pipeline workflow for our client using Google Cloud Composer. This product is a managed workflow orchestration service from Google Cloud Platform built using the open-source tool Apache Airflow. Using the managed Apache Airflow instance on Google cloud, we create Directed Acyclic Graph (DAG), a set of configurable job actions using built-in operators to achieve this seamlessly.
For this requirement, PythonOperator, BashOperator, and BigQueryOperator were used to develop a pipeline that can perform ETL from Google Search Console to BigQuery in a robust manner. The data extracted from the Search Console was transformed, backed up in cloud storage, and stored in different schemas in BigQuery based on the requirement specifications.
Audit and monitoring the composer runs is also another critical aspect of the ETL process. Configuring cloud logging with triggers and alerts thresholds ensures that critical issues do not go unnoticed. The other vital elements to the solution were:
- Promoting code reuse (from their previous version) by retrofitting to Airflow DAGs.
- Parallel runs were enabled on the DAGs, ensuring much faster data retrieval than the prior version of their code.
The customer benefitted from this approach since we did the whole re-engineering most cost-effectively. Typically, organizations spend several tens of thousands of dollars to manage data ingestion and ETL from diverse sources. Adding to the complexity is the rigidity of infrastructure, lack of control on the performance, and the clunky monitoring and logging approaches. Executing this process on Cloud composer ensure much reduced operating cost for the customer.
Comments are closed.