6. Creating a Crypto Data Pipeline in GCP

Kajahnan Sivakumar
Jul 27, 2022
3 min read

Updated: Aug 11, 2022

In this post, we'll be looking at creating a data pipeline/ multiple data pipelines on the Google Cloud Platform (GCP), and how you can do it using the different applications available in this cloud environment.

The main workflow of this architecture can be seen above, where it goes as follows:

The cloud scheduler would schedule the cloud function to run at whatever time is specified on the scheduler
Once scheduled the cloud function would download the data from a REST API, in this case Bitcoin trading data
The data would be outputted into a file either CSV/JSON into a Google Cloud Bucket i.e. Cloud Storage
Then you would add the file manually into BigQuery to determine the data schema and data types of each column
After determining the schema, you can write a pipeline using Apache Beam that can be deployed as a Dataflow job
Once Apache Beam pipeline is successfully completed and deployed, you will then be able to successfully create a batch processing pipeline to move data from Cloud Storage into BigQuery

Cloud Functions:

Like other cloud providers such as AWS and their Lambda functions, GCP has it's own application to bring through streaming data and that's their Cloud Function.

What you'll see below is how I start this workflow process by building a cloud function to stream data using a crypto API from CoinAPI that receives data points for BTC/USD trade pairings using Python.

main.py

Requirements.txt file:

# Function dependencies, for example:
# package>=version
google-cloud-bigquery
google-cloud-storage
requests
pandas

This file above is needed for the cloud function to download the relevant packages needed for the cloud function to work

Testing the function:

Once this has been created you can then deploy the function and test it out to see whether a file has been uploaded into GCP's Cloud Storage which we can see below.

If there are errors in execution, I recommend adding print statements within the function so that you can see where the function went wrong in the Logs.

Cloud Scheduler:

After the cloud function has been successfully deployed and tested, the next stage is to create a cloud scheduler that runs the cloud function automatically.

This can be done by finding the cloud scheduler application in the search bar and creating a new job. The configurations can be seen below for the above function:

This shows that the cloud function will run 7AM daily at British time. The URL here represents the HTTP link from the cloud function which can be seen in the Trigger tab if we scroll back to the "Testing the Function" section.

Dataflow:

After the file has been created and uploaded into a cloud storage bucket, the next part will be to move the data into a staging table, which in this case is BigQuery.

To do that we can create a pipeline using the Apache Beam library, that when run in the cloud shell, would move the data across, For the above example, I wrote the following pipeline and set up the pipeline options using help from this documentation:

Once this is created, this can be executed from the Google Cloud shell terminal by saving the files on the GCP Cloud Editor as seen below.

Remember, to execute this you will need access to your service account/ API credentials, which you will need to retrieve from GCP as a JSON file.

Once you have downloaded it, you should upload it to the cloud editor and run the file in the terminal as shown below.

You can then view the status of the Dataflow job in the Dataflow application of GCP and click on the relevant job name that was used to execute your pipeline. In this particular case, it's "btcusdpipeline", which has ran successfully.

The last modified date of the BigQuery. table can also be checked to see if the dataflow job has worked and again for this particular example, it has been successful:

After all these checks, you can comfortably say you have created a data pipeline/ pipelines from source to staging areas.

These steps can also be replicated in Terraform instead of doing each step in GCP but that will be included in another post.