Week 6 - Cloud Storage

By the end of this module, you’ll gain experience with storing information on and around the cloud.

Module videos:

Cloud Datatypes [16:38]

Cloud Storage Overview [9:54]

Cloud Storage / Cloud Shell Demo [8:45]

Cloud Storage / Cloud Functions / DLP demo (1/2) [15:21]

Cloud Storage / Cloud Functions / DLP demo (2/2) [16:10]

Cloud Datastore Demo [13:39]

Cloud Spanner Demo [10:56]

SQL warning! [7:53]

Inevitably, you’ll be wanting to “store” things up in the cloud.

Crazy concept, yes, however sometimes we do need persistent data accessible to our various cloud (and even non-cloud) services. Historically, data has been stored either as file-based (such as an FTP server or fGile share) or in databases (relational for a long time and non-relational for a reasonably recent time). That’s not to say that there aren’t other options, but typically it’ll either be files or relational data (e.g., MySQL, Microsoft SQL server, PostGreSQL, etc.).

Figure 1 shows how a “typical” database (monolithic on-premises database) might be migrated to some of the various AWS options for cloud storage.

AWS Database Migration

Figure 1: AWS Database Migration

On the left is the on-prem database that would be typically maintained by your local infrastructure team. On the right are some of the technologies available to turn a database into a cloud-based service. Note: there is not necessarily one “correct” way to store your data in the cloud – there are options!

For instance, you might simply be interested in a direct migration of your files to a cloud storage bucket (container for holding files/data). Or, you may be interested in simply taking your local SQL database and putting it up into a similar database management system hosted on a virtual machine.

You can take it further as well to non-relational databases (NoSQL databases), data warehousing strategies for big data and data analytics, etc. Basically, the decision you need to make is how your data is to be stored and what you want to do with it.

Module video: Cloud Datatypes [16:38]

Storage

Module Video: Cloud Storage Overview [9:54]

Focusing on Google now, let’s take a look at what options are available to you as a developer.

Note: you can interact with all of these products via the Cloud console (web interface) or the Cloud shell (CLI).

Figure 2 shows the currently-available options (c/o Google):

Google Cloud Storage Options

Figure 2: Google Cloud Storage Options

This figure is helpful in that it also identifies the type of storage that each service provides. For instance, you have access to your typical relational databases (Cloud SQL, Cloud Spanner), non-relational (i.e., NoSQL) databases (Datastore, Cloud BigTable), and data warehousing (BigQuery). Cloud Storage is a place to store files and data objects. As with any infrastructure activity, selecting the correct type of storage is an important task.

I will leave the selection process to the more appropriate database/systems administration courses (Google provides a handy flowchart that I’ve included later in this post). Considerations though:

All applications and services will most likely need to create, store, retrieve, and consume data. This includes text files, images, application data, etc. Selecting the appropriate storage option can help you speed up response times, save costs, and enhance flexibility.
Cloud-based storage options have the benefits that other services do in that they are provisionable on-demand and are highly-scalable.
Relational vs. non-relational can be a religious debate to some, however if you have a small-scale application that you never plan to grow it may not make sense to use the full power of a data warehousing strategy. Likewise, a massive social network-type application where data is constantly changing will perform poorly on a standard relational database. Choose the appropriate tool!

For reference, Microsoft Azure has a similar set of available options, as shown in Figure 3 (c/o dremio.com):

Microsoft Azure Storage Options

Figure 3: Microsoft Azure Storage Options

Module video: Cloud Storage / Cloud Shell Demo [8:45]

We also can do some other nifty things with Cloud Storage. In this next demo, we’ll hook up Cloud Storage, Cloud Functions, and the DLP API:

Link to Codelab
Link to Medium article where you can find the correct GitHub repository

Module video: Cloud Storage / Cloud Functions / DLP demo (1/2) [15:21]

Module video: Cloud Storage / Cloud Functions / DLP demo (2/2) [16:10]

Structured vs. Unstructured Data

This discussion is best suited for a proper database course (normalization anybody?), so you get the highlights here. First and foremost, data that you put into a database must be considered very carefully before you begin (or even before you begin transforming an existing database application to a cloud-based application).

Let’s start with relational (i.e., structured) data. This tends to be data best represented in spreadsheet form (rows, columns, etc.). This type of data would be your typical web application, billing information, etc. Structured data is pretty well-understood at this point both from a theoretical perspective and from a programming perspective. Data manipulation in the relational/structural world tends to be straightforward.

Unstructured data, however, has no organization. It is unstructured by nature, meaning it is schema-less. What this means is that there is nothing in the database system enforcing table format. For example, if you create an unstructured object for storing customer information, it is up to you/your application to enforce how usernames, customer names, email addresses, etc., are stored within that object. The database itself does not care!

Common relational database systems include MySQL/MariaDB, PostGreSQL, SQLite, and Microsoft SQL Server (and Access, though we don’t talk about that (ง’̀-‘́)ง). If you are working in industry, more than likely you have come across one of these.

Common non-relational/unstructured/NoSQL databases include MongoDB, CouchDB, Neo4J, and BigTable. These types of databases generally have their own particular methods of access and manipulation, whereas the relational databases all implement some form of SQL. For example, Neo4J uses the Cypher language to manipulate a database graph, whereas MongoDB uses JSON-style objects to store information and can be manipulated via API calls.

Figure 4 (c/o Google) presents a view of structured vs. unstructured data.

Google Cloud - Structured vs. Unstructured Data

Figure 4: Google Cloud - Structured vs. Unstructured Data

To illustrate the difference that you may see with an unstructured database environment, Figure 5 (c/o Neo4j) shows a graph-based database. In this system, the relationships between objects are generally most important, with each object being a self-contained entity.

Sample Neo4j database

Figure 5: Sample Neo4j database

What do I use?

This is the “hard” question, as there are just so many available services and technologies out there to use. Figure 6 shows a decision tree (c/o Google) for picking a Google-specific storage technology. Note that you can use this flowchart to make similar decisions in other cloud provider ecosystems!

Google Cloud Storage Selection Strategy

Figure 6: Google Cloud Storage Selection Strategy

The figure is fairly self-explanatory, however some key decisions you must consider are:

Do I want my data be structured or unstructured (i.e., relational vs. NoSQL)?
Do I need my application to be scalable or not (i.e., how many users am I expecting to have AND will it be bursty or fairly even in usage)?
What sort of performance do I need out of it?
How much data do I plan to store?

Next, we’ll do a couple of examples with the different technologies available in Google Cloud.

Example

As we’ve discussed, you have multiple options for data storage. Here, we’ll focus on examples for the two categories of data: structured and unstructured.

Structured Example

With structured data, typically you’re looking to model something fairly relational in nature (i.e., a normal database). For this example we’ll take a look at Cloud Spanner, a tool that aims to give you the feel of a normal database with the scalability features of a non-relational database.

Module video: Cloud Spanner Demo [10:56]

Links to the demo article: https://cloud.google.com/spanner/docs/query-syntax#appendix_a_examples_with_sample_data

And how to use Python with it: https://cloud.google.com/spanner/docs/getting-started/python

Unstructured Example

Here, you might go for BigTable if you’re looking to manage data considered ‘big data,’ however we’ll focus on Cloud Datastore as that may be more tenable for everyday applications.

Module Video: Cloud Datastore Demo [13:39]

Cloud Datastore - https://codelabs.developers.google.com/codelabs/cloud-spring-datastore/

And last, but not least, a warning!

It turns out that using some of the SQL services up in the cloud are not necessarily the cheapest option! Something to consider as you design your applications - that horizontal scaling tends to drive up the price pretty quickly.

Module Video: SQL warning! [7:53]

Labs

There are a couple of Qwiklabs available in Blackboard for you to go through - working with Bigtable on the command line and creating/populating an instance!

Project!

This is to be done in Blackboard, but you will be building a “term-project-lite” for this class. Blackboard will have the details, but I want you to link together five services for a fully-functioning cloud application. Four of those services are your decision, however the fifth must use one of the following:

Cloud Composer
AI Platform / AutoML / Tensorflow
Kubernetes (Container Orchestration - Week 5 topic)
Firebase (Full-Stack App Development - Week 5 topic)
Propose a “difficult” service if you have a particular interest

This assignment has you proposing your topic for me to approve. The amount of effort I’ll be looking for is approximately 2-3 assignments worth of effort (so a functioning prototype that meets agreed-upon specifications). I’ll also be looking for you to do a small writeup and record a demo video that will be due with your final deliverables.

Additional Resources

Where noted, the original content was provided by Google LLC and modified for the purpose of the course, without input or endorsement from Google LLC.

PREVIOUSWeek 7 - Application Programming Interfaces (APIs)

NEXTWeek 5 - Geographic Concerns