Coding with Lilla: December 2023

Sunday, December 3, 2023

Post exam prep: things I have learned about learning

I did finish the book “The Google Cloud Certified Professional Cloud Architect Study Guide” to prepare for my exam. I did want to post about everything I was learning. The start was great, not “chapter-by-chapter describing everything perfectly”-great, but I liked what was going on, I did get some nice summaries of the chapters. Up until chapter 6.

I did start writing about the networking concepts in chapter 6, but, to be honest, I can not put them quickly into words in a way that I would find easy to understand. As for the rest of the chapters, I got impatient. I know that writing out what I have learnt would have helped me cement my knowledge, and find the gaps easier, but I just really wanted to finish the book. I did learn from the later chapters, even if most of the things on software development lifecycle and SRE concepts I was already familiar with from my workplace.

Besides the technical and business knowledge I’ve got, here’s what I’m taking away from the preparations themselves.

Note: this is all very subjective.

Learning is like a bit like working out – the more you do it, the easier it gets, and the more you will like it.

Sports can teach you a lot about this. Because most of the time you don’t feel like working out. Or at least I don’t. But I am aware that once I am there, and I am in the flow, the voice, that earlier was very convinced that I am too tired and hungry for pizza, will disappear. Once the workout is finished, I feel proud and cannot stop smiling.

I am sure the processes behind learning are very different from sports. But accepting that voice, and knowing it might not be saying the things that will help me in the long term, is the same process. Taking cold showers is another example of how to get better at this.

Once you are doing it, focusing on one thing at a time, it can be quite meditative. I observed this at my workplace as well, as I was struggling with procrastination. Once I applied the same principles to getting started on my tasks, I felt like I got things done in a more timely manner.

If something needs to be done, it can be done even if it takes staying up until late.

I do not advise anyone to lose out on sleep, do not underestimate the power of it! (see “Why We Sleep” by Matthew Walker) Actually, I do advise against doing this in general.

But for me, I used to put things aside, to go to sleep early. Just to end up watching TV for 3 hours, but this time from the bed. So now I did sit down to learn, even if it meant starting after 10pm.

I do feel a certain satisfaction after finishing a chapter/section.

You know the tiny dopamine hit you get after finishing a task, any task? I got it after finishing a chapter. It made me want to do another chapter, and then I had to remind myself that I value sleep a lot and that I want to be productive at work the next day.

Getting into the flow makes it easier to get through the chapter.

After managing to sit down and start reading and taking notes, it was also important to eliminate distractions. I had to turn off the TV, put my phone away, and maybe even turn on my Pomodoro timer when everything else failed. Actually focusing on what I was reading, and trying to understand, and writing down pushed me into a focused mood. 1-2 hours went by quickly.

Regular breaks are important – especially to protect my eyes.

My eyes do hurt when I sit in front of a monitor the whole day. I do try not to overuse them, but working in front of a computer the whole day, scrolling on my phone and watching TV does make them hurt. So to protect them, I do take breaks. At some point, I want to look into other ways to help them, but for now, taking regular breaks away from anything that requires focusing my eyes, and just looking out the windows, like an old lady, does help.

Writing down what I’ve learned is still the best way for me to make things stick.

I do not know anyone who does this. I received doubtful looks from my teachers for this. One of them even asked if it does not take too long. It does take a long time and requires effort. But the effort is spent on the material you want to learn. You reflect on it, you think about it differently, and you understand it even better. If you’ve never tried it, I would say give it a go. If you have a friend to study with, maybe tell each other what you’ve learnt, ask questions, and challenge your understanding. I usually like to do things on my own, that is why I prefer writing.

Notes don’t have to be perfect or pretty for me – that part does not add to my learning experience.

Okay, I did enjoy using different colours on my notes. I like to look at them, they do make me feel good about the notes. But I also like to look at my handwriting. I did have to change my handwriting in university because others could not read it. As a result, my handwriting is pretty and resembles printed letters more. Different coloured lines and drawings do not add enough to the experience to make it worth it. Unless worthiness is not the point, and I am learning for fun.

Learning from a book and taking physical notes is very nostalgic, and reminds me of the hard-working and high-achieving student I used to be – past Lilla was impressive

The whole experience did remind me of who I was in school. I never needed to take physical notes since university, which is completely fine. But doing this did make me feel closer to the highly driven person I was, always learning, searching for new books, and getting in all the knowledge. I liked that person. That person did not need to think about adult things.

Exam Prep 5 – Designing Storage System

The next chapter is on storage solutions. There are a lot of them, to match all the possible needs applications could have. The main categories are:

object storage
persistent local and attached storage
relation and NoSQL databases

Flowchart for decisions: https://cloud.google.com/architecture/storage-advisor#decision_tree

Google Cloud Storage is the object storage solution on GCP. It is not a file system, there is no clear structure in it, and the files are treated atomically, which means that getting a part of a file is not possible. The files are arranged into buckets. Files in a bucket share access controls. The bucket names must be globally unique, therefore it is advisable to use a unique identifier in it. It has four tiers:

Regional – data is frequently accessed, and present in one region
Multiregional – data is frequently accessed from multiple regions
Nearline – for data accessed less than once per month
Coldline – data is accessed once per year or less

Cloud Filestore is a network-attached storage service. Used mostly for GCE and GKE, can be attached to multiple instances.

There are multiple databases available as well. Relational DBs follow the ACID principle.

CloudSQL is a managed SQL DB offering. It offers MySQL, PostgreSQL and Microsoft SQL servers on the internet.

Cloud Spanner is a globally scalable SQL DB on GCP. Ideal for applications that need to be available in multiple regions of the world.

BigQuery is a data warehouse used for analytics. It supports SQL. You pay based on the data you use, not the data you store.

NoSQL databases use flexible schemas.

Cloud Bigtable is a NoSQL database used for data analytics. Perfect for IoT projects.

Cloud Datastore is a document-based NoSQL DB. Its successor is Firestore, which is advised for web applications requiring flexible schema.

Cloud Memorystore is a managed Redis. Used for caching.

GCP encrypts data at rest. The user has to take care of data retention and lifecycle management. Networking and latency have to be taken into consideration as well when designing an application using cloud storage.

The review questions in this chapter went better than the others so far. I did go through them twice, to make sure that my usual issue of writing down a different letter than what I chose is not happening – I’ll keep doing that for future chapters.

Exam Prep 4

In this post, I will go over two chapters: one on technical requirements and the other on the Compute services.

Designing for Technical Requirements

In this chapter, there were mainly three broader categories of technical requirements discussed: high availability, scalability and reliability.

Availability is the “continuous operation of a system at sufficient capacity to meet the demands of ongoing workloads”, and is measured in percentages. What this means in practice is that requests coming from clients should be responded to promptly. Many types of failures affect the availability of a service, starting from bugs introduced by developers to DNS server hiccups.

For Compute services, Google already takes responsibility for the availability of infrastructure in most of the places. They are not responsible for bugs resulting from coding errors, but in case there’s a physical server issue, they do perform live migration of the app. There are still some things that the Cloud Architect or DevOps engineer has to take care of: like making sure to use redundant resources, and use IaC. The more managed the service is (Compute Engine -> Kubernetes Engine -> App Engine), the less there is a need to care about compute availability.

Storage services on GCP can be set up to have automatic backups, run in different zones (which means different buildings in the same DC), so Google is already taking care of most of the availability settings. Even for persistent disks, there’s a minimal effort required to make them as available as needed, by providing a redundant copy or resizing them to fit increasing file sizes.

With networks, two things have to be mentioned for availability: redundant connections and Premium Tier networking.

Scalability is the ability of software to grow (or shrink) based on incoming traffic. There’s horizontal scalability – which basically means being able to deploy another instance of the service to make it work with higher loads, and there is vertical scalability – meaning increasing already existing disks and CPU numbers to be able to respond to more requests. Managed services often offer auto-scaling, and Kubernetes also has a configurable auto-scaler object type. Managed instance groups can also scale by increasing or decreasing the number of instances in them.

Reliability is covered by most of the availability practices. Redundancy and following DevOps best practices are important for reliability. SRE practices make sure that monitoring, alerting, incident response and post-mortems are properly set up.

Designing Compute Systems

In this chapter, there were four GCP offerings discussed in more detail. I have heard from colleagues that Anthos has recently been added to the curriculum, so I will definitely have to look into that more.

The four services were Compute Engine, App Engine, Kubernetes Engine and Cloud Functions. I have a few years of experience with both Compute and Kubernetes and at least one year with Cloud Functions.

Compute Engine is an Infrastructure-as-a-Service offering, giving out virtual machines running in different data centres all around the world. There are numerous machine types, each fitting different use cases, some with more and different vCPUs, some with more memory, and there are also custom/configurable types. Service Accounts are used by software not have an account, but still needs permissions for operations on GCP. They can and should be associated with VMs. Persistent disks are making sure that data is surviving the virtual machine, and the key management system to secure the app better. Shielded VMs should be used when security is extremely important. Instance groups can be managed (starting from a template) and unmanaged (mostly for load balancing).

Kubernetes Engine is the managed Kubernetes offering on GCP. It lets you run your own Kubernetes cluster, with very little need for configuration.

App Engine is a Platform-as-a-Service offering. It can run application code in a container without needing to configure the underlying infrastructure. It can be Standard (supporting only certain languages) and Flexible – being more general.

Cloud Functions are handy when a piece of code needs to be run after something happens. In CF there are functions triggered by events. These events can be happening in other GCP-managed services, and they can also be configured with webhooks. Even Stackdriver logs can trigger a cloud function via Pub/Sub, the messaging service.

It is highly advised to use all this with some sort of Infastructure-as-Code – the Deployment manager is perfect for it.

I did do the tests at the end of both chapters and although I did make mistakes, I feel like I am going the right way. I’ve been thinking about signing up soon for the exam, but first I want to finish the book. I do have a date in my mind, in any case.

Exam Prep 3 – Business Requirements

This chapter was done in two parts + this post on another day. It was a big one, but not big enough to be done in three parts. However, I do want to give myself time when I need it, and this time I managed to get some quality Christmas feasts in there, and I was also trying some personal things. I’ll talk about them once they have evolved some more. Hopefully, they will do.

The title of this chapter is “Designing for Business Requirements”. In the beginning, it is explained that engineers usually are protected from talking to business colleagues by the architect. Which means the architect has to speak two languages. She has to understand the business use cases and the product strategy: the high-level objectives of the project. To be fair, I do think understanding what the product is trying to accomplish is important for the engineers as well, as they will be in a position to decide through the implementation, and they should be able to understand the implications of the different approaches.

The book gives some details about each of the use cases. I will not include them here, as they are incomplete, in the sense that the exam requirements have changed since the book was published. I do want however to write another post where I will try to understand the business requirements of each example exam project, and how they affect the technical decisions that need to be taken.

Application Design and Cost Considerations

In a business, cost should not be the main driver. In my opinion, the main driver should be the value it brings to its users. However, cost and finances are the enablers of the company. They do affect capital and operating expenses.

When developing a piece of software, the total cost of ownership (TCO) is not always evident in the beginning. There are a lot of hidden costs, that engineers like myself do not think about, but cloud architects should:

software licensing costs
cloud computing costs
cloud storage costs
data ingress & egress charges
cost of DevOps personnel
cost of third-party services
missed SLA charges
network connectivity charges

Managed Services

Managed services are a powerful set of features of GCP. They are offerings, whose operation is much less complicated and less expensive than maintaining virtual machines doing the same services, or even developing the same things in-house. These services are monitored automatically by Google, there’s no need for very low-level fine tunings (they usually autoscale), and the costs cannot be competed with.

Preemptible VMs

Preemptible VMs are virtual machines with limited running time, offered as an alternative to conventional virtual machines. They are much cheaper than virtual machines and are to be used with operations that do not require a single machine all through its lifecycle. Examples of this are batch jobs, services not requiring high availability, and stateless applications. They might be shut down at any time by GCP, and will surely be shut down after 24 hours, with a 30-second grace period. To replace them instantaneously, managed instance groups can be used. They can also be used with certain managed services.

Data Lifecycle Management

From what I have seen so far, data and databases and data management are a big part of this exam, and it’s one of the toughest, as GCP offers many data/storage services. You have to be very familiar with what each one is used for, so you can fine-tune your application in the cloud.

There are multiple categories for each storage option:

Memorystore: for caching
Databases: CloudSQL, Datastore
Time-series databases, where the data will be aggregated over time (ex. today we want to see how the app behaves every hour, but in 1-2 weeks, we only care about the day as a whole)
Object storage: multiregional and regional are used for frequently accessed data, nearline for data accessed once per month, coldline for data accessed once per year
Data warehouse:
- BigQuery has a 2-tiered pricing model:
  - active data (more expensive): updated in less than 90 days
  - long-term data (cheaper): not accessed in 90 days

System Integration and Data Management

In this chapter, the book went through all the exam projects and checked for what kind of data storage each should use. As homework for myself – besides checking each new project from the exam for high-level business requirements, I will look at the new projects and compare which storage solution should be better for which one. I also plan to do some research on what others think, if the time will let me.

Data Management Business Requirements

There are some things that the business colleague should be queried about when it comes to data management – ex. lifecycle policies and retention period:

How much data will be collected and stored?
How long will it be stored?
What processing will be applied to the data?
Who will have access to the data?

Compliance and Regulations

For most, GCP is already doing a lot to help the cloud apps stay secure. In many cases, it already encrypts data at rest. The user should still be aware of how to store protected data securely, IAM, firewalls, identity-aware proxy, the principle of least privilege, and defence in depth – which means assuming that things can break, and protecting every layer the best possible way.

Some policies that are mentioned in the book as well:

Health Insurance Portability and Accountability Act (HIPAA)
General Data Protection Regulation (GDPR)
Sarbanes-Oxley (SOX)
Children’s Online Privacy Protection Act (COPPA)
Payment Card Industry Data Security Standard (PCI DSS)

Data Integrity Regulations

Vulnerability scanning and anti-malware applications should be in place. Protection against fraud should be cared about as well.

Security

Confidentiality – limiting access
Integrity – changing data should be allowed to the right entities only
Availability – DDOS protection, redundant systems and failover mechanisms

Success Measures

KPIs and ROIs.

I also finished the quiz, 14/19. Which is not a bad score, but not good enough for me either. I made the mistake of writing down the wrong letter again. This mistake cost me 0.25 points once on my 12th-grade final computer science test – otherwise, it would have been a perfect 10. So besides putting more effort into data analytics on GCP, the additions to this exam since the book was published, I have to triple-check the letters I choose and write down. And I also have to take prettier notes so I can post some pictures together with my blog posts.

Exam prep 2

Chapter one was “Introduction to the Google Professional Cloud Architect Exam”. It felt more like a high-level description of the topics of the exam, as the introduction itself already happened at the beginning of the book.

The very first page of the chapter started with a list of all the aspects of software architecting for the cloud.

I was hoping to read about each one of these separately, which did not happen, but they were present in the other topics that came up.

The first is analyzing the business requirements. They are usually coming from non-technical people, and require a big-picture mindset. It should be the very first step in planning and analysis, as these are the closest to how the end customers perceive the piece of software. Therefore, they do include SLOs, but can also be related to speed of development, compliance and capital expenditure. Which is basically the long-term investment into machines, plants, buildings etc.

The greatest benefit of moving an application to the cloud is the reduction of operational expenses. On GCP, it is very cheap to have your infrastructure in multiple physical data centres, even on different sides of the world. That means fewer people plugging in cables and having to press the On button when someone accidentally “sudo shutdown now”s. It also means having the option to delegate management of some services (like database or queueing) to GCP. Or using only the computing resources that are needed by having autoscaling, or allowing preemptible VMs (VMs that can be deleted after 24 hours if the resource is needed by GCP) – all of which means saving a huge part of the operational expenses.

Another business requirement could be accelerating development. The cloud architect should be familiar with agile methodologies, but also how managed services, CI/CD and microservices could support developers in pushing out new features and bug fixes faster. Of course, all these things have to be put into context. Sometimes it is better to keep managing a low-effort service and have the engineers focus on new features or making the existing ones leaner and stronger. Same with microservices. If there is pressure on moving to the cloud because the current infrastructure cannot support the user base properly, it is also better to “lift and shift”, overspending double the amount of time on rewriting everything to fit into GCP.

Reporting on SLOs (service-level objectives) is about monitoring the application and making sure it does the right level of what the end-users are expecting. And the most general things users are usually expecting are availability (is the service reachable and does it do what I want it to do?), reliability (does the service work well with higher load, or do timeouts pop up?), scalability (is the service able to serve a higher user base when needed, but also save resources when the users are not online?) and durability (how probable is the accessibility of the stored data after some time?). These four appear also as nonfunctional requirements later on.

Reducing the time to recover from an incident is also an important business requirement. The book defines incidents as a disruption of the service that causes degradation or unavailability. When there is an outage, usually, the users cannot consume the service. There’s not just this direct impact that leads to loss of revenue, but if a service does not come back up fast, there is also a chance of bad press (loss of reputation), loss of users, and loss of data. Setting up proper monitoring, alerting and logging can optimize the time it takes to notice that something is wrong, and, in some cases, it can also prevent incidents.

Compliance means respecting the legal regulations and protecting users as best as possible. There are five industry regulations mentioned:

HIPAA (health)
COPPA (children’s privacy)
SOX (financial)
PCI (credit card payments)
GDPR (privacy)

Technical requirements can be broken down into functional and non-functional requirements. The non-functional ones should be measured by SLIs (metrics for SLOs, the SRE book has more info on them).

For technical requirements, there are mainly three areas:

Compute – virtual machines and containers (Compute Engine, App Engine, Kubernetes Engine)
Storage – SQL (Cloud SQL, Spanner, BigQuery), NoSQL (Bigtable, Cloud Datastore) and archival storage or buckets in Cloud Storage
Networks – VPCs and hybrid cloud (when the company network has to be connected to a network in the Google datacenters)

This chapter was actually really good at refreshing my memory on storage solutions and giving an overview of what technical things will be needed. It also had the case studies in there, but two of them were missing that I had to look up online, and one was taken out this year. I am planning to read them again next time when I sit down for chapter two. This time I got 14/15 on the review questions at the end of the chapter. The only thing I got wrong was related to a specific business requirement. I have to make sure to read the question twice and understand the circumstances described better.

Exam prep – part 1

Don’t mind the mess, it’s part of the creative process

I have been working as a software engineer for a while now. I have spent all of my work time writing tools and scripts to support the application developers with their operations.

My very first project was a vulnerability scanner for Docker images. I have felt the agility starting from day -1. I have been talking about Java during my interviews. I signed a contract for front-end development. When I joined, I saw this image scanner written in Go. I was excited, as everyone should be about their first real job. And to be fair, I already had a great story out of it, and a little lesson on getting used to being flexible. It also ended up giving me female IT friends, a group of amazing people, with whom we still catch up from time to time.

During this first experience, I have felt appreciated for my backend skills, and even peeked into EC2 on AWS. I have some vague memories about trying to make machines accessible by modifying some IP addresses, but I do not remember much on the exact issue. What I do remember is that I felt intrigued and challenged. I wanted more of it. (I also remember running Kubernetes on Docker and not managing to make minikube work with the VPN – but that’s for another day)

And I did get more of it. In my current position I have been part of the infrastructure team, having multiple internal dev teams as clients. During these years, my exposure to the cloud and cloud providers increased. I have learnt from people with a vast knowledge about virtual machines, load balancers, storage, networking and dealing with developers. The internet also helped.

Lately, we have been focusing more and more on the cloud provided by Google. So it was a natural step to try to challenge myself and go for the Professional Cloud Architect exam when the opportunity arose at the company. I signed up for a virtual class and followed the lessons on Coursera and Qwiklabs.

But then either life or my fear of exams or my fear of big books got in the way. I wanted to finish the Study Guide before actually doing the exam. I got to the second chapter and “took a break”. Did I say that it is a big book? Not Knuth-level, but bigger than my usual 120 page, big-lettered fiction books.

Today I did manage to pick it up and start over again. And I do want to recapitulate here what I have learned from the intro. I also tried using flow-based notes.

The very first thing explained is that this exam is not about a developer’s knowledge on how to get things done in GCP. It is targeted at architects, or people needing architecting. My knowledge of clicking around will not be the right kind of knowledge, and it will not be enough, but it will certainly help.

The exam is measuring the capability of solving business problems while keeping technical requirements in mind. On business parts, it is about understanding the role of the solution. On technical parts, it is about finding the right balance between writing pretty code and pushing out new features. And in order to get these right, we must think not just about the design of the application, but also about the infrastructure around it, how me manage the data, the deployment lifecycle. And how the software will evolve when requirements change. Which means getting out of the programmer mindset (writing code to solve stories), and doing more software engineering.

There was also a list of Objectives at the end of the introductions. This part was a more detailed list of the above-mentioned requirements. It started from designing and planning solutions directly fit for the cloud, went through setting up infrastructure, being compliant, implementation, all the way until keeping the application alive and reliable. All things that need to be kept in mind for cloud applications.

At the end, there was an assessment test. Got 20 out of 25 questions right. I have already done this test some months ago, but still, glad to see the knowledge did not get completely away from me. I have failed mostly on the data analysis questions, but also one time on storage. And there was one question I knew the answer to, but wrote the letter down wrong. This mostly means I have to pay extra attention during the chapters on big data and storage.

I did promise myself to finish one chapter, but I finished the introduction instead. I still count it as a win for today.

Coding with Lilla