During the weekend I learned the Vert.x tool kit in Java. Vert.x is a toolkit to create scalable, efficiency, and high-performance applications. Vert.x is a reactive toolkit, this means that you can create non-blocking applications with a few lines of code. One of the main concepts on Vert.x is a Verticle
, a Verticle is like an instance of some process within a Vert.x application, imagine that your Vert.x application is a big box and you can add things inside the box, these things can't communicate to each other directly, they need to communicate using a message bus. This message bus works with a publish/subscribe pattern, you can publish a message to a specific address and any Verticle that is subscribed to that address will receive the message. Another important concept is the event loop in Vert.x, the event loop is a single thread that is responsible for executing the Verticles code, this means that you can have multiple Verticles running in the same thread, this is a good way to create non-blocking applications. I created a simple HTTP server using Vert.x with an endpoint to receive JSON data in the response and then save that data into a database everything in a non-blocking/asynchronous
way. Vert.x is a beautiful tool, created this type of application with pure Java will be a great effort only to manage the threads and the non-blocking operations.
Today I needed to work on a curious issue. I needed to look how a JSON is serialized into a Java object, we use Gson to serialize the JSON into a Java object, and we use some custom Gson features to customize how we serialize some specific fields into object properties. The problem was due to this complex serialization customizations we don't know where some fields comes like the JSON field name is called username and in the Java object is called ownerName, you needed to review the source code to understand this. I think that is a bad practice to have this type of customizations in the serialization process, I think that is better to have the same name in the JSON and in the Java object, this is more clear and easy to understand.
Today I worked in an intelligent way. I needed to filter a JSON file to look if some elements meet some criteria, my first approach was, well I can create a Python script to do this process with code and avoid the human interaction, but later I think, hey I can use Spark to read the JSON file into a Dataframe, next filter the dataframe to finally get the desired results, was a very good moment. Spark is a very good tool for this types of tasks, you can easily read a CSV, Parquet, JSON, or other file formats into a Dataframe and next apply transformations and filters to this files.
Today I faced an issue with an API. For an unknown reason I am calling an API following the official documentation, but I the response of the call is an error. I don't like very generic errors in APIs like {"message": "API Error"}
, I remember when I wrote some Web APIs in one of my works, I always attempt to create good response messages with code errors, a common error name and a good description. For me is very bad that a big API with a lot of users have these non-good error messages.
Today reading the Software Engineering at Google book, I read about the haunted graveyards concept. At Google a haunted graveyard is a software system that anyone wants to work on it. Most of the times these software systems growths over the time and becomes a big problem to maintain, the software is not documented, the software is not tested, the software is not maintained, etc. Google discover that not writing unite tests is the first signal that the software system becomes a graveyard with the time.
During the weekend I started a Clojure application only for fun. I wanted to learn a new programming paradigm, and functional programming is a good way to start. I created a simple HTTP server application that receives input from the user, parse the request body into JSON, save the JSON into a file, and finally save some data in a SQL table. Clojure is a very interesting language, you can do multiple things with a few lines of code, but the syntax is very strange. Well I practiced and enjoyed the time creating the application.
Today I designed how to create two SQL tables to store data as a reference data. The idea is to have in a SQL table data that is not planned to be modified only as a reference and in another table add the updates for that data like this event happens on data in the another table. I will continue working on this tomorrow.
Today reading the Software Engineering at Google book I understand that Google manages external dependencies with much caution. Always external dependencies are a risk for any software project, because you are not in control of the code, you don't have control about if a new vulnerability appears you will need to wait until the dependency provider release a fix. Maybe you start a project and with the time the dependency maintainers will no longer continue maintaining the dependency and you end with a big issue in your application. I practiced some Java code, creating a simple project with two internal modules, the idea was divide the project in modules and manage the project with Gradle. The main application needs to use the module create in the project, I only did this exercise for fun.
Today I continue reading the Software Engineering at google book. I read the Code Review and Code Analysis sections in the book, learned about thw Critique and Tricorder tools. I write some Python code with PySpark to find duplicate elements in a parquet file.
Today I learned how to predict data with linear regression. (Linear regression)[https://www.ibm.com/topics/linear-regression] is a statistics model to predict a variable based on the value of another variable. I used this model only to predict how a data grows over time because we have an issue with a data pipeline that ingest to a parquet file the data from an API, and I want to know if the daily data is correct becaus we lose a lot of data in the parquet file the other time. With the values predicted I can check fast if the data is correct or not. The linear regression model is very simple to implement, you only need to have the data in a DataFrame and apply the model to the data. The model returns the slope and the intercept of the line, with this values you can predict the next value of the data. I learned new data science things today, but I am not a data scientist, I am a software engineer. I love to write code and solve problems with code, I think that is my passion. Data science is great but I don't know, for me is better create classes and objects to create amazing software products.
Today I learned some data engineering advices to work with big quantity of data, in data engineering terminology a delta file, it's only a file in which we have the differences from a file to be appended to the Historical data.
I continue reading the Software Engineering at Google book, I am reading the Build Systems and Build Philosophy section. In my work I am facing an interesting issue with some data integrity, the problem is the issue is not complicated but we are using a declarative implementation of Airflow pipelines, in this implementation we use YAML syntax to build the pipelines and Airflow do the magic in the background with Python parsing the syntax in to Python code. The problem is create the logic in this YAML syntax, with code this problem can be solved easily but with a custom implementation is a little hard. You will need to replicate the code with any weird implementation. I will continue working on this issue tomorrow.
Today I continue reading the Software Engineering at Google book, I am reading the chapter about the Code Search internal google tool, this tool is used to search code in the entire google codebase is a web UI application. Reading about the tool I did some experiments with Python, I have a big text file with more than 1 million words, I created two algorithm implementations to search a word in the file, the first implementation is basically a for loop with an O(N) complexity, the second implementation uses an index to search the word in the file with a O(1) complexity, but you will need to first create a separate file with the index. Also I studied a little about Data Science creating some fake sales data, next I created basic statistics graphs like the quantity of products sell by month, know the max product sell, etc. For my Data Science is a curiosity topic, I don't have a lot of experience in this field but I am learning.
Today I work on an strange issue, for an unknown reason a parquet file loss a lot of data let me give you more explanation. We have some data in a parquet file more than 6+ million records and there is something wrong in the data pipeline that is filtering maybe the data or a bad behavior because we lose 6.9 million records from the file, the pipeline is very complex, extract the data from an API, split that data into 9 dataframe, apply multiple modifications to each dataset to finally upload the data into a parquet file. Working on this issue is very hard because I need to review the entire pipeline to find the issue, I think that the issue is in the data split process, I will continue working on this issue tomorrow.
Today I read the Software Engineering at Google deprecation chapter. This chapter is about how to deprecate a feature or an entire software product. Google have a clear view that if you don't deprecate a system two things can happened, the system will be expensive in maintenance and the system will be forgotten. In my work currently we are working into deprecate some applications for a new one, the process is not easy and requires a lot of time and effort.
Today I continue reading the Software Engineering at Google book, I completed the testing part and started with the Deprecation section. Also in the work I attended some class room sessions related with AI and LLMs, I learned about the RAG concept explained by other amazing developers, I love learning listening to other developers, I think that is a good way to learn and share knowledge.
Today I learned about Regression testing. Regression testing involves executing tests to ensure the system continues to work after changes, such as bug fixes or new features, are applied. The problem with regression testing is when it’s manual, manual regression testing is time-consuming, the time is linear depending on the number of features/components to test and the journeys (the scenarios to test like adding a new user, buying a product, etc). We can express the human time required with the next formula: T = k * S * J, where k = the required time to test a journey, for example, 1 hour, S the number of components like 10, and J the number of journeys like 20. Using that example, we can say that testing a system with 10 components and 3 journeys for every component with a duration of 10 minutes per journey, we require 300 minutes to test the system. This is a lot of time and for this reason, automated testing is better. I continue reading the Software Engineering at Google book, reading the test part for me is very interesting.
Today I learned for Google what is a larger tests, for Google a larger test is a test in which many components interacts, a component can be a backend server, web server, database, etc. The idea of a larger test is to test the system as a whole, having the same dependencies like the production environment. Larger testing for Microservices can be a challenge depending of the number of Microservices involved because for each Microservice added to the application, the tests can increase exponentially. Is interesting how google require unit testing, integration testing and larger testing for any software developed, testing in Google is a must to do for any write code.
Today I studied testing techniques used to create unit tests that have external code dependencies. Test doubles are objects used to replace real objects implementations, a test double needs to follow the same API and have a similar behavior to the real object. Stubs most of the time are created with test libraries like Mockito for Java, create a unit test with stubs is very easy and fast. The trade-off about stubs is cause brittle tests and most of the times are not good to test behavior because if the code behavior changes, you will need to change the tests to mimic the new behavior. Finally, Mocks is normally used when you don't want to have a test double or code a real object. You can mock any object following dependency injection pattern. Be aware of these testing techniques is a good tool for any developer, testing is an important part in Software Engineering.
Today, I read the Software Engineering at Google book. I learned about some interesting unit testing practices like Behavior Test Driven Development (BTDD). This approach focuses on testing the behaviors of the system rather than implementation details. In the past, I always thought that a method/function equaled unit tests. I now agree that testing this way causes brittle tests and many errors when the code changes. Another good practice is to create complete and concise tests. This means the test needs to have a good name and be comprehensive. You might think about creating objects in a separate method, but sometimes this is a bad practice because it hides the test complexity. The book has four chapters dedicated to testing. Google really loves testing in software.
Today in the morning I presented the exam to get my Airflow Fundamentals certification. The exam was good, I passed with a 84% score, I am very happy because I studied a lot about Airflow to pass the exam, I feel very tired because I decide to take this certification due to the free opportunity to take the exam while studying for another two cloud certifications (AWS and IBM). I don't want to be in the same situation again, the next I think twice before take three certifications at the same time 😅. Now I am more relax and I can focus on continue developing some DAGs and my other studies topics.
Today I ws working on a Spark issue related with parallelism. I was working with a data pipeline that reads the data from a parquet file, next apply some modifications to this data to finally save the data into the same location with overwrite mode. The problem was an error happens while saving the data into the location, the reason error was Spark while spark is reading the data from the file another workers are removing and saving the data in the same location like a race condition. The solution was create a cache for the RDD and next from that cache saved the data to the location. This issue was very crazy because I have been working on that for more than 6 days.
Today I completed the Airflow 101 Learning Path. I completed this course to prepare for the Airflow Fundamentals exam. I learned the basic of Airflow and some advances concepts, I think that I am ready for the exam, and I will take the exam in the next days.
Today I continue studying about Airflow Sensors. I created examples DAGs to test the sensor functionality. I learned that by default sensors have a timeout of 7 days to wait for the condition to be true, this can be a problem for some use cases, also the poke_interval by default is 60 seconds. Sensors are a good feature to wait that something happen to perform another action. I want to create a pipeline that implements sensors but for the moment I don't have an idea about what can I do with sensors.
On Saturday I continue studying the Airflow 101 course, I learned about how works Airflow sensors, sensors are a way to wait for a condition to be true before continue with the next execution task, is like waiting that something happens before continue with the work, for example: wait for a file to be created in a directory, wait for a HTTP request to return a 200 status code, etc. I also I learned some debugging DAG techniques for the most common problems in Airflow, like the DAG is not running because the scheduler is not running or in busy processing other tasks, the DAG is not running because the start date is in the future, the DAG is not running because the schedule interval is not correct, etc.
Today I learned how to share data between tasks in Airflow using XCOMs. XCOMs values are like having two functions and function B uses the returned value from function A. XCOMs values is a good way to share data between tasks, but is not recommended for share large amounts of data. Depending of the database you are using as a meta-database is the maximum size of the XCOMs values, for example if you are using SQLite you can share values with 2gb of size. Also the values shared by as XCOMs needs to be JSON serializable.
Today I practice multiple ways to filter a DataFrame with PySpark and Python. I needed to do this because I need to remove from a DataFrame some not desired records, I learned how to filter using pyspark operations like where
and filter
using pyspark functions like col
, equal
, etc, also I create a temp view in Spark to apply the filter using SQL syntax, I discovered later that if you use df.filter()
, you can pass the SQL query in string format as argument for the filter
method. The pyspark flexibility to perform this operations is good but having multiples ways to do the same operation sometimes is a challenge because you are always looking for new ways to do their operations.
Today I learned how the left anti join works in SQL and Spark. I needed to clean some data within a Spark dataframe using a JOIN, I needed to use a JOIN because for filter a dataframe I need to use another DF to know the ID records to clean. I have problems to understand how the left anti join works, but with practice and drawing I can complete the task spark-tutorial. In other topics I continue working on my Airflow certification today.
On Monday I finished a DAG that generates a random user every 5 minutes and save the user in a PostgreSQL database, for the moment the database have 31920 records, I have the plan to save the records in a CSV file and upload the file in hugging face to share the data. In other topics I completed 5 practice tests for the IBM Cloud Advocate Certification achieving a general score 90%, I am happy with my current knowledge about IBM cloud.
I am having problems maintaining synchronized my two Airflow environments. I execute Airflow with pip in my development environment (Personal Laptop), and I execute Airflow with Docker in my server, for an unknown reason when I configured a Git repository to have my two environments sync, I have problems with the docker environment. I am going to debug the issue tomorrow and maybe consider execute both environments in the same way.
Today I learned how the Bash operator works in Airflow. The BashOperator is used to execute Bash commands or scripts on the host machine where Airflow is installed, this means maybe you want to save the content of a HTTP request in a file on disk, with the BashOperator you can create a script to do this action and save the file in the /tmp directory as an example. I also created a DAG to understand how the BashOperator works. My first task is to execute a bash script to make an HTTP request and get a JSON file, the second task is to debug if the file was saved correctly, and the third and final task is to upload the content of the JSON file into a MongoDB collection. I could not configure MongoDB with the Mongo provider, I will have to continue configuring the connection tomorrow.
Today, I continued studying more about Airflow. I refresh my knowledge in Airflow concepts like a DAG run is the instance of a entire DAG that was executed and a Task Instance is an instance of a Task that was executed, remember a DAG can be composed from one or more Tasks. Tomorrow I have the plan to continue studying my Airflow course and finish the prerequisite course to start with the exam.
Today, I continued doing practice exam for the AWS certification, in other topics I write a bash script to perform an action every ten minutes in an infinite while loop, the idea was avoid execute the command that triggers another process every time when the process is completed, with this way the script attempts every ten minutes execute the process.
Today, I continued studying for the IBM Cloud Advocate certification. At the moment, I am focusing on answering quizzes for this certification. I needed to revisit this study because the exam is scheduled for the last days of November or the first days of December.
I feel tired—why? Because I am working on achieving four certifications simultaneously:
- IBM Cloud Advocate
- AWS Cloud Practitioner
- Airflow Fundamentals
- Airflow DAG Authoring
The last two were not originally planned in my schedule, but I received a coupon to take both for free (with a 30-day expiration). I have organized a schedule to study for these certifications on specific days to avoid burnout and as a study technique. I’ll share more details in the future if this strategy proves effective. But please avoid work in multiple software certifications 😅.
Today I continue studying about Airflow, I am working in the next Airflow course. I obtained the Airflow Fundamentals Exam free with a promotion, so I am studying Airflow to get the Airflow Fundamentals Certification. I will need to pause my other current certifications because this promotion has a 30-days expiration time, so I only have 30 days to study and prepare for the exam.
Today I continue with the AWS practice exams, for the moment I am getting better exam results, the key is to memorize the service definitions to match a use case with a definition. For the Software Engineering at Google book I continue reading the testing book part, the book have 4 chapters dedicated to testing, looks like Google loves good software tested.
Today I created a new GitHub repository to store my Dockerfile and docker-compose files. The idea behind this repository is to have all my container files in one place because I have several Dockerfiles in my personal computer and my two Linux servers. I love to work with Docker and set up my own resources like databases, message brokers, etc., in containerized environments. You can find the repository in the next link. Also, I continued working on configuring my PostgreSQL database. I was working on setting up a separate container to create a backup of my database every day and store the backup in my server’s disk drive.
Today I worked into setting up a PostgreSQL database with Docker. I create a docker compose to deploy a PostgreSQL with special configurations, I was learning about the best ways to deploy and manage this database within a container, with configurations for performance, security and health checks. I created and configure the PostgreSQL database to save some data from my Airflow DAG into the database.
Today I continue working on my Udemy course about practice exams by the AWS Certified Cloud Practitioner exam. Also I was working into my Airflow DAG, for the moment I am configuring the DAG to load the extracted data into a PostgreSQL database.
Today I discovered some amazing things reading the Software Engineering at Google book, I have been reading the chapter of testing at google, looks like testing in google is a mature process and is required that any engineer working on Google work with a test methodology. I discover the Google Testing Blog a good place to find useful information about testing, also I learned the Beyonce rule for testing If you liked it, then you § put a test on it
, this rule is amazing xD. In other words I continue with my DAG development in Airflow.
Today, I continued working through the Udemy course and completed practice test number 3. Unfortunately, I didn’t pass, scoring 61% (the minimum passing score is 70%). I struggled particularly with questions related to Billing & Pricing and Cloud Concepts. The challenge with AWS is that it has so many services, making it difficult to remember all of them. Since I don’t work with AWS services in my daily activities, it’s harder for me to recall the most common service usages and descriptions. I’ll need to keep practicing and studying to feel fully prepared for the exam.
Today reading the Software Engineer at Google book, I review an important topic in software engineering Documentation. Documentation focusing on software products and code is one of the most important aspect of good code, when you write good documentation for your API code, is easy for a user understand how your code works and the things that the code do.
Today I continue working in the 6 Practice Exam | AWS Certified Cloud Practitioner Udemy course. I completed two of the six practice test. The first was completed with 82% grade and the second with 72% grade. The minimum to pas the exam is 70%, really this exam is a little crazy, AWS have a lot of services to Studying the idea is study the most important services and have luck like a coding interview in which you know how to solve a good number of problems waiting for find that problems in your coding interview.
During this day I only studied a brief time for my AWS certification, for the moment I am only focusing on solving example exams to practice for the real, I completed the required courses. I have been reading the Software Engineering at Google book, for the moment in chapter 10. Documentation.