Today I made some AWS practice exams, in the first practice exam 30 questions only I achieve an approbatory calcification to pass the exam, but I fail in the second 65 questions exams. Some of the example questions are very tricky and you need to know a basic one line definition for the recommended AWS services. In other topics I obtain a coupon to take the Airflow Fundamentals certification by Astronomer free, the problem is I only have 30 days to study for the exam and present the exam if not my free try will be expire. I am in a rush to obtain my AWS and Airflow certifications.
Today working in Spark I will need to perform an operation to create a new column and fill with the return of a SQL-like expression for example in base of the value of a another colum, parse this value to something and fill the new column. Also I studied more about this SQL-like expressions in Spark, an SQL expression is a combination of values, variables, operators and keywords to perform an action like a function in a programming language. In Computer Science an expression is like an SQL expression, a combination of values, variables, operators and functions call that return a value.
Today I studied some AWS Billing and Pricing AWS services, services like AWS Cost Explorer, AWS Budgets, Support plans (Basic, Developer, Business, Enterprise On-Ramp and Enterprise). AWS offers a variety of service to know how much money are spending over your entire AWS account and also to predict the money that you will spend during the next days/month. Some definitions are:
- Billing Dashboard: AWs service to perform billing actions like analyze the money usage within you account, view free tier usage and monitoring the money spent for your AWS services.
- AWS Budgets: AWS Cost management feature to setup desired limits for your services in money, and setups alarms when you reached that limits.
- AWS Cost Explorer: AWS tool to visualize the cost of your AWS services within your account.
- AWS Marketplace: a place in which you can find and buy third-party services within AWS.
Today I continue studying for the AWS certification, I studied some monitoring services like CloudTrail, CloudWatch and AWS Trusted Advisor, also I made some test generated by AI about example exam questions for the studied topics. It's amazing how AI ca help you to study better and creating exam questions.
Today I have been studying for my AWS certification, this time I studied about Compliance and Security service like Artifact, GuarDuty, WAF, Inspector, etc. In other topics I practiced with SQL executing SQL queries and modifying database schemas.
Today I created a Python script application to analyze a SQL Table, the idea was find in a table all the columns that always have NULL values in all the rows. I am working to learn more about SQL, Python, Scala and Spark to work on Data Engineering tasks.
Today I only read the Fundamentals of Data Engineering book a little. I learned the roles that interact with a data engineer, first the upstream data producers: Software engineer, data architects and DevOps and SREs, next the downstream data consumers: Data analysts, data scientist and machine learning engineers. Data engineering is am amazing field requires skills for variuos categories and work with my favorite language Java, also Scala is good to know.
Today I studied some Data Engineering topics, I begin to read the Fundamentals of Data Engineering book, this book for the moment is teaching to me the fundamentals concepts of data engineering and the role of a data engineer within a company. I practice today creating an Airflow dag, I want to build a test DAG for practice.
Today I completed the module 5 for the AWS course about Storage and Databases, I remember some specific databases by AWS like DocumentDB (NoSQL database compatible with MongoDB), Neptune (Graph database), Quantum Ledger Database QLDB (for audit tasks because this database in immutable). AWS have a good range of database for specific purposes to fill the user requirements. Tomorrow I have the plan to code something, really I need to code something maybe an Airflow Dag or anything else.
Today I continue studying for my AWS certification, today I studied about Database services like Amazon RDS, Aurora and DynamoDB. I learned the difference between a one instance database and a cluster of databases. Reading the Software Engineer at google book I discovered two amazing tools, one static analysis tool to detect errors in Java called error-prone and yapf a Python tool to format Python code. Tomorrow I have the plan to investigate more about these two tools and consider if start to use in my daily activities will be benefit.
Today I studied the S3, EBS, EFS abd Block Storage AWS services, these services are designed to store files one in blocks and another like a traditional file system. Reading the Software Engineering at Google book I learned a new concept called template metaprogramming, this concept is related with C++, a C++ ability to write code that generates another code during compilation, looks like that the Java Template Factory pattern in which you create objects in runtime.
During these days I continue to studying about AWS services, I studied services related with Networking like VPN, Security Groups, ACLs, CloudFront, Route 53, etc. The part to understand how is composed a VPN in AWS was a little tired, a ACLS in an AWS VPN controls income and output networking traffic, the VPN works in an Stateful way this means each request is treated like unknown. EC2 instance have another layer of security called Security Group, security groups works as virtual firewalls with the same purpose controls income and outcome traffic.
I am back from sick, during the last week I was very sick, unable to code due to bad bronchitis and I had to take rest in bed. I don't want to be sick because my wife forbid me to use my computer to code, but well at least I could finish reading the Everything is Fucked book, an amazing book in my opinion.
In other topics today I retake my AWS study, I studied services related to networking like VPN, Security Groups, Subnets, ACLs, Internet Gateway, etc.. Also I can make some progress in my Airflow DAG, I want to create a DAG to process some data, extract some keywords from this data and update a SQL database with the found keywords.
I completed a short tutorial on best practices in SQL for Java Developers: SQL Best Practices Every Java Engineer Must Know.
While reading the Baeldung weekly issue, I found it to be a great resource for staying updated on the most relevant Java topics each week—it's an excellent option for Java Developers to review regularly.
I've been practicing a lot with SQL, including:
- Creating tables
- Inserting records
- Working with joins
- Learning about best practices for
GROUP BY
- Using CTEs (Common Table Expressions)
- Working with subqueries
- Understanding the
EXPLAIN
statement
Today, I performed maintenance activities on my Oracle server running Ubuntu, upgrading it from 20.04 to 22.04 Jammy Jellyfish via SSH. During the process, I cleaned up cache, removed conflicting repositories, upgraded packages, and optimized my Docker environment.
I also set up a PostgreSQL database in Docker to practice SQL commands. While doing so, I noticed that my server was outdated and cluttered with unnecessary files, prompting a thorough cleanup. It was satisfying to get everything back in order!
Today I did a little course (3 hours) to learn some Data concepts, the course teach you Big Data concepts like what is big data, the 5Vs' in big data velocity, veracity, variety, volume and value. I learned interesting things about big data and data engineering like the Map Reduce operations, how a HDFS works.
Today I worked into modify some Java code to change the way in which Deserialize some JSON data into a Java Objects. The code use Google gson for the deserialization process, the code changes was fast, but the problem was the testing phase, the application don't have unit testing, so the unique way to test the code changes is running the application entirely.
Today I worked in a hard task, I worked into renew some certificates for a RabbitMQ server, was amazing because I needed to generate the certificates multiples times because sometimes was wrong generated. For me work with certificates and things related with security is funny and interesting but my knowledge on these topics is little.
Today I studied these AWS services and learning multiple things about them related with AutoScaling and Availability. ELB is an amazing service to automatically scale up or down another services like Amazon EC2, normally ELB is used with the Auto Scaling service to automatically apply scalability to an EC2 instance like applying the same technique in Kubernetes with a pod with three replicas.
I continue working in the AWS Cloud Practitioner Essentials course from Amazon. This course is to complement my knowledge in AWS and continue learning and practicing more. In other topics, during this month also I will need to work in an University onboarding course.
During these days I completed the Udemy Ultimate AWS Certified Cloud Practitioner CLF-C02 course. I continue working on my AWS Cloud Practitioner certification and learning about Cloud Computing in general. Also I learned the NIST definition about what is Cloud Computing, the Essentials characteristics, the service model and deployment models.
Over the weekend, I continued studying for the AWS Cloud Practitioner certification. I'm working on the final parts of the course to prepare for the exam. Today, I studied the AWS Well-Architected Framework, which helps create a cloud environment that follows enterprise and best practices. AWS has many cloud services, which can be frustrating as it's hard to grasp the basics of each one.
Today I continued reading the software engineering at google book. I learned about how google measure the productivity using the GSM (Goal, Signal and Metric) methodology. A goal can be improve security in the code, the signal can be how much vulnerabilities are detected in the code and reduce the number of code vulnerabilities introduced in the new code, how do you when the goal was completed? well get a setup and review if the expected metric was achieved. It's interesting how Google some really ambiguous problems with good methodologies and achieve great results.
Today I was investigating other data processing tools like Beam and Flink. Working with Airflow is good, but the problem is require a lot of resources to setup and run an local Airflow environment with Docker, so I am looking for alternatives like creating my pipeline with Apache Beam and package into a container to be executed in my Kubernetes cluster as a cronjob with Java obviously, also I want to learn Data Engineering with Java, I know Python is good but Java is better (my personal opinion).
Today I continue working on my Airflow DAG, most of my time today was used to read some blogs from Netflix and also watch some youtube videos about data engineering. I am looking for an alternative to Airflow but to work with Java code, Python is great but is consuming a lof of my compute resources.
Today I can't create a Spark session from an Airflow DAG to my local Spark cluster, I have various problems with this approach related with my Airflow cluster is working in a container environment, and my Apache Spark is installed locally in my machine. In the Udemy course that I did last week, I learned how to run a task with the Docker operator in a container environment, and running the task in a container environment is a good way, the problem with this approach is that we can't share data between the tasks in this way. Well, I will have to continue investigating how to solve this problem and learn different ways to work with Spark and Airflow.
Today I created a Airflow DAG to extract some JSON data from an open API using the HTTP operator, also I created a another DAG to monitor the pull request created into a GitHub repository, if the commit messages have some pattern, the DAG is executed and print some elements in the console. Building these basics DAGs is helping me to understand Airflow concepts and practicing Python and Airflow.
Today I continue studying about Apache Airflow. I spend some time setup an Airflow environment in my MackBook only with pip and using the Airflow pip package, running Airflow in this way works at the first time, but starts to fail with the time and I don't know why, investigating more about the issues I discovered that configure an Airflow environment with pip is not recommended and complicated, well at the end I continue executing Airflow in my local machine using the astro CLI tool from Astronomer. I completed the Astronomer Get started wih Airflow tutorial in which I learned how to configure the local environment very easy with the astro CLI tool and also learn some new concepts about Airflow.
I have been installing, setup and configuring Apache Airflow and Spark in my personal server to start developing DAGs in a production environment. I installed Airflow and Spark following the official documentations with Docker, really was hard configured both applications to be able to use in my Ubuntu server. I am going to start some data pipelines to create datasets and publish them in Hugging Face.
Today I studied how to execute a DAG in Apache Airflow for a specific date, this process is calling backfilling. Imagine that you change your DAG and you want to execute your old DAG executions with these new features, well with backfill you can rerun the DAG for a specific date. Also I was looking the Hugging Face website, I want to create a dataset to store the stock information about the FAANG companies.
Today I fixed the issue when loading the data from a Minio bucket to a PostgreSQL database, I configured wrong the connection in the Airflow connections section. When I upload the data to PostgreSQL I used Metabase to visualize the data, Metabase is a powerful tool to visualize data in a simple way, I created a dashboard to visualize the data that I uploaded to the database. I configured to visualize the historical data of the stock price of IBM company and also the current price for the stock.
I learned and executed a task inside a Docker container. Airflow has a DockerOperator that allows you to run a task inside a Docker container, this feature is very useful to execute task in an isolated environment and avoid install dependencies to Airflow only to be used one time. I am working into load a CSV file in a Minio bucket to a PostgresSQL database with the astro library, I am having some problems attempting to connect to the Minio bucket to download the file. I am going to continue working on this task tomorrow.
Today I continue studying Apache Airflow, I continue learning about how to create DAGs creating some basic examples to make simple data pipelines, A DAG in Airflow contains N number of tasks, this task needs to be executed in a step order and also a task can't have a dependency with a previous task. in Airflow if you have a DAg with 10 task you can execute only one task standalone with the CLI, this is a good way to test the execution of a task. PythonOperator in airflow is a way to execute a python function as a task, two main arguments are necessary to create a PythonOperator, task_id and python_callable, task_id is the name of the task in airflow and python_callable is the function that you want to execute as a task. My first impressions about airflow are very good, I am developing a data pipeline to extract financial stock data from Yahoo Finance using web scraping and next upload the extracted data into a Database for the moment I don't know if use a SQL or NoSQL database to store the information.
Today I continue studying for the AWS Cloud Practitioner certification. I studied some AWS service like Application Migration Service, DataSync, Step functions, Ground station, Migration Hub, etc.