You will find below the instructions to set up your computer for Le Wagon Data Engineering course
A part of the setup will be done on your local machine but most of the configuration will be done on a virtual machine.
Please read instructions carefully and execute all commands in the following order. If you get stuck, don't hesitate to ask a teacher for help π
Let's start π
To be able to interact when we are not in the same physical room, we will be using Zoom, a video conferencing tool.
Go to zoom.us/download.
Under Zoom Client click the Download button.
Open the file you have just downloaded to install the app.
Open the Zoom app.
If you already have a Zoom account, sign in using your credentials.
If not, click on the Sign Up Free link:
You will be redirected to Zoom's website to complete a form.
When it's done, go back to the Zoom app and sign in using your credentials.
You should then see a screen like this:
You can now close the Zoom app.
Slack is a communication platform pretty popular in the tech industry.
Download the Slack app and install it.
Launch the app and sign in to lewagon-alumni
organization.
Make sure you upload a profile picture π
The idea is that you'll have Slack open all day, so that you can share useful links / ask for help / decide where to go to lunch / etc.
To ensure that everything is working fine for video calls, let's test your camera and microphone:
- Open the Slack app
- Click your profile picture in the top right.
- Select
Preferences
from the menu. - Click
Audio & video
in the left-side column. - Below
Troubleshooting
, clickRun an audio, video and screensharing test
. The test will open in a new window. - Check that your preferred speaker, microphone and camera devices appear in the drop-down menus, then click
Start test
.
βοΈ When the test is finished, you should see green "Succeed" messages at least for your microphone and camera. π
β If not, contact a teacher.
You can also install Slack app on your phone and sign in lewagon-alumni
!
Have you signed up to GitHub? If not, do it right away.
π Upload a picture and put your name correctly on your GitHub account. This is important as we'll use an internal dashboard with your avatar. Please do this now, before you continue with this guide.
π Enable Two-Factor Authentication (2FA). GitHub will send you text messages with a code when you try to log in. This is important for security and also will soon be required in order to contribute code on GitHub.
We want to safely communicate with your virtual machine using SSH protocol. We need to generate a SSH key to authenticate.
- Open your terminal
π‘ Windows tip
We highly recommend installing Windows Terminal from the Windows Store (installed on Windows 11 by default) to perform this operation
- Create a SSH key
Windows
# replace "[email protected]" with your GCP account email
ssh-keygen.exe -t ed25519 -C "[email protected]"
MacOS & Linux
# replace "[email protected]" with your GCP account email
ssh-keygen -t ed25519 -C "[email protected]"
You should get the following message: > Generating public/private algorithm key pair.
- When you are prompted
> Enter a file in which to save the key
, press Enter - You should be asked to
Enter a passphrase
- this is optional if you want additional security. To continue without a passphrase press enter without typing anything when asked to enter a passphrase.
βΉοΈ Don't worry if nothing prompt when you type, that is perfectly normal for security reasons.
- You should be asked to
Enter same passphrase again
, do it.
βοΈ You must remember this passphrase.
βοΈ /home/your_username/.ssh/id_ed25519 already exists.
If you receive this message, you may already have an SSH Key with the same name (if you are a Le Wagon Alumni or are using SSH Authentication with Github).To create a separate SSH key to exclusively use for this bootcamp use the following:
# replace "[email protected]" with your GCP account email
ssh-keygen -t ed25519 -f ~/.ssh/de-bootcamp -C "[email protected]"
Your new SSH Key will be named de-bootcamp
. Make sure to remember it for later!
GCP is a cloud solution that you are going to use in order to work on a virtual machine.
π Note: Skip to the next section if you already have a GCP project
- Go to Google Cloud and create an account if you do not already have one
- In the Cloud Console, on the project list, select or create a Cloud project
- Give it a name such as
Wagon Bootcamp
for example - Notice the
ID
automatically created for the project, e.g.wagon-bootcamp-123456
In order to facilitate the following of the instructions during the bootcamp, open your GCP account preferences:
https://myaccount.google.com/language
If the preferred language is not:
- English
- United States
Then switch the language to english:
- Click on the edit pen logo
- Select English
- Select United States
- Click on Select
π Note: Skip to the next section if you already have a valid billing account
You will now link your account to your credit card. This step is required or you will not be able to use the services provided by GCP. Do not worry, you will be able to consume most GCP services through free credits throughout the bootcamp.
- Click on Billing
- Click on MANAGE BILLING ACCOUNTS
- Click on ADD BILLING ACCOUNT
- Give a name to your billing account, e.g.
My Billing Account
- Click on "I have read..." and agree the to the terms of service
- Click on CONTINUE
- Select your account type:
Individual
- Fill your name and address
You should see that you have a free credit of "$300 credits over the next 90days".
- Click on card details
- Enter your credit card info
- Click on START MY FREE TRIAL
Once this is done, verify that your billing account is linked to your GCP project.
- Select your project
- Go to Billing
- Select LINK A BILLING ACCOUNT
- Select
My Billing Account
- Click on SET ACCOUNT
You should now see:
Free trial status: $300 credit and 91 days remaining - with a full account, you'll get unlimited access to all of Google Cloud Platform.
π If you do not own a credit card π
If you do not own a credit card, an alternative is to setup a Revolut account. Revolut is a financial app that will allow you to create a virtual credit card linked to your mobile phone billing account.
Skip this step if you own a credit card and use your credit card for the setup.
Download the Revolut app, or go to revolut and follow the steps to download the app (enter your mobile phone number and click on Get Started).
- Open the Revolut app
- Enter your mobile phone number
- Enter the verification code received by SMS
- The app will ask for your country, address, first and last name, date of birth, email address
- The app will also ask for a selfie and request your profession
- The app will require a photo of your identification card or passport
Once this is done, select the standard (free) plan. No need to add the card to Apple pay, or ask for a the delivery of a physical card, or add money securely.
You now have a virtual card which we will use for the GCP setup.
In the main view of the Revolut the app
- Click on Ready to use
- Click on the card
- Click on Show card details
- Note down the references of the virtual credit card and use them in order to proceed with the GCP setup
π If you receive an email from Google saying "Urgent: your billing account XXXXXX-XXXXXX-XXXXXX has been suspended" π
This may happen especially in case you just setup a Revolut account.
- Click on PROCEED TO VERIFICATION
- You will be asked to send a picture of your credit card (only the last 4 digits, no other info)
- In case you used Revolut, you can send a screenshot of your virtual credit card (do not forget to remove the validity date from the screenshot)
- Explain that you are attending the Le Wagon bootcamp, do not own a credit card, and have just created a Revolut account in order to setup GCP for the bootcamp using a virtual credit card
You may receive a validation or requests for more information within 30 minutes.
Once the verification goes through, you should receive an email stating that "Your Google Cloud Platform billing account XXXXXX-XXXXXX-XXXXXX has been fully reinstated and is ready to use.".
You will use different GCP services during the bootcamp which needs to be activated and configured.
Go to your project APIs dashboard, you can see a bunch of APIs are already enabled:
π Note: Skip to the next section if you already have Compute Engine enabled
-
In the search bar, type compute and click on the Compute Engine result
-
Click on
ENABLE
-
Compute Engine is now enabled on your project
π Note: Skip to the next section if you already have a VM set up
Note: The following section requires you already have a Google Cloud Platform account associated with an active Billing account.
-
Go to console.cloud.google.com > > Compute Engine > VM instances > Create instance
-
Name it
lewagon-data-eng-vm-<github_username>
, replace<github_username>
with your own, e.g.krokrob
-
Region
europe-west1
, choose the closest one among the available regions -
In the section
Machine configuration
under the sub-headingMachine type
-
Select General purpose > PRESET > e2-standard-4
-
Boot disk > Change
-
Open
Networking, Disks, ...
underAdvanced options
-
Open
Networking
-
Go to
Network interfaces
and click ondefault default (...)
with a downward arrow on the right. -
This opened a box
Edit network interface
-
Go to the dropdown
External IPv4 address
, click on it, click onRESERVE STATIC EXTERNAL IP ADDRESS
-
Give it a name, like "lewagon-data-eng-vm-ip-<github_username>" (replace
<github_username>
with your own) and description "Le Wagon - Data Engineering VM IP". This will take a few seconds. -
You will now have a public IP associated with your account, and later to your VM instance. Click on
Done
at the bottom of the sectionEdit network interface
you were in.
-
Open the
Security
section -
Open the
Manage access
subsection -
Go to
Add manually generated SSH keys
and clickAdd item
-
In your terminal display your public SSH key:
-
Windows: navigate to where you created your SSH key and open
id_ed25519.pub
-
Mac/Linux users can use:
cat ~/.ssh/id_ed25519.pub # OR cat ~/.ssh/de-bootcamp.pub if you created a unique key
-
-
Copy your public SSH key and paste it:
-
On the right hand side you should see
-
You should be good to go and click
CREATE
at the bottom -
It will take a few minutes for your virtual machine (VM) to be created. Your instance will show up like below when ready, with a green circled tick, named
lewagon-data-eng-vm-krokrob
(krokrob
being replaced by your GitHub username). -
Click on your instance
-
Go down to the section
SSH keys
, and write down your username (you need it for the next section)
Congrats, your virtual machine is up and running, it is time to connect it with VS Code!
Let's install Visual Studio Code text editor.
Copy (Ctrl
+ C
) the commands below then paste them in your terminal (Ctrl
+ Shift
+ v
):
wget -qO- https://packages.microsoft.com/keys/microsoft.asc | gpg --dearmor > packages.microsoft.gpg
sudo install -o root -g root -m 644 packages.microsoft.gpg /etc/apt/trusted.gpg.d/
sudo sh -c 'echo "deb [arch=amd64,arm64,armhf signed-by=/etc/apt/trusted.gpg.d/packages.microsoft.gpg] https://packages.microsoft.com/repos/code stable main" > /etc/apt/sources.list.d/vscode.list'
rm -f packages.microsoft.gpg
sudo apt update
sudo apt install -y code
These commands will ask for your password: type it in.
Enter
.
Now let's launch VS Code from the terminal:
code
βοΈ If a VS Code window has just opened, you're good to go π
β Otherwise, please contact a teacher
We need to connect VS Code to a virtual machine in the cloud so you will only work on that machine during the bootcamp. A pretty useful Remote SSH Extension is available on the VS Code Marketplace.
- Open VS Code > Open the command palette > Type
Extensions: Install Extensions
- Install the extension
That's the only extension you should install on your local machine, we will install additional VS Code extensions on your virtual machine.
- Open VS Code > Open the command palette > Type
Remote-SSH: Connect to Host...
- Click on
Add a new host
- Type
ssh -i <path/to/your/private/key> <username>@<ip address>
, for instance, my username issomedude
, my private SSH key is located at~/.ssh/id_rsa
on my local computer, my VM has a public IP of34.77.50.76
: I'll typessh -i ~/.ssh/id_rsa [email protected]
- When prompted to
Select SSH configuration file to update
, pick the one in your home directory, under the.ssh
folder,~/.ssh/config
basically. Usually VS Code will pick automatically the best option, so their default should work.
- You should get a pop-up on the bottom right notifying you the host has been added
- Open again the command palette > Type
Remote-SSH: Connect to Host...
> Pick your VM IP address
- The first time, VSCode might ask you for a security permission like below, say yes / continue.
- Open again the command palette > Type
Terminal: Create New Terminal (in active workspace)
> You now have a Bash terminal in your virtual machine!
- Still on your local computer, lets create a more readable version of your machine to connect to!
code ~/.ssh/config
You should see something like the following:
Host <machine ip>
HostName <machine ip>
IdentityFile <file path for your ssh key>
User <username>
You can now change Host to whatever you would like to see as the name of your connection or in terminal with ssh <Host>
!
βοΈ It is important that the Host
alias does not contain any whitespaces βοΈ
# For instance
Host "de-bootcamp-vm"
HostName 34.77.50.76 # replace with your VM's public IP address
IdentityFile <file path for your ssh key>
User <username>
The setup of your local machine is over. All following commands will be run from within your π¨ virtual machineπ¨ terminal (via VS code for instance)
Let's install some useful extensions to VS Code.
- Open your VS Code instance and make sure you're connected to the remote server. At the bottom left, you'll see:
- Open the VS Code terminal (
CMD
+`
orCTRL
+`
) then run the following commands:
code --install-extension ms-vscode.sublime-keybindings
code --install-extension emmanuelbeziat.vscode-great-icons
code --install-extension ms-python.python
code --install-extension KevinRose.vsc-python-indent
code --install-extension ms-python.vscode-pylance
code --install-extension redhat.vscode-yaml
code --install-extension ms-azuretools.vscode-docker
code --install-extension tamasfe.even-better-toml
Here is a list of the extensions you are installing:
- Sublime Text Keymap and Settings Importer
- VSCode Great Icons
- Python
- Python Indent
- Pylance
- YAML
- Docker
- Even Better TOML
Instead of using the default bash
shell, we will use zsh
.
We will also use git
, a command line software used for version control.
Let's install them, along with other useful tools:
- Open an VS Code terminal connected to your VM
- Copy and paste the following commands:
sudo apt update
sudo apt install -y vim tmux tree git ca-certificates curl jq unzip zsh \
apt-transport-https gnupg software-properties-common direnv sqlite3 make \
postgresql postgresql-contrib build-essential libssl-dev zlib1g-dev \
libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm \
libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev \
gcc default-mysql-server default-libmysqlclient-dev libpython3-dev openjdk-8-jdk-headless
These commands might ask for your password, if they do: type it in.
Enter
.
Let's now install GitHub official CLI (Command Line Interface). It's a software used to interact with your GitHub account via the command line.
In your terminal, copy-paste the following commands and type in your password if asked:
curl -fsSL https://cli.github.com/packages/githubcli-archive-keyring.gpg | sudo dd of=/usr/share/keyrings/githubcli-archive-keyring.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/githubcli-archive-keyring.gpg] https://cli.github.com/packages stable main" | sudo tee /etc/apt/sources.list.d/github-cli.list > /dev/null
sudo apt update
sudo apt install -y gh
To check that gh
has been successfully installed on your machine, you can run:
gh --version
βοΈ If you see gh version X.Y.Z (YYYY-MM-DD)
, you're good to go π
β Otherwise, please contact a teacher
Let's install the zsh
plugin Oh My Zsh.
In a terminal execute the following command:
sh -c "$(curl -fsSL https://raw.github.com/ohmyzsh/ohmyzsh/master/tools/install.sh)"
If asked "Do you want to change your default shell to zsh?", press Y
At the end your terminal should look like this:
βοΈ If it does, you can continue π
β Otherwise, please ask for a teacher
CLI is the acronym of Command-line Interface.
In this section, we will use GitHub CLI to interact with GitHub directly from the terminal.
It should already be installed on your computer from the previous commands.
First in order to login, copy-paste the following command in your terminal:
email
gh auth login -s 'user:email' -w
gh will ask you few questions:
What is your preferred protocol for Git operations?
With the arrows, choose SSH
and press Enter
. SSH is a protocol to log in using SSH keys instead of the well known username/password pair.
Generate a new SSH key to add to your GitHub account?
Press Enter
to ask gh to generate the SSH keys for you.
If you already have SSH keys, you will see instead Upload your SSH public key to your GitHub account?
With the arrows, select your public key file path and press Enter
.
Enter a passphrase for your new SSH key (Optional)
. Type something you want and that you'll remember. It's a password to protect your private key stored on your hard drive. Then press Enter
.
Title for your SSH key
. You can leave it at the proposed "GitHub CLI", press Enter
.
You will then get the following output:
! First copy your one-time code: 0EF9-D015
- Press Enter to open github.com in your browser...
Select and copy the code (0EF9-D015
in the example), then press Enter
.
Your browser will open and ask you to authorize GitHub CLI to use your GitHub account. Accept and wait a bit.
Come back to the terminal, press Enter
again, and that's it.
To check that you are properly connected, type:
gh auth status
βοΈ If you get Logged in to github.com as <YOUR USERNAME>
, then all good π
β If not, contact a teacher.
Install the gcloud
CLI to communicate with Google Cloud Platform through your terminal:
echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list
sudo apt-get install apt-transport-https ca-certificates gnupg
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key --keyring /usr/share/keyrings/cloud.google.gpg add -
sudo apt-get update && sudo apt-get install google-cloud-sdk
sudo apt-get install google-cloud-sdk-app-engine-python
π Note: Skip to the next section if you already have a service account key
Now that you have created a GCP account
and a project
(identified by its PROJECT_ID
), we are going to configure the actions (API calls) that you want to allow your code to perform.
π€ Why do we need a service account key ?
You have created a GCP account
linked to your credit card. Your account will be billed according to your usage of the ressources of the Google Cloud Platform. The billing will occur if you consume anything once the free trial is over, or if you exceed the amount of spending allowed during the free trial.
In your GCP account
, you have created a single GCP project
, identified by its PROJECT_ID
. The GCP projects
allow you to organize and monitor more precisely how you consume the GCP ressources. For the purpose of the bootcamp, we are only going to create a single project.
Now, we need a way to tell which ressources within a GCP project
our code will be allowed to consume. Our code consumes GCP ressources through API calls.
Since API calls are not free, it is important to define with caution how our code will be allowed to use them. During the bootcamp this will not be an issue and we are going to allow our code to use all the API of GCP without any restrictions.
In the same way that there may be several projects associated with a GCP account, a project may be composed of several services (any bundle of code, whatever its form factor, that requires the usage of GCP API calls in order to fulfill its purpose).
GCP requires that the services of the projects using API calls are registered on the platform and their credentials configured through the access granted to a service account
.
For the moment we will only need to use a single service and will create the corresponding service account
.
Since the service account is what identifies your application (and therefore your GCP billing account and ultimately your credit card), you are going to want to be cautious with the next steps.
- Go to the service accounts page
- Select your project in the list of recent projects if asked to
- Create a service account:
- Click on CREATE SERVICE ACCOUNT:
- Give a
Service account name
to that account - Click on CREATE AND CONTINUE
- Click on Select a role and choose
Quick access/Basic
then Owner, which gives full access to all ressources - Click on CONTINUE
- Click on DONE
- Download the service account json file π:
- Click on the newly created service account
- Click on KEYS
- Click on ADD KEY then Create new key
- Select JSON and click on CREATE
The browser has now saved the service account json file π in your downloads directory (it is named according to your service account name, something like le-wagon-data-123456789abc.json
)
-
Open the service account json file with any text editor and copy the key
# It looks like: { "type": "service_account", "project_id": "kevin-bootcamp", "private_key_id": "1234567890", "private_key": "-----BEGIN PRIVATE KEY-----\nXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX\n-----END PRIVATE KEY-----\n", "client_email": "[email protected]", "client_id": "1234567890", "auth_uri": "https://accounts.google.com/o/oauth2/auth", "token_uri": "https://oauth2.googleapis.com/token", "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs", "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/bootcamp%40kevin-bootcamp.iam.gserviceaccount.com" }
-
on your Virtual Machine, create a
~/.gcp_keys
directory, then create a json file in it:mkdir ~/.gcp_keys touch ~/.gcp_keys/le-wagon-de-bootcamp.json
-
Open the json file then store the service account json file pasting the key:
code ~/.gcp_keys/le-wagon-de-bootcamp.json
βοΈDon't forget to save the file with
CMD
+s
orCTRL
+s
-
Authenticate the
gcloud
CLI with the google account you used for GCP# Replace service_account_name@project_id.iam.gserviceaccount.com with your own SERVICE_ACCOUNT_EMAIL=service_account_name@project_id.iam.gserviceaccount.com KEY_FILE=$HOME/.gcp_keys/le-wagon-de-bootcamp.json gcloud auth activate-service-account $SERVICE_ACCOUNT_EMAIL --key-file=$KEY_FILE
-
List your active account and check your email address you used for GCP is present
gcloud auth list
-
Set your current project
# Replace `PROJECT_ID` with the `ID` of your project, e.g. `wagon-bootcamp-123456` gcloud config set project PROJECT_ID
-
List your active account and current project and check your project is present
gcloud config list
Let's pimp your zsh and and vscode by installing lewagon recommanded dotfiles on your Virtual Machine
There are three options, choose one:
I already attended Web-Dev or Data-Science bootcamp at Le Wagon π¨ on the same Virtual Machine (highly unlikely!π¨)
This means that you already forked the GitHub repo lewagon/dotfiles
, but at that time the configuration was maybe not ready for the new Data Science bootcamp.
Open your terminal and go to your dotfiles
project:
cd ~/code/<YOUR_GITHUB_NICKNAME>/dotfiles
code . # Open it in VS Code
In VS Code, open the zshrc
file. Replace its content with the newest version of that file that we provide. Save to disk.
Back to the terminal, run a git diff
and ask a TA to come and check about this configuration change. You should see stuff about Python and pyenv
.
Once this is good, commit and push your changes:
git add zshrc
git commit -m "Update zshrc for Data Engineering bootcamp"
git push origin master
OR
I did not attend the Web-Dev or Data-Science bootcamp at Le Wagon
Hackers love to refine and polish their shell and tools. We'll start with a great default configuration provided by Le Wagon, stored on GitHub. As your configuration is personal, you need your own repository storing it, so you first need to fork it to your GitHub account.
β‘οΈ Click here to fork the lewagon/dotfiles
repository to your account (you'll need to click again on your picture to confirm where you do the fork).
Forking means that it will create a new repo in your GitHub account, identical to the original one. You'll have a new repository on your GitHub account, your_github_username/dotfiles
. We need to fork because each of you will need to put specific information (e.g. your name) in those
files.
Open your terminal and run the following command:
export GITHUB_USERNAME=`gh api user | jq -r '.login'`
echo $GITHUB_USERNAME
You should see your GitHub username printed. If it's not the case, stop here and ask for help.
There seems to be a problem with the previous step (gh auth
).
Time to fork the repo and clone it on your laptop:
mkdir -p ~/code/$GITHUB_USERNAME && cd $_
gh repo fork lewagon/dotfiles --clone
Run the dotfiles
installer.
cd ~/code/$GITHUB_USERNAME/dotfiles && zsh install.sh
Check the emails registered with your GitHub Account. You'll need to pick one at the next step:
gh api user/emails | jq -r '.[].email'
Run the git installer:
cd ~/code/$GITHUB_USERNAME/dotfiles && zsh git_setup.sh
βοΈ This will prompt you for your name (FirstName LastName
) and your email.
gh api ...
command.
If you don't do that, Kitt won't be able to track your progress. π‘ Select the @users.noreply.github.com
address if
you don't want your email to appear in public repositories you may contribute to.
Please now quit all your opened terminal windows.
OR
I already attended Web-Dev or Data-Science bootcamp at Le Wagon but not on this VM
Open your terminal and run the following command:
export GITHUB_USERNAME=`gh api user | jq -r '.login'`
echo $GITHUB_USERNAME
You should see your GitHub username printed. If it's not the case, stop here and ask for help.
There seems to be a problem with the previous step (gh auth
).
Time to fork the repo and clone it on your laptop:
mkdir -p ~/code/$GITHUB_USERNAME && cd $_
gh repo fork lewagon/dotfiles --clone
Run the dotfiles
installer.
cd ~/code/$GITHUB_USERNAME/dotfiles && zsh install.sh
Check the emails registered with your GitHub Account. You'll need to pick one at the next step:
gh api user/emails | jq -r '.[].email'
Run the git installer:
cd ~/code/$GITHUB_USERNAME/dotfiles && zsh git_setup.sh
βοΈ This will prompt you for your name (FirstName LastName
) and your email.
gh api ...
command.
If you don't do that, Kitt won't be able to track your progress. π‘ Select the @users.noreply.github.com
address if
you don't want your email to appear in public repositories you may contribute to.
Please now quit all your opened terminal windows.
Set zsh
as your default VS Code terminal.
You don't want to be asked for your passphrase every time you communicate with a distant repository. So, you need to add the plugin ssh-agent
to oh my zsh
:
First, open the .zshrc
file:
code ~/.zshrc
Then:
- Spot the line starting with
plugins=
- Add
ssh-agent
at the end of the plugins list
βοΈ Save the .zshrc
file with Ctrl
+ S
and close your text editor.
Docker is an open platform for developing, shipping, and running applications.
Setup the dock apt repo
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg
echo \
"deb [arch="$(dpkg --print-architecture)" signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
"$(. /etc/os-release && echo "$VERSION_CODENAME")" stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
Install the right packages
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
Finally give your user permission to use docker
sudo groupadd docker
sudo usermod -aG docker $USER
newgrp docker
Run docker run hello-world
, you should see something like:
βοΈ Permission denied while trying to connect to the Docker daemon socket. βοΈ
If you receive an error similar to the one below, navigate to the GCP Compute Engine Console and shut down your VM by selecting the tick box next to your VM instance and clicking STOP (closing and reopening VSCode is not enough).
It will take a few minutes for your VM to turn off. Once it's fully off, turn your VM on again by checking the box next to the VM instance and clicking START. Give the VM a few minutes to fully start up and connect through VSCode. Once connected try docker run hello-world
again. If you don't get an output similar to the below image, raise a ticket with a teacher.
π Note: Skip to the next section if you already have an Artifact Registry repository
Artifact Registry is a GCP service you will use to store artifacts such as Docker images. The storage units are called repositories.
- Enable the service within your project using the
gcloud
CLI:gcloud services enable artifactregistry.googleapis.com
- Create a new Docker repository:
# Set the repository name REPOSITORY=docker-hub # Set the location of the repository. Available locations: gcloud artifacts locations list LOCATION=europe-west1 gcloud artifacts repositories create $REPOSITORY \ --repository-format=docker \ --location=$LOCATION \ --description="Docker images storage"
You need to grant Docker access to push artifacts to (and pull from) your repository. There are different authentication methods, gcloud credentials helper being the easiest.
- Define the repository hostname matching the repository
$LOCATION
:# If $LOCATION is "europe-west1" HOSTNAME=europe-west1-docker.pkg.dev
- Configure gcloud credentials helper:
gcloud auth configure-docker $HOSTNAME
- Type
y
to accept the configuration - Check your credentials helper is set:
You should get:
cat ~/.docker/config.json
{ "credHelpers": { "europe-west1-docker.pkg.dev": "gcloud" } }%
Kubernetes (K8s) is a system designed to make deploying auto-scaling containerized applications easily.
Kubectl is the cli for interacting with k8s!
https://kubernetes.io/docs/tasks/tools/install-kubectl-linux/
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
curl -LO "https://dl.k8s.io/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl.sha256"
echo "$(cat kubectl.sha256) kubectl" | sha256sum --check
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
kubectl version --client
kubectl version --client --output=yaml
Minikube is a way to quickly spin up a local kubernetes cluster!
https://minikube.sigs.k8s.io/docs/start/
curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64
sudo install minikube-linux-amd64 /usr/local/bin/minikube
To test that you can launch a cluster run:
minikube start
you should see your cluster booting up :
Then to check the cluster run:
kubectl get po -A
you should be able to see your cluster running! :
To tear it all down for now:
minikube delete --all
Terraform is a tool for infrastructure as code (IAC) to define resources to create in the cloud!
Install some basic requirements
sudo apt-get update && sudo apt-get install -y gnupg software-properties-common
Terraform is not avaliable to apt by default so we need to make it avaliable!
wget -O- https://apt.releases.hashicorp.com/gpg | \
gpg --dearmor | \
sudo tee /usr/share/keyrings/hashicorp-archive-keyring.gpg
gpg --no-default-keyring \
--keyring /usr/share/keyrings/hashicorp-archive-keyring.gpg \
--fingerprint
echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] \
https://apt.releases.hashicorp.com $(lsb_release -cs) main" | \
sudo tee /etc/apt/sources.list.d/hashicorp.list
Now we can install terraform directly with apt π
sudo apt update
sudo apt-get install terraform
Verify the installation with:
terraform --version
Spark is a data processing framework:
Move to your home directory:
cd ~
Download spark:
wget https://downloads.apache.org/spark/spark-3.5.3/spark-3.5.3-bin-hadoop3.tgz
Open the tarball:
mkdir -p ~/spark && tar -xvzf spark-3.5.3-bin-hadoop3.tgz -C ~/spark
Set the environment variables needed by spark:
echo "export SPARK_HOME=$HOME/spark/spark-3.5.3-bin-hadoop3" >> .zshrc
echo 'export PATH=$PATH:$SPARK_HOME/bin' >> .zshrc
Test it works by running:
exec zsh
spark-shell
Ubuntu 22.04 has Python 3.8 pre-installed, but we want to have the latest security release of python 3.8 (3.8.14)
Lets install pyenv to manage our python versions:
git clone https://github.com/pyenv/pyenv.git ~/.pyenv
source ~/.zprofile
exec zsh
Now install 3.8.14:
pyenv install 3.8.14
pyenv global 3.8.14
Now python --version
should return 3.8.14
We'll also install a useful pyenv
plugin called pyenv-virtualenv
. Although we will be using poetry
for package and virtual environment management, pyenv-virtualenv
is useful for controlling python versions locally.
git clone https://github.com/pyenv/pyenv-virtualenv.git $(pyenv root)/plugins/pyenv-virtualenv
exec zsh
Next we are going to install pipx to install python packages we want globally available while still using virtual environments
pip install --upgrade pip
python -m pip install --user pipx # --user so that each ubuntu user can have his own 'pipx'
python -m pipx ensurepath
exec zsh
Lets install a tldr with pipx
pipx install tldr
Now tldr
should be globally available (for the current user), test it out with:
tldr ls
Much more readable than the classic man ls
(although sometimes you will still need to delve into the man pages to get all of the details!) and it even has pages not included in man such as tldr gh
:
Lets add a few more packages we want globally available
black for helping to format code
pipx install black
Poetry is a modern Python package manager we will use throughout the bootcamp.
Install Poetry running the following command in your VS Code terminal:
pipx install poetry
Then, let's update default poetry behavior so that virtual envs are always created where poetry install
is run.
During the bootcamp, you'll see a .venv
folder being created inside each challenge folder.
poetry config virtualenvs.in-project true
Finally, update your VScode settings to tell it that this .venv
relative folder path will be your default interpreter !
(Command Palette - Preference: Open Remote Settings (JSON), then add the following line to the panel that opens on the right)
"python.defaultInterpreterPath": ".venv/bin/python",
Direnv is a great utility that will look for .envrc
files in your directories. When you cd
into directories with a .envrc
files, paths will automatically be updated. In our case, this will simplify our workflow and allow us to not have to worry about Poetry managed Python virtual environments.
- First, setup the direnv hook to your zsh shell so that direnv gets activated anytime a
.envrc
file exists in current working directory.
code ~/.zshrc
plugins=(... direnv) # add this direnv to the existing list of plugins
- Second, let's configure what will happens anytime
.envrc
file is found
code ~/.direnvrc
- Paste the following lines
layout_poetry() { if [[ ! -f pyproject.toml ]]; then log_error 'No pyproject.toml found. Use `poetry new` or `poetry init` to create one first.' exit 2 fi # create venv if it doesn't exist poetry run true export VIRTUAL_ENV=$(poetry env info --path) export POETRY_ACTIVE=1 PATH_add "$VIRTUAL_ENV/bin" }
- Save and close the file
π Now, anytime you cd
into a challenge folder which contains a .envrc
file which contains layout_poetry()
command inside, the function will get executed and your virtual env will switch to the poetry one that is defined by the pyproject.toml
!
- No need to prefix all commands by
poetry run <my_command>
, but simply<my_command>
- Each challenge will have its own virtual env, and it will be seemless for you to switch between challenges/envs
Lets clone the challenges onto your virtual machine
export GITHUB_USERNAME=`gh api user | jq -r '.login'`
echo $GITHUB_USERNAME
Then
mkdir -p ~/code/$GITHUB_USERNAME && cd $_
gh repo fork lewagon/data-engineering-challenges --clone
You want this setup:
Check your remotes match origin
your data engineering challenges and upstream
lewagon's!
cd data-engineering-challenges
git remote -v
# origin [email protected]:your_github_username/data-engineering-challenges.git (fetch)
# origin [email protected]:your_github_username/data-engineering-challenges.git (push)
# upstream [email protected]:lewagon/data-engineering-challenges.git (fetch)
# upstream [email protected]:lewagon/data-engineering-challenges.git (push)
From challenge folder root on the vm, we'll run make install
, which triggers 3 operations:
make install-poetry
:cd
inside each challenge folders, andpoetry install
inside each! (takes a while)make allow-envrc
: allow direnv to execute inside each folder (otherwise you have to manually "allow" it)make own-repo
: allows your user to be the linux "owner" of all files in this challenge folder
Let's make! (You've got time for a coffee βοΈ, or start next step during the install)
make install
direnv: error .envrc file not found
- that is normal and nothing to worry about.
Download and install DBeaver on your local machine, a free and open source powerful tool to connect to any database, explore the schema and even run SQL queries.
If you are unsure about what to do, you can follow this link. If you are already logged in, you can safely skip this section. If you are not logged in, click on Enter Kitt as a Student
. If you manage to login, you can safely skip this step. Otherwise ask a teacher whether you should have received an email or follow the instructions below.
Register as a Wagon alumni by going to kitt.lewagon.com/onboarding. Select your batch, sign in with GitHub and enter all your information.
Your teacher will then validate that you are indeed part of the batch. You can ask them to do it as soon as you completed the registration form.
Once the teacher has approved your profile, go to your email inbox. You should have 2 emails:
- One from Slack, inviting you to the Le Wagon Alumni slack community (where you'll chat with your buddies and all the previous alumni). Click on Join and fill the information.
- One from GitHub, inviting you to
lewagon
team. Accept it otherwise you won't be able to access the lecture slides.