Skip to content

Latest commit

 

History

History
121 lines (53 loc) · 10.3 KB

README.md

File metadata and controls

121 lines (53 loc) · 10.3 KB

Luis Valencia Data Science Portfolio

Creating an LLM

Understanding Word Embeddings: The Foundation of Language Models

In this post, we'll explore one of the key concepts that powers large language models (LLMs) - word embeddings. Don't worry if it sounds complex right now, because by the end of this post, you'll have a clear idea of what word embeddings are and how they transform plain text into a form that computers can understand and learn from.

Efficient Tokenization with Byte Pair Encoding (BPE) for Neural Networks

When working with neural networks for natural language processing (NLP), one of the key challenges is handling words that your model hasn't encountered before. Traditional word-level tokenization, which splits text into individual words, struggles with this because if a model encounters an unknown word during training, it can't process it properly. This is where Byte Pair Encoding (BPE) comes in.

Implementing Byte Pair Encoding (BPE) for Tokenization: A Step-by-Step Guide

In this project, I write a custom and simple BPE tokenizer to be able to tokenize in subwords like the original BPE does, the intention of this project is for learning purposes and not to replace the original BPE which is widely adopted.

How Large Language Models Use Sliding Windows for Next-Word Prediction — With PyTorch

In this project I explain how the sliding technique for training LLMs works, fully detailed, step by step, easy to read and modify to your needs.

LLMs, Langchain, OpenAI, Cognitive Search, Vector DB, Pinecone

Elevating User Engagement: Implementing Real-Time Streaming with RAG Chat and Azure Cognitive Search in Chatbot Backends with Flask and LangChain

Discover how to leverage FLASK and multi threading capabilities to create a streaming REST API to hace near to real time token generation (at least in the UI).

Creating a Llama2 Managed Endpoint in Azure ML and Using it from Langchain

Discover the power of combining Azure ML Studio with the new model catalog, and deploy the models as Managed Endpoints, and then consuming those from Streamlit applications, no need to rely on OpenAI Models anymore when you have this.

Elevate Chat & AI Applications: Mastering Azure Cognitive Search with Vector Storage for LLM Applications with Langchain

Discover the synergy of Azure Cognitive Search's custom skills and OpenAI Embedding Generator. Unleash the potential of enhanced data indexing, AI embeddings, and Language Models for enriched search and dynamic interactions. Explore the series that transforms insights into conversations, bridging the gap between data and AI-driven engagement. We use Cognitive Search Vector Storage Public Preview, Langchain, Open AI,We use features like Knowledge Store, Custom Skilsets. Its an exciting portfolio project split in 5 parts:

Creating a Langchain application with Streamlit, OpenAI to talk to your own text documents using Pinecone as Vector DB

Discover the power of Langchain applications! In this blog post, I will explore how to create a cutting-edge Langchain application that enables you to interact with your own text documents in a conversational manner. By harnessing the capabilities of Langchain, OpenAI, and Pinecone as a Vector DB, we'll guide you through the process of building a seamless user experience.

Azure Machine Learning

Azure ML environments

In this project I demostrate how to properly manage environments, both local environments and azure ml environments which can be synced with requirements.txt for training and inference, this will help to have a reproducible working enviroment across the ML stages.

Deep Learning, PyTorch, Transformers or HuggingFace Projects

Build a Neural Network from Scratch Using Numpy Learn how to build a basic NN from scratch without Pytorch Or TensorFlow.

Transfer Learning for Image Classification with PyTorch

In my previous project I created a CNN from scratch, but in the real world you would barely do that and instead rely on existing pretrained models, we will take an existing model and modify the classification head for our specific problem.

Building a TinyVGG Model from Scratch to Classify The Simpsons Characters with PyTorch 2.0

In this project I created a CNN from scratch, TinyVGG its a well known CNN architecture, and I create it in order for readers to understand all steps needed to create, train and evaluare a deep learning model from its roots.

Fine-tuning DistilBERT with your own dataset for multi-classification task

HuggingFace and the transformers library has made it very easily for us to avoid training Large Language Models, instead we can re-use existing models, by just donwloading them from the HF Hub and then with Pytorch and Transfomers API Fine tune it for your own specific task with your own specific labels. The end result a great model which used State of The Art Pretrained model but fit to your needs. In this case I selected DistilBert and fine tuned it for hate speech or offensive tweets detection. In this notebook I start from the very beginning by introducting NLP Concepts, and to until the end when the model is saved to disk, loaded and then used for inference.

Data Cleaning and Data Preparation

Real State Transactions Per Municipality

The government has given us a dataset with aggregated data of real state transactions per municipality, per quarter, per semester and year. Our job is to describe the data, clean it, wrangle it and/or prepare it for further processing in an ML forecasting project or for Data StoryTelling.

Classification

Wine Quality Classification Problem

We have a wine dataset with information about chemical ingredients, ph, and a quality score from 0 to 10. We want to classify a wine as good or bad, the idea is that good wines are scored 7 or above, all the rest are bad wines.

Recommender Systems

Recommender system for internal tranings

The idea of this project is easy, we have internal trainings created by our company employees, we have also external trainings that we take via pluralsight, udemy or any other platform like coursera, and we have employees which take those internal or external trainings. Employees have some attributes, like department, language, skills, etc. All those attributes need to be taken into account into our recommender system.

For example: if you are a new employee with only 1 year of experience in Data Science (feature), and in the skills(feature) you have listed Statistics, Machine Learning, but another person in the company, has 10 years of experience in DataScience, with similar skills, and if that person took "Advanced machine learning specialization" in coursera. The recommender system would be able to predict a.k.a recommend this training to the new employee.

Regression

Real State Price Prediction

We are required to build a model to predict house prices in the Belgium Real State Market, the idea is that when users want to buy a new house, they can compare the listed price with the model prediction and check if it the prices are similar and take a decision.

Time Series Forecasting

Forecasting Energy Inflation in Belgium

Belgium Gas and Energy Prices skyrocketed last year, and recently we received the news that price has lowered from 300 Euros per MWH to 60 Euros per MWH. With this project I took the last years data for Energy Inflation and forecasted the next 4 months to see if our pockets will finally breathe a bit.