Data Engineering Portfolio
News Extraction and Recommendation System
Technologies: ETL, Airflow, Snowflake, Kafka, CI/CD, MongoDB, FastAPI, Beautiful Soup, Spacy, OpenAI, GPT
GitHub Repository: Link to Repository
- Spearheaded the creation of an ETL pipeline to extract over 300 news stories every 10 minutes, embedding news titles into a vector database, summarizing using the GPT model, and loading them into Snowflake
- Established streaming alerts on Kafka for keyword sets, utilizing a Python service to match alerts with user interests stored in MongoDB, and delivering alerts to users via a FastAPI endpoint
Knowledge Retrieval System (RAG Application)
Technologies: RAG, Generative AI, ETL, Airflow, Snowflake, GCP, OpenAI, LLM, Pinecone VectorDB
GitHub Repository: Link to Repository
- Designed and implemented an ETL pipeline using Airflow to extract daily information from web sources, transform and load it into Snowflake, and utilized OpenAI’s GPT model for efficient knowledge retrieval
- Integrated vector embeddings in Airflow, stored them in a Pinecone vector database for rapid retrieval of information based on user queries and hosted the entire system on GCP for scalability
Stock Market Real-Time Data
Technologies: Python, SQL, AWS, EC2, S3, Glue, Athena, Kafka, Docker
GitHub Repository: Link to Repository
- Designed a data pipeline project using Apache Kafka to stream 1000s of real-time stock market data using Python producer and consumer to load streamed data into different folders inside S3 bucket
- Utilized AWS Athena to query data from S3 storage using SQL queries and displaying the results on the Streamlit
Automated PDF Data Extraction and Querying
Technologies: Snowflake, data pipelining, Airflow, DBT, AWS, PyPDF, Data extraction
GitHub Repository: Link to Repository
- Created an application to convert unstructured PDFs to structured data, storing files in S3 and initiating an Airflow pipeline for parsing using PyPDF and loading into Snowflake, supporting 1000+ simultaneous requests
- Executed data transformation in DBT by loading Snowflake data into the DBT model and reloading it into the Snowflake production schema, while providing a Streamlit UI for querying all extracted data
Electric Vehicle Analysis Dashboard
Technologies: Tableau, Data Visualization, Data Analysis, Business Intelligence
GitHub Repository: Link to Repository
- Created an interactive dashboard analyzing over 150,000 electric vehicles, providing insights into market trends and technological advancements
- Enhanced decision-making by visualizing key metrics, including over 200% increase in BEV adoption and state-wise vehicle distribution from 2020, using dynamic charts and maps