Jaewon Shim
Education
University of California Berkeley
Data Science B.A. | GPA: 3.9 / 4.0
Dec 2025 | Berkeley, CA
Work Experiences
Data Scientist Intern, MKS Instruments (Jun 2024 – Dec 2024)
- Designed and deployed an ensemble machine learning model (LightGBM, Random Forest) on 500K+ rows of laser product inspection data, achieving 92% prediction accuracy and reducing false positives by 35%.
- Operated the model for real-time defect risk scoring, contributing to a 12% increase in first-pass yield and $750K annual scrap reduction.
- Led migration of 20+ dashboards from Tableau to Power BI, optimizing DAX data models and ETL pipelines to cut data refresh times by 45% and improve stakeholder usability.
- Automated reporting workflows using Python and Excel VBA, increasing operational efficiency by 30%, and supported quality initiatives with exploratory data analysis that reduced warranty incidents by 23%.
Python and Mathematics Tutor, Tublet (Feb 2023 – Mar 2025)
- Facilitated online programming and statistics tutoring sessions, offering guidance to over 100 students.
- Earned an Honorable IT Tutor Certificate, awarded to top 1% of tutors in the company, for raising student grades from C or below to A in 96% of tutoring sessions.
Skills
- Python: Pandas, Matplotlib, Seaborn, Scikit-learn, Tensorflow, Keras, LightGBM, PyTorch, Regex, API
- Database Management: SQL, Query Optimization, ETL Process, Datamart, SAP HANA, Snowflake, Smartsheet
- Machine Learning: Supervised/Unsupervised Learning, Ensemble Method, Evaluation, Pipeline Automation
- Deep Learning: Python Frameworks, CNN (Image Processing), RNN (Time-Series), Transfer Learning
- Mathematics / Statistics: Linear Algebra, Probability, Hypothesis Testing, Regression Analysis, A/B Testing
- Business Intelligence: Excel, Power BI, Tableau, Smartsheet
Projects
Defect Prediction Model for Laser Product Quality Optimization, MKS Instruments
- Developed a LightGBM-based ensemble model to predict final-stage laser product defects using over 500K inspection records, achieving 85% recall and minimizing late-stage failures.
- Automated SMOTE-based resampling and feature generation pipelines in Python, reducing preprocessing time by 40% and ensuring consistent model performance over multiple retraining cycles.
- Integrated predictions into Power BI to visualize risk trends across product lines and inspection stages, enabling proactive quality control and reducing unexpected scrap events by 20%.
Root Cause Analysis Dashboard, MKS Instruments
- Utilized Power BI to pinpoint issues in manufacturing or design by analyzing trends in OBQ, AFR, and WIRR.
- Reduced warranty incidents by 23% in 6 months through quality benchmarking and servicing.
- Allowed data-driven decision-making by aligning OBQ insights with customer reviews to prioritize areas of improvements.
Samsung Stock Forecasting
- Featured the implementation of LSTM and GRU architectures in deep learning modeling using the TensorFlow framework, achieving an R-squared value of 0.95 with the GRU model.
- The model provides actionable insights for stock investors, aiding in optimizing their investment plans.
- Forecasted Samsung stock prices for the following 10 days, advising against investment due to predicted price decline.
California Housing Cost Modeling
- Performed exploratory data analysis and random forest regression modeling to predict house prices.
- The project demonstrates proficiency in regression algorithms, data preprocessing, model evaluation, and hyperparameter tuning.
- The final model explains 80% of the variance in house prices and the model successfully predicted the price of the target house.
COVID-19 Data Exploration
- Employed PostgreSQL database and advanced SQL queries to perform multivariate analysis.
- Explored most infectious countries along with their corresponding death rates.
- Calculated the global correlation coefficient of -0.751 between GDP and infection rate, highlighting a strong negative association and emphasizing the influence of GDP per capita on the spread of the pandemic.
Bike Ride Moving Average Dashboard
- London bike rides dataset was used to create moving average visualization with three customizable parameters.
- Implemented a heatmap with two bar charts in the tooltip, displaying ride length and weather distribution.
Certificates
- Google Data Analytics Certificate
- DataCamp SQL Certificate
- DataCamp Python Certificate
- IBM Data Science Certificate
Languages
- Korean: Native/Bilingual Proficiency
- English: Native/Bilingual Proficiency