Projects
Martin Luther Bironga
Highlights
Breast cancer diagnosis Causal inference testing
We analyze the Wisconsin Diagnostic Breast Cancer (WDBC) data using machine learning techniques. It will be a classification difficulty because the WDBC data is class tagged. The data consists of 32 qualities, or features, divided into two classifications (B=Benign, M=Malignant). The "mean" traits and low radius mean had a causal effect on the detection of breast cancer, according to all of the tests we conducted using Causal inference.
In this project, I implemented these main Machine learning and Data Engineering skills:
Pandas, Numpy, matplotlib, seaborn, and different python libraries: Before starting the machine learning and causal inference the first thing I did was to analyze the data and do some visualizations. I used these libraries for the Data Exploratory and Analysis part.
Causalnex library: After finishing my data analysis the next thing I did was to find out the causal relationship between my features and the target variable. To do that I used the widely implemented causalnex. I was able to see the causal graphs and infer the relations between different fractions of my data.
Jaccard Index of similarity: After plotting my ground truth causal graph the next thing I did was to compare different fractions of my data to the ground truth. I used the Jaccard index to implement this topic.
Machine Learning
Carried out 3 types of classification analysis to predict whether a user responds yes to brand awareness,namely: Logistic Regression Decision Trees XGboost ,then compared the different classification models to assess the best performing one(s).
SPeech-to-text data collection
We created a web app that will receive sound records of people reading a given text displayed on our front end. We saw that the amount of data was a crucial factor that makes our deep learning model during the Amharic language speech-to-text conversion so we have done data collection using the Apache tools. We implemented Apache spark, Airflow and Kafkaa concepts.
APPROACHES
Combined the implementations of Apache Kafka, Airflow, and Spark for better data collection.
We have seen the data shortage in the Amharic language and tried to collect the vast number of corpora of the language to have a more robust data pipeline.
For this project, we use Kafka as our broker, Airflow as our event listener and initiator, and Spark to do the data transformation and cleaning part,
METRICS:
We used the WER (Word Error Rate) and implement different data pipelines to have the smallest WER on our machine learning model.