Martin Luther Bironga


Breast cancer diagnosis Causal inference testing

We analyze the Wisconsin Diagnostic Breast Cancer (WDBC) data using machine learning techniques. It will be a classification difficulty because the WDBC data is class tagged. The data consists of 32 qualities, or features, divided into two classifications (B=Benign, M=Malignant). The "mean" traits and low radius mean had a causal effect on the detection of breast cancer, according to all of the tests we conducted using Causal inference.

In this project, I implemented these main Machine learning and Data Engineering skills:

  • Pandas, Numpy, matplotlib, seaborn, and different python libraries: Before starting the machine learning and causal inference the first thing I did was to analyze the data and do some visualizations. I used these libraries for the Data Exploratory and Analysis part.

  • Causalnex library: After finishing my data analysis the next thing I did was to find out the causal relationship between my features and the target variable. To do that I used the widely implemented causalnex. I was able to see the causal graphs and infer the relations between different fractions of my data.

  • Jaccard Index of similarity: After plotting my ground truth causal graph the next thing I did was to compare different fractions of my data to the ground truth. I used the Jaccard index to implement this topic.

Machine Learning

  • Carried out 3 types of classification analysis to predict whether a user responds yes to brand awareness,namely: Logistic Regression Decision Trees XGboost ,then compared the different classification models to assess the best performing one(s).

SPeech-to-text data collection

We created a web app that will receive sound records of people reading a given text displayed on our front end. We saw that the amount of data was a crucial factor that makes our deep learning model during the Amharic language speech-to-text conversion so we have done data collection using the Apache tools. We implemented Apache spark, Airflow and Kafkaa concepts.


  • Combined the implementations of Apache Kafka, Airflow, and Spark for better data collection.

  • We have seen the data shortage in the Amharic language and tried to collect the vast number of corpora of the language to have a more robust data pipeline.

  • For this project, we use Kafka as our broker, Airflow as our event listener and initiator, and Spark to do the data transformation and cleaning part,


  • We used the WER (Word Error Rate) and implement different data pipelines to have the smallest WER on our machine learning model.