Project Structure¶

This is the class structure diagram that Hermione relies on:

Here we describe briefly what each class is doing:

Data Source¶

DataBase - should be used when data recovery requires a connection to a database. Contains methods for opening and closing a connection.
Spreadsheet - should be used when data recovery is in spreadsheets/text files. All aggregation of the bases to generate a “flat table” should be performed in this class.
DataSource - abstract class which DataBase and Spreadsheet inherit from.

Preprocessing¶

Preprocessing - concentrates all preprocessing steps that must be performed on the data before the model is trained.
Normalization - applies normalization and denormalization to reported columns. This class contains the following normalization algorithms already implemented: StandardScaler e MinMaxScaler.
TextVectorizer - transforms text into vector. Implemented methods: Bag of words, TF_IDF, Embedding: mean, median e indexing.

Visualization¶

Visualization - methods for data visualization. There are methods to make static and interactive plots.
App Streamlit - streamlit example consuming Titanic dataset, including pandas profilling.

Model¶

Trainer - module that centralizes training algorithms classes. Algorithms from scikit-learn library, for instance, can be easily used with the TrainerSklearn implemented class.
Wrapper - centralizes the trained model with its metrics. This class has built-in integration with MLFlow.
Metrics - it contains key metrics that are calculated when models are trained. Classification, regression and clustering metrics are already implemented.

Tests¶

test_project - module for unit testing.