Understanding the underlying topology of data

Im fasinating by machine learnning and started to use in back in 2016 in my previous IOT project when used to train the algorithm i used to make data correlation more accurate and use to that use it as new triggers for actions other IOT devices will get. A new methode caught my eye recently and i decided to deticate a blog post to it.

Topological data analysis (TDA) is an unconventional machine learning technique that is used to understand the underlying topology of data. The premise is that data has shape. The two methodologies used in TDA are persistent homology and the mapper algorithm. Traditional machine learning techniques include supervised/unsupervised methods such as clustering, Bayesian networks, neural networks, support vector machines (SVM), and random forests.

Background/Research/Business Need

When TDA is used in conjunction with traditional machine learning models, improves overall accuracy of these machine learning models at localized sections of the data. SymphonyAI (2021) in a white paper discusses how traditional machine learning models use global optimization that assumes/guesses the shape of the data to derive parameters to approximate the dataset which often produces errors in some regions of the data. TDA in contrast creates separate models of the underlying data based on the output network topology that is responsible for different local sections of the data. This technique produces a better representation than a single globalized model. The challange is to test whether this localized modeling methodology of TDA is more efficient and improves accuracy of predictions.

Machine Learning

Machine learning is binned into unsupervised and supervised learning. Unsupervised learning uses methods such as clustering to segment data into smaller datasets and dimensionality reduction to make it easier to visualize data that are high dimensional. Clustering models include hierarchical and K-Means. Supervised learning consists of regression and classification models. The classification models used to assist in the prioritization effort are neural networks, random forests and single tree models, and SVM.

Supervised Learning Classification Models

Random forests are an ensemble technique analogous to bagging trees. It works by collecting a bootstrapped sample of identical and independently distributed trees and conducting recursive partitioning on them. Classification is based on a majority vote of the aggregated trees. The beauty of this technique is that it obtains an estimate of the misclassification error and also performs random feature selection to estimate the relative importance of the explanatory variables. Support vector machines are large-margin powerful predictive models that can be utilized for classification or regression. They are a class of distance-based classifiers that attempt to use hard margins for stability in classification. They can be linear or nonlinear in form. The beauty and utility of SVM is the implementation of kernel methods that transform vectors from the input space and calculate their inner products in the feature space therefore bypassing the calculation of the function in the input space, which would be untenable. This allows the SVM to perform classification of datasets in which the underlying boundaries of the classes are not readily clear. Some examples of kernels are the Gaussian radial basis, Laplace radial basis, and the hyperbolic tangent kernels. The use of kernels offers a rich model class to essentially tune the SVM. Neural Networks are extremely powerful classifiers as they can be tuned by many different parameters. They are also heavily nonlinear classification models. The sigmoid function that defines the neural net may be modeled using the logistic, hyperbolic tangent, or heavy side step sigmoid functions. These sigmoid functions in conjunction with the size of the hidden layers offer ways to tune the neural network as a more robust classifier.

TDA

TDA is an emerging and exciting form of unsupervised learning. TDA is based on topology, a branch of mathematics that examines the notion of shape. TDA attempts to analyze highly complex data and draws on the notion that all data has afundamental shape and that shape has meaning.

The two methodologies used in TDA are persistent homology and the mapper algorithm. Persistent homology provides a framework and efficient algorithms to quantify the evolution of the topology of a family of nested topological spaces. Persistent diagrams are used to capture and visualize the birth and death of homological features over a specific period of time. The mapper algorithm is a tool used to visualize the topology of the data under consideration. This method of TDA will be used for this research. The inputs to the algorithm are a point cloud of data, a filter function, a covering of a metric space, a clustering algorithm, and tuning parameters.

The output is a network graph that represents the topology of the data. Next, a filter function is identified. Third, determine the number of overlapping bins to map the input data. Finally, create a network topology representation of the original dataset using nodes and edges . The nodes represent the clusters of local regions created by the binning. It is important to note that information from one node can be contained in another node as a result of overlapping bins. The edges connect clusters to display the overall topology. Can TDA in conjunction with traditional machine learning models improve the accuracy of the predictions of those machine learning models when used without TDA?

See you in part 2 ….

Product management expert with domain expertise in Telecom cloud, cloud management system, Docker, K8Sand IIOT