Description

Rather than characterizing specific correlations between individuals and variables, the data mining approach allowed by Mobipaleo aims at performing a faster, systematic and objective exploration of the meanings of raw, complex and heterogeneous databases in order to highlight meaningful “hidden patterns or associations” (p. 4, Gilbert et al., 2018) and to consider their modelling and their potential predictability. More precisely, the local exploration of the databases allowed by the data mining approach aims at discriminating meaningful key groups of indicators named Frequent Closed Gradual Patterns or FCGP.

Thus, the overall objective of Mobipaleo is to deploy task automation to rapidly extract frequent closed gradual patterns (Di Jorio et al., 2008) that track the order correlations of the form “the more/less taxon X associated with the more/less taxon Y…” from large databases. This automatic patterning work is based on a data-driven modelling, which confirms data mining methods are complementary to multivariate statistics, which allow user-driven modelling of data.

Algorithms of gradual pattern mining currently reported in the literature do not assume any temporal constraints on data, yet all palaeoecological and ecological data contain temporal relationships between objects (time-scaled data). The application of data mining methods in palaeoecology seeks to perform a data mining process under temporal constraint. This need for a temporal dimension motivated our creation of a new and specific algorithm allowing the automatic extraction of data on co-evolutions between paleoecological or ecological indicators.

Briefly, the initial database in tabular form is a set of objects (the temporal component) described by a set of attributes (multi-variate indicators). This table displays the abundance or amount (different units possible) of each attribute for each object. In this database, a gradual item corresponds to (Indicator-type 1=+), for instance, while {Indicator-type 1=+, Indicator-type 2=+, …}, for example, is a gradual pattern, which indicates that these 2 types are positively correlated (in term of covariation). The complexity of the data set is related to the order associated to input objects. An algorithm, inspired by Berzal et al. (2007), first transforms the original numerical palaeoecological database in a categorical database. More precisely, the original numerical palaeoecological database D contains an attributes (variables) set with numerical data I = {i1, …, in} and an ordered set of objects (depths) T = {t1,…, tm}, where t[i] gives the value of attribute I on the object t.

Our algorithm first built from the original numerical palaeoecological or ecological database D, a new database D’ containing a categorical attribute set I’ = {i1+, i1-, i1=,…, in+, in-, in=} such that: For each attribute i of D, and for all couple of consecutive objects (tj, tj+1) of D,

The previous procedure describes the first step of our gradual pattern mining algorithm, more precisely, how to obtain categorical database from the initial numerical database.

In the second step of our algorithm, we apply a modified version of APRIORI (Agrawal & Srikant, 1994) algorithm to the categorical database obtained at the previous step. APRIORI is a seminal algorithm for mining frequent itemsets. We use a modified version of APRIORI to extract from categorical database obtained D’ the frequent closed itemsets, which correspond to the frequent closed gradual patterns (FCGP) of the original numerical database D. The resulting gradual patterns are finally post-processed in relation to the user’s scientific issues and research objectives in order to identify patterns relevant to the research questions.

FCGP correspond to the most concise representation of patterns without any loss of information (Pasquier et al., 1999). In this sense, the FCGP with a low support of at least 10% and positively correlated are retained. The support measures the redundancy of a FCGP in the database and low support values ensure no loss of information. FCGP correspond to the most significant and repeated co-evolutions of bioindicators.

Extract from “Supplementary Materials” (Miras et al., 2022).

For more details, please see Lonlac et al. (2018) and Miras et al. (2022). Full references in the tap publications

References