Symptom Proximity in Diagnostic Problem

An algorithm SP (= Symptom Proximity) is suggested for solving discrete diagnostic problem. It is based on probabilistic approach to decision-making under uncertainty, however, it does not use knowledge integration from marginal distributions.


Informal analysis
The ability of decision-making belongs to the most important mechanismus build in complex organisms. Optimal strategies for decision-making determine the success or even survival of different animal species (Darwin theory), trading companies or individual patients.
As mentioned e.g. in [5], decision-making can take place on a purely intuitive basis (i.e. an individual decision is provided by a genetically inborn mechanism or it can be given by experience acquired during the life span of a decision-maker and stored in neural structures of his brain ). Or, decisionmaking can take place on the basis of a more considered strategy and in the framework of a formalized theory.
When collecting a large number of facts relating to a situation, there appears a point, in certain phase, where facts are "abstracted" to a piece of knowledge. (The process is refered to as hegelian dialectic concept of "quantitative change leads to qualitative change".) Knowledge is usually expressed in form of sets of implications. Then, in the context of decisionmaking, observed facts (called evidences ) on an individual object are used as antecedents and the required decision is, hopefully, provided via laws of logics from the succedents (of the implications). However, even this approach involves implicitely certain degree of uncertainty (e.g. when to switch from facts to a law/knowledge).
The tolerable precision is domain dependent. In physics, a theory must explain data with precision to six decimal positions. In soft sciences, models are less demanding. Therefore, one can generalize that decision-making cannot be separated from uncertainty. Partially, it is due to the men-tioned "ever present lack of data" and partially, there exists an "uncertainty" which of formal theories of uncertainty to use as a model. Even in probability, taken usually as a normative theory (for uncertainty), there exist alternatives (see e.g. [4]).
One of the most fruitful lines in probabilistic decisionmaking (see subsection 2.1) can be divided into four phases: First, select less-dimensional distributions, considered as marginals of a theoretical joint distribution, as input knowledge.
Second, one can construct (integrate) an explicit formula for the joint distribution. It can be done via strong assumptions like conditional independence between marginals expressed as graph models .
Third, given an evidence about an object (i.e. measured on the object), the formula is reduced (marginalized) to a simpler form where the evidence can be directly applied.
Fourth, the "best" decision is the one that yields largest conditional probability on diagnosis variable (i.e. containing decisions) given the observed evidence.
The topic of this paper is a method/algorithm SP that does not use the marginals (and assumptions about their conditional independence) and finds an approximation of the largest conditional probability directly from the data (called statistical file in the sequel and defined in subsection 2.3).
Let us suppose that both models (i.e. "marginal "one and SP) are finite and discrete. ( Explicitly, we do not consider e.g. family of log-linear distributions and no estimation of their parameters from the statistical file takes place as is usual in standard statistics).
There is a discrete number of symptom variables, discrete number of symptoms in range of each symptom variable, discrete number of diagnoses/decisions and finite number of objects in the statistical file (representing knowledge).
It should be stressed that if one uses the term "measured" it should be interpreted as "measured and discretized". E.g. biochemical values, though continuous, are reduced to dichotomies (greater or less than a threshold) or trichotomies ("standard values", "too low","too high").
This finiteness and discretization need not be a limiting factor, as in many decision-making situations, there is small number of alternatives, as well. In most frequent situations, the decision-making takes place on the basis of not too many symptom variables .
In "marginal approach", the joint probability formula is integrated once forever from marginals (elicited, in their turn, from a "learning" statistical file ). Then, this unique formula is used for each evidence available for the object/patient/case. Whereas, "marginal models" construct an unique approximative distribution of an unique theoretical joint distribution, the SP-model uses for decision-making an adhoc approximation of conditional probability ( derived from the unique theoretical joint distribution). (This link is established by the supposition that the statistical file was generated (by nature?) according to a theoretical joint distribution.) This adhoc approximation of conditional probability is generated freshly from scratch for each set of symptom variables (carrying evidence measured on the object/patient). (In other words, the term adhoc relates to a disclosed set of symptom variables.) The aim of SP is not to "predict" the shape of aposteriori probabilities of all considered diagnoses, but only to find the most probable diagnosis/decision and even the numerical value of its aposteriori probability is not important. In other words, what matters is just the first place in this ranking.
What is new ? In eighties, it was believed that whereas the data (i.e. statistical file) is not sufficient for estimating the joint distribution, it should do for less-dimensional tables (i.e. supposed marginals). Moreover, these marginals were considered as given externally and reliable as "incorporated truth". Beside being given from outside, the marginals were just few. One might use a marginal for integration or leave it out. Not ask for new ones! The argument that marginals cannot be obtained otherwise than from the data was not paid sufficient attention to. If it would have been taken into consideration, one could ask for more marginals, but, at the same time, it would raise questions like "How many marginals should we ask for?" and "What would be the best composition of those marginals?".
The central notion was the approximation of the joint distribution whereas it is the approximation of conditional probability that SP takes as its "flag ship". At the same time, the role of disclosed symptom variables carrying evidence is stressed for SP, as well. Then, the detour "integration to joint distribution" and subsequent marginalization is not needed any more.
A specific methodology based on proximity (of vector of evidences and respective vectors constructed from cases in the statistical file) in space of symptom variables is used in SP algorithm.
To explain the theory of the SP algorithm, simple metrics is introduced in section 3.
As details matter and redundancy is preferable to ambiguity, the essence of the algorithm SP is described twice in section 4.2. Namely, via a flowchart (see Figure 1.) and via a code written in a symbolic language. The latter description makes possible to derive computational complexity for SP in section 5.
However, the real strength of SP lies in the fact that this approach is fast enough to operate on large statistical files with large number of symptom variables.
As far as decision quality is concerned, the situation is not so transparent. On one hand side, it is possible to construct examples where SP dominates the "marginal approach". It may happen when input marginals are "at wrong position" to the disclosed symptom variables carrying evidences. However, in general, the dominance of SP cannot be directly proved and one can only compare SP and one of many possible "marginal" algorithms, with different sets of marginals and for different evidence-carrying symptom variables. And it has to be done in many combinations. This "branching expansion" is multiplied by two when we consider two testing techniques. First, "testing" and "learning" statistical files are identical. Second, the "leave one out" technique, described in section 6, is used.
The final impression (from this multicriterial decisionmaking) is that SP does very well. It is based on experience from many comparative runs. If it is not dominant, SP belongs to Pareto optimum, at least.
It should be noted that SP does not use the empirical distribution derivable from the statistical file. (The "mass" of the empirical distribution is concentrated only in the atoms of the defining set algebra that appears in cases from the statistical file and it is zero anywhere else).
One may ask why this line of research (i.e. "without marginals") was not followed as a main stream before?
In my opinion, during the last decades, external material conditions (computerization and existence of large databases) changed a lot. In the meantime, the original paradigm was refined and improved by many researchers. The model became so rich, successful and invested in that the inertia prevented an investigation if basic suppositions cannot be weakened.
2 Specific statement of the problem

Historical background
Firat attempts for machine-assisted decision-making under uncertainty are marked by rule-based expert systems Mycin and Prospector in early eighties. Weights in rules were interpreted as conditional probabilities. But the way the rules were combined was not a probabilistic one. The same held for systems with fuzzy number approach. At that time, Albert Perez (in [11]) raised the requirement that partial knowledge should be "integrated" intensionally i.e. using the concept of theoretical joint distribution P . Knowledge was understood as probability or conditional probability elicited either from experts or observed from experiments. The best way to keep it, at least partially, complete and homogenous was to assume that it comes in form of less-dimensional distributions that were supposed to be marginals of the theoretical joint P . Thanks to smaller sizes, marginals could be estimated from available data. The main effort in the subsequent research was concentrated on the way how to assemble effective approximations of the joint P . The formulation of the task was known as marginal problem already in [6] and its specific solution was suggested even before in [2]. Different models, connected with names like Lauritzen, Spiegelhalter, Dempster, Shafer, Pearl, Dawid, were studied with assumptions about conditional independence of variables appearing in P that helped to integrate the marginals. At present, there exist professional software packages (e.g. Hugin) supporting the decision-making on commercial basis. As, beside different algorithms, even the selection of proper marginals may be a problem of its own (see e.g. [9] and [10]), this paper tries to study an alternative to the marginal approach.

2.2
The layout of the paper 1. Informal (i.e. without notation) analysis of the diagnostic problem in probabilistic setting is given in section 1.
2. There are historical reminiscences explaining the position of the suggested method in a broader context in subsection 2.1.
3. Basic notions are defined including the formulation of the diagnostic problem and describing the role of the statistical file F in subsection 2.3.
4. The essential features of the algorithm SP are laid down in section 3.
5. Realisation of SP is described in section 4 via a flowchart (in subsection 4.1) and via symbolic programming language (in subsection 4.2).
6. On the basis of the latter description, computational complexity C SP of SP in terms of "length" l = |F| of the file F and of its "width" n = | {ξ 1 , ξ 2 , · · · ξ n } | is estimated and verified experimentally for different values of l and w in section 5.
7. "Discernment power" of SP (i.e. absolute values or percentage of wrong classifications ) is tested for different "apertures" ( sets of symptom variables whose values are disclosed to SP as evidence). Testing is performed both via method "leave one out" as well as on all data in simulation runs. The results are compared with a simple marginal-based algorithm under the same testing conditions in section 6.
8. In section 7, there are features of SP sorted as "pros" and "cons". Then, there are two open questions suggested for further investigation and finally, there is a summarizing conclusion.
Though the topic is defined in a formal way, the names of objects in the universe of discussion (e.g. diagnosis, symptoms etc.) are taken from the field of medicine to give them a semantical interpretation and ease up understanding of basic notions and character of their interaction.
The mutual behaviour of all random variables η, ξ 1 , ξ 2 · · · ξ n is described by a theoretical joint probability distribution P η ξ1ξ2···ξn . Decision making under uncertainty with probabilistic background can be interpreted as the diagnostic problem with the following formulation: Diagnostic problem: Find the diagnosis d(s 1 , s 2 · · · s n ) ∈ η that is the most probable (according to the P ηξ1,ξ2···ξn ) on the observed) arbitrary combination (s 1 , s 2 · · · s n ) of values of symptom variables from the cartesian product Ξ = ξ 1 × ξ 2 × . . . ξ n . If we wish to predict the values of diagnostic variable η, the conditional probability P η|ξ1ξ2...ξn (derivable from P η ξ1ξ2...ξn ) should be used instead of P η ξ1ξ2...ξn .
Optimal decision: The value of diagnosis d from η that should be selected if the values of symptom variables are (s 1 , s 2 · · · s n ) to keep the wrong classifications of d as low as possible), called Bayes solution, is for each (s 1 , s 2 · · · s n ) ∈ ξ 1 × ξ 2 × . . . ξ n given by the formula d opt (s 1 , s 2 · · · s n ) = argmax d∈η P η |ξ1ξ2...ξn (d|s 1 , s 2 · · · s n ) (1) So far the theory. Unfortunately, in the "real world", we are never given the theoretical distribution P ηξ1ξ2···ξn in full and directly. To compensate for this, we expect to have some indirect information about P ηξ1ξ2···ξn that will be called knowledge base and denoted by K. It is done by postulating a set of conditions that we believe the theoretical P ηξ1ξ2···ξn fulfills. Marginal problem: Using the concept of marginal problem, see [6], knowledge base K is given as a set of "lowdimensional" distributions (i.e. number of variables in the distribution does not exceed e.g. 10. ), postulated as theoretical marginal distributions of the P ηξ1,ξ2...ξn . Beside the marginals, there are usually made assumptions about conditional independence holding between groups of random variables. It is interesting that the topic was so attractive that it was addressed in several waves, usually after 20 years. Original and interesting ideas were not just the product of the last two decades but go back much deeper. See e.g. [2], [6], [3], [1]. Instead of the unknown P ηξ1ξ2···ξn , we try to construct (from the marginals) its suitable approximationP ηξ1ξ2···ξn that could play its role in the formula (1).
If existence of marginals is postulated, it is natural to ask where do they come from. Therefore, another notion should be specified.
There exists a taciturn assumption that decision making about a concrete case (patient) should be very fast (about 1 sec/pers.). On the other hand, longer time (e.g. hours of CPU time) devoted to selecting and populating the marginals (in the learning phase) is tolerable. This may be one of the reasons why the "marginal approach" is the standard way.
However, using marginals for "integrating"P ηξ1ξ2···ξn and its subsequent conditioning need not be mandatory for solving the diagnostic problem.

Basic idea of SP algorithm
An algorithm, called SP ( = Symptom Proximity), tries to construct necessary conditional probabilities directly from available statistical data file F. Basic idea of SP can be explained by the assumption "Patients with similar symptoms should have a similar diagnosis". Hence, the name of the algorithms SP interpretes the similarity as a proximity in the sense of a very natural metrics.
Proximity metrics ρ : where δ(·, ·) is the Kronecker function and (u) i is the i-th component of the sequence u The mapping ρ is a metrics (i.e. reflexivity, symmetry, triangular inequality) on Ξ that can be used for defining equivalence classes on Ξ. For each fixed v ∈ Ξ, there exist n + 1 sets C 0 (v), The next step is to estimate P (C k (v)). It can be done, in a natural way, using available data ( i.e. the statistical file F).
Similarly, ((F) j ) Ξ is that part of the j-th vector (F) j that corresponds to symptom variables i.e. ((F) j ) Ξ ∈ Ξ. We are interested in the set C k (v) with smallest k but at the same time such that P (C k (v)) > 0. Let us denote this optimal k as k 0 Finally, the conditional probability P η|C k0 (v)(d|v) of η on C k0 (v) can be defined.
To shorten the shape of the final expression for the P η|C k0 (v)(d|v), an auxillary variable D(F, v, k 0 , d) will be introduced. It denotes the number of cases (patients) from the statistical file F that "belong" to the equivalence class C k0 (v) and at the same time they have the diagnosis d (Similarly according to the above used notation, ((F) j ) η stands for the value of the diagnosis that is in the j-th vector of the statistical file F i.e. ((F) j ) η ∈ η. ) Then, P η|C k0 (v)(d|v) is defined by If v = (s 1 , s 2 , · · · s n ) ∈ Ξ, we may approximate the conditional probability P η |ξ1ξ2...ξn (d|s 1 , s 2 · · · s n ) appearing in formula (1) by the P η|C k0 (v)(d|v) so that P η |ξ1ξ2...ξn (d|s 1 , s 2 · · · s n ) = P η|C k0 (v)(d|v) and formula (1) can be applied as the decision rule in the SP algorithm. The algorithm SP is presented in a symbolic programming language in section 4.2. The complexity of the algorithm SP will be defined, in section 5, as a function of size |F| of the data file F and as a function of number n of symptom variables. The complexity is verified on real data by measuring time required for making decision for one person.
The decision quality (or discernment power) is dealt with in section 6. In principle, it is the number of wrong classifications what is measured. However, it may be defined more formally: Let L ⊂ F, v ∈ Ξ. Further, let SP(L, v) ∈ η denote decision of SP when evidence (about a patient) is v and algorithm SP has the "learning" file L at his disposal. Then, "discernment power" of SP can be measured by percentage of wrong classifications either as 100 This second approach is referred to as "Leave one out" technique. (There are more details on the technique in section 6.) The results will be compared with one simple algorithm using the "marginal approach" in section 6.

Detailed description of SP
There is a theory of SP in section 3 and there is a realization of SP in this section 4.

Symptom Proximity in Diagnostic Problem
As mentioned above in section 2, SP algorithm is described twice. Namely, via a flowchart (see Figure 1.) and via a code written in a symbolic language.
Strictly speaking, realization of an algorithm is an executable file (i.e with extension .exe ) that is "understood" e.g. by any PC. Less strictly, realization of an algorithm is syntactically and semantically correct text file "written" by a man-programmer in a general purpose programming language (like Fortran, C ,C++, Visual Basic) that is "understood" by a compiler of the respective language. (This text file has extensions like .for, .c, .vbs.) Next level is symbolic language. It is an abstraction of general purpose languages. It is "written" by man and it can be "understood" by another man to the extent that he/she can rewrite it to correct text in general purpose language.
The last common representation of an algorithm is its flowchart. It is a "text" written by man-programmer in graphical language form. It has to be "understood" by another man. It has conventional structures like action blocks (rectangles), decision blocks (rhombus or trapezoid) and lines where arrows determine the "flow" of the program. Some symbols are overloaded. E.g. symbol "=" is interpreted as an assignment in action blocks whereas it is a relational operator in decision blocks. To summarize, there is always potential place left for ambiguity and misunderstanding. Therefore, both descriptions of the realization are used in parallel and they are supplemented by informal comments that explain semantics of used structures (in context of theoretical foundations of the algorithm).
There is a slight functional difference between the flowchart and the symbolic language descriptions of the SP algorithm.
The flowchart describes entering one statistical file L (from the problem area) and entering a sequence of several new evidences v for which their optimal diagnoses are calculated.
In the symbolic language, there is a body of a function SP where only one given statistical file L and only one evidence v are entered as parameters.
Then, the corresponding diagnosis d opt (v) is calculated and returned when the function SP ends. The difference is only a formal one and the essence of SP algorithm is the same.

Flowchart of SP
It represents the activity of the algorithm in a simplified form. It is a link between the theory and the realization. Therefore, the symbols from both "worlds" are used. E.g. the first action block describes the filling of the array L with values from the statistical file denoted as L to stress it is for "learning". Names of the symptoms, usually in form of text strings, are densely coded into integers. Denotation L[0 − n, 1 − |L|] says that two-dimensional array L has n + 1 positions in the first dimension. (Position "0" is used for storing diagnoses, positions "1" to "n" are for storing values of symptom variables). The second dimension is for indexing individual records from L. Variable maxcount corresponds to k 0 via definition k 0 = n − maxcount. There are three loops in flowchart which are realized as for-cycles in the symbolic language. Filling the two-dimensional array LD with zeroes takes place anytime new evidence v is entered. In the end, after running through the whole L matrix, the array element LD(i,j) contains the number D(L, v, n − i, d j ). It is easy to see that elements of the matrix LD are integers and their sum is |L|. It would be possible to calculate the approximation of conditional probability P η|C k0 (v)(d|v) of η on C k0 (v). (It can be done by normalization of the row LD(maxcount,*) of the array LD. i.e. by dividing elements in the row LD(maxcount,*) by their sum.) However, as we are interested only in the ranking, this normalization is not necessary and we look for d opt by looking for maximal value in the row LD(maxcount,*) of matrix LD, only. That way, we save computational time by skipping unnecessary actions.
Semantics of LD (level distance) can be seen from the fact that the row LD(i,*) contains number of all records/vectors from L that have just i mutual symptom variables with the evidence v. In other words, they belong to the equivalence class C n−i (v).

Description of SP in a symbolic language
SP algorithm can be used in different roles. It may be a simple "one-shot" decision-making, repeated decisionmaking for different apertures (sets of symptom variables whose values are disclosed to SP as evidence), using SP in a general testing scheme or it may be adapted to specific testing via "Leave one out" technique.
Instead of using one highly parametrized form of SP algorithm, it seems better, for didactical reasons, to use several stand-alone modifications. However, only the most simple version, under the name function SP , will be presented in this paper. Specific modifications built on its basis ( and entitled SPL and SPA) will be mentioned in other sections. The following symbolic description is kept as simple as possible.
First, though the variables have their specific denotation reflecting their semantics, they are coded as integers or arrays of integers to make SP faster.
Second, tests and resulting exceptions in inconsistent situations such as |L| = 0 or |η| = 0 are omitted ! Third, this version of SP algorithm is defined for all n available symptom variables. However, it can be easily modified if not values of all symptom variables are known, but if only some of them are provided as evidences (i.e. in case of smaller aperture). Function SP returns the value d opt (v) for each given Comments to the code of SP : l.1 expresses that SP is a function SP : Ξ −→ η i.e. accepts as argument the vector v and returns the optimal diagnosis d opt (v). l.2 learning file L is stored in an array L. The value "0" in first dimension is for values of η. l.3 -l.7 sets zero values to the array LD (level distance) where metrics will be stored in the sequel. l.9 -l.21 For each l ∈ L, number of symptom variables with coinciding values (symptoms) is calculated in the variable count. Increasing LD(count, η(l)) by one increases chances of diagnosis η(l) to become optimal d opt if the decision should take place at the level count. l.16 -l.18 stores in maxcount the up-to-now achieved maximal number of coincidences. l.23 -l.30 finds in LD(maxcount,j) such diagnosis d j that would, on the level maxcount, define the winning dignosis d opt . Naturally, if the number of cases from L is small (and that would result in objections from statistical point of view), it is possible to perform search for optimal diagnosis on a level count smaller than maxcount that would have more objects than level maxcount. Or even, it is possible to sum LD(ct, j) for ct = count to maxcount in an array D(1-|η|) and search for d opt in this array. (However, this modification of SP is not available in the presented version.) The link of the code with previous formal description may be made more clear if we realize that the variable maxcount is related to k 0 . However, their roles are reversed in the sense that maxcount is the number of coincidences (of symptoms from u and v) and should be as close to n as possible, whereas k 0 is proximity (in the sense of the metrics ρ) and should tend to zero. Further, v(j) is (v) j and the value in the array LD(maxcount, j) stores the D(L, v, k 0 , d j ).

Computational complexity
It should be mentioned that experimenting with algorithms was performed on a statistical file F, from the field of rheumatology with 1089 patients. Diagnosis variable η contained 4 diagnosis and there were 34 symptom variables whose ranges had cardinalities from 2 to 9. That way, no generation of artificial examples was necessary. Nevertheless, this choice has no influence on the substance of SP . Complexity C SP of SP algorithm can be mesured with respect to the number of symptom variables n, number |L| of objects in the learning file L and with respect to the number |η| of diagnoses i.e. C SP = C(n, |L|, |η|). Due to the simple structure of SP , C SP can be estimated directly: l.2 c 1 * n * |L| l.3 -l.7 c 2 * n * |η| l.9 -l.15˙c 3 * |L| * (n + c 4 ) l.23 -l.29 The assumption of linearity (c 1 , c 2 · · · ) is a bit simplifiction and valid only for small ranges of n, |L|, |η|. If the ranges are greater, then effects like "paging" of memory, the way files are stored in a concrete file system (e.g. FAT 32 or NTFS ) and variables used for storing the "coding" numbers may come in play. E.g. values of η are stored in variables of type integer*1 and therefore should not exceed 256.
Therefore, instead of looking for explicit values for c 1 , · · · c 5 , direct measurements are documented in Table 1 where length |L| of L varies from 1000 to 20000 and in Table 2 where width n of L varies from 35 to 300. Corresponding files L (e.g. L(70, 1089) or L(35, 20000)) were generated from original L (i.e. L(35, 1089) ) by repeating respective rows and columns. In the Tables 1 and 2, column T read contains time necessary to read L. Column T total increases with square power of |L| as it is the time necessary for |L| decisions. When T total is divided by |L|, then the times in column T decision are always below 1 sec and therefore completely acceptable. Based both on analysis and direct measurements, complexity of SP is not a problem. Therefore, the limiting factor for better discernment power of SP is an externality i.e. experts should provide bigger data in form of a file F (or L).

Simulation results
This section should reflect the decision-making quality of SP via simulation results. Simulation is performed on the data mentioned in the beginning of section 5. Though the following example is very simple, it may reveal interesting facts when comparing SP and the decision-making algorithm A4 (from [7] and mentioned in [8]) that can serve as a simple representative of marginal-based algorithms. Let knowledge base KB consist of 3 marginals i.e. KB = {m 1 , m 2 , m 3 }.
The marginals can be defined by their "generating" symptom variables (besides implicitely supposed diagnosis variable η). Underscoring of symbol for a marginal denotes the set of "generating" symptom variables. E.g., if m 1 = {ξ 25 , ξ 33 }, then m 1 = P ηξ25ξ33 . Then, KB used for testing was defined via The "testing environment" provides an easy way to manipulate with "inputs" to the decision-making algorithm. First, it makes possible to remove marginals from KB and second, not all symptom variables, from n possible ones, need to be revealed as "evidences" to decision makingalgorithm SP or A4. We tested SP or A4 alongside for 8 different situations s 1 , s 2 , · · · s 8 described in Table 3. E.g., the expression {m 1 , m 3 } ∩ {ξ 1 − ξ 33 } (in column "marginals ∩ variables" of Table 3) stands for situation s 2 where KB consists only of marginals {m 1 , m 3 } and values of all 33 symptom variables {ξ 1 − ξ 33 } are submitted as "evidences" to the SP and A4. (Naturally, {m 1 , m 3 } has impact only on A4, whereas {ξ 1 − ξ 33 } influences both SP and A4 ). Column "active variables" in Table 3 contains symptom variables whose values have influence on A4 as result of both conditions. Column "active space" is product of their ranges. As all symptom variables here are dichotomical ones, the values are like 4, 16, 32.
Even the above mentioned denotation for individual marginals is a little simplified. E.g. m 1 = P ηξ25ξ32 is not enough as it should be also mentioned what data was used for populating the marginal m 1 . This can be expressed by adding the source. E.g. m 1 = P ηξ25ξ32 (L) stands for the marginal filled from the data set L. This denotation would do for the column A4A, but not for calculating the values for column A4L. Then, in fact, there are 1089 different marginals m 1 (L\t) = P ηξ25ξ32 (L\t). Those marginals will be populated from 1089 different data files L\t) that have to be created just for the purpose.
The "A" (in column A4A containing number of wrong classifications) stresses that all data was used both for learning and testing i.e. L = T = F. The "L" (in column A4L) has the meaning that the method "Leave one out" was used for the calculation of "discernment power" of algorithm A4.

Problems associated with selection of marginals are
avoided (by definition !) and only symptom variables are necessary. In general, values of all symptom variables (present in the learning file L ) should be provided as evidences, if available.

5.
Testing via "Leave one out" technique is extremely easy with a small modification in the presented code of SP . It takes approximately the same time as testing on the all data (i.e. when L = T ). Marginal-based algorithm require for "Leave one out" a lot of time for splitting the data file (|F|) times !) and filling the marginals for each split.
6. If an evidence v ∈ Ξ turns up, as input for SP , that is not present in the learning file L i.e. v ∈ ((F) j ) Ξ for j = 1, 2, · · · |F| then SP recommends the diagnosis with the highest apriori probability i.e.
what sounds as one would expect.
7. If there is a valid piece of knowledge of logical nature (e.g. implications), it is possible to force it out just by repeating artificial records representing it in L several times.
SP has several drawbacks as well: 1. SP can be applied only to nominal variables (i.e. not continuous, not cardinal and even not to ordinal) due to properties of the proximity metrics ρ.
2. As the only testing criterion is number of wrong classifications, SP , in its presented version, is not a proper choice for risk analysis.
3. With decreasing number of symptoms (evidences), discernment power of SP drops as well. (It is similar to marginal-based algorithms, as well.) 4. It is not possible to add additional knowledge about structure of P (e.g. in form of graph models expressing conditional independence among marginals.) All is based on input data represented by the statistical file F (or L) only.
There are some open questions to study. One would expect that decision quality is decreasing with increasing value k of proximity metrics ρ. It can be observed on testing runs. However, one would expect certain invariance or at least monotonicity in recommended d opt . In general, it does not always happen and recommended d opt "oscillates" when moving in array LD to levels with smaller symptom incidence i (or otherwise for ρ higher k value.). Would it be possible to change this behaviour via introducing a sort of "weighting" on symptom variables? Up to now, all symptom variables were supposed as equally informative. And if it would be effective, how to implement it in the existing realization of the SP? Another topic that would deserve a deeper investigation into is behaviour of SP for small sets of disclosed symptom variables (so called aperture) used for defining evidence. Though the improvement in decision quality (with respect to alternative approaches) may be only several percentage points, it would be interesting to find if, on average, SP is a reliable tool for decision-making even in this situation.
To summarize, with respect to above mentioned arguments, SP can be recommended for decision-making on nominal symptom variables and when a sufficient learning data file is available. It may serve as an alternative to well established marginal-based algorithms for decision-making under uncertainty.It has sound basic philosophy ("Patients with similar symptoms should have a similar diagnosis") and it is theoretically well founded (conditioning on equivalence classes induced by proximity metrics ρ and direct link to the optimal solution represented by formula 1). According to simulations (described in section 6) on a realistic study case (mentioned in section 5), it does quite well and it does not require sophisticated suppositions (like maximal entropy principle or Markovian blanket) about structure of approximating joint distributions. The problem with marginal selection (i.e "how many" and "what composition" ) is circumvented in SP. Remark: The original version (entitled "Diagnostic problem without marginals") was published at Wupes15 Workshop at Moninec. This is the extended version of that.