Volume 179, Issue 2
Original Article
Open Access

What matters in differences between life trajectories: a comparative review of sequence dissimilarity measures

Matthias Studer

Corresponding Author

University of Geneva, Switzerland

Address for correspondence: Matthias Studer, Institute for Demographic and Life Course Studies, University of Geneva, Boulevard du Pont d'Arve 40, CH‐1211 Geneva 4, Switzerland. E‐mail: Matthias.Studer@unige.chSearch for more papers by this author
First published: 14 July 2015
Citations: 94

Summary

This is a comparative study of the multiple ways of measuring dissimilarities between state sequences. The originality of the study is the focus put on the differences between sequences that are sociologically important when studying life courses such as family life trajectories or professional careers. These differences essentially concern the sequencing (the order in which successive states appear), the timing and the duration of the spells in successive states. The study examines the sensitivity of the measures to these three aspects analytically and empirically by means of simulations. Even if some distance measures underperform, the study shows that there is no universally optimal distance index, and that the choice of a measure depends on which aspect we want to focus on. From the review and simulation results, the paper derives guidelines to help the end user to choose the right dissimilarity measure for her or his research objectives. This study also introduces novel ways of measuring dissimilarities that overcome some flaws in existing measures.

1 Introduction

Abbott (1983) stressed the relevance of sequence methods to the social sciences and founded theoretically the use of sequence analysis on narrative positivism (Abbott, 1992). Since then, sequence analysis has become popular, and particularly so‐called optimal matching (OM) analysis (Abbott and Forrest, 1986; Abbott and Hrycak, 1990). Sequence analysis is now a key method used to study the spans of life trajectories and careers (e.g. Bras et al. (2010), Widmer and Ritschard (2009) and Schumacher et al. (2012)). The strength of the sequence approach is the holistic view that it provides by dealing with whole trajectories. This allows us to determine trajectory patterns that account for all states of interest experienced during the period considered. In contrast, survival or event history analyses focus on the hazard of—or time to—a specific event, and do not give an overall view of how the trajectories are organized.

An OM analysis measures pairwise dissimilarities between sequences and then identifies ‘types’ of pattern by clustering the sequences based on these dissimilarities. Beneath clustering analyses, other dissimilarity‐based methods have also proven useful when investigating sequence data. For instance, Abbott (1983) mentioned multi‐dimensional scaling, Massoni et al. (2009) used self‐organizing maps, Studer et al. (2011) showed how to run analysis‐of‐variance like analyses and to grow regression trees on sequence data, and Gabadinho and Ritschard (2013) searched for non‐redundant typical patterns with the densest neighbourhoods.

Despite often being referred to as ‘OM analysis’, so named by Abbott and Forrest (1986) after the edit distance they used, a dissimilarity‐based analysis is in no way restricted to OM distances. The methods also work with other measures of dissimilarity, and, as we shall see, many different distances have been proposed. For example, there are urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0001‐distances that have been adapted for sequence data, which essentially measure differences in state distributions (e.g. Deville and Saporta (1983) and Grelet (2002)). There are also distances based on counts of common attributes, e.g. matching states (Hamming, 1950; Bergroth et al., 2000) or matching subsequences (Elzinga and Studer, 2015), and multiple variants of editing dissimilarity measures, such as OM, that evaluate differences according to the cost of ‘editing’ one sequence into the other (e.g. Levenshtein (1966), Hollister (2009), Halpin (2010), Lesnard (2010) and Biemann (2011)).

Measuring the dissimilarity between sequences (i.e. a pairwise comparison of the sequences) is the common and crucial starting point for all dissimilarity‐based sequence methods. Therefore, in choosing a dissimilarity measure, it is important that we understand what we want to account for before quantitatively evaluating the difference between two sequences. This study contributes to this understanding by identifying the aspects (e.g. constituent states, sequencing, timing and duration) in which sequences may differ, and studying how various dissimilarity measures account for these aspects. This study of dissimilarity measures comprises an organized descriptive review and is original in that it focuses on those aspects of sequence differences that matter in the social sciences. In addition, we conduct a simulation study to examine how these measures behave with respect to those aspects.

Alongside this review, we propose two new distance measures and three original strategies to set OM costs. The first new distance measure is an edit measure—OM between sequences of spells—that consistently accounts for differences in the time that is spent in the distinct successive states (DSSs). The second is a reformulation of the OM of transition sequences that was introduced by Biemann (2011). The variant proposed drastically reduces the number of parameters. With regard to setting the OM costs, we first propose an original solution to set data‐driven insertion and deletion, indel, costs based on state frequencies. Second, we suggest the use of the Gower distance to determine the substitution costs for a mix of quantitative and qualitative state attributes. Finally, we propose to define substitution costs as a urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0002‐distance that stresses the similarity between states sharing the same future.

The remainder of this paper is organized as follows. We first set the framework by specifying the kinds of sequences that we consider, and the different aspects that we may want the dissimilarity measures to reflect. We then present the dissimilarity measures reviewed and their theoretical properties. In the following section, we examine the behaviour of the measures by using artificially generated data, and we empirically study how the measures are related to each other. Finally, we conclude the paper by providing guidelines on how to select an appropriate measure.

All measures that are presented in this paper are available in the latest version of the TraMineR R library. See the TraMineR Web site (http://traminer.unige.ch) for explanations on how to compute these measures.

2 Sequences and distances

2.1 Definitions and notation

We consider categorical sequences, defined as an ordered list of successive elements chosen from a finite alphabet, Σ. For sequences describing life trajectories, the elements in the sequences are usually in chronological order. In addition, in discrete time state sequences, the position in the sequence conveys time information so that the difference between two positions defines a duration. For example, assuming positions correspond to ages in years, knowing that an individual is in state ‘full‐time work’ from positions 20 to 29, we can conclude that the individual worked full time for 10 years.

The natural way to encode a state sequence is to list the successive elements. For example, the trajectory of someone working ‘full time’, F, for 2 years and then ‘part time’, P, for 3 years is represented as F–F–P–P–P. We can also encode the sequence in a more compact way as Furn:x-wiley:09641998:media:rssa12125:rssa12125-math-0003–Purn:x-wiley:09641998:media:rssa12125:rssa12125-math-0004. In other words, we simply list the DSSs and add a duration stamp—in this case, as the superscript—indicating the number of successive positions in that state (i.e. the length of the spell in the state). Apart from being compact, the latter form also facilitates comparing the sequencing and duration of spells in the same state. In the remainder of this study, we shall use the term spell to refer to the whole spell spent in the same state.

2.2 Differences between sequences

State sequences are complex objects that provide many different pieces of information, such as total and consecutive time spent in each state, the timing of states and the state order. Kruskal (1983), page 207, distinguished four different ways—or transformation operations—in which sequences may differ: substitutions, indels, compressions and expansions, and transpositions or swaps. These transformation‐based distinctions make sense in fields such as biology, computer science and speech research, and motivated the basic operations that are considered in edit distances. In the social sciences, the life trajectory of one individual can hardly be considered to be the result of a transformation of the trajectory of another person. Therefore, our interest when comparing sequences is not in the transformation of one sequence to another, but in how the sequences differ in socially meaningful aspects. In line with the distinctions that were made by Settersten and Mayer (1997) and Billari et al. (2006), we identify the following important aspects:
  1. experienced states—the distinct elements of the alphabet present in the sequence;
  2. distribution—the within‐sequence state distribution (total time);
  3. timing—the age or date at which each state appears;
  4. duration—the spell lengths in the distinct successive states;
  5. sequencing—the order of the distinct successive states.

The first basic aspect of interest when comparing the trajectories of two individuals is the list of distinct states that each experiences. This is what Dijkstra and Taris (1995) and Elzinga (2003) implicitly referred to when stating that two sequences with no common state are maximally dissimilar. (Note that this claim would not hold if some states can be considered to be more similar than others.) The notion of experienced states is also related to the quantum that was defined by Billari et al. (2006) as the count of experienced events. In addition to the list of experienced states, we may want to examine the total time spent in each distinct state. This tells us the distribution of the states within each sequence. Knowing the distribution is useful, for example, when studying the effect of total exposure times. For instance, we may want to examine the effect that the total amount of time spent unemployed has on a person's health status at retirement. However, differences in the presence or absence of states, or in the distribution of the states within the sequences, do not account for how the states occur along the longitudinal axis. Therefore, in a sequence analysis, these differences should be used in conjunction with other dimensions.

The timing of the states (i.e. the age—or date—at which we are in a given state) or the time that events occur, such as the start of a spell (Settersten and Mayer, 1997) in a given state, is a sociologically important aspect. For instance, life course literature often stresses the role of age norms in the construction of life trajectories (Widmer et al., 2003). Moreover, the social reality that is reflected by a state often depends on its position in the trajectory. For example, Rousset et al. (2012) observed that the effect of unstable employment on the professional integration trajectory increases with age. In addition, in his study on the way that couples use time, Lesnard (2010) claimed that differences between ‘no partner working’ and ‘only one partner working’ reflect a very different reality when observed during the day or night. Then, spell duration, the consecutive time that is spent in the same state, is another way to account for time (see Settersten and Mayer (1997), who even considered the more general concept of spacing to refer to the time between any two events or transitions). Instead of the precise timing, spell duration refers to the time that elapses between the start and the end of a significant spell. The spell durations, such as the time lived alone before marrying, or the duration of a jobless episode, are important aspects within people's life courses. Spell duration is different from the information that is provided by the state distribution in that it gives the consecutive exposure time, rather than the total, but not necessarily consecutive exposure time. Unlike the state distribution, the spell duration allows us, for example, to distinguish between long‐term unemployment and multiple short‐term unemployment episodes.

Finally, sequencing, the order in which states (or events) are experienced, is another socially sound dimension. The role of sequencing norms in the construction of life trajectories is at least as important as the role of age norms and has been emphasized by, for example, Hogan (1978). For example, experiencing childbirth before or after marriage reflects different ways of life. Abbott (1990) identified sequencing as the key concept in sequence analysis, and Billari et al. (2006) emphasized its importance in conjunction with timing and quantum for demographic life course analysis.

The five aforementioned aspects are not independent of each other. For example, by changing the sequencing, we also change the timing. Similarly, changing consecutive times spent in states implies changes in the within‐sequence distribution, and possibly in the sequencing as well. Likewise, modifying the within‐sequence distribution by changing the distinct present states affects the sequencing and duration. From the reverse point of view, two sequences that are similar with regard to one aspect may be quite different in terms of another.

In fact, we do not need all five aspects to characterize a sequence entirely. Specifying the sequencing (the DSS), the duration of the DSS and at least one time (e.g. the start time of the sequence) automatically determines the experienced states, their distribution and their time of occurrence. Likewise, the sequencing and the start time of the successive spells completely define the sequence. Therefore, from here on, we essentially focus on sequencing, duration and timing aspects.

In practice, we compare two sequences by using a measure of dissimilarity to quantify the level of mismatch between the sequences. The next section reviews existing dissimilarity measures, while stressing their properties and their sensitivity to timing, duration and sequencing. Among the properties, we shall in particular pay attention to the fulfilment of the mathematical conditions of a metric distance. These are required for most applications and especially for sample‐based studies. Discrepancy analyses and many clustering algorithms (such as Ward) require metrics. For instance, the triangle inequality ensures coherence between computed dissimilarities. Without the triangle inequality, the actual dissimilarity between x and y could be smaller than the measured dissimilarity d(x,y) because of a third sequence z. In this case, the actual dissimilarity would depend on the other sequences in the data set (see, for example, Elzinga and Studer (2015). Among other cases, this is problematic in sample‐based studies, where the actual distance depends on whether the z‐sequence was drawn or not.

3 Overview of dissimilarity measures

Table 1 lists the dissimilarities reviewed. The first column gives short names, which will be used later when presenting the results of our empirical evaluations. The dissimilarity measures can be classified into three broad classes, as shown in the next three columns:

Table 1. Summary of dissimilarity measures between state sequences
Measure Type Description Properties Parameters
Dis Att Edt Metric Eucl T.warp S.dep Ctxt Subst. Indels Others
CHI2, EUCLID (Deville and Saporta, 1983) × Distance between per‐period state distributions × × × Number of periods K
CHI2fut (Rousset et al., 2012) × Positionwise state distances based on shared future × × × Time lag weighting function
NMS (Elzinga, 2003, 2005) × Based on number of matching subsequences × × × ×
SVRspell (Elzinga and Studer, 2015) × Based on number of matching spell subsequences with spell length weights × × × × × User Subsequence length weight a; spell duration weight b
HAM (Hamming, 1950) × × Number of mismatches × × †Squared Euclidean distance.
Generalized × Sum of mismatches with state‐dependent weights × ‡If costs fulfil the triangle inequality.
× †Squared Euclidean distance.
§ §If costs are squared Euclidean distances.
× User
DHD (Lesnard, 2010) × Sum of mismatches with positionwise state‐dependent weights × × Data
OM (Abbott and Forrest, 1986) × Minimum cost for turning x into y by using theoretically defined costs × ‡If costs fulfil the triangle inequality.
× × User Multiple
LCS, OM(1,2) or Levenshtein‐II × × Based on length of LCS or number of indels × ×
Feature (new) × Costs based on state features × × × Features Single State features
Future (new) × Costs based on similarity between conditional state distributions q periods ahead × × × Data Single Forward lag q
trate (Rohwer and Poetter, 2005) × Costs based on transition rates × × Data Single Transition lag q
opt* *Not available in TraMineR.
(Gauthier et al., 2009)
× Costs adjusted to increase similarity between similar sequences §§ §§Can generate negative dissimilarities.
× × Data Single Similarity rate
indels, indelslog (new) × State‐dependent indels based on inverse or log‐inverse state frequencies. × × × Auto
OMloc (Holister, 2009) × Context‐dependent indel costs × × × User Auto Expansion cost e; context g
OMslen (Halpin, 2010) × Costs weighted by spell length × × × × User Multiple* *Not available in TraMineR.
Spell length weight h
OMspell (new) × OM between sequences of spells × ‡If costs fulfil the triangle inequality.
× × × User Multiple* *Not available in TraMineR.
Expansion cost e
OMstran (new) × OM between sequences of transitions × ‡If costs fulfil the triangle inequality.
× × × User Multiple Origin–transition trade‐off w; transition indel cost function
  • †Squared Euclidean distance.
  • ‡If costs fulfil the triangle inequality.
  • §If costs are squared Euclidean distances.
  • §§Can generate negative dissimilarities.
  • *Not available in TraMineR.
  1. distances between distributions, ‘Dis’;
  2. measures based on the count of common attributes between sequences, ‘Att’;
  3. edit distances, which measure the cost of the operations that are necessary to transform one sequence into the other, ‘Edt’.

The next five columns indicate the properties of the measures. ‘Metric’ denotes measures fulfilling the mathematical conditions of distances. Then, ‘Eucl’ denotes Euclidean distances, ‘T.warp’ denotes measures allowing for a time warp when comparing sequences, ‘S.dep’ denotes state‐dependent measures (i.e. measures that allow for differences between states that can vary) and ‘Ctxt’ denotes measures that consider the context of the states.

These properties may help to narrow the set of potentially useful distances. For example, we may want to discard OM with so‐called optimized costs, OM(opt), because of the possible negative values that it can generate. We may also want to discard non‐metric measures such as OM with transition‐based costs, OM(trate), localized OM, OMloc, and dynamic Hamming costs, DHD, with costs derived from transition rates, because of unexpected behaviour that may result from possible violations of the triangle inequality. In addition, Euclidean distances may be preferred if we plan to use multi‐dimensional scaling. For non‐Euclidean distances, multi‐dimensional scaling produces complex co‐ordinates that are associated with negative eigenvalues. These (usually ignored) complex co‐ordinates reflect the distortion that is incurred by embedding sequences in a Euclidean vector space. Therefore, it may be worth studying these in further detail (Laub and Müller, 2004).

The last columns in Table 1 show the available tuning parameters. These parameters are explained in the following subsections, where each measure is briefly described. Here, ‘Subst’ represents the possibility of accounting for state‐dependent substitution—or proximity—costs, with ‘User’ meaning that the costs are set by the user, ‘Data’ that they are data driven and ‘Features’ that they are based on state features. The ‘Indels’ column indicates whether there is a single state‐independent indel cost, ‘Single’, whether state‐dependent user‐defined indel costs are allowed, ‘Multiple’, or whether the indel costs are automatically set by the measure itself, ‘Auto’.

Considering state proximities or substitution costs is of special interest when some states should obviously be considered closer than others. Such distinctions occur, for instance, when the states are ordinal, such as education level, or result when some states share a higher number of common attributes than others. The possibility of considering state‐dependent substitution costs is also of interest in the multichannel case. The method that was adopted by Pollock (2007) for measuring distances between multichannel sequences derives the multichannel costs from the costs that are available for each individual channel. In this case, the costs generated would at least vary with the number of channels concerned by the state mismatch. With the same fixed cost in each channel, for example, the cost will be lower for a difference in one channel only than for a simultaneous mismatch in two channels.

We now briefly describe each of the dissimilarity measures that are presented in Table 1. We start by addressing distances between within‐sequence state distributions; then we consider measures based on the count of common attributes. Lastly, we discuss OM and other related edit dissimilarities. In addition, we introduce two new distances measures to overcome problems that are identified in existing measures as well as three new strategies to define the costs in OM.

A more detailed and formalized review of the dissimilarities that are considered here can be found in Studer and Ritschard (2014). This working paper also provides a more thorough discussion of the sociological interpretation of the distances.

3.1 Distances between probability distributions

3.1.1 Distances between state distributions

One approach to measuring the dissimilarity between sequences, propounded by adepts of the French school of data analysis (Deville and Saporta, 1983; Grelet, 2002), focuses on the longitudinal state distribution within each sequence. In other words, the approach focuses on the time that is spent in each state within the sequences. The dissimilarity between sequences is measured by the distance between the distribution vectors by using either the Euclidean distance or the urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0009‐distance. The former accounts for the absolute differences in the proportion of time spent in the states. The squared urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0010‐distance weights the squared differences for each state by the inverse of the overall proportion of time spent in the state, which, for two identical differences, places more importance on a rare state than on a frequent state.

This first distribution‐based measure is, by definition, sensitive to the time spent in the states. However, it is insensitive to the order and exact timing of the states. Following Deville and Saporta (1983), we can overcome this limitation by considering the distribution in K successive—possibly overlapping—periods. The distance is then equal to the sum of the urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0011‐distances for each period. At the limit, when K is equal to the length of the sequences, the distance corresponds to a weighted count of mismatching states. The latter case will be very sensitive to non‐matching timings and, as a result, gains some sensitivity to sequencing.

3.1.2 Distance based on conditional distributions of subsequent states

A related measure is defined as the sum of the position‐dependent distances computed at successive positions. This measure was proposed by Rousset et al. (2012) to measure the dissimilarity between sequences describing professional integration trajectories. The aim of the measure is to stress the similarity of the sequences that are likely to lead to the same future. For example, two different educational trajectories will be considered similar if they are both likely to lead to the same stable professional position. Here, the distance between states at position t is itself defined as the urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0012‐distance between the vectors of the (weighted and normalized) transition rates from the state observed at t to the states observed at the subsequent positions, t+1,t+2,…. Each transition rate is weighted by a decreasing function of the time interval to give more importance to the near future than to the far future when evaluating the distance between the states at position t.

Since this distance is the sum of positionwise distances, it should be sensitive to non‐matching timings and differences in sequencing. However, we can expect this sensitivity to be smoothed somewhat by the introduced link to the future.

The urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0013‐distances are Euclidean, as is the sum of the Euclidean distances over positions. Therefore, all three distances are Euclidean and have all the desired mathematical properties. However, the distances that are defined as the sum of the positionwise distances between states apply only to pairs of sequences of the same length.

3.2 Distances based on counts of common attributes

3.2.1 Simple Hamming distance

Hamming (1950) proposed measuring the dissimilarity between two sequences by using the number of positions with non‐matching states—the Hamming distance also corresponds to the Gower distance with equally weighted states and positions, as considered by Wilson (2006). Since the Hamming distance proceeds by a positionwise comparison, it applies only to pairs of sequences of the same length and is very sensitive to timing mismatches. The square root of the measure is Euclidean and, in its original formulation, is independent of the mismatching tokens.

3.2.2 Length of the longest common subsequence

The length of the longest common subsequence (LCS) corresponds to the number of elements in one sequence that can be uniquely matched with elements occurring in the same order in the other sequence (for example see Bergroth et al. (2000). Letting A(x,y) be the number of elements matching in this way, we obtain the LCS distance by computing urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0014. Since the position in the other sequence with which an element is matched varies with the other sequence, urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0015 is not Euclidean. Moreover, since it is not based on positionwise matches, the LCS distance should not be too sensitive to timing. In this case, we can expect a stronger dependence on differences in the state distribution and sequencing, especially the order of the most frequent states and, to a lesser extent, to differences in the consecutive times spent in the distinct states.

3.2.3 Number of matching subsequences

Elzinga (2003, 2005) introduced a dissimilarity measure based on the number of matching subsequences, NMS. (A subsequence is obtained by deleting any number of states in a sequence (Elzinga et al., 2008).) The general idea of the measure is that, the more often a given ordering of tokens in one sequence is observed in the other sequence, the closer the two sequences are to each other.

Elzinga and Studer (2015) proposed a generalization of NMS called the subsequence vector representation‐based metric, SVRspell. This distance is based on the matching subsequences between DSS sequences where the matching subsequences are weighted according to their length and the duration of the spells involved. Two parameters control the behaviour of the measure. The first parameter, a⩾0, is an exponent for the subsequence length weights. The second parameter, b⩾0, is an exponent for the spell durations. In addition to these weighting mechanisms, the subsequence vector representation SVR can account for state proximities.

NMS and SVRspell are Euclidean distances. They should be very sensitive to differences in sequencing and sensitive to differences in duration. Owing to the duration extension, the original version increases the number of embeddings of subsequences concerned. The second form does so by explicitly considering the duration of spells. In contrast, computing NMS between the DSS sequences, which is equivalent to SVRspell with b=0, should be insensitive to differences in timing and duration.

3.3 Optimal matching

Since Andrew Abbott (Abbott and Forrest, 1986; Abbott and Hrycak, 1990) popularized OM analysis in the social sciences, OM has become the most common way of computing dissimilarities between sequences describing life trajectories. The method borrows from other fields that use similar edit approaches (Kruskal, 1983), such as the Levenshtein distance (Levenshtein, 1966) in computer science and sequence alignment in bioinformatics.

3.3.1 Optimal matching principles and special cases

OM measures the dissimilarity between two sequences, x and y, as the minimum total cost of transforming one sequence, say x, into the other sequence y, by means of indels of tokens or substitutions between tokens. Each operation is assigned a cost, which may vary with the states involved.

The costs can be specified by using a single matrix, denoted as Γ, where the indel costs are specified as a substitution with an additional ‘null’ or ‘empty’ state. The OM distance between the sequences is a metric if Γ defines a metric between the admissible states (Yujian and Bo, 2007). In other words, the costs should be symmetric, fulfil the triangle inequality and be 0 only for the substitution of an element with itself. If the triangle inequality is not satisfied, at least one substitution cost will not make sense, because there will be a path allowing the same substitution result at a lower cost. Moreover, existing algorithms, such as that of Needleman and Wunsch (1970), that are used to compute the OM distance all assume that the costs satisfy the metric properties. Therefore, they could return a solution that does not reflect the minimum cost if these properties are violated. As the solution to a minimization process, the OM distance cannot be expressed as a kernel and, therefore, is not Euclidean (Elzinga, 2007).

The parameterization of OM by using the costs of the elementary operations makes it a very flexible dissimilarity measure that can cope with many situations. It defines a range of distance measures between two extreme cases (Lesnard, 2010): the generalized Hamming distance and the Levenshtein II distance. The former is the weighted sum of positionwise mismatches between two sequences (i.e. OM without indels). Like the simple Hamming distance, it should be mainly sensitive to timing differences. The latter case is the number of indels that are needed to transform one sequence into the other (i.e. OM without substitutions). Note that, for a single‐indel cost of 1, OM without substitutions (Levenshtein II) is equivalent to OM with substitution costs of 2 or more, and to the LCS distance. The Levenshtein II distance can be interpreted as the count of the elements in each sequence that are not involved in the LCS and should be more sensitive to spell durations and sequencing. OM lies between these two distance measures and is the sum of two terms: a weighted sum of time shifts (indels) and a weighted sum of the mismatches (substitutions) remaining after the time shifts. High indel costs render the dissimilarity extremely time sensitive, whereas low indel costs—with respect to substitution costs—downplay the importance of time shifts in sequence comparisons. Costs also allow for state‐dependent dissimilarities between sequences.

The OM distance can be thought of as based on the longest partially matched subsequence (Lesnard, 2010). From a sociological point of view, this partially matched subsequence can be interpreted as a ‘common backbone’ or ‘common narrative’ between trajectories (Elzinga and Studer, 2015).

OM has been criticized because of the lack of sociological meaning of the transformation operations, and their associated costs (Abbott and Tsay, 2000; Abbott, 2000; Levine, 2000; Wu, 2000; Aisenbrey and Fasang, 2010; Lesnard, 2010). Furthermore, the high number of indel and substitution costs may be seen as an overparameterization (Wu, 2000). Next, we examine the various methods for setting the costs.

3.3.2 Substitution costs

There are essentially three strategies when choosing substitution costs (e.g. Abbott and Tsay, (2000) and Hollister (2009)).

3.3.2.1 Theory‐based costs

The first strategy is to determine the costs on theoretical grounds. A priori knowledge often provides an order of magnitude of the similarity of two states, which allows us to rank possible replacements. To illustrate, assume careers coded by using the following statuses: Senior Manager, S, Manager, M, and Employee, E. From the nature of the states, S is closer to M than to E. To reflect this hierarchy, we could, for instance, set the cost of replacing S with E as 1.5 times the substitution cost between S and M. In doing so, we account for the order between the states, although the exact values that are chosen for the ratios between the substitution costs remain quite arbitrary.

3.3.2.2 Costs based on state attributes

A solution that was advocated by Hollister (2009) to make the choice less arbitrary is to specify the list of state attributes on which we want to evaluate the closeness between states. For example, for professional positions we could consider the qualification required, level of responsibility and degree of precariousness, and for cohabitational statuses we could consider the events that should have been lived to reach each situation. By specifying the values of the attributes for each state, we can then derive the pairwise substitution costs from the distances between all pairs of attribute vectors. This distance could be the Euclidean distance when all attributes are numerical. More generally, in the case of nominal, ordinal and symmetric or asymmetric binary characteristics, or even in the presence of a mix of variable types, we suggest the use of the Gower (dis)similarity coefficient (Gower, 1971). Besides explicitly rendering the state comparison criteria, the approach also has the advantage of generating costs that satisfy the triangle inequality.

3.3.2.3 Data‐driven costs

A third strategy is to rely on data‐driven methods. Here, a popular solution is to derive the substitution costs from the observed transition rates. The idea is to assign higher costs to substituting between states when the transitions between them are rare, and a low cost when frequent transitions are observed (Rohwer and Poetter, 2005). However, deriving the substitution costs from the transition rates is questionable, as there is no reason for transition rates to reflect state similarities. For example, ‘single’ and ‘divorced’ may be seen as close states, but, by definition, we cannot switch from divorced to single. In addition, switching from single to divorced would suppose that marriage and divorce occur during the same unit of time, which is highly unlikely. Moreover, in practice, observed transition rates are generally low and the resulting substitution costs are all close to 2. Therefore, the OM distances that are based on transition rate costs produce results which are close to those obtained by using fixed state‐independent costs. A solution that generates somewhat higher and more diversified transition rates is to consider the transition between the state at t and the state q (>1) periods ahead, rather than using the transition between two consecutive time units. Whatever the time lag q, the transition‐rate‐based substitution does not ensure that the triangle inequality holds.

In the spirit of the work of Rousset et al. (2012) that was described earlier, a conceptually better approach considers the two states a and b to be close when there is a high chance that both states will be followed by a common state c, q units of time later. In other words, states a and b are close, when they share a common future. For instance, although switching between high education and high vocational school is generally unlikely, both states may be seen as similar because they both have a high probability of leading to a managerial position, and a relatively low probability of leading to joblessness. We propose to operationalize this idea by defining the substitution cost between a and b as the urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0016‐distance between the cross‐sectional state distributions expected q time units after the occurrence of state a and state b,
urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0017(1)
where urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0018 is the probability of moving from f to e over q units of time. Using a negative q‐value, we can similarly determine costs in terms of a common past.

Another method of deriving costs from data was proposed by Gauthier et al. (2009). This approach is an ‘optimization’ procedure based on methods that are used in biology (e.g. Henikoff and Henikoff (1992)). The principle is to consider two states as close—and to assign them a low substitution cost—when they tend to occur jointly in pairs of similar sequences. Similarly, the method considers them as dissimilar—and assigns them a high cost—when they rarely co‐occur in pairs of similar sequences. The method works iteratively. At each step, it successively computes each cost by keeping the others unchanged and iterates until the costs converge. Experimenting with the implementation of the method in T‐COFFEE (Notredame et al., 2006), we faced serious issues, such as obtaining negative costs and, as a result, negative dissimilarities. Therefore, we did not include this method of computing substitution costs in our simulation study.

3.3.3 indel costs

Despite the importance of indel costs in controlling time warp, choosing indel costs has, with the noticeable exception of Stovel and Bolan (2004) and Hollister (2009), received far less attention than substitution costs. Note that Stovel and Bolan (2004) suggested lowering the indel for incomplete sequences. Such costs that change from one sequence to another would probably result in dissimilarities that violate the triangle inequality.

3.3.3.1 Single‐indel cost

indel is often seen as a gap insertion operator. Thus, most applications use the same indel cost, irrespective of the inserted or deleted state. The only choice then concerns the level of this fixed indel cost. Abbott and Tsay (2000) advocated the use of a low indel cost and suggested a value in the vicinity of 0.1 times the maximum substitution cost. However, as pointed out by Hollister (2009), using such a low value ‘throws out much of the careful consideration a researcher puts into creating substitution costs in the first place’, because an insert and a delete would be used in place of any substitution costing more than twice the indel cost. For the extended Γ‐matrix to fulfil the triangle inequality and if we want indels to serve only to adjust sequence lengths, a unique indel cost urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0019 should be within the range urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0020, where urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0021 is the maximum substitution cost and L is the maximum sequence length.

3.3.3.2 State‐dependent indel costs

Little attention has been paid to state‐dependent indel costs. According to Stovel (2001), more exceptional or rare states should be given a higher cost. Like the resemblance between states, we can determine how exceptional a state is, theoretically, on the basis of the state's attributes, or from the data. As a data‐driven solution, we propose to define the indel cost of state a as a monotonic function—such as a logarithm or square root—of the inverse of the overall observed frequency of state a or, equivalently, of the inverse mean time spent in state a. An alternative could be to use the mean time not spent in a. Such data‐driven solutions for indel costs avoid the criticisms of transition‐rate‐based substitution costs. An alternative to the latter method could be to set substitution costs as the sum of the indels of the two terms involved.

3.4 Variants of optimal matching

Despite the high flexibility of OM with state‐dependent costs, several researchers (Elzinga, 2003; Hollister, 2009; Halpin, 2010; Elzinga and Studer, 2015) have pointed out that OM distances are essentially driven by differences in durations. There are two main reasons for this. First, sequences in social sciences typically comprise a few long spells. Therefore, the LCSs typically include these longest spells or long portions of them (Elzinga and Studer, 2015). Second, OM operations are independently applied on each symbol in the sequence, regardless of the context. OM weights the insertion of state a in sequence aa and in sequence bb equally. However, in the first case, the insertion affects only the time that is spent in the spell in state a, whereas, in the second case, it changes the sequencing (Hollister, 2009; Halpin, 2010). Lesnard (2010) observed that OM does not consider the position (i.e. the age or date) when transformation operations are applied.

The OM variants that are discussed below aim to make edit operations more context sensitive by making them depend either on the position in the sequence where the operation applies or on the surrounding patterns at that position.

3.4.1 Dynamic Hamming distance

State similarities in time‐use analyses—e.g. between sleeping and commuting—can hardly be assumed to remain the same all day, and distinct timings reflect important differences in behaviour. As a result, Lesnard (2010) focused on OM without indels, such as the generalized Hamming distance, and proposed that substitution costs should depend on the position t in the sequence. He operationalized the idea by deriving the substitution cost at t from the cross‐section of the transition rates observed between t−1 and t and between t and t+1.

The dynamic Hamming distance DHD shares the strong timing sensitivity of the Hamming distance. Several criticisms can be pointed out. First, the criticism of the validity of the transition‐rate‐based substitution costs applies here also. Second, the number of transition rates to estimate is very high, potentially worsening overparameterization. Furthermore, if the meaning of a state a changes with the time when it occurs, a simpler solution could be to consider state a at time t and state a at time urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0022 as two distinct states urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0023 and urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0024.

3.4.2 Localized optimal matching

The OM extension that was proposed by Hollister (2009) aims to make indel costs dependent on the two adjacent states. The motivation is that inserting or deleting a state that is similar to its neighbours would change only the length of the spell in that state, without affecting the sequencing. However, an indel of a state that is different from its neighbours has much more important consequences and should, therefore, be charged a higher cost.

This localized OM is controlled with two user‐defined parameters. The first, e, can be interpreted as a spell expansion cost or time warp penalization. The second parameter, g, penalizes differences with surrounding states measured by the substitution costs. (For indels at one of the sequence ends, the average between the costs of the substitutions with the two surrounding states is replaced by the cost of the substitution with the sole adjacent term.) In her experiments, Hollister (2009) obtained the best results with a small shift penalization e and a g close to 1−2e. As long as parameters e and g fulfil the constraint 1−2eg, the method also prevents the OM from using a pair of indels instead of a substitution. Thus, it provides a way to allow for important time warps while preserving the effectiveness of substitution costs.

By construction, the localized OM should be less sensitive than the classical OM to differences in spell length, while being more sensitive to changes in sequencing. However, the localized OM can generate dissimilarities that do not satisfy the triangle inequality (Halpin, 2014).

3.4.3 Optimal matching sensitive to spell length

The OM sensitive to spell length variant, proposed by Halpin (2010), accounts more explicitly for the spell length, making indel and substitution costs depend on the spell length. Operations inside longer spells cost less than those involving shorter spells. More precisely, the costs are multiplied by a factor of urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0025, where t is the spell length and 0⩽h⩽1 the exponent time weight.

Decreasing the indel cost with the spell length produces the expected effect of favouring indels in longer spells instead of, for instance, indels that would create or suppress spells. However, the decrease in the substitution costs with the lengths of the implied spells has the reverse effect of encouraging the splitting of long spells. These contradicting effects make it difficult to predict the sensitivity of the measure to spell lengths. Moreover, this dissimilarity does not guarantee that the triangle inequality holds (Halpin, 2014).

3.4.4 Optimal matching between sequences of spells

To overcome the limitations of the two previous context‐sensitive dissimilarities, we propose to measure the OM distance between sequences of spells. The general idea is to consider, for each value of t, a spell in state a during t units of time as a distinct element, denoted urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0026, of the alphabet. Doing so considerably increases the size of the alphabet and, as a consequence, the number of indel and substitution costs to be considered. However, the number of parameters can easily be limited if we express the cost urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0027 of the indel of spell urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0028 and the substitution cost urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0029 between spells, urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0030 and urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0031 respectively, in terms of the basic indel and substitution costs (urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0032 and γ(a,b)) of the constituent elements, a and b, and a correction factor function of the spell length. For instance, letting δ⩾0 be a weight factor for the spell length, the costs can be defined as
urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0033(2)
urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0034(3)
The parameter δ is the cost of extending or compressing a sequence by 1 unit of time, and the substitution between two spells urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0035 is the cost of compressing each spell into a 1‐unit‐long spell, plus the substitution between the two states concerned, a and b. For urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0036, inserting an a in an existing spell urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0037 costs less than creating a new spell in a. Therefore, the method favours the expansion (or compression) of existing spells. However, unlike the method of Halpin, it does not encourage breaking long spells. Moreover, as defined by equations 2 and 3, the costs urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0038 and urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0039 satisfy the triangle inequality as long as urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0040 and γ(·) satisfy the inequality. Interestingly, for δ=0, the OM of spell sequences becomes the OM distance between the DSS sequences.

The OM between sequences of spells is, by construction, sensitive to differences in the duration of spells. It is also sensitive to sequencing by considering the DSS sequence and allows some control for the time warp through the expansion–compression penalty factor δ.

3.4.5 Optimal matching between sequences of transitions

Another way of accounting for the context, as described by Biemann (2011), is to compute the OM distances between sequences of transitions. The transitions in a state sequence are characterized by the two successive long subsequences obtained by joining each state with its previous state. For example, the transitions in aabb are aaabbb. We could possibly also specify the start of a sequence by using a transition from the start to the first state and, likewise, the end of the sequence by using a transition to the end state.

As noted by Biemann (2011), by considering transitions instead of the states, we increase the size of the alphabet considerably and, hence, the number of indel and substitution costs to be considered. To overcome this limitation, we propose (similarly to the case of sequences of spells) to express the indel urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0041 and substitution urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0042 costs of transitions in terms of the indel and substitution costs of states. Considering that a transition ab comprises an origin state a and a type of transition (e.g. a transition to the same state or a transition to another state), we express the cost of inserting or substituting a transition as a linear combination of respectively the cost of inserting or substituting the origin state and the cost urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0043 of the transition type concerned. Formally, we define the indel and substitution costs as follows:
urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0044(4)
urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0045(5)
with urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0046 the (possibly normalized) indel cost of the origin state a, urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0047 the transition type cost, γ(a,c) the (possibly normalized) substitution cost between the origin states, a and c, and w ∈ [0,1] a coefficient that controls the trade‐off between the cost that is related to the origin state and the cost that is related to the type of transition. A simple parameter‐free solution for the urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0048 function is to set it to 0 when a=b, and 1 otherwise. An alternative, which would make urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0049 state dependent without the need for any additional parameters, is to set urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0050 as the substitution cost γ(a,b) between a and b. Both solutions generate costs urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0051 and urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0052 that satisfy the triangle inequality when the basic costs urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0053 and γ(·) themselves verify the inequality.

The OM of sequences of transitions is, by construction, sensitive to differences in sequencing. With our formulation of the indel and substitution costs of the transitions, we obtain the classical OM for w=1, which shares the properties of OM. Otherwise, by reducing w, we can increase the sensitivity to sequencing. Time warp can be controlled through the origin state indel cost urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0054.

Finally, it is worth mentioning that Dijkstra and Taris (1995) proposed an interesting distance measure that should account for some special aspects such as the number and the order of common states. However, as shown by van Driel and Oosterveld (2001), their algorithm does not produce the expected results.

4 Simulation study

So far, we have reviewed a great number of dissimilarity measures between sequences. Moreover, many dissimilarity measures depend on user‐defined parameter values and, thus, define families of measures. However, in the end, we face the crucial question of choosing between them.

To help in that choice, this section provides empirical insights on how dissimilarity measures behave with regard to the three aspects that are relevant to comparing state sequences that describe life trajectories, namely sequencing, duration and timing. In what follows, we report the main results from a series of simulation strands.

The simulations reported provide an original view of the ability of the dissimilarity measures to render differences in each of the sequencing, duration and timing dimensions. In this regard, our simulations differ from other attempts to compare dissimilarity measures empirically. Several researchers (for a review, see Halpin (2014) and Robette and Bry (2012) have analysed how results—most often the clusters that are derived from the dissimilarity values—change with each dissimilarity measure used. Such approaches permit us to assess the robustness of the outcome of the dissimilarity‐based analyses against the dissimilarity measure that is used. However, outcome‐oriented simulation analyses do not in stricto sensu provide indications on the behaviour of the measures, and the generalization of their findings to other data sets and analysis methods—clustering algorithms—is subject to debate. The approach by Robette and Bry (2012), which is based on correlations between dissimilarities computed on artificial data, is more illuminating from that point of view. Nevertheless, although the Mantel tests of the correlations that were used by Robette and Bry prove useful in identifying measures that behave similarly, they do not specify what the measures are sensitive to.

4.1 Simulation design

The simulation study consists of different strands, each of which studies the sensitivity of the dissimilarity measures to one of timing, duration or sequencing. Each strand may itself contain a series of simulations run with different specifications of the differences tested.

The general principle of each series of simulations is to generate, in a controlled manner, two groups of sequences that differ in a selected single aspect of interest. In each group, the characteristic evaluated—sequencing, duration or timing—is kept fixed for all sequences in the group, whereas the other aspects are changed randomly across the sequences to allow for non‐systematic differences between sequences on these other non‐evaluated aspects. Doing so, the sequences compared differ systematically in the aspect evaluated, but also differ randomly on all other aspects. Thus, we can evaluate the relative importance that is given by the dissimilarity measures to the selected aspect in the presence of discrepancies on the others. Hence, we can evaluate how well each measure renders differences of the aspect studied.

Let us illustrate with an example. To measure the sensitivity to sequencing, we generate two groups of sequences with a different unique sequencing pattern in each group. Whereas the order of the states remains identical for all sequences inside a group, the timing and time spent in each distinct successive state are changed randomly within the groups. Therefore, a dissimilarity measure that is more sensitive to differences in duration than in sequencing will probably take similar values for pairs of sequences belonging to the same group as it would for pairs with a sequence from each of the two groups. In contrast, a measure that is sensitive to sequencing will typically take higher values for dissimilarities between groups than it would for dissimilarities within groups.

The sensitivity to the criterion considered is measured with the pseudo‐urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0055 that was defined in Studer et al. (2011), which measures the proportion of the discrepancy of the sequences explained by a categorical covariate. In our case, the covariate is the two‐group variable. The discrepancy of the sequences is evaluated from the pairwise dissimilarities in the same way as the variance of a series of values can be derived from the pairwise differences between the observed values. A high urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0056‐value will reflect strong sensitivity to the considered systematic difference between the two groups. In other words, a high urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0057 means that the measure can discriminate between the groups. In contrast, a low urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0058‐value indicates that the measure poorly accounts for the tested dimension. To ensure stable urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0059‐values, we generate 1 million sequences per group in each series of simulations. All simulated sequences are of length 20. Despite the huge number of sequences, all computations could be done relatively quickly by considering only unique sequences and weighting them by their counts (Studer, 2013). Each set of 1 million simulated sequences contained between 80 and 800 unique sequences, except for one set that had around 3000 unique sequences.

For each series of simulations, we obtain an urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0060 for each considered dissimilarity measure d. The urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0061‐values can be compared across dissimilarity measures within each series, where all urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0062 are computed on the same set of sequences. However, they are not comparable between series or strands, since the total variability and the mean urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0063 differ significantly across series. Therefore, we report the standardized value, namely the score. This score reflects the sensitivity of each dissimilarity measure d in comparison with the overall sensitivity of all measures considered. The score is positive for dissimilarity measures that are more sensitive than the average to the tested dimension, and negative otherwise.

4.2 Random‐sequence generation

We ran two sets of simulations, each with a different sequence‐generating model. For the first set, we generated the state sequences directly. For the second set, we postulated assumptions on the occurrences of events and then derived the states from these events.

For clarity, and for brevity, we report only a subset of the series of simulations that we tried (see Studer (2012)). However, the experiments that are reported are representative in that they render all salient findings of the complete set of simulations.

4.2.1 State‐based generating process

The direct generating process is based on the duration‐stamped spell representation of sequences. It first determines the sequence urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0064 of the urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0065 DSS, and then the durations urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0066 of the successive distinct states. The order—the DSS—is chosen randomly from a list of possible sequencings, and the durations are set randomly assuming uniform distributions. For each series, the control of the aspect tested is achieved by means of constraints on the generating process.

We report three strands of state‐based simulations, the characteristics of which are summarized in Table 2. Each strand comprises several series of simulations. The first strand evaluates the sensitivity to sequencing by completely controlling the order in each of the two groups in each series.

Table 2. Designs for evaluating sensitivity to order, timing and duration of states
Tested dimension Description Group 1 Group 2
Sequencing Order patterns controlled in each group, and duration in each consecutive state left random under the constraint of the fixed sequence length abc cba
abca acba
abcda adcba
abca abda
abab baba
abc abd
abc acb
abcd cdab
Timing Sequences randomly follow one of the patterns ‘abcde’ or ‘edcba’, and the time point t that the spell in state ‘c’ must cover is controlled t=7 t ∈ {9…15}
t=15 t ∈ {7…13}
Duration Sequences randomly follow one of the patterns ‘abc’ or ‘cba’ and duration d of the spell in state ‘b’ is controlled d=4 d ∈ {6…14}
d=14 d ∈ {4…12}

The second strand evaluates the sensitivity to timing. The order patterns are selected randomly and the durations are set randomly, while controlling the start of the spell in state c in each of the two groups. Several series of simulations are run with varying differences in the time point where we impose to be in state c between the two groups. Finally, the third strand evaluates the sensitivity to duration by controlling the total consecutive amount of time spent in a given state for each group.

We ran an additional strand of simulations to evaluate the sensitivity of the measures to small perturbations (Table 3). The same order is retained for the two groups but, in group 2, the sequences are perturbed by randomly changing the state of an element in the sequence, either for any element or for one element among those at the junction of two successive spells.

Table 3. Designs for evaluating sensitivity to a random change of state
Description Order pattern State change in group 2
Controlled order pattern and random durations: abc Anywhere
sequences in the second group derived from the abc’ or ‘cba Anywhere
sequences of the first group by randomly abc Start or end of spells
changing one of their elements abc’ or ‘cba Start or end of spells

4.2.2 Event‐based generating process

The aim of this second group of simulations is to evaluate the sensitivity of the measures to the underlying events that provoke the change in states.

We consider the occurrences of successive events and define the consequent new state after each event. In our simulations, we consider three events. Each sequence is then characterized by when each of the events urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0067, urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0068 and urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0069 occurs. The sequences are simulated by generating the times of occurrence with an independent uniform distribution over the period of observation for each of the three events.

Three strands of event‐based simulations are considered. The first evaluates the sensitivity to the order that events occur. Here, we impose the restriction that the first event occurs before the second event in group 1, and after the second event in group 2. The second strand evaluates the sensitivity to the timing of the events by controlling the occurrence of event 1 in each group. To evaluate the sensitivity to the duration between two events, in the last strand, we control the elapsed time between the first two events. The time of occurrence of the third event is left as random in all cases. Table 4 summarizes the simulations.

Table 4. Simulations evaluating the sensitivity to the order and timing of events, and to duration between events
Simulation Description Group 1 Group 2
Order Random‐occurrence times urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0070 urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0071
Timing Date urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0072 of event 1 is fixed urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0073 urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0074
urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0075 urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0076
Duration Fixed duration between events urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0077 urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0078
urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0079 urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0080

When comparing event‐based sequences, we can, for state‐dependent measures, define the state dissimilarities—substitution costs—by using the number of unshared underlying events. For example, the substitution cost between state ‘has experienced event urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0081 only’ and state ‘has experienced all three events’ is 2, since two events distinguish these states. We use this principle to test the behaviour of measures parameterized with features‐based costs.

4.3 Analysed dissimilarity measures

Most of the dissimilarity measures that are described in Table 1 have been included in the simulation study. For distances that can be parameterized, we consider a selection of parameter configurations to explore the effect of the parameters and the range of behaviour that can be covered. The complete list of dissimilarity measures and parameter configurations studied in the simulations is given in Table 5. The meanings of the parameters that are shown in Table 5 and in the following figures are specified in Table 6.

Table 5. Distances included in the simulation study
Distance Configurations
Distribution based EUCLID(K=1) (Euclidean), CHI2(K=1,2,4,5,10,20), (urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0082‐distance between distributions within K periods), CHI2fut (metric based on distributions of subsequent states)
Hamming HAM (simple and generalized Hamming)
DHD (dynamic Hamming)
OM OM, OM(i=1.5), OM(trate), OM(indelslog), OM(indels), OM(future)
Localized OM (OMloc) OMloc(e=0,0.1,0.25,0.4)
Spell‐length‐sensitive OM (OMslen) OMslen(h=1,i=1,1.5,5), OMslen(i=1,1.5,5)
OM of spell sequences (OMspell) OMspell(e=0,0.1,0.5,1), OMspell(e=0,0.1,0.5,1,i=2)
OM of transition sequences (OMstran) OMstran(w=0,0.1,0.5), OMstran(i=1.5,w=0.1,0.5),
OMstran(i=5, tm=sm, w=0.1,0.5),
OMstran(tm=raw)
Number of matching subsequences (NMS) NMS
Subsequence vectorial representation (SVRspell) SVRspell(b=0,1,2,3), SVRspell(b=0,1,2,3,a=1)
Table 6. Meaning of parameters
Label Description
K Number of intervals used to compute urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0083‐ and Euclidean distances
i indel cost (single cost of 1 when not specified)
sm †For brevity ‘sm=’ will be omitted and therefore OM arguments without the ‘=’ sign should be interpreted as values of the sm argument.
Substitution cost (single cost of 2 when not specified): trate (derived from transition rates), indelslog (derived from log‐state‐frequency‐based indel costs), indels (derived from inverse state‐frequency‐based indel costs), future (common future), ec (based on count of non‐shared experienced events)
e Spell expansion cost (for OMspell and OMloc)
w Weight of origin state versus transition‐type trade‐off
ti Transition indel costs (single cost of 1 when not specified): sm (based on substitution costs), raw (Biemann's method)
a Subsequence length weight exponent (0 when not specified)
b, h Spell duration weight exponent for SVRspell and OMslen respectively (when not specified, b=1 and h=0.5)
  • †For brevity ‘sm=’ will be omitted and therefore OM arguments without the ‘=’ sign should be interpreted as values of the sm argument.

4.4 Results

Detailed results for each series of simulations are provided as an on‐line appendix. Here, we summarize the outcome of the study by opposing the mean scores that are achieved by each measure for the simulations of the ‘sequencing’ strand to the mean scores that are obtained for the temporality—‘timing’ and ‘duration’—strands on one side, and the duration scores to the timing scores on the other. (The mean temporality scores are computed as the average between the mean timing and mean duration scores, and the mean scores that are reported have been standardized.) The duration and timing axes roughly correspond to the first two robust principal components (Todorov and Filzmoser, 2009) that were found in Studer (2012). However, unlike principal components, the axes here are defined independently from the data. They also have a clearer interpretation. The first axis is defined as the temporality score minus the mean sequencing score. Therefore, it is oriented such that higher sensitivity to sequencing appears on the left and higher sensitivity to temporality dimensions appears on the right. The second axis is defined with higher sensitivity to durations at the bottom and higher sensitivity to timing at the top.

Results from the state‐based group of simulations are displayed graphically in Fig. 1 and the results for the event‐based group are shown in Fig. 2. Fig. 3 reports the results for sensitivity to a random change of one token in the sequence. In each figure, the position of the measures should be interpreted relatively to the others and does not reflect an absolute level of sensitivity.

image
Scores for state‐based simulations
image
Scores for event‐based simulations
image
Sensitivity to a random change of state versus sensitivity to sequencing

In order not to overload Figs 1-3 with too many points, the results for each family of measures is represented by the smallest polytope covering the scores that were obtained for the various parameterizations tested. The labels of inner points have been omitted and only those of configurations that are associated with the vertices of the polytope are displayed. A large polytope area, such as that for OMstran—OM of transitions—in Fig. 1 indicates that the measure allows for very different sensitivities through its parameterization.

We can observe that the measures are distributed within a triangle in Figs 1 and 2. This (unsurprisingly) reflects a higher contrast between duration and timing sensitivities between measures that are sensitive to temporal aspects—on the right—than among measures that are primarily sensitive to the sequencing—on the left. A noticeable general outcome in Fig. 2 is that considering explicit information on the state proximities can significantly affect the behaviour of the measure (e.g. HAM(ec) lies far from HAM).

4.4.1 Results by distance families

Here, we examine each considered family of dissimilarity measures in more detail. Figs 4(a)–12(a)-4(a)–12(a) and Figs 4(b)–12(b)-4(b)–12(b) respectively give the position that each family occupies in Figs 1 and 2.

image
Distribution‐based distances: (a) state; (b) event
image
Hamming distance: (a) state; (b) event
image
OM: (a) state; (b) event
image
Localized OM: (a) state; (b) event
image
Spell‐length‐sensitive OM: (a) state; (b) event
image
OM of spells: (a) state; (b) event
image
OM of transitions: (a) state; (b) event
image
NMS: (a) state; (b) event
image
SVRspell: (a) state; (b) event

4.4.1.1 Distribution‐based distances

Unsurprisingly, dissimilarity measures based on differences between distributions over the whole period (K=1) appear to be the most sensitive to durations (Fig. 4). They are also the least sensitive to differences in sequencing, with urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0084s close to 0. The urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0085‐distance CHI2(K=1) is, among all distances considered, the most sensitive to duration differences for rare states, whereas the Euclidean distance EUCLID(K=1) is the most sensitive to differences for states with high durations. When K increases, the sensitivity of the urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0086‐ measure shifts from duration to timing. For K equal to the sequence length (here, 20), CHI2 receives scores that are similar to those of the Hamming family regarding timing but maintains some sensitivity to differences in durations. The detailed results in the on‐line appendix show that the positionwise urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0087‐distance ranks better as a time‐sensitive measure for small time changes than for large differences in timing. Here, CHI2fut—itself a positionwise measure—is closer to the positionwise CHI2 than are CHI2‐versions with a smaller K.

4.4.1.2 Hamming

All variants of the Hamming distance lie in the top right‐hand quadrant, meaning that they are specifically sensitive to timing differences (Fig. 5). They are slightly less insensitive to differences in sequencing than overall distribution‐based distances. This is because sequencing is partly determined from the start and end states, especially when, as in our generated sequences, the number of transitions remains low. The neutral position of HAM and DHD on the vertical timing–duration scale for the event‐based sequences is a consequence of the much higher timing sensitivity that is reached by HAM(ec) by using event‐based substitution costs. The HAM‐ and DHD‐scores are low relative to the HAM(ec)‐score. From the simulations, the time varying costs of the DHD‐metric seem to relax the strong time sensitivity of pure Hamming distance somewhat. As with the positionwise urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0088‐distances, the Hamming distances rank better as time‐sensitive measures for small time changes than for large time differences.

4.4.1.3 Optimal matching

The family of OM distances lies on the right of the plot, which confirms the low sensitivity to differences in sequencing, as pointed out, for instance, by Elzinga (2003), Hollister (2009) and Halpin (2010) (Fig. 6). As expected, we also observe that OM with high indel costs—relative to substitution costs—is more sensitive to timing differences (remember that HAM is OM with an arbitrarily high indel cost). Further, lowering the indel cost increases the sensitivity to duration and seems to reduce the insensitivity to sequencing at the same time. As expected, the scores for OM with data‐driven substitution costs remain very close to those with a single state‐independent cost of 2, the variation in position being essentially determined by the ratio between indel and substitution costs. For example, the data‐driven costs of OM(future)—the unlabelled point inside the OM area slightly below the vertical splitting line—are low in comparison with those used in other OM versions, which, for the same indel value, increase the indel/substitution cost ratio. This explains why OM(future) lies higher in the plot. As also expected, deriving substitution and indel costs from the state frequencies renders OM more sensitive to changes in the duration of rare events (see Fig. 3). Using the logarithm of inverse frequencies seems a better choice than raw inverse frequencies that make OM too sensitive to rare events and small perturbations. Costs that are derived from the count of non‐shared lived events, ec, reduce the sensitivity to duration and ensure that more importance is placed on timing differences. Interestingly, in all our simulations, OM(ec) and OM(ec, i=1.5) receive the same scores as HAM(ec). The reason for this is that the substitution costs are so low globally that indels costing 1 or more are never used.

4.4.1.4 Localized optimal matching

The distances of the localized OM family are, with a few exceptions in the case of costs based on the count of non‐shared events, in the lower portion of the plots (Fig. 7). The horizontal position is determined by the expansion cost e. In this case, lower values of e mean that the position is further to the left, with the measure becoming highly insensitive to temporality as e approaches 0. For some simulation strands, the part of urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0089 for timing differences that is accounted for by the measure becomes negative. This means that, with OMloc, we can obtain a total within‐group discrepancy that is greater than the overall ‘OMloc’‐discrepancy. This is a consequence of the violation of the triangle inequality and makes OMloc especially unsuited to distinguishing between groups of sequences with different timings.

4.4.1.5 Spell‐length‐sensitive optimal matching

As expected by Halpin (2010), OMslen appears to be less sensitive to differences in durations than classic OM (Fig. 8). However, here again, for several simulation strands (related to timing, duration and random perturbation) we obtained negative urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0090s when h=1.

4.4.1.6 Optimal matching of spells

The family of OMspell distances is around a line going from right to left from OM (i.e. classic OM with a single substitution cost) to OM of the DSS sequences, the latter corresponding to OMspell (e = 0) (Fig. 9). A high expansion cost e makes the measure more sensitive to temporality, whereas low values of e give more importance to sequencing. The sensitivity to temporality is attributable primarily to duration rather than to timing.

4.4.1.7 Optimal matching of transitions

The family of OM distances between sequences of transitions covers the largest range of sensitivity combinations (Fig. 10). Sensitivity to temporality increases with the value of the origin–transition trade‐off parameter w. Recall that, for w=1, OMstran is equivalent to classic OM. Lowering the value of w significantly increases the sensitivity to sequencing. As with classic OM, the vertical position is driven mainly by the indel/substitution cost ratio. However, we can observe that the effect of the ratio becomes smaller as w decreases; in other words, when more importance is given to the transition type than to the origin states.

4.4.1.8 NMS

The NMS‐distance occupies a neutral position near the centre of the plots (Fig. 11). Such neutral positions result in other distances exhibiting balanced positive sensitivity to sequencing, temporality and duration. However, this is not so for NMS, which appears to be virtually insensitive to each of them. For instance, in Fig. 3, we observe that NMS is the least sensitive distance to ordering. Counterintuitively, the measure receives its best scores for sensitivity to timing. This strange behaviour is a consequence of the extremely low proportion of subsequences that match among the huge number of subsequences in each sequence. The NMS‐measure exhibits the expected sensitivity to sequencing when applied to the DSS sequences, which corresponds to SVRspell (b=0).

4.4.1.9 SVRspell

The family of SVRspell‐distances—also based on matching subsequences—does not suffer from the NMS lack of sensitivity to our three relevant aspects (Fig. 12). The position of the measure is essentially linked to the spell duration exponent weight, b. For b=0 and a=0, the SVRspell distance becomes the NMS‐distance between the DSS sequences. This is the configuration that is the most sensitive to sequencing. Increasing b makes the measure more sensitive to temporality. Overall, and contrary to what we expected, SVRspell lies in the top half of the state‐based simulation plot and, therefore, looks more sensitive to timing differences than to differences in durations. However, this behaviour is not confirmed by the event‐based simulations. The effect of the a‐parameter that determines the weight that is given to the length of the matching subsequences remains unclear, but limited. From Fig. 3, we observe that the SVRspell‐measures are, after the urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0091‐measures, the most sensitive to small random perturbations.

4.4.2 Small random perturbations

The sensitivity to small random perturbations is shown in Fig. 3, where the scores for a random change of one token in each sequence are plotted against sequencing scores. We observe that the basic Euclidean and urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0092‐distances are, regardless of the breakdown of the covered time interval into periods, quite insensitive to differences in ordering. The urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0093‐distance appears to be, in our simulations, much more sensitive to small perturbations than the basic Euclidean distance is. If we exclude the urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0094‐distance, we observe that parameterizations that make measures more sensitive to ordering at the same time render them more sensitive to small perturbations. The linear correlation between the ordering scores and the scores for random perturbation of all except urn:x-wiley:09641998:media:rssa12125:rssa12125-math-0095‐measures is 0.8.

5 Choosing the right dissimilarity measure

The aim of this section is to provide guidelines on choosing from among the many different possibilities of measuring dissimilarity between sequences. The choice is difficult because it typically is multicriterial. For instance, in the retained social science framework, we expect the measure to reflect differences in timing, duration and sequencing. From our theoretical knowledge and empirical evidence on the behaviour of the various dissimilarity measures, there is no measure that dominates all others in all three dimensions of interest.

However, some distance measures are not recommended, on the basis of our study. These are the NMS‐distance, OMloc (the localized OM) and OMslen (the spell‐length‐sensitive OM). NMS lacks sensitivity to all three aspects, OMloc has strange behaviour resulting from the violation of the triangle inequality and OMslen has counterintuitive and, hence, unexpected behaviour. Although these three measures have interesting characteristics, there are alternatives—SVR for NMS, OMstran for OMloc and OMspell for OMslen—that share similar aims without suffering from their drawbacks. Further, remember that we discarded the so‐called ‘optimization cost’ method that was proposed by Gauthier et al. (2009) because of serious mathematical problems that could lead to negative dissimilarities.

Deriving substitution costs from transition rates—OM(trate)—adds complications and could possibly generate violations of the triangle inequality. Moreover, as shown by the simulations, OM(trate) provides results that are very close to those of OM with a single state‐independent substitution cost. The same holds for DHD, which produces results that are similar to the simple Hamming distance HAM. In contrast, when states are structurally organized, as in our event‐based simulations, taking this information on the relationships between states into account—the ec‐cases—can drastically change the outcomes. Together with the questionable justification linking substitution costs to transition rates, these remarks advocate against using such transition‐rate‐based methods.

Now, the choice between the remaining solutions will depend on what the researcher is interested in. For instance, when studying the destandardization of family life, the focus may be on changes in the order of successive family life events, in changes in the age at which people experience events such as marriage or the birth of the first child, or changes in durations, such as the time to the birth of the first child after the first union.

If the focus is on changes in sequencing, measures that are highly sensitive to sequencing should be preferred. Good choices are OMstran—OM of transitions—with low weight (i.e. a low w‐value) on the state of origin, OMspell—OM of spells—with low expansion cost e and SVRspell—the subsequence vectorial representation metric—with low b spell length weight. One of the differences between these three measures is the sensitivity to small perturbations. If we are interested in these small differences, such as small spells of unemployment, SVRspell should be used. In contrast, OMstran appears to be less sensitive to this aspect, whereas OMspell shows an intermediary position. Classic OM is definitely not suited to measuring differences in sequencing.

If we are interested in explaining changes in timing, then we need measures that are sensitive to timing. Positionwise measures, such as those of the Hamming family, are the most time sensitive. Using the CHI2‐ and EUCLID‐distances with the number of periods K equal to the sequence length is also a solution. This K‐parameter offers the advantage of allowing a smooth relaxation of exact timing alignment. This can also be achieved by expressing the Hamming distance as an OM distance with a high indel value, then progressively lowering the indel value. CHI2 is especially interesting when we want to stress the importance of changes involving rare states.

With regard to duration, the CHI2‐ and EUCLID‐distances with K set as 1 are recommended when the interest is primarily in the distribution over the entire period. If the importance of spell lengths needs to be stressed, then OMspell with a high expansion cost would be better. Indeed, LCS and the classic OM distance should also reflect dissimilarities in spell durations reasonably well. Distances of the Hamming, SVRspell‐ and OMtrans‐families are less suited to focusing on differences in spell durations.

Whereas the choice of a dissimilarity measure is relatively easy when the focus remains limited to a single dimension, the choice becomes more difficult when we want to consider differences in two or three dimensions simultaneously. Measures such as OMstran, OMspell and SVRspell, which can cover a large mix of sensitivities, look interesting in this multifocus context, as they allow control of the trade‐off between the various dimensions.

In many applications, we may be interested in specific differences that are attributable to each of the timing, duration and sequencing aspects, rather than needing to consider them simultaneously. It could then be useful to use three different dissimilarity measures: one sensitive to timing, one to duration and one to sequencing. This would allow us to identify distinctions stemming from each aspect by comparing the analysis outcomes that are obtained from each measure. For example, when studying the long‐term consequences of professional integration trajectories, we would probably look at differences between the trajectories of those who reach stable professional situations and those who stay more vulnerable. Finding greater differences with sequencing‐sensitive measures than with timing or duration‐sensitive measures would indicate that the effect of the unemployment policy depends more on the order in the trajectory than on temporality. Similarly, when studying differences in family formation trajectories across birth cohorts, we should be able to determine whether differences are due primarily to changes in sequencing. This may reflect, for instance, the emergence of new stages, such as ‘cohabiting couples’. We could also determine whether changes are due to timing differences, perhaps resulting from the postponement of events such as marriage or childbirth, or due to differences in spell durations, such as duration of marriage or the delay between marriage and the birth of the first child.

Running cluster analyses with different dissimilarity measures should also allow us to determine whether the trajectories are primarily structured by timing, duration or sequencing differences. To achieve this, we compare cluster quality measures such as the average silhouette width of the different partitions that are obtained. In a discrepancy analysis, comparing outcomes that are obtained with different measures may also help to identify which covariates best explain sequencing differences, and which best explain timing and duration differences.

However, it is worth recalling that the different dimensions are not completely independent from each other. Therefore, we may observe only minor differences between outcomes that are obtained with different measures. Nevertheless, the use of multiple measures can provide interesting information about borderline cases. For instance, we could learn that a given trajectory looks more like a type A in terms of sequencing, and more like a type B from a timing point of view.

6 Conclusion

Dissimilarity‐based sequence analysis has gained much popularity in life course studies in recent years. Although OM remains the most used dissimilarity measure, many other ways of measuring dissimilarity exist. Thus, the researcher faces the difficult task of choosing a suitable measure for her or his research objectives. Our structured and critical review of current measures, together with our simulation study of the behaviour of many of these variants, are intended to help to make this choice.

This review is original in several respects. First, it is specifically oriented towards the ability of the measures to render timing, duration and sequencing differences, which are important in life course studies. Second, it pays attention to the often overlooked mathematical properties of the dissimilarity measures, showing, for instance, that measures that can potentially violate the triangle inequality may exhibit unexpected behaviour. Third, the review covers a unique list of measures.

In this study, we also proposed new distance measures and original strategies to set the costs in OM. OM between sequences of spells has proven to be valuable, notably when considering duration and sequencing. Our reformulation of the OM of the sequences of transitions that was introduced by Biemann (2011) also gave good results in the simulations, covering a wide range of sensitivities depending on the parameters. The strategies that we proposed to set the costs in OM include data‐driven indel costs based on state frequencies, state property‐based costs by using the Gower distance and common future‐based substitution costs.

A few aspects that were not addressed in this study deserve mention here. For example, we did not study the ability of the dissimilarity measures to cope with sequences of unequal length, or to cope with missing elements in the sequences. Further, we did not consider normalized distances. A finer comprehension of how dissimilarity measures behave in these situations is still needed, and we plan to run such a study using a simulation design that is similar to that described here.

To conclude, note that the entire set of dissimilarity measures that were studied with simulated sequences were implemented in the TraMineR R package.

Acknowledgements

This publication results from research work that was conducted within the framework of the Swiss National Centre of Competence in Research LIVES Overcoming Vulnerability: Life Course Perspectives (IP14), which is financed by the Swiss National Science Foundation. The authors are grateful to the Swiss National Science Foundation for its financial support. The authors also thank the referees for their constructive comments.

      Number of times cited according to CrossRef: 94

      • Journey Segmentation of Turkish Tobacco Users Using Sequence Clustering Techniques, Intelligent and Fuzzy Techniques: Smart and Innovative Solutions, 10.1007/978-3-030-51156-2_11, (79-86), (2021).
      • Spatial and Temporal Dynamics of Social Vulnerability in the United States from 1970 to 2010, International Journal of Applied Geospatial Research, 10.4018/IJAGR.2020010103, 11, 1, (36-54), (2020).
      • Predicting Postsecondary Pathways: The Effect of Social Background and Academic Factors on Routes through School, Socius: Sociological Research for a Dynamic World, 10.1177/2378023119895174, 6, (237802311989517), (2020).
      • Differences in forum communication of residents and visitors in MOOCS, Computers & Education, 10.1016/j.compedu.2020.103937, (103937), (2020).
      • Model‐based clustering and analysis of life history data, Journal of the Royal Statistical Society: Series A (Statistics in Society), 10.1111/rssa.12575, 183, 3, (1231-1251), (2020).
      • Detecting commonality in multidimensional fish movement histories using sequence analysis, Animal Biotelemetry, 10.1186/s40317-020-00195-y, 8, 1, (2020).
      • Sensitivity of sequence methods in the study of neighborhood change in the United States, Computers, Environment and Urban Systems, 10.1016/j.compenvurbsys.2020.101480, 81, (101480), (2020).
      • Unsupervised Learning of the Sequences of Adulthood Transition Trajectories, Applications of Machine Learning, 10.1007/978-981-15-3357-0_20, (293-319), (2020).
      • Using sequence analysis to test if human life histories are coherent strategies, Evolutionary Human Sciences, 10.1017/ehs.2020.38, 2, (2020).
      • The effects of socioeconomic conditions on old-age mortality within shared disability pathways, PLOS ONE, 10.1371/journal.pone.0238204, 15, 9, (e0238204), (2020).
      • Examining familial role in mobile news consumption as a sequential process, Telematics and Informatics, 10.1016/j.tele.2020.101502, (101502), (2020).
      • The diversity of pathways to childlessness in the Czech Republic: The union histories of childless men and women, Advances in Life Course Research, 10.1016/j.alcr.2020.100363, (100363), (2020).
      • Not one, but many “publics”: public engagement with global development in France, Germany, Great Britain, and the United States, Development in Practice, 10.1080/09614524.2020.1801594, 30, 6, (795-808), (2020).
      • Labour Market Trajectories of the Self-employed in the Netherlands, De Economist, 10.1007/s10645-020-09358-x, (2020).
      • Women’s Family and Employment Life Courses Across Twentieth-Century Europe: The Role of Policies and Norms, Social Politics: International Studies in Gender, State & Society, 10.1093/sp/jxz056, (2020).
      • Quantifying streams of thought during cognitive task performance using sequence analysis, Behavior Research Methods, 10.3758/s13428-020-01416-1, (2020).
      • Mobile Phone Use as Sequential Processes: From Discrete Behaviors to Sessions of Behaviors and Trajectories of Sessions, Journal of Computer-Mediated Communication, 10.1093/jcmc/zmz029, (2020).
      • The New Place of Corporate Law Firms in the Structuring of Elite Legal Careers, Law & Social Inquiry, 10.1017/lsi.2019.62, (1-33), (2020).
      • Exploring behavioural patterns during complex problem‐solving, Journal of Computer Assisted Learning, 10.1111/jcal.12451, 0, 0, (2020).
      • The Many Forms of Multiple Migrations: Evidence from a Sequence Analysis in Switzerland, 1998 to 2008, International Migration Review, 10.1177/0197918320914239, (019791832091423), (2020).
      • Beyond abstinence and relapse: cluster analysis of drug-use patterns during treatment as an outcome measure for clinical trials, Psychopharmacology, 10.1007/s00213-020-05618-5, (2020).
      • Comparing Groups of Life-Course Sequences Using the Bayesian Information Criterion and the Likelihood-Ratio Test, Sociological Methodology, 10.1177/0081175020959401, (008117502095940), (2020).
      • Paid work, household work, or leisure? Time allocation pathways among women following a cancer diagnosis, Social Science & Medicine, 10.1016/j.socscimed.2019.112776, (112776), (2019).
      • Career patterns in self-employment and career success, Journal of Business Venturing, 10.1016/j.jbusvent.2019.105998, (105998), (2019).
      • Sequence analysis and acoustic tracking of individual lake sturgeon identify multiple patterns of river–lake habitat use, Ecosphere, 10.1002/ecs2.2983, 10, 12, (2019).
      • Becoming obese in young adulthood: the role of career-family pathways in the transition to adulthood for men and women, BMC Public Health, 10.1186/s12889-019-7797-7, 19, 1, (2019).
      • Sequence Analysis as a Tool for Family Demography, Analytical Family Demography, 10.1007/978-3-319-93227-9_5, (101-123), (2019).
      • Navigating the early career: The social stratification of young workers’ employment trajectories in Italy, Research in Social Stratification and Mobility, 10.1016/j.rssm.2019.100421, (100421), (2019).
      • Examining inclusive mobility through smartcard data: What shall we make of senior citizens' declining bus patronage in the West Midlands?, Journal of Transport Geography, 10.1016/j.jtrangeo.2019.102474, 79, (102474), (2019).
      • Destandardization in later age spans in Western Germany. Evidence from sequence analysis of Family Life Courses, Advances in Life Course Research, 10.1016/j.alcr.2019.04.017, (2019).
      • Destination as a Process: Sibling Similarity in Early Socioeconomic Trajectories, Advances in Life Course Research, 10.1016/j.alcr.2019.04.015, (2019).
      • Sequence Analysis of Life History Data, Handbook of Research Methods in Health Social Sciences, 10.1007/978-981-10-5251-4, (935-953), (2019).
      • Sequence analysis of capnography waveform abnormalities during nurse-administered procedural sedation and analgesia in the cardiac catheterization laboratory, Scientific Reports, 10.1038/s41598-019-46751-2, 9, 1, (2019).
      • 20 Years in the world of work: A study of (nonstandard) occupational trajectories and health, Social Science & Medicine, 10.1016/j.socscimed.2019.02.002, (2019).
      • Social group signatures in hummingbird displays provide evidence of co-occurrence of vocal and visual learning, Proceedings of the Royal Society B: Biological Sciences, 10.1098/rspb.2019.0666, 286, 1903, (20190666), (2019).
      • Mobiliser les méthodes mixtes pour mieux comprendre les parcours de vie des femmes sans enfantMobilizing mixed methods to better understand the life paths of childless women, Recherches sociographiques, 10.7202/1070977ar, 60, 2, (401), (2019).
      • Evaluating the Effects of Missing Values and Mixed Data Types on Social Sequence Clustering Using t-SNE Visualization, Journal of Data and Information Quality, 10.1145/3301294, 11, 2, (1-22), (2019).
      • Pre-parliamentary party career and political representation, West European Politics, 10.1080/01402382.2019.1643597, (1-24), (2019).
      • Longitudinal employment trajectories and health in middle life: Insights from linked administrative and survey data, Demographic Research, 10.4054/DemRes.2019.40.47, 40, (1375-1412), (2019).
      • Return to work after first incidence of long-term sickness absence: A 10-year prospective follow-up study identifying labour-market trajectories using sequence analysis, Scandinavian Journal of Public Health, 10.1177/1403494818821003, (140349481882100), (2019).
      • Stable cohabitational unions increase quality of life: Retrospective analysis of partnership histories also reveals gender differences, Demographic Research, 10.4054/DemRes.2019.40.24, 40, (657-692), (2019).
      • OUP accepted manuscript, Socio-Economic Review, 10.1093/ser/mwz004, (2019).
      • Analysis of High Temporal Resolution Land Use/Land Cover Trajectories, Land, 10.3390/land8020030, 8, 2, (30), (2019).
      • The Heterogeneity of Partnership Trajectories to Childlessness in Germany, European Journal of Population, 10.1007/s10680-019-09519-y, (2019).
      • Diverging Patterns in Women’s Reconciliation Behavior across Family Policies and Educational Groups, Social Politics: International Studies in Gender, State & Society, 10.1093/sp/jxy043, (2019).
      • Initial employment pathways of immigrants in Germany. Why legal contexts of reception matter – an analysis of life-course data, Transfer: European Review of Labour and Research, 10.1177/1024258918818069, (102425891881806), (2019).
      • Visualizing and exploring event databases: a methodology to benefit from process analytics, Operational Research, 10.1007/s12351-018-00447-z, (2019).
      • Normalization of Distance and Similarity in Sequence Analysis, Sociological Methods & Research, 10.1177/0049124119867849, (004912411986784), (2019).
      • Union dissolution and housing trajectories in Britain, Demographic Research, 10.4054/DemRes.2019.41.7, 41, (161-196), (2019).
      • Categorical state sequence analysis and regression tree to identify determinants of care trajectory in chronic disease: Example of end-stage renal disease, Statistical Methods in Medical Research, 10.1177/0962280218774811, 28, 6, (1731-1740), (2018).
      • Use of state sequence analysis for care pathway analysis: The example of multiple sclerosis, Statistical Methods in Medical Research, 10.1177/0962280218772068, 28, 6, (1651-1663), (2018).
      • Union Histories of Dissolution: What Can They Say About Childlessness?, European Journal of Population, 10.1007/s10680-018-9464-6, 35, 1, (101-131), (2018).
      • Holistic Analysis of the Life Course: Methodological Challenges and New Perspectives, Advances in Life Course Research, 10.1016/j.alcr.2018.10.004, (2018).
      • undefined, 2018 2nd European Conference on Electrical Engineering and Computer Science (EECS), 10.1109/EECS.2018.00057, (266-273), (2018).
      • Lone Mothers in Belgium: Labor Force Attachment and Risk Factors, Lone Parenthood in the Life Course, 10.1007/978-3-319-63295-7_12, (257-282), (2018).
      • Extended working lives and late-career destabilisation: A longitudinal study of Finnish register data, Advances in Life Course Research, 10.1016/j.alcr.2018.01.007, 35, (114-125), (2018).
      • Understanding trends in family formation trajectories: An application of Competing Trajectories Analysis (CTA), Advances in Life Course Research, 10.1016/j.alcr.2018.02.003, 36, (1-12), (2018).
      • A New Urban Typology Model Adapting Data Mining Analytics to Examine Dominant Trajectories of Neighborhood Change: A Case of Metro Detroit, Annals of the American Association of Geographers, 10.1080/24694452.2018.1433016, 108, 5, (1313-1337), (2018).
      • Typologies of dyadic mother-infant emotion regulation following immunization, Infant Behavior and Development, 10.1016/j.infbeh.2018.09.007, 53, (5-17), (2018).
      • Unpacking Configurational Dynamics: Sequence Analysis and Qualitative Comparative Analysis as a Mixed-Method Design, Sequence Analysis and Related Approaches, 10.1007/978-3-319-95420-2_10, (167-184), (2018).
      • Multiphase Sequence Analysis, Sequence Analysis and Related Approaches, 10.1007/978-3-319-95420-2_9, (149-166), (2018).
      • Sequence History Analysis (SHA): Estimating the Effect of Past Trajectories on an Upcoming Event, Sequence Analysis and Related Approaches, 10.1007/978-3-319-95420-2_6, (83-100), (2018).
      • Modelling Mortality Using Life Trajectories of Disabled and Non-Disabled Individuals in Nineteenth-Century Sweden, Sequence Analysis and Related Approaches, 10.1007/978-3-319-95420-2_5, (69-81), (2018).
      • Do Different Approaches in Population Science Lead to Divergent or Convergent Models?, Sequence Analysis and Related Approaches, 10.1007/978-3-319-95420-2_2, (15-33), (2018).
      • Sequence Analysis: Where Are We, Where Are We Going?, Sequence Analysis and Related Approaches, 10.1007/978-3-319-95420-2_1, (1-11), (2018).
      • Measuring Sequence Quality, Sequence Analysis and Related Approaches, 10.1007/978-3-319-95420-2_15, (261-278), (2018).
      • From 07.00 to 22.00: A Dual-Earner Couple’s Typical Day in Italy, Sequence Analysis and Related Approaches, 10.1007/978-3-319-95420-2_14, (241-257), (2018).
      • Divisive Property-Based and Fuzzy Clustering for Sequence Analysis, Sequence Analysis and Related Approaches, 10.1007/978-3-319-95420-2_13, (223-239), (2018).
      • Combining Sequence Analysis and Hidden Markov Models in the Analysis of Complex Life Sequence Data, Sequence Analysis and Related Approaches, 10.1007/978-3-319-95420-2_11, (185-200), (2018).
      • Estimating the Relationship between Time-varying Covariates and Trajectories: The Sequence Analysis Multistate Model Procedure, Sociological Methodology, 10.1177/0081175017747122, 48, 1, (103-135), (2018).
      • A multichannel typology of temporary employment careers in the Netherlands: Identifying traps and stepping stones in terms of employment and income security, Social Science Research, 10.1016/j.ssresearch.2018.10.001, (2018).
      • Sequence Analysis of Life History Data, Handbook of Research Methods in Health Social Sciences, 10.1007/978-981-10-2779-6, (1-19), (2018).
      • SADI: Sequence Analysis Tools for Stata, The Stata Journal: Promoting communications on statistics and Stata, 10.1177/1536867X1701700302, 17, 3, (546-572), (2018).
      • Agreement of Self-Reported and Administrative Data on Employment Histories in a German Cohort Study: A Sequence Analysis, European Journal of Population, 10.1007/s10680-018-9476-2, (2018).
      • Smoothness of the School-to-Work Transition: General versus Vocational Upper-Secondary Education, European Sociological Review, 10.1093/esr/jcy043, (2018).
      • Social Disparities in Destandardization—Changing Family Life Course Patterns in Seven European Countries, European Sociological Review, 10.1093/esr/jcx083, 34, 1, (64-78), (2017).
      • Enduring contexts: Segregation by affluence throughout the life course, The Sociological Review, 10.1177/0038026117741051, 66, 3, (645-664), (2017).
      • Early Adversity and Late Life Employment History—A Sequence Analysis Based on SHARE, Work, Aging and Retirement, 10.1093/workar/wax014, 4, 3, (238-250), (2017).
      • Late Life Employment Histories and Their Association With Work and Family Formation During Adulthood: A Sequence Analysis Based on ELSA, The Journals of Gerontology: Series B, 10.1093/geronb/gbx066, 73, 7, (1263-1277), (2017).
      • undefined, Proceedings of the Seventh International Learning Analytics & Knowledge Conference on - LAK '17, 10.1145/3027385.3027391, (128-137), (2017).
      • The Questionable Ecological Validity of Ecological Momentary Assessment: Considerations for Design and Analysis, Research in Human Development, 10.1080/15427609.2017.1340052, 14, 3, (253-270), (2017).
      • Intergenerational determinants of joint labor market and family formation pathways in early adulthood, Advances in Life Course Research, 10.1016/j.alcr.2017.09.001, 34, (10-21), (2017).
      • Subsidized Housing and Residential Trajectories: An Application of Matched Sequence Analysis, Housing Policy Debate, 10.1080/10511482.2017.1316757, 27, 6, (843-874), (2017).
      • Differentiating pathways of neighborhood change in 50 U.S. metropolitan areas, Environment and Planning A, 10.1177/0308518X17722564, 49, 10, (2402-2424), (2017).
      • undefined, 2017 IEEE International Conference on Big Data (Big Data), 10.1109/BigData.2017.8258222, (2620-2627), (2017).
      • Analyzing Dyadic Data Using Grid-Sequence Analysis: Interdyad Differences in Intradyad Dynamics, The Journals of Gerontology: Series B, 10.1093/geronb/gbw160, 73, 1, (5-18), (2016).
      • The Path-Dependency of Low-Income Neighbourhood Trajectories: An Approach for Analysing Neighbourhood Change, Applied Spatial Analysis and Policy, 10.1007/s12061-016-9189-z, 10, 3, (363-380), (2016).
      • Using observed sequence to orient causal networks, Health Care Management Science, 10.1007/s10729-016-9373-3, 20, 4, (590-599), (2016).
      • The Family Life Course and Health: Partnership, Fertility Histories, and Later-Life Physical Health Trajectories in Australia, Demography, 10.1007/s13524-016-0478-6, 53, 3, (777-804), (2016).
      • A Life Course Perspective on Work Stress and Health, Work Stress and Health in a Globalized Economy, 10.1007/978-3-319-32937-6_3, (43-66), (2016).
      • Sequencing the real time of the elderly: Evidence from South Africa, Demographic Research, 10.4054/DemRes.2016.35.25, 35, (711-744), (2016).
      • The diversity in longitudinal partnership trajectories during the transition to adulthood: How is it related to individual characteristics and regional living conditions?, Demographic Research, 10.4054/DemRes.2016.35.37, 35, (1101-1134), (2016).
      • Quo Vadis? Career paths of Brazilian regulators, Regulation & Governance, 10.1111/rego.12348, 0, 0, (undefined).
      • Lifetime patterns of comorbidity in eating disorders: An approach using sequence analysis, European Eating Disorders Review, 10.1002/erv.2767, 0, 0, (undefined).