Hierarchical Exploration of Continuous Seismograms With Unsupervised Learning

Abstract Continuous seismograms contain a wealth of information with a large variety of signals with different origin. Identifying these signals is a crucial step in understanding physical geological objects. We propose a strategy to identify classes of signals in continuous single‐station seismograms in an unsupervised fashion. Our strategy relies on extracting meaningful waveform features based on a deep scattering network combined with an independent component analysis. Based on the extracted features, agglomerative clustering then groups these waveforms in a hierarchical fashion and reveals the process of clustering in a dendrogram. We use the dendrogram to explore the seismic data and identify different classes of signals. To test our strategy, we investigate a two‐day‐long seismogram collected in the vicinity of the North Anatolian Fault, Turkey. We analyze the automatically inferred clusters' occurrence rate, spectral characteristics, cluster size, and waveform and envelope characteristics. At a low level in the cluster hierarchy, we obtain three clusters related to anthropogenic and ambient seismic noise and one cluster related to earthquake activity. At a high level in the cluster hierarchy, we identify a seismic burst that includes around 200 events with similar waveforms and high‐frequent signals with correlating envelopes and an anthropogenic origin. The application shows that the cluster hierarchy helps to identify particular families of signals and to extract subclusters for further analysis. This is valuable when certain types of signals, such as earthquakes, are under‐represented in the data. The proposed method may also successfully discover new types of signals since it is entirely data‐driven.

seismic wavefield that makes the analysis and interpretation of seismic records difficult, especially if seismic data are the only data available.
As a response to this problem, seismologists have developed many processing tools for exploring these complex seismic data. Since the 1970s seismology benefits from artificial intelligence developments, bringing machine-learning-based solutions for exploring seismic data and recognizing patterns (e.g., Allen, 1978). More recently an unsupervised learning strategy called clustering was utilized to explore seismic data and find families of similar signals (Holtzman et al., 2018;Jenkins et al., 2021;C. W.Johnson et al., 2020;Köhler et al., 2010;Mousavi et al., 2019;Seydoux et al., 2020;Snover et al., 2020). In contrast to supervised learning strategies, clustering does not rely on a labeled training set and human expert knowledge (Goodfellow et al., 2016). Thus, clustering seismograms can help identifying families of signals which are not yet discovered or are poorly defined such as non-volcanic tremors.
In the present paper, we introduce a new strategy to use clustering as an exploration tool for continuous seismograms. Our strategy follows the idea that seismic signals are grouped in a hierarchy of classes following a specific similarity measurement, as schematized in Figure 1. Note that this illustration aims at sketching the concept rather than being complete or accurate. We consider the similarity between classes of signals to be measured on a set of signal characteristics that can be human-defined (such as mean frequency and signal duration) or learned with machine-learning tools, as we propose in the present paper. In the first place, one can imagine the seismic signal classes to split into long-term and short-term signals based on the duration of a signal ( Figure 1). In the class of long-term signals, one could use a similarity measure based on frequency content to separate the primary from secondary microseism. We see that building a tree of classes lets us explore the data on different levels and that different signal characteristics may be relevant at each node of the tree.
The sketch presented in Figure 1 also illustrates the problems of designing a class hierarchy by hand. The labels used in this sketch are the ones we created as seismologists based on our domain knowledge. That is problematic for those classes of signal that do not have a proper definition of signal and source properties, such as non-volcanic tremors. Moreover, some splittings, such as between earthquakes and explosions, ask for a more complex similarity measure which is hard to design by hand. Hierarchical clustering produces precisely this kind of tree, called a dendrogram, based on the exploration of the similarity of signals present in the input data. Therefore, we propose to represent continuous seismograms as a dendrogram and utilize it to explore the content of the data and identify different types of seismic signals. We want to emphasize that clustering identifies groups of similar objects, but it does not provide any meaningful labels, such as the labels in Figure 1. In an extra step, the found clusters can be labeled by analyzing the inherent properties of the clusters.
In the following section, we present the workflow to build a dendrogram from continuous single-station data. We introduce the concept of hierarchical clustering and how we transform continuous seismograms to a meaningful input (features) for the hierarchical clustering. In Section 3, we introduce a data set to apply and test the proposed workflow. In Section 4, we show and discuss briefly the resulting dendrogram. Section 5 is about navigating through the dendrogram and interpreting the clusters at different levels.

Method
A sketch of the hierarchical clustering workflow is depicted in Figure 2. In the following lines, we start with the concept of clustering in general and hierarchical clustering in particular. Then, we explain how we transform seismograms into a meaningful input for the cluster analysis.

Hierarchical Clustering
In general, cluster analysis groups objects based on their similarity to each other (e.g., Xu & Wunsch, 2008). Objects in the same cluster are more similar to each other than objects in separated clusters. The similarity between objects is measured on a set of certain characteristics called features. Finding the most relevant features for this task will be discussed later.
Various algorithms exist to find groups of objects in a data set. This study utilizes hierarchical clustering with a bottom up approach, namely agglomerative clustering. Hierarchical clustering relies on a similarity matrix, which defines the similarity (e.g., a specific distance in the feature space) between all objects in a data set (S.

10.1029/2021JB022455
3 of 21 C. Johnson, 1967). With a bottom-up approach, all objects start in a singleton cluster. The clusters start merging based on the similarity matrix until all objects unify in a single global cluster. This process is summarized in a dendrogram, revealing the hierarchical structure of the entire data set. Such a strategy fits very well the nature of seismic data as depicted in Figure 1.
The agglomerative clustering outcome depends mainly on the applied metric, which drives the merging of the cluster. In our approach, we use the Ward's method (Ward Jr, 1963). Given a distance d (here considered Figure 1. Illustration of a possible hierarchy of seismic signals found in seismograms. The different branches represent how a signal class splits into different sub-classes depending on a given similarity measure. Here, the different classes of signals are thought in a hierarchical way, based on arbitrary properties (e.g., duration, frequency range or signal's structure). This scheme aims at illustrating the expected behavior of an optimal clustering algorithm, but does not depict the potential issues related to clustering such as overlapping between different classes of signals or imbalance between classes. Euclidean), the Ward's method aims at grouping objects x i into clusters such that the within-cluster variance remains minimal after merging different clusters. The within-cluster variance σ quantifies the spread of each cluster in the feature space (for more details see Appendix A). By minimizing the overall variance, ∑

=1
with K being the number of clusters, the Ward's method allows for clusters of variable population sizes and variances. Thus, it may highlight clusters of high density located in the vicinity of more spread, low-density clusters. Therefore, Ward's method is suitable for the expected seismic data partition, where often ambient seismic noise outweighs signals with a tectonic origin.
It is worth mentioning that hierarchical clustering especially with the Ward's method can be computationally expensive. However, algorithms have been improved over time and became more efficient. In this study we utilize the python package fastcluster, which has a time complexity of ( 2 ) with N elements in ℝ and a memory complexity of ( ) (Müllner, 2013). More recently, the use of hyperbolic embeddings for preserving the hierarchical structure of the data seems to be a promising way to reduce even further the computational costs (Chami et al., 2020).

Finding an Appropriate Representation of Seismograms: The Deep Scattering Spectrum
In order to detect and identify classes of signals in continuous seismograms with hierarchical clustering, the seismograms have to be transformed into a meaningful input for the cluster analysis. For that purpose, we calculate features for fixed windows of the seismogram. Thus, each window will be assigned a cluster based on the features for this window. Note that this process simplifies the complexity of seismic data, since multiple types of signals can occur simultaneously. Common cluster analysis such as hierarchical clustering neglect this fact and can only assign a single cluster to an object. Besides the choice of the applied metric within hierarchical clustering, the choice of features is another important factor, which determines the outcome of the cluster analysis. Finding the most relevant features should be done according to the task at hand and can be done thanks to prior knowledge on the data or by defining proper algorithms to learn the most relevant features. We distinguish classical machine-learning algorithms that rely on human-defined features (Maggi et al., 2017;Malfante et al., 2018) or representation-learning algorithms where the features are learned from the data to optimize a given task (LeCun et al., 2015;Ross et al., 2018;Rouet-Leduc et al., 2020). While classical machine learning provides less accuracy in most cases, it provides interpretability since the features are known, which is an interesting aspect. Most algorithms that rely on representation learning are less easy to interpret since the features are more abstract, but they also provide more accurate results. In the present paper, we propose to use a hybrid approach between classical and representation learning algorithms that combines the advantages of both.
A time-frequency representation such as the spectrogram is one way to create a set of features for classifying seismic signals (Jenkins et al., 2021;Snover et al., 2020). However, Andén and Mallat (2014) showed that a spectrogram generated by the Fourier transform is not ideal for classification purposes since it is not stable to time-warping deformations, especially at short periods compared with the duration of the analyzing window. They introduce another time-frequency representation called a deep scattering spectrum which is computed by a scattering network. This type of network implements a cascade of convolutions with wavelet filters, modulus function, and pooling operations (see Figures 2a and 2b). Deep scattering spectra are locally translation invariant and preserve transient phenomena such as attack and amplitude modulation. These characteristics are beneficial when it comes to classifying any time series data. In Andén and Mallat (2014) and Peddinti et al. (2014), the authors have successfully classified audio data based on the deep scattering spectrum. The authors of Seydoux et al. (2020) have brought that representation into seismology and showed that small precursory signals of a landslide could be detected and classified in an unsupervised fashion. Other successful deep-learning classifiers inspired by deep scattering networks are presented in Balestriero et al. (2018) and Cosentino and Aazhang (2020).
We use the strategy presented in Seydoux et al. (2020) for calculating the deep scattering spectrum. Considering the continuous input signal ( ) ∈ ℝ (where, C is the number of channels), the scattering coefficients S (ℓ) of order ℓ are obtained from the following cascade of wavelet convolutions and modulus operations (i.e., wavelet transforms): where, ⋆ stands for the temporal convolution, |⋅| represents the modulus operator and ( ) ( ( ) ) is the wavelet filter at the layer i of the scattering network, with center frequency . Here, refers to one of the center frequencies of the layer i indexed by n i = 1 … N i , where N i is the total number of wavelets at layer i. In contrast to the Fourier transform, the center frequencies of the wavelets are placed logarithmically. In this study, we only consider a scattering network with 2 layers (as depicted in Figure 2) since, Andén and Mallat (2014) argued that more layers do not necessarily introduce new valuable information. The first layer in the network creates N 1 scalograms per channel of the seismic station. In the second layer another wavelet transform is applied to each scalogram of the first layer. Thus, the second layer contains N 1 * N 2 scalograms. A maximum pooling operation is then applied over each scalogram to retrieve the scattering spectrum. The entries of the scattering spectrum pooled from the first layer scalogram are called first-order scattering coefficients. The entries of the scattering spectrum pooled from the second layer scalogram are called second-order scattering coefficients, which contain important information about the attack and modulation of a signal. While, the scalograms still have the same sampling rate in time as the input data, the temporal pooling collapses the time dimension of the scalogram and produces a scattering spectrum for each input window. Note that each input channel from the seismic station is treated separately and their deep scattering spectrum are concatenated into a deep scattering spectrum vector with the size 3 * N 1 + 3 * N 1 * N 2 for a three-component seismogram. For each input window, a deep scattering spectrum vector is created, which are then merged into the deep scattering spectrogram. The final sampling rate of the deep scattering spectrogram is defined by the size of the input window. In Seydoux et al. (2020), the authors initialize Gabor wavelets with amplitudes and derivatives on a certain sets of knots and interpolate then with Hermite cubic splines. With respect to the clustering loss, they learn the parameters on these knots governing the shape of the wavelets. In this study, we directly use the initialized Gabor wavelets with zero phase shift and do not apply any learning of the wavelets. This choice was made principally because we do not perform a fixed cluster analysis in our study, but an exploration of the data instead where a loss function is harder to define. For the interested reader we refer to Andén and Mallat (2014) and Seydoux et al. (2020).

Features Extraction From Deep Scattering Spectrogram
The deep scattering spectrum can have more than 1,000 dimensions and, thus, the conditions for clustering are not favorable (Kriegel et al., 2009). Indeed, distances in very high-dimensional spaces give little information about the structure of the data (the so-called curse of dimensionality; Bellman, 1966). In addition, the representation is known to be highly redundant since the wavelet filters of the first layer are often considered with a strong frequency overlap in order to provide a dense first-order representation. Therefore, it is recommended to reduce the dimensions before clustering. In our case, we use an independent component analysis (ICA) to reduce the dimension of the representation. In the following remarks, we explain the basic concept of ICA. For the interested reader we refer to Comon (1994).
ICA is introduced as a statistical tool for blind source separation and feature extraction. The generative model of the ICA can be described as: where, ∈ ℝ × are the N observations of dimension F, ∈ ℝ × is the mixing matrix, and ∈ ℝ × are the unmixed sources (namely, the C unmixed sources obtained from ICA). The observations x are therefore a linear combination of the independent sources s, with the mixing weights gathered in A. A test of statistical independence is required to solve Equation 2 while ensuring the sources s to be independent. This concept is illustrated in Figure 2, where the unmixed sources are considered as features in our workflow (therein called feature matrix). These sources are obtained from applying the unmixing matrix, the pseudo inverse of the mixing matrix A, to the deep scattering spectrogram. Among the different strategies, we can look for a minimum of mutual information, or similarly, a maximization of the non-Gaussianity. In our study, we apply the FastICA algorithm from the scikitlearn Python library, which uses the negentropy as a measure of non-Gaussianity (Hyvärinen & Oja, 2000). This analysis is similar to the principal component analysis, with the difference that the independent components are not orthogonal. In addition, there is no information about the variance explained by the different independent components, and are therefore delivered unsorted by the algorithm.

Data
We test our proposed workflow on continuous three-component seismic data from the station DC06 of the DANA experiment in Turkey (see for instance Poyraz et al., 2015, and the map shown in Figure 3a). Originally, the experiment was conducted to investigate the crustal structure beneath the western segment of the North Anatolian Fault. We choose the data set for mainly two reasons. First of all, the data set contains both seismic and anthropogenic activity, which is a typical situation in most seismological studies. Second of all, an existing template matching catalog provides labels for the seismicity in this area. The catalog was built following the methodology in Beaucé et al. (2019).
We choose to analyze the seismic data from the 25th to the 27th of July 2012. During the period of these two days, a high rate of localized seismicity with 148 cataloged events occurred on and around the northern strand of the North Anatolian fault (see Figures 3a and 3b). In this study, we refer to this high rate of seismicity as a seismic burst. The catalog explains the series of events with 17 templates having their hypocenters close to each other ( Figure 3a, red dots). Since the seismic burst causes a repeating pattern in the seismogram with short time-warping deformations due to slight changes of the hypocenters, it is an interesting study case for our proposed method. Station DC06 is close to the seismic burst and records the time period of interest without data gaps. Thus, we choose the three-component seismograms of this station. The sampling rate of the data is 50 Hz.
The spectrogram of the east component of station DC06 is presented in Figure 3c. The oceanic microseism is visible around 0.2 Hz, where we can observe the dispersive nature of the oceanic gravity waves. At around 1.5 Hz we can identify a nonstationary monochromatic noise source, which seems to be more active during the first day. At frequencies higher than 3 Hz we can see increased activity during daytime, most likely induced by anthropogenic seismic sources. The event with the largest magnitude of the burst is also easy to spot during the evening of the 25th in the spectrogram.

Feature Space
First, we use the continuous three-component seismograms to calculate the deep scattering spectrogram with a two-layered scattering network (as detailed in Equation 1). The network parameters are physics-driven and can be adjusted according to the goal. In this study, the first layer contains 24 Gabor wavelets with center frequencies between the Nyquist frequency of the seismogram (25 Hz) and 0.78 Hz with a spacing of 4 wavelets per octave. The second layer contains 14 Gabor wavelets with center frequencies between 25 and 0.19 Hz with a spacing of 2 wavelets per octave. This setup results in 24 wavelet transforms per channel in the first layer and 336 (24 * 14) wavelet transforms per channel in the second layer. Because the deep scattering spectrum is a concatenation of the first-and second-order scattering coefficient of each input channel, the total number of scattering coefficients is 1,080 (dimension F in Figure 2). For the temporal pooling operation, we apply maximum pooling, since we are interested in detecting and classifying non-stationary events such as the seismic burst. If the focus of classification is the background noise, average pooling might be the better choice (as suggested in Seydoux et al., 2020).
The moving pooling window is 20.48 s large and does not overlap. Hence, the time resolution of the deep scattering spectrogram is also 20.48 s.
For dimensionality reduction, we apply an independent component analysis using the FastICA algorithm from the scikit-learn Python library. In this study, we select the appropriate number of independent components according to the reconstruction loss between the original data and the reconstructed data after compression with an ICA (detailed in Appendix B). We emphasize that we look for a trade-off between keeping the most significant amount of information while using few independent components. From the study of the loss with increasing number of components shown in Appendix B and Figure B1 therein, we conclude that keeping 10 independent components is a good compromise and constitute our choice in the present study. A visual representation of the 10 unmixed sources building the feature space is depicted in Figure B2 in Appendix B.

Dendrogram
After transforming the continuous seismic data into a most relevant set of features, we can use this representation to explore the data with hierarchical clustering. By controlling the distance threshold, we can extract different numbers of clusters. The distance threshold sets the boundaries for the possible distances between points within a cluster. While a larger distance threshold allows larger and fewer clusters to form, a smaller distance threshold extracts smaller but many clusters. Note that the distance threshold is only used to extract different cluster solutions based on the similarity matrix; it is not a hyperparameter affecting the similarity matrix. In Figure 4a we selected a distance threshold of 0.47 in order to show a truncated dendrogram stopping at 16 clusters. At a distance of 0.9, we extract four main clusters labeled as A, B, C, and D. Figure 4b-4e depict random-selected three-component seismograms from each of the cluster. Figure 4f shows the averaged first-order scattering coefficients of these four clusters. These first-order scattering coefficients describe the frequency characteristics of each cluster. Figure 4g presents the normalized cumulative detection rate of each cluster, with the seismic burst detection rate indicated as a reference. The relative size of each cluster compared to the size of the entire data set is depicted in Figure 4h. In the following remarks, we will analyze each of the four main clusters from left to right.
Cluster A contains ca. 27% of the data ( Figure 4h) and is the first cluster to split from the whole data set, that is, cluster A is the furthest away from the center of the data points ( Figure 4a). Compared to the other clusters, its scattering coefficients for all frequencies are relatively low except for a local maximum around 1.5 Hz (Figure 4f). Looking at the corresponding cumulative detection curve (Figure 4g), we see that this cluster is active mainly during the first day until the late afternoon, which seems to correlate with the monochromatic signal around 1.5 Hz we have already identified in the spectrogram (Figure 3c).
Cluster B contains about 19% of the data samples ( Figure 4h) and has relatively large scattering coefficients for frequencies above 10 Hz (Figure 4f). The corresponding cumulative detection curve indicates that this cluster accumulates less detections during the beginning of a day than with later times of a day (Figure 4g). Combining these facts leads to the hypothesis that cluster B might be related to signals with an anthropogenic origin.
Cluster C is the largest cluster with more than 50% of the data points (Figure 4h). Compared to the other clusters, it also has the lowest scattering coefficients at all frequencies (Figure 4f). Looking at the cumulative detection curve (Figure 4g), we see this cluster shows an almost linear increase starting at the afternoon of the first day, exactly when cluster A becomes almost inactive. The cluster size and frequency content suggest that cluster C contains mostly ambient seismic noise data.
Finally, cluster D contains about 4% of data set and is the smallest of the four clusters (Figure 4h). The corresponding first-order scattering coefficients show a local maximum around 5 Hz (Figure 4f). Its cumulative detection curve correlates well with the detections of the seismic burst (Figure 4g), with additional detections before the seismic burst starts. All these observations indicate that cluster D is probably related to nearby seismic activity in general.

Discussion
In this section, we will discuss and interpret the dendrogram's representation and its clustering solution. While, the main focus is on identifying how the seismic burst occurs in the dendrogram, we will also discuss how the general seismicity is observed through this representation, and interpret the remaining clusters with anthropogen- ic activity and ambient seismic noise. To underpin the statement that the deep scattering spectrum is a superior representation for the task at hand than the Fourier-transform spectrum, we also create and interpret a dendrogram based on the Fourier-transform of the same data set (see Appendix D).

Identification of the Seismic Burst Within the Dendrogram
First, we identify all time segments containing onsets of the events of the seismic burst and observe which clusters those time segments belong to. The template matching catalog contains 148 detections related to this seismic burst. However, we only associate 136 samples in the feature space with the seismic burst, since one sample represents about 20 s of waveform data and, thus, can contain multiple events. Figure 5a shows that a large majority of the samples, which contain arrivals of the seismic burst, fall into cluster D (92.6%). On the other hand, only 40% of cluster D is related to the seismic burst, underpinning the statement that this cluster is related to general seismic activity. Cluster B and C share the remaining 7.4% of the burst. Compared to the large population sizes of clusters B and C, the contribution of the burst almost vanishes (0.3 and 0.1%). Cluster A contains no detections of the burst. While cluster D contains the majority of the seismic burst, the interesting aspect is to understand what the remaining 60% samples of this cluster are related to (earthquakes from the same source region, different signals, etc). To answer that question, we investigate the subclusters visible in Figure 4a obtained with a distance threshold of 0.47; in particular, we will narrow the focus on the subclusters of cluster D, namely the four subclusters D.1 to D.4.
First, we look at the distribution of the samples containing the seismic burst across the four subclusters in main cluster D. From Figure 5a, we know that more than 92% of the burst was found in cluster D. We observe in Figure 5b that this amount splits into ca. 71.3% in cluster D.1 and ca. 21.3% in cluster D.4. The subclusters D.2 and D.3 contain no earthquakes from the seismic burst and will be discussed later. If we look at the cumulative detection curve of each subcluster in D (Figure 5c), we see that cluster D.1 and D.4 share a very similar temporal pattern. The corresponding averaged first-order scattering coefficients (Figure 5d) explain why the burst got split into two clusters: across almost all frequencies the larger subcluster D.1 shows significantly smaller scattering coefficients than the smaller subcluster D.4. Hence, the magnitude of the events seems to be the characteristic that separates the burst into two clusters. Besides, we observe that 56% of D.1 and 97% of D.  The first 30 waveforms correspond to subcluster D.4. 29 of them are are also in the catalog (marked orange) while 1 of them is not in the catalog (marked magenta). The following 174 waveforms are from subcluster D.1. 98 of them are are also in the catalog (marked light blue) while 76 of them are not in the catalog (marked blue). The waveforms are very similar to each other on all three channels. This indicates that these new detections are coming from the same source area. Note also that the first 30 waveforms representing subcluster D.4 have a better signal-to-noise ratio than the following waveforms of subcluster D.1. This agrees with our assumption that the burst is split into two subclusters due to magnitude differences. The magnitude estimations of the template matching catalog confirms this assumption (see Figure 6d). While most of the events located in D.1 range between M0.5 and M1, the events located in D.4 range between M1 and M2.2.
By investigating cluster D and its subclusters D.1 and D.4, we are able to identify two subclusters representing the seismic burst. While D.1 contains many events with smaller magnitudes, D.4 contains fewer events with larger magnitudes. Together the two subclusters contain 92.6% of the cataloged events and 77 new events, which have identical P and S wave arrivals as the cataloged ones. The new detections can be explained by the fact that we utilize a single station method and compare it to a catalog based on a multi station method. More details and a comparison with a single station template matching catalog based on station DC06 can be found in Appendix C.
However, 7.4% of the cataloged detections can not be found in subclusters D.1 or D.4. In the following remarks, we want to analyze the misidentified 7.4% of cataloged events, which equal 10 over 136 events. First of all, we want to know where these events are located in the feature space. Therefore, we calculate the Euclidean distance between the misidentified events and the centroids of each cluster in the feature space (see Figure 7a). In magenta, we highlight the distance between the sample and its respective subcluster. In cyan, we highlight the distance between the sample and subcluster D.1 containing the low magnitude events of the burst. In gray, we highlight the distances to all other remaining clusters as a comparison. We sorted the misidentified 10 events according to the distance to the centroid of D.1. We see that for the first six events, the distance to the centroid of D.1 is smaller than to the centroid of its respective cluster. The corresponding waveform data also offer explanations for the misidentification (Figures 7b-7d). Indeed, the P and S arrivals are noisy but visible for the first five events. Thus, some events might be misclassified because samples are grouped with the Ward's method, which solves iteratively an objective function considering the Euclidean distance and the within-cluster variance. In other words, clusters can agglomerate samples which might be closer to the centroids of other clusters if we consider the pure Euclidean distance. After the first five events, when the distance to its respective cluster becomes smaller than the distance to D.1., the P and S arrivals are not visible anymore, or other large-amplitude events are present. Here the problem is related to assigning a single cluster to 20 s of waveform data, which can contain multiple signals.

Neighboring Clusters of the Seismic Burst in the Feature Space
Having identified most of the seismic burst in two neighboring subclusters already shows that the representation of the data and the distances between the data points are meaningful. As a next step, we want to analyze the neighborhood of these two subclusters to get a better understanding of the data representation. Since D.2 and D.3 share the same cluster with D.1 and D.4, we know that they are located next to each other in the feature space. This indicates that subcluster D.2 and D.3 might contain similar signals, such as seismic activity with a different origin than the seismic burst. To verify this assumption, we can compare existing earthquake catalogs with the timestamps of the samples in the subclusters. We extend the local template matching catalog with a regional catalog limited to events within a radius of 5° around station DC06. The regional catalog is downloaded from IRIS. For calculating the seismic phase arrivals at the station, we use the TauP module of ObsPy with the velocity model of Kennett and Engdahl (1991). We consider a sample related to an event of the catalog if the 20 s window of the sample overlaps with the window between the P wave arrival and the decaying coda.
The waveform data of D.2 and D.3 are presented in Figure 8. Figure 8a indicates the samples which can be explained by arrivals of a regional or local event, and Figure 8b shows the samples which can not be explained by arrivals of a regional or local event. Note that one sample in the feature space represents ca. 20 s of waveform data and each horizontal waveform displayed in Figure 8 contains multiple consecutive 20 s windows. Subcluster D.2 contains only nine samples corresponding to two seismic events indicated in blue in Figure 8a. The first event represented by eight consecutive samples at index 0 is a relatively distant M4 event. The other event represented by a single sample is a quarry blast from a local mine mentioned by the template matching catalog. At first sight, it might seem unexpected that these two events are found in the same subcluster. However, subclusters D.2 shows the largest scattering coefficients for frequencies below 5 Hz (see Figure 5d), and its centroid is the furthest away from the remaining data set as we can see from the inter-cluster distance matrix presented in Figure A1 in Appendix A. Moreover, the within-cluster variance σ c in the top panel of Figure A1 indicates that the samples of subcluster D.2 are the most spread out compared to the other subclusters, This suggests that both events are seen as outliers in the feature space due to their high amplitudes at lower frequencies.
Moreover, we observe that the catalog can explain 67% of all samples of D.3 (a random selection of waveforms are shown in Figure 8a). The other 33% are shown in Figure 8b, and some samples also show seismic phase arrivals (in particular, the seismograms shown at index six and nine). It is thus likely that the samples shown in Figure 8b contain uncataloged events. While subcluster D.1 and D.4 represent similar earthquakes from a similar source region, subcluster D.3 shows many kinds of signals, such as earthquakes with different magnitudes and distances to the station. We can interpret subcluster D.3 as an agglomeration of transient signals with increased energy between 1 and 5 Hz (see Figure 5d). Regional and local events also fall into this category. Thus, in the vicinity of the subclusters D.1 and D.4, related to the seismic burst, other subclusters containing seismic activity can be found.

Anthropogenic Signals With High Envelope Correlation
After identifying seismic activity in cluster D, we want to draw attention to the remaining part of the seismic data set. Seismic activity induces short-term signals with a characteristic waveform and envelope shape. However, if we want to classify other types of signals like tremors, anthropogenic noise, or ambient noise, correlating waveforms are unlikely to be suitable for this task. One key feature of the deep scattering spectrum is the representation of the waveform's envelope in the second-order scattering coefficients (Andén & Mallat, 2014). Consequently, we should find clusters with weakly correlating waveforms but strongly correlating envelopes.
For that reason, we investigate the correlation coefficient of the waveform (CC W ) and the envelope (CC E ) for all subclusters. First, a template is defined by the closest sample to the centroid representing the most typical waveform of a cluster. Then, we calculate the correlation coefficient of the waveform data CC W and the correlation coefficient of the smoothed envelope CC E between the template and the remaining samples. The envelope is defined by the modulus of the analytic signal, which is a complex-valued representation of the waveform disregarding the negative frequencies from the Fourier transform. A median-filter smoothens the envelope. The averaged results are depicted in Figure 9a. We first observe that CC E is more significant than CC W for most subclusters. In particular, cluster B.4 shows the most significant discrepancy between CC E and CC W ; this subcluster is part of cluster B, which we related to high-frequent urban noise. In Figures 9b-9d, we align the envelopes for each channel and each sample in B.4 to depict the shared characteristics. We see a very symmetric envelope that lasts around 5 s. The envelopes look very similar on all three components. Figure 9e shows a histogram of detections over the time of the day. We see that this cluster mostly appears during daytime with a clear peak around 14:00 local time. Figure 9f shows the averaged first-order scattering coefficients for all three channels. The frequencies above 5 Hz are very pronounced and peak between 10 and 15 Hz. In summary, we see that subcluster B.4 is relat- ed to non stationary urban noise which produced similar envelopes lasting 5 s. Nearby road traffic could produce these kind of signals.

Long-Lasting Signals With Low Envelope Correlation
As the last example, we want to draw attention towards clusters A and C. Both clusters show relatively low correlation coefficients for the envelopes (see Figure 9). Cluster C contains more than half of the data, and the average scattering coefficients are the lowest for all frequencies compared to the other clusters (see Figures 4f and 4g). Moreover, the subclusters of C have a relatively low distance to each other, and their within-cluster variance is relatively low (see Figure A1 in Appendix A). This indicates that they contain similar signals. Combining these facts, we conclude that this cluster contains ambient noise without any significant activity of transient signals.
Cluster A seems to correlate with the monochromatic noise source around 1.5 Hz (see Figures 3c and 4f). To prove that cluster A contains only data with increased activity around 1.5 Hz we depict the occurrence of cluster A and the Fourier amplitude of the three channels filtered between 1.4 and 1.6 Hz as a function of time in Figure 10. In general, an increased amplitude around 1.5 Hz correlates well with the appearance of cluster A. However, not all samples with an increased monochromatic activity fall into cluster A. As with the misidentified events in Figure 7, the problem is related to assigning a single cluster to 20 s of waveform data containing multiple types of signals. It is also interesting to note that subcluster A.1 and A.3 show larger correlation coefficients for the waveforms than for the envelopes (Figure 9a). This characteristic only applies to these two subclusters and is related to the dominance of the monochromatic signal.
Cluster A and C show that the dendrogram representation based on features from the deep scattering spectrum also finds cluster of noise sources without strong correlation of the waveforms or envelopes.

Conclusion
In this study, we proposed a new way of exploring the content of continuous seismograms and identifying different types of signals present in the data. Our approach is based on hierarchical clustering, which offers many cluster solutions with the dendrogram and, thus, delivers a tool for exploring the data. The hierarchical clustering is applied to a low-dimensional feature space extracted from the deep scattering spectrogram of the continuous seismogram. A primary advantage of the workflow compared to other machine learning algorithms for classifying continuous seismic data is the interpretability at each step and the deep scattering spectrum, which seems to be a promising representation of seismic data for classification purposes.
For an application in this study, we chose a 2-day long three-component seismogram containing a nearby seismic burst with 148 cataloged events with similar waveforms. These labels served as a sanity check for the algorithm. First, we extracted a cluster solution with four main clusters to get a rough overview of the data. With the cluster size, the temporal detection, and averaged first-order scattering coefficients, we delivered an interpretation of each cluster and could identify a cluster containing mostly waveforms related to earthquakes. Inside this specific cluster, we found two subclusters containing almost all cataloged events of the seismic burst. While the events of the seismic burst split into two subclusters due to magnitude differences, 77 uncataloged events with similar waveforms were found. The case of the seismic burst shows that we can identify a repeating pattern with slight variations of the waveforms and low SNR in an unbalanced data set. The few misidentified events highlight the multi-label characteristics of seismograms. Multiple signals can arrive simultaneously and, thus, assigning a single label to a part of the seismogram does not reflect the whole truth. Integrating this issue into clustering seismograms is an interesting aspect for future work. Besides the seismic burst, we also identified signal families with anthropogenic origin and a large cluster containing ambient seismic noise. The different types of signals show that the strategy is able to group signals with correlating waveforms, envelopes or similar frequency characteristics.
We want to emphasize here that hierarchical clustering and the dendrogram itself does not deliver meaningful labels for the clusters. Interpreting the different cluster solutions with certain characteristics such as the temporal detection curve is a crucial step towards understanding and revealing the content of the data. Until the point of hierarchical clustering, the proposed workflow is an unsupervised and data-driven strategy to find groups of similar seismic signals. After that point, we use the output of that strategy to do an interpretation and assign meaningful labels to the retrieved clusters.
As most machine learning algorithms, the proposed strategy relies on a few parameters to tune. The hyperparameters of the deep scattering network are mainly physics-driven and depend on the pre-defined task. As with Fourier spectrograms, we can control the window size and frequencies of interest. For example, low frequent first-order wavelet filters might not be necessary for finding groups of anthropogenic signals. Maximum pooling is more interesting than average pooling if the signals of interest have a transient character such as earthquakes.
After designing the deep scattering network, the number of components in the independent component analysis is an exploratory task. It is a trade-off between keeping crucial information and producing a low-dimensional representation to avoid the curse of dimensionality.
In general, the method can be used for various tasks. It is beneficial to get a general overview of an unknown data set. If there is a particular target of interest (e.g., earthquakes, urban noise sources, tremors), we can navigate the dendrogram and focus the analysis on a specific branch. The temporal detection curves of the clusters can be easily correlated with other time series such as GPS displacement or environmental parameters to search for signal classes related to certain physical processes. A specific interesting application would be the North Anatolian Fault, where seismologists assume the presence of non-volcanic tremors but conventional methods did only deliver null results so far (Bocchini et al., 2021;Pfohl et al., 2015). In contrast to conventional tremor detection algorithm, our approach could identify signals related to tectonic processes without assuming any signal characteristics. In the same sense, the dendrogram can reveal clusters/classes human expert knowledge could not reveal yet and expand the classes of signals we know so far. Moreover, the method can be helpful to extract particular types of noise for performing ambient noise cross-correlation potentially enhancing the signal quality.

Appendix A: Within-Cluster Variance and Inter-Cluster Distance
This section presents the way we calculate the inter-cluster distance d ij between clusters i and j and the within-cluster variance σ i of cluster i. The inter-cluster distance are defined by the Euclidean distances between the centroids of the cluster: where, = 1 ∑ ∈̂ represents the centroid of cluster i with the samples ̂∈ ℝ belonging to cluster i, and where ‖ ⋅ ‖ 2 represents the L2 norm. Similarly, the variance σ i of cluster i is defined as: This analysis is inspired from the silhouette analysis (Rousseeuw, 1987) and helps to understand better the clustering results. The within-cluster variances and the Euclidean distances between the centroids are depicted in Figure A1. Figure B1 depicts the reconstruction loss ϵ(C) for an increasing number of independent components (sources) C. The reconstruction loss decreases rapidly with the first components. With a more significant number of components, the rate of error decrease becomes smaller. The choice of the number of components is a trade-off between keeping the dimensions low and retaining most of the information. Thus, 10 independent components seem like a good compromise to us.
The time series of the 10 unmixed sources calculated from the data set are shown in Figure B2. To see if a single source already shows a clear distinction between the seismic burst and the rest of the data, we marked in blue the samples containing at least one earthquake from the burst. It appears that the ninth unmixed source seems to separate the seismic burst from the rest of the data. This observation raises the question if other trends, such as the background noise, can be correlated with specific unmixed sources. If we compare with the spectrogram of Figure 3c we see that the second unmixed source seems to correlate with the variations around 0.2 Hz and the eighth unmixed source seems to correlate with the monochromatic noise source around 1.5 Hz. This quick visual inspection shows us that the feature space can already be physically interpreted, and the ICA separates different signals on its different unmixed sources, which is favorable for further analysis by clustering algorithms.

Appendix C: Comparison With Single-Station Template Matching
Station DC06 recorded higher signal-to-noise ratio S waves from the seismicity burst than the more proximal stations. Therefore, we are able to detect about twice more events by running the matched-filter search only on station DC06, with respect to the multi-station (10 stations) matched-filter search. The single-station template matching catalog captures a seismicity pattern similar to clusters D.1 and D.4, but reports about 50% more events (see Figure C1). Both the single-station and multi-station template matching catalogs were built with a detection threshold of eight times the root-mean-square of the correlation coefficient time series. The 20 s time resolution of the clustering method presented in this work sets a hard constraint on revealing the details of low magnitude seismicity. Nevertheless, we recall that producing a fine resolution earthquake catalog is not the first goal of our method, which instead aims at unraveling signals of different nature with no prior knowledge of the data set.

Appendix D: Qualitative Comparison With Hierarchical Clustering Based on Spectrograms
In our study, we use a deep scattering spectrum instead of a Fourier-transform spectrum, since it is more suitable for classification purposes (Andén & Mallat, 2014). In the following lines, we create and interpret a dendrogram based on Fourier-transform spectral features to verify this claim for seismograms. For the sake of comparison, the window size of the Fourier-transform equals the pooling window of the scattering network, which is 20.48 s. Moreover, the considered frequency range of the Fourier-transform is adapted to the frequency range of the first order scattering coefficients. The three-component spectrogram with 1,440 spectral coefficients per time step is then used to calculate 10 independent components, which resemble the feature space for the dendrogram. Thus, we only replaced the scattering coefficients with spectral coefficients of comparable time and frequency properties.
To compare the clustering outcome, we retrieve 16 subclusters, which can be grouped into the three main clusters A',B', and C' (see Figure D1a). The time evolution curves and the cluster sizes in Figures D1b and D1c show if the retrieved main clusters are the same as in Figure 4. Cluster A' matches very well with cluster A in terms of cluster size and temporal detection curve. Thus, Cluster A' is also related to the monochromatic signal. Cluster B' matches with the detection curve of Cluster C, however, Cluster B' contains more data than Cluster C. Thus, Cluster B' is also related to ambient signals but possibly contains also additional types of signals. The normalized detection curve of Cluster C' matches with Cluster B, however, Cluster C' is not even half of the size of cluster B. Hence, Cluster C' is probably related to high-frequent urban signals. Cluster D, which is related to general seismicity, does not appear within the main clusters based on spectral coefficients. In fact, most of the seismic burst is within cluster B', which is mainly related to ambient signals (see Figure D1d). Hence, we can assume that Cluster C and D are unified here in Cluster B'. Retrieving subclusters at a lower distance threshold than the three main clusters could possibly reveal a few subclusters related to the seismic burst. However, 12 out of 16 subclusters contain events from the seismic burst (see Figure D1e). It is not possible to identify a few clusters which are purely related to the seismic burst. Subcluster B'.1 and B'.2 contain ca. 20% of the cataloged seismic burst respectively, however, most of the subcluster ( 96%) is not related to the cataloged seismic burst. This example shows that a deep scattering spectrum delivers a better representation for classification purposes than a Fourier transform spectrum. This is particularly true for classifying reoccurring transient signals in a relative large data set such as the events of the seismic burst within the continuous seismogram.

Data Availability Statement
The facilities of IRIS Data Services, and specifically the IRIS Data Management Center, were used for access to waveforms, related metadata, and/or derived products used in this study. The data of the DANA array can be found at DANA (2012). The scattering network which was used in this study can be found at https://doi. org/10.5281/zenodo.5518136. The python packages ObsPy, SciPy, and Scikit-learn were heavily used for processing the data (Beyreuther et al., 2010;Pedregosa et al., 2011;Virtanen et al., 2020). Maps were created with the python package Cartopy (Met Office, 2010. We used map tiles by Stamen Design, under CC BY 3.0. Data by OpenStreetMap, under ODbL.