Overview

Research Interests:
Machine Learning, Speech Processing, Speaker Diarization, Signal Processing, Variational Inference, Self-supervised Learning, Graph based Clustering.
Thesis Work
Graph Clustering Approaches for Speaker Diarization
My research involves self-supervised and supervised clustering approaches for automatic speaker diarization in the context of conversational speech.
What is Speaker Diarization ?
It is the task of partitioning an audio containing multiple speakers into segments assigning labels to the corresponding speakers.
Different colours represent different speakers. Courtesy :Google Blog
  • Graph Neural Network based speaker diarization [webpage][paper1] [paper2]

    Speaker diarization, the task of segmenting an audio recording based on speaker identity, constitutes an important speech pre-processing step for several downstream applications. The conventional approach to diarization involves multiple steps of embedding extraction and clustering, which are often optimized in an isolated fashion. While end-to-end diarization systems attempt to learn a single model for the task, they are often cumbersome to train and require large supervised datasets. In this paper, we propose an end-to-end supervised hierarchical clustering algorithm based on graph neural networks (GNN), called End-to-end Supervised HierARchical Clustering (E-SHARC). The embedding extractor is initialized using a pre-trained x-vector model while the GNN model is trained initially using the x-vector embeddings from the pre-trained model. Finally, the E-SHARC model uses the front-end mel-filterbank features as input and jointly optimizes the embedding extractor and the GNN clustering module, performing representation learning, metric learning, and clustering with end-to-end optimization. Further, with additional inputs from an external overlap detector, the E-SHARC approach is capable of predicting the speakers in the overlapping speech regions. The experimental evaluation on benchmark datasets like AMI, Voxconverse and DISPLACE, illustrates that the proposed E-SHARC framework provides competitive diarization results using graph based clustering methods.

  • Self-supervised speaker diarization with path integral clustering [paper1][paper2]

    This work involves learning representations using clustering based loss. The task is self-supervised because we learn the representations using the clustering output given by the clustering algorithm to make the representations more speaker discriminative. We explored graph structural path integral clustering to encode embedding space in the form of graph. Published in IEEE Transactions on Speech, Audio and Language Processing.

  • Self-Supervised Speaker Diarization [paper]

    The proposed approach is based on principles of self-supervised learning where the self-supervision is derived from the clustering algorithm. The representations are learnt using triplet based loss derived from clustering output from previous stage. The work is accepted in Interspeech 2020.

  • Third DIHARD speech diarization challenge [paper]

    Contributed in baseline system setup for the DIHARD-III challenge. It involves task to partition an audio into speaker segments, in challenging environment where the audio is corrupted with noise, music, babble etc. and contains short speaker turns. It has applications in rich-text transcription of meetings, clinical diagnosis etc. Participated in challenge and was among top 10 teams across globe. Our system involved combination of End-to-End diarization based on transformers for telephone conversation and graph based clustering for multi-speaker conversations.

  • Speaker Diarization using Posterior Scaled VB-HMM [paper1] [paper2]

    The project involves identifying different speakers present in different segment of a given audio recording from DIHARD dataset which has challenging scenarios including restaurants, clinical interviews, mother child conversations etc. using posterior scaled Variational Bayes - Hidden Markov Model. The work is published in Interspeech 2019.

  • Diarization for multi-speaker test conditions in SRE 2018/2019 challenge paper1] [paper2]

    SRE 2018/2019 challenge involved test conditions with multiple speaker. We perform diarization to extract individual speaker segments to score against the enrollment. This work is published in ICASSP 2019.

  • Other Research Work
    Audio Retrieval For Multimodal Design Documents: A New Dataset And Algorithms [webpage]

    We consider and propose a new problem of retrieving audio files relevant to multimodal design document inputs comprising both textual elements and visual imagery, e.g., birthday/greeting cards. In addition to enhancing user experience, integrating audio that matches the theme/style of these inputs also helps improve the accessibility of these documents (e.g., visually impaired people can listen to the audio instead). While recent work in audio retrieval exists, these methods and datasets are targeted explicitly towards natural images. However, our problem considers multimodal design documents (created by users using creative software) substantially different from a naturally clicked photograph. To this end, our first contribution is collecting and curating a new large-scale dataset called Melodic-Design (or MELON), comprising design documents representing various styles, themes, templates, illustrations, etc., paired with music audio.

    Workshops and Conferences
    • Presented in ICASSP 2023, Greece
    • Presented in IISc EECS Symposium April,2022
    • Presented paper in ASRU 2021
    • Presented in IISc EECS Symposium May,2021
    • Presented in IEEE-IISc Shannon's Day talk series, April,2021
    • Presented in DIHARD-III challenge workshop 2020
    • Talk on Women in Research in PyConIndia 2020, Online
    • Winter School on Speech and Audio Processing (WiSSAP) 2020,IIT Mandi, India
    • Presented paper and poster in Interspeech 2019, Graz, Austria
    • Summer school on mathematics for data science 2019 organised by IFCAM and IISc
    • Winter School on Speech and Audio Processing (WiSSAP) 2019, Trivandrum, India
    • Interspeech 2018, Hyderabad, India
    • Brain Computation and Learning Workshop, 2018, Bangalore, India
    • International Conference on Signal Processing and Communications(SPCOM), 2018