Streaming key phrase recognizing is a broadly used resolution for activating voice assistants. Deep Neural Networks with Hidden Markov Mannequin (DNN-HMM) primarily based strategies have confirmed to be environment friendly and broadly adopted on this area, primarily due to the power to detect and establish the beginning and finish of the wake-up phrase at low compute price. Nonetheless, such hybrid methods undergo from loss metric mismatch when the DNN and HMM are educated independently. Sequence discriminative coaching can’t absolutely mitigate the loss-metric mismatch as a result of inherent Markovian model of the operation. We suggest an low footprint CNN mannequin, known as HEiMDaL, to detect and localize key phrases in streaming circumstances. We introduce an alignment-based classification loss to detect the prevalence of the key phrase together with an offset loss to foretell the beginning of the key phrase. HEiMDaL reveals 73% discount in detection metrics together with equal localization accuracy and with the identical reminiscence footprint as current DNN-HMM model fashions for a given wake-word.