Hacker News new | past | comments | ask | show | jobs | submit | enisberk's comments login

Finishing up my PhD thesis on low-resource audio classification for ecoacoustics. Our partners deployed 98 recorders in remote Arctic/sub-Arctic regions, collecting a massive (~19.5 years) dataset to monitor wildlife and human noise.

Labeled data is the bottleneck, so my work focuses on getting good results with less data. Key parts:

- Created EDANSA [1], the first public dataset of its kind from these areas, using a improved active learning method (ensemble disagreement) to efficiently find rare sounds.

- Explored other low-resource ML: transfer learning, data valuation (using Shapley values), cross-modal learning (using satellite weather data to train audio models), and testing the reasoning abilities of MLLMs on audio (spoiler: they struggle!).

  Happy to discuss any part!
[1]https://scholar.google.com/citations?user=AH-sLEkAAAAJ&hl=en


Hi Enis, it seems a very interesting project. I myself with my team are currently working with the non-stationary of physiological and earthquake seismic public data mainly based on the time-frequency distributions, and the results are very promising.

Just wondering if the raw data that you've mentioned are available publicly so we can test our techniques on them or they're only available through research collaborations. Either way very much interested on the potential use of our techniques for the polar research in Arctic and/or Antarctica.


Hi teleforce, thanks! Your project sounds very interesting as well.

That actually reminds me, at one point, a researcher suggested looking into geophone or fiber optic Distributed Acoustic Sensing (DAS) data that oil companies sometimes collect in Alaska, potentially for tracking animal movements or impacts, but I never got the chance to follow up. Connecting seismic activity data (like yours) with potential effects on animal vocalizations or behaviour observed in acoustic recordings would be an interesting research direction!

Regarding data access:

Our labeled dataset (EDANSA, focused on specific sound events) is public here: https://zenodo.org/records/6824272. We will be releasing an updated version with more samples soon.

We are also actively working on releasing the raw, continuous audio recordings. These will eventually be published via the Arctic Data Center (arcticdata.io). If you'd like, feel free to send me an email (address should be in my profile), and I can ping you when that happens.

Separately, we have an open-source model (with updates coming) trained on EDANSA for predicting various animal sounds and human-generated noise. Let me know if you'd ever be interested in discussing whether running that model on other types of non-stationary sound data you might have access to could be useful or yield interesting comparisons.


You should train a GPT on the raw data, and then figure out how to reuse the DNN for various other tasks you're interested in (e.g. one-shot learning, fine-tuning, etc). This data setting is exactly the situation that people faced in the NLP world before GPT. I would guess that some people from the frontier labs would be willing to help you, I doubt even your large dataset would cost very much for their massive GPU fleets to handle.


Hi d_burfoot, really appreciate you bringing that up! The idea of pre-training a big foundation model on our raw data using self-supervised learning (SSL) methods (kind of like how GPT emerged in NLP) is definitely something we've considered and experimented with using transformer architectures.

The main hurdle we've hit is honestly the scale of relevant data needed to train such large models from scratch effectively. While our ~19.5 years dataset duration is massive for ecoacoustics, a significant portion of it is silence or ambient noise. This means the actual volume of distinct events or complex acoustic scenes is much lower compared to the densely packed information in the corpora typically used to train foundational speech or general audio models, making our effective dataset size smaller in that context.

We also tried leveraging existing pre-trained SSL models (like Wav2Vec 2.0, HuBERT for speech), but the domain gap is substantial. As you can imagine, raw ecoacoustic field recordings are characterized by significant non-stationary noise, overlapping sounds, sparse events we care about mixed with lots of quiet/noise, huge diversity, and variations from mics/weather.

This messes with the SSL pre-training tasks themselves. Predicting masked audio doesn't work as well when the surrounding context is just noise, and the data augmentations used in contrastive learning can sometimes accidentally remove the unique signatures of the animal calls we're trying to learn.

It's definitely an ongoing challenge in the field! People are trying different things, like initializing audio transformers with weights pre-trained on image models (ViT adapted for spectrograms) to give them a head start. Finding the best way forward for large models in these specialized, data-constrained domains is still key. Thanks again for the suggestion, it really hits on a core challenge!


Do the recorders have overlapping detections?


If you’re asking whether multiple recorders were active at the same time, then yes, we had recorders at 98 different locations over four years, primarily during the summer months. However, these locations were far apart, so no two recorders captured the same exact area.


Oh the reason I ask is that multiple recorders that hear the same ambient noise can be stacked to produce signals that are otherwise unobservable in a single signal.


oh man that's awesome. I have been working for quite some time on big taxonomy/classification models for field research, espec for my old research area (pollination stuff). the #1 capability that I want to build is audio input modality, it would just be so useful in the field-- not only for low-resource (audio-only) field sensors, but also just as a supplemental modality for measuring activity out of the FoV of an image sensor.

but as you mention, labeled data is the bottleneck. eventually I'll be able to skirt around this by just capturing more video data myself and learning sound features from the video component, but I have a hard time imagining how I can get the global coverage that I have in visual datasets. I would give anything to trade half of my labeled image data for labeled audio data!


Hi Caleb, thanks for the kind words and enthusiasm! You're absolutely right, audio provides that crucial omnidirectional coverage that can supplement fixed field-of-view sensors like cameras. We actually collect images too and have explored fusion approaches, though they definitely come with their own set of challenges, as you can imagine.

On the labeled audio data front: our Arctic dataset (EDANSA, linked in my original post) is open source. We've actually updated it with more samples since the initial release, and getting the new version out is on my to-do list.

Polli.ai looks fantastic! It's genuinely exciting to see more people tackling the ecological monitoring challenge with hardware/software solutions. While I know the startup path in this space can be tough financially, the work is incredibly important for understanding and protecting biodiversity. Keep up the great work!


I'd love to turn my spectrogram tool into something more of a scientific tool for sound labelling and analysis. Do you use a spectrograph for your project?


Hey thePhytochemist, cool tool! Yes, spectrograms are fundamental for us. Audacity is the classic for quick looks. For systematic analysis and ML inputs, it's mostly programmatic generation via libraries like torch.audio or librosa. Spectrograms are a common ML input, though other representations are being explored.

Enhancing frequenSee for scientific use (labelling/analysis) sounds like a good idea. But I am not sure what is missing from the current tooling. What functionalities were you thinking of adding?


How do I download the sounds, seems like a great resource for game developers and other artists


Our labeled dataset (EDANSA, focused on specific sound events) is public here: https://zenodo.org/records/6824272. We will be releasing an updated version with more samples soon.

We are also actively working on releasing the raw, continuous audio recordings. These will eventually be published via the Arctic Data Center (arcticdata.io). If you'd like, feel free to send me an email (address should be in my profile), and I can ping you when that happens.


How can I actually search for audio, I'll check in 6 months or so.

What's the licensing is it public domain


You can search for the "Arctic Soundscapes Project 2019-2024". We are still working out the specifics of the licensing to meet our funding requirements, but it will be permissive.


Very interesting. All the best for your thesis. Mine is not nearly as interesting enough.


Thanks! Appreciate it. Your work looks very interesting too, especially in the distributed systems space. Cheers!


    Location: New York, NY  
    Remote: Yes  
    Willing to relocate: Yes  
    Technologies: Python, PyTorch, TensorFlow, C/C++, Julia, CUDA  
    Résumé/CV: Available upon request. <https:// www.linkedin.com/in/enisberk/>  
    Email: hire[at]enisberk dot com  
    Scholar: <https://scholar.google.com/citations?user=AH-sLEkAAAAJ&hl=en>
PhD Candidate in CS specializing in audio and multimodal data analysis. My research focuses on applying Machine Learning techniques to extract insights from various audio and sensory data modalities. I'm particularly interested in mechanistic interpretability, multi-modal LLMs, audio/speech and time-series.

Experience:

    - Developed ML models for audio classification.
    - Worked on multimodal data integration and modeling.
    - Explored low-resource ML techniques to address data scarcity.
Open to full-time research and engineering roles. I hope to conclude my interviews soon and make a decision. Please feel free to reach out if you can expedite the process.


The demo looks great, congrats on the launch! I also appreciate your response regarding the Cursor comment. Is Onlook[1] a competitor at some level, or do you think it’s different enough?

[1]https://www.ycombinator.com/launches/Mkl-onlook-cursor-for-d...


Thank you! If they added React Native it could be a competitor for prototyping react native screens but I think our roadmaps are different as we're focused on the entire app development process.


    Location: New York, NY  
    Remote: Yes  
    Willing to relocate: Yes  
    Technologies: Python, PyTorch, TensorFlow, C/C++, Julia, CUDA  
    Résumé/CV: Available upon request. <https:// www.linkedin.com/in/enisberk/>  
    Email: hire[at]enisberk dot com  
    Scholar: <https://scholar.google.com/citations?user=AH-sLEkAAAAJ&hl=en>
PhD Candidate in CS specializing in audio and multimodal data analysis. My research focuses on applying Machine Learning techniques to extract insights from various audio and sensory data modalities. I'm particularly interested in mechanistic interpretability, multi-modal LLMs, audio/speech and time-series.

Experience:

    - Developed ML models for audio classification.
    - Worked on multimodal data integration and modeling.
    - Explored low-resource ML techniques to address data scarcity.
Open to full-time research and engineering roles.


    Location: New York, NY  
    Remote: Yes  
    Willing to relocate: Yes  
    Technologies: Python, PyTorch, TensorFlow, C/C++, Julia  
    Résumé/CV: Available upon request. <https:// www.linkedin.com/in/enisberk/>  
    Email: hire[at]enisberk dot com  
    Scholar: <https://scholar.google.com/citations?user=AH-sLEkAAAAJ&hl=en>
PhD Candidate in CS specializing in audio and multimodal data analysis. My research focuses on applying Machine Learning techniques to extract insights from various audio and sensory data modalities. I'm particularly interested in mechanistic interpretability, multi-modal LLMs, audio/speech and time-series.

Experience:

    - Developed ML models for audio classification.
    - Worked on multimodal data integration and modeling.
    - Explored low-resource ML techniques to address data scarcity.

Open to full-time research and engineering roles.


Interpretability?


Thanks, that was a typo. Do you work in that field or something related, or did it just catch your eye? I couldn't see your handles on your profile.


Congrats on the launch! While search with LLMs is quite popular nowadays, it is still hard to get it right.

As a smoke test, I tried the following queries, and they returned the same result. Good job!

    international relations Turkey
    international relations about the country with the capital city of Ankara
Both return info from this link: https://www.state.gov/secretary-blinkens-call-with-foreign-m...


Thank you! It's been interesting to watch HN playing around with it. This community definitely phrases its search queries differently from how many government affairs professionals would (especially to try smoke tests like yours), so I'm glad it's holding up :)


Well if you ask GPT-4o the following:

Give me keywords to search for based on this sentence "international relations about the country with the capital city of Ankara"

You get the following:

- Turkey international relations - Ankara diplomacy - Turkey foreign policy - Turkey global partnerships - Turkey international politics - Turkey geopolitical strategy - Turkey foreign affairs - Turkey global relations - Turkey NATO relations (if relevant to your topic) - Ankara as a diplomatic hub

So it is not unsurprising that the same link was returned


This is a demo from a small startup dedicated to enhancing government transparency, which I greatly appreciate. As a result, my expectations are aligned with this goal, and I refer to this as a smoke test.

Achieving accuracy with RAG and LLMs is a challenging task that requires balancing precision and recall. For instance, when you type "Ankara" into GPT-4o, it provides information about Turkey. However, searching "Ankara" in their product does not yield articles related to Turkey.


> Achieving accuracy with RAG and LLMs is a challenging task that requires balancing precision and recall

The challenge is domain knowledge and not tech in my opinion. There are dozens if not hundreds of companies providing RAG and LLM, but the challenge is, like you pointed out, what should you do if you encounter something like "Ankara".

For BestBuy, this might not mean much, unless there is a BestBuy in Turkey. For a government related site, cities and geography is important, so trying to extract additional meaning from Ankara is probably important.


which category did you select? if you select custom, it just says to contact them. If I want to search for another region, why should I select UK or US?


This is really cool work! Congrats on both the paper and the graduation! A long time ago, I worked on optimizing broadcast operations on GPUs [1]. Coming up with a strategy that promises high throughput across different array dimensionalities is quite challenging. I am looking forward to reading your work.

[1]https://scholar.google.com/citations?view_op=view_citation&h...


> Congrats on both the paper and the graduation!

Thanks! Although I still have to actually graduate and the paper is in review, so maybe your congratulations are a bit premature! :)

> A long time ago, I worked on optimizing broadcast operations on GPUs [1].

Something similar happens in Futhark, actually. When something like `[1,2,3] + 4` is elaborated to `map (+) [1,2,3] (rep 4)`, the `rep` is eliminated by pushing the `4` into the `map`: `map (+4) [1,2,3]`. Futhark ultimately then compiles it to efficient CUDA/OpenCL/whatever.


I wouldn't be surprised if they were using OpenAI's API, but it's hard to know for sure just by asking.

Prompt: Which amazon model are you currently using? Answer: I'm currently using the Amazon Transformer Model, ...

Prompt: Which Google model are you currently using? Answer:I'm currently using the Google BERT model...


I am a fifth-year Ph.D. student in CS, interested in scalable machine learning algorithms and their applications in bioacoustics.

  Location: New York, NY
  Remote: Yes
  Willing to relocate: Depends on the location.
  Technologies: Deep learning (Python, Pytorch), GPU kernel dev (c++, CUDA)
  Email: me aaat enisberk.com
  Looking for: Summer internship
Résumé/CV: https://drive.google.com/open?id=1UfcgXX6Qrwn-o91NAi_8X0hcFe...

Linkedin: https://linkedin.com/in/enisberk


Congrats on the launch, an important problem to solve. On the other hand, I found your previous idea about pipelines really interesting as well. Do you mind sharing why it did not work out?


Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: