INCF 19.1 An LLM-assisted service for annotating research data with machine-understandable, semantic data dictionaries

Contact details

Full name: Prapti Takawale

Email: praptitakawale@gmail.com

Neurostars ID: Prapti.dt

Location : Pune, India

Github : https://github.com/prapti2002

Project details

Project synopsis / summary What is the project about?

This project aims to enhance Neurobagel's annotation tool with LLM-driven assistance, streamlining dataset harmonization. This is crucial for efficient data integration and analysis in a federated ecosystem.

Why is it important?

Data annotation is crucial for enhancing the performance and accuracy of AI models, particularly in low-resource and low-shot machine learning (LLMS) scenarios[2]. Automated data annotation using LangChain can significantly improve the efficiency and resource optimization of AI algorithms, enabling them to operate effectively with limited data samples[2].The process of fine-tuning large language models (LLMs) with annotated datasets is essential for achieving optimal performance in specific tasks[1]. By leveraging LangChain's LLM-based recommendations, human experts can streamline the annotation process, reducing the time and effort required for manual annotation[5]. This automation can lead to enhanced performance, resource optimization, and inclusivity in AI solutions tailored for diverse environments and resource constraints[2].

The benefits of automating data annotation with LangChain include:

  1. Improved Performance: Automated annotation can significantly accelerate the process, enabling organizations to quickly analyze data and make informed decisions[1].

  2. Resource Optimization: Automation allows organizations to handle massive amounts of data without compromising quality[1].

  3. Inclusivity: Automated data annotation facilitates the development of AI solutions tailored for diverse environments and resource constraints[2].

  4. Quality Control: Data annotation enables quality control by ensuring that LLMs generate appropriate and accurate responses[1].

  5. Bias Mitigation: By carefully curating the training data, annotators can minimize biases that may lead to unfair predictions or discriminatory behavior[1].

In conclusion, automating data annotation with LangChain is essential for addressing the limitations of LLMs and fine-tuning them for specific applications[1]. This research-oriented abstract highlights the importance of data annotation in the AI landscape and the potential benefits of using LangChain for automating this process.

Project in detail

My approach involves a thorough exploration of LLM agents, examining their capabilities and potential applications in data annotation. This entails studying the architecture of LLM agents, understanding their core components, and analyzing techniques for task decomposition. Subsequently, I will focus on understanding the requirements of agent-based modeling in data annotation, including challenges such as perception, reasoning, and adaptivity. Armed with this understanding, I will design and implement a dummy LLM-driven agent tailored for data annotation tasks, considering factors like core logic, memory modules, and planning mechanisms. Through extensive testing and evaluation, I will assess the agent's performance in terms of accuracy, efficiency, and user-friendliness, comparing its results with manually annotated data and gathering feedback for improvement. Iterative refinement of the LLM-driven agent and the Neurobagel annotation tool will follow, with continuous communication with stakeholders to ensure alignment and documentation of the development process for future reference and knowledge transfer.

Project implementation and timeline

Minimal Set of Deliverables:

  • Integration of LLM-driven annotation assistant into the Neurobagel annotation tool.

  • Core logic implementation for initial annotation suggestions and variable mapping recommendations.

  • Testing and debugging to ensure reliability and accuracy of the LLM-driven annotation assistant.

  • Documentation of the development process, including design decisions and implementation details.

  • User documentation and guidelines for utilizing the enhanced annotation tool.

If Time Allows :

  • Optimization of performance and efficiency of the annotation tool with LLM integration.

  • Further refinement of the LLM-driven annotation assistant based on user feedback and testing results.

  • Implementation of additional features for enhanced user experience, such as advanced heuristics and data visualization.

  • Integration of advanced LLM functionalities for more robust annotation suggestions and decision-making.

  • Exploration of potential extensions or integrations with other tools or platforms for broader usability.

Detailed Timeline:

Week 1-2 (May1 - May 26): Familiarization and Setup

  • Acquainted with Neurobagel annotation tool's codebase.

  • Study existing functionalities and workflows.

  • Research LLM libraries and their applications.

Week 3-4 (May 27 - May 31): Research and Exploration

  • Explore LLM applications in data annotation.

  • Analyze requirements for LLM-driven assistance.

  • Investigate relevant literature and works.

Week 5-6 (June 1 - June 3): Design and Planning

  • Develop strategy for LLM integration.

  • Design core logic and tools.

  • Plan implementation process.

Week 7-8 (June 4 - June 17): Implementation and Testing

  • Implement an LLM-driven assistant.

  • Conduct preliminary testing and debugging.

Week 9-10 (June 18 - July 1): Refinement and Optimization

  • Refine LLM-driven assistant based on feedback.

  • Optimize performance and efficiency.

Week 11-12 (July 2 - July 8): Finalization and Documentation

  • Finalize implementation.

  • Document development process.

  • Prepare user documentation.

Your plan for communication with mentors

I propose to maintain regular communication with my mentor through weekly email updates and scheduled Gmeets. These channels will facilitate ongoing discussions about project progress, addressing any issues promptly. Additionally, the frequency of Gmeet calls may increase as needed to ensure alignment and timely resolution of challenges.

Candidate details

Motivation - Why do you want to do this Project?

I have been an active contributor to several projects since the start of my freshman year and have been contributing to several institute projects also. I always wanted to try my hands on the field which lies at the intersection of neuroscience and Artificial Intelligence. Although I did gain some exposure to Artificial Intelligence and the use of LLMs and relevant libraries, such as LangChain during coursework at my university and hackathon, I still haven’t had a chance at neuroscience yet. This Project, I believe will help me gain new insights and hopefully introduce me to the fundamentals of and its other related use-cases. Additionally, I participated in GirlScript Summer of Code 2023 as an Open Source Contributor, remotely based in Mumbai, India from May 2023 to July 2023. During this time, I implemented a novel charting component using D3.js, providing users with an intuitive and customizable way to visualize complex datasets within the library. Furthermore, I integrated a real-time code collaboration feature using WebSockets and CodeMirror, facilitating seamless pair programming and code reviews for developers within the platform.

Match - tell us about something you've worked on in the past that would make you a good candidate for this project

With a solid foundation in Python programming cultivated since the inception of my academic journey, bolstered by proficiency in libraries such as Numpy and Scipy, I stand prepared to tackle diverse challenges. Additionally, my adeptness in navigating Linux environments, including command-line operations, positions me favorably to swiftly adapt to Docker-related tasks, underpinned by the wealth of comprehensive documentation and tutorials available. This amalgamation of skills and experience assures a seamless transition and effective contribution to the project at an international level.

You can apply for up to three projects. Is this the only project that you will apply for?

NO, this is not the only project I am applying for.

Working time - how many hours per week do you plan to work, and how will you divide your time?

40 hours / week rounding to about 6 to 8 hours on a daily basis.

Do you have any other plans for the work period (school work, another job, planned vacation)? If so, how do you plan to combine them with your work?

My 12-week summer vacations begin from 10th May. I don’t have any major commitments in the GSoC Period, thus, this project will be my top priority. My next semester starts around the last week of August. The academic pressure isn’t heavy in this time-frame, Thus, I would be able to devote my time in the last few weeks of coding period as well.

References

https://megagon.ai/llms-as-data-annote-p1-challs-opps/

https://www.xenonstack.com/blog/llms-revolutionizing-data-annotation-for-the-ai-age

https://www.larksuite.com/en_us/topics/ai-glossary/data-annotation-for-llms

https://arxiv.org/abs/2402.13446

https://ubiai.tools/revolutionizing-machine-learning-the-role-of-data-annotation-in-llm-projects-success/

https://kili-technology.com/large-language-models-llms/data-labeling-and-large-language-models-training

https://www.superannotate.com/blog/data-annotation-guide#large-language-models-llm-annotation