Winning the Best AI Vision Hack at Antler

I went into this hackathon with no expectations. I was just looking to have a good time and learn something new. I'm glad I went. I ended up meeting many amazing people, hacking on an interesting problem, and winning the best vision AI Hack at the end of the weekend.

What Did We Build?

We built a tool that uses AI people with visual impairments to browse the web better. Specifically, we built a Chrome extension that took underlying snapshots of the current webpage and used AI to highlight the important elements on the page, along with voice-based navigation to perform actions such as clicking and filling out forms.

How Did We Build It?

Essentially, we had two parts to the extension itself. First, we needed to take screenshots of the page. Second, we needed to analyze the structure (HTML) of the page. We were able to use a tiny bit of JavaScript to analyze the DOM and then used Google's Gemini model to digest the combined content and image data to highlight the important elements. After receiving the data, we used a series of LLMs to navigate the page and perform actions as necessary as a multi-agent system.

Why Did We Build It?

In the examples that I demonstrated as part of the demo, I was able to show how the extension took care of situations where images on sites had faulty or non-descriptive alt text. It was able to fill in the gaps and provide a better experience for users that existing tools in the space which are mostly at the OS level. In addition, actions such as clicking buttons and filling out forms were performed with natural language or with keyboard shortcuts and could be used in combination for efficiency. These provide an in-browser experience that is much more efficient than existing tools today for those with visual impairments.