Taking help of LLM to understand a large code base

Posted on September 16, 2024

Context

Over a period of last two weeks I found myself in multiple discussions regarding LLMs, AI, Microsoft Co Pliot and its use cases with individuals in academia to Software Sales. I ended up investigating about a code base for Signal.org last week and I took help of a LLM based tool to speed up discovery. I thought its a good idea to document what I was doing as it will help to showcase use cases considering the hype around the technology and perhaps help someone.

Task at hand

Signal is an opensource communication tool which is arguably one of the most secure messing tools available. In this case I wanted to investigate whether Signal can be used for a secure private communication requirement where the software including the back-end Instant message Server, Android and Web clients should be running in a self hosted environment.

Typically when we investigate a software like this, its possible to read the reviews, follow documentation, instructions etc. But Signal comes with no documentation whatsoever as the purpose of the software is not be used as a self hosted solution by the looks of it. That means the investigation will involve studying the code to understand the underlying technologies, dependant services, dependencies on any third party products etc. The server software was written Java and my expectation was an FOSS database like PostgreSQL, a caching layer and a message queue will be involved in addition to Google’s FCM and Apple push notification services.

I fired up Visual Studio Code with AI Coding assistant Cody. This tool has handful of LLMs but in free version I have Claude 3.5 Sonnet available. After browsing through the code, I could query the code base with the help of the code assistant.

The tool was able to provide a good summary of the codebase including the technologies being used. The biggest surprise was the mention of Amazon DynamoDB instead of a FOSS database. And it quickly helped to reach the conclusion that setting up the service in a self hosted environment may not as very easy. After looking into the codebase little deeper it was quite clear the software is using AWS Java SDK and heavily reliant on DynamoDB. As demonstrated here, we could identify one of the biggest dependencies for setting up the software in few minutes with the help of an LLM tool.

After having a high level idea of the project, I decided to look into the tests that’s a place to find fine grained details about the (any) project. I was able to quickly run tests & found errors in the logs. Usually such errors takes time to decipher and needs deep understanding of the stack involved.

But this time , I turned to the force:

It took 2 queries against the error log to identify the root cause of the error and the service the source code was expecting.

I could address the dependency in no time. Its important to note that I had never came across this particular dependency (statsd) and even then I could find my way around in less than 10 minutes.

Progress

In a matter of few hours a high-level listing of the dependencies and overview of the code base completed. The LLM was not able to help with queries like finding a FOSS alternative for Amazon DyanamoDB API but it fared really well when presented with various errors. Obviously it failed in listing all the important functionality and dependencies as well.

(The Cody Code Assistant was able to support with numerous run time errors along the way)

At end of few hours the following was achieved:

High level understanding of code base
A list of dependencies
Surprisingly good support when it comes to parsing errors and providing feedback

While the Code assistant failed listing out details like VoIP services, extensive use of certain technologies like GRPC, overall the support provided in nagging the code base for the initial assessment was commendable.

Summary

The LLM technology when effectively used can be used to assist in myriad of tasks. Code assistants are generating queryable environments based on the code & are very efficient. In this given example, the productivity gain is clearly visible as tasks like understanding an error log which could end up taking a long time and often collaboration of multiple engineers was handled in a matter of minutes. For tasks where the tool was asked to analysis already existing data like error logs it performed really well. There are numerous tools getting introduced now which claims to assist in software development to perform end-to-end software development but review of such tools are beyond the scope of this post. The LLM based tools are matured enough of support software development and provide tangible productivity boost when used efficiently.

An Analysis of the Amendments of the Digital Media Ethics code

Posted on November 12, 2025

Bobinson K B

Introduction

India has taken the first step towards guiding its citizens, policy makers and businesses to regulate and manage the new era of opportunities opened up by the generative AI based innovations. We have been working hard to promote innovation, promote our artists and protect our citizens from the threats of misinformation and fake news. The new guidelines are expected to provide the foundations to define our success in navigating the fast paced technological landscape. As the very first definition of an emerging technological innovation, the definitions and the procedures established in the Digital Media Ethics Code will be widely referenced and used in other areas of policy making and thus has the utmost importance.

Expectations

Complete, non-ambiguous definition of the content being handled, establish context and standard operating procedures for content creators, users, advertisers, content creation tools, social media intermediaries and enforcement.

Establishes simple guidelines for the non-tech savvy consumers and users
Freedom of expression and the Rights of the artist creating the content are given utmost importance
Technical accuracy and clear definition of procedures is established

The definition

“(wa) ‘synthetically generated information’ means information which is artificially or algorithmically created, generated, modified or altered using a computer resource, in a manner that such information reasonably appears to be authentic or true;”

Terms used:

Algorithmically [Terms], expanded in the table below:

Procedures

The procedures must be invoked by various personas namely, creator, Tools or platforms used for content creation,

Content Creator

The artist, aka content creator is the most important person in the room and decides the idea and intention behind the content, plans and decides the creation of the tool and is in the position to adequately categorize the content.

No definition, guidelines or procedures are provided for the original content creator. This means creative artists, photographers, editors and similar professionals who might be in a better position to provide accurate categorization of the synthetic content are not considered in the process. This gap combined with the lack of clarity in the definition regarding photographs or videos edited with a computer resource, Computer Generated Imagery (CGI) etc creates a huge difficulty in establishing well defined categorization of content.

The lack of clarity might adversely impact the creative individual whose content might be incorrectly labelled as “synthetic content” . The artists are creators who are the least powerful and most vulnerable category whose rights must be protected and are left out at the mercy of powerful platforms.

Questions:

Is a photograph “modified” with an image editor and “reasonably appears to be authentic or true” Synthetic Content ?
An image or video, which “reasonably appears to be authentic or true” created using a computer resource” Similar to Tigers used in Oscar winning movies like RRR & Life of a Pi falls under the definition of “Synthetic Content” ?

2. Intermediary offering Generation of Content

The procedure itself is well defined for the platform offering content generation services however the practical implications seems to be missed out. Categorization who exactly falls under the category of content creation or which content falls under the category is also unclear due to the lack of completeness and ambiguity in the original definition of synthetic content.

Few practical scenarios & Questions:

A platform offers end to end Computer Generated Imagery (CGI) as a service which doesn’t use Generative AI (LLM) technology. Do they classify as “intermediary” under the definition of intermediary offering generative content ?
End to End music generation platforms will be required to insert a 10% message to a one hour long music its computer systems generate ?
In the case of movies and commercials created with CGI the same procedures apply ?
In the case of full length movies and video content, the 10% visual notice needs to be shown during the entire duration of the video ?

3. Intermediary hosting & publishing

The procedure for the intermediary doesn’t include a Standard Operating Procedure (SOP) & Acceptable minimal standards. This combined with the ambiguity in the definition of synthetic content, creates huge challenges for the intermediary to categories and introduce procedures to detect objectionable content and derive upon actionable conclusions.

Conclusion

Rather than considering the challenge in a silo, a holistic approach is needed to assess the impact on entire industries like media and advertising, startup eco-system, business processes and needs to come up with a comprehensive plan to establish regulations. The definition of the new type of content category must be non-ambiguous, complete and context free to be used across different scenarios and industries. The guardrails must be empowering startups, innovation and must take care of the rights of content creators.

Recommendations

Concerns, Rights & Freedom of expression of artists must given utmost importance in defining the content categories and process
Complete, non-ambiguous definition of technical terminology and context must be established
In the case of intermediaries, well defined standard operating procedures must be established with participation from the subject matter experts and industry participation. Precedence is established in Re: Prajjwala for industry participation.

References & Terms

Securing Development Environments in the AI age

Posted on July 24, 2025

Bobinson K B

Now that generative AI has found one of its major use cases in the software development workflow new tools are entering the market every week. Every SaaS product seems to be adding MCP servers and publishing them. That’s a topic needing another post ! Code editors are added AI support at various levels. Zed Editor even took the step to add a flag to disable AI completely!

Editors and plugins are running in sandboxes and it provides some level of security. But browser access, CLI access etc are still possible, supply chain attacks on the MCP plugins added to the editors can introduce bugs. And then there are AI hallucinations or mistakes that alters the file system. With developer machines often possessing SSH keys and similar information, we are indeed in uncharted waters.

How Typos are handled ?

I have got Cline with Puppeteer MCP installed in VS code. In the instructions to Cline, used the spelling pupetteer & Cline happily passed this on to Claude Sonnet 4 and needless to say a the MCP was promptly invoked, followed by Chrome and browsing started. (Obviously, It asked me for permission).

Needless to say, this is the first time development environments have become “stupid” enough to hallucinate & decide actions by its own.

Where do we go from here ?

Make sure to isolate the development environments and use known MCP servers and plugins with the IDE. Watchout for Supply chain attacks.

Stay Safe!

Passkeys

Posted on May 14, 2025

Bobinson K B

Passkey’s are a new brand name for WebAuthn + CTAP2 from FIDO2 standards. The underlying technology uses FIDO2 compliant Public key Cryptography targeted for Web applications. Passkey makes the FIDO2 authentication method only available to people using hardware devices like Yubikeys available to everyone by making mobile phones and popular OSes supporting storage and syncing of the private keys used for authentication.

Some keywords to quickly understand the technology:

WebAuthn/FIDO2 = Passkey

Discoverable Credential or discoverable WebAuthn/FIDO2 Credential = Passkey

FIDO2 devices and Passkeys generates Public – Private key pairs and then stores the private key securely. The public key is stored by a FIDO2 complaint backend service which then allows authentication. When a user wants to access a website in a computer using Passkeys, assuming the user has previously signed up using a mobile device, then the browser requests the mobile device to authenticate on behalf of the browser. Such communication between mobile devices and computers uses CTAP2 and the transport protocol is Bluetooth in the case of mobile devices (Both iOS and Android). In the case of devices like NFC rings, the transport protocol is replaced with NFC.

What is FIDO2 ?

Fast Identity Online (FIDO) is standard to handle password less authentication on mobile devices, web browsers and in operating systems.

FIDO2 handles:

Authentication
Registration

What is WebAuthn ?

W3C standard for web authentication supported by major browsers. WebAuthn makes it possible to use websites using the FIDO2 authentication method.

What is CTAP2 ?

Client to Authenticator Protocol – Establishes connection with an external authenticator like a Mobile phone, Security key etc.

The underlying transport for CTAP is : NFC, USB HID, Bluetooth (smart and BLE)

Passkeys vs YubiKey

Copyable, Syncable, Sharable, Multi Device FIDO2/WebAuthn keys

Incorporating Passkey

For incorporating Passkey, we would need support for the technology, ie authentication in the frontend and a back-end server which is FIDO2 complaint. The back-end system essentially stores Public keys corresponding to a device’s passkey and uses the same to process authentication requests it receives from Authenticators like Mobile devices or dedicated Multi Factor tools like YubiKey.

In a nutshell, any application wanting to provide support for Passkey must incorporate frontend support and use a backend. Popular services like Auth0 are already offering Passkey support.

Blockchains and Passkey

Since blockchains and Passkey’s are using Public Key encryption scheme, one obvious questions is whether it’s possible to use Passkey to access blockchains. Additionally is it possible to use Passkey with a popular Ethereum wallet like Metamask ?

The answer to this question is depends on the type of Encryption Algorithm, mostly Elliptic Curve used in generation of the keys in Passkeys (WebAuthn) and in the blockchains.

FIDO2 uses NIST FIPS 186-4 mentioned the Elliptic Curves mentioned in “Appendix D: Recommended Elliptic Curves for Federal Government Use” . secp256r1 aka NIST P-256 is suggested but Chains like Ethereum and Bitcoin uses secp256k1. EOSIO blockchain on the other hand added support for secp256r1.

Polkadot blockchain uses Schnorrkel signature scheme and the Curve25519 which are relatively new and not yet included in the NIST or FIDO2 specifications.

References

Time and Digital Forensics

Posted on April 28, 2025

Bobinson K B

In a session about digital forensics, we happen to discuss about timestamps used in the case of digital forensic and it immediately captured my attention.

I had few questions & thought of write them down.

How is time captured
What is the official source of time ?
Bharathiya Nagarik Suraksha Sanhita (BNSS) suggests the use of mobile phone to record the search and seizure procedure – Which NTP servers they are to be connected at the time of recording the videos ?
BSA allows digital evidence
evidence captured has wrong timestamp

Time in connected devices

Computers and mobile phones and most of the devices connected to the internet receives time from the local ISP or via time synchronization servers operated by companies is like Apple, Microsoft etc. The Network Time Servers (NTP) are not always reliable.

[ An example of issue with time servers ]

CERT-IN and related organizations are given directions to use certain specific NTP servers as described in the FAQs.

Is it required to synchronise clocks only with NTP Servers of NPL and NIC?
Is it now required to set system clocks in Indian Standard Time (IST) only?
Ans.: The requirement of synchronising time is stipulated to ensure that only standard
time facilities are used across all entities. Organisations may use accurate and standard
time source other than National Physical Laboratory (NPL) and National Informatics
Centre (NIC) as long as the accuracy of time is maintained by ensuring that the time
source used conforms to time provided by NTP Servers of NPL and NIC.

Wrong time

Incorrect time in digital devices like mobile phones or computers will result wrong time stamps in artifacts. They can be:

Wrong file creation timestamps
Wrong access or modification timestamp
Metadata of photos/videos ends up with wrong timestamp
Log files created may have wrong timestamps

Evidence or other data without correct time stamps can put time and thus related events in the past or future.

Creating a computing device in the past

Simple cases like clock skews or BIOS batteries not working are known and perhaps the Standard Operating procedures existing to account for such cases. External anchors can be used to detect clock skews and find correct timeline. However there can be interesting scenarios that can be investigated.

Boot a device from an older version of Operating System
Use a LiveCD or USB device
Make sure there is no BIOS battery
Change MAC Addresses of the device or arbitrary numbers before connecting to network
Disable NTP

I am not being specific or being elaborate for specific reasons but I hope we are very serious about “time” !

Malicious source code shared via job offers/business offers

Posted on October 23, 2024

Bobinson K B

Over the last 3 months or so I have been receiving emails or Linkedin messages that generally talks about needing help with a software project. Initial emails simply shared a link to a bitbucket code base and requested to help with it. The profiles involved were all new Harvard educated individuals and easy to identity as scam.

The recent ones are bit more creative and are from Linkedin profiles which are much more believable with recommendations and long history.

Unlike the initial contacts who shared various code bases in Python or NodeJS in the first day itself, latest messages follows a multi day initiative.

On the third day, I got the code base and this time they also have a functional specification !

What’s in the code ?

The code provided seems to be slightly changed boilerplate code (LLMed ?) in React or Javascript with instructions to run locally. Somewhere in the code base there is an encoded function which is either loaded from an external URL or in one of the source files like the first screen shot in the image.

*Encoded malicious code being fetched from an external URL*

*Typescript codebase with encoded malicious code*

I took pains to use a secure environment to run the first of the lot which was attempting to access a blockchain wallet. The latest ones looks to be different and attempting to download external code. (Needs further verification.)

In any case, if you happen to get emails/messages requesting help with some code base and access to code, be careful and ignore if you are not sure about what its all about. Do not run the code on your computers!

Verifying and mitigating CUPS printing related vulnerabilities from servers

Posted on September 27, 2024

Bobinson K B

For the past few days a high severity vulnerability impacting multiple GNU/Linux distributions is going around and as expected, this is from the CUPS printing stack.

Details can be found here www.evilsocket.net

Steps for ensuring your Debian GNU/Linux is not impacted

Check for cups-browsed with: systemctl status cups-browsed

root@host:~# systemctl status cups-browsed

cups-browsed.service
  Loaded: not-found (Reason: No such file or directory)
  Active: inactive (dead)



Lets scan the port sudo nmap localhost -p 631 --script cups-info

One scan gave a core dump 😳

root@host:~# sudo nmap localhost -p 631 --script cups-info

Starting Nmap 7.01 ( https://nmap.org ) at 2024-09-27 11:40 UTC
Stats: 0:00:00 elapsed; 0 hosts completed (0 up), 0 undergoing Script Pre-Scan
nmap: timing.cc:710: bool ScanProgressMeter::printStats(double, const timeval*): Assertion `ltime' failed.
Aborted (core dumped)

But the port itself is closed

Starting Nmap 7.01 ( https://nmap.org ) at 2024-09-27 11:45 UTC
mass_dns: warning: Unable to determine any DNS servers. Reverse DNS is disabled. Try using --system-dns or specify valid servers with --dns-servers
Nmap scan report for localhost (127.0.0.1)
Host is up (0.000054s latency).
PORT    STATE  SERVICE
631/tcp closed ipp

Inspect the installed packages:

libcupsfilters1/xenial-infra-security,now 1.8.3-2ubuntu3.5+esm1 amd64 [installed,automatic]

Loo for cups related packages: apt list --installed | grep cups

libcups2/xenial-infra-security,now 2.1.3-4ubuntu0.11+esm7 amd64 [installed,automatic]
libcupsfilters1/xenial-infra-security,now 1.8.3-2ubuntu3.5+esm1 amd64 [installed,automatic]
libcupsimage2/xenial-infra-security,now 2.1.3-4ubuntu0.11+esm7 amd64 [installed]

Disable & remove the services:

If the printing and document management is not used on the server, delete the related packages as follows.

apt remove libcups2 libcupsfilters1 libcupsfilters1 libcupsimage2 cups-browsed

These steps will make sure that the usually high severity (9.1) rated vulnerabilities are removed from the servers.

Nagarhole Tiger Reserve 2024 July

Posted on July 20, 2024

Bobinson K B

This was a trip to the wild after a long time and with a mirrorless camera (Nikon Z9, 200-500 f5.6). While the trip was hectic due to shuttling between multiple locations, we could get a glimpse of quite a few animals.

One troubling observation though is the spread of Senna spectabilis.

The entire collection of photos are here https://freebird.in/pics/Nagarhole_2024July/

nip.io : An interesting wildcard DNS service

Posted on November 27, 2023

Bobinson K B

This is a peculiar post about a nice little DNS service I came across few days ago. While reviewing a pull request I came across an address along the lines https://192.168.1.56.nip.io & I couldn’t find an immediate clarification and searching I could find the Github repo of the project but I couldn’t understand how it worked.

Later our DevOps engineer had to explain to in detail on what this is and how it works !

The nice little utility service has per-created wild card DNS entries for the entire private IP address range. Queries like NOTEXISTING.192.168.2.98.nip.io will get resolved to 192.168.2.168

bkb@bkb-dev-server1:~/src/discovery$ dig NOTEXISTING.192.168.2.98.nip.io

; <<>> DiG 9.16.1-Ubuntu <<>> NOTEXISTING.192.168.2.98.nip.io
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 12130
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;NOTEXISTING.192.168.2.98.nip.io. IN    A

;; ANSWER SECTION:
NOTEXISTING.192.168.2.98.nip.io. 86400 IN A     192.168.2.98

;; Query time: 583 msec
;; SERVER: 208.66.16.191#53(208.66.16.191)
;; WHEN: Mon Nov 27 03:50:17 AST 2023
;; MSG SIZE  rcvd: 76

This is a very useful trick if we don’t want to edit /etc/hosts or equivalent for tools or software running locally or for scenarios where a DNS record is required.

Learn more https://nip.io