Maxim Saplin (@msmxm)

LLMs: The New Wishmasters Of Deception And Cheating?

5m

AI hallucinations & deception on the rise! o1 models cheat at chess, ignoring fair play to win at all costs. Like Wishmaster's demonic djinn, AI grants literal wishes with sinister twists. Can we trust our autonomous systems?

LLMs Struggle With Chess: Limitations And Implications

Maxim Saplin @msmxm

6m

LLMs struggle with complex tasks like chess due to lack of creative problem-solving skills. They're essentially large look-up tables, not capable of strategic reflection or evaluation. New benchmarks are needed to assess their capabilities.

Autogen's Evolution: Microsoft's Rewrite Vs AG2 Fork

Maxim Saplin @msmxm

6m

Autogen's creators parted ways with Microsoft, leading to new products & team separation. Microsoft introduced a complete rewrite (0.4) while community maintains legacy 0.2 version. Autogen 0.4 will be merged into Semantic Kernel in 2025.

Phi-4 14B: Verbose And Error Prone In Real-Life Tests

Maxim Saplin @msmxm

6m

Phi-4 14B released, beating GPT-4o in Math. Tested on LLM Chess Eval, scored 0 wins & 30 draws against random player. Instruction following consistency poor, using 6x more tokens & making 10x more mistakes than Gemma 9B.

Software Engineering And Web Development: LLM Inference Speed Tests

Maxim Saplin @msmxm

8m

Benchmarking LLM inference speed on various hardware specs, from Apple M1 Pro to AMD Ryzen 7 7840U and NVIDIA GeForce RTX 4090. Results show significant performance differences between CPU and GPU processing.

Flet: A Python Imperative UI Framework, Not Flutter

Maxim Saplin @msmxm

9m

Flet is not Flutter, despite using it behind the scenes. It's a cross-platform, server-driven, imperative UI framework for Python with its own library of controls and no standard Flutter UI library.

Python 3.13 Performance Benchmark: Newer Not Always Better

Maxim Saplin @msmxm

10m

Python 3.13RC1 tested on M1 Mac Book Pro: inconsistent results vs 3.11 & 3.12 in CPU-bound Mandelbrot set calculation benchmark.