Rearchitecting to take advantage of the latest in mlx_lm #36

uogbuji · 2025-01-14T00:11:44Z

No description provided.

… generate_step in mlx

… generate_step in mlx for use by Model.completion

…e-integrate-mlxlm

Use mlx_lm.models.cache._BaseCache in signature and update method name to reflect how the stream_generate method might be a better level of abstraction as it returns GenerationResponse objects as token production metadata (which completion tries to do) and how it also takes a draft model for speculative decoding.

…certainly a poor alternative to the new speculative decoding

…mple_utils This is the only convention for a logits processor in mlx_lm AFAICT, which repetition penalties are based on. It looks like the logits are expected to be updated *and* returned (the += operator).

uogbuji · 2025-01-14T14:40:56Z

Bookmark: I had cause to look at some of the data structure bits in llm-structured-output. I see that it has its own search trie implementation, presumably optimized for use with LLM tokens, as i's named TokenTrie. I still wonder whether we can replace that eventually with a C-optimized Trie library (or derivative therefrom).

Perplexity offers:

Yes, there are efficient Python implementations of search tries, including some with C-based optimizations for improved performance.

Python-based Implementations

Pygtrie: This is a pure-Python implementation of a trie data structure, which offers good performance for many use cases [Archived in Jan 2023]

Marisa-trie: While not a pure Python implementation, it provides Python bindings to an efficient C++ trie library. It's particularly memory-efficient and fast for large datasets

C-optimized Implementations

Datrie: This library provides Python bindings to a C-based double-array trie implementation. It's known for its speed and memory efficiency, especially for large datasets [5 years since last commit]

DAWG: Another C-based implementation with Python bindings, DAWG (Directed Acyclic Word Graph) is a data structure similar to a trie but with even better memory efficiency for certain types of data [5 years since last commit]

By leveraging these efficient implementations, developers can create high-performance systems for string-based operations, particularly for applications requiring fast prefix-based searching or autocomplete features.

Weirdly all of P's citations are to https://anvil.works/forum/t/python-trie-implementation-efficiently-search-trie-based-on-prefixes/3074 which doesn't even mention several of these libs

uogbuji · 2025-01-14T16:29:49Z

Now that we've switched the logits biasing from the llm-structured-output custom function util.bitmap.bias_logits to the newer MLX_LM make_logit_bias_processor we're running into a bug with the logic. We did at the same time convert it from numpy to MLX. It seems the following logic

            if accepted_token_count <= highest_token_accepted / 2:

is always true, which ends up making all the tokens impossible (hmm shouldn't this be considered a failure condition in the code?)

I'm currently investigating. I spun up this simple test driver, country_extract.py:

'''
This script demonstrates how to use the toolio library to interact with a model that extracts countries from a sentence
It also shows how you can set a random seed for reproducible results
'''
import sys
import asyncio
import mlx.core as mx
from toolio.llm_helper import local_model_runner

# We'll be needing to print large numbers, so we remove the maximum number of digits
sys.set_int_max_str_digits(0)

RANDOM_SEED = 42

toolio_mm = local_model_runner('mlx-community/Mistral-Nemo-Instruct-2407-4bit')

SCHEMA_PY = {
    'type': 'array',
    'items': {
        'type': 'object',
        'properties': {
            'name': {'type': 'string'},
            'continent': {'type': 'string'}
        },
        'required': ['name', 'continent']
    }
}

async def say_hello(tmm):
    mx.random.seed(RANDOM_SEED)
    sentence = 'Adamma went home to Nigeria for the hols'
    prompt = f'Which countries are mentioned in the sentence \'{sentence}\'?\n'
    prompt += 'Your answer should be only JSON, according to this schema: #!JSON_SCHEMA!#'
    # The complete() method accepts a JSON schema in string form or as the equivalent Python dictionary
    print(await tmm.complete([{'role': 'user', 'content': prompt}], json_schema=SCHEMA_PY))

asyncio.run(say_hello(toolio_mm))

Notice the random seed setting. In the main branch of Toolio I'll update util.bitmap.bias_logits to trace state, and do the same for the logit_bias_processoron this branch. That should give us good debug data.

uogbuji · 2025-01-14T16:42:38Z

OK here is the diff for tracing the bias_logits case:

diff --git a/pylib/vendor/llm_structured_output/util/bitmap.py b/pylib/vendor/llm_structured_output/util/bitmap.py
index de0fdc9..0600425 100644
--- a/pylib/vendor/llm_structured_output/util/bitmap.py
+++ b/pylib/vendor/llm_structured_output/util/bitmap.py
@@ -45,7 +45,6 @@ def enumerate_set_bits(bitmap: int) -> Iterable[int]:
         yield highest_bit
         bitmap -= 1 << highest_bit
 
-
 def bias_logits(np, logits, accepted_token_bitmap):
     """
     Apply a -inf bias to tokens that will not be accepted.
@@ -55,6 +54,8 @@ def bias_logits(np, logits, accepted_token_bitmap):
     vocab_size = logits.shape[0]
     highest_token_accepted = highest_bit_set(accepted_token_bitmap)
     accepted_token_count = count_set_bits(accepted_token_bitmap)
+    with open('bias_logits_trace.txt', 'a') as f:
+        f.write(f'{accepted_token_bitmap}, {highest_token_accepted}, {accepted_token_count}\n')
     # Check whether there's more tokens to be rejected or to be allowed, then do what's less work.
     if accepted_token_count <= highest_token_accepted / 2:
         bias = np.full(vocab_size, -inf)

And woooh yeah, when you turn an LLM's entire vocab into a bitmap, you get some huge integers: bias_logits_trace.txt

uogbuji · 2025-01-14T17:13:40Z

Here is the equiv diff from 33-re-integrate-mlxlm:

diff --git a/pylib/schema_helper.py b/pylib/schema_helper.py
index 764922a..3d1d38a 100644
--- a/pylib/schema_helper.py
+++ b/pylib/schema_helper.py
@@ -168,6 +168,8 @@ class Model:
             vocab_size = logits.shape[0]
             highest_token_accepted = highest_bit_set(self.accepted_token_bitmap)
             accepted_token_count = count_set_bits(self.accepted_token_bitmap)
+            with open('logit_bias_processor_trace.txt', 'a') as f:
+                f.write(f'{self.accepted_token_bitmap}, {highest_token_accepted}, {accepted_token_count}
\n')
             # Check whether there's more tokens to be rejected or to be allowed, then do what's less wor
k.
             if accepted_token_count <= highest_token_accepted / 2:
                 bias = mx.full(vocab_size, -inf)
@@ -185,6 +187,7 @@ class Model:
                 indices = mx.array([*enumerate_set_bits(rejected_token_bitmap)])
                 bias[indices] = -inf
             logits += bias
+            # print(f'{logits==mx.full(vocab_size, -inf)}')
             return logits
 
         return logit_bias_processor
@@ -220,7 +223,7 @@ class Model:
         self.curr_token_acceptor = self.json_schema_acceptor_driver_factory(schema, encapsulated) if schema else None
         self.accepted_token_bitmap = self.curr_token_acceptor.select_valid_tokens()
 
-        del kwargs['logits_processors']
+        # del kwargs['logits_processors']
         print(f'{kwargs=}')
         logits_generator = stream_generate(self.model, self.tokenizer, prompt_tokens, **kwargs)

Which results in the following trace: logit_bias_processor_trace.txt

Looking at line 1 alone, they match, and since the advance branch is not succeeding in a single step, that means self.accepted_token_bitmap is not the issue.

The smoking gun seems to be

                indices = mx.array([*enumerate_set_bits(self.accepted_token_bitmap)])

vs

        indices = np.array([*enumerate_set_bits(accepted_token_bitmap)])

I'll adjust the tracing to check enumerate_set_bits(self.accepted_token_bitmap)

uogbuji · 2025-01-14T18:07:09Z

Weird! In both cases I'm seeing the enumerate_set_bits(accepted_token_bitmap) result as [57096, 33966, 4344, 1091].

So in each I did the equivalent of

                with open('logit_bias_processor_trace.txt', 'a') as f:
                    sep = '-' * 80
                    bits = list(enumerate_set_bits(self.accepted_token_bitmap))
                    f.write(f'{self.accepted_token_bitmap}\n{sep}\n{bits}\n{highest_token_accepted}\n{accepted_token_count}\n')
                indices = mx.array([*enumerate_set_bits(self.accepted_token_bitmap)])
                bias[indices] = 0
                print(f'{bits[0]=} {bias[bits[0]]=}')

And indeed in both cases I get, for the first entry, as expected:

bits[0]=57096 bias[bits[0]]=array(0, dtype=float32)

I checked bias[0] in both cases, to be sure I'm not losing my marbles, and got, as expected bias[0]=array(-inf, dtype=float32)

Next marbles check was to see what was happening as the logits were getting bias added:

            bits = list(enumerate_set_bits(self.accepted_token_bitmap))
            if accepted_token_count <= highest_token_accepted / 2:
                bias = mx.full(vocab_size, -inf)
                with open('logit_bias_processor_trace.txt', 'a') as f:
                    sep = '-' * 80
                    f.write(f'{self.accepted_token_bitmap}\n{sep}\n{bits}\n{highest_token_accepted}\n{accepted_token_count}\n')
                indices = mx.array([*enumerate_set_bits(self.accepted_token_bitmap)])
                bias[indices] = 0
                print(f'{bits[0]=} {bias[bits[0]]=} {bias[0]=}')
            else:
                bias = mx.concatenate(
                    [
                        mx.full(highest_token_accepted + 1, 0),
                        # All tokens above the highest accepted token are rejected.
                        mx.full(vocab_size - highest_token_accepted - 1, -inf),
                    ]
                )
                rejected_token_bitmap = bitmap_complement(self.accepted_token_bitmap)
                indices = mx.array([*enumerate_set_bits(rejected_token_bitmap)])
                bias[indices] = -inf
            logits += bias
            print(f'{bits[0]=} {logits[bits[0]]=} {logits[0]=}')

And there was a clue. With bias_logits :

bits[0]=57096 logits[bits[0]]=array(8.64844, dtype=float32) logits[0]=array(-inf, dtype=float32)

With logit_bias_processor:

bits[0]=57096 logits[bits[0]]=array([0, 0, 0, ..., 0, 0, 0], dtype=float32) logits[0]=array([-inf, -inf, -inf, ..., -inf, -inf, -inf], dtype=float32)

I'm not sure why the dimensionality is showing that way. I changed the summing line to mx.add(logits, bias) and from that I get:

bits[0]=57096 logits[bits[0]]=array([0, 0, 0, ..., 0, 0, 0], dtype=float16) logits[0]=array([0.773438, 0.773926, 1.5127, ..., -0.281006, -0.390869, 0.450439], dtype=float16)

This still seems to result in the sampler not selecting a token, though, so I'm at a dead end for now.

uogbuji · 2025-01-15T02:15:41Z

Here's a simple reproduction kit.

Download & unzip logits.npy (Github makes me zip it 🙄):
logits.npy.zip

Then run ipython and paste JUST the following line:

import sys; sys.set_int_max_str_digits(0)

Then paste the rest:

import mlx.core as mx
from toolio.vendor.llm_structured_output.util.bitmap import highest_bit_set, count_set_bits, bitmap_complement, enumerate_set_bits

logits = mx.load('logits.npy')

accepted_token_bitmap = 4060994774593053295078131258417855611202259325478197839908801482945540786663111885057371420178561831999017904078928622613951018540989526269390773887344212981551735700196303585954881909233786262482937776054953551383009356337852738738322390307669912764751735056651053525371307773464418661491597610790195021322837358495478826130701557065576026989608514859840396983740732846526374591856669612076776575264280829218719068864103244613386734920803797369143979572989510113787871619347316284664928282459240323806820000716811334936060269292197354455978819311141546827028584709833763542696161207732042575783711225784153738086419923076712786921567307477196610648670941553279764679180803399270132788326284250927156281798151794704434541452102251234813842407696346334855263799537554836664303911108318572341366552756937250586672431232589688698015184393308118575591367905621638860069754891048821237903929921092195502206197578989434189021162205805888108906366459966139713581861482220189264417631369477026398591796058767478372800478802573645437584112885574240770738015957265770124301927765689767262083332934521030090473535565892029357934280813617807473709019822186832202823432746325393699835281094360639223178083304643344366778577797659708227407152899963031056544843733476083640443310514346070362051163819219810182019358287619601834086557840228242067401918264260622903904147402717554745726367946830708964241944394882697260114492692379366701026554358539239097323261688766469659845345849909553275052515908966279802108278245762170837956675419398948829448038971233038507221317966804455015432662718678254066772040225001302285247976143606830378150412407513710643319520301434287338232771485594684501645286240176059939763334753919980779225484976699431241188225637973473154132202084688021896571833549408253886296275073201550395261185555944373428404578798416489554301734478926226726977722637268180697016118579384465789729421411768618027333422941230654790717351290848391359077891369441460968907171885765458240021783449401325451179069969694637163416940109752643998600642345096781003874443604733247781945555669232175546271173317303425845693254782701235989930502911482206583066035081797868250818932627259746337187626237424491090471149844626759436958423417134011703316278764345762470608035107426405636165016492307142554399768029395470821904638360087166427399293872299779704321228141101161101457158468092210446421510234398047736890430496039122447653301063229192654309619600873647653824703344613845151020607521808842170663539896485776422736182445978893526833430152004806778024576965847780665190592300481174687338486344120729763273094579283692730892345549968743733500677044323803874910739876840136223151366836220160231047135294280593487882301071025137436284761035668208625221685643106412125865880676296297175750962103549438479276656840718972286344913744111226469770934710337611480698300714446774609759940779125243513695680219710205411125721741907726728525100065179038664319493388878112434985097147017762429183877034283334524496496137274916372258741716461395844503817727669528731898495493057862915059829081830126739645754903605574344164372720387346096503624292853205215137927121627249654144994709793771045851736374710132979194800561708538601033035504314136462304241676587531645272782178057309983094209539035911361375362476176396022940042130290481744183840635650127730625924236597934015630514245164693402930098933718563356869309341862174470687563228592315045372754618073315764132534426143922657615197716757261839979252651034869308740805738208614925261598603667475574692703711494793818188472900424692543591921448173165812955986585338951604932445534097359395055586901827999462097665808756675080051449031458871043017000865665126060767635026617796403062461161201500073561279230450762711053097167383115791883799605483089107182899962954462579535743143392722071808950274290572136493108485516289919663271415116927224734672966632793342896522420961255019142994023122205159122535963900955632577785026118139324531085999087178381478354834838286291873890771410716941298248146383575707940271561006857671764368773656582616180771352765405500377886878070014252784774691040108838901405593602762143118800441201283508919211058637673203499414370169924887801361257820150896897833573247193872125268381540248280886613399961494183818653832374388865884709977783582522263255050009651387190299104594159833592513967227854133681057262987127047251279296731667026618123724644875753687442682507569061121326061614534338668607750198854691548021241654255542072975762539947172970306779717126785474872948130965748811970985416612768246786030536973438390010135544023625619920667110126739180109773128153470025832789230542863099906978982605664239800554499843639170585509163701108182948532323345283009323004044402407150721006613463213462123065492209030698325572612920914196717140484227254950154975506149945163216636805691371815829172204689987142523492975860193400145233765748130065379728713997052908822217787062236802863483087189642352745053580421693915090292146819647919832209599222979327120262756562237741063535571205534218992931076240649148195131203232823220418655996576526528105720041381102365745064225848939991271537452744861131675641198349473368087963594939585290412336926670703270327635308529463553216851338541720810403275960709533721033842156310081155111878582466841708957083743595709706064804096679990147901559209583814660551795698901720530986620757053197234008602902702407751836229995856222881719072814765285729284105917274651129013716067215057498765095841436460949680840717521835862180136866963121936463214566641590436050474918284726800163870473593827188716137252294155920194648104356757954220181284470818516762606580608000267815411718960097992790201145964449407883921193049184436568507086931042798708928651532897543675365008532712974426855157264140213727597000862160214387482318242215172242209412646086238982445493375076246798840459691491559359163307916994844829304702606756183123791434333750722134515456502274863368954614718443880017186232387606410592253433758800593493641014575551526879484570374698601108626373819140700832723509688627772097278844553002314411505436354980586717867601518976878913368039460056255025204954531539870149983558699548375043075260517950444914782689231924397485173779754225629701339413751790468914321739816104309557104473567824018427104374225016157523904443409502760351267751949340829648009487169828474932146865262040060048078754817833352791845940265383666566614529057113026468180812301654173139457920704058209667334246761406401540293322071216742354429155440040084504595685514833261409179403692242658187675321712240550496914732556326867389406337286501812532203243290332791612139205003976267642136526888726800242251016702596559350989473292660958290124190978439350648449417355518343789211267654576576386045897662200221824844203407584860606140731100428860793018831310042037765865850557462300493005120716264076229212187106894027208214477817415325590844625140184887748147517332241191758229856021988718608761218075498524862135424821711616316468210627651376238659555969208027033209154928184836208119586219408610406781844199793670828082021303409967424690864602965051524509902155888332055341777533580283876245020763879539444480584961402674709599644242436241266254802238044257364042291660092919909426454016204160517239158071933865761932475339713863737358749391655242957584139250815039760975754753385835155009272417627813753141868092063925884160490204168699054969325874468961686116216555246391441177935037205024897904259923786474107366169750521176012363404064459968269903654814369777023972224002210039340446152730039361528941509458245976837199954731063461244712938231402629088390965067256654476347814901654370448133570539509204459398465016710093519348389418517094971222928202770754162633260895160193205694485129993168243408580412797692444276997866464238178657416858316942553073818141351360688728397058931328013999915952954152768515330721847248015851029089793529448855239312142231232337238507190008821626341274998166142806495166067926158964019681271329217961146329157291633761206633190178992230755894639620997676142370617964418338145445462425336997708635262334213402260943473162324143267780407657613581386940783519332352898845398909524453954117142499628803741148125047622463879268582073333900896588278524393843241413792949296350745381582971833858108750841273721733025987477745761127981438078402008462671148301816888425303823886280267154385742918141827872609113438723164529855458956053823562957437841136391357137606505023729461142539883567508724924990232362313616053678879434379045646528358474238263191750124665725949634782088823120135547177876394260941014522799701405558037342065309322887006256822121236181007696032866475404892033704725570115029776722542844637836387344244964496840990057420581908150990897730828852045493976581440636931427707774594924288115281871070183190926562615353748973432922610669883731252038740662905896207432139873197203538103607915522327567312194981742376354468441967961783088337943445824323590753838693294459219864288259578212044349884231954091612734251318538013326091589787522961734264605537204035618945252776015959894328662372700668501863732629883807589314216399518780000405492747825637990809414616796437221664868641632247075867367154479204007763398446740379384492085212841262628540206293632925161093111858608174453890125246563720758382921175606753998814771159735388039007416642588792837494907035552642267493148336329702175821013350189136550970412388363324884476802631353621177192533946842093922767621910190980280405255716779676330280452073323709084754764651796387649268562923507519479007514391649571447541159603015853078990699410403533371026093953238399390390240754249464167442808233217075114812620114608872259177677458088361023562719048623400763143129764089383599222308862408179292825213543192655691636612677574638229199715781643953289509864630189099943630554029547952482167967635863996161526222214568124534674930573977502588592606934834693389365831775900166547154172068233010071087136914778384075730853569566769450059569512753417071157669811571890494659634987442993821111985454674614514802340955869214393408450526020006907243944892006834454637710858078910524168725727707335413185168833470127658845170216195201791820396183787784460578475211216592645055729531840883145484767723454475384795330782405640668103076458734630359098507419065400516057036753437978656082492709621480489291094210602648687023521124986553777559488143006395159700295950195921589451409813420656559931206516101688275004105637115860164495912688958399124935618082808424376215426984986731328838848287523499317547802888182600991582014105903981371269735086501152347022644395875698491710956689326770910837230465381942932627779384703669443958722773667228445783212052419669641660835614682292389624120102827941211482126641153339997761375444891951446351895007666211150119291752528586338228584373941831745264479268885646195663374932221062989497173453850117206096898254195032124562235576908338997471694126583866007013539671956842948244794342823149638264369175299562328717120690820472386847401945190197315413013047072180852569262360805685199979952383478630320157847822133292842137639270083848160535237048776572066791488624264084357898873850823096778357630564442151074869614031944094825402615541500950980629452546005293954410004424344348758566815365627814169639050627590740299961396150584454819100058331229182618309973364574858970574243688098295873277039758704834557423144907924075558141828784341217591488660575667126724193785224806988704985466674349595276045150456276221212272259276821357974861514607358854834258860227574157393743545924871116174191037818486342225205288539527458027839180501432370276837374406551809162679458109298249575792235025237922262500997765385012764113205513547140995250899908008332785410901889743961922990232152706823251820862790488983650371301947773554489010508885795000702930463698140800861381539084516506299726796156324398935071444252593124430683678492592288443362423312095716890689807325660005753413072394243452099401945228690284609490565066841379409841467361974453034960286683784704994634008215046229212189245313642267605376022962541534304307699387847960273032984176965606685580128349177528425845876634464696505385836403670055778016828908592752539150977264487981872271868329458342203707406330210863087097113842296150874441963946083163366962769626674527966456013878342730688221137086107134658891820039718631034147859777985460198726794700725080481952755762443126768893699407762210732450947222295779453857952840438446967633957129516919260615932773441109310960264830624186176336626485778112102118452164080910385033440408775496297201735486769213547582932446849458687338610155201383902607829636895418341388856994039201577194733894128344722704020857353863448606076347119245520639185289028880787877513775899383611459819957087039658882583221717508664372279603969163841648289910762628376779154797834798156652130472108681687945137841292015650520141769973371821808652200175952058873864990600177029214686329406485852819328176636053119180455660548979671397888719169813824349252249010208987101792556002163411846421398389537067831269192166534100157437199941328918112839496665433552293000175362349632234811014084519502859226925970691382488612469090194556176372777725700999215252295060879738625918334602416296474583220278224475526467858690373936646973757350147302099508601128366382638837661773059355367314076894095911609638889235441089921719591227130912885422870260733299391005720727784895596675136619212072888440941067852512122726743202240349343275633086750099926266284577413287153089094292940568525871155254207567940316819064838864554607778461917492829812779727006102455239806749796097025542720436359595394317905569128668609310848900030534610369782637919510966472540260649926084703633395903098443610280730228305137053630592244813074958635524940754678889655365771346131084490614552693023970475710504508158125205836665968347947172986069431029423264310842800464217888934013684059669350258065714517675234453089018025529099473004257160755923872699194877678679238481117149083477878959417676682657313640971280507599832706385891465252120821205671720487960311458763525566712967388889902598668338099000692490825172189132648664404518948403776405800827429214650462338587213776414794404841182084069018989058325949190520375290079480933976621695417666185546481211837214496054375104412409703346121466586431220011009499945991053759762570398832136179104154472376266023207303114561140987070799993245371514808517315398001341264955491705762945319147734742842629350772572745400414874293281784417600566252333863952426228530658402155169622318345171129731950280409864120120872359079178911789210270565252815819885341399508710325816554479483052597080063159904476056814803419234157328519694708307831405488394872160788185907721858996635674816327686128035634446295748382183569210516927376466511490539867755336967922325981389775447240457195035620164952555844439515188569552064249283991385411701465109915959613800107756784930826738097834612517965600497748859722663072474826405774929836088784308226292047041627297632178912329847411871382682989807696268021466942867147692070294974404355335413000943165613305996790104042522862298320334656967705830824386214068293226813435721054702637436412787916881801388992518106026807556260478686753140584339089586971633445326330253790114398774383826300216103938311680791263376565488460185369349121485995002625059835708132241339543600612583472749708124022250021438833408729027861210777885050873266136579636667030805620324116742450425613813377127027058764214380851953904035054328365623717414387659355474953706650767369511047263284541322448287518432710426701344534732126683837422159167276385400344699998222532133572128405209859355208568059454086829742966139542474201937947137351963490300903289545475432615761113524706778192523594365127077483274262621821827017209575218591850296509382994652819000654844343978315041396888544706013647839987876533980500549121655182785044193226857546131130844313607033398359042896854409590141905978224375576460477640015603238236089055468772336873436259443078013845791763572128061897320549843966623160674458790442206415989155276975314412871978865102204147120120363768527977547666583017801511345890033631622226600057542449559119650683399853070556417251700769346097934003465400051623476101479747731272313351114107068152928428745010204990655203597670763834487031870382534920000111313353415153217929833567442051409821624646375498520193566256586910928690177987369664972733164847439488538889455821456547571155922193173981429939725498767262452118949388936661128055510306015031905615956459345344491760244196309716041179845949562369439288549665760636260062179236108692890151895637206461557231395375812542309073241162424236626740361387788066538938826658308837377440335578915062658424865456878595100849032520580766414319716990528920481901194505914902487713794759175666894532984735213526174404856251186676073150449577781075794924583869989599030952601594174511262540292413956904281765819862183623738875204799514519244581122473641062458670571531321590282577647672838801963423936310758232318869490555006253662511892269341953314063634125414010408316574888488468988489566003899904494158149880712980193349067470376851703162601922828797072113664
bits = list(enumerate_set_bits(accepted_token_bitmap))

Don't know why Python breaks if you don't paste the first line separately.

Anyway at this point you have pretty much all we need to figure out how to get this logits logic right.

BTW if we ever need to save multiple arrays to a file:

>>> a = mx.array([1.0])
>>> b = mx.array([2.0])
>>> mx.savez("arrays", a, b=b)

chimezie · 2025-01-15T05:18:28Z

Thanks for the setup. I tried a few things.

Changing the logit_bias_processor to the following results in stream_generate producing GenerationResponse objects:

        def logit_bias_processor(tokens: mx.array, logits: mx.array) -> mx.array:
            '''
            Apply a -inf bias to tokens that will not be accepted
            '''
            vocab_size = logits.shape[0]
            highest_token_accepted = highest_bit_set(self.accepted_token_bitmap)
            accepted_token_count = count_set_bits(self.accepted_token_bitmap)
            # Check whether there's more tokens to be rejected or to be allowed, then do what's less work.
            if accepted_token_count <= highest_token_accepted / 2:
                indices = mx.array([*enumerate_set_bits(self.accepted_token_bitmap)])
            else:
                bias = mx.concatenate(
                    [
                        mx.full(highest_token_accepted + 1, 0),
                        # All tokens above the highest accepted token are rejected.
                        mx.full(vocab_size - highest_token_accepted - 1, -inf),
                    ]
                )
                rejected_token_bitmap = bitmap_complement(self.accepted_token_bitmap)
                indices = mx.array([*enumerate_set_bits(rejected_token_bitmap)])
                bias[indices] = -inf
            rejected_tokens = mx.array([*enumerate_set_bits(bitmap_complement(self.accepted_token_bitmap))])
            logits[:, rejected_tokens] = mx.full(rejected_tokens.shape[0], -inf)
            return logits

The main difference is directly setting the logits of the rejected tokens to -inf instead of updating all the logits by adding zero or -inf, building out a bias array as large as the vocabulary size to do so. I didn't change the else clause since it was not being used, but I think the same principle of zeroing in on the rejected tokens should apply

uogbuji · 2025-01-15T05:29:25Z

Thanks! This would have taken me a while to work out, for sure. I think the else branch is only not being used by coincidence on what we've tested with. Basically, if the first valid token is more than half way through the vocab set, the second branch would fire, so we do have to make sure it works. I think I'll play around until I do find a token stream / model combo that exercises the second branch so we have a test case.

…work.

…chimezie

…chimezie

uogbuji · 2025-01-15T06:21:39Z

After a closer look, I think @chimezie is right. We're reducing the work already by setting rejected logit values directly, so the bisect approach they were using upstream is probably not worth the hassle. I left in a comment in case we do ever want to go back and figure that out, but for now, I think we might be on to the next problem! 🎉🎉🎉

chimezie · 2025-01-17T02:45:28Z

I was looking at the way make_repetition_penalty in mlx_lm.sample_utils penalizes tokens it doesn't want to repeat, which is the only reference I have for something similar to what we are doing. It uses a scaling factor to reduce the raw logit values instead of setting them to a particular value. I do know that logit operations can make the model sampling process 'unstable', so I had a thought to penalize schema-invalid tokens in the same way, but with a constant value (2 in this case, but it can be higher):

    def make_logit_bias_processor(self) -> Callable[[mx.array, mx.array], mx.array]:
        def logit_bias_processor(tokens: mx.array, logits: mx.array) -> mx.array:
            '''
            Apply a -inf bias to tokens that will not be accepted
            '''
            # Could try to re-apply the upstream logic "Check whether more tokens to reject or allow, then do what's less work."
            # https://github.com/OoriData/Toolio/blob/903aba3a6daac3fce14b8ab84dab1d760da76304/pylib/schema_helper.py#L171
            # But this approach might minimize the array construction enough not to bother
            # We're instead directly setting the logits of rejected tokens to -inf rather than doing a full array add
            # Saves us from building out a vocabulary-sized bias array

            accepted_tokens = [*enumerate_set_bits(self.accepted_token_bitmap)]
            rejected_tokens = [t for t in range(logits.shape[-1])
                               if t not in accepted_tokens]
            rejected_logits = logits[:, rejected_tokens]
            logits = mx.where(
                rejected_logits < 0,
                rejected_logits * 2,
                rejected_logits / 2,
            )
            return logits

When I make that change, I get more emissions from the sampling process

chimezie · 2025-01-17T03:39:20Z

I was pointed to the logic for top_k, which prevents the bottom |Vocab size| - k from being emitted and I see:

    mask_idx = mx.argpartition(-logprobs, kth=top_k - 1, axis=-1)[..., top_k:]
    masked_logprobs = mx.put_along_axis(
        logprobs, mask_idx, mx.array(-float("inf"), logprobs.dtype), axis=-1
    )
    return mx.random.categorical(masked_logprobs, axis=-1)

This works with logprobs instead of logits, but the principle is the same. This suggests that -inf is the right way to go about it, but they instantiate it and put it in the log probs in a slightly different way in mlx_lm (using put_along_axis). So, perhaps (ignoring mx.random.categorical, which just does the sampling):

        def logit_bias_processor(tokens: mx.array, logits: mx.array) -> mx.array:
            '''
            Apply a -inf bias to tokens that will not be accepted
            '''
            # Could try to re-apply the upstream logic "Check whether more tokens to reject or allow, then do what's less work."
            # https://github.com/OoriData/Toolio/blob/903aba3a6daac3fce14b8ab84dab1d760da76304/pylib/schema_helper.py#L171
            # But this approach might minimize the array construction enough not to bother
            # We're instead directly setting the logits of rejected tokens to -inf rather than doing a full array add
            # Saves us from building out a vocabulary-sized bias array

            accepted_tokens = [*enumerate_set_bits(self.accepted_token_bitmap)]
            rejected_tokens = [t for t in range(logits.shape[-1])
                               if t not in accepted_tokens]
            logits = mx.put_along_axis(
                logits, mx.array(rejected_tokens), mx.array(-float("inf"), logits.dtype), axis=-1
            )
            return logits

…p-k sampler

…going on with the sampler's selection

uogbuji · 2025-01-18T16:41:21Z

OK whew! The basic mechanics look in good working order now. The downside is that I think it' seems much slower than the main branch. Main branch:

❯ time python scratch/country_extract.py
[
  {
    "name": "Nigeria"
 ,
    "continent": "Africa"
  }
]
python scratch/country_extract.py  3.92s user 3.80s system 95% cpu 8.125 total

This PR branch:

❯ time python demo/country_extract.py
[
  {
    "name": "Nigeria"
  ,
    "continent": "Africa"
  }
]
python demo/country_extract.py  10.90s user 6.43s system 106% cpu 16.298 total

I've been profiling, using demo/country_extract_cprofile.py and around 6 seconds of the latter is spent in schema_helper.apply_token_mask.

I did some noodling on my own, and some with help of Perplexity and Claude. Came up with this to study the options for spedups:

import timeit
import mlx.core as mx
from math import inf

DEFAULT_TOKEN_MASK_BATCH_SIZE = 1024

def apply_token_mask_batched(logits, accepted_token_bitmap, batch_size=DEFAULT_TOKEN_MASK_BATCH_SIZE):
    '''
    Iterators/generators approach to setting logits of non-accepted tokens to -inf
    Fixed-size batched approach, trading off space/speed by only creating small temporary lists for each batch
    '''
    vocab_size = logits.shape[-1]
    
    # Process tokens in batches
    for start_idx in range(0, vocab_size, batch_size):
        end_idx = min(start_idx + batch_size, vocab_size)
        batch_indices = []

        # Check each token in the current batch
        for token_idx in range(start_idx, end_idx):
            if not accepted_token_bitmap & (1 << token_idx):
                batch_indices.append(token_idx)

        # If we found any tokens to reject in this batch, update logits
        if batch_indices:
            logits = mx.put_along_axis(
                logits,
                mx.array(batch_indices)[None, ...],
                mx.array(-inf, logits.dtype),
                axis=-1
            )

    return logits

def apply_token_mask_vectorized(logits, accepted_token_bitmap):
    vocab_size = logits.shape[-1]

    # Create a boolean mask for the entire vocabulary
    mask = mx.array([(accepted_token_bitmap & (1 << i)) != 0 for i in range(vocab_size)])

    # Invert the mask and convert to the same dtype as logits
    inverted_mask = (~mask).astype(logits.dtype)

    # Multiply the inverted mask by negative infinity
    inf_mask = inverted_mask * mx.array(-mx.inf, dtype=logits.dtype)

    # Apply the mask to logits
    masked_logits = mx.where(mask, logits, inf_mask)

    return masked_logits

def apply_token_mask(logits, accepted_token_bitmap):
    '''
    Iterators/generators approach to setting logits of non-accepted tokens to -inf
    '''
    # Process each position in the logits vocabulary dimension
    for token_idx in range(logits.shape[-1]):
        # Check if this token should be rejected (not in accepted bitmap)
        if not accepted_token_bitmap & (1 << token_idx):
            logits = mx.put_along_axis(
                logits,
                mx.array([token_idx])[None, ...],
                mx.array(-inf, logits.dtype),
                axis=-1
            )
    return logits

if __name__ == '__main__':
    # Setup code that includes all necessary variables
    setup_code = '''
# Generate example data
import mlx.core as mx
from math import inf
vocab_size = 10000
logits = mx.random.normal((1, vocab_size))  # Example logits tensor
BITMAP_WIDTH = int(vocab_size * 0.8)
accepted_token_bitmap = (1 << BITMAP_WIDTH) - 1  # Accept first N tokens

from __main__ import apply_token_mask_batched, apply_token_mask_vectorized, apply_token_mask
'''

    # Benchmark each function
    batch_sizes = [128, 1024, 8192]
    for batch_size in batch_sizes:
        batched_time = timeit.timeit(
            stmt=f'apply_token_mask_batched(logits, accepted_token_bitmap, batch_size={batch_size})',
            setup=setup_code,
            number=100
        )
        print(f'apply_token_mask_batched (batch size {batch_size}): {batched_time:.6f} seconds')

    vectorized_time = timeit.timeit(
        stmt='apply_token_mask_vectorized(logits, accepted_token_bitmap)',
        setup=setup_code,
        number=100
    )
    print(f'apply_token_mask_vectorized: {vectorized_time:.6f} seconds')

    iterative_time = timeit.timeit(
        stmt='apply_token_mask(logits, accepted_token_bitmap)',
        setup=setup_code,
        number=100
    )
    print(f'apply_token_mask: {iterative_time:.6f} seconds')

I'm seeing:

❯ python scratch/apply_token_mask_timeit.py
apply_token_mask_batched (batch size 128): 0.212579 seconds
apply_token_mask_batched (batch size 1024): 0.196801 seconds
apply_token_mask_batched (batch size 8192): 0.195407 seconds
apply_token_mask_vectorized: 0.209955 seconds
apply_token_mask: 1.785793 seconds

Based on this analysis I'd go with apply_token_mask_vectorized (lower memory usage given th elack of list construction), but we need to find a way to make this even faster.

I tried to play around with mx.compile() but didn't get very far. I need to learn that tool better.

Hmm./ It only just occurred to me that the venv using main is on Python 3.12.5 and that for this PR is on 3.11.6. Probably doesn't make a big difference, but not quite apples to apples.

uogbuji · 2025-01-18T16:54:46Z

I think all it needed was an LRU cache

import timeit
from math import inf
from functools import lru_cache

import mlx.core as mx

DEFAULT_TOKEN_MASK_BATCH_SIZE = 1024

def apply_token_mask_batched(logits, accepted_token_bitmap, batch_size=DEFAULT_TOKEN_MASK_BATCH_SIZE):
    '''
    Iterators/generators approach to setting logits of non-accepted tokens to -inf
    Fixed-size batched approach, trading off space/speed by only creating small temporary lists for each batch
    '''
    vocab_size = logits.shape[-1]
    
    # Process tokens in batches
    for start_idx in range(0, vocab_size, batch_size):
        end_idx = min(start_idx + batch_size, vocab_size)
        batch_indices = []

        # Check each token in the current batch
        for token_idx in range(start_idx, end_idx):
            if not accepted_token_bitmap & (1 << token_idx):
                batch_indices.append(token_idx)

        # If we found any tokens to reject in this batch, update logits
        if batch_indices:
            logits = mx.put_along_axis(
                logits,
                mx.array(batch_indices)[None, ...],
                mx.array(-inf, logits.dtype),
                axis=-1
            )

    return logits

@lru_cache(maxsize=128)
def create_mask(accepted_token_bitmap, vocab_size):
    return mx.array([(accepted_token_bitmap & (1 << i)) != 0 for i in range(vocab_size)])

def apply_token_mask_vectorized(logits, accepted_token_bitmap):
    vocab_size = logits.shape[-1]
    
    # Use the memoized function to create or retrieve a boolean mask for the entire vocabulary
    mask = create_mask(accepted_token_bitmap, vocab_size)
    
    # Invert the mask and convert to the same dtype as logits
    inverted_mask = (~mask).astype(logits.dtype)
    
    # Multiply the inverted mask by negative infinity
    inf_mask = inverted_mask * mx.array(-mx.inf, dtype=logits.dtype)
    
    # Apply the mask to logits
    masked_logits = mx.where(mask, logits, inf_mask)
    
    return masked_logits

def apply_token_mask(logits, accepted_token_bitmap):
    '''
    Iterators/generators approach to setting logits of non-accepted tokens to -inf
    '''
    # Process each position in the logits vocabulary dimension
    for token_idx in range(logits.shape[-1]):
        # Check if this token should be rejected (not in accepted bitmap)
        if not accepted_token_bitmap & (1 << token_idx):
            logits = mx.put_along_axis(
                logits,
                mx.array([token_idx])[None, ...],
                mx.array(-inf, logits.dtype),
                axis=-1
            )
    return logits

if __name__ == '__main__':
    # Setup code that includes all necessary variables
    setup_code = '''
# Generate example data
import mlx.core as mx
from math import inf
vocab_size = 10000
logits = mx.random.normal((1, vocab_size))  # Example logits tensor
BITMAP_WIDTH = int(vocab_size * 0.8)
accepted_token_bitmap = (1 << BITMAP_WIDTH) - 1  # Accept first N tokens

from __main__ import apply_token_mask_batched, apply_token_mask_vectorized, apply_token_mask
'''

    # Benchmark each function
    batch_sizes = [128, 1024, 8192]
    for batch_size in batch_sizes:
        batched_time = timeit.timeit(
            stmt=f'apply_token_mask_batched(logits, accepted_token_bitmap, batch_size={batch_size})',
            setup=setup_code,
            number=100
        )
        print(f'apply_token_mask_batched (batch size {batch_size}): {batched_time:.6f} seconds')

    vectorized_time = timeit.timeit(
        stmt='apply_token_mask_vectorized(logits, accepted_token_bitmap)',
        setup=setup_code,
        number=100
    )
    print(f'apply_token_mask_vectorized: {vectorized_time:.6f} seconds')

    iterative_time = timeit.timeit(
        stmt='apply_token_mask(logits, accepted_token_bitmap)',
        setup=setup_code,
        number=100
    )
    print(f'apply_token_mask: {iterative_time:.6f} seconds')

Gets:

apply_token_mask_batched (batch size 128): 0.209994 seconds
apply_token_mask_batched (batch size 1024): 0.196780 seconds
apply_token_mask_batched (batch size 8192): 0.196083 seconds
apply_token_mask_vectorized: 0.002782 seconds
apply_token_mask: 1.808748 seconds

And now on the whole I get

❯ time python demo/country_extract.py
[
  {
    "name": "Nigeria"
  ,
    "continent": "Africa"
  }
]
python demo/country_extract.py  9.83s user 5.05s system 119% cpu 12.403 total

I'll commit next, and that's probably enough of a speedup to release with, but we should still be hunting more speedups.

I will say, though, that I appreciate watching the tokens's progress in real-time in the Toolio demos now.

Use string formatting of bitmap to create string to use for token bit comparisons more efficiently Before and after timings: ```commandline python demo/algebra_tutor.py 10.90s user 3.27s system 89% cpu 15.766 total python demo/algebra_tutor.py 5.67s user 3.22s system 84% cpu 10.550 total ```

chimezie · 2025-01-20T19:21:03Z

I was pointed to casting to a Python string of the bits via string formatting. It lends some speed. I haven't checked memory use, though.

Remove left-sided zero padding

uogbuji · 2025-01-20T22:29:04Z

I was pointed to casting to a Python string of the bits via string formatting. It lends some speed. I haven't checked memory use, though.

Woohoo! Major improvement! Just for completeness I did update the timeit test.

Click to reveal large code listing

import timeit
from math import inf
from functools import lru_cache

import mlx.core as mx

DEFAULT_TOKEN_MASK_BATCH_SIZE = 1024

def apply_token_mask_batched(logits, accepted_token_bitmap, batch_size=DEFAULT_TOKEN_MASK_BATCH_SIZE):
    '''
    Iterators/generators approach to setting logits of non-accepted tokens to -inf
    Fixed-size batched approach, trading off space/speed by only creating small temporary lists for each batch
    '''
    vocab_size = logits.shape[-1]
    
    # Process tokens in batches
    for start_idx in range(0, vocab_size, batch_size):
        end_idx = min(start_idx + batch_size, vocab_size)
        batch_indices = []

        # Check each token in the current batch
        for token_idx in range(start_idx, end_idx):
            if not accepted_token_bitmap & (1 << token_idx):
                batch_indices.append(token_idx)

        # If we found any tokens to reject in this batch, update logits
        if batch_indices:
            logits = mx.put_along_axis(
                logits,
                mx.array(batch_indices)[None, ...],
                mx.array(-inf, logits.dtype),
                axis=-1
            )

    return logits

def create_mask1(accepted_token_bitmap, vocab_size):
    return mx.array([(accepted_token_bitmap & (1 << i)) != 0 for i in range(vocab_size)])

@lru_cache(maxsize=128)
def create_mask2(accepted_token_bitmap, vocab_size):
    return mx.array([(accepted_token_bitmap & (1 << i)) != 0 for i in range(vocab_size)])

def create_mask3(accepted_token_bitmap, vocab_size):
    token_bitmap_str = '{0:b}'.format(accepted_token_bitmap)
    return mx.array([False if i > (len(token_bitmap_str) - 1)
                     else token_bitmap_str[-1 - i] == '1' for i in range(vocab_size)])

@lru_cache(maxsize=128)
def create_mask4(accepted_token_bitmap, vocab_size):
    token_bitmap_str = '{0:b}'.format(accepted_token_bitmap)
    return mx.array([False if i > (len(token_bitmap_str) - 1)
                     else token_bitmap_str[-1 - i] == '1' for i in range(vocab_size)])

def apply_token_mask_vectorized(logits, accepted_token_bitmap, create_mask=create_mask2):
    vocab_size = logits.shape[-1]
    
    # Use the memoized function to create or retrieve a boolean mask for the entire vocabulary
    mask = create_mask(accepted_token_bitmap, vocab_size)
    
    # Invert the mask and convert to the same dtype as logits
    inverted_mask = (~mask).astype(logits.dtype)
    
    # Multiply the inverted mask by negative infinity
    inf_mask = inverted_mask * mx.array(-mx.inf, dtype=logits.dtype)
    
    # Apply the mask to logits
    masked_logits = mx.where(mask, logits, inf_mask)
    
    return masked_logits

def apply_token_mask(logits, accepted_token_bitmap):
    '''
    Iterators/generators approach to setting logits of non-accepted tokens to -inf
    '''
    # Process each position in the logits vocabulary dimension
    for token_idx in range(logits.shape[-1]):
        # Check if this token should be rejected (not in accepted bitmap)
        if not accepted_token_bitmap & (1 << token_idx):
            logits = mx.put_along_axis(
                logits,
                mx.array([token_idx])[None, ...],
                mx.array(-inf, logits.dtype),
                axis=-1
            )
    return logits

if __name__ == '__main__':
    # Setup code that includes all necessary variables
    setup_code = '''
# Generate example data
import mlx.core as mx
from math import inf
vocab_size = 10000
logits = mx.random.normal((1, vocab_size))  # Example logits tensor
BITMAP_WIDTH = int(vocab_size * 0.8)
accepted_token_bitmap = (1 << BITMAP_WIDTH) - 1  # Accept first N tokens

from __main__ import apply_token_mask_batched, apply_token_mask_vectorized, apply_token_mask, create_mask1, create_mask2, create_mask3, create_mask4
'''

    # Benchmark each function
    batch_sizes = [128, 1024, 8192]
    for batch_size in batch_sizes:
        batched_time = timeit.timeit(
            stmt=f'apply_token_mask_batched(logits, accepted_token_bitmap, batch_size={batch_size})',
            setup=setup_code,
            number=100
        )
        print(f'apply_token_mask_batched (batch size {batch_size}): {batched_time:.6f} seconds')

    vectorized_time = timeit.timeit(
        stmt='apply_token_mask_vectorized(logits, accepted_token_bitmap, create_mask=create_mask1)',
        setup=setup_code,
        number=100
    )
    print(f'apply_token_mask_vectorized with create_mask1: {vectorized_time:.6f} seconds')

    vectorized_time = timeit.timeit(
        stmt='apply_token_mask_vectorized(logits, accepted_token_bitmap, create_mask=create_mask2)',
        setup=setup_code,
        number=100
    )
    print(f'apply_token_mask_vectorized with create_mask2: {vectorized_time:.6f} seconds')

    vectorized_time = timeit.timeit(
        stmt='apply_token_mask_vectorized(logits, accepted_token_bitmap, create_mask=create_mask3)',
        setup=setup_code,
        number=100
    )
    print(f'apply_token_mask_vectorized with create_mask3: {vectorized_time:.6f} seconds')

    vectorized_time = timeit.timeit(
        stmt='apply_token_mask_vectorized(logits, accepted_token_bitmap, create_mask=create_mask4)',
        setup=setup_code,
        number=100
    )
    print(f'apply_token_mask_vectorized with create_mask4: {vectorized_time:.6f} seconds')

    iterative_time = timeit.timeit(
        stmt='apply_token_mask(logits, accepted_token_bitmap)',
        setup=setup_code,
        number=100
    )
    print(f'apply_token_mask: {iterative_time:.6f} seconds')

The results are eye-catching already:

❯ python scratch/apply_token_mask_timeit.py
apply_token_mask_batched (batch size 128): 0.210021 seconds
apply_token_mask_batched (batch size 1024): 0.200561 seconds
apply_token_mask_batched (batch size 8192): 0.197152 seconds
apply_token_mask_vectorized with create_mask1: 0.208924 seconds
apply_token_mask_vectorized with create_mask2: 0.002607 seconds
apply_token_mask_vectorized with create_mask3: 0.088991 seconds
apply_token_mask_vectorized with create_mask4: 0.001500 seconds
apply_token_mask: 1.770771 seconds

create_mask3 is your new algorithm without LRU caching. create_mask4 is your new algorithm with LRU caching 🚀🚀🚀.

And the spedup manifests nicely in the less rigorous command line check:

Before your latest update:

❯ time python demo/country_extract.py      
[
  {
    "name": "Nigeria"
  ,
    "continent": "Africa"
  }
]
python demo/country_extract.py  10.15s user 6.32s system 100% cpu 16.341 total

And after:

❯ time python demo/country_extract.py
[
  {
    "name": "Nigeria"
  ,
    "continent": "Africa"
  }
]
python demo/country_extract.py  5.94s user 5.55s system 116% cpu 9.854 total

Not quite as zippy as the main branch:

❯ time python scratch/country_extract.py 
[
  {
    "name": "Nigeria"
 ,
    "continent": "Africa"
  }
]
python scratch/country_extract.py  3.83s user 3.06s system 102% cpu 6.694 total

But easily good enough for the release, when we're ready!

chimezie · 2025-01-22T15:34:40Z

Last night, I threw everything but the kitchen sink at optimization without being able to shave more than fractions of a second. cProfiling showed (as you probably know) that the logit bias function is where the majority of the computation time occurs.

Within that function, I noticed:

prev_tok = tokens.tolist()[-1]

This can be changed to this, which avoids the casting of the mx.array to a list to retrieve its last element and directly returns it as a scalar:

prev_tok = tokens[-1].item()

Within the apply_token_mask method:

    # Invert the mask and convert to the same dtype as logits
    inverted_mask = (~mask).astype(logits.dtype)

    # Multiply the inverted mask by negative infinity
    inf_mask = inverted_mask * mx.array(-mx.inf, dtype=logits.dtype)

    # Apply the mask to logits
    masked_logits = mx.where(mask, logits, inf_mask)

Can be done in a single step, avoiding the additional array multiplication operation, and creating the -mx.inf array natively and directly in mlx (opening the door for low-level optimizations to that heavily used primitive operation that clearly are not evident now):

    # Apply the mask to logits
    masked_logits = mx.where(mask, logits, mx.full(logits.shape, -mx.inf, dtype=logits.dtype))

Also, since the logits second dimension will always be constant (the vocabulary width of the model), I wonder if that -inf array could be created once and not repeatedly each time this method is invoked.

As for create_mask, I found (after some investigation) that converting an integer to its binary string representation can be done in many ways (including the approach I noticed the original vendored code was using: bin([.])), but the most efficient approach is via an f-string.

With this in mind, I also changed the creation of the mask to create the False boolean mask up front and only iterate through the accepted bit mask (in reverse order) to fill in the True values, reducing the computation. The zip was apparently necessary since the zero-padding of the accepted token bitmap can result in having either more or less indices than that of the vocabulary size.

Here a numpy array was used instead of an mlx array and then casted it to an mx.array at the end, which is the pattern I have noticed used in the most performance-sensitive places in mlx (such as training, for example):

def create_mask(accepted_token_bitmap, vocab_size):
    token_bitmap_str = f'{accepted_token_bitmap:b}'
    mask = np.full(vocab_size, False, dtype=bool)
    for (i, bit_char), _ in zip(enumerate(token_bitmap_str[::-1]), range(vocab_size)):
        mask[i] = bit_char == '1'
    return mx.array(mask)

I did try creating the False-filled boolean mask in an mx.array and then update it in place, but this was extremely slow and I suspect this may be related to mlx's lazy evaluation.

In the end, the combinations of these changes only shaved off fractions of a second, so I just thought I would do a brain dump for your reference.

uogbuji · 2025-01-22T15:41:10Z

I really appreciate this thorough work. I think what I'll do is take steps to beef up the test suite this week, and then we can try applying some of these theoretically more efficient measures. That way, we can be poised to take advantage of future mlx core improvements.

In which case my priority for the 0.6.0 release would be:

Restore working tool-calling & HTTP
Beef up test suite
Ask you to implement whichever of the above measures makes sense architecturally (and doesn't immediately kill performance), with the test suite to help guard for correctness.

…te management.

uogbuji and others added 8 commits January 11, 2025 15:15

A little down-payment clean-up

f41354c

Stub out generate_step_with_schema method with similar signature to…

edfd44a

… generate_step in mlx

Stub out generate_step_with_schema method with similar signature to…

25fa0f5

… generate_step in mlx for use by Model.completion

Merge remote-tracking branch 'origin/33-re-integrate-mlxlm' into 33-r…

64f5466

…e-integrate-mlxlm

Remove generate_with_preemptive_decoding, which is fiddly and almost …

7bd6ccf

…certainly a poor alternative to the new speculative decoding

Partway through the process of modernizing our use of MLX

666d043

Some more cleanup, including to local call method signatures

62bc792

uogbuji assigned chimezie and uogbuji Jan 14, 2025

uogbuji linked an issue Jan 14, 2025 that may be closed by this pull request

Re-integrate better into mlx_lm #33

Open

uogbuji and others added 3 commits January 13, 2025 18:12

Incomplete work on response_helper.py

db4eca5

Getting closer but make_logit_bias_processor seems ot be breaking things

06eeee6

Update logits bias adjustment per make_logits_processors in mlx_lm.sa…

03ddce7

…mple_utils This is the only convention for a logits processor in mlx_lm AFAICT, which repetition penalties are based on. It looks like the logits are expected to be updated *and* returned (the += operator).

uogbuji added 2 commits January 14, 2025 23:08

Try a different method to apply logits biases. Spoiler alet: doesn't …

903aba3

…work.

I think this gets us out of logit_bias_processor purgatory! Thanks to @…

9641ee5

…chimezie

Add country extractor demo

6d1b43b

uogbuji added 2 commits January 16, 2025 20:47

Add some debugging logic, and try the token suppression logic from to…

a7f9b52

…p-k sampler

Simplify logit_bias_processor setup

2f1984c

uogbuji added 4 commits January 17, 2025 12:34

Got the prefill situation sorted, but there is still something wonky …

d050e3e

…going on with the sampler's selection

Many enhancements and fixes towards getting this rearchitecture working

deef2d1

Handle eos token. Improvements to iter_print.

c3c8c89

Code cleanup. Variant demo set up for perf profiling the guts of Toolio

c2997fb

uogbuji and others added 3 commits January 18, 2025 09:58

Profiled version of apply_token_mask()

7b8437b

Fix the schemaless case. Other fixes while updating README.

d151201

Generalize from example

5caf8c6

Remove left-sided zero padding

Start working on tool-calling path, though not quite there yet.

0302dd9

uogbuji added 2 commits February 1, 2025 09:19

QUick checkin before resuming work

7ef419a

Got tool-calling working again. Lots of modularization & improved sta…

984508a

…te management.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rearchitecting to take advantage of the latest in mlx_lm #36

Rearchitecting to take advantage of the latest in mlx_lm #36

uogbuji commented Jan 14, 2025

uogbuji commented Jan 14, 2025

Python-based Implementations

C-optimized Implementations

uogbuji commented Jan 14, 2025 •

edited

Loading

uogbuji commented Jan 14, 2025 •

edited

Loading

uogbuji commented Jan 14, 2025

uogbuji commented Jan 14, 2025

uogbuji commented Jan 15, 2025 •

edited

Loading

chimezie commented Jan 15, 2025

uogbuji commented Jan 15, 2025

uogbuji commented Jan 15, 2025

chimezie commented Jan 17, 2025

chimezie commented Jan 17, 2025

uogbuji commented Jan 18, 2025 •

edited

Loading

uogbuji commented Jan 18, 2025 •

edited

Loading

chimezie commented Jan 20, 2025

uogbuji commented Jan 20, 2025

chimezie commented Jan 22, 2025

uogbuji commented Jan 22, 2025

Rearchitecting to take advantage of the latest in mlx_lm #36

Are you sure you want to change the base?

Rearchitecting to take advantage of the latest in mlx_lm #36

Conversation

uogbuji commented Jan 14, 2025

uogbuji commented Jan 14, 2025

Python-based Implementations

C-optimized Implementations

uogbuji commented Jan 14, 2025 • edited Loading

uogbuji commented Jan 14, 2025 • edited Loading

uogbuji commented Jan 14, 2025

uogbuji commented Jan 14, 2025

uogbuji commented Jan 15, 2025 • edited Loading

chimezie commented Jan 15, 2025

uogbuji commented Jan 15, 2025

uogbuji commented Jan 15, 2025

chimezie commented Jan 17, 2025

chimezie commented Jan 17, 2025

uogbuji commented Jan 18, 2025 • edited Loading

uogbuji commented Jan 18, 2025 • edited Loading

chimezie commented Jan 20, 2025

uogbuji commented Jan 20, 2025

chimezie commented Jan 22, 2025

uogbuji commented Jan 22, 2025

uogbuji commented Jan 14, 2025 •

edited

Loading

uogbuji commented Jan 14, 2025 •

edited

Loading

uogbuji commented Jan 15, 2025 •

edited

Loading

uogbuji commented Jan 18, 2025 •

edited

Loading

uogbuji commented Jan 18, 2025 •

edited

Loading