• Allero@lemmy.today
    link
    fedilink
    English
    arrow-up
    14
    ·
    edit-2
    9 months ago

    Here’s my guess, aside from highlighted token issues:

    We all know LLMs train on human-generated data. And when we ask something like “how many R’s” or “how many L’s” is in a given word, we don’t mean to count them all - we normally mean something like “how many consecutive letters there are, so I could spell it right”.

    Yes, the word “strawberry” has 3 R’s. But what most people are interested in is whether it is “strawberry” or “strawbery”, and their “how many R’s” refers to this exactly, not the entire word.

    • Opisek@lemmy.world
      link
      fedilink
      English
      arrow-up
      2
      ·
      9 months ago

      But to be fair, as people we would not ask “how many Rs does strawberry have”, but “with how many Rs do you spell strawberry” or “do you spell strawberry with 1 R or 2 Rs”

    • jj4211@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      9 months ago

      It doesn’t even see the word ‘strawberry’, it’s been tokenized in a way to no longer see the ‘text’ that was input.

      It’s more like it sees a question like: How many 'r’s in 草莓?

      And it spits out an answer not based on analysis of the input, but a model of what people might have said.