Ok, you have a moderately complex math problem you needed to solve. You gave the problem to 6 LLMS all paid versions. All 6 get the same numbers. Would you trust the answer?

  • OwlPaste@lemmy.world
    link
    fedilink
    English
    arrow-up
    5
    ·
    26 days ago

    no, once i tried to do binary calc with chat gpt and he keot giving me wrong answers. good thing i had sone unit tests around that part so realised quickly its lying

    • Pika@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      2
      ·
      26 days ago

      Just yesterday I was fiddling around with a logic test in python. I wanted to see how well deepseek could analyze the intro line to a for loop, it properly identified what it did in the description, but when it moved onto giving examples it contradicted itself and took 3 or 4 replies before it realized that it contradicted itself.

    • dan1101@lemmy.world
      link
      fedilink
      English
      arrow-up
      2
      ·
      26 days ago

      Yes more people need to realize it’s just a search engine with natural language input and output. LLM output should at least include citations.

    • Farmdude@lemmy.worldOP
      link
      fedilink
      arrow-up
      1
      arrow-down
      4
      ·
      26 days ago

      But, if you ran, gave the problem to all the top models and got the same? Is it still likely an incorrect answer? I checked 6. I checked a bunch of times. Different accounts. I was testing it. I’m seeing if its possible with all that in others opinions I actually had to check over a hundred times each got the same numbers.

      • Denjin@feddit.uk
        link
        fedilink
        arrow-up
        2
        ·
        26 days ago

        They could get the right answer 9999 times out of 10000 and that one wrong answer is enough to make all the correct answers suspect.

      • porcoesphino@mander.xyz
        link
        fedilink
        arrow-up
        1
        ·
        26 days ago

        What if there is a popular joke that relies on bad math that happens to be your question. Then the alignment is understandable and no indication of accuracy. Why use a tool with known issues, and overhead like querying six, instead of using a decent tool like Wolfram alpha?

      • OwlPaste@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        26 days ago

        my use case was, i expect easier and simpler. so i was able to write automated tests to validate logic of incrementing specific parts of a binary number and found that expected test values llm produced were wrong.

        so if its possible to use some kind of automation to verify llm results for your problem, you will be confident in your answer. but generally llms tend to make up shit and sound confident about it