How do I write a simple, Python parsing script?

How do I write a simple, Python parsing script?

Unless its a really big file, why not iterate line by line? If the input files size is some significant portion of your machines available resources (memory), then you might want to look into buffered input and other, more low-level abstractions of what the computer is doing. But if youre talking about a few hundred MB or less on a relatively modern machine, let the computer do the computing 😉

Off the bat you might want to get into the habit of using the built-in context manager with. For instance, in your snippet, you dont have a call to output.close().

with open(data.txt, r) as f_in:
    search_terms = f_in.read().splitlines()

Now search_terms is a handle to a list that has each line from data.txt as a string (but with the newline characters removed). And data.txt is closed thanks to with.

In fact, I would do that with the db.txt file, also.

with open(db.txt, r) as f_in:
    lines = f_in.read().splitlines()

Context managers are cool.

As a side note, you could open your destination file now, and do your parsing and results-tracking with it open the whole time, but I like leaving as many files closed as possible for as long as possible.

I would suggest setting the biggest object on the outside of your loop, which Im guessing is db.txt contents. The outermost loop only usually only gets iterated once, so might as well put the biggest thing there.

results = []
for i, line in enumerate(lines):
    for term in search_terms:
        if term in line:
            # Use something not likely to appear in your line as a separator
            # for these second lines. I used three pipe characters, but
            # you could just as easily use something even more random
            results.append({}|||{}.format(line, lines[i+1]))

if results:
    with open(output.txt, w) as f_out:
        for result in results:
            # Dont forget to replace your custom field separator
            f_out.write(> {}n.format(result.replace(|||, n)))
else:
    with open(no_results.txt, w) as f_out:
        # This will write an empty file to disk
        pass

The nice thing about this approach is each line in db.txt is checked once for each search_term in search_terms. However, the downside is that any line will be recorded for each search term it contains, ie., if it has three search terms in it, that line will appear in your output.txt three times.

And all the files are magically closed.

Context managers are cool.

Good luck!

search_terms keeps whole data.txt in memory. That its not good in general but in this case its not quite bad.

Looking line-by-line is not sufficient but if the case is simple and files are not too big its not a big deal. If you want more efficiency you should sort data.txt file and put this to some tree-like structure. It depends on data which is inside.

You have to use seek to move pointer back after using next.

Propably the easiest way here is to generate two lists of lines and search using in like:

`db = open(db.txt).readlines()
 db_words = [x.split() for x in db]
 data = open(data.txt).readlines()
 print(Lines in db {}.format(len(db)))
 for item in db:
     for words in db_words:
         if item in words:
            print(Found {}.format(item))`

How do I write a simple, Python parsing script?

Your key issue is that you may be looping in the wrong order — in your code as posted, youll always exhaust the db looking for the first term, so after the first pass of the outer for loop db will be at end, no more lines to read, no other term will ever be found.

Other improvements include using the with statement to guarantee file closure, and a set to track which search terms were not found. (There are also typos in your posted code, such as opening a file as data but then reading it as ids).

So, for example, something like:

with open(data.txt, r) as data:
    search_terms = data.read().splitlines()

missing_terms = set(search_terms)

with open(db.txt, r) as db, open(output.txt, w) as output:
    for line in db:
        for term in search_terms:
            if term in line:
                missing_terms.discard(term)
                next_line = db.next()
                output.write(> + head + n + next_line)
                print(Found {}.format(term))
                break

if missing_terms:
    diagnose_not_found(missing_terms)

where the diagnose_not_found function does whatever you need to do to warn the user about missing terms.

There are assumptions embedded here, such as the fact that you dont care if some other search term is present in a line where youve found a previous one, or the very next one; they might take substantial work to fix if not applicable and it will require that you edit your Q with a very complete and unambiguous list of specifications.

If your db is actually small enough to comfortably fit in memory, slurping it all in as a list of lines once and for all would allow easier accommodation for more demanding specs (as in that case you can easily go back and forth, while iterating on a file means you can only go forward one line at a time), so if your specs are indeed more demanding please also clarify if this crucial condition hold, or rather you need this script to process potentially humungous db files (say gigabyte-plus sizes, so as to not comfortably fit in memory, depending on your platform of course).

Leave a Reply

Your email address will not be published. Required fields are marked *