How do I write a simple, Python parsing script?
How do I write a simple, Python parsing script?
Unless its a really big file, why not iterate line by line? If the input files size is some significant portion of your machines available resources (memory), then you might want to look into buffered input and other, more low-level abstractions of what the computer is doing. But if youre talking about a few hundred MB or less on a relatively modern machine, let the computer do the computing 😉
Off the bat you might want to get into the habit of using the built-in context manager with
. For instance, in your snippet, you dont have a call to output.close()
.
with open(data.txt, r) as f_in:
search_terms = f_in.read().splitlines()
Now search_terms
is a handle to a list that has each line from data.txt
as a string (but with the newline characters removed). And data.txt
is closed thanks to with
.
In fact, I would do that with the db.txt
file, also.
with open(db.txt, r) as f_in:
lines = f_in.read().splitlines()
Context managers are cool.
As a side note, you could open your destination file now, and do your parsing and results-tracking with it open the whole time, but I like leaving as many files closed as possible for as long as possible.
I would suggest setting the biggest object on the outside of your loop, which Im guessing is db.txt
contents. The outermost loop only usually only gets iterated once, so might as well put the biggest thing there.
results = []
for i, line in enumerate(lines):
for term in search_terms:
if term in line:
# Use something not likely to appear in your line as a separator
# for these second lines. I used three pipe characters, but
# you could just as easily use something even more random
results.append({}|||{}.format(line, lines[i+1]))
if results:
with open(output.txt, w) as f_out:
for result in results:
# Dont forget to replace your custom field separator
f_out.write(> {}n.format(result.replace(|||, n)))
else:
with open(no_results.txt, w) as f_out:
# This will write an empty file to disk
pass
The nice thing about this approach is each line in db.txt
is checked once for each search_term in search_terms
. However, the downside is that any line will be recorded for each search term it contains, ie., if it has three search terms in it, that line will appear in your output.txt
three times.
And all the files are magically closed.
Context managers are cool.
Good luck!
search_terms
keeps whole data.txt in memory. That its not good in general but in this case its not quite bad.
Looking line-by-line is not sufficient but if the case is simple and files are not too big its not a big deal. If you want more efficiency you should sort data.txt file and put this to some tree-like structure. It depends on data which is inside.
You have to use seek
to move pointer back after using next
.
Propably the easiest way here is to generate two lists of lines and search using in
like:
`db = open(db.txt).readlines()
db_words = [x.split() for x in db]
data = open(data.txt).readlines()
print(Lines in db {}.format(len(db)))
for item in db:
for words in db_words:
if item in words:
print(Found {}.format(item))`
How do I write a simple, Python parsing script?
Your key issue is that you may be looping in the wrong order — in your code as posted, youll always exhaust the db
looking for the first term, so after the first pass of the outer for
loop db
will be at end, no more lines to read, no other term will ever be found.
Other improvements include using the with
statement to guarantee file closure, and a set
to track which search terms were not found. (There are also typos in your posted code, such as opening a file as data
but then reading it as ids
).
So, for example, something like:
with open(data.txt, r) as data:
search_terms = data.read().splitlines()
missing_terms = set(search_terms)
with open(db.txt, r) as db, open(output.txt, w) as output:
for line in db:
for term in search_terms:
if term in line:
missing_terms.discard(term)
next_line = db.next()
output.write(> + head + n + next_line)
print(Found {}.format(term))
break
if missing_terms:
diagnose_not_found(missing_terms)
where the diagnose_not_found
function does whatever you need to do to warn the user about missing terms.
There are assumptions embedded here, such as the fact that you dont care if some other search term is present in a line where youve found a previous one, or the very next one; they might take substantial work to fix if not applicable and it will require that you edit your Q with a very complete and unambiguous list of specifications.
If your db
is actually small enough to comfortably fit in memory, slurping it all in as a list of lines once and for all would allow easier accommodation for more demanding specs (as in that case you can easily go back and forth, while iterating on a file means you can only go forward one line at a time), so if your specs are indeed more demanding please also clarify if this crucial condition hold, or rather you need this script to process potentially humungous db
files (say gigabyte-plus sizes, so as to not comfortably fit in memory, depending on your platform of course).